### Assignment 3 -- Text Classification

You are given training data in the form of product and review information for 1500 products.  Each is labeled as being in the category **Books** or  **Movies & TV** or **Music**

There are two parts to the assignment:  delivering the classifier, and documenting the research that went into choosing the best model.

#### Input Data

There are two files, *products.txt* and *reviews.txt* which are in the same format as the data files in Assignment 2.
The product category label appears in the **salesRank** attribute.   

Notice that there is also an attribute **categories** in this data set.  ***You are not allowed to use the attribute*** *categories* ***for this assignment.***

The file *reviews.txt* contains reviews for the products in *products.txt*.

The code to build your model should assume that these files are present in the same directory as the workbook, that all the records in *products.txt* have one of the three categories above, and that every record in *reviews.txt* refers to a product  that appears in *products.txt*.


#### Parts to the Assignment

In the first part of the assignment you will put code to train your model and to preprocess test data. After that, you will answer some questions about experiments you conducted and decisions you made in building your "best" model.

-------------------------------------------------

#### The Model and its Evaluation

These are the two functions that will allow evaluation of your model.

The first reads training data, prepares the data (extracts fields from the files, builds the response variable, vectorizes the X, and possibly reduces the feature set), then trains the model on that data (using the fit method).  The model returned should be ready to predict X values.

The second just does the preprocessing steps, producing an X and a y.

To evaluate your model, I will first use build_model to train it -- the two files will be in the same format as the files you got as part of the assignment, but may contain different data records.   After training the model, I will call prepare_data using a different set of test data, then evaluate the model by callithe X matrix it produces.  

In [5]:
# Returns a model -- an object that at least implements a predict method.
# The two parameters are names of files containing labeled training data
# The model returned should already be trained on (fitted to) the data in those two files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def load_input(file_name):
    raw_input = []
    for line in open(file_name):
        raw_input.append(eval('(' + line + ')'))
    return raw_input

def make_training(product_file_name, review_file_name):
    products = {}
    reviews = {}
    txts = []
    ys = []    
    for raw_product in load_input(product_file_name):
        products[raw_product['asin']] = list(raw_product['salesRank'].keys())[0]
    
    for raw_review in load_input(review_file_name):
        txt = raw_review['summary'] + ' ' + raw_review['reviewText']
        if(raw_review['asin'] in reviews.keys()):
            reviews[raw_review['asin']].append(txt)
        else:
            reviews[raw_review['asin']] = [txt]
            
    for asin, list_texts in reviews.items():
        for text in list_texts:
            txts.append(text)
            ys.append(products[asin])
    return txts, ys

def build_model(product_file_name="products.txt", review_file_name="reviews.txt"):
    txts, y = make_training(product_file_name, review_file_name)
    v = CountVectorizer() # TODO: change the parameters!
    X = v.fit_transform(txts)
    print(X.shape)
    nb = MultinomialNB()
    nb.fit(X,y)
    # TODO: optimize the model, hyperparameters, cv
    return nb

# Returns an X matrix and y vector which are prepared using the same preprocessing
# steps used to build and train the model returned by build_model above

def prepare_data(product_file_name, review_file_name):
    txts, y = make_training(product_file_name, review_file_name) 
    
    #with or without training??
    v = CountVectorizer() # TODO: change the parameters!
    X = v.fit_transform(txts)
    return X, y

In [6]:
# remove me before submission!!
nb = build_model("products.txt", "reviews.txt")

(17757, 54770)


-----------------------------------------------------------------------------
### Documenting your Decisions


#### Evaluation and Analysis

In answering these questions, please be sure to show your work, for example output of commands you used to gather data supporting your decisions.   For each of the cells below containing a question, please leave the question header and text in the notbook you subnmit.  Put you answer in markdown in the same cell, and add additional cells below the question/answer cell for supporting data (output, tables, graphics).  I will evaluate your notebook from beginning to end prior to evaluating the model or reading your analysis.


----------------------------
#### Model Quality

What do you expect the accuracy of your model to be on a new set of product records it has not seen before?
How many variables are in your model?

------------------------------------------------
#### Input Fields

What input fields from the product and review records did you include in training your model?
How did you decide which fields to use and which to omit?

-------------------------------------------
#### Preprocessing

What preprocessing steps did you use?  At minimum you must evaluate stemming, tokenizing, stop word removal.  How did you decide which steps improved the model and which did not?

---------------------------------------------
#### Vectorization

What technique did you use to turn the input features into a feature vector?  How did you make that choice?

---------------------------------------------
####  Feature Selection

It is important to examine and understand the model features both to convince that the important features are plausible, and to consider removing the unimportant features.  What features did you use in the model, and how did you make that choice?  List some of the most important features.  Are you convinced that they are are accurate exemplars of the class, or might they be artifacts of the training set?  List some of the least important features -- do they suggest ways to cut down the model size without significantly affecting accuracy?  

In class we looked at three ways of estimating the impact of a term on a model

* Frequency-based selection
* Mutual information
* Feature log probabilities, i.e. $P(f_i | C)$

How different are the three measures, i.e. do they all tend to rank the same variables as significant and insignificant?  Which did you use in your model, and why?

-------------------------------------------------
#### Algorithm

Which classification algorithm did you use;  what alternatives did you explore and how did you make the choice?  What hyperparameter optimization did you perform?

----------------------------------------
#### Understanding Misclassifications

Even though misclassifications are inevitable, it is important to understand *why* your algorithm makes errors, and whether it is making "understandable" errors.   Choose several examples of misclassification and informally explain why you believe the classifier made the wrong choice.  Is/was there anything you might be able to do in terms of feature engineering to fix some misclassifications?