
### Extra Credit Assignment (5 points)
**Deadline:** Friday, April 5, 2019 11:59:59pm. I will create an assignment on Classes, where you can submit your results.

As part of this assignment, there is an attached zip file containing two files: `reviews/reviews_train.tsv` and `reviews/reviews_test.tsv`. Each file contains a number of lines. Each line is separated by a **tab value**. The first part of the line is a piece of text (a "review"). The second part of the line is an integer value, from 1 to 5, corresponding to the star rating that was given alongside the review.

In [1]:
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pylab as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8

* First, you will need to upload these files to your JupyterHub Notebook server. You can upload files by simple drag 'n' drop, or by using the "Upload" button.

* You will use the `reviews_train.tsv` file to train various classification models (see below). You will then use the trained model(s) on the `reviews_test.tsv` file to test their performance.

In [34]:
something = '4.0'
if float(something) == 4:
    print('yes!')
    print(int(float(something)))

yes!
4


In [67]:
# open file and clean up data
with open('reviews_train.tsv', encoding="utf8") as tr:
    train_rawdata = [x_tr.strip().split("\t") for x_tr in tr.readlines()] 
    train_data_cleaned = []
    for data in train_rawdata:
        tr_cleaned = {'text': data[0], 'label': int(float(data[1]))}
        train_data_cleaned.append(tr_cleaned)
        
train_data = pd.DataFrame(train_data_cleaned)

with open('reviews_test.tsv', encoding="utf8") as te:
    test_rawdata = [x_te.strip().split("\t") for x_te in te.readlines()] 
    test_data_cleaned = []
    for data in test_rawdata:
        te_cleaned = {'text': data[0], 'label': int(float(data[1]))}
        test_data_cleaned.append(te_cleaned)
        
test_data = pd.DataFrame(test_data_cleaned)

In [68]:
print(' ======= TRAIN DATA ======= \n', train_data.head(3))
print(' \n ======= TEST DATA ======= \n', test_data.head(3))

    label                                               text
0      4  Proximity to waterfront and downtown Seattle. ...
1      4  Clean, shopping near by, very pleasant staff, ...
2      5  Everything about our stay was great. Traveling...
 
    label                                               text
0      5  This was our first time visiting the city of N...
1      5  Excellent service, great size rooms and great ...
2      5  Great new hotel with amazing access to shoppin...


In particular, you will train your classifiers on the contents of the `reviews_train.tsv` file (text and label), as they are, meaning you will be working with 5 different classes (the different star ratings that show up in the file). You will train **4 different classifiers**. 

Each of those classifiers will be tested on the contents of the `reviews_test.tsv` file. For the evaluation, you will be using CRCs (Cumulative Response Curves). You will have to create **2 CRCs**.

Each CRC will demonstrate (in the same plot) the performance of each of the 4 classifiers on the `reviews_test.tsv`, when focusing on a _specific star rating_. You get to pick the star ratings to test. For example, if you pick "rating = 2", then the CRC will consider as correct classifications on the `reviews_test.tsv` the ratings that are equal to 2, and will consider as misclassification anything else.

Which two star ratings to show (one per CRC) is up to you. However, one rating must be below 3 and the other greater than or equal to 3. For example, you can pick to plot the pair ("Rating = 2", "Rating = 5"). You cannot pick both plots to be < 3 or both of them to be >= 3.

### Part I -- 4 Classifiers & CRC
---

In [None]:
# Decision-tree model
model_tree = DecisionTreeClassifier(criterion='entropy', max_depth=15)

# Logistical Regression model
model_logregr = LogisticRegression(C=100)

# The SVM model


#### Classifier #4 -- Naive Bayes (NB) 
_[using Bernoulli Naive Bayes (BNB)]_

In [None]:
# Count the total instances of the vocabs in all text
binary_vectorizer_tr = CountVectorizer(binary=True)
binary_vectorizer_tr.fit(train_data['text'])

vocab_ls_tr = list(zip( binary_vectorizer_tr.vocabulary_.keys(), binary_vectorizer_tr.vocabulary_.values()) )
vocab_ls_tr[0:10]

In [None]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

# Naive Bayes has an alpha parameter, which operates exactly like the lambda parameter for Logistic Regression
model = BernoulliNB()
model.fit(X_train_counts, Y_train)

In [None]:
print ("AUC on the count TRAIN data = %.3f" % metrics.roc_auc_score(Y_train, model.predict(X_train_counts)))
print ("AUC on the count TRAIN data = %.3f" % metrics.roc_auc_score(Y_train, model.predict_proba(X_train_counts)[:, 1]))

print ("AUC on the count TEST data = %.3f" % metrics.roc_auc_score(Y_test, model.predict(X_test_counts)))
print ("AUC on the count TEST data = %.3f" % metrics.roc_auc_score(Y_test, model.predict_proba(X_test_counts)[:, 1]))

You can reuse the same classifiers in the two CRCs.


**Note 1:** When training your classifiers, you will have to train them on the dataset with the 5 different classes together. When testing, you will be focusing on a specific rating, one for each CRC that you generate.


**Note 2:** As you are dealing with text, you may consider the text representation transformation as a different "classifier". That is, you can consider the the `CountVectorizer` with binary feature values and the `CountVectorizer` with actual counts as two different "classifiers", even though you use, e.g., `LogisticRegression` for both of them. On the other hand, changing the complexity parameters of a classifier _alone_ is not considered a separate classifier for the purposes of this assignment.


Make sure that in your plots you specify which model corresponds to which line. Give a brief description of the results and say which model you would pick.

### Part II -- 2 CRCs
---

In [None]:
# Here's a method that trains and returns the CRC of a clasifier

def train_and_compute_crc( model, x_train, y_train, x_test, y_test ):

    # Train the model
    model.fit(x_train, y_train)

    # Let's get the probabilities. FOCUS ON THE POSITIVE CLASS
    probabilities = model.predict_proba(x_test)[:, 1]

    # Create a dataframe that we can conveniently manipulate
    model_df = pd.DataFrame(list(zip(probabilities, y_test)), columns=["PROBABILITY", "TRUE_CLASS"])

    # Sort the dataframe rows by the PROBABILITY
    model_df_sorted = model_df.sort_values(by=['PROBABILITY'], ascending=False)

    # Compute the CUMULATIVE correct responses up until the
    return model_df_sorted["TRUE_CLASS"].cumsum()

#### CRC #1 -- Decision Tree Model

#### CRC #2 -- Naive Bayes

### Plot of CRC for all four classifiers
---

In [None]:
# Let's plot the above results, together
plt.plot(range(0, len(dec_tree_crc)), dec_tree_crc, label="Decision Tree")
plt.plot(range(0, len(logReg_crc)), logReg_crc, label="Logistic Regression")
plt.plot([0,len(logReg_crc)], [0,max(logReg_crc)], 'k--', label="Random")
plt.xlabel("Number of test instances targeted (decreasing score)")
plt.ylabel("Number of positives targeted")
plt.title("Cumulative response curve")
plt.legend()
plt.show()

## Useful note for the data

When dealing with text, it is often the case that we must work with Unicode characters. 

If you encounter the error: "ValueError: np.nan is an invalid document, expected byte or unicode string.". You are welcome to ask questions for the EC on the forums.

check the following url: https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document to help you resolve it.