Extra Credit
====

Find the best model among scikit-learn's Logistic Regression, k-NN, Naive Bayes and SVM algorithms for `polarity dataset v2.0 (3.0Mb)` from http://www.cs.cornell.edu/People/pabo/movie-review-data/

In [37]:
reset -fs

In [38]:
from sklearn.datasets import load_files

In [39]:
# Load data
# NOTE: Assume the data is already locally stored
sentiment = load_files('./txt_sentoken/', 
                       encoding='utf-8',
                       random_state=42)
sentiment.target_names

['neg', 'pos']

In [40]:
from sklearn.model_selection import train_test_split

In [41]:
# Create train/test split with labels
# NOTE: `random_state` is not fixed, thus data selections will be random
train_data, test_data, train_target, test_target = train_test_split(sentiment.data,
                                                                    sentiment.target)

In [42]:
# Accuracy will be the single metric
from sklearn.metrics import accuracy_score

In [47]:
def fit_extra_credit_model():
    "Fit your single best model, returning model and accuracy."

    # Example solution; Replace with your solution
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    vectorizer = CountVectorizer()
    train_features = vectorizer.fit_transform(train_data)
    test_features = vectorizer.transform(test_data)
    
    model = MultinomialNB()
    model.fit(train_features, train_target)
    predicted = model.predict(test_features)
    
    return model, accuracy_score(predicted, test_target)

In [44]:
# List of acceptable models
# GLM - http://scikit-learn.org/stable/modules/linear_model.html
# k-NN - http://scikit-learn.org/stable/modules/neighbors.html
# Naive Bayes - http://scikit-learn.org/stable/modules/naive_bayes.html
# SVM - http://scikit-learn.org/stable/modules/svm.html

acceptable_algos = {'linear_model', 'neighbors', 'naive_bayes', 'svm'}

In [55]:
"""
Points will be determined later.
Test code for the 'fit_best_model' function.
This cell should NOT give any errors when it is run, warnings are okay.
"""
model, accuracy = fit_extra_credit_model()
raw_name = str(model.__class__)
model_module = raw_name.split('.')[1]
assert model_module in acceptable_algos # Check that your model an instance of an acceptable algorithm

In [57]:
# Visual inspection
name_formatted = model_module.replace("_", " ").title()
print(f"Model: {name_formatted}")
print(f"Acc: {accuracy}")

Model: Naive Bayes
Acc: 0.822


<br>
<br> 
<br>

----