In [3]:
import pandas as pd
import operator

Today's exercise will be done with [scikit-learn](http://scikit-learn.org/stable/index.html) a very popular python package for Machine Learning. It implements most common ML models and methods for preparing the features. In fact, many companies use it in production.

Unless you want to use custom models or neural networks, you will be able to do most of our development with scikit-learn.

This notebook will take you through the process of creating a classifier on your own. An example implementation is already filled in to help you get started. 

The exercises are there to guide you, but feel free to experiment beyond them. 

# The Movie Review Model

The sections of the notebook will take you through each step of creating a classifier.

### Loading the data

For this task we will using a real movie review dataset. The dataset contains texts of reviews and a sentiment label (0 = bad, 1=good). Both the train and test sets contain 50\% of each.  

As a reminder, *training data* is used to train/fit your model. *Test data* is used to evaluate performance. We use a separate dataset for evaluation, because it tells us how we will perform on unseen data.

In [4]:
train_data = pd.read_csv('train.data')
train_data[:1]

Unnamed: 0,text,sentiment
0,"I felt a great joy, after seeing this film, no...",1


In [5]:
print(train_data.text.iloc[0])


I felt a great joy, after seeing this film, not because it is a master piece, but because it convinced me of, that the Portuguese cinema became really very good. We can see here the best Portuguese actores in this field.


In [6]:
print(train_data.text.iloc[1])

Henry Thomas showed a restraint, even when the third act turned into horrible hollywood resolution that could've killed this movie, that kept the dignity of a redemption story and as for pure creepiness-sniffing babies?


In [7]:
test_data = pd.read_csv('test.data')
test_data[:1]

Unnamed: 0,text,sentiment
0,"This film is pretty good, it actually is like ...",1


The data is stored in a Pandas DataFrame. 

To access the columns use `train_data.text` or `train_data.sentiment`.

If you are having trouble with it, I've put each column into a separate variable below.

In [6]:
train_text = train_data.text.values
train_sentiment = train_data.sentiment.values

test_test = test_data.text.values
test_sentiment = test_data.sentiment.values

### Creating the features

The next step is to create features. Scikit-learn has a couple of different functions to help us:

[CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): Counts the words in a piece of text.

[TfidfVectorier](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer): Scaled version of counts - weighs rare words higher

Both classes will transform texts to a matrix. Each row in the matrix represents a data point, each column maps to a word. To use this class, first call the `fit` method to create a map from words to indicies. Then call `transform` to actually create the matrix.

---

**Exercise: Try using binary counts instead of actual counts**

`CountVectorizer(binary=True)`

**Exercise: Try using the TfidfVectorizer and compare performance**

**Exercise: Try different settings for ngram_range - which range is optimal?**

Note: n-grams refer to using phrases of length n (e.g 'very cool' is a bigram/2-gram)


*Bonus Exercise: The model seems to have a lot of weird words like "00001", figure out how to get rid of them*

**Answer: Set `min_df` higher - this will ignore words that occured less than n times:**

`CountVectorizer(min_df=5)`

will ignore words that occured in less than 5 reviews.

In [31]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = TfidfVectorizer(ngram_range=(1, 2), min_df=3)

# first call fit - to know which words you are counting
vec.fit(train_data.text)

# Now create a matrix from the words - the Xs refer to the matrix form
X_train = vec.transform(train_data.text)
X_test = vec.transform(test_data.text)

X_train

<2000x9621 sparse matrix of type '<class 'numpy.float64'>'
	with 140206 stored elements in Compressed Sparse Row format>

### Choosing a model

Scikit-learn implements a number of different classifiers, you can find all the ones for supervised learning [here](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

Here are a few simple ones to get you started.

- [Perceptron](http://scikit-learn.org/stable/modules/linear_model.html#perceptron)
- [K Nearest Neighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
- [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)
- [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

For each classifier, the workflow is similar:

1. Create classifier and specify parameters
2. Call the `fit` method on the training data.
3. Call `score` to get accuracy on the test data OR call `predict` to actually get the prediction values.

Note: For `fit` and `score`, you will need to pass in both the text matrix and the output. For predict, you will only need to pass in the text matrix.

---

**Exercise: Try each of the classifiers above and compare the performance**

**Exercise: Try changing the "K" in the KNN method**

`KNeighborsClassifier(n_neighbors=8)`

**Exercise: Try changing the number of trees (n_estimators) in the Random Forest Method**

`RandomForestClassifier(n_estimators=8)`

**Exercise: Try changing the other parameters under Random Forest Method**

See parameters defined in the KNN link above.

Feel free to experiment with the settings in the other classifiers, but the Random Forest ones are the most intuitive. 


Each classifier comes with a number of parameters. In practice, these parameters are not set by hand, because it can be hard predict the effects of each one. Instead, the best parameter settings are found automatically through a process called *hyperparameter tuning*. You can read about it [here](http://scikit-learn.org/0.15/auto_examples/randomized_search.html), but it basically means using a script to try a bunch of parameter settings.

*Bonus Exercise: Experiment with [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) or with the [Bagging Classifier](http://scikit-learn.org/0.15/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)*

*Bonus Exercise: Create an [ensemble](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) of your favorite classifiers. (Hint: you pass in the classifiers you want to use)*

**Answer below**

*Bonus Exercise: Use the link above to automatically try different settings for one of the classifiers*

**Answer below** + http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV



In [53]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
# Perceptron is a very basic linear model
clf = LinearSVC()

# Specify input than output
clf.fit(X_train, train_data.sentiment)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Finally, we evaluate the accuracy. 

Use the exercises above to see how different changes affect the accuracy.

In [52]:
# Score method outputs the accuracy
print("Accuracy:", clf.score(X_test, test_data.sentiment))

Accuracy: 0.754


In [12]:
# See the first 10 predictions 
clf.predict(X_test)[:10]

array([0, 0, 1, 1, 0, 0, 1, 1, 1, 0])

In [42]:
# Ensemble example
from sklearn.ensemble import VotingClassifier

# Meta classifier with three inner ones (the first part of the tuple is just a label)
clf = VotingClassifier([('svc', LinearSVC()), 
                        ('rfc', RandomForestClassifier(n_estimators=30)),
                        ('mnb', MultinomialNB())])
clf.fit(X_train, train_data.sentiment)

print("Accuracy:", clf.score(X_test, test_data.sentiment))

Accuracy: 0.856


In [54]:
# Automatic parameter search
from sklearn.model_selection import GridSearchCV

# Gridsearch is a wrapper around a classifer and set of parameters
# Here we will try different settings for the number of estimators and 
# the min_samples_split (another param)
grid_search = GridSearchCV(RandomForestClassifier(), 
                            {'n_estimators': [10, 20, 30],
                            'min_samples_split': [2, 5, 10]})
grid_search.fit(X_train, train_data.sentiment)
grid_search.best_params_



{'min_samples_split': 5, 'n_estimators': 30}

In [55]:
grid_search.score(X_test, test_data.sentiment)

0.806

### Understanding performance

Accuracy is a good first metric to look at to get a sense of model performance; it, also, let's us compare different parameter settings. However, it does not give us a good insight into why the model works.

To get a better understanding, we will analyze is the weight that the model assigns to different words. This is similar to the exercise that we did in the slides. 

*Note: this analysis will not work with every classifiers.*

The weights are stored in the ``clf.coef_[0]`` variable. We can map it to the words as follows:


In [15]:
# First, we want to put the words in the order of the coefficients
words_ordered = sorted(vec.vocabulary_.items(), key=operator.itemgetter(1))
# Throw out the indicies
words_ordered = [x[0] for x in words_ordered]

# Get the weights
weights = clf.coef_[0]

In [16]:
# Pair the words with the coefficients and order them by weight
word_weights = list(zip(words_ordered, weights))
word_weights_sort = sorted(word_weights, key=operator.itemgetter(1))

In [17]:
# Not look for the most positive and negative words
word_weights_sort[:10]

[('worst', -45.0),
 ('boring', -27.0),
 ('waste', -27.0),
 ('terrible', -24.0),
 ('awful', -22.0),
 ('nothing', -22.0),
 ('bottom', -21.0),
 ('lame', -21.0),
 ('dog', -20.0),
 ('bad', -19.0)]

In [23]:
word_weights_sort[-10:]

[('fun', 16.0),
 ('amazing', 17.0),
 ('enjoyed', 18.0),
 ('everything', 18.0),
 ('liked', 18.0),
 ('brilliant', 19.0),
 ('great', 19.0),
 ('shows', 19.0),
 ('excellent', 23.0),
 ('wonderful', 26.0)]

Congratulations!! You have built your first maching learning model! 