Toy Example above TBD

Now it's your turn! The rest of the notebook takes you through the process of creating a classifier on your own. An example implementation is already filled in to help you get started. 

The exercises are there to guide you, but feel free to experiment beyond them. 

### Loading the data

For this task we will using a real movie review dataset. The dataset contains texts of reviews and a sentiment label (0 = bad, 1=good). Both the train and test sets contain 50\% of each.

In [315]:
train_data = pd.read_csv('train.data')
train_data[:1]

Unnamed: 0,text,sentiment
0,"I felt a great joy, after seeing this film, no...",1


In [312]:
print(train_data.text.iloc[0])


I felt a great joy, after seeing this film, not because it is a master piece, but because it convinced me of, that the Portuguese cinema became really very good. We can see here the best Portuguese actores in this field.


In [313]:
print(train_data.text.iloc[1])

Henry Thomas showed a restraint, even when the third act turned into horrible hollywood resolution that could've killed this movie, that kept the dignity of a redemption story and as for pure creepiness-sniffing babies?


In [314]:
test_data = pd.read_csv('test.data')
test_data[:1]

Unnamed: 0,text,sentiment
0,"This film is pretty good, it actually is like ...",1


The data is stored in a Pandas DataFrame. 

To access the columns use `train_data.text` or `train_data.sentiment`.

If you are having trouble with it, I've put each column into a separate variable below.

In [310]:
train_text = train_data.text.values
train_sentiment = train_data.sentiment.values

test_test = test_data.text.values
test_sentiment = test_data.sentiment.values

### Creating the features

The next step is to create features. Scikit-learn has a couple of different functions to help us:

[CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): Counts the words in a piece of text.

[TfidfVectorier](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer): Scaled version of counts - weighs rare words higher

Note: n-grams refer to using phrases of length n (e.g 'very cool' is a bigram/2-gram)

*Exercise: Try using binary counts instead of actual counts*

*Exercise: Try using the TfidfVectorizer and compare performance*

*Exercise: Try different settings for ngram_range - which range is optimal?*


In [271]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer()

# first call fit - to know which words you are counting
vec.fit(train_data.text)

# Now create a matrix from the words - the Xs refer to the matrix form
X_train = vec.transform(train_data.text)
X_test = vec.transform(test_data.text)


### Choosing a model

- [K Nearest Neighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html)
- [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)
- [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

For each classifier, the workflow is similar:

1. Create classifier and specify parameters
2. Call the `fit` method on the training data
3. Call score to get accuracy on the test data


**Exercise: Try each of the classifiers above and compare the performance**

**Exercise: Try changing the "K" in the KNN method**

**Exercise: Try changing the number of trees (n_estimators) in the Random Forest Method**

**Exercise: Try changing the other parameters under Random Forest Method**

Feel free to experiment with the settings in the other classifiers, but the Random Forest ones are the most intuitive. 


Each classifier comes with a number of parameters. In practice, these parameters are not set by hand, because it can be hard predict the effects of each one. Instead, the best parameter settings are found automatically through a process called *hyperparameter tuning*. You can read about it [here](http://scikit-learn.org/0.15/auto_examples/randomized_search.html), but it basically means using a script to try a bunch of parameter settings.

*Bonus Exercise: Experiment with [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) or with the [Bagging Classifier](http://scikit-learn.org/0.15/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)*

*Bonus Exercise: Create an [ensemble](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) of your favorite classifiers. (Hint: you pass in the classifiers you want to use)*

*Bonus Exercise: Use the link above to automatically try different settings for one of the classifiers*



In [287]:
from sklearn.linear_model import Perceptron

# Perceptron is a very basic linear model
clf = Perceptron()

# Specify input than output
clf.fit(X_train, train_data.sentiment)



Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=None, n_iter=None, n_iter_no_change=5,
      n_jobs=None, penalty=None, random_state=0, shuffle=True, tol=None,
      validation_fraction=0.1, verbose=0, warm_start=False)

Finally, we evaluate the accuracy. 

Use the exercises above to see how different changes affect the accuracy.

In [291]:
# Score method outputs the accuracy
print("Accuracy:", clf.score(X_test, test_data.sentiment))

Accuracy: 0.758
