# Random Forest for sentiment analysis

In the previous chapter we saw one of the most promising models was based on Random Forest. It combines quite a good accuracy with a performance. It is really important if we'd like to move our model to production, so the best architecture, in terms of its precision, is not always a possible choice. 

We are goin to begin with a simple description of the Random Forest, to understand how it works under the hood. In simple words, Random Forest is a collection of ensembled Decision Trees which vote in order to form a single decision of belonging to a particular class. Each Decision Tree uses a randomly selected subset of features in order to make the decision.

## Tuning the model parameters

Let's play a little bit with the hyperparameters of the Random Forest Classifier, to optimize its accuracy. First of all, let's list some of the possible parameters and their values: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

From our perspective the following parameters look like the ones we should test out:

- **n_estimators** - number of decision trees used to make a forest, by default set to 10
- **criterion** - quality function for measuring a split, can be set to "gini" (default) or "entropy"
- **max_features** - a maximum number of features to consider (int - exact number, float - percentage, "auto", "sqrt", "log2", None

In [None]:
%run 02_data_preparation.ipynb

import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


N_ESTIMATORS = (5, 10, 25, 50, 100)
CRITERION = ("gini", "entropy")
MAX_FEATURES = ("auto", "log2", None)

# Divide the dataset into train and test fraction
train_messages, test_messages, train_targets, test_targets = train_test_split(tweets["text"], 
                                                                              tweets["sentiment"],
                                                                              test_size=0.2)

vectorizer = TfidfVectorizer()
for n_estimators, criterion, max_features in itertools.product(N_ESTIMATORS,
                                                               CRITERION,
                                                               MAX_FEATURES):
    # Define the classifier instance
    classifier = RandomForestClassifier(random_state=2018, 
                                        n_estimators=n_estimators, 
                                        criterion=criterion, 
                                        max_features=max_features)
    # Vectorize preprocessed sentences
    train_features = vectorizer.fit_transform(train_messages)

    # Train the model
    %time fit = classifier.fit(train_features.toarray(), train_targets)

    # Check the accuracy of the model on test data and display it
    test_features = vectorizer.transform(test_messages)
    train_predictions = fit.predict(train_features.toarray())
    train_accuracy = accuracy_score(train_predictions, train_targets)
    test_predictions = fit.predict(test_features.toarray())
    test_accuracy = accuracy_score(test_predictions, test_targets)
    print("Configuration: n_estimators = {}, criterion = {}, max_features = {}\n"
          "Train accuracy score: {}\n"
          "Test accuracy score: {}\n".format(n_estimators, criterion, max_features, 
                                             train_accuracy, test_accuracy))

CPU times: user 6.03 s, sys: 556 ms, total: 6.59 s
Wall time: 6.59 s
Configuration: n_estimators = 5, criterion = gini, max_features = auto
Train accuracy score: 0.9596994535519126
Test accuracy score: 0.6963797814207651

CPU times: user 2.87 s, sys: 528 ms, total: 3.4 s
Wall time: 3.4 s
Configuration: n_estimators = 5, criterion = gini, max_features = log2
Train accuracy score: 0.9631147540983607
Test accuracy score: 0.6796448087431693

CPU times: user 5min, sys: 745 ms, total: 5min
Wall time: 5min 1s
Configuration: n_estimators = 5, criterion = gini, max_features = None
Train accuracy score: 0.9564549180327869
Test accuracy score: 0.7110655737704918

CPU times: user 5.25 s, sys: 564 ms, total: 5.82 s
Wall time: 5.82 s
Configuration: n_estimators = 5, criterion = entropy, max_features = auto
Train accuracy score: 0.9605532786885246
Test accuracy score: 0.6960382513661202

CPU times: user 2.77 s, sys: 660 ms, total: 3.43 s
Wall time: 3.43 s
Configuration: n_estimators = 5, criterion = 

It turned out the following configuration achieves the best accuracy on our test dataset:

`n_estimators = 100, criterion = gini, max_features = auto`

For that reason we are going to create a simple application that will use these parameters for training. That will be a console application reading the sentences from the user and classifies its sentiment.

## Excercise

Please fill the gaps in the source code below, in order to create an application that will:

- create Random Forest based model and train it on a whole dataset used in previous examples, using selected parameters
- continiously read the sentences from the standard input and classify them with the created model, until "exit" sentence is passed
- display the probabilities of beloning to each class

Some of the functionalities are already prepared - please complete the application.

In [None]:
# Vectorize the dataset
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(tweets["text"])

# Create the model an train it
model = RandomForestClassifier(random_state=2018)
fit_model = model.fit(features, tweets["sentiment"])

# Continiously read the sentences from the standard input
# and classify them with the created model
sentence = None
while True:
    sentence = input("Sentence ('exit' to close): ")
    if sentence == "exit":
        break
    # Classify the message and display the probabilities
    sentence_features = vectorizer.transform((sentence, ))
    probabilities = fit_model.predict_proba(sentence_features)
    print("Sentence: {}\nProbabilities: {}".format(sentence, 
                                                   zip(fit_model.classes_, probabilities[0])))