In [None]:
%run 02_data_preparation.ipynb

import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


N_ESTIMATORS = (5, 10, 25, 50, 100)
CRITERION = ("gini", "entropy")
MAX_FEATURES = ("auto", "log2", None)

# Divide the dataset into train and test fraction
train_messages, test_messages, train_targets, test_targets = train_test_split(tweets["text"], 
                                                                              tweets["sentiment"],
                                                                              test_size=0.2)


vectorizer = TfidfVectorizer()
for n_estimators, criterion, max_features in itertools.product(N_ESTIMATORS,
                                                               CRITERION,
                                                               MAX_FEATURES):
    # Define the classifier instance
    classifier = RandomForestClassifier(random_state=2018, 
                                        n_estimators=n_estimators, 
                                        criterion=criterion, 
                                        max_features=max_features)
    # Vectorize preprocessed sentences
    train_features = vectorizer.fit_transform(train_messages)

    # Train the model
    %time fit = classifier.fit(train_features.toarray(), train_targets)

    # Check the accuracy of the model on test data and display it
    test_features = vectorizer.transform(test_messages)
    train_predictions = fit.predict(train_features.toarray())
    train_accuracy = accuracy_score(train_predictions, train_targets)
    test_predictions = fit.predict(test_features.toarray())
    test_accuracy = accuracy_score(test_predictions, test_targets)
    print("Configuration: n_estimators = {}, criterion = {}, max_features = {}\n"
          "Train accuracy score: {}\n"
          "Test accuracy score: {}\n".format(n_estimators, criterion, max_features, 
                                             train_accuracy, test_accuracy))

It turned out the following configuration achieves the best accuracy on our test dataset:

`n_estimators = 100, criterion = entropy, max_features = auto`

For that reason we are going to create a simple application that will use these parameters for training. That will be a console application reading the sentences from the user and classifies its sentiment.

## Excercise

Please fill the gaps in the source code in *exercise/exercise_03.py*, in order to create an application that will:

- create Random Forest based model and train it on a whole dataset used in previous examples, using selected parameters (`n_estimators = 100, criterion = entropy, max_features = auto`)
- continiously read the sentences from the standard input and classify them with the created model, until "exit" sentence is passed
- display the probabilities of beloning to each class

Some of the functionalities are already prepared - please complete the source code.

# Random Forest for sentiment analysis

In the previous chapter we saw one of the most promising models was based on Random Forest. It combines quite a good accuracy with a performance. It is really important if we'd like to move our model to production, so the best architecture, in terms of its precision, is not always a possible choice. 

We are going to begin with a simple description of the Random Forest, to understand how it works under the hood. In simple words, Random Forest is a collection of ensembled Decision Trees which vote in order to form a single decision of belonging to a particular class. Each Decision Tree uses a randomly selected subset of features in order to perform its own decision.

![Random Forest architecture](images/random-forest.png)

## Tuning the model parameters

Let's play a little bit with the hyperparameters of the Random Forest Classifier, to optimize its accuracy. First of all, let's list some of the possible parameters and their values: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

From our perspective the following parameters look like the ones we should test out:

- **n_estimators** - number of decision trees used to make a forest, by default set to 10
- **criterion** - quality function for measuring a split, can be set to "gini" (default) or "entropy"
- **max_features** - a maximum number of features to consider (int - exact number, float - percentage, "auto", "sqrt", "log2", None)

In [None]:
%run exercise/exercise_03.py

## Feature importance

One of the biggest advantages of Random Forest classifier is the ability to describe the importance of the used features. It allows to check which variables have the best predictive force and to understand how the model performs the decision. The following code snippet visualizes the feature importance for our created model:

In [None]:
%matplotlib inline

feature_importances = pd.DataFrame(classifier.feature_importances_, 
                                   index=vectorizer.get_feature_names(),
                                   columns=("importance", ))
feature_importances = feature_importances.sort_values("importance", 
                                                      ascending=False)
feature_importances.head(n=100)

In [None]:
feature_importances.tail(n=10)