# Random Forest for sentiment analysis

In the previous chapter we saw one of the most promising models was based on Random Forest. It combines quite a good accuracy with a performance. It is really important if we'd like to move our model to production, so the best architecture, in terms of its precision, is not always a possible choice. Let's play a little bit with the hyperparameters of the Random Forest Classifier, to optimize its accuracy. First of all, let's list some of the possible parameters and their values: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

From our perspective the following parameters look 

- **n_estimators** - number of decision trees used to make a forest, by default set to 10
- **criterion** - quality function for measuring a split, can be set to "gini" (default) or "entropy"
- **max_features** - a maximum number of features to consider (int - exact number, float - percentage, "auto", "sqrt", "log2", None)

In [None]:
%run 02_data_preparation.ipynb

import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


N_ESTIMATORS = (5, 10, 25, 50, 100)
CRITERION = ("gini", "entropy")
MAX_FEATURES = ("auto", "log2", None)

# Divide the dataset into train and test fraction
train_messages, test_messages, train_targets, test_targets = train_test_split(tweets["text"], 
                                                                              tweets["sentiment"],
                                                                              test_size=0.2)

vectorizer = TfidfVectorizer()
for n_estimators, criterion, max_features in itertools.product(N_ESTIMATORS,
                                                               CRITERION,
                                                               MAX_FEATURES):
    # Define the classifier instance
    classifier = RandomForestClassifier(random_state=2018, 
                                        n_estimators=n_estimators, 
                                        criterion=criterion, 
                                        max_features=max_features)
    # Vectorize preprocessed sentences
    train_features = vectorizer.fit_transform(train_messages)

    # Train the model
    %time fit = classifier.fit(train_features.toarray(), train_targets)

    # Check the accuracy of the model on test data and display it
    test_features = vectorizer.transform(test_messages)
    %time test_predictions = fit.predict(test_features.toarray())
    accuracy = accuracy_score(test_predictions, test_targets)
    print("Vectorizer: {}\nClassifier: {}\nAccuracy score:{}\n".format(vectorizer,
                                                                       classifier, 
                                                                       accuracy))

CPU times: user 5.65 s, sys: 540 ms, total: 6.19 s
Wall time: 6.19 s
CPU times: user 100 ms, sys: 120 ms, total: 220 ms
Wall time: 220 ms
Vectorizer: TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
Classifier: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=5, n_jobs=1, oob_score=False, random_state=2018,
            verbose=0, warm_start=False)
Accuracy sc