# Sentiment analysis methods

There is no ML method which solves specifically the problem of sentiment analysis, so it is important to test some different algorithms and choose the best one. In our tests we are going to use scikit-learn library, which is commonly used, especially for implementing POCs.

The quality of the ML model has to be measured somehow in order to compare their efficiency for given problem. In our case we are going to use quite a simple metric, called accuracy. 

$$accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

We have three different labels for our texts: positive, negative and neutral. To avoid dealing with texts, we are going to convert them to 1, -1 and 0, respectively.

In [1]:
%run 02_data_preparation.ipynb

Let's start with defining a base pipeline for our dataset. We're going to use some functions declared in previous parts of the training.

In [2]:
import pandas as pd

SENTIMENT_TO_LABEL_MAPPING = {
    "negative": -1,
    "neutral": 0,
    "positive": 1
}

# Load the dataset
raw_tweets = pd.read_csv("data/twitter-airlines-sentiment.csv")

# Preprocess the data with the function declared previously
tweets = raw_tweets[["airline_sentiment", "text"]]
tweets.columns = ("sentiment", "text", )
tweets["text"] = tweets["text"].map(preprocess_text)
tweets["sentiment"] = tweets["sentiment"].map(lambda x: SENTIMENT_TO_LABEL_MAPPING[x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [3]:
from sklearn.model_selection import train_test_split


# Divide the dataset into train and test fraction
train_messages, test_messages, train_targets, test_targets = train_test_split(tweets["text"], 
                                                                              tweets["sentiment"],
                                                                              test_size=0.2)

In [5]:
import itertools

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Declare vectorizers to be used
VECTORIZERS = (
    CountVectorizer(),
    TfidfVectorizer(),
)

# Declare classifiers to be used
CLASSIFIERS = (
#     LogisticRegression(C=10e-5, solver="liblinear", max_iter=10000),
#     KNeighborsClassifier(9),
#     SVC(kernel="rbf", C=0.025, probability=True),
#     LinearSVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(random_state=2018),
#     GaussianNB(),
)

for vectorizer, classifier in itertools.product(VECTORIZERS, CLASSIFIERS):
    # Vectorize preprocessed sentences
    train_features = vectorizer.fit_transform(train_messages)

    # Train the model
    %time fit = classifier.fit(train_features.toarray(), train_targets)

    # Check the accuracy of the model on test data and display it
    test_features = vectorizer.transform(test_messages)
    %time test_predictions = fit.predict(test_features.toarray())
    accuracy = accuracy_score(test_predictions, test_targets)
    print("Vectorizer: {}\nClassifier: {}\nAccuracy score:{}\n".format(vectorizer,
                                                                       classifier, 
                                                                       accuracy))

CPU times: user 53.4 s, sys: 463 ms, total: 53.9 s
Wall time: 53.9 s
CPU times: user 57.8 ms, sys: 104 ms, total: 162 ms
Wall time: 162 ms
Vectorizer: CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
Classifier: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
Accuracy score:0.7042349726775956

CPU times: user 10.1 s, sys: 396 ms, total: 10.5 s
Wall time: 10.5 s
CPU times: user 94.9 ms, sys: 100 ms, total: 1