# **Introduction**
In this notebook, we will explore some text mining techniques for sentiment analysis. First, we will spend some time preparing the tweets. This will involve cleaning the text data, removing stop words and stemming. [The Twitter US Airline Sentiment data set](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) on Kaggle is nice to work with for this purpose. It contains the tweet’s text and one variable with three possible sentiment values.

To infer the tweets’ sentiment we use two classifiers: *logistic regression* and *multinomial naive Bayes*. We will tune the hyperparameters of both classifiers with grid search.

We will compare the performance with three metrics: precision, recall and the F1 score.


We start by importing the packages and configuring some settings.

In [None]:
!pip install emoji 
import numpy as np 
import pandas as pd 
pd.set_option('display.max_colwidth', -1)
from time import time
import re
import string
from pprint import pprint
import collections
import emoji
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

#Loading the data
We shuffle the data frame in case the classes would be sorted. This can be done with the reindex method applied on the permutation of the original indices. In this notebook we will only focus on the text variable and the class variable.

In [None]:
df = pd.read_csv('sentiment.csv')
df = df.reindex(np.random.permutation(df.index)) 
df.reset_index(inplace=True)
df.drop('index',inplace=True,axis=1)
df = df[['text', 'airline_sentiment']]
df.head()

#Text variable
To analyze the text variable we create a class TextCounts. In this class we compute some basic statistics on the text variable. This class can be used later in a Pipeline, as well.

* count_words : number of words in the tweet
* count_mentions : referrals to other Twitter accounts, which are preceded by a @
* count_hashtags : number of tag words, preceded by a #
* count_capital_words : number of uppercase words, could be used to "shout" and express (negative) emotions
* count_excl_quest_marks : number of question or exclamation marks
* count_urls : number of links in the tweet, preceded by http(s)
* count_emojis : number of emoji, which might be a good indication of the sentiment


In [None]:
class TextCounts(BaseEstimator, TransformerMixin):
    
    def count_regex(self, pattern, tweet):
        #finding all the substring containing the pattern in the tweet
        return len(re.findall(pattern, tweet))
    
    def fit(self, X, y=None, **fit_params):
        # fit method is used when specific operations need to be done on the train data, but not on the test data
        return self
    
    def transform(self, X, **transform_params):
        #all the alphanumeric character
        count_words = X.apply(lambda x: self.count_regex(r'\w+', x)) 
        count_mentions = X.apply(lambda x: self.count_regex(r'@\w+', x))
        count_hashtags = X.apply(lambda x: self.count_regex(r'#\w+', x))
        count_capital_words = X.apply(lambda x: self.count_regex(r'\b[A-Z]{2,}\b', x))
        count_excl_quest_marks = X.apply(lambda x: self.count_regex(r'!|\?+', x))
        count_urls = X.apply(lambda x: self.count_regex(r'https?://[^\s]+[\s]?', x))
        # We will replace the emoji symbols with a description, which makes using a regex for counting easier
        # Moreover, it will result in having more words in the tweet
        count_emojis = X.apply(lambda x: emoji.demojize(x)).apply(lambda x: self.count_regex(r':[a-z_&]+:', x))
        
        df = pd.DataFrame({'count_words': count_words
                           , 'count_mentions': count_mentions
                           , 'count_hashtags': count_hashtags
                           , 'count_capital_words': count_capital_words
                           , 'count_excl_quest_marks': count_excl_quest_marks
                           , 'count_urls': count_urls
                           , 'count_emojis': count_emojis
                          })
        
        return df

In [None]:
tc = TextCounts()

df_eda = tc.fit_transform(df.text)
df_eda['airline_sentiment'] = df.airline_sentiment
df_eda.head()

# Text Cleaning 
Before we start using the tweets' text we clean it. We'll do the this in the class CleanText:

* remove the mentions, as we want to make the model generalisable to tweets of other airline companies too.
* remove the hash tag sign (#) but not the actual tag as this may contain information
* set all words to lowercase
* remove all punctuations, including the question and exclamation marks
* remove the urls as they do not contain useful information and we did not notice a distinction in the number of urls used between the sentiment classes
* make sure the converted emojis are kept as one word.
* remove digits
* remove stopwords
* apply the PorterStemmer to keep the stem of the words


In [None]:
class CleanText(BaseEstimator, TransformerMixin):
    def remove_mentions(self, input_text):
        return re.sub(r'@\w+', '', input_text)
    
    def remove_urls(self, input_text):
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
    
    def emoji_oneword(self, input_text):
        # By compressing the underscore, the emoji is kept as one word
        return input_text.replace('_','')
    
    def remove_punctuation(self, input_text):
        # Make translation table
        punct = string.punctuation
        trantab = str.maketrans(punct, len(punct)*' ')  # Every punctuation symbol will be replaced by a space
        return input_text.translate(trantab)

    def remove_digits(self, input_text):
        return re.sub('\d+', '', input_text)
    
    def to_lower(self, input_text):
        return input_text.lower()
    
    def remove_stopwords(self, input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words) 
    
    def stemming(self, input_text):
        porter = PorterStemmer()
        words = input_text.split() 
        stemmed_words = [porter.stem(word) for word in words]
        return " ".join(stemmed_words)
        
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        clean_X = X.apply(self.remove_mentions).apply(self.remove_urls).apply(self.emoji_oneword).apply(self.remove_punctuation).apply(self.remove_digits).apply(self.to_lower).apply(self.remove_stopwords).apply(self.stemming)
        return clean_X

*One side-effect of text cleaning is that some rows do not have any words left in their text. To deal with these missing values, we impute them with some placeholder text like [no_text].*


In [None]:
ct = CleanText()

sr_clean = ct.fit_transform(df.text)
empty_clean = sr_clean == ''
print('{} records have no words left after text cleaning'.format(sr_clean[empty_clean].count()))
sr_clean.loc[empty_clean] = '[no_text]'

# Creating test data
To evaluate the trained models we'll need a test set. Evaluating on the train data would not be correct because the models are trained to minimize their cost function.


First we combine the TextCounts variables with the CleanText variable.

In [None]:
df_model = df_eda
df_model['clean_text'] = sr_clean
df_model.columns.tolist()

df_model now contains several variables. However, our vectorizers will only need the clean_text variable. The TextCounts variables can be added as such. To specifically select columns, we use ColumnExtractor class. This can be used in the Pipeline afterwards.

In [None]:
class ColumnExtractor(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols

    def transform(self, X, **transform_params):        
        return X[self.cols]

    def fit(self, X, y=None, **fit_params):
        return self

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment, test_size=0.1, random_state=37)

# Hyperparameter tuning and cross-validation
The vectorizers and classifiers all have configurable parameters. In order to chose the best parameters, we need to evaluate on a separate validation set that was not used during the training. However, using only one validation set may not produce reliable validation results. Due to chance you might have a good model performance on the validation set. If you would split the data otherwise, you might end up with other results. To get a more accurate estimation, we perform cross-validation.

With cross-validation the data is split into a train and validation set multiple times. The evaluation metric is then averaged over the different folds. Luckily, GridSearchCV applies cross-validation out-of-the-box.

To find the best parameters for both a vectorizer and classifier, we create a Pipeline. All this is put into a function for ease of use.

In our function grid_vect we additionally generate the classification_report on the test data. This provides some interesting metrics per target class, which might be more appropriate here. These metrics are the **precision**, **recal** and **F1 score**.


In [None]:
def grid_vect(clf, parameters_clf, X_train, X_test, parameters_text=None, vect=None):
    
    textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
                      ,'count_mentions','count_urls','count_words']
    
    features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols)),
                             ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text')), ('vect', vect)]))], n_jobs=1)
    
    pipeline = Pipeline([('features', features), ('clf', clf)])
    
    # Join the parameters dictionaries together
    parameters = dict()
    if parameters_text:
        parameters.update(parameters_text)
    parameters.update(parameters_clf)

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
    
    print("Performing grid search...")
    print()    
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best CV score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
    print()
    print("Test score with best_estimator_: %0.3f" % grid_search.best_estimator_.score(X_test, y_test))
    print("Train score with best_estimator_: %0.3f" % grid_search.best_estimator_.score(X_train, y_train))
    print()
    print("Classification Report Test Data")
    print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
    #print("Classification Report Train Data")
    #print(classification_report(y_train, grid_search.best_estimator_.predict(X_train)))   
    return grid_search

In [None]:
# Parameter grid settings for the vectorizers (Count and TFIDF)
parameters_vect = {
    'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
    'features__pipe__vect__ngram_range': ((1, 1), (1, 2), (1, 3)), #((1, 1), (1, 2)), 
    'features__pipe__vect__min_df': (1, 2, 3, 4)   #(1,2) 
}


# Parameter grid settings for MultinomialNB
parameters_mnb = {
    'clf__alpha': (0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.75, 1.0)  #(0.25, 0.5, 0.75)
}


# Parameter grid settings for LogisticRegression
parameters_logreg = {
    'clf__C': (0.25, 0.5, 1.0), #(0.01, 0.5, 1.0, 1.05, 1.1, 1.15, 1.2)
    'clf__penalty': ('l1', 'l2'),
    #'clf__solver': ('lbfgs', 'saga'),
    'clf__max_iter' : (150, 200, 300, 500)
}

# Classifiers
Here we will compare the performance of a [MultinomailNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression).

In [None]:
mnb = MultinomialNB()
logreg = LogisticRegression()

# CountVectorizer
To use words in a classifier, we need to convert the words to numbers. This can be done with a CountVectorizer. Sklearn's CountVectorizer takes all words in all tweets, assigns an ID and counts the frequency of the word per tweet. This bag of words can then be used as input for a classifier. It is what is called a sparse data set, meaning that each record will have many zeroes for the words not occurring in the tweet.

In [None]:
countvect = CountVectorizer()

In [None]:
# MultinomialNB x CountVectorizer
best_mnb_countvect = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=countvect)

In [None]:
# LogisticRegression x CountVectorizer
best_logreg_countvect = grid_vect(logreg, parameters_logreg, X_train, X_test, parameters_text=parameters_vect, vect=countvect)

#TF-IDF
One issue with CountVectorizer is that there might be words that occur frequently in observations of the target classes. These words do not have discriminatory information and can be removed. TF-IDF can be used to downweight these frequent words.

In [None]:
tfidfvect = TfidfVectorizer()

In [None]:
# MultinomialNB x TF-IDF
best_mnb_tfidf = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)

In [None]:
# LogisticRegression x TF-IDF
best_logreg_tfidf = grid_vect(logreg, parameters_logreg, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)

#Apply the best model on new tweets
We will use the best model and apply it to some new tweets.

Thanks to the GridSearchCV, we now know what are the best hyperparameters. So now we can train the best model on all training data, including the test data that we split off before.

In [None]:
textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
                  ,'count_mentions','count_urls','count_words']
    
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols)), 
                         ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text'))
                        , ('vect', CountVectorizer(max_df=0.5, min_df=4, ngram_range=(1,2)))]))], n_jobs=-1)

pipeline = Pipeline([('features', features), ('clf', MultinomialNB(alpha=1.0))])

best_model = pipeline.fit(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment)

In [None]:
test = pd.Series(["Irish budget airline Ryanair has also resumed limited flights schedule. changed its service so that all food is pre-packaged&must be pre-ordered before flying. Alcohol isn't off menu, though -- chosen to ax hot drinks service instead, throughout July."])

df_counts_neg = tc.transform(test)
df_clean_neg = ct.transform(test)
df_test = df_counts_neg
df_test['clean_text'] = df_clean_neg
best_model.predict(df_test).tolist()

In [None]:
filename = 'model.joblib'
joblib.dump(best_model, filename)