# Introduction

In this notebook, we will perform sentiment analysis on a dataset containing tweets on US airlines during 2015. We will first explore the dataset, do so some cleaning and feature selection, and then run a few machine learning algortihms to predict the sentiment of a particular tweet using NLP methods such as Term Frequency-Inverse Document Frequency (TF-IDF) which convert strings into numerical vectors for analysis.

#### Workflow:

1. Load the dataset into a dataframe
2. Explore the dataframe
3. Remove unnecessary columns
4. Remove all punctuation from the text column
5. Convert the text column into TF-IDF feature vectors
6. Use classification algorithms to predict the sentiment using K-Fold Cross validation with a grid search on a few relevant hyperparameters. The classification algortihms we will be using are Logistic Regression, Multinomial Naive Bayes, and Support Vector Machines
7. Compare the accuracies of the three classifiers


In [277]:
#import the required libraries and fuctions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#render charts inline
%matplotlib inline

In [278]:
#load the .csv dataset into a dataframe. The .csv file is actually encoded in ISO-8859-1
df = pd.read_csv('Airline-Sentiment-2-w-AA.csv', encoding = 'ISO-8859-1') 

# Data Exploration

In [279]:
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,False,finalized,3,2/25/15 5:24,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/15 11:35,5.70306e+17,,Eastern Time (US & Canada)
1,681448153,False,finalized,3,2/25/15 1:53,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
2,681448156,False,finalized,3,2/25/15 10:01,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/15 11:15,5.70301e+17,Lets Play,Central Time (US & Canada)
3,681448158,False,finalized,3,2/25/15 3:05,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
4,681448159,False,finalized,3,2/25/15 5:50,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2/24/15 11:14,5.70301e+17,,Pacific Time (US & Canada)


In [280]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 20 columns):
_unit_id                        14640 non-null int64
_golden                         14640 non-null bool
_unit_state                     14640 non-null object
_trusted_judgments              14640 non-null int64
_last_judgment_at               14584 non-null object
airline_sentiment               14640 non-null object
airline_sentiment:confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason:confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 

In [281]:
#What are the the reasons for the negative tweets?

negative_tweets = df[['airline', 'negativereason']]
df_negative = negative_tweets.groupby('negativereason', as_index=False).count().sort_values(by = 'airline', ascending = False)
df_negative.columns = ['Reason', '# of tweets']
df_negative

Unnamed: 0,Reason,# of tweets
3,Customer Service Issue,2910
7,Late Flight,1665
1,Can't Tell,1190
2,Cancelled Flight,847
8,Lost Luggage,724
0,Bad Flight,580
6,Flight Booking Problems,529
5,Flight Attendant Complaints,481
9,longlines,178
4,Damaged Luggage,74


Seems like the biggest reason by far for negative tweets is due to poor customer service

# Data Cleaning and Feature Selection

In [282]:
#remove all unnecessary columns for our analysis. We only need 'airline_sentiment', 'airline_sentiment:confidence','airline', and 'text' columns
df = df[['airline_sentiment', 'airline_sentiment:confidence','airline', 'text']]

In [283]:
df.head()

Unnamed: 0,airline_sentiment,airline_sentiment:confidence,airline,text
0,neutral,1.0,Virgin America,@VirginAmerica What @dhepburn said.
1,positive,0.3486,Virgin America,@VirginAmerica plus you've added commercials t...
2,neutral,0.6837,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,negative,1.0,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,1.0,Virgin America,@VirginAmerica and it's a really big bad thing...


In [284]:
#for our analysis we will only focus on positive and negative sentiments
df = df[(df['airline_sentiment'] == 'positive')|(df['airline_sentiment'] =='negative')]

In [285]:
df.head(10)

Unnamed: 0,airline_sentiment,airline_sentiment:confidence,airline,text
1,positive,0.3486,Virgin America,@VirginAmerica plus you've added commercials t...
3,negative,1.0,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,1.0,Virgin America,@VirginAmerica and it's a really big bad thing...
5,negative,1.0,Virgin America,@VirginAmerica seriously would pay $30 a fligh...
6,positive,0.6745,Virgin America,"@VirginAmerica yes, nearly every time I fly VX..."
8,positive,0.6559,Virgin America,"@virginamerica Well, I didn'tÛ_but NOW I DO! :-D"
9,positive,1.0,Virgin America,"@VirginAmerica it was amazing, and arrived an ..."
11,positive,1.0,Virgin America,@VirginAmerica I &lt;3 pretty graphics. so muc...
12,positive,1.0,Virgin America,@VirginAmerica This is such a great deal! Alre...
13,positive,0.6451,Virgin America,@VirginAmerica @virginmedia I'm flying your #f...


In [286]:
#lets just focus on sentiments which were outright positive or negative
df = df[df['airline_sentiment:confidence']==1]
df.head()

Unnamed: 0,airline_sentiment,airline_sentiment:confidence,airline,text
3,negative,1.0,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,1.0,Virgin America,@VirginAmerica and it's a really big bad thing...
5,negative,1.0,Virgin America,@VirginAmerica seriously would pay $30 a fligh...
9,positive,1.0,Virgin America,"@VirginAmerica it was amazing, and arrived an ..."
11,positive,1.0,Virgin America,@VirginAmerica I &lt;3 pretty graphics. so muc...


In [287]:
#reset the indices of the dataframe to be in order from 0
df = df.reset_index() 
df = df.drop(columns = ['index'])
df.head()

Unnamed: 0,airline_sentiment,airline_sentiment:confidence,airline,text
0,negative,1.0,Virgin America,@VirginAmerica it's really aggressive to blast...
1,negative,1.0,Virgin America,@VirginAmerica and it's a really big bad thing...
2,negative,1.0,Virgin America,@VirginAmerica seriously would pay $30 a fligh...
3,positive,1.0,Virgin America,"@VirginAmerica it was amazing, and arrived an ..."
4,positive,1.0,Virgin America,@VirginAmerica I &lt;3 pretty graphics. so muc...


In [288]:
#how many tweets were made for each airline
df.groupby('airline').agg({'airline_sentiment':'count'}).sort_values(by = 'airline_sentiment', ascending = False)

Unnamed: 0_level_0,airline_sentiment
airline,Unnamed: 1_level_1
United,2418
US Airways,2068
American,1856
Southwest,1296
Delta,1026
Virgin America,233


Seems like the three most tweeted airlines are United Airlines, US Airways, and American Airlines

In [289]:
#which airlines has the most number of negative tweets? We need to calculate the ratio of the negative to positive tweets to make an accurate assessment because if we just go by the number of negative tweets then the airline with the most tweets would be disadvantaged
table = pd.pivot_table(df, index = ['airline'],columns = ['airline_sentiment'], aggfunc=np.sum)
table['ratio'] = table.iloc[:,0]/table.iloc[:,1]
table

Unnamed: 0_level_0,airline_sentiment:confidence,airline_sentiment:confidence,ratio
airline_sentiment,negative,positive,Unnamed: 3_level_1
airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
American,1635.0,221.0,7.39819
Delta,688.0,338.0,2.035503
Southwest,909.0,387.0,2.348837
US Airways,1901.0,167.0,11.383234
United,2120.0,298.0,7.114094
Virgin America,129.0,104.0,1.240385


Looks like US Airways has the highest ratio of negative to positive tweets and is therefore the most negatively reviewed airline

In [290]:
# For our analysis we need to convert the 'positive' and 'negative' values to numerical 1 and 0 values
def sentiment_class(sentiment):
    if sentiment == 'negative':
        return 0
    else:
        return 1

df['airline_sentiment'] = df['airline_sentiment'].apply(sentiment_class)
df.head(10)

Unnamed: 0,airline_sentiment,airline_sentiment:confidence,airline,text
0,0,1.0,Virgin America,@VirginAmerica it's really aggressive to blast...
1,0,1.0,Virgin America,@VirginAmerica and it's a really big bad thing...
2,0,1.0,Virgin America,@VirginAmerica seriously would pay $30 a fligh...
3,1,1.0,Virgin America,"@VirginAmerica it was amazing, and arrived an ..."
4,1,1.0,Virgin America,@VirginAmerica I &lt;3 pretty graphics. so muc...
5,1,1.0,Virgin America,@VirginAmerica This is such a great deal! Alre...
6,1,1.0,Virgin America,@VirginAmerica Thanks!
7,1,1.0,Virgin America,@VirginAmerica So excited for my first cross c...
8,0,1.0,Virgin America,@VirginAmerica I flew from NYC to SFO last we...
9,1,1.0,Virgin America,I _ü flying @VirginAmerica. ÷¼ü_ÙÔ


In [291]:
#drop the remaining unnecessary columns
df = df.drop(['airline_sentiment:confidence', 'airline'], axis = 1)

In [292]:
#define a function to clean the target 'text' column by removing all punctuation, making it lowercase, and removing leading and trailing whitespaces
import re

def remove_punctuation(string):
    return(re.sub('[^\sa-zA-Z0-9]', '',string).lower()) #remove puntuation and make lowercase
df['text'] = df['text'].apply(remove_punctuation)

df['text'] = df['text'].str.strip() #remove leading and trailing whitespaces
df.head(10)

Unnamed: 0,airline_sentiment,text
0,0,virginamerica its really aggressive to blast o...
1,0,virginamerica and its a really big bad thing a...
2,0,virginamerica seriously would pay 30 a flight ...
3,1,virginamerica it was amazing and arrived an ho...
4,1,virginamerica i lt3 pretty graphics so much be...
5,1,virginamerica this is such a great deal alread...
6,1,virginamerica thanks
7,1,virginamerica so excited for my first cross co...
8,0,virginamerica i flew from nyc to sfo last wee...
9,1,i flying virginamerica


# Develop and Evaluate Machine Learning Models

First we need to convert the text in the target 'text' column into numerical feature vectors. We can do this by utilizing the TF-IDF Vectorizer. TF-IDF Vectorizes assigns a TF-IDF score to each word based on how often that word occurs in each tweet and how oftern it occurs in other documents. The score is boosted if the work occurs several times in a tweet but is also subsequently penalized if it occurs in several other documents. The logic is that a word that occurs in several other documents does not help much in making predictions

In [293]:
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

lr = LogisticRegression() # initiate a logistic regression model

#identify the features and labels of the training set
train = df['text']
labels = df['airline_sentiment']


tf_idf_vec = TfidfVectorizer() #initiate the TF-IDF Vectorizer
tf_idf = tf_idf_vec.fit_transform(train) #transform the 'text' column into TF-IDF feature vectors
cv_scores = cross_val_score(lr, tf_idf, labels, cv = 5) #run a 5-fold cross validation of the logistic regression model on the TF-IDF feature vectors for predicting sentiments



In [294]:
#calculate the mean accuracy score
mean_score = np.mean(cv_scores)
mean_score

0.9149155882297213

A score of 91.5% is very good! Lets see what tf-idf feature vectors look like by converting the TF-IDF feature matrix into a dataframe

In [295]:
transformed_df = pd.DataFrame(tf_idf.todense(),columns=tf_idf_vec.get_feature_names()) #convert the tf-idf matrix into a dataframe with the column names corresponding to the unique word tokens
print(transformed_df.shape)
transformed_df.head()

(8897, 11759)


Unnamed: 0,0162431184663,0214,021mbps,0223,02282015,03,0303,03032015,0316,0372389047497,...,zip,zippers,zkatcher,zombie,zone,zones,zoom,zrh,zukes,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [296]:
#Obtain predictions from the 5-fold cross validation and append it to the dataframe for a quick visual comparison
cv_predictions = cross_val_predict(lr, tf_idf, labels, cv = 5)
cv_predictions = pd.Series(cv_predictions)
df['predicted_sentiment'] = cv_predictions



In [297]:
df.sample(10, random_state = 42) #obtain a random sample of 10 tweets for a quick visual check

Unnamed: 0,airline_sentiment,text,predicted_sentiment
7430,0,americanair 2284 four hours late flightrs and ...,0
5992,0,usairways please help on hold 3 hours cant cha...,0
5386,0,usairways forced sections 4 and 5 to check the...,0
7868,0,americanair customer service is terriblebeen w...,0
6827,0,usairways already called no other options flig...,0
6567,0,usairways they had to turn the seat cushions o...,0
8132,0,americanair your customer service is inferior ...,0
1127,0,united once again you guys didnt let me down a...,0
5956,0,usairways are you going to do anything to help...,0
6622,0,usairways delays to the max,0


Looks good on this sample! Now, lets perform a grid-search with cross validation in order to determine if a unigram TF-IDF model provides better results than multigram TF-IDF models. Unigram means that a sentence is broken down into single word tokens while multigram means that a sentence is broken down into multiple word tokens. 

We will also create a pipeline that first performs TF-IDF vectorization and then logistic regression.

In [298]:
param_grid = [{'tf_idf_vec__ngram_range':[(1,1),(1,2),(1,3)]}] #a parameter grid containing 1-gram, 2-gram, and 3-gram values for the ngram_range hyperparameter of the tf-idf vectorizer
lr_tfidf = Pipeline([('tf_idf_vec', tf_idf_vec), ('lr', LogisticRegression(random_state = 42))]) #create the pipeline for tf-idf vectorization followed by logistic regression
grid_search = GridSearchCV(lr_tfidf, param_grid, cv = 5, scoring = 'accuracy')
grid_search.fit(train, labels)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tf_idf_vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=..., penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'tf_idf_vec__ngram_range': [(1, 1), (1, 2), (1, 3)]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [299]:
print(grid_search.best_params_) #print the hyperparameter which provided the highest mean accuracy score

{'tf_idf_vec__ngram_range': (1, 1)}


In [300]:
#print the scores of all three hyperparameter values
cvres = grid_search.cv_results_
for score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(score, params)

0.9072721141957963 {'tf_idf_vec__ngram_range': (1, 1)}
0.9046869731370125 {'tf_idf_vec__ngram_range': (1, 2)}
0.9013150500168596 {'tf_idf_vec__ngram_range': (1, 3)}


So it seems that a 1-gram model provides the highest mean accuracy score of 91%. Let's see how a multinomial naive bayes model performs using the same methodology

In [301]:
from sklearn.naive_bayes import MultinomialNB
nb_tfidf = Pipeline([('tf_idf_vec', tf_idf_vec), ('multi_nb', MultinomialNB())])
grid_search = GridSearchCV(nb_tfidf, param_grid, cv = 5, scoring = 'accuracy')
grid_search.fit(train, labels)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tf_idf_vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...        vocabulary=None)), ('multi_nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'tf_idf_vec__ngram_range': [(1, 1), (1, 2), (1, 3)]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [302]:
print(grid_search.best_params_)

{'tf_idf_vec__ngram_range': (1, 1)}


In [303]:
cvres = grid_search.cv_results_
for score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(score, params)

0.8490502416544903 {'tf_idf_vec__ngram_range': (1, 1)}
0.8443295492862762 {'tf_idf_vec__ngram_range': (1, 2)}
0.8430931774755536 {'tf_idf_vec__ngram_range': (1, 3)}


The highest naive bayes accuracy score is 85% which was again for the 1-gram model. But this accuracy score is less than the one obtained through logistic regression. Let's see how a Support Vector Machine (SVM) performs using the same methodology. Note that we will be training the model with three different learning rates 0.001,0.01, and 0.1 

In [304]:
from sklearn.linear_model import SGDClassifier
param_grid = [{'tf_idf_vec__ngram_range':[(1,1),(1,2),(1,3)], 'svm__alpha':[0.001,0.01,0.1]}] #use three different values for the learning rate hyperparameter - 0.001,0.01,and 0.1
svm_tfidf = Pipeline([('tf_idf_vec', tf_idf_vec), ('svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=5, tol = None))])
grid_search = GridSearchCV(svm_tfidf, param_grid, cv = 5, scoring = 'accuracy')
grid_search.fit(train, labels)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tf_idf_vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...dom_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'tf_idf_vec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'svm__alpha': [0.001, 0.01, 0.1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [305]:
print(grid_search.best_params_)

{'svm__alpha': 0.001, 'tf_idf_vec__ngram_range': (1, 1)}


In [306]:
cvres = grid_search.cv_results_
for score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(score, params)

0.8920984601551084 {'svm__alpha': 0.001, 'tf_idf_vec__ngram_range': (1, 1)}
0.8834438574800495 {'svm__alpha': 0.001, 'tf_idf_vec__ngram_range': (1, 2)}
0.8750140496796673 {'svm__alpha': 0.001, 'tf_idf_vec__ngram_range': (1, 3)}
0.8312914465550185 {'svm__alpha': 0.01, 'tf_idf_vec__ngram_range': (1, 1)}
0.8297178824322805 {'svm__alpha': 0.01, 'tf_idf_vec__ngram_range': (1, 2)}
0.8297178824322805 {'svm__alpha': 0.01, 'tf_idf_vec__ngram_range': (1, 3)}
0.8297178824322805 {'svm__alpha': 0.1, 'tf_idf_vec__ngram_range': (1, 1)}
0.8297178824322805 {'svm__alpha': 0.1, 'tf_idf_vec__ngram_range': (1, 2)}
0.8297178824322805 {'svm__alpha': 0.1, 'tf_idf_vec__ngram_range': (1, 3)}


So, it looks like the highest accuracy score for the SVM model is 89% with a learning rate of 0.001, utilizing unigram tokens. Therefore, for this particular problem, logistic regression was able to perform the best with the highest accuracy score of 91% and the SVM did better than Naive Bayes.