## Twitter US Airline Sentiment Analysis 

The aim of this project is to build a model for sentiment analysis based on the Twitter US Airline Datasets.

The Twitter US Airline Dataset :
* Tweets since Feb 2015 about each of the major US airline (US Airways, Virgin America, Delta, United, American Airlines, Southwest)
* Each tweet is classified either positive, negative or neutral.


The included features including :
- Twitter ID, sentiment confidence score, sentiments, negative reasons, airline name, retweet count, name, tweet text, tweet coordinates, date and time of the tweet, and the location of the tweet.

Download dataset from here : https://www.kaggle.com/crowdflower/twitter-airline-sentiment

## 0 - Packages

In [18]:
import pandas as pd
import numpy as np

import pandas as pd
import warnings 
import nltk
from nltk.tokenize import TweetTokenizer #for tokenize text 
from nltk.stem.snowball import SnowballStemmer # for Stemming word 
from nltk.tokenize import word_tokenize
from sklearn.metrics import classification_report
from time import time
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import collections
from util_gridsearch import *

In [2]:
%matplotlib inline
warnings.filterwarnings("ignore")

## 1- Import of the Dataset

In [3]:
# Import of data
df = pd.read_csv('Data/Tweets.csv')
df = df[['text', 'airline_sentiment']]
df.head(2)

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive


In [4]:
print (' 1- Number of tweets in the datasets: ' ,df.shape[0])
print("-----------------------------------------------------")
print(' 2- Number of tweet per type of sentiment :')
print("-----------------------------------------------------")
print(df['airline_sentiment'].value_counts())
print("-----------------------------------------------------")
print(' 3- The part of each type of sentiment in the dataset:')
print("-----------------------------------------------------")
print(df['airline_sentiment'].value_counts(normalize=True) )

 1- Number of tweets in the datasets:  14640
-----------------------------------------------------
 2- Number of tweet per type of sentiment :
-----------------------------------------------------
negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64
-----------------------------------------------------
 3- The part of each type of sentiment in the dataset:
-----------------------------------------------------
negative    0.626913
neutral     0.211680
positive    0.161407
Name: airline_sentiment, dtype: float64


## 2- Text Processing 

The text processing in this project consist of :

- Removing punctuation, tags,emoticons, URL and  hyperlinks (Http..)
- Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
- Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic
- Apostrophe: to avoid any word sense disambiguation in text, for example "n't" is remplaced by "not", and "'ll" by "will", etc
- Removing the name of the Airlines from the text.    

In [5]:
def Text_Processing(text):
#Lower Case
    text=text.str.lower()
    
#Using Regular Expression for removing tags, punctuation, emoticons and URL
    text=text.str.replace("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|([0-9])","")
    text=text.apply(nltk.word_tokenize)
    
#Stemming each word 
    stemmer = SnowballStemmer('english')
    text=text.apply(lambda x: [stemmer.stem(y) for y in x])
    stopwords = nltk.corpus.stopwords.words('english')
    text=text.apply(lambda x: [y for y in x if y not in stopwords])
    
# removing stopword 
    stopwords = nltk.corpus.stopwords.words('english')
    text=text.apply(lambda x: [y for y in x if y not in stopwords])
    
#removing name of the airlines from text
    text=text.replace("([^United]+)|([^US]+)|([^Southwest]+)|([^Delta]+)|([^Virgin]+)|([^American]+)","")
    
#dictionary consisting of the contraction and the actual value 
    Apos_dict={"'s":" is","n’t":" not","'m":" am","'ll":" will", 
           "'d":" would","'ve":" have","’re":" are"} 
  
    #replace the contractions 
    for key,value in Apos_dict.items(): 
        if key in text: 
            text=text.replace(key,value) 
        
# Detokenize cleaned dataframe
    text_final = text.str.join(" ")
    return text_final

In [6]:
df['text']=Text_Processing(df['text'])

In [7]:
#view after the text processing
df.head()

Unnamed: 0,text,airline_sentiment
0,said,neutral
1,plus youv ad commerci experi tacki,positive
2,didnt today must mean need take anoth trip,neutral
3,realli aggress blast obnoxi entertain guest fa...,negative
4,realli big bad thing,negative


## 3- Model 

#### - Extracting features from text file :
TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data.

- TF-IDF (term frequency - inverse document frequency) : weights the word counts by a measure of how often they appear in the documents
- CountVectorizer :count the number of times a word appears in the document

Parameters to define : min_df, max_df, Ngram

for example
* min_df =5 : include words tha occur in at least 5 documents 
* max_df=0.5: Use those words that occur in a maximum of 50% of the documents
* Ngram : takes value of (1,1) or (1,2)

#### - Model 
- Logistic Regression
- Naives Bayes 


#### - Evaluation metrics
- Accuracy ratio /classification report

In [8]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets : 
#Test data size is 0.2 i.e. 20% of the data, Train data size is the remaining 80%.
X_train, X_test, y_train, y_test = train_test_split(df['text'], 
                                                    df['airline_sentiment'], 
                                                    random_state=0, test_size=0.2)

In [9]:
print('The shape of the Training set: ', X_train.shape)
print('The shape of the Test shape: ', X_test.shape)

The shape of the Training set:  (11712,)
The shape of the Test shape:  (2928,)


### 3-1- First Model 
Before Tunning the parameters of the model, let's see the result of a the Logistic Regression and CountVect with default parameters.

In [10]:
#CountVect with default parameters
vect = CountVectorizer().fit(X_train)

X_train_count=vect.transform(X_train)
X_test_count=vect.transform(X_test)

In [11]:
#Logistic Regression with default parameter
logreg = LogisticRegression().fit(X_train_count, y_train)

predicted = logreg.predict(X_test_count)
print("\033[107m"+ "Classification report on Test ""\033[0m")
print(classification_report(y_test, logreg.predict(X_test_count)))
print("\033[107m"+ "Classification report on Train ""\033[0m")
print(classification_report(y_train, logreg.predict(X_train_count)))

[107mClassification report on Test [0m
              precision    recall  f1-score   support

    negative       0.84      0.89      0.87      1870
     neutral       0.62      0.56      0.59       614
    positive       0.73      0.65      0.68       444

    accuracy                           0.79      2928
   macro avg       0.73      0.70      0.71      2928
weighted avg       0.78      0.79      0.78      2928

[107mClassification report on Train [0m
              precision    recall  f1-score   support

    negative       0.92      0.96      0.94      7308
     neutral       0.85      0.76      0.80      2485
    positive       0.89      0.85      0.87      1919

    accuracy                           0.90     11712
   macro avg       0.88      0.86      0.87     11712
weighted avg       0.90      0.90      0.90     11712



### Comments : 

- The accuracy ratio for the train is 90% and 79% for the test, there is an overffirting.
- To reduce overffiting, we will try to use regularization l1 or l2.
- gridserach is used to find the best parameters for the logisticregression/NaiveBayes/SVM and CountVect/TF-IDF 

### 3-2- Logistic Regression

In [12]:
parameters_logreg = {
     'vect__ngram_range': [(1, 1), (1, 2)],
     'vect__max_df': (0.4,0.5),
     'vect__min_df': (6,7),
     'tfidf__use_idf': (True, False),
     'clf__C': (1.2,1.6,1.7),
     #'clf__penalty': ('l1','l2')
}
logreg = LogisticRegression(penalty='l2')
#see util_gridsearch.py for the function grid_vect
best_mnb_countvect = grid_vect(logreg, parameters_logreg, X_train, y_train, X_test,y_test,model_name='LogisticRegression')

[107mModel : LogisticRegression[0m
-------------------------------------------------------
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   51.8s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  1.1min finished


Time : done in 64.82s

Best CV score: 0.78
Best parameters set:
	clf__C: 1.6
	tfidf__use_idf: False
	vect__max_df: 0.4
	vect__min_df: 6
	vect__ngram_range: (1, 2)
Test score with best_estimator_: 0.79


Train score with best_estimator_: 0.84


 ** Classification Report Test Data ** 
              precision    recall  f1-score   support

    negative       0.81      0.93      0.87      1870
     neutral       0.67      0.50      0.57       614
    positive       0.77      0.60      0.67       444

    accuracy                           0.79      2928
   macro avg       0.75      0.67      0.70      2928
weighted avg       0.78      0.79      0.77      2928

** Classification Report Train Data ** 
              precision    recall  f1-score   support

    negative       0.85      0.95      0.90      7308
     neutral       0.79      0.60      0.68      2485
    positive       0.84      0.70      0.76      1919

    accuracy                           0.84     11712
   macro avg       0.83

### 3-3- Naive Bayes

In [13]:
#Parameters for MultinomialNB
mnb = MultinomialNB()
parameters_mnb = {
     'vect__ngram_range': [(1, 1), (1, 2)],
     'vect__max_df': (0.5,0.7),
     'vect__min_df': (5,7),
     'tfidf__use_idf': (True, False),
     'clf__alpha': (0.001,0.05,0.25),}
#see util_gridsearch.py for the function grid_vect
best_mnb_countvect = grid_vect(mnb, parameters_mnb, X_train, y_train, X_test,y_test,model_name='NaiveBayes')

[107mModel : NaiveBayes[0m
-------------------------------------------------------
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   34.8s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:   43.5s finished


Time : done in 44.21s

Best CV score: 0.76
Best parameters set:
	clf__alpha: 0.05
	tfidf__use_idf: False
	vect__max_df: 0.5
	vect__min_df: 5
	vect__ngram_range: (1, 2)
Test score with best_estimator_: 0.77


Train score with best_estimator_: 0.82


 ** Classification Report Test Data ** 
              precision    recall  f1-score   support

    negative       0.77      0.96      0.85      1870
     neutral       0.71      0.36      0.48       614
    positive       0.81      0.51      0.63       444

    accuracy                           0.77      2928
   macro avg       0.76      0.61      0.65      2928
weighted avg       0.76      0.77      0.74      2928

** Classification Report Train Data ** 
              precision    recall  f1-score   support

    negative       0.82      0.97      0.89      7308
     neutral       0.82      0.50      0.62      2485
    positive       0.87      0.69      0.77      1919

    accuracy                           0.82     11712
   macro avg      

### 3.5 - Conclusion
<table> 
    <tr>
        <td>
        **Model**
        </td>
        <td>
        **Train Accuracy**
        </td>
        <td>
        **Test Accuracy**
        </td>
    </tr>
        <td>
        Logistic Regression
        </td>
        <td>
        84%
        </td>
        <td>
        79%
        </td>
    <tr>
        <td>
        Naives Bayes
        </td>
        <td>
        82%
        </td>
        <td>
        77%
        </td>
    </tr>
    
</table> 

The Logistic Regression with CountVectorizer reduces the overfitting and acheive a Accuracy ratio 79% on test set, we select this model for prediction

The performance of the model may be improved :
- Adding new features : number of word in the text, number of punctuation, etc..
- Using Deep Learning approach 


## 4- Prediction 

In [19]:
#Parameters for the best model 
model=LogisticRegression(penalty='l2',C=1.6)

vect = CountVectorizer(analyzer='word',  min_df=6,max_df=0.4,ngram_range=(1, 2),stop_words='english').fit(X_train)
len(vect.get_feature_names())

2665

In [20]:
model.fit(vect.transform(X_train), y_train)

LogisticRegression(C=1.6, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
names = vect.get_feature_names()
df_features = pd.DataFrame({'coef':model.coef_[0],'names':names})

In [22]:
print('The Smallest Coefs :')
df_features.sort_values(by='coef',ascending=True)[:10]

The Smallest Coefs :


Unnamed: 0,coef,names
133,-2.415421,appl
1353,-2.133697,kudo
2283,-2.079267,thank
725,-2.031248,excel
2493,-1.985774,visit
600,-1.959276,discount
1230,-1.904039,id like
512,-1.875334,dal
968,-1.861136,flyingitforward
1041,-1.855351,golf


In [23]:
print('The Largest Coefs :')
df_features.sort_values(by='coef',ascending=False)[:10]

The Largest Coefs :


Unnamed: 0,coef,names
2615,3.452032,worst
1005,2.574707,fuck
424,2.499321,communic
1974,2.487044,ridicul
2224,2.344507,suck
433,2.334466,complaint
1394,2.32464,lie
2025,2.319076,screw
2455,2.247134,unless
1549,2.242092,miser


## Test for new tweet 

In [24]:
new_positive_tweets = pd.Series(["Thank you @VirginAmerica for you amazing customer support team on Tuesday and returning my lost bag in less than 24h! #efficiencyiskey #virginamerica"
                      ,"Love flying with you guys ask these years.  Sad that this will be the last trip 😂   @VirginAmerica  #LuxuryTravel"
                      ,"Wow @VirginAmerica This plane is nice and clean & I have enjoyed the trip "])
new_positive_tweets= Text_Processing(new_positive_tweets)
new_positive_tweets

0    thank amaz custom support team tuesday return ...
1     love fli guy ask year sad last trip luxurytravel
2                      wow plane nice clean enjoy trip
dtype: object

In [26]:
print(model.predict(vect.transform(new_positive_tweets)))

['positive' 'positive' 'positive']


In [27]:
new_negative_tweets = pd.Series(["@VirginAmerica shocked my initially with the service, but then went on to shock me further with no response to what my complaint was. #unacceptable @Delta @richardbranson"
                      ,"@VirginAmerica this morning I was forced to repack a suitcase w a medical device because it was barely overweight - wasn't even given an option to pay extra. My spouses suitcase then burst at the seam with the added device and had to be taped shut. Awful experience so far!"
                      ,"Board airplane home. Computer issue. Get off plane, traverse airport to gate on opp side. Get on new plane hour later. Plane too heavy. 8 volunteers get off plane. Ohhh the adventure of travel"])
new_negative_tweets= Text_Processing(new_negative_tweets)

In [28]:
print(model.predict(vect.transform(new_negative_tweets)))

['negative' 'negative' 'negative']
