# Curriculum: Twitter US Airline Sentiment

Steps and tasks:

1. Import the libraries, load dataset, the print shape of data, data description. (5 Marks)
2. Understand of data columns: (5 Marks)
     a. Drop all other columns except “text” and “airline_sentiment”.
     b. Check the shape of the data.
     c. Print the first 5 rows of data.
3. Text pre-processing: Data preparation. (16 Marks)
NOTE:- Each text pre-processing step should be mentioned in the notebook separately.
     a. Html tag removal.
     b. Tokenization.
     c. Remove the numbers.
     d. Removal of Special Characters and Punctuations.
     e. Removal of stopwords
     f. Conversion to lowercase.
     g. Lemmatize or stemming.
     h. Join the words in the list to convert back to text string in the data frame. (So that each row contains the data in text format.)
     i. Print the first 5 rows of data after pre-processing.
4. Vectorization: (10 Marks)
    a. Use CountVectorizer.
    b. Use TfidfVectorizer.
5. Fit and evaluate the model using both types of vectorization. (6+6 Marks)
6. Summarize your understanding of the application of Various Pre-processing and Vectorization 
     and performance of your model on this dataset. (8 Marks)
7.Overall notebook should have:(4 Marks)
     a. Well commented code
     b. Structure and flow


## 1. Import and Data Description

In [1]:
# Import librairy
import numpy as np                                  
import pandas as pd                                 
import nltk                                         
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords, wordnet                 #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.tokenize import word_tokenize, sent_tokenize 
import re

from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.

from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF

!pip install vaderSentiment

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aurelienvallier/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aurelienvallier/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [3]:
# load dataset
data=pd.read_csv('/Users/aurelienvallier/Desktop/AI & Machine Learning/8- Natural Language Processing/Tweets.csv')

In [4]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [5]:
data.shape

(14640, 15)

In [6]:
data.describe

<bound method NDFrame.describe of                  tweet_id airline_sentiment  airline_sentiment_confidence  \
0      570306133677760513           neutral                        1.0000   
1      570301130888122368          positive                        0.3486   
2      570301083672813571           neutral                        0.6837   
3      570301031407624196          negative                        1.0000   
4      570300817074462722          negative                        1.0000   
...                   ...               ...                           ...   
14635  569587686496825344          positive                        0.3487   
14636  569587371693355008          negative                        1.0000   
14637  569587242672398336           neutral                        1.0000   
14638  569587188687634433          negative                        1.0000   
14639  569587140490866689           neutral                        0.6771   

               negativereason  negativere

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

### Data Description conclusion: 
Data consist of a collection of 14640 tweets classified alongside 15 features, whose datatype is a mixture of integers and objects

## 2. Understanding of data and dropping of unnecessary columns

#### Since our analysis will be on sentiment analysis of the tweet text, we will drop all columns except Text (which is the content of the tweet) and Airline Sentiment (which is the target variable)

In [9]:
data.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [10]:
data= data[["airline_sentiment", "text"]]
data

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@AmericanAir thank you we got on a different f...
14636,negative,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,@AmericanAir Please bring American Airlines to...
14638,negative,"@AmericanAir you have my money, you change my ..."


In [11]:
#Check the shape of the data. 
data.shape # Still 14640 rows but now only 2 columns 

(14640, 2)

In [12]:
#Print the first 5 rows of data.
pd.set_option('display.max_colwidth', None # we add this in order to have the see the full content of the tweet
datafirst5rows=data.iloc[0:5]
print(datafirst5rows)



  airline_sentiment  \
0           neutral   
1          positive   
2           neutral   
3          negative   
4          negative   

                                                                                                                             text  
0                                                                                             @VirginAmerica What @dhepburn said.  
1                                                        @VirginAmerica plus you've added commercials to the experience... tacky.  
2                                                         @VirginAmerica I didn't today... Must mean I need to take another trip!  
3  @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse  
4                                                                         @VirginAmerica and it's a really big bad thing about it  


## 3. Text pre-processing: 

### Data preparation

In [13]:
# a. Html tag removal using BeautifulSoup
def strip_html(text):
    soup = BeautifulSoup(data, "html.parser")    # Removing HTML tags
    return soup.get_text()

data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
4,negative,@VirginAmerica and it's a really big bad thing about it


In [14]:
data.shape

(14640, 2)

Shape is unchanged

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   airline_sentiment  14640 non-null  object
 1   text               14640 non-null  object
dtypes: object(2)
memory usage: 228.9+ KB


In [16]:
# b. Tokenization. 
# We will tokenize the words of whole dataframe using nltk.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    words = nltk.word_tokenize(text)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[@, VirginAmerica, What, @, dhepburn, said, .]"
1,positive,"[@, VirginAmerica, plus, you, 've, added, commercials, to, the, experience, ..., tacky, .]"
2,neutral,"[@, VirginAmerica, I, did, n't, today, ..., Must, mean, I, need, to, take, another, trip, !]"
3,negative,"[@, VirginAmerica, it, 's, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, amp, ;, they, have, little, recourse]"
4,negative,"[@, VirginAmerica, and, it, 's, a, really, big, bad, thing, about, it]"


In [17]:
# c. remove special characters. We will use a function to iterate through the entire document.
def remove_Special_Characters (words):        
    """remove_special characters and numbers in list of tokenized words"""
    new_words = [] # Create empty list to store pre-processed words.
    for word in words: 
        new_word= re.sub("[^a-zA-Z]", " ", word) 
        new_words.append(new_word) 
    return new_words
def normalize(words):
    words = remove_Special_Characters (words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[ , VirginAmerica, What, , dhepburn, said, ]"
1,positive,"[ , VirginAmerica, plus, you, ve, added, commercials, to, the, experience, , tacky, ]"
2,neutral,"[ , VirginAmerica, I, did, n t, today, , Must, mean, I, need, to, take, another, trip, ]"
3,negative,"[ , VirginAmerica, it, s, really, aggressive, to, blast, obnoxious, , entertainment, , in, your, guests, , faces, , amp, , they, have, little, recourse]"
4,negative,"[ , VirginAmerica, and, it, s, a, really, big, bad, thing, about, it]"


In [18]:
# d. remove numbers. We will use a function to iterate through the entire document.
def remove_Numbers (words):        
    """remove_special characters and numbers in list of tokenized words"""
    new_words = []
    for word in words: 
        new_word= re.sub(r'[0-9]+', '', word)
        new_words.append(new_word) 
    return new_words
def normalize(words):
    words = remove_Numbers(words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()


Unnamed: 0,airline_sentiment,text
0,neutral,"[ , VirginAmerica, What, , dhepburn, said, ]"
1,positive,"[ , VirginAmerica, plus, you, ve, added, commercials, to, the, experience, , tacky, ]"
2,neutral,"[ , VirginAmerica, I, did, n t, today, , Must, mean, I, need, to, take, another, trip, ]"
3,negative,"[ , VirginAmerica, it, s, really, aggressive, to, blast, obnoxious, , entertainment, , in, your, guests, , faces, , amp, , they, have, little, recourse]"
4,negative,"[ , VirginAmerica, and, it, s, a, really, big, bad, thing, about, it]"


In [19]:
# e. remove punctuations. We will use a function to iterate through the entire document.
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                       
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)    
    return new_words
def normalize(words):
    words = remove_punctuation(words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[ , VirginAmerica, What, , dhepburn, said, ]"
1,positive,"[ , VirginAmerica, plus, you, ve, added, commercials, to, the, experience, , tacky, ]"
2,neutral,"[ , VirginAmerica, I, did, n t, today, , Must, mean, I, need, to, take, another, trip, ]"
3,negative,"[ , VirginAmerica, it, s, really, aggressive, to, blast, obnoxious, , entertainment, , in, your, guests, , faces, , amp, , they, have, little, recourse]"
4,negative,"[ , VirginAmerica, and, it, s, a, really, big, bad, thing, about, it]"


In [21]:
# f. remove stopwords. We will use a function to iterate through the entire document.
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)        
    return new_words
def normalize(words):
    words = remove_stopwords(words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[ , VirginAmerica, What, , dhepburn, said, ]"
1,positive,"[ , VirginAmerica, plus, ve, added, commercials, experience, , tacky, ]"
2,neutral,"[ , VirginAmerica, I, n t, today, , Must, mean, I, need, take, another, trip, ]"
3,negative,"[ , VirginAmerica, s, really, aggressive, blast, obnoxious, , entertainment, , guests, , faces, , amp, , little, recourse]"
4,negative,"[ , VirginAmerica, s, really, big, bad, thing]"


In [22]:
#g. Conversion to lowercase. We will use a function to iterate through the entire document.
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []                        
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words
def normalize(words):
    words = to_lowercase(words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[ , virginamerica, what, , dhepburn, said, ]"
1,positive,"[ , virginamerica, plus, ve, added, commercials, experience, , tacky, ]"
2,neutral,"[ , virginamerica, i, n t, today, , must, mean, i, need, take, another, trip, ]"
3,negative,"[ , virginamerica, s, really, aggressive, blast, obnoxious, , entertainment, , guests, , faces, , amp, , little, recourse]"
4,negative,"[ , virginamerica, s, really, big, bad, thing]"


In [24]:
# h. lemmatize. We will use a function to iterate through the entire document.
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmas = []                           
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)              
    return lemmas
def normalize(words):
    words = lemmatize_verbs(words)
    return words
# Normalize over the data
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aurelienvallier/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Unnamed: 0,airline_sentiment,text
0,neutral,"[ , virginamerica, what, , dhepburn, say, ]"
1,positive,"[ , virginamerica, plus, ve, add, commercials, experience, , tacky, ]"
2,neutral,"[ , virginamerica, i, n t, today, , must, mean, i, need, take, another, trip, ]"
3,negative,"[ , virginamerica, s, really, aggressive, blast, obnoxious, , entertainment, , guests, , face, , amp, , little, recourse]"
4,negative,"[ , virginamerica, s, really, big, bad, thing]"


In [None]:
#Normalization. We normalise the functions defined earlier
def normalize(words):
    words = remove_Special_Characters (words)
    words = remove_Numbers (words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

In [None]:
# Iterate the normalize funtion over whole data.
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

In [39]:
#h. Join the words in the list to convert back to text string in the data frame. (So that each row contains the data in text format.) 

### The words were appended already in the function, one after the other.






In [41]:
# i. Print the first 5 rows of data after pre-processing.
data2=str(data) # we convert them into string
print(data2)

      airline_sentiment  \
0               neutral   
1              positive   
2               neutral   
3              negative   
4              negative   
...                 ...   
14635          positive   
14636          negative   
14637           neutral   
14638          negative   
14639           neutral   

                                                                                                                                                      text  
0                                                                                                            [ , virginamerica, what,  , dhepburn, say,  ]  
1                                                                               [ , virginamerica, plus,  ve, add, commercials, experience,    , tacky,  ]  
2                                                                      [ , virginamerica, i, n t, today,    , must, mean, i, need, take, another, trip,  ]  
3                        [ , virginamerica,  s,

## 4. Vectorization: (10 Marks) 

###  a. We start by using CountVectorizer. 

In [154]:
# Import necessary librairies
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [155]:
#split the data into test and train
X=data['text']
Y=data['airline_sentiment'] # this is the target variable
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=1,stratify=Y)

#Converting to array
X_train= np.array(X_train)
X_test=np.array(X_test)
y_train=np.array(y_train)
y_test=np.array(y_test)

In [129]:
# Initialization of CountVectorizer
cv= CountVectorizer(max_features=1000) # we chose 1000 as start.

### b. Use TfidfVectorizer

In [156]:
# Import necessary librairies
from sklearn.feature_extraction.text import TfidfVectorizer

In [157]:
#Initialization of TfidVectorizer
vt= TfidfVectorizer()

## 5. Fit and evaluate the model using both types of vectorization. (6+6 Marks)


In [64]:
# Lets fit CountVectorizer

In [166]:
# Use fit_transform() 
train_data_features_countV = cv.fit_transform ([' '.join(arr) for arr in X_train])
#I am using the .join to avoir bug with error message "list" object has no attribute lower

In [146]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV

In [169]:
# Lets use random forest

# Initialize a Random Forest classifier with 10 trees
forest = RandomForestClassifier(n_estimators = 10,n_jobs=4) 
# Fit the forest to the training set, using the text bag of words as 
# features and the airline_sentiment labels as the response variable
print ("Training the random forest...")
forest = forest.fit( train_data_features_countV, y_train)
# random forest performance through cross vaidation 
print (forest)
print (np.mean(cross_val_score(forest,train_data_features_countV,y_train,cv=10)))


Training the random forest...
RandomForestClassifier(n_estimators=10, n_jobs=4)
0.731655487804878


#### Evaluation Count Vect: The mean cross validation for count vect using random forest is 0.74 which is not bad considering no hypertuning.

In [174]:
# Lets fit Tfid
# Use fit_transform() 
train_data_features_Tfid= vt.fit_transform ([' '.join(arr) for arr in X_train])

In [175]:
# Lets use random forest

# Initialize a Random Forest classifier with 10 trees
forest = RandomForestClassifier(n_estimators = 10,n_jobs=4) 
# Fit the forest to the training set, using the text bag of words as 
# features and the airline_sentiment labels as the response variable
print ("Training the random forest...")
forest = forest.fit( train_data_features_Tfid, y_train)
# random forest performance through cross vaidation 
print (forest)
print (np.mean(cross_val_score(forest,train_data_features_Tfid,y_train,cv=10)))



Training the random forest...
RandomForestClassifier(n_estimators=10, n_jobs=4)
0.735851181402439


#### Evaluation Tfid: The mean cross validation using random forest is 0.74 which is decent considering no tuning

## 6. Summarize your understanding of the application of Various Pre-processing and Vectorization and performance of your model on this dataset. (8 Marks)



In summary:
- The various pre-processing steps have helped us to go from a dataframe with 15 columns containing various text information into a bag of words which contains cleaned up data for NLP. We have removed the columns which are not relevant for NLP and subsquently removed the punctuations, special characters, numbers etc, all elements which create noise and are not easily intepretable as positive, negative or neutral. The bag of words was then vectorized using 2 different techniques (Count and Tfid). 
- We havesplit the bag of words into train and test and used a supervised model (random forest algorithm) to train the data. The choice of supervised model was guided by the fact that we the target variable was known to us ("airline_sentiment").
- The performance of the model using vectorized bag of words was decent. Using minimal tuning, we obtained almost 75% for both the Vectorized Count and Tfid, which indicates that both method of vectorizing bear similar results, which makes sense considering there was only one document.
