## NLP Project: Twitter US Airline Sentiment

## Data Description:
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

**Objective:**
- To implement the techniques learnt as a part of the course.

## Alex N Waithera # Project #8 NLP

## 1. Importing the libraries, loading dataset, printing shape of data & data description

In [144]:
## mounting the drive to be able to use the dataset stored in the dataset.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [145]:
# install and import necessary libraries.

!pip install contractions
import warnings
warnings.filterwarnings('ignore')
import re, string, unicodedata                       
import contractions                                     
from bs4 import BeautifulSoup                           

import numpy as np                                   
import pandas as pd                                     
import nltk                                             

nltk.download('stopwords')                             
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                      
from nltk.tokenize import word_tokenize, sent_tokenize 
from nltk.stem.wordnet import WordNetLemmatizer     

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [146]:
#Load dataset.
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Tweets.csv')

In [147]:
# Check first five(5) rows of data.
dataset.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,5.70306e+17,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/2015 11:35,,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2/24/2015 11:15,,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2/24/2015 11:15,Lets Play,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2/24/2015 11:15,,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2/24/2015 11:14,,Pacific Time (US & Canada)


In [148]:
#Print shape of data 
dataset.shape

(14640, 15)

In [149]:
# Dataset description 
dataset.describe()

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779109200000000.0,0.16283,0.33044,0.745778
min,5.67588e+17,0.335,0.0,0.0
25%,5.68559e+17,0.6923,0.3606,0.0
50%,5.69478e+17,1.0,0.6706,0.0
75%,5.698902e+17,1.0,1.0,0.0
max,5.70311e+17,1.0,1.0,44.0


In [150]:
#print dataset columns
dataset.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

## 2. Dropping all other columns except “text” and “airline_sentiment:


In [151]:
# Drop all other columns except "text" and "airline_sentiment"
data = dataset.drop(['tweet_id', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline','airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'tweet_coord', 'tweet_created',
'tweet_location', 'user_timezone'], axis=1)

In [152]:
#Check the shape of data 
data.shape

(14640, 2)

In [153]:
#Print first five(5) rows of data
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
4,negative,@VirginAmerica and it's a really big bad thing about it


In [154]:
#Check for null values 
data.isnull().sum(axis=0)   

airline_sentiment    0
text                 0
dtype: int64

In [155]:
# Display full dataframe information (Non-turncated Text column.)
pd.set_option('display.max_colwidth', None)

data.head() 

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
4,negative,@VirginAmerica and it's a really big bad thing about it


In [156]:
#Univariate analysis of the 'airline_sentiment' variable. 
print(data['airline_sentiment'].value_counts(normalize=True))

negative    0.626913
neutral     0.211680
positive    0.161407
Name: airline_sentiment, dtype: float64


## 3.Text pre-processing: Data preparation.


In [157]:
#Html tag removal 
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

data['text'] = data['text'].apply(lambda x: strip_html(x))
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse"
4,negative,@VirginAmerica and it's a really big bad thing about it


In [158]:
# Replace contractions 
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

data['text'] = data['text'].apply(lambda x: replace_contractions(x))
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you have added commercials to the experience... tacky.
2,neutral,@VirginAmerica I did not today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it is really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse"
4,negative,@VirginAmerica and it is a really big bad thing about it


In [159]:
#Removal of special characters and punctuations
def remove_special_characters(text):
    #define the pattern to keep
    pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]' 
    return re.sub(pat, '', text)
data['text'] = data['text'].apply(lambda x:  remove_special_characters(x))
data.head(5)    

Unnamed: 0,airline_sentiment,text
0,neutral,VirginAmerica What dhepburn said.
1,positive,VirginAmerica plus you have added commercials to the experience... tacky.
2,neutral,VirginAmerica I did not today... Must mean I need to take another trip!
3,negative,"VirginAmerica it is really aggressive to blast obnoxious ""entertainment"" in your guests' faces they have little recourse"
4,negative,VirginAmerica and it is a really big bad thing about it


In [160]:
#Removal of URls
data['text'] = data['text'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
data['text'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))
data.head(5)

Unnamed: 0,airline_sentiment,text
0,neutral,VirginAmerica What dhepburn said.
1,positive,VirginAmerica plus you have added commercials to the experience... tacky.
2,neutral,VirginAmerica I did not today... Must mean I need to take another trip!
3,negative,"VirginAmerica it is really aggressive to blast obnoxious ""entertainment"" in your guests' faces they have little recourse"
4,negative,VirginAmerica and it is a really big bad thing about it


In [161]:
#Removal of numbers.
def remove_numbers(text):
  text = re.sub(r'\d+', '', text)
  return text

data['text'] = data['text'].apply(lambda x: remove_numbers(x))
data.head(5)

Unnamed: 0,airline_sentiment,text
0,neutral,VirginAmerica What dhepburn said.
1,positive,VirginAmerica plus you have added commercials to the experience... tacky.
2,neutral,VirginAmerica I did not today... Must mean I need to take another trip!
3,negative,"VirginAmerica it is really aggressive to blast obnoxious ""entertainment"" in your guests' faces they have little recourse"
4,negative,VirginAmerica and it is a really big bad thing about it


In [162]:
#Tokenization of data
# Tokenize the words of whole dataframe.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    words = nltk.word_tokenize(text)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[VirginAmerica, What, dhepburn, said, .]"
1,positive,"[VirginAmerica, plus, you, have, added, commercials, to, the, experience, ..., tacky, .]"
2,neutral,"[VirginAmerica, I, did, not, today, ..., Must, mean, I, need, to, take, another, trip, !]"
3,negative,"[VirginAmerica, it, is, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, they, have, little, recourse]"
4,negative,"[VirginAmerica, and, it, is, a, really, big, bad, thing, about, it]"


In [163]:
# Removal of stopwords from the tokens
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['AmericanAir', 'ppl', 'need', 'know', 'many', 'seats', 'next', 'flight', '.', 'Plz', 'put', 'us', 'standby', 'people', 'next', 'flight', '?']


In [164]:
# Lemmatization and normalization of the tokenized words 
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in tokens_without_sw:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)


In [165]:
# Join the words in the list to convert back to text string in the dataframe. (So that each row contains the data in text format.)
data['text'] = data.apply(lambda row: normalize(row['text']), axis=1)
data.head(10)

Unnamed: 0,airline_sentiment,text
0,neutral,virginamerica what dhepburn say
1,positive,virginamerica plus you have add commercials to the experience tacky
2,neutral,virginamerica i do not today must mean i to take another trip
3,negative,virginamerica it be really aggressive to blast obnoxious entertainment in your guests face they have little recourse
4,negative,virginamerica and it be a really big bad thing about it
5,negative,virginamerica seriously would pay a for that do not have this play it be really the only bad thing about fly va
6,positive,virginamerica yes nearly every time i fly vx this ear worm will not go away
7,neutral,virginamerica really miss a prime opportunity for men without hat parody there
8,positive,virginamerica well i do notbut now i do d
9,positive,virginamerica it be amaze and arrive an hour early you be too good to me


## 4. Vectorization.
### a) Using CountVectorizer for Vectorization.

In [166]:
# Vectorization (Convert text data to numbers).
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=2500)              
data_features = vectorizer.fit_transform(data['text'])

data_features = data_features.toarray()  

In [167]:
data_features.shape

(14640, 2500)

### b) Using TfidfVectorizer for Vectorization.

In [168]:
from sklearn.feature_extraction.text import TfidfVectorizer
# create the transform
vectorizer = TfidfVectorizer(max_features=2500)
# encode document
vector = vectorizer.fit_transform(data['text'])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(14640, 2500)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


##5. Fitting and evaluating model using both type of vectorization.

### a) Applying CountVectorizer to fit and evaluate the Random Forest Classifier.

In [169]:
# Labels 
y = pd.get_dummies(data['airline_sentiment']).values
X = data_features
y = y.astype('int')
y =data['airline_sentiment']

In [170]:
# Split data into training and testing set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)

(10980, 2500)
(10980,)


In [171]:
# standardization
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [172]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest = RandomForestClassifier(n_estimators=10, n_jobs=4)

forest = forest.fit(X_train, y_train)

print(forest)

print(np.mean(cross_val_score(forest, data_features, y, cv=20)))


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=4,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)
0.7354508196721311


In [173]:
# Predicting using the model for X_test data.

Y_pred_class = forest.predict(X_test)

In [174]:
num_classes = 3
from sklearn.metrics import classification_report
target_names = ["Class {}".format(i) for i in range(num_classes)]
print(classification_report(y_test, Y_pred_class, target_names=target_names))

              precision    recall  f1-score   support

     Class 0       0.80      0.93      0.86      2340
     Class 1       0.61      0.46      0.52       738
     Class 2       0.80      0.49      0.61       582

    accuracy                           0.77      3660
   macro avg       0.74      0.63      0.66      3660
weighted avg       0.76      0.77      0.75      3660



### b) Applying TfidfVectorizer to fit and evaluate the Random Forest Classifier.  

In [175]:
X = vector

In [176]:
# Split data into training and testing set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [177]:
# Using Random Forest to build model for the classification of sentiment analysis.
# Also calculating the cross validation score.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest_vector = RandomForestClassifier(n_estimators=10, n_jobs=4)

forest_vector = forest.fit(X_train, y_train)

print(forest_vector)

print(np.mean(cross_val_score(forest, vector, y, cv=20)))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=4,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)
0.7286885245901639


In [178]:
# Predicting using the model for X_test data.

Y_pred_class2 = forest.predict(X_test)

In [179]:
num_classes = 3
from sklearn.metrics import classification_report
target_names = ["Class {}".format(i) for i in range(num_classes)]
print(classification_report(y_test, Y_pred_class2, target_names=target_names))

              precision    recall  f1-score   support

     Class 0       0.79      0.95      0.86      2340
     Class 1       0.63      0.42      0.51       738
     Class 2       0.81      0.48      0.61       582

    accuracy                           0.77      3660
   macro avg       0.74      0.62      0.66      3660
weighted avg       0.76      0.77      0.75      3660



##6. Summarize your understanding of the application of various Pre-processing and Vectorization and performance of your model on this dataset.

- The objective of this project was to implement the concepts and techniques of Natural Language Processing (NLP) as learnt in the course. A sentiment analysis was conducted to classify the sentiment of tweets as positive, negative or neutral. 
- The learning outcomes of the project were: 
a) Basic understanding of text pre-processing
b) What to do after text pre-processing: i.e. Bag of words & Tfidf 
c) Build the classification model. 
d) Evaluate the Model. 

Consequently, these outcomes were achieved successfully as follows:  

First, various pre-processing techniques learnt in the course were applied to the dataset which initially contained 14,640 rows and 15 columns. However, all other columns were dropped except “text” and “airline_sentiment” which were the focus of our sentiment analysis. 

Second, the dataset was analysed to check whether there were null values, but none were found. Also, the data distribution of the “airline_sentiment" column was done to check for any data imbalances that might affect the final model. However, no artificial data was added using SMOTE because the target variable seem to be well distributed. There are no minority classess as such. Below are the results of the analysis as captured: 
    - negative    0.626913
    - neutral     0.211680
    - positive    0.161407


Third, application of various data pre-processing steps was conducted mainly for our dependent variable which in this case is the twitter text i.e. data['text']. Viewing at the twitter text, there were various punctuation marks, numbers, emoji characters, stopwords, non-ASCII characters, URLs, HTML tags etc. As a result, various techniques were applied as captured in section three(3) above to clean the twitter text and remove all unnecessary noise in the text data. 

Four, tokenization was applied to the text data where the different tweet texts were broken down to tokens - breaking text into tokens which is called tokenization. In addition, after tokenization was applied, the different texts were reduced to a core root and in this case lemmatization was preferred instead of stemming to maintain the dictionary form of the words. 

Five, after tokenization, lemmatization and removal of stopwords, the words were normalized and joined together before ready vectorization. Vectorization was conducted to create a sparse -term-document matrix as shown in section four (4) above, applying both countvectorizer and TfidfVectorizer.

Six, vectorization involved converting the tweet text data into a numerical sparse matrix to form our X variable or numeric vector which was fed into the model (Random Forest Classiffier) to predict our target variable (y). Our target variable y has three(3) classifications i.e. negative, neural and positive, depending on how the words are read by the model. Our target variable (y) was also converted to a numerical form (class 0, class 1, and class 2) through one-hot-encoding, before it was fed into the classification model. Based on the results, class 0 represent negative sentiments, class 1 represent neutral setiments and class 2 represent positive sentiments. 

Seven, both forms of the vectorized data (X_variable) i.e. Countvectorizer and TfidfVectorizer resulting from the tweet text were fitted into the Random Forest Classifier. The training data from both models yielded an accuracy of about 73% while the test data yielded a model accuracy of 77% for both models i.e. either applying Countvectorizer or TfidfVectorizer vectorization. Please see the classsification or confusion matrix produced in section five (5) above.  

In conclusion, sentiment analysis such as classifying the sentiments of tweets as positive or negative or neutral can be implemented successfully using techniques of NLP as captured above. 