## Etan Ogbemi
### Background and Context:

Twitter possesses 330 million monthly active users, which allows businesses to reach a broad population and connect with customers without intermediaries. On the other hand, there’s so much information that it’s difficult for brands to quickly detect negative social mentions that could harm their business.

That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.


Listening to how customers feel about the product/service on Twitter allows companies to understand their audience, keep on top of what’s being said about their brand and their competitors, and discover new trends in the industry.

 

Data Description:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

 

Dataset:

The dataset has the following columns:

* tweet_id
* airline_sentiment
* airline_sentiment_confidence
* negativereason
* negativereason_confidence
* airline
* airline_sentiment_gold
* name
* negativereason_gold
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location
* user_timezone

#### Objective
This project will attempt to analyse the sentiments of airline passengers in order to glean information about what passengers feel about various airlines they travel on.  It will also try to make some determinations as to why the passangers have those sentiments.

We will build and tune a predictive model that we hope will accurately predict the features that cause the sentiments which passengers express using "unseen"/test data, which in turn will provide insights for airlines to improve their services and generate more positive sentiments from their passengers by improving the travel experience.  All of this will be achieved by using the techniques of Sentiment analysis, Encoding techniques, data analysis, feature selection, data pre processing and vectorization amongst other techniques we have learnt in this module and throughout this programme.



In [1]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
import re, string, unicodedata
import numpy as np                                  
import pandas as pd                              
import nltk                                     
from bs4 import BeautifulSoup

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Downloading NLTL lexicons for use by VADER
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('wordnet')
!pip install vaderSentiment
!pip install textblob
!pip install contractions
!pip install wordcloud

from nltk.corpus import stopwords                   #Stopwords corpus
# from nltk.corpus import vader_lexicon               #vader_lexicon corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator, wordcloud


from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF

from sklearn.ensemble import RandomForestClassifier       # Import Random forest Classifier
from sklearn.metrics import classification_report         # Import Classification report
from sklearn.model_selection import cross_val_score, StratifiedKFold, KFold      
from sklearn.metrics import accuracy_score


import contractions
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer
# set threshold for compound score, we will vary this 
# def classify_compound(text, threshold=0.33):

from textblob import TextBlob
# In order to be able to view long text strings we set the option for maximum width
pd.set_option('display.max_colwidth', -1)
 
# for plottting
import matplotlib.pyplot as plt

In [None]:
# Import the datat set
data_original = pd.read_csv("/content/drive/MyDrive/Colab_Notebooks/NLP/Tweets.csv")

In [None]:
# copying data to another varaible to avoid any changes to original data
data_copy = data_original.copy()

### Exploratory Data Analysis

We will look at the dataset to determine information about the data and understand it's structure and other properties

In [None]:
data_copy.shape

In [None]:
# viewing the first few rows of the data
data_copy.head(10)

In [None]:
data_copy.info()

In [None]:
# distribution of tweets accross airlines
print()
sns.countplot(data_copy['airline']);

In [None]:
# Distribution of sentiments accross all tweets
sns.countplot(data_copy['airline_sentiment']);

In [None]:
# Distribution of sentiments by airline
print()
sns.countplot(data_copy['airline'], hue=data_copy['airline_sentiment']);

In [None]:
# Distribution of all negative reasons
plt.figure(figsize=(15,10)) #adjust the size of plot
ax=sns.countplot(x=data_copy['negativereason'],data=data_copy,palette='magma')

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")  #it will rotate text on x axis

plt.tight_layout()
plt.show()

### Observation
In 2015 when this data was scraped, the [biggest airlines](https://en.wikipedia.org/wiki/List_of_largest_airlines_in_North_America) (from wikipedia and other sources) in terms of passengers enplaned (from online research) in descending order of the airplanes in this survey was American Airlines, Delta Air lines, Southwest Airlines, United Airlines, US Airway and Virgin America.  It however useful to note that United received more tweets than (probably mostly negative from our sentiment analysis below) than bigger airlines like American, Delta and South West! 

From our analysis, we can see that there were more than 4 times as many negative tweets as there were positive tweets, there were also thrice as many negative tweets than there were tweets classified as neutral.  From this, we could be inclined to surmise that people are more likely to tweet when they have a negative sentiment than when they have a positive or neutral sentiment about their travel experience.  However when we view Virgin America and Delta, it appears that the unbalance between negative and positive sentiments is less pronounced.  This may suggest that when travellers receive exceptionally good service, they might be just as inclined to tweet about their experience as when the have good experiences.

One may also draw the conclusion that there is an inverse proportional relationship between the size of an airline and customer satisfaction.  This could be for a number of reasons like smaller airlines work harder to increase market share, larger airlines do not have as much incentive to increase their the quality of their service as they are already large, maintaining quality service becomes more challenging with size or a combination of  these factors.

### Observations

It may be useful to note that the dataset comprises of not just the scraped data from the tweets of users, but also contains sentiment analysis that has already been carried out, some of which may be inaccurate.

For example when viewed in an excel spreadsheet, for row 84 with tweet_id 569933405506310000, the "airline_sentiment" is classified as "negative" with a confidence of almost 70%.  Further, the reason for the negative classification is given in the feature "negativereason" as "Late Flight".  However if we examine the actual tweet by the customer, in the feature "text" the tweet is **"@VirginAmerica you're the best!! Whenever I (begrudgingly) use any other airline I'm delayed and Late Flight".**  This actually appears to be more of a positive sentiment about Virgin America than a negative sentiment even though the words "Late Flight' appear in the tweet.

In [None]:
data_copy.isnull().sum()

### Data pre processing

There are several features that we will drop which have no utility for our sentiment analysis, these include tweet_id, negativereason_confidence, airline_sentiment_gold, name, negativereason_gold, retweet_count, tweet_coord, tweet_created, tweet_location, user_timezone.  Apart from the fact that many contain several thousand NaN values, their meaning is unclear.  Since we have a unique index, others like tweet_id and name are redundant.

We will also clean up the text data to remove nonalphabetical characters like @ signs, numbers and replace contractions.  We will also Lemmatize and Tokenize the text data as well as remove stopwords.  From reviewing the data however, we notice that the text (that is the tweets) always begins with the twitter handle of the airline eg @united, @VirginAmerica etc, so rather than remove just the @ sign, we will remove the whole twitter handle for the 6 airlines from the text feature

In [None]:
data_copy.shape

In [None]:
data_copy.head(10)

In [None]:
#  We will remove the tweet handles (which start with @) and links (which start with http) for the airlines using the re module for Regular Expressions, I have chosen to do this at this point to make the wordcloud neater
# More data preparation will be done later
import re

def remove_twitter_handles(text):
    text = re.sub('@[^\s]+','',text)
    text = re.sub('http[^\s]+','',text)
    return text
data_copy['text'] = data_copy['text'].apply(remove_twitter_handles)

In [None]:
data_copy.head(10)

### Changing the name of the text column from "text" to "passenger_tweet"

In [None]:
data_copy.rename(columns = {'text':'passenger_tweet'}, inplace = True)

In [None]:
data_copy.head(10)

### Wordcloud for positive and negative sentiments

In [None]:
# data['airline_sentiment'] = data_copy.apply(lambda row: nltk.word_tokenize(row['airline_sentiment']), axis=1) # Tokenization of data

def show_wordcloud(data_copy, title):
    text = ' '.join(data_copy['passenger_tweet'].astype(str).tolist())                 # Converting Summary column into list
    stopwords = set(wordcloud.STOPWORDS)                                  # instantiate the stopwords from wordcloud
    
    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='white', max_words=75,         # Setting the different parameter of stopwords and limiting to 75 words
                    colormap='viridis', width=800, height=600).generate(text)
    
    plt.figure(figsize=(14,11), frameon=True)                             
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(title, fontsize=30)
    plt.show()

In [None]:
# Negative sentiment wordcloud
show_wordcloud(data_copy[data_copy.airline_sentiment == "negative"], title = "Negative sentiment wordcloud")

In [None]:
# Positive sentiment wordcloud
show_wordcloud(data_copy[data_copy.airline_sentiment == "positive"], title = "Positive_sentiment wordcloud")

Since our objective, as I indicated earlier, is to perform sentiment analysis based on tweets from customers, our focus will be solely on two of the columns(features) in the dataset, "airline_sentiment" which is a subjective classification of the sentiments expressed in the customer tweets and "text", the actual tweets from the customers in which we hope some sentiment is expressed.

We will therefore drop all the other columns


In [None]:
# dropping all columns except for airline_sentiment and text
data_copy.drop(["tweet_id", "airline_sentiment_confidence", "negativereason", "negativereason_confidence", "airline", "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count", "tweet_coord", "tweet_created", "tweet_location", "user_timezone"], axis=1, inplace=True)

In [None]:
# Checking to ensure only two features are remaining
data_copy.head(25)

In [None]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

data_copy['passenger_tweet'] = data_copy['passenger_tweet'].apply(lambda x: replace_contractions(x))
# data['Summary'] = data['Summary'].apply(lambda x: replace_contractions(x))

data_copy.head(25)

In [None]:
def remove_numbers(text):
  text = re.sub(r'\d+', '', text)
  return text

data_copy['passenger_tweet'] = data_copy['passenger_tweet'].apply(lambda x: remove_numbers(x))

data_copy.head(25)

In [None]:
data_copy['passenger_tweet'] = data_copy.apply(lambda row: nltk.word_tokenize(row['passenger_tweet']), axis=1) # Tokenization of data
# data['Summary'] = data.apply(lambda row: nltk.word_tokenize(row['Summary']), axis=1) # Tokenization of data

In [None]:
# Confirm tokenization
data_copy.head(25)

### Data pre-processing

We will process the data further by converting all the words in the tweets to lower case, replace contractions, remove any numbers, remove any other stopwords, tokenize the tweets and also lemmatize them.

In [None]:
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
#    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

data_copy['passenger_tweet'] = data_copy.apply(lambda row: normalize(row['passenger_tweet']), axis=1)
# data['Summary'] = data.apply(lambda row: normalize(row['Summary']), axis=1)

In [None]:
# Vectorization (Convert text data to numbers).
# from sklearn.feature_extraction.text import CountVectorizer

Count_vec = CountVectorizer(max_features=2500)                # Keep only 500 features as number of features will increase the processing time.
data_features = Count_vec.fit_transform(data_copy['passenger_tweet'])

data_features = data_features.toarray()                        # Convert the data features to array.

In [None]:
data_copy.head(25)

### Observation

I note that some of the lemmatizations and contractions have were not perfect, for example in row 19 the word bosfil was contracted from BOS-FLL and in row 15 SFO-PDX becaomes sfopdx.  These will tend to be unique words and therefore should have very limited impact during learning and modelling

In [None]:
data_copy = data_copy.replace(['positive'],'1')
data_copy = data_copy.replace(['neutral'],'1')
data_copy = data_copy.replace(['negative'],'0')

In [None]:
data_copy.head(25)

## CountVectorizer

In [None]:
# Vectorization (Convert text data to numbers).
from sklearn.feature_extraction.text import CountVectorizer

bow_vec = CountVectorizer(max_features=2500)                # We will use only 2,500 features and discard the rest
data_features = bow_vec.fit_transform(data_copy['passenger_tweet'])

data_features = data_features.toarray()                        # Convert the data features to array.

In [None]:
data_features.shape

In [None]:
X = data_features

y = data_copy.airline_sentiment

In [None]:
# Split data into training and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = np.arange(100,400,100).tolist()
base_ln

In [None]:
# K-Fold Cross - validation .
cv_scores = []
for b in base_ln:
    clf = RandomForestClassifier(n_estimators = b)
    scores = cross_val_score(clf, X_train, y_train, cv = 5, scoring = 'accuracy')
    cv_scores.append(scores.mean())

In [None]:
# plotting the error as k increases
error = [1 - x for x in cv_scores]                                 #error corresponds to each nu of estimator
optimal_learners = base_ln[error.index(min(error))]                #Selection of optimal nu of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           #Plot between each nu of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Training the best model and calculating accuracy on test data .
clf = RandomForestClassifier(n_estimators = optimal_learners)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
count_vectorizer_predicted = clf.predict(X_test)
print(classification_report(y_test ,count_vectorizer_predicted , target_names = ['0' , '1']))
print("Accuracy of the model is : ",accuracy_score(y_test,count_vectorizer_predicted))

In [None]:
# Confusion matirx to get an idea of how the distribution of the prediction is, among all the classes.


from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test, count_vectorizer_predicted)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in ['0', '1']],
                  columns = [i for i in ['0', '1']])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

While overall the true positives are high and the talse negative and false positive's are comparitively lower, the model can be expected to perform better.  Nonetheless an accuracy of 84% is good.

In [None]:
all_features = Count_vec.get_feature_names()              #Instantiate the feature from the vectorizer
top_features=''                                            # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:50]:
    top_features+=all_features[i]
    top_features+=','
    
print(top_features)  

print(" ") 
print(" ")     

from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white",colormap='tab20c',width=2000, 
                          height=1000).generate(top_features)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(20, 18), frameon='true')
plt.title('Top 50 features WordCloud', fontsize=20)
plt.axis("off")

### Term Frequency - Inverse Document Frequency

TF–IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. [Wikipedia](https://https://en.wikipedia.org/wiki/Tf%E2%80%93idf)



In [None]:
# Using TfidfVectorizer to convert text data to numbers.

# from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_features=500)
data_features = tfidf_vect.fit_transform(data_copy['passenger_tweet'])

data_features = data_features.toarray()

data_features.shape     #feature shape

In [None]:
X = data_features

y = data_copy.airline_sentiment

In [None]:
# Split data into training and testing set.

# from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = np.arange(100,400,100).tolist()
base_ln

In [None]:
# K-Fold Cross - validation .
cv_scores = []
for b in base_ln:
    clf = RandomForestClassifier(n_estimators = b)
    scores = cross_val_score(clf, X_train, y_train, cv = 5, scoring = 'accuracy')
    cv_scores.append(scores.mean())

In [None]:
# plotting the error as k increases
error = [1 - x for x in cv_scores]                                 #error corresponds to each nu of estimator
optimal_learners = base_ln[error.index(min(error))]                #Selection of optimal nu of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           #Plot between each nu of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Training the best model and calculating accuracy on test data .
clf = RandomForestClassifier(n_estimators = optimal_learners)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
tf_idf_predicted = clf.predict(X_test)
print(classification_report(y_test , tf_idf_predicted , target_names = ['0' , '1']))
print("Accuracy of the model is : ",accuracy_score(y_test,tf_idf_predicted))

In [None]:
# Print and plot Confusion matirx to get an idea of how the distribution of the prediction is, among all the classes.


conf_mat = confusion_matrix(y_test, tf_idf_predicted)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in ['0', '1']],
                  columns = [i for i in ['0', '1']])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

The performance of the model is pretty decent with true positives being dominant and true negatives less so.  they both are more than the false negatives and positives, in this case false negatives being a bit closer to true negatives.  This is reflected in the slightly better performance of the BoW whn compared to TF-IDF

In [None]:
all_features = tfidf_vect.get_feature_names()              #Instantiate the feature from the vectorizer
top_features=''                                            # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=', '
    
print(top_features)  

print(" ") 
print(" ") 

from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white",colormap='tab20c',width=2000, 
                          height=1000).generate(top_features)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

## Performance comparison

TF-IDF had an accuracy of 83% as compared to Bag of Words which had an accuracy of 84%.  Although the numbers are very close, the results were smewhat surprising as one would have expected TF-IDF to perform better due to it's weighting, which usually produces better accuracy than the counting of words used by BoW