<a href="https://colab.research.google.com/github/djm160830/twt-airline-sa/blob/master/Tweets_US_airline_SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import sys
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
from prettytable import PrettyTable

In [None]:
def read_tweets():
  """Reads Tweets from US Airline Sentiment dataset.

  Returns:
  DataFrame: Table containing Tweets, which airline the Tweet is referring to, and sentiment of Tweet

  """
	return pd.read_csv("https://raw.githubusercontent.com/djm160830/twt-airline-sa/master/archive/Tweets.csv", 
		usecols=["airline_sentiment", "airline", "text"])


Convert text to lowercase because: 

CountVectorizer() to convert each text document (Tweet) into a matrix of token counts 

TfidfTransformer() to weigh each feature name from CountVectorizer(). Selects 

In [None]:
def preprocess(data):
	"""Converts words to lowercase, performs label encoding, train/test split, tf-idf scores
	
	Parameters: 
	data (DataFrame): Table containing Tweets, which airline the Tweet is referring to, and sentiment of Tweet
	
	Returns:
  counts (sparse matrix): Transformed count vectorized matrix of words from training dataset into a tf-idf representation
  X_test_tf (sparse matrix): Transformed count vectorized matrix of words from test dataset into a tf-idf representation
  y_train (Series): Label encoded target variable from training dataset
  y_test (Series): Label encoded target variable from testing dataset
  data['airline_sentiment'] (DataFrame): Table of label encoded sentiments
  X_test (DataFrame): Raw feature variables
	"""
	# Convert text to lowercase 
	for column in data.columns:
		data[column] = data[column].str.lower()

	# Categorize target variable
	le = preprocessing.LabelEncoder()
	data['airline_sentiment'] = le.fit_transform(data['airline_sentiment'])

	# Split data into training and testing (10% testing) using train_test_split
	X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:], data.iloc[:, 0], test_size=0.10)
 
	# Transform training text using countvectorizer and tfidftransformer
	"""
	CountVectorizer(): Counts frequency of words
	TfidfTransformer(): Adjusts for the fact that some words appear more frequently in general (ex: 'we', 'the').
	"""
	count_vect = CountVectorizer() 
	counts = count_vect.fit_transform(X_train['text']) # Learn the vocabulary dictionary
	transformer = TfidfTransformer(use_idf=True)
	counts = transformer.fit_transform(counts) 			# Learn IDF vector (global term weights), and transform a count matrix to a tf-idf representation

	# Process test data
	X_test_cv = count_vect.transform(X_test['text']) 
	X_test_tf = transformer.transform(X_test_cv)

	return counts, X_test_tf, y_train, y_test, data['airline_sentiment'], X_test

In [None]:
t = PrettyTable(['ITERATION', 'ALPHA', 'FIT_PRIOR', 'TRAINING ACCURACY'])

# Repeat process 5 times with different parameter choices a, fp (for laplace smoothing & fit_prior) and output the parameters and accuracy in a tabular format.
for i, a in enumerate(np.linspace(1.25, 1.0e-10, 5)):
  for _, fp in enumerate([True, True, True, False, False, False]):
    df = read_tweets()

    X_train, X_test, y_train, y_test, target_sentiment, X_test_raw = preprocess(df)

    # Build a Multinomial Naïve Bayes (MNB) model using the training dataset
    model = MultinomialNB(alpha=a, fit_prior=fp).fit(X_train, y_train)

    # Apply model on test and output the accuracy
    predicted = model.predict(X_test) 
    accuracy = model.score(X_train, y_train)
    t.add_row([i+1, a, fp, accuracy])
  if i!=4: t.add_row([' ', ' ', ' ', ' '])	
print(t)

# Average sentiment of each airline, and which airline has the highest positive sentiment
df['airline_sentiment'] = target_sentiment
highest_sentiment = df.groupby('airline').agg(mean_sentiment=('airline_sentiment', 'mean')).sort_values(by='mean_sentiment', ascending=False)
print(f'\n{highest_sentiment}')
print(f'\nHighest positive sentiment: \n{highest_sentiment[:1]}')

+-----------+----------------+-----------+--------------------+
| ITERATION |     ALPHA      | FIT_PRIOR | TRAINING ACCURACY  |
+-----------+----------------+-----------+--------------------+
|     1     |      1.25      |    True   | 0.6918639951426837 |
|     1     |      1.25      |    True   | 0.6940649666059502 |
|     1     |      1.25      |    True   | 0.6929265330904675 |
|     1     |      1.25      |   False   | 0.8047207043108683 |
|     1     |      1.25      |   False   | 0.8044930176077717 |
|     1     |      1.25      |   False   | 0.8047207043108683 |
|           |                |           |                    |
|     2     | 0.937500000025 |    True   | 0.7178202792956891 |
|     2     | 0.937500000025 |    True   | 0.7186551305403764 |
|     2     | 0.937500000025 |    True   | 0.7181997571341834 |
|     2     | 0.937500000025 |   False   | 0.831056466302368  |
|     2     | 0.937500000025 |   False   | 0.8313600485731634 |
|     2     | 0.937500000025 |   False  

According to the table, the model performed well with fit_prior=False, and **alpha**=1.0e-10. 
# What is alpha? 
Scikit-learn's documentation defines it as "Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing)." We want to use something like Laplace/Lidstone smoothing because frequency-based probability, like Tf-idf in this example, "might introduce zeroes when multiplying probabilities, leading to a failure in preserving the information contributed by non-zero probabilities." We want to prevent a probability of zero if we know that something has the possibility (even the smallest possibility) of occurring, because "this oversimplification is inaccurate and often unhelpful, particularly in probability-based machine learning techniques". Laplace/Lidstone smoothing handles this problem of zero probability by adding a smoothing parameter, $ \alpha $, to the probability of a single observation $x$ from a multinomial distribution with $ N $ trials and $k$ feature variables: \begin{equation} \frac{x_i+\alpha}{N+\alpha k} \end{equation} In practice, a smaller value is typically chosen, as seen with this model.

So how would this model predict the mean sentiment of US airlines, after having been trained on Tweets?

In [None]:
X = X_test_raw
p = X["airline"].reset_index().join(pd.Series(predicted, name="sentiment"))
print(f'\n{p.groupby("airline").agg(mean_sentiment=("sentiment", "mean")).sort_values(by="mean_sentiment", ascending=False)}')


                mean_sentiment
airline                       
virgin america        0.762712
delta                 0.757991
southwest             0.632000
united                0.409214
american              0.340351
us airways            0.297872


It looks like the model yielded similar results as it did with the training data. United and American were flipped around in this result, which makes sense given that their mean sentiments were also very close in the results produced with the training data. We can get a numerical representation of how accurate the model was with the test data using metrics.accuracy_score:

In [None]:
metrics.accuracy_score(y_test, predicted)

0.7151639344262295

# When Laplace smoothing falls short
