<a href="https://colab.research.google.com/github/durgajo3github/.github-workflows/blob/master/Copy_of_MtechDissertation_SA_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Load Essential Libraries

In [None]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

2. Dataset

2.1. Download Dataset

In [None]:
# Download data
import requests
request = requests.get("https://drive.google.com/uc?export=download&id=1wHt8PsMLsfX5yNSqrt2fSTcb8LEiclcf")
with open("data.zip", "wb") as file:
    file.write(request.content)

# Unzip data
import zipfile
with zipfile.ZipFile('data.zip') as zip:
    zip.extractall('data')

**2.2. Load Train Data**

The train data has 2 files, each containing 1700 complaining/non-complaining tweets. Every tweets in the data contains at least a hashtag of an airline.

We will load the train data and label it. Because we use only the text data to classify, we will drop unimportant columns and only keep id, tweet and label columns.

In [None]:
 # Load data and set labels
data_complaint = pd.read_csv('data/complaint1700.csv')
data_complaint['label'] = 0
data_non_complaint = pd.read_csv('data/noncomplaint1700.csv')
data_non_complaint['label'] = 1

# Concatenate complaining and non-complaining data
data = pd.concat([data_complaint, data_non_complaint], axis=0).reset_index(drop=True)

# Drop 'airline' column
data.drop(['airline'], inplace=True, axis=1)

# Display 5 random samples
data.sample(5)

Unnamed: 0,id,tweet,label
1994,25374,"@JetBlue Book and enjoy October,# Boo your fr...",1
998,161758,@AmericanAir in laws stuck in phl. Flight canc...,0
2883,116963,@FlyGRFord Missing @AlaskaAir &amp; @HawaiianAir,1
3291,157220,Can't wait to fly @SouthwestAir to Vegas today...,1
1888,20110,"Ah, one of the new @SouthwestAir livery birds ...",1


We will randomly split the entire training data into two sets:

 -a train set with 90% of the data and a validation set with 10% of the data.

 -will perform hyperparameter tuning using cross-validation on the train set and use the validation set to compare models.

In [None]:
from sklearn.model_selection import train_test_split

X = data.tweet.values
y = data.label.values

X_train, X_val, y_train, y_val =\
    train_test_split(X, y, test_size=0.1, random_state=2020)

**2.3. Load Test Data**

   The test data contains 4555 examples with no label. About 300 examples are non-complaining tweets.
   Our task is to identify their id and examine manually whether our results are correct.

In [None]:
# Load test data
test_data = pd.read_csv('data/test_data.csv')

# Keep important columns
test_data = test_data[['id', 'tweet']]

# Display 5 samples from the test data
test_data.sample(5)

Unnamed: 0,id,tweet
3283,124401,Seriously @JetBlue we've been waiting an hour ...
2251,86813,"@AmericanAir My lugge was lost last week, spen..."
947,37629,@rothsara @Oregonian RIGHT?!?! I'M SO MAD @Jet...
2996,113495,Thought we were gonna fly the Red Eye.. nope @...
2431,92889,Impresssed w/our Captain &amp; his very succin...


3. Set up GPU for training

Google Colab offers free GPUs and TPUs.
Since we’ll be training a large neural network it’s best to utilize these features.

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


**Baseline: TF-IDF + Naive Bayes Classifier**

In this baseline approach, first we will use TF-IDF to vectorize our text data. Then we will use the Naive Bayes model as our classifier.

Why Naive Bayse? with different machine learning algorithms including Random Forest, Support Vectors Machine, XGBoost and observed that Naive Bayes yields the best performance.
 In Scikit-learn’s guide to choose the right estimator, it is also suggested that Naive Bayes should be used for text data. (tried using SVD to reduce dimensionality; however, it did not yield a better performance).

**1. Data Preparation**

**1.1. Preprocessing**

 In the bag-of-words model, a text is represented as the bag of its words, disregarding grammar and word order.
 Therefore, we will want to remove stop words, punctuations and characters that don’t contribute much to the sentence’s meaning.

In [None]:
import nltk
nltk.download('stopwords')
# Uncomment to download "stopwords"
# nltk.download("stopwords")
from nltk.corpus import stopwords

def text_preprocessing(s):
    """
    - Lowercase the sentence
    - Change "'t" to "not"
    - Remove "@name"
    - Isolate and remove punctuations except "?"
    - Remove other special characters
    - Remove stop words except "not" and "can"
    - Remove trailing whitespace
    """
    s = s.lower()
    # Change 't to 'not'
    s = re.sub(r"\'t", " not", s)
    # Remove @name
    s = re.sub(r'(@.*?)[\s]', ' ', s)
    # Isolate and remove punctuations except '?'
    s = re.sub(r'([\'\"\.\(\)\!\?\\\/\,])', r' \1 ', s)
    s = re.sub(r'[^\w\s\?]', ' ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    # Remove stopwords except 'not' and 'can'
    s = " ".join([word for word in s.split()
                  if word not in stopwords.words('english')
                  or word in ['not', 'can']])
    # Remove trailing whitespace
    s = re.sub(r'\s+', ' ', s).strip()

    return s

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**1.2. TF-IDF Vectorizer**

In information retrieval, TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We will use TF-IDF to vectorize our text data before feeding them to machine learning algorithms

In [None]:
import numpy as np

In [None]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

# Preprocess text
X_train_preprocessed = np.array([text_preprocessing(text) for text in X_train])
X_val_preprocessed = np.array([text_preprocessing(text) for text in X_val])

# Calculate TF-IDF
tf_idf = TfidfVectorizer(smooth_idf=False)
X_train_tfidf = tf_idf.fit_transform(X_train_preprocessed)
X_val_tfidf = tf_idf.transform(X_val_preprocessed)

CPU times: user 5.5 s, sys: 791 ms, total: 6.29 s
Wall time: 6.35 s


**2. Train Naive Bayes Classifier**

**2.1. Hyperparameter Tuning**

We will use cross-validation and AUC score to tune hyperparameters of our model. The function get_auc_CV will return the average AUC score from cross-validation.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

def get_auc_CV(model):
    """
    Return the average AUC score from cross-validation.
    """
    # Set KFold to shuffle data before the split
    kf = StratifiedKFold(5, shuffle=True, random_state=1)

    # Get AUC scores
    auc = cross_val_score(
        model, X_train_tfidf, y_train, scoring="roc_auc", cv=kf)

    return auc.mean()

The MultinominalNB class only have one hypterparameter - alpha. The code below will help us find the alpha value that gives us the highest CV AUC score.

In [None]:
from sklearn.naive_bayes import MultinomialNB

res = pd.Series([get_auc_CV(MultinomialNB(i))
                 for i in np.arange(1, 10, 0.1)],
                index=np.arange(1, 10, 0.1))

best_alpha = np.round(res.idxmax(), 2)
print('Best alpha: ', best_alpha)

plt.plot(res)
plt.title('AUC vs. Alpha')
plt.xlabel('Alpha')
plt.ylabel('AUC')
plt.show()

TypeError: MultinomialNB.__init__() takes 1 positional argument but 2 were given