# Introduction

This notebook is for the major project submission for COMP8220, on the language dataset. The notebooks contains the following sections which we have used to obtain the required results

1) We used 3 Conventional Models to find the accuracy of the language dataset and out of which the Random Forest model gave us the best accuracy. The other 2 models which were used were Logistic Regression and Naive Bayes model.
* Logistic Regression Model : Logistic regression is fast and relatively uncomplicated fundamental classification technique.
* Naive Bayes : As the dataset has three classes of the output from choose, we have used Multinomial Naive Bayes. We used this model as it is specially used for text documents. Its model is smaller than random forest and is faster.
* Random Forest : Random forest consists of a large number of individual decision trees that operate as an ensemble and each individual tree in the random forest gives out a class prediction and the class with the most votes becomes the model’s prediction. It is robust against overfitting at and gives better results with more samples.

2) Convolutional Neural Network: Convolutional neural network is a class of deep neural networks. It is a neural Network that has one or more convolutional layers and is used for classification, segmentation and other auto correlated data. It is an algorithm which takes and input, assigns weights and importance to the objects in the data.

3) Performance : Random forest gave the highest accuracy was higher for the validation dataset as compared to the CNN model, but the CNN model had higher accuracy for both the public and private dataset on kaggle. The CNN model accuracy for the public dataset was 0.56865 while for the private dataset was 0.64129 and had a significant increase.

### Importing and downloading all the required Packages

In [102]:
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import numpy as np
import sklearn
from textblob import Word
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import naive_bayes
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from keras.layers import Dense, Dropout
import string
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras import layers
from keras import regularizers

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [103]:
import numpy as np
from os.path import join
import pickle

def load_pickle(path):
    with open(path, 'rb') as f:
        file = pickle.load(f)
        print ('Loaded %s..' %path)
        return file

dataset_directory = 'C:/Users/abhis/Downloads/tweet-emotion-detection/language_dataset/'
private_directory = 'C:/Users/abhis/Downloads/tweet-emotion-detection-private/'
emotions = ['anger', 'fear', 'joy', 'sadness']

tweets_train = np.load(join(dataset_directory, 'text_train_tweets.npy'))
labels_train = np.load(join(dataset_directory, 'text_train_labels.npy'))
vocabulary = load_pickle(join(dataset_directory, 'text_word_to_idx.pkl'))

tweets_val = np.load(join(dataset_directory, 'text_val_tweets.npy'))
labels_val = np.load(join(dataset_directory, 'text_val_labels.npy'))

tweets_test_public = np.load(join(dataset_directory, 'text_test_public_tweets_rand.npy'))
tweets_test_private = np.load(join(private_directory, 'text_test_private_tweets.npy'))

idx_to_word = {i: w for w, i in vocabulary.items()}

Loaded C:/Users/abhis/Downloads/tweet-emotion-detection/language_dataset/text_word_to_idx.pkl..


### The below Cell contains all the Functions used in the Notebook which help us in pre processing the Data.

* remove_punct : This function is used to remove all the punctuations from the text
* tokenization : This function is used to split longer strings of data into smaller strings
* load_data : In this fucntion, we are converting the data which we obtained into readable format, in this case into dataframes. We also remove jargon values from the text such as start, end and user.
* stemming : We use stemming to remove the affixes from a word and obtain the root word
* lemmatizer : We use lemmatization to capture canonical forms based on a word's lemma. Eg : better → good
* convert_emojis : We use the convert emojis fucntion to convert the emojis into the their meaning. eg : a sushi emoji will be changed to the word sushi.
* convert_emoticons : We use the convert emojis fucntion to convert the emojis into the their meaning. eg : a happy emoji will be changed to the text happy.
* preprocessing : The preprocessing function is used to preprocess the text. In this function we call the others functions too which will help us in preprocessing the data. We remove punctuations, we remove stop words, emojis and emoticons. We also sem and lemmatize the data. we change the data into lower case.


In [104]:
def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def tokenization(text):
    text = re.split('\W+', text)
    return text

def load_data(tweet, labels):
    loaded_df = pd.DataFrame()
    counter=0
    new_list=[]
    sentiments=[]
    for j in range(len(tweet)):
        end=0
        dummy_string=""
        word_list=[]
        for count in range(50):
            word_list.append(idx_to_word[tweet[counter][count]])
        for word in word_list:
            if "<END>" in word:
                end=1
            elif "<START>" in word:
                continue;
            elif "<user>" in word:
                continue;
            elif(end==0):
                dummy_string=dummy_string + word + " "
        new_list.append(dummy_string)
        if (len(labels) > 0):
            sentiments.append(labels[counter])
        counter += 1
    loaded_df["Text"]=new_list
    if (len(sentiments) > 0):
        loaded_df["Label"]=sentiments
    return loaded_df

ps = nltk.PorterStemmer()

def stemming(text):
    text = [ps.stem(word) for word in text]
    return text

wn = nltk.WordNetLemmatizer()

def lemmatizer(text):
    text = [wn.lemmatize(word) for word in text]
    return text

# Converting emojis to words
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return text

# Converting emoticons to words    
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

def preprocessing(df):
    df.dropna(inplace=True)
    df= df.apply(lambda x: convert_emojis(x))
    df= df.apply(lambda x: convert_emoticons(x))
    df = df.apply(lambda x: " ".join(x.lower() for x in x.split()))
    stop = stopwords.words('english')
    df = df.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    df = df.apply(lambda x: remove_punct(x))
    df = df.apply(lambda x: tokenization(x.lower()))
    df = df.apply(lambda x: stemming(x))
    df = df.apply(lambda x: lemmatizer(x))
    
    for i in range(0, len(df)):
        processed_feature = re.sub(r'\W', ' ', str(df[i]))
        processed_feature = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
        processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
        processed_feature = re.sub(r'^b\s+', '', processed_feature)
        df[i] = processed_feature.lower()
    return df

#### Converting the Training Dataset into a dataframe and then preprocessing it

In [105]:
training_dataframe = load_data(tweets_train,labels_train)
training_dataframe["Text"] = preprocessing(training_dataframe["Text"])
training_dataframe.head()

Unnamed: 0,Text,Label
0,make fuck irat jesu nobodi call ppl like haji...,0
1,lol adam bull fake outrag,0
2,pas away earli morn fast furiou style car cra...,0
3,lol wow gonna say realli haha seen chri nah d...,0
4,need bentobox sushi date ricebal spaghetti ol...,0


#### Converting the Validation Dataset into a dataframe and then preprocessing it

In [106]:
validation_dataframe = load_data(tweets_val, labels_val)
validation_dataframe["Text"] = preprocessing(validation_dataframe["Text"])
validation_dataframe.head()

Unnamed: 0,Text,Label
0,fume hijack money move full back poutingfac,0
1,nightmar dream freedom,0
2,cnn realli need get busi number second fallen...,0
3,kikm horni kik nude girl wearyfac horni snap,0
4,fuck tag pictur famili first cut number year ...,0


#### Converting the Public Dataset into a dataframe and then preprocessing it.

The public dataset doesnot contain labels and are load_data function requires the label_list parameter. Hence, we are sending an empty list to fulfil the requirement

In [107]:
label_list = []
public_dataframe = load_data(tweets_test_public,label_list)
public_dataframe["Text"] = preprocessing(public_dataframe["Text"])
public_dataframe.head()

Unnamed: 0,Text
0,omg mother daughter dull ni move dad worri
1,happi birthday repeat miss excit back florida...
2,ever cri middl bomb rest someon woke emerg sl...
3,mental suffer worthless pain
4,courag driver shot bu show courag natur scare...


#### Converting the Private Dataset into a dataframe and then preprocessing it.

The private dataset doesnot contain labels and are load_data function requires the label_list parameter. Hence, we are sending an empty list to fulfil the requirement

In [108]:
label_list = []
private_dataframe = load_data(tweets_test_private,label_list)
private_dataframe["Text"] = preprocessing(private_dataframe["Text"])
private_dataframe.head()

Unnamed: 0,Text
0,whatev decid make sure make happi
1,accept challeng liter even feel exhilar victo...
2,roommat okay spell autocorrect terribl firstw...
3,cute atsu probabl shi photo cherri help uwu
4,rooney fuck untouch fuck dread depay look dec...


#### Combining the dataframes to create a bigger dataset for input which will have more features

In [109]:
combined_dataframe = training_dataframe.append(validation_dataframe, ignore_index=True)
combined_dataframe.head()

Unnamed: 0,Text,Label
0,make fuck irat jesu nobodi call ppl like haji...,0
1,lol adam bull fake outrag,0
2,pas away earli morn fast furiou style car cra...,0
3,lol wow gonna say realli haha seen chri nah d...,0
4,need bentobox sushi date ricebal spaghetti ol...,0


# Preparing data to be passed for the models

* Converting the text values to vectors. 
    * We use count vector to vectorize the data.The CountVectorizer function provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. 
* Converting the Labels to numpy to match the vectorized input variable using get_dummies
* Setting y_test and y_train and also the values for the combined dataset

In [110]:
vectorizer = CountVectorizer()
vectorizer.fit(training_dataframe["Text"].values)

X_train = vectorizer.transform(training_dataframe["Text"].values)
X_test  = vectorizer.transform(validation_dataframe["Text"].values)
y_train = training_dataframe["Label"]
y_test  = validation_dataframe["Label"]
y_np = pd.get_dummies(validation_dataframe['Label']).values
x_np = pd.get_dummies(training_dataframe['Label']).values

In [111]:
# Setting Values for combined Dataset

vectorizer = CountVectorizer()
vectorizer.fit(combined_dataframe["Text"].values)

X_combined = vectorizer.transform(combined_dataframe["Text"].values)
X_test  = vectorizer.transform(validation_dataframe["Text"].values)
y_combined = combined_dataframe["Label"]
c_np=pd.get_dummies(combined_dataframe['Label']).values

## Conventional ML Model:

We have used 3 conventional models to find out the best accuracy out of which Random forest had the highest Accuracy and we chose that model as the best models.

**For all the conventional models, we have fitted the model using the combined dataframe as it will provide more values and more features**

#### The Logistic Regression model gave us an accuracy of 56.30 which is lower than the Random Forest Model. We used 2 default parameters which was multi_class which we kept is as 'auto' and the solver parameter where we used liblinear as the value

In [112]:
classifier = LogisticRegression(multi_class="auto",solver="liblinear")
classifier.fit(X_combined, y_combined)
logistic = classifier.predict(X_test)
accuracy_LR = accuracy_score(logistic, y_test)*100
print("Logistic Regression Accuracy Score -> ", accuracy_LR)

Logistic Regression Accuracy Score ->  56.3013698630137


#### The Naive Bayes model gave us an accuracy of 52.60 which is comparatively very low as compared to the other 2 models used. We used Multinomial Naive Bayes because it is very efficient of text documents

In [113]:
naive = naive_bayes.MultinomialNB()
naive.fit(X_combined, y_combined)
predictions_NB = naive.predict(X_test)
accuracy_NB =  accuracy_score(predictions_NB, y_test)*100
print("Naive Bayes Accuracy Score -> ", accuracy_NB)

Naive Bayes Accuracy Score ->  52.602739726027394


#### The Random Forest Model gave us an accuracy of 60.95 which was the highest amongst all the 3 models . We use the n_estimators parameter which shows the number of trees to be used in the forest, in this case its 1000

In [114]:
RForest = RandomForestClassifier(n_estimators=1000)
RForest.fit(X_combined, y_combined)
predictions_RForest = RForest.predict(X_test)
accuracy_RF = accuracy_score(predictions_RForest, y_test)*100
print("Random Forest Accuracy Score -> ", accuracy_RF)

Random Forest Accuracy Score ->  60.95890410958904


## Notes on the Conventional ML Model

From the above accuracies, we can see that Random forest is the best model as it has the highest accuracy and hence it is the final model and used for prediction. This could be because Random forest creates multiples trees in the forest. n_estimators is the hyperparameter used.

### Creating the CSV file for the predictions made by the model

In [124]:
output_df = pd.DataFrame()
output_df["ID"] = [x for x in range(0, len(validation_dataframe))]
output_df["Prediction"] = predictions_RForest
output_df.to_csv("ConventionalModel.csv", index=False)

# Deep Learning Model

The final model that produced the best-performing predictions for the Kaggle submission was the CNN model with accuracy 60%. The first input dimension or feature for the first dense layer was the dimension of the dataset and the output was 1000 which inturn was the input the following dataset and so on.

The parameters used for the CNN model are as follows:
* Dense is a standard layer type that is used in many cases for neural networks.
* Relu that is rectified linear activation function returns the value provided as input directly, when training a neural network.
* add function is used to add layers to our model.
* Sequential model is used as the layers are stacked sequentially that is input and output layer with their respective shapes.
* We use kernel regularizers to avoid overfitting and smoothen the regression line. We use L2 type of kernel regularization.
* As the output layer is a multiclass classification problem "softmax" has been used as output layer.


In [125]:
input_dim = X_combined.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(output_dim=1000, input_dim=input_dim, activation='relu', kernel_regularizer=regularizers.l2(0.0001)))
model.add(layers.Dense(output_dim=1000, input_dim=1000, activation='relu',kernel_regularizer=regularizers.l2(0.0001)))
model.add(layers.Dense(output_dim=1000, input_dim=1000, activation='relu',kernel_regularizer=regularizers.l2(0.0001)))
model.add(layers.Dense(output_dim=1000, input_dim=1000, activation='relu',kernel_regularizer=regularizers.l2(0.0001)))
model.add(layers.Dense(output_dim=1000, input_dim=1000, activation='relu',kernel_regularizer=regularizers.l2(0.0001)))
model.add(layers.Dense(4, activation='softmax',kernel_regularizer=regularizers.l2(0.0001)))

  after removing the cwd from sys.path.
  """
  
  import sys
  


We Compile the model in the below paramters:
* The optimizer controls the learning rate. We have used Adam as the optimizer. The adam optimizer adjusts the learning rate throughout training.
* We have used categorical_crossentropy for the loss fucntion. The lower the loss score means the model is a better performer
* We use accuracy as the metric to make it easier for us to evaluate the model

In [126]:
model.compile(loss='categorical_crossentropy', 
             optimizer='adam', 
              metrics=['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 1000)              9813000   
_________________________________________________________________
dense_14 (Dense)             (None, 1000)              1001000   
_________________________________________________________________
dense_15 (Dense)             (None, 1000)              1001000   
_________________________________________________________________
dense_16 (Dense)             (None, 1000)              1001000   
_________________________________________________________________
dense_17 (Dense)             (None, 1000)              1001000   
_________________________________________________________________
dense_18 (Dense)             (None, 4)                 4004      
Total params: 13,821,004
Trainable params: 13,821,004
Non-trainable params: 0
__________________________________________

### Fitting the Model

We use the following paramters to compile our model:
* We use the training paramters of the combined Dataset
* The number of epochs represent the number of times the model will cycle through the data. We tried multiple epochs and a saturation was reached between 10-20 epochs. Hence we have used 10 epochs.
* For the validation data, we use the validation dataset.
* We have set the batch_size to 256

In [127]:
history=model.fit(X_combined, c_np,
                    epochs=10,
                   verbose=True,
                 validation_data=(X_test, y_np),
                 batch_size=256)
loss, accuracy = model.evaluate(X_test, y_np, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Train on 8558 samples, validate on 1460 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Testing Accuracy:  0.6055


# Notes on the Deep Learning Model

The deep learning model had the accuracy of 0.605 on the validation data set and had the hyperparameters Dense, Relu, Kernel regulazier and softmax as the output layer which are explained in detail above while builing the model.

We also tried a model with a few different parameters such as Dropout, LSTM, GlobalMaxPooling1D but the accuracy was significantly low and around 52%

The Deep Learning accuracy on the private dataset was 0.64082 where as it was 0.56865 on the public dataset and it was 0.6055 on the validation dataset.

# Discussion of Model Performance and Implementation

* By observing the models, random forest gave the best accuracy out of the conventional models and also was slightly higher than the CNN model on the validation dataset.
* For validation Dataset, random forest gave the highest accuracy[60.9589] as compared to the deep learning CNN model which gave the accuracy of 0.6055.
* On the public dataset, CNN model had the highest accuracy which was 0.56865 while the random forest model gave 0.56028 which was slightly lower.
* On the private dataset too, the CNN model had a higher accuracy which was 0.64129 as compared to the accuracy of the Random Forest model.
* The accuracy score on the Private dataset was 0.64129 which was around 7% more than it had on the public dataset and around 3% more than it was on the validation dataset. 
* The private dataset accuracy was 0.64129 which ranked 25th on the private dataset and was only 0.82% lower than the 2nd rank. 
* Hence, CNN was the best machine learning model.
