### Student Information
Name: Derrick Bol

Student ID: 108065434

GitHub ID: @derrxb

Kaggle name: @derrxb

Kaggle private scoreboard snapshot:

[Snapshot](img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the [DM19-Lab2-Master Repo](https://github.com/EvaArevalo/DM19-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/t/179d01d4dd984fc5ac45a894822479dd) regarding Emotion Recognition on Twitter. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the score (ie. 20% of 30% )

    - **Top 41% - 100%**: Get (101-x)% of the score, where x is your ranking in the leaderboard (ie. (101-x)% of 30% )   
    Submit your last submission __BEFORE the deadline (Nov. 23rd 11:59 pm, Saturday)__. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


You can submit your homework following these guidelines: [Git Intro & How to hand your homework](https://github.com/EvaArevalo/DM19-Lab1-Master/blob/master/Git%20Intro%20%26%20How%20to%20hand%20your%20homework.ipynb), but make sure to fork the [DM19-Lab2-Homework](https://github.com/EvaArevalo/DM19-Lab2-Homework) repository this time! Also please __DON´T UPLOAD HUGE DOCUMENTS__, please use Git ignore for that.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th 11:59 pm, Tuesday)__. 

## Part 3: Kaggle Competition Report

### Table of contents
* Summary
* Data Preparation
* Data Preprocessing
* Feature engineering
* Models
* Final Model Explanation

### Summary

This report is an overview of the various steps completed for the Data Mining class' Twitter sentiment analysis competition. It includes the various steps such as data preparation, preprocessing, and the various models used to build my model for the competition.

Overall, 4 major models were tried. These include Random Forest Tree, Naive Bayes classifier, Logistic Regression, and a neural network. The best result was obtained with the naive bayes model.

### Data Preparation

The dataset was provided in different dataset and so a major part of the data preparation was merging the different files into one Pandas dataframe.

In [4]:
import re
import math
import pandas as pd
import numpy as np
import json
from nltk.corpus import stopwords
import nltk

In [6]:
emotion = pd.read_csv('emotion.csv')
emotion.set_index('tweet_id')

identification = pd.read_csv('data_identification.csv')
identification.set_index('tweet_id')

identification_with_emotion = pd.merge(identification, emotion, on="tweet_id", how="outer", left_index=True)
identification_with_emotion.set_index('tweet_id')

Unnamed: 0_level_0,identification,emotion
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0x28cc61,test,
0x29e452,train,joy
0x2b3819,train,joy
0x2db41f,test,
0x2a2acc,train,trust
...,...,...
0x227e25,train,disgust
0x293813,train,sadness
0x1e1a7e,train,joy
0x2156a5,train,trust


In [7]:
# The `tweets_DM.json` is a json file with multiple lines of JSON objects. We need to parse each line individualy.
# We then normalize the results so we can merge it with the `identification_with_emotion` dataframe.
raw_tweets = []

with open('tweets_DM.json') as f:
    for line in f:
        raw_tweets.append(json.loads(line))
        
temp_tweets = pd.io.json.json_normalize(raw_tweets)

In [8]:
# Rename columns and drop uncessary columns.
temp_tweets.columns = ['score', 'index', 'crawldate', 'type', 'hastags', 'tweet_id', 'text']
temp_tweets.set_index('tweet_id')
temp_tweets = temp_tweets.drop(['index', 'type'], 1)

# Merge temp_tweets with the emotion and training information
tweets = pd.merge(temp_tweets, identification_with_emotion, on="tweet_id", left_index=True)

### Data Preprocessing

#### Data Cleaning

For the data preprocessing step, the main thing I did to the dataset was to clean the tweets. I cleaned the tweets by removing stopwords, removing user mentions, and removing unnecessary punctuations. This can be seen in the cell below.

In [9]:
# Data cleaning method uses regex to find and replace text.
def clean_data(tt):
    text = re.sub('\w*\d\w*', '', tt)
    text = re.sub('_+[^a-z]', '', text)
    text = re.sub('_+[a-z]', '', text)
    test = re.sub('@[a-zA-Z0-9]*', '', text)
    text = re.sub('[''""...]', '', text)
    text = text.replace(",", "")
    text = re.sub("[''""’’...**][a-zA-Z0-9]*", '', text)
    text = re.sub('@[a-zA-Z0-9]*', '', text)
    
    return text

# Uses the common stop words provided by `nltk`
stop = stopwords.words('english')

def remove_stopwords(text):
    # Sets up regex to filter out stop words    
    pattern = re.compile(r'\b(' + r'|'.join(stop) + r')\b\s*')

    return pattern.sub('', text)

tweets['cleaned_text'] = tweets['text'].apply(lambda x: remove_stopwords(clean_data(x)))

#### Binarization

I also one-hot encoded the emotions from the tweets. This was done because I wanted to try creating a NN for the project. However, to do this I split the dataset into `train_tweets` and `test_tweets`.

In [13]:
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy

# Split dataset
test_tweets = tweets[tweets['identification'] == 'test']
train_tweets = tweets[tweets['identification'] == 'train']

# Create labels
label_binarizer = preprocessing.LabelBinarizer()
label_binarizer.fit(train_tweets.emotion)

# One hot encode emotions
train_tweets['bin_emotion'] = label_binarizer.transform(train_tweets['emotion']).tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [14]:
train_tweets.head()

Unnamed: 0,score,crawldate,hastags,tweet_id,text,identification,emotion,cleaned_text,bin_emotion
1205566.0,391,2015-05-23 11:42:47,[Snapchat],0x376b20,"People who post ""add me on #Snapchat"" must be ...",train,anticipation,People post add #Snapchat must dehydrated Cuz ...,"[0, 1, 0, 0, 0, 0, 0, 0]"
1083048.0,433,2016-01-28 04:52:09,"[freepress, TrumpLegacy, CNN]",0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",train,sadness,As see Trump dangerous #freepress around worl...,"[0, 0, 0, 0, 0, 1, 0, 0]"
1178309.0,376,2016-01-24 23:53:05,[],0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,train,fear,Now ISSA stalking Tasha 😂😂😂 <LH>,"[0, 0, 0, 1, 0, 0, 0, 0]"
687882.0,120,2015-06-11 04:44:05,"[authentic, LaughOutLoud]",0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,train,joy,Thx BEST TIME tonight What stories! Heartbre...,"[0, 0, 0, 0, 1, 0, 0, 0]"
407631.0,1021,2015-08-18 02:30:07,[],0x2c91a8,Still waiting on those supplies Liscus. <LH>,train,anticipation,Still waiting supplies Liscus <LH>,"[0, 1, 0, 0, 0, 0, 0, 0]"


### Feature Engineering

After the data cleaning and preparation is finished, we split the training data into test and train data. This is done so we have testing data to validate our models against.

In [56]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import metrics
from sklearn.model_selection import train_test_split

# We do this using the `train_tweets` dataframe. The `test_tweets` dataframe is the data that will be used for 
# the Kaggle competition predictions.
sample_tweets = train_tweets.sample(n=1000000)

tweets_x = sample_tweets['cleaned_text']
tweets_y = sample_tweets['emotion']

x_train, x_test, y_train, y_test = train_test_split(tweets_x, tweets_y, random_state=42, test_size=0.4)

For the feature engineering step, we use both the bag of words approached and TF/IDF. These features will be used
to train the various models since they need numerical data to train.

##### Bag of Words features via CountVectorizer

In [57]:
bow_vector = TfidfVectorizer(max_features=20000, min_df=1, stop_words='english', tokenizer=nltk.word_tokenize)

# learn training data vocab and create a document-term matrix
x_bow_traincv = bow_vector.fit_transform(x_train)

# Transform the testing data into a document-term matrix
x_bow_testcv = bow_vector.transform(x_test)

##### TFID Features via TFIDVectorizer

In [58]:
tfidf_vector = TfidfVectorizer(max_features=20000, min_df=1, stop_words='english', tokenizer=nltk.word_tokenize)

# learn training data vocab and create a document-term matrix
x_tfidf_traincv = tfidf_vector.fit_transform(x_train)

# Transform the testing data into a document-term matrix
x_tfidf_testcv =tfidf_vector.transform(x_test)

In [59]:
# Perform feature creation the tweets to be used in the Kaggle competition
kaggle_x_tfidf_cv = tfidf_vector.transform(test_tweets['cleaned_text'])
kaggle_x_bow_cv   = bow_vector.transform(test_tweets['cleaned_text'])

### Models and in sights

#### Random Forest Tree

One of my first attempts was to train a random forest decision tree to peform the predictions. The Random Forest was trained with 75 n estimators and achieved an accuraccy of .43 on the original kaggle competitiion. It was trained with the TFIDF features.

##### Naive Bayes Classifier

I also trained two Naive Bayes classifier: one with the CountVectorizer features and the other with the TFIDF features. These two classifiers can be found below. Overall both models seem to have similar results so the different
features did not make that much of a difference.

Insights: For the Naive Bayes models, I initially was getting accuracy between 30% and 40%. I could not get them to go higher. I think my issue was that I was performing TOO much data cleaning on the tweets. So after stripping back on the data cleaning I managed to pass 40% accuracy.

In [60]:
# TFIDF Model
tfidf_mnb = MultinomialNB()

# Train model
%time tfidf_mnb.fit(x_tfidf_traincv, y_train)

# Validate model accuracy
tfidf_prediction = tfidf_mnb.predict(x_tfidf_testcv)

print('Testing accuracy: ', metrics.accuracy_score(y_test, tfidf_prediction))

# BOW Model
bow_mnb = MultinomialNB()

# Train model
%time bow_mnb.fit(x_bow_traincv, y_train)

# Validate model accuracy
bow_prediction = bow_mnb.predict(x_bow_testcv)

print('Testing accuraccy: ', metrics.accuracy_score(y_test, bow_prediction))

CPU times: user 2.57 s, sys: 63.7 ms, total: 2.63 s
Wall time: 2.67 s
Testing accuracy:  0.5095025
CPU times: user 2.52 s, sys: 17.1 ms, total: 2.54 s
Wall time: 2.55 s
Testing accuraccy:  0.5095025


In [61]:
def prepare_submission(t_prediction, tweet_ids, name):
    dataframe = pd.DataFrame({ 'id': tweet_ids.tolist(), 'emotion': t_prediction.tolist() })
    dataframe.to_csv(name, index=None) 

In [62]:
prepare_submission(tfidf_mnb.predict(kaggle_x_tfidf_cv), test_tweets['tweet_id'], 'naive_with_tfidf.csv')
prepare_submission(bow_mnb.predict(kaggle_x_bow_cv), test_tweets['tweet_id'], 'naive_with_bow.csv')

#### Logistic Regression

I also trained a logistic regression model to see if the results can be improved. Overall, the logistic regression model's accuraccy was slightly better than the Naive Bayes models.

Insights: Initially, the logistic regression model's results were almost the same as the naive bayes. However, after changing the `C` value and `max_iter` I managed to get the model to perform better than the naive bayes models.

In [65]:
from sklearn.linear_model import LogisticRegression

logisticRegr = LogisticRegression(C=1e4, solver='lbfgs', multi_class='multinomial', max_iter=550)

# Train model
logisticRegr.fit(x_bow_traincv, y_train)

# Validate model
print('training', logisticRegr.score(x_bow_traincv, y_train))
print('testing', logisticRegr.score(x_bow_testcv, y_test))



training 0.5769066666666667
testing 0.51961


In [64]:
log_reg_prediction = logisticRegr.predict(kaggle_x_bow_cv)

# Export
prepare_submission(log_reg_prediction, test_tweets['tweet_id'], 'logreg_with_bow_2.csv')

# Dump the model
dump(logisticRegr, 'logistic_regresssion_2.joblib')

['logistic_regresssion_2.joblib']

#### Neural Network

I also built a model using a Neural Network built with Keras. However, the results for the neural network wasn't as good as I expected it to be. The highest accuraccy I managed to get was 42%. 

I think this is a combination of things that I did not do too well. Firstly, it took way too long to train and I kept running out of time from my Google Colab session. I then tried to use Keras' Callback to save the model after each epoch of training. However, that still did not help with the training.

I also tried to reduce amount of data cleaning I had performed on the results.

But in the end, I think the logistic regression model is the model I went for and tried to improve the most.

### Final Model Explanation

In the end, my best results were obtained using Logistic Regression and the TFIDF features. To get the best results from the this model, I tried different `C` and `max_iter` values.