# Exploring how emojis affect sentiment analysis
## Introduction
The aim of this research is to understand how emoticons and emojis can influence the polarity or sentimentof a sentence in the domain of Sentiment Analysis.

In order to achieve this aim, the research objectives are:
- Conduct a research of how traditional machine learning techniques are applied to the task of Sentiment Analysis. 
- Conduct a research of public datasets containing tweets or reviews which their sentiment manually assigned.
- Implement machine learning techniques and make experiments between them within the selected dataset.
- Explain how the presence of emojis and emoticons influences the results. In case of conflict of polarity between the sentence and the emoji, aim to understand if the text is ironic.

## Useful Links
- [Notion](https://www.notion.so/Research-Methodology-Project-Planning-d4631470aa3a41a5a31614b38937ccc9): Documents of the research including papers, datasets and implementations.
- [Selected dataset](https://www.kaggle.com/crowdflower/twitter-airline-sentiment): The Twitter US Airline Sentiment dataset stored how travelers in February 2015 expressed their feelings of each major U.S.airline on Twitter. The text was manually labeled by contributors into positive, negative and neutral tweets.

## Authors
- Serghei Socolovschi [serghei@kth.se](mailto:serghei@kth.se)
- Angel Igareta [alih2@kth.se](mailto:alih2@kth.se)

## General

### Imports
All the imports of the folder should be here so it is more scalable and there are not redundancies nor different versions.

In [None]:
# Load the TensorBoard notebook extension
%reload_ext tensorboard
!rm -rf ./logs/

In [None]:
# Run only once
!sudo apt install openjdk-8-jdk
!sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!pip install language-check
!pip install pycontractions
!pip install emot
!pip install demoji
!pip install pyspellchecker

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libxxf86dga1 openjdk-8-jdk-headless openjdk-8-jre
  openjdk-8-jre-headless x11-utils
Suggested packages:
  gvfs openjdk-8-demo openjdk-8-source visualvm icedtea-8-plugin libnss-mdns
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  fonts-wqy-zenhei fonts-indic mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libxxf86dga1 openjdk-8-jdk openjdk-8-jdk-headless
  openjdk-8-jre openjdk-8-jre-headless x11-utils
0 upgraded, 15 newly installed, 0 to remove and 16 not upgraded.
Need to get 43.4 MB of archives.
After this 

In [None]:
import csv
import time
import re
import pandas as pd
import numpy as np
import nltk
from time import time

from pycontractions import Contractions # For expanding contractions

# For lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# For removing stop words
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

# For removing accents
import unicodedata 

# For removing emoticons
from emot.emo_unicode import EMOTICONS

# For transforming the emojis into their description
import demoji
demoji.download_codes()

# For spellchecking 
from spellchecker import SpellChecker

# Machine Learning Models
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import svm
from sklearn.metrics import classification_report, accuracy_score, f1_score, roc_auc_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Hypertuning
import itertools 
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Downloading emoji data ...
... OK (Got response in 0.13 seconds)
Writing emoji data to /root/.demoji/codes.json ...
... OK


### Constants

In [None]:
# This dataset is preprocessed, the hashtag, mentions, urls and punctuation that is not emoticon or ' has been removed.
dataset_url = "https://drive.google.com/uc?export=download&id=1wEAHS8-pzvKa7tIz99HJNSfKYwsEqs3T"
contraction_expander = Contractions(api_key="glove-twitter-100")
lemmatizer = WordNetLemmatizer()

# Do not consider negative stop words as stop words or it will change sentiment
english_stop_words = set(stopwords.words('english')) 
english_stop_words.remove('no')
english_stop_words.remove('nor')
english_stop_words.remove('not')

In [None]:
# https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

## Preprocessing Methods

In [None]:
example_sentence = "I've been to Málaga and Alcorcón to the world's most famous burger shop ❤️ 😋 I ain't flying ☺️👍 :)"

### Emojis and emoticons
The core value of the research, the main hyper-tuning will be done by processing the data set, eliminating or maintaining each of these elements.

In [None]:
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)

In [None]:
def remove_emojis_text(text):
  return emoji_pattern.sub(r'', text)

In [None]:
remove_emojis_text(example_sentence)

"I've been to Málaga and Alcorcón to the world's most famous burger shop   I ain't flying  :)"

In [None]:
def remove_emoticons_text(text):
  emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
  return emoticon_pattern.sub(r'', text)

In [None]:
remove_emoticons_text("Hi :)")

'Hi '

In [None]:
def remove_hashtags_text(text):
  return re.sub(r"#(\w+)", ' ', text, flags=re.MULTILINE)

In [None]:
remove_hashtags_text("Hi #friend")

'Hi  '

In [None]:
def remove_mentions_text(text):
  return re.sub(r"@(\w+)", ' ', text, flags=re.MULTILINE)

### Remove accents from text
[Source](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)

In [None]:
def remove_accents(text):
    unaccented_text = ''.join((char for char in unicodedata.normalize('NFD', text) if unicodedata.category(char) != 'Mn'))
    return unaccented_text

In [None]:
# Make sure not emoticons nor emojis are removed
remove_accents(example_sentence)

"I've been to Malaga and Alcorcon to the world's most famous burger shop ❤ 😋 I ain't flying ☺👍 :)"

### Expanding Contractions
In this step we remove the shorthed versions of words and syllabes, so it helps to the text standarization. Examples would be, do not to don’t and I would to I’d. [More info on why to use it](https://medium.com/@lukei_3514/dealing-with-contractions-in-nlp-d6174300876b)

Using [pycontractions library](https://pypi.org/project/pycontractions/), using an advanced three-step approach.

In [None]:
def expand_contractions(text):
  expanded_text = contraction_expander.expand_texts([text], precise=True) # Precise helps solving ain't ambuiguity
  return list(expanded_text)[0]

In [None]:
expand_contractions(example_sentence)



"I have been to Málaga and Alcorcón to the world's most famous burger shop ❤️ 😋 I have not flying ☺️👍 :)"

### Lemmatization
This approach allows to extract the root forms of the words in the text, thus generating more occurrences of the same meaning of the word, which assists in the standardization of the text. Lemmatization was selected instead of stemming because speed is not a major concern in this case and the result is more representative for the type of text being used. [See differences](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)

In [None]:
def lemmatize_text(text):
  lemmatized_text = " ".join([lemmatizer.lemmatize(word, pos="v") for word in text.split(" ")])
  return lemmatized_text

In [None]:
lemmatize_text(example_sentence)

"I've be to Málaga and Alcorcón to the world's most famous burger shop ❤️ 😋 I ain't fly ☺️👍 :)"

### Removing Stopwords
Remove words with little significance in the text. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords are a, an, the, and the like.

In [None]:
def remove_stop_words_text(text):
  word_tokens = word_tokenize(text) 
  filtered_text = " ".join([word for word in word_tokens if not word in english_stop_words])
  return filtered_text

In [None]:
remove_stop_words_text(example_sentence)

"I 've Málaga Alcorcón world 's famous burger shop ❤️ 😋 I ai n't flying ☺️👍 : )"

### Transforming Emojis and Emoticons into text
[Source](https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing#Conversion-of-Emoji-to-Words)

In [None]:
def convert_emojis(text):
  emoticon_dic = demoji.findall(text)
  for emot, value in emoticon_dic.items():
    text = text.replace(emot, value.lower().replace(" ", "_") + " ")
  return text

In [None]:
convert_emojis(example_sentence)

"I've been to Málaga and Alcorcón to the world's most famous burger shop red_heart  face_savoring_food  I ain't flying smiling_face thumbs_up  :)"

In [None]:
def convert_emoticons(text):
  for emot in EMOTICONS:
    value = EMOTICONS[emot].lower().replace(",", "").split()
    text = re.sub(u'(' + emot + ')', "_".join(value) + " ", text)
  return text

In [None]:
convert_emoticons(example_sentence)

"I've been to Málaga and Alcorcón to the world's most famous burger shop ❤️ 😋 I ain't flying ☺️👍 happy_face_or_smiley "

In [None]:
convert_emoticons(convert_emojis(example_sentence))

"I've been to Málaga and Alcorcón to the world's most famous burger shop red_heart  face_savoring_food  I ain't flying smiling_face thumbs_up  happy_face_or_smiley "

### Spelling check
Note: Discarded in preprocessing because of poor results

In [None]:
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

In [None]:
correct_spellings('I\'ve bee to Málaga and Alcorcón to the world most famous burge shop')

"I've bee to Málaga and Alcorcón to the world most famous urge shop"

### Preprocess Wrapper
Method that unifies previous transformations into a single method

In [None]:
def preprocess_text(text, 
                    remove_emojis = False, 
                    remove_emoticons = False, 
                    remove_hashtags = True, 
                    remove_mentions = False, 
                    remove_stop_words = True,
                    transform_lowercase = True, 
                    transform_emojis_text = False, 
                    transform_emoticons_text = False,
                    transform_lemmatize = True):
  preprocessed_text = text
  preprocessed_text = remove_accents(preprocessed_text)
  preprocessed_text = expand_contractions(preprocessed_text)

  # Elements removal
  if remove_emojis:
    preprocessed_text = remove_emojis_text(preprocessed_text)
  if remove_emoticons:
    preprocessed_text = remove_emoticons_text(preprocessed_text)
  if remove_hashtags:
    preprocessed_text = remove_hashtags_text(preprocessed_text)
  if remove_mentions:
    preprocessed_text = remove_mentions_text(preprocessed_text)
  if remove_stop_words:
    preprocessed_text = remove_stop_words_text(preprocessed_text)

  # Transformations
  if transform_lowercase:
    preprocessed_text = preprocessed_text.lower()
  if transform_emojis_text and not remove_emojis:
    preprocessed_text = convert_emojis(preprocessed_text)
  if transform_emoticons_text and not remove_emoticons:
    preprocessed_text = convert_emoticons(preprocessed_text)
  if transform_lemmatize:
    preprocessed_text = lemmatize_text(preprocessed_text)

  # Remove 's if there is any left
  preprocessed_text = re.sub(r'\'s', '', preprocessed_text)
  preprocessed_text = re.sub(r'\s+', ' ', preprocessed_text)

  return preprocessed_text

In [None]:
preprocess_text(example_sentence, remove_emoticons=True)

'i malaga alcorcon world famous burger shop ❤ 😋 i not fly ☺👍'

## Dataset

### Initialization

In [None]:
df = pd.read_csv(dataset_url)
df.head(20)

Unnamed: 0.1,Unnamed: 0,sentiment,sentiment_confidence,text,sentiment_numeric,hasEmoji,hasEmoticon,hasUrl,hasMention,hasHashtag
0,1,positive,1.0,@JetBlue incredible PR team 👏👏👏👏,1,True,False,False,True,False
1,2,neutral,1.0,@SouthwestAir can you please DM me I have a q...,0,False,True,False,True,False
2,3,neutral,0.6733,@SouthwestAir how oh how do we get tickets,0,False,False,False,True,False
3,4,negative,1.0,@united and now your rep just hung up on me af...,-1,False,False,False,True,False
4,5,negative,1.0,@AmericanAir our flight was Cancelled Flightle...,-1,False,False,False,True,True
5,6,neutral,0.669,@SouthwestAir @travelportland welcome to Portl...,0,False,False,True,True,False
6,7,positive,0.6701,@VirginAmerica I m looking forward to watching...,1,False,True,False,True,False
7,8,positive,0.3586,@united Just sent Thanks :),1,False,True,False,True,False
8,9,negative,1.0,@USAirways I tried to call your customer servi...,-1,False,False,False,True,False
9,10,positive,0.656,@united Hubby made it by the skin of his teeth...,1,False,True,False,True,False


### Processing

Only keep texts with confidence > 0.6

In [None]:
df = df[df['sentiment_confidence'] > 0.6]

Add other label transforming neutral as positive to have a better distinction (it will be used in hypertuning)

In [None]:
df['sentiment_binary'] = df['sentiment_numeric'].apply(lambda x: x + 1 if x != 1 else x)

Preprocess texts

In [None]:
df['text_clean'] = df['text'].apply(lambda text: preprocess_text(text))

In [None]:
df['text_clean_no_graphics'] = df['text_clean'].apply(lambda text: re.sub(r"[^a-zA-Z0-9#@]+", ' ', text))

Show result

In [None]:
df.head(20)

Unnamed: 0.1,Unnamed: 0,sentiment,sentiment_confidence,text,sentiment_numeric,hasEmoji,hasEmoticon,hasUrl,hasMention,hasHashtag,sentiment_binary,text_clean,text_clean_no_graphics
0,1,positive,1.0,@JetBlue incredible PR team 👏👏👏👏,1,True,False,False,True,False,1,@ jetblue incredible pr team 👏👏👏👏,@ jetblue incredible pr team
1,2,neutral,1.0,@SouthwestAir can you please DM me I have a q...,0,False,True,False,True,False,1,@ southwestair please dm i question : ),@ southwestair please dm i question
2,3,neutral,0.6733,@SouthwestAir how oh how do we get tickets,0,False,False,False,True,False,1,@ southwestair oh get ticket,@ southwestair oh get ticket
3,4,negative,1.0,@united and now your rep just hung up on me af...,-1,False,False,False,True,False,0,@ unite rep hang 35 mins hold i ask supervisor...,@ unite rep hang 35 mins hold i ask supervisor...
4,5,negative,1.0,@AmericanAir our flight was Cancelled Flightle...,-1,False,False,False,True,True,0,@ americanair flight cancel flightled rebooked...,@ americanair flight cancel flightled rebooked...
5,6,neutral,0.669,@SouthwestAir @travelportland welcome to Portl...,0,False,False,True,True,False,1,@ southwestair @ travelportland welcome portla...,@ southwestair @ travelportland welcome portla...
6,7,positive,0.6701,@VirginAmerica I m looking forward to watching...,1,False,True,False,True,False,1,@ virginamerica i look forward watch oscars fl...,@ virginamerica i look forward watch oscars fl...
8,9,negative,1.0,@USAirways I tried to call your customer servi...,-1,False,False,False,True,False,0,@ usairways i try call customer service line k...,@ usairways i try call customer service line k...
9,10,positive,0.656,@united Hubby made it by the skin of his teeth...,1,False,True,False,True,False,1,@ unite hubby make skin teeth : ),@ unite hubby make skin teeth
10,11,neutral,0.6809,@USAirways I think it's ok,0,False,False,False,True,False,1,@ usairways i think ok,@ usairways i think ok


In [None]:
df.shape

(1294, 13)

### Data preparation

In [None]:
# For starting only use train_test split. Future: Cross validation 
sub_df = df.loc[:, ['sentiment_numeric', 'sentiment_binary', 'text_clean']]

train, test = train_test_split(sub_df, test_size=0.1, random_state=1)

# Prepare input data for the algorithms
X_train = train['text_clean'].values
X_test = test['text_clean'].values
sentences = np.append(X_train, X_test)

Y_train = train['sentiment_binary']
Y_test = test['sentiment_binary']

print(X_train)
print(Y_train)

['@ americanair @ airport 8hrs tell nothing anyone rep say wait mechanics finish lunch'
 '@ jetblue tweet get word 😢'
 '@ unite appreciate sentiment able get grind still miss connection' ...
 '@ jetblue i nervous sunday be flight baltimore boston any suggestions i need boston monday'
 '@ unite cat flight delay 1+hour arrive hawaii 5 amp i not able pick tomorrow 😭'
 '@ usairways thank make miss meet dallas 700 dollars toilet 😊']
1010    0
435     0
133     1
316     0
465     1
       ..
728     1
922     0
1118    1
240     0
1082    0
Name: sentiment_binary, Length: 1164, dtype: int64


### Vectorizers
Two vectorizers will be used in the hypertuning:
- Count: This will transform the text in our data frame into a **bag of words model**, which will contain a sparse matrix of integers. The number of occurrences of each word will be counted and printed.
- TF-IDF: Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

We will need to convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.

In [None]:
def get_train_test_vectors(vectorizer_name, X_train, X_test):
  vectorizer = None      
  if vectorizer_name == "count":
    vectorizer = CountVectorizer(
        analyzer = 'word',
        ngram_range=(1, 2), # Using both unigrams and bigrams
        token_pattern=r'[^\s]+') # To allow emojis and emoticons
  elif vectorizer_name == 'tf-idf':
    vectorizer = TfidfVectorizer(
        analyzer = 'word',
        ngram_range=(1, 2), # Using both unigrams and bigrams
        token_pattern=r'[^\s]+') # To allow emojis and emoticons
    
  vectorizer.fit(sentences)

  train_vectors = vectorizer.transform(X_train)
  test_vectors = vectorizer.transform(X_test)

  return train_vectors, test_vectors

## Algorithms

In [None]:
train_vectors, test_vectors = get_train_test_vectors("count", X_train, X_test)

### Support Vector Machine (SVM)

In [None]:
def get_svm_metrics(train_vectors, train_labels, test_vectors, test_labels, regularization, kernel, gamma):
  start_time = time()
  # Perform classification with SVM
  svm_classifier = svm.SVC(C = regularization, kernel = kernel, gamma = gamma, probability = True)
  # Train Model
  svm_classifier.fit(train_vectors, train_labels)

  # Predict Model
  svm_classifier_prediction = svm_classifier.predict(test_vectors)
  svm_classifier_prediction_proba = svm_classifier.predict_proba(test_vectors)

  # Convert probability to class for roc_auc_score in case of binary
  if svm_classifier_prediction_proba.shape[1] == 2: 
    svm_classifier_prediction_proba = np.array([np.argmax(x) for x in svm_classifier_prediction_proba])

  # Report metrics
  accuracy = accuracy_score(test_labels, svm_classifier_prediction)
  f1 = f1_score(test_labels, svm_classifier_prediction, average = 'weighted')
  auc = roc_auc_score(test_labels, svm_classifier_prediction_proba, average = 'weighted', multi_class = 'ovr')
  end_time = time() - start_time

  return accuracy, f1, auc, end_time

In [None]:
get_svm_metrics(train_vectors, Y_train, test_vectors, Y_test, 1.0, 'linear', 'auto')

(0.8, 0.8009696969696971, 0.7793432982112228, 1.2597923278808594)

### Multinomial Naive Bayes (MNB)

In [None]:
def get_mnb_metrics(train_vectors, train_labels, test_vectors, test_labels, alpha, fit_prior):
  start_time = time()
  # Perform classification with SVM
  mnb_classifier = MultinomialNB(alpha = alpha, fit_prior = fit_prior)
  # Train Model
  mnb_classifier.fit(train_vectors, train_labels)
  # Predict Model
  mnb_classifier_prediction = mnb_classifier.predict(test_vectors)
  mnb_classifier_prediction_proba = mnb_classifier.predict_proba(test_vectors)

  # Convert probability to class for roc_auc_score in case of binary
  if mnb_classifier_prediction_proba.shape[1] == 2: 
    mnb_classifier_prediction_proba = np.array([np.argmax(x) for x in mnb_classifier_prediction_proba])

  # Report metrics
  accuracy = accuracy_score(test_labels, mnb_classifier_prediction)
  f1 = f1_score(test_labels, mnb_classifier_prediction, average = 'weighted')
  auc = roc_auc_score(test_labels, mnb_classifier_prediction_proba, average = 'weighted', multi_class = 'ovr')
  end_time = time() - start_time

  return accuracy, f1, auc, end_time

In [None]:
get_mnb_metrics(train_vectors, Y_train, test_vectors, Y_test, 1.0, False)

(0.7923076923076923,
 0.7940828402366864,
 0.8011516785101691,
 0.00892186164855957)

### K-Nearest Neighbors (KNN)


In [None]:
def get_knn_metrics(train_vectors, train_labels, test_vectors, test_labels, n_neighbors, weights):
  start_time = time()
  # Perform classification with SVM
  knn_classifier = KNeighborsClassifier(n_neighbors = n_neighbors, weights = weights)
  # Train Model
  knn_classifier.fit(train_vectors, train_labels)
  # Predict Model
  knn_classifier_prediction = knn_classifier.predict(test_vectors)
  knn_classifier_prediction_proba = knn_classifier.predict_proba(test_vectors)

  # Convert probability to class for roc_auc_score in case of binary
  if knn_classifier_prediction_proba.shape[1] == 2: 
    knn_classifier_prediction_proba = np.array([np.argmax(x) for x in knn_classifier_prediction_proba])

  # Report metrics
  accuracy = accuracy_score(test_labels, knn_classifier_prediction)
  f1 = f1_score(test_labels, knn_classifier_prediction, average = 'weighted')
  auc = roc_auc_score(test_labels, knn_classifier_prediction_proba, average = 'weighted', multi_class = 'ovr')
  end_time = time() - start_time

  return accuracy, f1, auc, end_time

In [None]:
get_knn_metrics(train_vectors, Y_train, test_vectors, Y_test, n_neighbors=3, weights='distance')

(0.5846153846153846,
 0.4731538461538462,
 0.5023278608184268,
 0.02318739891052246)

### Decision Tree (DT)

In [None]:
def get_dt_metrics(train_vectors, train_labels, test_vectors, test_labels, max_depth = None, max_features = None, criterion = "gini"):
  start_time = time()
  # Perform classification with RF
  dt_classifier = DecisionTreeClassifier(max_depth = max_depth, max_features = max_features, criterion = criterion)
  # Train Model
  dt_classifier.fit(train_vectors, train_labels)
  # Predict Model
  dt_classifier_prediction = dt_classifier.predict(test_vectors)
  dt_classifier_prediction_proba = dt_classifier.predict_proba(test_vectors)

  # Convert probability to class for roc_auc_score in case of binary
  if dt_classifier_prediction_proba.shape[1] == 2: 
    dt_classifier_prediction_proba = np.array([np.argmax(x) for x in dt_classifier_prediction_proba])

  # Report metrics
  accuracy = accuracy_score(test_labels, dt_classifier_prediction)
  f1 = f1_score(test_labels, dt_classifier_prediction, average = 'weighted')
  auc = roc_auc_score(test_labels, dt_classifier_prediction_proba, average = 'weighted', multi_class = 'ovr')
  end_time = time() - start_time

  return accuracy, f1, auc, end_time

In [None]:
get_dt_metrics(train_vectors, Y_train, test_vectors, Y_test)

(0.7846153846153846,
 0.7862717911744588,
 0.7887772604753738,
 0.14447522163391113)

### Random Forest (RF)

In [None]:
def get_rf_metrics(train_vectors, train_labels, test_vectors, test_labels, n_trees = 100, max_depth = None, max_features = "auto", criterion = "gini"):
  start_time = time()
  # Perform classification with RF
  rf_classifier = RandomForestClassifier(n_estimators = n_trees, max_depth = max_depth, max_features = max_features, criterion = criterion)
  # Train Model
  rf_classifier.fit(train_vectors, train_labels)
  # Predict Model
  rf_classifier_prediction = rf_classifier.predict(test_vectors)
  rf_classifier_prediction_proba = rf_classifier.predict_proba(test_vectors)

  # Convert probability to class for roc_auc_score in case of binary
  if rf_classifier_prediction_proba.shape[1] == 2: 
    rf_classifier_prediction_proba = np.array([np.argmax(x) for x in rf_classifier_prediction_proba])

  # Report metrics
  accuracy = accuracy_score(test_labels, rf_classifier_prediction)
  f1 = f1_score(test_labels, rf_classifier_prediction, average = 'weighted')
  auc = roc_auc_score(test_labels, rf_classifier_prediction_proba, average = 'weighted', multi_class = 'ovr')
  end_time = time() - start_time

  return accuracy, f1, auc, end_time

In [None]:
get_rf_metrics(train_vectors, Y_train, test_vectors, Y_test, n_trees = 10)

(0.7846153846153846,
 0.7831185443992592,
 0.771134525851507,
 0.16304922103881836)

### Deep Neural Network (DNN)
*Discarded because of poor accuracy (probably because of small amount of data)*

In [None]:
def get_dnn_metrics(train_vectors, train_labels, test_vectors, test_labels, activation, optimizer):
  dnn_model = Sequential([
    Dense(25, activation=activation),
    Dense(10, activation=activation),
    Dense(1, activation='sigmoid')
  ])
  # Compile the model  
  dnn_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy']) # ‘sparse_categorical_crossentropy‘ for multi-class classification.

  # Fit the model
  dnn_model.fit(train_vectors, train_labels, epochs=100, batch_size=16, verbose=0)

  # Output accuracy
  return dnn_model.evaluate(test_vectors, test_labels, verbose=0)

In [None]:
get_dnn_metrics(train_vectors.toarray(), Y_train, test_vectors.toarray(), Y_test, activation='relu', optimizer='sgd')

[0.7176179885864258, 0.800000011920929]

## Hypertuning

In [None]:
# General
METRIC_ACCURACY = 'accuracy'
METRIC_F1_SCORE = 'f1-score'
METRIC_AUC = 'auc'
METRIC_TIME = 'time'

# For sentiment label to use
HP_SENTIMENT_TYPE = hp.HParam('sentiment_type', hp.Discrete(['sentiment_binary', 'sentiment_numeric'])) # Discarded after preselection: 'sentiment_numeric'

# For model
HP_MODEL = hp.HParam('model', hp.Discrete(['svm', 'mnb'])) #, 'knn', 'dt', 'rf']))
HP_VECTORIZER = hp.HParam('vectorizer', hp.Discrete(['tf-idf', 'count']))

# For Data processing
HP_EMOJI_INCLUDED = hp.HParam('emoji_included', hp.Discrete([True, False]))
HP_EMOTICON_INCLUDED = hp.HParam('emoticon_included', hp.Discrete([True, False]))
HP_TRANSFORM_EMOJIS_TEXT = hp.HParam('transform_emojis_text', hp.Discrete([True, False]))
HP_TRANSFORM_EMOTICONS_TEXT = hp.HParam('transform_emoticons_text', hp.Discrete([True, False]))

HP_MENTION_INCLUDED = hp.HParam('mention_included', hp.Discrete([True, False])) # Discarded after preselection: False
HP_HASHTAG_INCLUDED = hp.HParam('hashtag_included', hp.Discrete([True, False])) # Discarded after preselection: True
HP_STOPWORDS_INCLUDED = hp.HParam('stopwords_included', hp.Discrete([True, False])) # Discarded after preselection: True
HP_TRANSFORM_LOWERCASE = hp.HParam('transform_lowercase', hp.Discrete([True, False])) # Discarded after preselection: False

hparam_grid = {
  HP_MODEL: HP_MODEL.domain.values,
  HP_VECTORIZER: HP_VECTORIZER.domain.values,
  HP_EMOJI_INCLUDED: HP_EMOJI_INCLUDED.domain.values,
  HP_EMOTICON_INCLUDED: HP_EMOTICON_INCLUDED.domain.values,
  HP_TRANSFORM_EMOJIS_TEXT: HP_TRANSFORM_EMOJIS_TEXT.domain.values,
  HP_TRANSFORM_EMOTICONS_TEXT: HP_TRANSFORM_EMOTICONS_TEXT.domain.values,
  HP_MENTION_INCLUDED: HP_MENTION_INCLUDED.domain.values,
  HP_HASHTAG_INCLUDED: HP_HASHTAG_INCLUDED.domain.values,
  HP_STOPWORDS_INCLUDED: HP_STOPWORDS_INCLUDED.domain.values,
  HP_TRANSFORM_LOWERCASE: HP_TRANSFORM_LOWERCASE.domain.values,
}

### Algorithms Hyperparameters

##### SVM

In [None]:
HP_MODEL_SVM_C = hp.HParam('svm_regularization', hp.IntInterval(1, 3)) # Best after hypertuning => (1, 3)
HP_MODEL_SVM_KERNEL = hp.HParam('svm_kernel', hp.Discrete(['linear', 'rbf'])) # , 'sigmoid'])) # Best after hypertuning => rbf
HP_MODEL_SVM_GAMMA = hp.HParam('svm_gamma', hp.Discrete(['scale'])) # , 'auto'])) # Best after hypertuning => scale

svm_hparam_grid = {
  HP_MODEL_SVM_C: range(HP_MODEL_SVM_C.domain.min_value, HP_MODEL_SVM_C.domain.max_value),
  HP_MODEL_SVM_KERNEL: HP_MODEL_SVM_KERNEL.domain.values,
  HP_MODEL_SVM_GAMMA: HP_MODEL_SVM_GAMMA.domain.values,
}

In [None]:
def hypertune_svm(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  for svm_hparams in list(create_tfparam_grid(svm_hparam_grid)):
    hparams.update(svm_hparams)

    run_name = "run-%d" % session_num
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})

    # Log to tensorboard
    with tf.summary.create_file_writer(run_dir + run_name).as_default():
      # Record the values used in this trial
      hp.hparams(hparams)  
      # Get metrics
      accuracy, f1, auc, time = get_svm_metrics(train_vectors, 
                                  y_train, 
                                  test_vectors, 
                                  y_test, 
                                  svm_hparams[HP_MODEL_SVM_C], 
                                  svm_hparams[HP_MODEL_SVM_KERNEL],
                                  svm_hparams[HP_MODEL_SVM_GAMMA])
      tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
      tf.summary.scalar(METRIC_F1_SCORE, f1, step=1)
      tf.summary.scalar(METRIC_AUC, auc, step = 1)
      tf.summary.scalar(METRIC_TIME, time, step = 1)
    session_num += 1
  return session_num

#### MNB

In [None]:
# For MNB
HP_MODEL_MNB_ALPHA = hp.HParam('mnb_alpha', hp.IntInterval(1, 3)) # Best after hypertuning => (1, 3)
HP_MODEL_MNB_PRIOR = hp.HParam('mnb_fit_prior', hp.Discrete([True, False])) # Best after hypertuning => Not clear

mnb_hparam_grid = {
  HP_MODEL_MNB_ALPHA: range(HP_MODEL_MNB_ALPHA.domain.min_value, HP_MODEL_MNB_ALPHA.domain.max_value),
  HP_MODEL_MNB_PRIOR: HP_MODEL_MNB_PRIOR.domain.values
}

In [None]:
def hypertune_mnb(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  for mnb_hparams in list(create_tfparam_grid(mnb_hparam_grid)):
    hparams.update(mnb_hparams)

    run_name = "run-%d" % session_num
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})

    # Log to tensorboard
    with tf.summary.create_file_writer(run_dir + run_name).as_default():
      # Record the values used in this trial
      hp.hparams(hparams)  
      # Get metrics
      accuracy, f1, auc, time = get_mnb_metrics(train_vectors, 
                                  y_train, 
                                  test_vectors, 
                                  y_test,
                                  mnb_hparams[HP_MODEL_MNB_ALPHA], 
                                  mnb_hparams[HP_MODEL_MNB_PRIOR])
      tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
      tf.summary.scalar(METRIC_F1_SCORE, f1, step=1)
      tf.summary.scalar(METRIC_AUC, auc, step = 1)
      tf.summary.scalar(METRIC_TIME, time, step = 1)
    session_num += 1

  return session_num

#### KNN

In [None]:
# For KNN
HP_MODEL_KNN_NEIGHBORS = hp.HParam('knn_n_neighbors', hp.IntInterval(8, 13)) # Best after hypertuning => (9, ) # Try until 20
HP_MODEL_KNN_WEIGHTS = hp.HParam('knn_weights', hp.Discrete(["uniform", "distance"])) # Best after hypertuning => Uniform

knn_hparam_grid = {
  HP_MODEL_KNN_NEIGHBORS: range(HP_MODEL_KNN_NEIGHBORS.domain.min_value, HP_MODEL_KNN_NEIGHBORS.domain.max_value),
  HP_MODEL_KNN_WEIGHTS: HP_MODEL_KNN_WEIGHTS.domain.values
}

In [None]:
def hypertune_knn(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  for knn_hparams in list(create_tfparam_grid(knn_hparam_grid)):
    hparams.update(knn_hparams)

    run_name = "run-%d" % session_num
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})

    # Log to tensorboard
    with tf.summary.create_file_writer(run_dir + run_name).as_default():
      # Record the values used in this trial
      hp.hparams(hparams)  
      # Get metrics
      accuracy, f1, auc, time = get_knn_metrics(train_vectors, 
                                  y_train, 
                                  test_vectors,
                                  y_test, 
                                  knn_hparams[HP_MODEL_KNN_NEIGHBORS], 
                                  knn_hparams[HP_MODEL_KNN_WEIGHTS]) 
      tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
      tf.summary.scalar(METRIC_F1_SCORE, f1, step=1)
      tf.summary.scalar(METRIC_AUC, auc, step = 1)
      tf.summary.scalar(METRIC_TIME, time, step = 1)
    session_num += 1

  return session_num

#### Decision Tree

In [None]:
# For DT
HP_MODEL_DT_MAX_DEPTH = hp.HParam('dt_max_depth', hp.Discrete([10, 20]))
HP_MODEL_DT_CRITERION = hp.HParam('dt_criterion', hp.Discrete(["gini", "entropy"]))
HP_MODEL_DT_MAX_FEATURES = hp.HParam('dt_max_features', hp.Discrete(["sqrt", "log2"])) # Also here the default is None

dt_hparam_grid = {
    HP_MODEL_DT_MAX_DEPTH: HP_MODEL_DT_MAX_DEPTH.domain.values,
    HP_MODEL_DT_CRITERION: HP_MODEL_DT_CRITERION.domain.values,
    HP_MODEL_DT_MAX_FEATURES: HP_MODEL_DT_MAX_FEATURES.domain.values
}

In [None]:
def hypertune_dt(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  for dt_hparams in list(create_tfparam_grid(dt_hparam_grid)):
    hparams.update(dt_hparams)

    run_name = "run-%d" % session_num
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})

    # Log to tensorboard
    with tf.summary.create_file_writer(run_dir + run_name).as_default():
      # Record the values used in this trial
      hp.hparams(hparams)  
      # Get metrics
      accuracy, f1, auc, time = get_dt_metrics(train_vectors, 
                                  y_train, 
                                  test_vectors, 
                                  y_test, 
                                  dt_hparams[HP_MODEL_DT_MAX_DEPTH],
                                  dt_hparams[HP_MODEL_DT_MAX_FEATURES],
                                  dt_hparams[HP_MODEL_DT_CRITERION])
      tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
      tf.summary.scalar(METRIC_F1_SCORE, f1, step=1)
      tf.summary.scalar(METRIC_AUC, auc, step = 1)
      tf.summary.scalar(METRIC_TIME, time, step = 1)
    session_num += 1

  return session_num

#### Random Forest

In [None]:
# For RF
HP_MODEL_RF_N_TREES = hp.HParam('rf_n_trees', hp.Discrete([10, 20, 50]))
HP_MODEL_RF_MAX_DEPTH = hp.HParam('rf_max_depth', hp.Discrete([10, 20]))
HP_MODEL_RF_CRITERION = hp.HParam('rf_criterion', hp.Discrete(["gini"])) #, "entropy"]))
HP_MODEL_RF_MAX_FEATURES = hp.HParam('rf_max_features', hp.Discrete(["sqrt", "log2"]))

rf_hparam_grid = {
    HP_MODEL_RF_N_TREES: HP_MODEL_RF_N_TREES.domain.values,
    HP_MODEL_RF_MAX_DEPTH: HP_MODEL_RF_MAX_DEPTH.domain.values,
    HP_MODEL_RF_CRITERION: HP_MODEL_RF_CRITERION.domain.values,
    HP_MODEL_RF_MAX_FEATURES: HP_MODEL_RF_MAX_FEATURES.domain.values
}

In [None]:
def hypertune_rf(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  for rf_hparams in list(create_tfparam_grid(rf_hparam_grid)):
    hparams.update(rf_hparams)

    run_name = "run-%d" % session_num
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})

    # Log to tensorboard
    with tf.summary.create_file_writer(run_dir + run_name).as_default():
      # Record the values used in this trial
      hp.hparams(hparams)  
      # Get metrics
      accuracy, f1, auc, time = get_rf_metrics(train_vectors, 
                                  y_train, 
                                  test_vectors, 
                                  y_test,
                                  rf_hparams[HP_MODEL_RF_N_TREES], 
                                  rf_hparams[HP_MODEL_RF_MAX_DEPTH],
                                  rf_hparams[HP_MODEL_RF_MAX_FEATURES],
                                  rf_hparams[HP_MODEL_RF_CRITERION])
      tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
      tf.summary.scalar(METRIC_F1_SCORE, f1, step=1)
      tf.summary.scalar(METRIC_AUC, auc, step = 1)
      tf.summary.scalar(METRIC_TIME, time, step = 1)
    session_num += 1

  return session_num

### Data Processing Hyperparameters
Developed custom param grid creator as sklearn one gave error when using tensorflow estimators as key

In [None]:
def create_tfparam_grid(hparams):
    dict_size = len(hparams)
    keys = list(hparams.keys())
    values = hparams.values()
    
    # Calculate possible combinations among subarrays in values
    combinations = itertools.product(*values)
    
    # For each combination, convert to dictionary
    param_grid = [dict(zip(keys, combination)) for combination in combinations]
    
    return param_grid

In [None]:
def filter_tfparam_grid(param_grid):
    new_param_grid = []
    for combination in param_grid:
      if not (combination[HP_TRANSFORM_EMOJIS_TEXT] and not combination[HP_EMOJI_INCLUDED] \
        or combination[HP_TRANSFORM_EMOTICONS_TEXT] and not combination[HP_EMOTICON_INCLUDED]):
        new_param_grid.append(combination)

    return new_param_grid

In [None]:
example_dict = { HP_EMOJI_INCLUDED: HP_EMOJI_INCLUDED.domain.values, HP_MODEL: HP_MODEL.domain.values }
print(create_tfparam_grid(example_dict))

[{HParam(name='emoji_included', domain=Discrete([False, True]), display_name=None, description=None): False, HParam(name='model', domain=Discrete(['mnb', 'svm']), display_name=None, description=None): 'mnb'}, {HParam(name='emoji_included', domain=Discrete([False, True]), display_name=None, description=None): False, HParam(name='model', domain=Discrete(['mnb', 'svm']), display_name=None, description=None): 'svm'}, {HParam(name='emoji_included', domain=Discrete([False, True]), display_name=None, description=None): True, HParam(name='model', domain=Discrete(['mnb', 'svm']), display_name=None, description=None): 'mnb'}, {HParam(name='emoji_included', domain=Discrete([False, True]), display_name=None, description=None): True, HParam(name='model', domain=Discrete(['mnb', 'svm']), display_name=None, description=None): 'svm'}]


### Run Hyperparameter Tuning

In [None]:
def run(session_num, train_vectors, y_train, test_vectors, y_test, hparams):
  if hparams[HP_MODEL] == "svm": 
    return hypertune_svm(session_num, train_vectors, y_train, test_vectors, y_test, hparams)
  elif hparams[HP_MODEL] == "mnb":
    return hypertune_mnb(session_num, train_vectors, y_train, test_vectors, y_test, hparams)
  elif hparams[HP_MODEL] == "knn":
    return hypertune_knn(session_num, train_vectors, y_train, test_vectors, y_test, hparams)
  elif hparams[HP_MODEL] == "dt":
    return hypertune_dt(session_num, train_vectors, y_train, test_vectors, y_test, hparams)
  elif hparams[HP_MODEL] == "rf":
    return hypertune_rf(session_num, train_vectors, y_train, test_vectors, y_test, hparams)

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


#### Preselection of features

In [None]:
%%time
session_num = 0
run_dir_original = '/content/drive/MyDrive/logs/'
run_dir = run_dir_original

sub_df = df.loc[:, ['text', 'sentiment_numeric', 'sentiment_binary']]

# Train-test split of 0.2
train, test = train_test_split(sub_df, test_size=0.2, random_state=1)

X_train = train['text'].values
X_test = test['text'].values

# Choosing sentiment_numeric or sentiment_binary as label for the models
for sentiment_type in HP_SENTIMENT_TYPE.domain.values[::-1]:
  y_train = train[sentiment_type].values
  y_test = test[sentiment_type].values

  # Hyperparameters about data processing
  param_grid = filter_tfparam_grid(create_tfparam_grid(hparam_grid))
  for hparams in list(param_grid):
    run_dir = run_dir_original + hparams[HP_MODEL] + "/"
    hparams.update({ HP_SENTIMENT_TYPE: sentiment_type })
    
    # Preprocess train and test set with hyperparameters
    def preprocess_text_with_combination(text):
      return preprocess_text(text,
                              remove_emojis=not hparams[HP_EMOJI_INCLUDED], 
                              remove_emoticons=not hparams[HP_EMOTICON_INCLUDED], 
                              remove_hashtags=not hparams[HP_HASHTAG_INCLUDED], 
                              remove_mentions=not hparams[HP_MENTION_INCLUDED],
                              remove_stop_words=not hparams[HP_STOPWORDS_INCLUDED],
                              transform_lowercase=hparams[HP_TRANSFORM_LOWERCASE],
                              transform_emojis_text=hparams[HP_TRANSFORM_EMOJIS_TEXT],
                              transform_emoticons_text=hparams[HP_TRANSFORM_EMOTICONS_TEXT])
      
    train_set = [preprocess_text_with_combination(text) for text in X_train]
    test_set = [preprocess_text_with_combination(text) for text in X_test]

    train_vectors, test_vectors = get_train_test_vectors(hparams[HP_VECTORIZER], train_set, test_set)
    session_num = run(session_num, train_vectors, y_train, test_vectors, y_test, hparams)

#### Model Hypertuning

In [None]:
HP_MODEL = hp.HParam('model', hp.Discrete(['svm', 'mnb', 'knn', 'dt', 'rf']))

hparam_grid = {
  HP_MODEL: HP_MODEL.domain.values,
  HP_VECTORIZER: HP_VECTORIZER.domain.values,
  HP_EMOJI_INCLUDED: HP_EMOJI_INCLUDED.domain.values,
  HP_EMOTICON_INCLUDED: HP_EMOTICON_INCLUDED.domain.values,
  HP_TRANSFORM_EMOJIS_TEXT: HP_TRANSFORM_EMOJIS_TEXT.domain.values,
  HP_TRANSFORM_EMOTICONS_TEXT: HP_TRANSFORM_EMOTICONS_TEXT.domain.values,
}

In [None]:
%%time
session_num = 0
run_dir_original = '/content/drive/MyDrive/logs/Models/'
run_dir = run_dir_original

max_iter = 3 # Number of iterations to use cross validation
sub_df = df.loc[:, ['text', 'sentiment_binary']]

# K-Fold with number of splits = max_iter
kf = KFold(n_splits = max_iter)
skf = StratifiedKFold(n_splits = max_iter, random_state = 7, shuffle = True) 

for train_index, val_index  in skf.split(np.zeros(sub_df.shape[0]), sub_df['sentiment_binary']):
  X_train = sub_df.iloc[train_index]['text'].values
  X_test = sub_df.iloc[val_index]['text'].values
  y_train = sub_df.iloc[train_index]['sentiment_binary'].values
  y_test = sub_df.iloc[val_index]['sentiment_binary'].values

  # Hyperparameters about data processing
  param_grid = filter_tfparam_grid(create_tfparam_grid(hparam_grid))
  for hparams in list(param_grid):    
    # Preprocess train and test set with hyperparameters
    def preprocess_text_with_combination(text):
      return preprocess_text(text,
                              remove_emojis=not hparams[HP_EMOJI_INCLUDED], 
                              remove_emoticons=not hparams[HP_EMOTICON_INCLUDED],
                              transform_emojis_text=hparams[HP_TRANSFORM_EMOJIS_TEXT],
                              transform_emoticons_text=hparams[HP_TRANSFORM_EMOTICONS_TEXT])
      
    train_set = [preprocess_text_with_combination(text) for text in X_train]
    test_set = [preprocess_text_with_combination(text) for text in X_test]

    train_vectors, test_vectors = get_train_test_vectors(hparams[HP_VECTORIZER], train_set, test_set)
    session_num = run(session_num, train_vectors, y_train, test_vectors, y_test, hparams)

## Results

In [None]:
%tensorboard --logdir /content/drive/MyDrive/logs/Models