# How do you feel my dear? (P8)

## Andrea Pio Cutrera - 965591
### Università degli Studi di Milano - _Data Science and Economics_

**Text Mining and Sentiment Analysis**

Recently, **emotion detection in text** has received attention in the literature on **sentiment analysis**. Detecting emotions is important for studying human communication in different domains, including fictional scripts for TV series and movies. The project aims at studying fictional scripts of several movies and TV series under the emotional profile. In particular, the task of the project is threefold:

1. Create a **model to predict emotions in text** using available datasets as EmoBank or WASSA-2017 or Emotion Detection from Text as training sets (see below);
2. Emotions may be represented either as **categorical classes** or in a continuous space such as Valence-Arousal-Dominance (see for example Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods, 45(4), 1191-1207.)
3. Exploit the model to **study an emotional profile** of the **main characters** in **one of the movies** included in the Cornell Movie--Dialogs Corpus;
4. Study how this **emotional profile changes in time along** the evolution of the movie story and how it is affected by the various relations among the different characters.

### Import all the libraries we need

In [2]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#!pip install neattext
import neattext.functions as nfx

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import demoji
demoji.download_codes()

import nltk
# nltk.download('wordnet')
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Sequential, model_from_json
from tensorflow.keras.layers import Dense, LSTM, Embedding, Bidirectional, Dropout, Input, Conv1D, MaxPooling1D, Flatten, GlobalMaxPooling1D
from kerastuner.tuners import RandomSearch, Hyperband
from kerastuner.engine.hyperparameters import HyperParameters
from keras.models import load_model
import tensorflow as tf

from tensorflow.compat.v1.keras.layers import CuDNNLSTM

import warnings 
warnings.filterwarnings("ignore")

In [3]:
working_directory = !pwd                 # get present working directory
working_directory = working_directory[0] # get the full string 
working_directory

'/Users/andreacutrera/Desktop'

### Load the model which has been trained on Dataset: WASSA-2017 Shared Task on Emotion Intensity (EmoInt)
link: http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

In [4]:
# load model
model = load_model('how_do_you_feel_my_dear/final_model/model.h5')
# summarize model
model.summary()

2022-04-22 18:16:53.922072: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 200)           2180000   
                                                                 
 dropout_27 (Dropout)        (None, 50, 200)           0         
                                                                 
 bidirectional_27 (Bidirecti  (None, 50, 100)          100400    
 onal)                                                           
                                                                 
 dropout_28 (Dropout)        (None, 50, 100)           0         
                                                                 
 bidirectional_28 (Bidirecti  (None, 50, 300)          301200    
 onal)                                                           
                                                                 
 dropout_29 (Dropout)        (None, 50, 300)          

In [7]:
# load in variables the data just downloaded
DIR = os.getcwd() + "/how_do_you_feel_my_dear/emotions"

joy_test = pd.read_csv(os.path.join(DIR, "joy_test"), sep="\t", header=None)
sadness_test = pd.read_csv(os.path.join(DIR, "sadness_test"), sep="\t", header=None)
fear_test = pd.read_csv(os.path.join(DIR, "fear_test"), sep="\t", header=None)
anger_test = pd.read_csv(os.path.join(DIR, "anger_test"), sep="\t", header=None)

In [8]:
# load in variables the data just downloaded

joy_train = pd.read_csv(os.path.join(DIR, "joy_train"), sep="\t", header=None)
sadness_train = pd.read_csv(os.path.join(DIR, "sadness_train"), sep="\t", header=None)
fear_train = pd.read_csv(os.path.join(DIR, "fear_train"), sep="\t", header=None)
anger_train = pd.read_csv(os.path.join(DIR, "anger_train"), sep="\t", header=None)

joy_val = pd.read_csv(os.path.join(DIR, "joy_val"), sep="\t", header=None)
sadness_val = pd.read_csv(os.path.join(DIR, "sadness_val"), sep="\t", header=None)
fear_val = pd.read_csv(os.path.join(DIR, "fear_val"), sep="\t", header=None)
anger_val = pd.read_csv(os.path.join(DIR, "anger_val"), sep="\t", header=None)

joy_test = pd.read_csv(os.path.join(DIR, "joy_test"), sep="\t", header=None)
sadness_test = pd.read_csv(os.path.join(DIR, "sadness_test"), sep="\t", header=None)
fear_test = pd.read_csv(os.path.join(DIR, "fear_test"), sep="\t", header=None)
anger_test = pd.read_csv(os.path.join(DIR, "anger_test"), sep="\t", header=None)

# rename columns all in the same way to get homogeneous datasets which could be concatenated

joy_train.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'j'}, inplace=True)
joy_train['s'] = 0
joy_train['a'] = 0
joy_train['f'] = 0
joy_train = joy_train[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

joy_val.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'j'}, inplace=True)
joy_val['s'] = 0
joy_val['a'] = 0
joy_val['f'] = 0
joy_val = joy_val[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

joy_test.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'j'}, inplace=True)
joy_test['s'] = 0
joy_test['a'] = 0
joy_test['f'] = 0
joy_test = joy_test[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

sadness_train.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 's'}, inplace=True)
sadness_train['j'] = 0
sadness_train['a'] = 0
sadness_train['f'] = 0
sadness_train = sadness_train[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

sadness_val.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 's'}, inplace=True)
sadness_val['j'] = 0
sadness_val['a'] = 0
sadness_val['f'] = 0
sadness_val = sadness_val[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

sadness_test.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 's'}, inplace=True)
sadness_test['j'] = 0
sadness_test['a'] = 0
sadness_test['f'] = 0
sadness_test = sadness_test[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

anger_train.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'a'}, inplace=True)
anger_train['j'] = 0
anger_train['s'] = 0
anger_train['f'] = 0
anger_train = anger_train[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

anger_val.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'a'}, inplace=True)
anger_val['j'] = 0
anger_val['s'] = 0
anger_val['f'] = 0
anger_val = anger_val[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

anger_test.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'a'}, inplace=True)
anger_test['j'] = 0
anger_test['s'] = 0
anger_test['f'] = 0
anger_test = anger_test[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

fear_train.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'f'}, inplace=True)
fear_train['j'] = 0
fear_train['s'] = 0
fear_train['a'] = 0
fear_train = fear_train[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

fear_val.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'f'}, inplace=True)
fear_val['j'] = 0
fear_val['s'] = 0
fear_val['a'] = 0
fear_val = fear_val[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

fear_test.rename(columns={0: 'id', 1: 'text', 2: 'sentiment', 3: 'f'}, inplace=True)
fear_test['j'] = 0
fear_test['s'] = 0
fear_test['a'] = 0
fear_test = fear_test[['id', 'text', 'sentiment', 'j', 's', 'f', 'a']]

# concatenate the datasets in order to get 3 separated datasets: train, test, validation

data = pd.concat([joy_train,
                  sadness_train,
                  fear_train,
                  anger_train,
                  joy_test,
                  sadness_test,
                  fear_test,
                  anger_test,
                  joy_val,
                  sadness_val,
                  fear_val,
                  anger_val])

In [9]:
# clean text functions
# https://github.com/Jcharis/neattext/blob/master/neattext/functions/functions.py

def clean_emoji_output(text):
    return re.sub(":", " ", text)

def strip_lowercase(text):
    return text.strip().lower()

# tokenize
tt = TweetTokenizer()

# lemmatize
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]

# function which cleans texts
def clean_text(data):
    data['clean_text'] = data['text'].apply(nfx.remove_emails)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_numbers)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_urls)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_userhandles)
    data['clean_text'] = data['clean_text'].apply(demoji.replace_with_desc)
    data['clean_text'] = data['clean_text'].apply(clean_emoji_output)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_special_characters)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_bad_quotes)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_html_tags)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_punctuations)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_stopwords)
    data['clean_text'] = data['clean_text'].apply(nfx.remove_multiple_spaces)
    data['clean_text'] = data['clean_text'].apply(strip_lowercase)
    
    data['tokenize'] = data.clean_text.str.lower().apply(tt.tokenize)
    data['tokenize_lemmatized'] = data['tokenize'].apply(lemmatize_text)
    
    # detokenize
    data['final_text'] = data.tokenize_lemmatized.apply(TreebankWordDetokenizer().detokenize)

In [10]:
# clean and shuffle
clean_text(data)
data = shuffle(data, random_state=42)

In [11]:
data.head()

Unnamed: 0,id,text,sentiment,j,s,f,a,clean_text,tokenize,tokenize_lemmatized,final_text
100,11041,Nothing is more relentless than a dog begging ...,anger,0.0,0.0,0.0,0.479,relentless dog begging food,"[relentless, dog, begging, food]","[relentless, dog, begging, food]",relentless dog begging food
120,40980,Ok it really just sunk in that I'm seeing the ...,sadness,0.0,0.417,0.0,0.0,ok sunk im seeing goat hours wow face rolling ...,"[ok, sunk, im, seeing, goat, hours, wow, face,...","[ok, sunk, im, seeing, goat, hour, wow, face, ...",ok sunk im seeing goat hour wow face rolling e...
350,41210,@MariamVeiszadeh #depressing it's so freaking ...,sadness,0.0,0.679,0.0,0.0,depressing freaking close,"[depressing, freaking, close]","[depressing, freaking, close]",depressing freaking close
505,41365,I feel like a burden every day that I waste bu...,sadness,0.0,0.896,0.0,0.0,feel like burden day waste dont know bc discou...,"[feel, like, burden, day, waste, dont, know, b...","[feel, like, burden, day, waste, dont, know, b...",feel like burden day waste dont know bc discou...
183,31085,@partydelightsUK it's 5679787. Cannot DM you a...,joy,0.164,0.0,0.0,0.0,dm dont follow delight party fail letdown,"[dm, dont, follow, delight, party, fail, letdown]","[dm, dont, follow, delight, party, fail, letdown]",dm dont follow delight party fail letdown


In [12]:
# train-validation-test split has been already done

X_train, X_test, y_train, y_test = train_test_split(data.final_text,
                                                    data.sentiment,
                                                    test_size = 0.13,
                                                    random_state = 42,
                                                    stratify = data.sentiment)

X_train, X_val, y_train, y_val = train_test_split(X_train,
                                                  y_train,
                                                  test_size = 0.14,
                                                  random_state = 42,
                                                  stratify = y_train)

In [13]:
# numerical encoding of our labels

label_encoder = LabelEncoder()

y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.fit_transform(y_test)
y_val_enc = label_encoder.fit_transform(y_val)

# one-hot-encode them

y_train = to_categorical(y_train_enc)
y_val = to_categorical(y_val_enc)
y_test = to_categorical(y_test_enc)

In [14]:
# here we use the Tokenizer of keras, fitting it on all our training and validation data, 
# getting the sequences and the words with the corresponding index (unique words)

num_words = 10000

tokenizer = Tokenizer(nb_words=num_words)
tokenizer.fit_on_texts(pd.concat([X_train, X_val]))

sequences = tokenizer.texts_to_sequences(pd.concat([X_train, X_val]))
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=50)

Found 10899 unique tokens.


In [15]:
import pickle  # saving 

with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)  # loading 
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [18]:
import io
import json
tokenizer_json = tokenizer.to_json() 
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

In [20]:
from keras_preprocessing.text import tokenizer_from_json
with open('tokenizer.json') as f:
    data = json.load(f)

tokenizer = tokenizer_from_json(data)

In [21]:
# Convert texts into some numeric sequences and make the length of all numeric sequences equal 

X_train_seq = tokenizer.texts_to_sequences(X_train) 
X_train_pad = pad_sequences(X_train_seq, maxlen = 50, padding = 'post') 

X_test_seq = tokenizer.texts_to_sequences(X_test)
X_test_pad = pad_sequences(X_test_seq, maxlen = 50, padding = 'post')

X_val_seq = tokenizer.texts_to_sequences(X_val)
X_val_pad = pad_sequences(X_val_seq, maxlen = 50, padding = 'post')

X_train_pad = np.array(X_train_pad)
X_test_pad = np.array(X_test_pad)
X_val_pad = np.array(X_val_pad)

In [22]:
predictions = model.predict(X_test_pad)

In [23]:
y_test[:3]

array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]], dtype=float32)

In [24]:
predictions[:3]

array([[5.8564013e-03, 3.6703704e-03, 1.9867076e-03, 9.8848653e-01],
       [1.2811157e-03, 5.0674111e-04, 6.3612720e-04, 9.9757606e-01],
       [7.6428088e-03, 6.5521821e-02, 9.2463005e-01, 2.2053479e-03]],
      dtype=float32)

In [25]:
# check external validity on the test set
score = model.evaluate(X_test_pad, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Test loss: 0.46348777413368225
Test accuracy: 0.8409090638160706


In [26]:
strings = ["Today something went wrong with my exams, I'm so depressed",
           "What a beautiful sunny day, I'm so excited"]
new_texts = pd.DataFrame({"text": strings})

new_texts

Unnamed: 0,text
0,"Today something went wrong with my exams, I'm ..."
1,"What a beautiful sunny day, I'm so excited"


In [27]:
clean_text(new_texts)
new_texts

Unnamed: 0,text,clean_text,tokenize,tokenize_lemmatized,final_text
0,"Today something went wrong with my exams, I'm ...",today went wrong exams im depressed,"[today, went, wrong, exams, im, depressed]","[today, went, wrong, exam, im, depressed]",today went wrong exam im depressed
1,"What a beautiful sunny day, I'm so excited",beautiful sunny day im excited,"[beautiful, sunny, day, im, excited]","[beautiful, sunny, day, im, excited]",beautiful sunny day im excited


In [28]:
new_texts['tokenize'] = new_texts.clean_text.str.lower().apply(tt.tokenize)
new_texts['tokenize_lemmatized'] = new_texts['tokenize'].apply(lemmatize_text)
new_texts['final_text'] = new_texts.tokenize_lemmatized.apply(TreebankWordDetokenizer().detokenize)
new_texts

Unnamed: 0,text,clean_text,tokenize,tokenize_lemmatized,final_text
0,"Today something went wrong with my exams, I'm ...",today went wrong exams im depressed,"[today, went, wrong, exams, im, depressed]","[today, went, wrong, exam, im, depressed]",today went wrong exam im depressed
1,"What a beautiful sunny day, I'm so excited",beautiful sunny day im excited,"[beautiful, sunny, day, im, excited]","[beautiful, sunny, day, im, excited]",beautiful sunny day im excited


In [29]:
X_seq = tokenizer.texts_to_sequences(new_texts.final_text)
X_pad = pad_sequences(X_seq, maxlen = 50, padding = 'post')

X_pad = np.array(X_pad)
X_pad

array([[  27,  445,  352, 2850,    2, 1257,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [ 244, 2968,    6,    2,  292,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [30]:
np.argmax(model.predict(X_pad), axis=1)

array([3, 2])

In [247]:
# alphabetical order
# 0: anger
# 1: fear
# 2: joy
# 3: sadness