# Workflow:
#### 1. Import Data
#### 2. Prepare the input data
#### 3. Import pre-trained W2V
#### 4. Create Neural Network Pipeline
#### 5. Train The Model
#### 6. Evaluate results

<br>
____________________________________________________________________________________________________________________________

### 1. Import Data

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
train_path = r"https://raw.githubusercontent.com/lukasgarbas/nlp-text-emotion/master/data/data_train.csv"
test_path = r"https://raw.githubusercontent.com/lukasgarbas/nlp-text-emotion/master/data/data_test.csv"

In [3]:
data_train = pd.read_csv(train_path, encoding='utf-8')
data_test = pd.read_csv(test_path, encoding='utf-8')

#### Checking Data

In [4]:
data_train.head(2)

Unnamed: 0,Emotion,Text
0,neutral,There are tons of other paintings that I thin...
1,sadness,"Yet the dog had grown old and less capable , a..."


In [5]:
data_test.head(2)

Unnamed: 0,Emotion,Text
0,sadness,I experienced this emotion when my grandfather...
1,neutral,"when I first moved in , I walked everywhere ...."


#### Chekcing Null Values

In [6]:
data_train.isna().sum()

Emotion    0
Text       0
dtype: int64

In [7]:
data_test.isna().sum()

Emotion    0
Text       0
dtype: int64

#### Value Counts

In [8]:
data_train.Emotion.value_counts()

sadness    1641
joy        1619
neutral    1616
anger      1566
fear       1492
Name: Emotion, dtype: int64

In [9]:
data_test.Emotion.value_counts()

joy        707
anger      693
fear       679
sadness    676
neutral    638
Name: Emotion, dtype: int64

#### Train and Test

In [10]:
X_train = data_train.Text
X_test = data_test.Text

y_train = data_train.Emotion
y_test = data_test.Emotion

#### Merging Train and Test data

In [11]:
data = data_train.append(data_test, ignore_index=True)
data.head()

Unnamed: 0,Emotion,Text
0,neutral,There are tons of other paintings that I thin...
1,sadness,"Yet the dog had grown old and less capable , a..."
2,fear,When I get into the tube or the train without ...
3,fear,This last may be a source of considerable disq...
4,anger,She disliked the intimacy he showed towards so...


#### Variable Initialization

In [12]:
# Number of Labels: joy, anger, fear, sadness, neutral
num_classes = 5

# Number of dimenstion for word embedding
embed_num_dims = 300

# Max input length (max num of words)
max_seq_len = 800

class_names = ['joy', 'anger', 'fear', 'sadness', 'neutral']

<br>
____________________________________________________________________________________________________________________________

### 2. Prepare Input Data

- Tokenize our texts and count unique tokens
- Padding: each input (sentence or text) has to be of the same lenght
- Labels have to be converted to integeres and categorized

In [13]:
from nltk.tokenize import word_tokenize
def clean_data(data):
    
    # Removing the unwanted @ and #
    data = re.sub(r"(#[\d\w\.]+)", '', data)
    data = re.sub(r"(@[\d\w\.]+)", '', data)
    
    # tekenization using nltk
    data = word_tokenize(data)
    
    return data

In [36]:
v = 'This is an apple'
k = word_tokenize(v)
print(k)

['This', 'is', 'an', 'apple']


In [14]:
texts = [' '.join(clean_data(text)) for text in data.Text]

texts_train = [' '.join(clean_data(text)) for text in X_train]
texts_test = [' '.join(clean_data(text)) for text in X_test]

In [38]:
texts

['There are tons of other paintings that I think are better .',
 'Yet the dog had grown old and less capable , and one day the gillie had come and explained with great sorrow that the dog had suffered a stroke , and must be put down .',
 'When I get into the tube or the train without paying for the ticket .',
 'This last may be a source of considerable disquiet and one might not at first see how such obviously ` immoral `` content could be defended as part of a system of morality .',
 'She disliked the intimacy he showed towards some of them , was resentful of the memories they shared of which she was not a part , and felt excluded .',
 "When my family heard that my Mother 's cousin who lives in England wrote us to tell that he had cancer of the lymph glands .",
 "Finding out I am chosen to collect norms for Chinese aphasia ( I will contribute to China 's catching up with the West in neuropsychology ) .",
 'A spokesperson said : ` Glen is furious that the new ` Anarchy `` promo feature

In [15]:
print(texts_train[100])

Playing NOW on Hardest : BYZPO Radio Show Session Tune in , listen and enjoy .


#### Tokenization + fitting using keras

In [16]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

tok = Tokenizer()
tok.fit_on_texts(texts)

seq_train = tok.texts_to_sequences(texts_train)
seq_test = tok.texts_to_sequences(texts_test)

index_of_words = tok.word_index

# Vocab size is number of unique words + reserved 0 index of padding
voc_size = len(index_of_words)+1

print(f"Number of unique words:{len(index_of_words)}")

Number of unique words:12087


In [39]:
index_of_words

{'i': 1,
 'the': 2,
 'a': 3,
 'to': 4,
 'and': 5,
 'was': 6,
 'my': 7,
 'of': 8,
 'in': 9,
 'when': 10,
 'that': 11,
 'it': 12,
 'me': 13,
 'had': 14,
 'you': 15,
 'for': 16,
 'at': 17,
 'with': 18,
 'not': 19,
 'he': 20,
 'on': 21,
 "'s": 22,
 'is': 23,
 "n't": 24,
 'we': 25,
 'very': 26,
 'she': 27,
 'but': 28,
 'do': 29,
 'her': 30,
 'have': 31,
 'this': 32,
 'about': 33,
 '’': 34,
 'so': 35,
 'as': 36,
 'be': 37,
 'his': 38,
 'did': 39,
 'an': 40,
 'friend': 41,
 'from': 42,
 'what': 43,
 'time': 44,
 'one': 45,
 'by': 46,
 'were': 47,
 'they': 48,
 'out': 49,
 'felt': 50,
 'are': 51,
 'all': 52,
 "'m": 53,
 'up': 54,
 'after': 55,
 'been': 56,
 'there': 57,
 'would': 58,
 'him': 59,
 'no': 60,
 'got': 61,
 'who': 62,
 'could': 63,
 'just': 64,
 'like': 65,
 'because': 66,
 'home': 67,
 'go': 68,
 'some': 69,
 'see': 70,
 'know': 71,
 'our': 72,
 'can': 73,
 'good': 74,
 'day': 75,
 'get': 76,
 'first': 77,
 'how': 78,
 'your': 79,
 'which': 80,
 'am': 81,
 'night': 82,
 'really': 

#### Padding: each input has the same length 

In [17]:
X_train_pad = pad_sequences(seq_train, maxlen=max_seq_len)
X_test_pad = pad_sequences(seq_test, maxlen=max_seq_len)

In [18]:
X_train_pad

array([[    0,     0,     0, ...,   119,    51,   345],
       [    0,     0,     0, ...,    37,   277,   154],
       [    0,     0,     0, ...,    16,     2,  1210],
       ...,
       [    0,     0,     0, ...,   876,     4,   909],
       [    0,     0,     0, ...,     1,     6,   117],
       [    0,     0,     0, ..., 10258,   173,    13]])

In [50]:
X_train_pad[100]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [19]:
X_test_pad

array([[    0,     0,     0, ...,   397,   141,   120],
       [    0,     0,     0, ...,   172,   663, 10259],
       [    0,     0,     0, ...,     5,   389,   582],
       ...,
       [    0,     0,     0, ...,    12,   194,    23],
       [    0,     0,     0, ...,   106,    16,    59],
       [    0,     0,     0, ...,     9,     2,   534]])

#### Categorize Labels

In [20]:
encoding = {'joy':0, 'fear': 1, 'anger': 2, 'sadness': 3, 'neutral': 4}

# Integer Lables
y_train = [encoding[x] for x in data_train.Emotion]
y_test = [encoding[x] for x in data_test.Emotion]

In [21]:
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [54]:
X_train_pad[0:5]

array([[   0,    0,    0, ...,  119,   51,  345],
       [   0,    0,    0, ...,   37,  277,  154],
       [   0,    0,    0, ...,   16,    2, 1210],
       [   0,    0,    0, ..., 2744,    8, 4147],
       [   0,    0,    0, ...,    5,   50, 3297]])

In [55]:
y_train[5]

array([0., 0., 0., 1., 0.], dtype=float32)

In [23]:
y_test

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.]], dtype=float32)

### Import pretrained word vectors
- Importing pretrained word2vec from file and creating embedding matrix
- We will later map each word in our corpus to existing word vector

In [24]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
    voc_size = len(word_index)+1 # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((voc_size, embedding_dim))
    with open(filepath, encoding='utf-8') as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]
    return embedding_matrix

In [25]:
# import urllib.request
# import zipfile
# import os

# fname = 'embeddings/wiki-news-300d-1M.vec'

# if not os.path.isfile(fname):
#     print('Downloading word vectors...')
#     urllib.request.urlretrieve('https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip',
#                               'wiki-news-300d-1M.vec.zip')
#     print('Unzipping...')
#     with zipfile.ZipFile('wiki-news-300d-1M.vec.zip', 'r') as zip_ref:
#         zip_ref.extractall('embeddings')
#     print('done.')
    
#     os.remove('wiki-news-300d-1M.vec.zip')

In [26]:
fname = 'D:/Download_brave/New folder/wiki-news-300d-1M.vec'

embedd_matrix = create_embedding_matrix(fname, index_of_words, embed_num_dims)
embedd_matrix.shape

(12088, 300)

Some of the words from our corpus were not included in the pre-trained word vectors. If we inspect those words we'll see that it's mostly spelling errors. It's also good to double check the noise in our data i.e different languages or tokenizer errors.

In [29]:
# Inspect unseen words
new_words = 0

for word in index_of_words:
    entry = embedd_matrix[index_of_words[word]]
    if all(v == 0 for v in entry):
        new_words = new_words+1

print('Words found in wiki vocab: '+str(len(index_of_words)-new_words))
print('New words found: '+str(new_words))

Words found in wiki vocab: 11442
New words found: 645


<br>
___________________________________________________________________________________________________________________________

### Creating LSTM Pipeline

#### Embedding Layer
We will use pre-trained word vectors. We could also train our own embedding layer if we don't specify the pre-trained weights

- <b>vocabulary size:</b> the maximum number of terms that are used to represent a text: e.g. if we set the size of the “vocabulary” to 1000 only the first thousand terms most frequent in the corpus will be considered (and the other terms will be ignored)

- <b>the maximum length:</b> of the texts (which must all be the same length)

- <b>size of embeddings:</b> basically, the more dimensions we have the more precise the semantics will be, but beyond a certain threshold we will lose the ability of the embedding to define a coherent and general enough semantic area

- <b>trainable:</b> True if you want to fine-tune them while training

In [32]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Dropout, Embedding, GRU

# Embedding layer before the actual BLSTM
embedd_layer = Embedding(voc_size, 
                         embed_num_dims, 
                         input_length = max_seq_len, 
                         weights = [embedd_matrix], 
                         trainable = False)

#### Model Pipeline
- the input is the first N words of each text (with proper padding)

- the first level creates embedding of words, using vocabulary with a certain dimension, and a given size of embeddings

- LSTM/GRU layer which will receive word embeddings for each token in the tweet as inputs. The intuition is that its output tokens will store information not only of the initial token, but also any previous tokens; In other words, the LSTM layer is generating a new encoding for the original input.

- the output level has a number of neurons equal to the classes of the problem and a “softmax” activation function

You can change GRU to LSTM. The results will be very similar but LSTM might take longer to train.

In [33]:
gru_output_size = 128
bidirectional = True

model = Sequential()
model.add(embedd_layer)

if bidirectional:
    model.add(Bidirectional(GRU(units=gru_output_size, 
                                dropout=0.2, 
                                recurrent_dropout=0.2)))
    
else:
    model.add(GRU(units=gru_output_size,
                  dropout=0.2,
                  recurrent_dropout=0.2))
    
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 800, 300)          3626400   
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               330240    
_________________________________________________________________
dense (Dense)                (None, 5)                 1285      
Total params: 3,957,925
Trainable params: 331,525
Non-trainable params: 3,626,400
_________________________________________________________________


In [34]:
batch_size=128
epochs=20

hist = model.fit(X_train_pad, y_train, epochs=epochs, validation_data=(X_train_pad,y_test))

Epoch 1/20
  6/248 [..............................] - ETA: 47:15 - loss: 1.6059 - accuracy: 0.1845

KeyboardInterrupt: 