<h1><center> Sentiment Analysis

<h2> Table of Content

* [Introduction](#introduction)
    * [NLP](#introduction_nlp)
    * [Word2Vec](introduction_w2v)
    * [Tokenizer](introduction_tokenizer)
* [Prerequisites](#prerequisites)
    * [Packages](#prerequisites_packages)
    * [Prompt](#prerequisites_prompt)
    * [Read Data](#prerequisites_data)
* [EDA](#eda)    
* [Preprocess Data](#preprocess_data) preprocess_text
    * [Process Tweets](#preprocess_tweet)
    * [Split Dataset](#prerequisites_split)

## Introduction <a class="anchor" id="introduction"></a>

> <h3> Natural Language Processing <a class="anchor" id="introduction_nlp"></a>
> <h3> Word2Vec <a class="anchor" id="introduction_w2v"></a>
> <h3> Tokenizer <a class="anchor" id="introduction_tokenizer"></a>

## Prerequisites <a class="anchor" id="prerequisites"></a>

### Packages <a class="anchor" id="prerequisites_packages"></a>
- pandas
- numpy
- matplotlib
- scikit-learn
- gensim
- tensorflow
- nltk
- keras

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM
from keras import utils
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

import gensim
from gensim.models import word2vec

import re

import os
from collections import Counter
import logging
import time
import pickle
import itertools

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Prompt  <a class="anchor" id="prerequisites_prompt"></a>

- Python version can't be higher than 3.6

### Read Data <a class="anchor" id="prerequisites_data"></a>

In [6]:
col_names = ["Sentiment", "Id", "Date", "Flag", "User", "Text"]
data = pd.read_csv(r'D:\CSC590_Design_Project\Data\data.csv',names = col_names,encoding="ISO-8859-1")
sentiment_conv = {0:-1,2:0,4:1}
data['Sentiment'].map(sentiment_conv)
data.drop(["Id", "Date", "Flag", "User"],axis = 1,inplace = True)

## Expolratory Data Analysis <a class="anchor" id="eda"></a>

## Preprocess Data <a class="anchor" id="preprocess_data"></a>

### Process Text <a class="anchor" id="preprocess_tweet"></a>

In [7]:
stops = set(stopwords.words("english"))
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = SnowballStemmer("english")

def process_text(text,remove_stops = False, stem = False):
    text = str(text).lower().strip()
    text = re.sub('https?://\S+|www\.\S+', '', text) # remove url links
    text = re.sub("@[\w]*",'',text) # remove "@user"
    text = re.sub('[^a-zA-Z]',' ',text) # leave only characters
    words =[]
    for word in text.split():
        if not remove_stops or word not in stops:
            if not stem:
                words.append(word)
            else:
                words.append(stemmer.stem(word))
    return words    

data['Text'] = data['Text'].apply(lambda x: process_text(x,remove_stops = True))

### Split Dataset  <a class="anchor" id="preprocess_split"></a>

In [8]:
train,val = train_test_split(data, test_size=0.3, random_state=42)
val,test = train_test_split(val, test_size=0.5, random_state=42)

## Word2Vec Model

###  Setup Model: 
> All parameters are set up while `sentences` is not given, leaving the model blank

#### Parameters:
- `size`: dimentionality of word vectors
- `window`: size of the context
- `min_count`: minimum appearance requirement
- `workers`: number of worker threads

In [9]:
W2V_SIZE = 300 
W2V_WINDOW = 7  
W2V_MIN_COUNT = 10 
W2V_WORKERS=8


w2v_model = word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT, workers=W2V_WORKERS)

### Build Vocab

> The model builds its vocabulary table

In [10]:
train_sentences = train['Text'].tolist()
w2v_model.build_vocab(train_sentences)

2021-05-11 22:16:59,828 : INFO : collecting all words and their counts
2021-05-11 22:16:59,829 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-05-11 22:16:59,849 : INFO : PROGRESS: at sentence #10000, processed 69897 words, keeping 13017 word types
2021-05-11 22:16:59,871 : INFO : PROGRESS: at sentence #20000, processed 140173 words, keeping 19897 word types
2021-05-11 22:16:59,891 : INFO : PROGRESS: at sentence #30000, processed 210800 words, keeping 25339 word types
2021-05-11 22:16:59,916 : INFO : PROGRESS: at sentence #40000, processed 280414 words, keeping 29965 word types
2021-05-11 22:16:59,935 : INFO : PROGRESS: at sentence #50000, processed 350751 words, keeping 34071 word types
2021-05-11 22:16:59,954 : INFO : PROGRESS: at sentence #60000, processed 420700 words, keeping 38041 word types
2021-05-11 22:16:59,977 : INFO : PROGRESS: at sentence #70000, processed 490648 words, keeping 41756 word types
2021-05-11 22:16:59,996 : INFO : PROGRESS: at s

2021-05-11 22:17:01,384 : INFO : PROGRESS: at sentence #720000, processed 5037582 words, keeping 163845 word types
2021-05-11 22:17:01,403 : INFO : PROGRESS: at sentence #730000, processed 5107481 words, keeping 165188 word types
2021-05-11 22:17:01,425 : INFO : PROGRESS: at sentence #740000, processed 5177061 words, keeping 166610 word types
2021-05-11 22:17:01,445 : INFO : PROGRESS: at sentence #750000, processed 5247002 words, keeping 167962 word types
2021-05-11 22:17:01,466 : INFO : PROGRESS: at sentence #760000, processed 5316963 words, keeping 169264 word types
2021-05-11 22:17:01,489 : INFO : PROGRESS: at sentence #770000, processed 5387036 words, keeping 170680 word types
2021-05-11 22:17:01,509 : INFO : PROGRESS: at sentence #780000, processed 5456109 words, keeping 172010 word types
2021-05-11 22:17:01,530 : INFO : PROGRESS: at sentence #790000, processed 5526541 words, keeping 173358 word types
2021-05-11 22:17:01,553 : INFO : PROGRESS: at sentence #800000, processed 559645

### Train Model

### Parameters:

- `total_examples`: count of sentences
- `epochs`: number of iteration over the corpus

In [11]:
W2V_EPOCH = 32 
w2v_model.train(train_sentences, total_examples=len(train_sentences), epochs=W2V_EPOCH)

2021-05-11 22:17:06,304 : INFO : training model with 8 workers on 27065 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=7
2021-05-11 22:17:07,333 : INFO : EPOCH 1 - PROGRESS: at 13.40% examples, 921574 words/s, in_qsize 16, out_qsize 0
2021-05-11 22:17:08,365 : INFO : EPOCH 1 - PROGRESS: at 25.53% examples, 871225 words/s, in_qsize 14, out_qsize 1
2021-05-11 22:17:09,412 : INFO : EPOCH 1 - PROGRESS: at 36.48% examples, 824401 words/s, in_qsize 14, out_qsize 4
2021-05-11 22:17:10,435 : INFO : EPOCH 1 - PROGRESS: at 46.43% examples, 788735 words/s, in_qsize 16, out_qsize 1
2021-05-11 22:17:11,466 : INFO : EPOCH 1 - PROGRESS: at 56.51% examples, 767833 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:17:12,486 : INFO : EPOCH 1 - PROGRESS: at 68.38% examples, 775381 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:17:13,499 : INFO : EPOCH 1 - PROGRESS: at 80.94% examples, 787822 words/s, in_qsize 14, out_qsize 1
2021-05-11 22:17:14,534 : INFO : EPOCH 1 - PROGRESS:

2021-05-11 22:17:49,229 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:17:49,232 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:17:49,235 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:17:49,245 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-11 22:17:49,260 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-11 22:17:49,266 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-11 22:17:49,269 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-11 22:17:49,269 : INFO : EPOCH - 5 : training on 7827483 raw words (6988437 effective words) took 8.5s, 819104 effective words/s
2021-05-11 22:17:50,290 : INFO : EPOCH 6 - PROGRESS: at 12.50% examples, 870891 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:17:51,297 : INFO : EPOCH 6 - PROGRESS: at 25.52% examples, 887246 words/s, in_qsize 15, out_qsize 

2021-05-11 22:18:28,295 : INFO : EPOCH 10 - PROGRESS: at 90.39% examples, 886937 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:18:28,955 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-11 22:18:28,958 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:18:28,959 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:18:28,963 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:18:28,965 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-11 22:18:28,982 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-11 22:18:28,983 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-11 22:18:28,984 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-11 22:18:28,984 : INFO : EPOCH - 10 : training on 7827483 raw words (6987739 effective words) took 7.8s, 894408 effective words/s
2021-05-11 22:18:30

2021-05-11 22:19:06,754 : INFO : EPOCH 15 - PROGRESS: at 75.81% examples, 870130 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:19:07,758 : INFO : EPOCH 15 - PROGRESS: at 88.59% examples, 872747 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:19:08,539 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-11 22:19:08,541 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:19:08,556 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:19:08,568 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:19:08,573 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-11 22:19:08,577 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-11 22:19:08,579 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-11 22:19:08,586 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-11 22:19:08,587 : INFO : EPOCH - 1

2021-05-11 22:19:45,248 : INFO : EPOCH 20 - PROGRESS: at 65.83% examples, 896315 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:19:46,254 : INFO : EPOCH 20 - PROGRESS: at 78.64% examples, 894791 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:19:47,260 : INFO : EPOCH 20 - PROGRESS: at 91.54% examples, 894795 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:19:47,834 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-11 22:19:47,837 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:19:47,843 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:19:47,843 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:19:47,844 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-11 22:19:47,864 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-11 22:19:47,869 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-11 22:19:4

2021-05-11 22:20:23,232 : INFO : EPOCH 25 - PROGRESS: at 49.22% examples, 850265 words/s, in_qsize 14, out_qsize 1
2021-05-11 22:20:24,237 : INFO : EPOCH 25 - PROGRESS: at 62.13% examples, 859598 words/s, in_qsize 16, out_qsize 0
2021-05-11 22:20:25,241 : INFO : EPOCH 25 - PROGRESS: at 75.04% examples, 865785 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:20:26,259 : INFO : EPOCH 25 - PROGRESS: at 87.57% examples, 864906 words/s, in_qsize 14, out_qsize 3
2021-05-11 22:20:27,107 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-11 22:20:27,109 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:20:27,116 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:20:27,117 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:20:27,138 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-11 22:20:27,140 : INFO : worker thread finished; awaiting finish of 2 more th

2021-05-11 22:21:01,486 : INFO : EPOCH 30 - PROGRESS: at 38.78% examples, 888178 words/s, in_qsize 13, out_qsize 2
2021-05-11 22:21:02,510 : INFO : EPOCH 30 - PROGRESS: at 51.14% examples, 877522 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:21:03,537 : INFO : EPOCH 30 - PROGRESS: at 64.80% examples, 887912 words/s, in_qsize 15, out_qsize 1
2021-05-11 22:21:04,547 : INFO : EPOCH 30 - PROGRESS: at 77.86% examples, 890187 words/s, in_qsize 14, out_qsize 1
2021-05-11 22:21:05,565 : INFO : EPOCH 30 - PROGRESS: at 90.78% examples, 889369 words/s, in_qsize 15, out_qsize 0
2021-05-11 22:21:06,174 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-11 22:21:06,196 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-11 22:21:06,201 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-11 22:21:06,205 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-11 22:21:06,209 : INFO : worker thread finished; awai

(223602363, 250479456)

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train['Text'])

vocab_size = len(tokenizer.word_index)+1
vocab_size

213921

In [None]:
SEQUENCE_LENGTH = 300
x_train = pad_sequences(tokenizer.texts_to_sequences(train['Text']), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(val['Text']), maxlen=SEQUENCE_LENGTH)

In [None]:
labels = [-1,1,0]
encoder = LabelEncoder()
encoder.fit(train['Sentiment'].tolist())

y_train = encoder.transform(train['Sentiment'].tolist())
y_val = encoder.transform(val['Sentiment'].tolist())

y_train = y_train.reshape(-1,1)
y_val = y_val.reshape(-1,1)

In [None]:
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
  if word in w2v_model.wv:
    embedding_matrix[i] = w2v_model.wv[word]
print(embedding_matrix.shape)

In [None]:
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False)

In [None]:
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()

In [None]:
SEQUENCE_LENGTH = 300
EPOCHS = 8
BATCH_SIZE = 1024
model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
              EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]
history = model.fit(x_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_split=0.1,
                    verbose=1,
                    callbacks=callbacks)