# Part 2: Model Build and Evaluation

This notebook is structured to help guide you through the second half of this challenge. If additional cells are needed to build and train your classifier, please feel free to use additional cells. Otherwise please refrain from adding cells at any point in the notebook during this challenge. Please also do not delete or modify the provided headers to the cells. You are welcome to additional comments, though, if needed! Thank you!

### Import your libraries in the cell below

In [1]:
import pandas as pd
import numpy as np
import re
from tensorflow.keras.models import Sequential

# imports for data pre-processing
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Import in your csv from the previous notebook in the cell below

In [2]:
# train data:
df = pd.read_csv('cleaned_train_data.csv')
# test data:
# df = pd.read_csv('cleaned_test_data.csv')

In [3]:
df.Phrase = df.Phrase.astype(str)

### Build and Train your Classifier in this and the following cell(s) 

In [4]:
# Data pre-preprocessing

X = df.Phrase
y = df.Sentiment
tokenize = Tokenizer()
tokenize.fit_on_texts(X.values)

In [5]:
# Sequential model

X = tokenize.texts_to_sequences(X)
max_length = max([len(s.split()) for s in df['Phrase']])
X = pad_sequences(X, max_length)

In [6]:
EMBEDDING_DIM = 100
unknown = len(tokenize.word_index)+1
model = Sequential()
model.add(Embedding(unknown, EMBEDDING_DIM, input_length=max_length))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2 ))
model.add(Dense(5, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [7]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 29, 100)           1207600   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               117248    
_________________________________________________________________
dense (Dense)                (None, 5)                 645       
Total params: 1,325,493
Trainable params: 1,325,493
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
model.fit(X, y, batch_size=128, epochs=7, verbose=1)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7

### Create your Predictions in the cell below

In [None]:
df_test = pd.read_csv('cleaned_test_data.csv')

In [None]:
X_test = df_test.Phrase
df_test.Phrase = df_test.Phrase.astype(str)
X_test = tokenize.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, max_length)

In [None]:
final_pred = model.predict_classes(X_test)

### Perform the final evaluation of the Performance of your model in the cell below