# Creating the Deep Learning Model

We will create a simple deep learning model for sentiment analysis. The dataset is from Kaggle.

## CHECK VERSIONS OF THE FOLLOWING:
 -  Python (some issues with v3.9 and above, no problems with v3.7 or v3.8)
 - Tensorflow (most ok, here we use v2.0.0)
 - Keras (testing v2.3.1, best v2.4.3)
 - NumPy (breaking issues above v1.19.5, using v1.19.2)
 - h5py (v2.10.0)

In [1]:
!python --version

Python 3.7.6


In [2]:
import tensorflow as tf
tf.__version__

'2.0.0'

In [3]:
import keras
keras.__version__

Using TensorFlow backend.


'2.3.1'

In [4]:
import numpy as np
np.__version__

'1.19.2'

In [15]:
import h5py
h5py.__version__

'2.10.0'

### Import libraries

In [5]:
import pandas as pd    # to read .csv files (data)

# using Tensorflow for modeling
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

# model validation using train-test split
from sklearn.model_selection import train_test_split

import re    # to perform reg-ex on textual data

### Read in the data

In [6]:
data = pd.read_csv('Sentiment.csv') # use local path of data file

data.columns

Index(['id', 'candidate', 'candidate_confidence', 'relevant_yn',
       'relevant_yn_confidence', 'sentiment', 'sentiment_confidence',
       'subject_matter', 'subject_matter_confidence', 'candidate_gold', 'name',
       'relevant_yn_gold', 'retweet_count', 'sentiment_gold',
       'subject_matter_gold', 'text', 'tweet_coord', 'tweet_created',
       'tweet_id', 'tweet_location', 'user_timezone'],
      dtype='object')

### Clean and process the data

In [7]:
# prune the columns that we don't need

data = data[['text', 'sentiment']]
data.head()

Unnamed: 0,text,sentiment
0,RT @NancyLeeGrahn: How did everyone feel about...,Neutral
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
2,RT @TJMShow: No mention of Tamir Rice and the ...,Neutral
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive


In [8]:
# remove unwanted characters via Regex

def clean_data(text):    # helper function
    text = text.lower()
    new_text = re.sub('[^a-zA-z0-9\s]', '', text)
    new_text = re.sub('rt', '', new_text)
    return new_text

data['text'] = data['text'].apply(clean_data) # clean the data

In [9]:
data.head()

Unnamed: 0,text,sentiment
0,nancyleegrahn how did everyone feel about the...,Neutral
1,scottwalker didnt catch the full gopdebate la...,Positive
2,tjmshow no mention of tamir rice and the gopd...,Neutral
3,robgeorge that carly fiorina is trending hou...,Positive
4,danscavino gopdebate w realdonaldtrump delive...,Positive


In [10]:
max_feats = 2000

tokenizer = Tokenizer(num_words=max_feats, split=' ')
tokenizer.fit_on_texts(data['text'].values)
# tokenize dataset
X = tokenizer.texts_to_sequences(data['text'].values)
# pad sequences
X = pad_sequences(X, 28)

# convert categorical data to indicator variables (dummies)
Y = pd.get_dummies(data['sentiment']).values

### Split data into train-test datasets (for model validation)

In [11]:
# split data 80:20 (80% training, 20% testing)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

### Implement a simple deep learning model

This example is using an embedding layer and some LSTM layers with dropout. We are also using categorical cross-entropy loss, and the optimizer function we are using is Adam.

In [12]:
embed_dim = 128
lstm_out = 196

model = Sequential()
# embedding layer
model.add(Embedding(max_feats, embed_dim, input_length=X.shape[1]))
# dropout
model.add(SpatialDropout1D(0.4))
# LSTM layers
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(128, recurrent_dropout=0.2))
# Dense with softmax activation
model.add(Dense(3, activation='softmax'))

# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Fit the model

In [13]:
model.fit(X_train, Y_train, epochs=10, batch_size=512, validation_data=(X_test, Y_test))

Train on 11096 samples, validate on 2775 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1f6fdaf3548>

### Store the model

Here we are saving the model in 'hdf5' format (.h5 file type).

In [14]:
model.save('py_sentiment.h5')

# Creating the REST API

See file **sentiment_app.py**. We are making a REST API using FAST API.

### Running the app

The created app, within the Python program **sentiment_app.py** can be run using the following **uvicorn** command:

```$ uvicorn sentiment_app:app --reload```

We use the above command formatting because
 -  sentiment_app refers to the name of the Python program that define the app, in this case **sentiment_app.py**
 -  app refers to the name of the variable defined in **sentiment_app.py** of which we instantiated the app

       ```app = FastAPI()```

    (*see line 11 in* ***sentiment_app.py***)
    
The app should take you to the user page at the following address:

```http://127.0.0.1:8000/docs```


Additionally, the app can be tested with FastAPI at the /docs route:

```http://127.0.0.1:8000/docs```

(*the prettier app output*)

# Preparing for app deployment

To deploy a version of the app on Heroku, we need the following files:
 - **runtime.txt**, which version of Python is suitable. In this case **runtime.txt** should read

 ```python-3.7.6```, or ```python-VERSION```


 - **Procfile**, of file-type no extension (.). It should simply read
 
 ```web: uvicorn sentiment_app:app --host=0.0.0.0 --port=${PORT:-5000}```

  - We use the above values because we will run the server on 0.0.0.0 and the port on Heroku should be 5000.
  - The **Procfile** can be created on Visual Studio Code or other IDEs that allow for file creation of the type *no extension*.
 - **requirements.txt**, a text file of all the libraries used in the project. Our file reads
 
 ```pandas
sklearn
tensorflow==2.0.0
h5py==2.10.0
fastapi
uvicorn
python-multipart
 ```
 
 - **gitignore.txt**, a file that stores the name of the files that will not be used for Heroku. Ours is as follows:
 
 ```???
 __pycache__
 model.py
 ```

# Deploying on Github

Create a new Git repository to host the project files.

In the project directory, do the following (on command line)

```$ git init```

This command should return a response like below

```Initialized empty Git repository in absolute_path_of_project_directory/.git/```
