<a href="https://colab.research.google.com/github/cagBRT/SentimentTextAnalysis/blob/master/Sentiment_Text_Analysis_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
%cd /content/
!git clone  https://github.com/cagBRT/SentimentTextAnalysis.git cloned-repo
%cd cloned-repo
!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("images/sentTextAna"+str(num)+ ".png" , width=600)

# **Import the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

In [None]:
import pandas as pd

In [None]:
from keras.models import Sequential
from keras import layers
from keras.callbacks import EarlyStopping

# **Examine the data**<br>
The data is from three sources: <br>
> yelp reviews<br>
> amazon reviews<br>
> movie reviews<br>

The data has the structure: <br>
>"review text" label source<br>

**review text is called**: sentence<br>
**label**: 0 = negative review, 1 = positive review<br>
**source**: yelp, amazon, imdb

In [None]:
#!cat yelp_labelled.txt
#Change directory to the cloned repo
%cd /content/cloned-repo/

In [None]:
#create a dataframe containing all three sources
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])
print("dataframe shape: ",df.shape)
df['label'].value_counts()

# **Split the review data into train and test sets**

Split the Yelp data into training and tests sets<br>

[train_test_split](https://www.bitdegree.org/learn/train-test-split)

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_yelp = df[df['source'] == 'yelp']

sentences_yelp = df_yelp['sentence'].values
y_yelp = df_yelp['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_yelp, sentences_test_yelp, y_train_yelp, y_test_yelp = train_test_split(
   sentences_yelp, y_yelp, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_yelp[0])

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_amazon = df[df['source'] == 'amazon']

sentences_amazon = df_amazon['sentence'].values
y_amazon = df_amazon['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_amazon, sentences_test_amazon, y_train_amazon, y_test_amazon = train_test_split(
   sentences_amazon, y_amazon, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_amazon[0])

# **Vectorize the training and test set**
Vectorize the data: <br>

Assign each word a number.<br>
Count the number of times each word appears in the individual review text. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
#1. use the words from the training set
#2. create a BoW from the yelp reviews
vectorizer = CountVectorizer()
vectorizer.fit(sentences_yelp)
vocab_yelp = vectorizer.vocabulary_
vocab_yelp = pd.Series(vocab_yelp)
#2. vectorize the sentences
X_train_yelp = vectorizer.transform(sentences_train_yelp)
X_test_yelp  = vectorizer.transform(sentences_test_yelp)
print("training data: ", X_train_yelp.shape,"\ntest data: ", X_test_yelp.shape)

In [None]:
#1. use the words from the training set
#2. create a BoW from the yelp reviews
vectorizer = CountVectorizer()
vectorizer.fit(sentences_amazon)
vocab_amazon = vectorizer.vocabulary_
vocab_amazon = pd.Series(vocab_amazon)
#2. vectorize the sentences
X_train_amazon = vectorizer.transform(sentences_train_amazon)
X_test_amazon  = vectorizer.transform(sentences_test_amazon)
print("training data: ", X_train_amazon.shape,"\ntest data: ", X_test_amazon.shape)

What has been done so far: 
1. Created a vocabulary from all the words used in the yelp reviews.
2. Assigned each word a number.<br>

Now check the vectorization of the sentences in the yelp review training and test data. 

In [None]:
#"Select a number between 0 - 749
check=24
print(sentences_train_yelp[check])
print(X_train_yelp[check])
#Prints sentence number, word vector, quantity of word in sentence

In [None]:
#"Select a number between 0 - 749
check=24
print(sentences_test_yelp[check])
print(X_test_yelp[check])
#Prints sentence number, word vector, quantity of word in sentence

In [None]:
#"Select a number between 0 - 749
check=45
print(sentences_test_amazon[check])
print(X_test_amazon[check])
#Prints sentence number, word vector, quantity of word in sentence

# **Trial 1:Keras DNN**
Create a DNN using Keras, use the Yelp reviews bag of words.<br> 
Compare it to the logistic regession using the same data. 

In [None]:
page(3)

In [None]:
input_dim_yelp = X_train_yelp.shape[1]  # Number of features
print("model imputs = ", input_dim_yelp)

model_yelp = Sequential()
model_yelp.add(layers.Dense(1700, input_dim=input_dim_yelp, activation='relu'))
model_yelp.add(layers.Dense(1000,  activation='relu'))
model_yelp.add(layers.Dense(100,  activation='relu'))
model_yelp.add(layers.Dense(1, activation='relu'))

# Discussion:
Given the dataset and the model, what do you expect to happen? 

In [None]:
model_yelp.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model_yelp.summary()

**Train the DNN Model**

In [None]:
history = model_yelp.fit(X_train_yelp, y_train_yelp,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test_yelp, y_test_yelp),
                    batch_size=20)

In [None]:
loss, accuracy = model_yelp.evaluate(X_train_yelp, y_train_yelp, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model_yelp.evaluate(X_test_yelp, y_test_yelp, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

The accuracy from the Scikit Learn model in notebook 1:<br>
>Accuracy for yelp data: 0.7960<br>
Accuracy for amazon data: 0.7960<br>
Accuracy for imdb data: 0.7487<br>

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

The Deep Neural Network trained using Bag of Words is overfit. <br>
Play with the model architecture and hyperparameters to see if you can find a better model. 

In [None]:
plot_history(history)

In the BOW model, you  represented an entire review as a single feature vector. In the next section, each word is represented as a vector. 

# **Assignment #4:**
Modify the DNN to see if you can improve the accuracy and loss. 

# **Assignment #5:** 
Train the DNN with the amazon reivews.<br>
<br>
Use different variable names than the ones used for the Yelp reviews. 