<a href="https://colab.research.google.com/github/cagBRT/SentimentTextAnalysis/blob/master/Sentiment_Text_Analysis_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
%cd /content/
!git clone  https://github.com/cagBRT/SentimentTextAnalysis.git cloned-repo
%cd cloned-repo
!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("images/sentTextAna"+str(num)+ ".png" , width=600)

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

# **Import the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

In [None]:
import pandas as pd

In [None]:
from keras.models import Sequential
from keras import layers
from keras.callbacks import EarlyStopping

# **Examine the data**<br>
The data is from three sources: <br>
> yelp reviews<br>
> amazon reviews<br>
> movie reviews<br>

The data has the structure: <br>
>"review text" label source<br>

**review text is called**: sentence<br>
**label**: 0 = negative review, 1 = positive review<br>
**source**: yelp, amazon, imdb

In [None]:
#!cat yelp_labelled.txt
#Change directory to the cloned repo
%cd /content/cloned-repo/

In [None]:
#create a dataframe containing all three sources
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])
print("dataframe shape: ",df.shape)

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_yelp = df[df['source'] == 'yelp']

sentences_yelp = df_yelp['sentence'].values
y_yelp = df_yelp['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_yelp, sentences_test_yelp, y_train_yelp, y_test_yelp = train_test_split(
   sentences_yelp, y_yelp, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_yelp[0])

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_amazon = df[df['source'] == 'amazon']

sentences_amazon = df_amazon['sentence'].values
y_amazon = df_amazon['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_amazon, sentences_test_amazon, y_train_amazon, y_test_amazon = train_test_split(
   sentences_amazon, y_amazon, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_amazon[0])

In [None]:
from keras.preprocessing.text import Tokenizer

#Go through all the reviews and keep 3000 words.
tokenizer_yelp = Tokenizer(num_words=3000) #keep 3000 words

#Update the internal vocabulary based on a list of texts
#Must be run before running texts_to_sequences
tokenizer_yelp.fit_on_texts(sentences_train_yelp)

In [None]:
#Go through all the reviews and keep 3000 words.
tokenizer_amazon = Tokenizer(num_words=3000) #keep 3000 words

#Update the internal vocabulary based on a list of texts
#Must be run before running texts_to_sequences
tokenizer_amazon.fit_on_texts(sentences_train_amazon)

The number assigned to each word is dependent upon is frequency of use in all the sentences. <br>
For example:<br>
>'the' is 1<br>
'and' is 2<br>
'was' is 3<br>


In [None]:
#Examples of reviews as word embeddings
X_train_yelp = tokenizer_yelp.texts_to_sequences(sentences_train_yelp)
print(sentences_train_yelp[3],X_train_yelp[3])
print(sentences_train_yelp[23],X_train_yelp[23])
print(sentences_train_yelp[620],X_train_yelp[620])

In [None]:
#Examples of reviews as word embeddings
X_train_amazon = tokenizer_amazon.texts_to_sequences(sentences_train_amazon)
print(sentences_train_amazon[3],X_train_amazon[3])
print(sentences_train_amazon[23],X_train_amazon[23])
print(sentences_train_amazon[620],X_train_amazon[620])

In [None]:
X_test_yelp = tokenizer_yelp.texts_to_sequences(sentences_test_yelp)
vocab_size_yelp = len(tokenizer_yelp.word_index) + 1  # Adding 1 because of reserved 0 index

print("vocab size=", vocab_size_yelp)

In [None]:
X_test_amazon= tokenizer_amazon.texts_to_sequences(sentences_test_amazon)
vocab_size_amazon = len(tokenizer_amazon.word_index) + 1  # Adding 1 because of reserved 0 index

print("vocab size=", vocab_size_amazon)

The indexing begins with the most common word first (the). <br>
It is important to note that the index 0 is reserved and is not assigned to any word. 

In [None]:
for word in ['the', 'all', 'bad', 'terrible','horrible','lost','lukewarm','bacon']: 
    print('{}: {}'.format(word, tokenizer_yelp.word_index[word]))

The list can be searched by word or by index. 

In [None]:
#What is the least used word in the list? 
print((tokenizer_yelp.index_word[1746]))
print((tokenizer_amazon.index_word[1573]))

# **Pad the sequence of words**

One problem that we have is that each text sequence has different number of words. To fix this, you can use pad_sequence() which simply pads the sequence of words with zeros. By default, it prepends zeros but we want to append them. Typically it does not matter whether you prepend or append zeros.

Additionally you would want to add a maxlen parameter to specify how long the sequences should be. This cuts sequences that exceed that number.

The resulting feature vector contains mostly zeros, when you have a fairly short sentence. 

In [None]:
from keras.utils import pad_sequences
#The maximum length of a review, cut off the extra words 
maxlen = 100
#If a review is less than 100 words, pad the vector with 0s.

X_train_yelp = pad_sequences(X_train_yelp, padding='post', maxlen=maxlen)
X_test_yelp = pad_sequences(X_test_yelp, padding='post', maxlen=maxlen)

print(X_train_yelp.shape,X_test_yelp.shape)
print(y_train_yelp.shape,y_test_yelp.shape, "\n")

index=5
print("The review:\n",sentences_train_yelp[index])
print("\nThe final feature vector:\n",X_train_yelp[index, :])

In [None]:
#The maximum length of a review, cut off the extra words 
maxlen = 100
#If a review is less than 100 words, pad the vector with 0s.

X_train_amazon = pad_sequences(X_train_amazon, padding='post', maxlen=maxlen)
X_test_amazon = pad_sequences(X_test_amazon, padding='post', maxlen=maxlen)

print(X_train_amazon.shape,X_test_amazon.shape)
print(y_train_amazon.shape,y_test_amazon.shape, "\n")

index=5
print("The review:\n",sentences_train_amazon[index])
print("\nThe final feature vector:\n",X_train_amazon[index, :])

# **Train the Embedded Model**

Now you can use the Embedding Layer of Keras which takes the previously calculated integers and maps them to a dense vector of the embedding. <br>
You will need the following parameters:<br>

>input_dim: the size of the vocabulary<br>
output_dim: the size of the dense vector<br>
input_length: the length of the sequence<br>

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

To connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

In [None]:
embedding_dim = 50
input_dim_yelp=vocab_size_yelp

model_yelp = Sequential()
model_yelp.add(layers.Embedding(input_dim_yelp, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model_yelp.add(layers.Flatten())
model_yelp.add(layers.Dense(10, activation='relu'))
model_yelp.add(layers.Dense(1, activation='sigmoid'))
model_yelp.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

print("input dim=",input_dim_yelp)
print("output dim of embedding layer=",embedding_dim)
print("input length = ", maxlen)
model_yelp.summary()

In [None]:
embedding_dim = 30
input_dim_amazon=vocab_size_amazon

model_amazon = Sequential()
model_amazon.add(layers.Embedding(input_dim_amazon, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model_amazon.add(layers.Flatten())
model_amazon.add(layers.Dense(10, activation='relu'))
model_amazon.add(layers.Dense(1, activation='sigmoid'))
model_amazon.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

print("input dim=",input_dim_amazon)
print("output dim of embedding layer=",embedding_dim)
print("input length = ", maxlen)
model_amazon.summary()

In [None]:
print(X_train_yelp.shape,X_test_yelp.shape)
print(y_train_yelp.shape,y_test_yelp.shape)

history_yelp = model_yelp.fit(X_train_yelp, y_train_yelp,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test_yelp,y_test_yelp),
                    batch_size=10)

loss_yelp, accuracy_yelp = model_yelp.evaluate(X_train_yelp, y_train_yelp, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy_yelp))
loss_yelp, accuracy_yelp = model_yelp.evaluate(X_test_yelp, y_test_yelp, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy_yelp))
plot_history(history_yelp)


In [None]:
print(X_train_amazon.shape,X_test_amazon.shape)
print(y_train_amazon.shape,y_test_amazon.shape)

history_amazon = model_amazon.fit(X_train_amazon, y_train_amazon,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test_amazon,y_test_amazon),
                    batch_size=10)

loss_amazon, accuracy_amazon = model_amazon.evaluate(X_train_amazon, y_train_amazon, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy_amazon))
loss_amazon, accuracy_amazon = model_amazon.evaluate(X_test_amazon, y_test_amazon, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy_amazon))
plot_history(history_amazon)

This is typically a not very reliable way to work with sequential data as you can see in the performance. When working with sequential data you want to focus on methods that look at local and sequential information instead of absolute positional information.



---



---



# **Use a MaxPooling Layer**

Another way to work with embeddings is by using a MaxPooling1D/AveragePooling1D or a GlobalMaxPooling1D/GlobalAveragePooling1D layer after the embedding. You can think of the pooling layers as a way to downsample (a way to reduce the size of) the incoming feature vectors.

In the case of max pooling you take the maximum value of all features in the pool for each feature dimension. In the case of average pooling you take the average, but max pooling seems to be more commonly used as it highlights large values.

Global max/average pooling takes the maximum/average of all features whereas in the other case you have to define the pool size. Keras has again its own layer that you can add in the sequential model:

Global max pooling = ordinary max pooling layer with pool size equals to the size of the input.<br>

Advantages of Global Pooling:
* it is more native to the convolution structure by enforcing correspondences between feature maps and categories.
* there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. 

In [None]:
embedding_dim = 50
input_dim=vocab_size_yelp

model_yelp = Sequential()
model_yelp.add(layers.Embedding(input_dim, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model_yelp.add(layers.GlobalMaxPool1D())
model_yelp.add(layers.Dense(10, activation='relu'))
model_yelp.add(layers.Dense(1, activation='sigmoid'))
model_yelp.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model_yelp.summary()


In [None]:
history_yelp = model_yelp.fit(X_train_yelp, y_train_yelp,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test_yelp, y_test_yelp),
                    batch_size=10)
loss_yelp, accuracy_yelp = model_yelp.evaluate(X_train_yelp, y_train_yelp, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy_yelp))
loss_yelp, accuracy_yelp = model_yelp.evaluate(X_test_yelp, y_test_yelp, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy_yelp))
plot_history(history_yelp)

# **Assignment #11:** 
Use the Amazon dataset to train the model with a max pooling layer. 