<a href="https://colab.research.google.com/github/cagBRT/SentimentTextAnalysis/blob/master/Sentiment_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pre-requisites:**<br>
Python - 2 day course is sufficient<br>
Keras understanding - intro course is sufficient<br>
logistic regression - understanding<br>




# **Pre-work**
You will need to add a file to your google drive. <br>
1. Download the file to your computer. 
>Click on the link below. Then click **Download** <br>
The download can take as long as 15 minutes.<br>
The file to download is: [fileToAddToGoogleDrive](https://drive.google.com/open?id=1zJI1Xz-CgaQqX1UtBcOhUjEKWcSt6QK6)<br>
The file is large: 2GBytes<br><br>


2. Upload the file to Google Drive:<br>
>Open Google Drive<br>
On the Drive menu, click on **New** >> **File Upload**<br>
Find the file on your computer, click on it and upload the file. 

The file is large, it may take as long as 15 minutes<br>
Once the file is on your Google Drive, you can delete it from your computer. 

The file is from a website: [English word vectors](https://fasttext.cc/docs/en/english-vectors.html)<br>
This page gathers several pre-trained word vectors trained using fastText.

# **Mount your Google Drive on this CoLab Notebook**

Using scikit-learn library

In [0]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/SentimentTextAnalysis.git cloned-repo
%cd cloned-repo
!ls

# **Import the libraries**

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

In [0]:
import pandas as pd

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

# **Examine the data**

In [0]:
#!cat yelp_labelled.txt

In [0]:
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

# **Create a Bag of Words**
Create a bag of words (BoW) for vectorizing the text. 

In [0]:
john_words = ['John likes to run.', 'John hates to be cold.', 'John hates to be late.']

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(john_words)
vectorizer.vocabulary_

In [0]:
import pandas as pd

vectorizer.transform(john_words).toarray()

dfbow = pd.DataFrame()
dfbow['voc']= vectorizer.vocabulary_
dfbow.sort_values(by=['voc'])
cat_columns = ["voc"]
df_processed = pd.get_dummies(dfbow, prefix_sep="__",columns=cat_columns)
df_processed


# **Split the data into train and test sets**

Split the Yelp data into training and tests sets<br>

[train_test_split](https://www.bitdegree.org/learn/train-test-split)

In [0]:
from sklearn.model_selection import train_test_split

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(
   sentences, y, test_size=0.25, random_state=1000)
print(sentences_train[0])

# **Vectorize the training and test set**

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

check=0
print(sentences_train[check])
print(X_train[check])
#Prints sentence number, word vector, quantity of word in sentence

**The training set has:** <br>
750 examples<br>
1714 words in the vocabulary<br>

It is a sparse matrix

In [0]:
X_test

# **Create a logistic regression model and train it**

In [0]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

# **Baseline: Perform logistic regression on all three data sets**<br>
yelp<br>
amazon<br>
imdb<br>

Get a baseline using logistic regression. This will give us something to compare with the other methods. 

In [0]:
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    reviews = df_source['sentence'].values
    reviews_y = df_source['label'].values

    reviews_train, reviews_test, reviews_y_train, reviews_y_test = train_test_split(
        reviews, reviews_y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(reviews_train)
    r_X_train = vectorizer.transform(reviews_train)
    r_X_test  = vectorizer.transform(reviews_test)

    classifier = LogisticRegression()
    classifier.fit(r_X_train, reviews_y_train)
    score = classifier.score(r_X_test, reviews_y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

# **Trial 1:Keras DNN**
Create a DNN using Keras. 
Compare it to the logistic regession using the same data. 

In [0]:
input_dim = X_train.shape[1]  # Number of features
print("model imputs = ", input_dim)
model = Sequential()
model.add(layers.Dense(2500, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1000, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1000, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='relu'))

In [0]:
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

In [0]:
history = model.fit(X_train, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)

In [0]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

In [0]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

In [0]:
plot_history(history)

# **Word Embedding**
There are various ways to vectorize text, such as:
*   Words represented as a vector.
*   Characters represented as a vector


In this notebook, you’ll see how to deal with representing words as vectors which is the common way to use text in neural networks. Two possible ways to represent a word as a vector are:
*   one-hot encoding
*   Lword embeddings



**Hot-one encoding**

In [0]:
cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']
cities

In [0]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)

df = pd.DataFrame()
df['cities']= cities
df['city_labels']= city_labels
df.sort_values(by=['cities'])

cat_columns = ["city_labels"]
df_processed = pd.get_dummies(df, prefix_sep="__",columns=cat_columns)
df_processed


In [0]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
city_labels = city_labels.reshape((5, 1))
encoder.fit_transform(city_labels)

**Word embedding**<br>
Word embedding has fewer dimensions than one-hot encoding<br>
Word embedding places similar words near each other<br>



This method represents words as dense word vectors (also called word embeddings) which are trained unlike the one-hot encoding which are hardcoded. This means that the word embeddings collect more information into fewer dimensions.

Note that the word embeddings do not understand the text as a human would, but they rather map the statistical structure of the language used in the corpus. Their aim is to map semantic meaning into a geometric space. This geometric space is then called the embedding space.<br>

This would map semantically similar words close on the embedding space like numbers or colors. If the embedding captures the relationship between words well, things like vector arithmetic should become possible. A famous example in this field of study is the ability to map King - Man + Woman = Queen.

How can you get such a word embedding? <br>
You have two options for this. 

>1.Train your word embeddings during the training of your neural network. <br>
>2.Use pretrained word embeddings which you can directly use in your model. You can leave these word embeddings unchanged during training or you can train them.<br><br>

Now you need to tokenize the data into a format that can be used by the word embeddings. <br><br>
Keras offers a couple of convenience methods for text preprocessing and sequence preprocessing which you can employ to prepare your text.<br>

[Keras Tokenizer ](https://keras.io/preprocessing/text/)

In [0]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=3000) #keep 3000 words

#Updates internal vocabulary based on a list of texts
#Must be run before running texts_to_sequences
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print("vocab size=", vocab_size)
number = 0
print(sentences_train[number])
print(X_train[number])

The indexing begins with the most common word first (the). <br>
It is important to note that the index 0 is reserved and is not assigned to any word. 

In [0]:
#Entering a word that is not in the texts will
#generate an error
for word in ['the', 'all', 'bad', 'terrible','horrible','lost','lukewarm','bacon']: 
    print('{}: {}'.format(word, tokenizer.word_index[word]))

**Find similar words with gensim**

In [0]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")
result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

**Pad the sequence of words**

One problem that we have is that each text sequence has in most cases different length of words. To counter this, you can use pad_sequence() which simply pads the sequence of words with zeros. By default, it prepends zeros but we want to append them. Typically it does not matter whether you prepend or append zeros.

Additionally you would want to add a maxlen parameter to specify how long the sequences should be. This cuts sequences that exceed that number.

The resulting feature vector contains mostly zeros, when you have a fairly short sentence. 

In [0]:
from keras.preprocessing.sequence import pad_sequences
#The maximum length of a review 
maxlen = 100
#If a review is less than 100 words, pad the vector with 0s.

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

index=0
print(sentences_train[index])
print(X_train[index, :])

Now you can use the Embedding Layer of Keras which takes the previously calculated integers and maps them to a dense vector of the embedding. <br>
You will need the following parameters:<br>

>input_dim: the size of the vocabulary<br>
output_dim: the size of the dense vector<br>
input_length: the length of the sequence<br>

In [0]:
embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

print("input dim=",input_dim)
print("output dim of embedding layer=",embedding_dim)
model.summary()

In [0]:
print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)
history = model.fit(X_train, y_train,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)


This is typically a not very reliable way to work with sequential data as you can see in the performance. When working with sequential data you want to focus on methods that look at local and sequential information instead of absolute positional information.

Another way to work with embeddings is by using a MaxPooling1D/AveragePooling1D or a GlobalMaxPooling1D/GlobalAveragePooling1D layer after the embedding. You can think of the pooling layers as a way to downsample (a way to reduce the size of) the incoming feature vectors.

In the case of max pooling you take the maximum value of all features in the pool for each feature dimension. In the case of average pooling you take the average, but max pooling seems to be more commonly used as it highlights large values.

Global max/average pooling takes the maximum/average of all features whereas in the other case you have to define the pool size. Keras has again its own layer that you can add in the sequential model:

In [0]:
embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()


In [0]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

use a precomputed embedding space that utilizes a much larger corpus. It is possible to precompute word embeddings by simply training them on a large corpus of text. Among the most popular methods are Word2Vec developed by Google and GloVe (Global Vectors for Word Representation) developed by the Stanford NLP Group.<br>

Word2Vec achieves this by employing neural networks and GloVe achieves this with a co-occurrence matrix and by using matrix factorization. In both cases you are dealing with dimensionality reduction, but Word2Vec is more accurate and GloVe is faster to compute.


In [0]:
import numpy as np
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

# **CHECK: DOES THE FOLLOWING CODE CELL EXECUTE WITHOUT ERRORS**

In [0]:
embedding_dim = 50
embedding_matrix = create_embedding_matrix(
    '/content/drive/My Drive/ZUp/wiki-news-300d-1M.vec',
    tokenizer.word_index, embedding_dim)

Percentage of vocabulary covered by the pretrained model

In [0]:
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / vocab_size

# **Trial 3: Embedded DNN**

In [0]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [0]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

In [0]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=True))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [0]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

In [0]:
embedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()


In [0]:
history = model.fit(X_train, y_train,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

In [0]:
def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',metrics=["acc"])
    return model

**Embedding dimension**

In [0]:
param_grid = dict(num_filters=[32, 64, 128],
                  kernel_size=[3, 5, 7],
                  vocab_size=[5000], 
                  embedding_dim=[50],
                  maxlen=[100])

# **HyperParameter Grid Search of each text set**

In [0]:
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)

In [0]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

# Main settings
epochs = 20
embedding_dim = 50
maxlen = 100
output_file = '/content/drive/My Drive/output.txt'

# Run grid search for each source (yelp, amazon, imdb)
for source, frame in df.groupby('source'):
    print('Running grid search for data set :', source)
    sentences = df['sentence'].values
    y = df['label'].values

    # Train-test split
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    # Tokenize words
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(sentences_train)
    X_train = tokenizer.texts_to_sequences(sentences_train)
    X_test = tokenizer.texts_to_sequences(sentences_test)

    # Adding 1 because of reserved 0 index
    vocab_size = len(tokenizer.word_index) + 1

    # Pad sequences with zeros
    X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
    X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

    # Parameter grid for grid search
    param_grid = dict(num_filters=[32, 64, 128],
                      kernel_size=[3, 5, 7],
                      vocab_size=[vocab_size],
                      embedding_dim=[embedding_dim],
                      maxlen=[maxlen])
    model = KerasClassifier(build_fn=create_model,
                            epochs=epochs, batch_size=10,
                            verbose=False)
    grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                              cv=4, verbose=1, n_iter=5)
    grid_result = grid.fit(X_train, y_train)

    # Evaluate testing set
    test_accuracy = grid.score(X_test, y_test)

    # Save and evaluate results
    #prompt = input(f'finished {source}; write to file and proceed? [y/n]')
    #if prompt.lower() not in {'y', 'true', 'yes'}:
    #    break
    with open(output_file, 'a') as f:
        s = ('Running {} data set\nBest Accuracy : '
             '{:.4f}\n{}\nTest Accuracy : {:.4f}\n\n')
        output_string = s.format(
            source,
            grid_result.best_score_,
            grid_result.best_params_,
            test_accuracy)
        print(output_string)
        f.write(output_string)