<a href="https://colab.research.google.com/github/cagBRT/SentimentTextAnalysis/blob/master/Sentiment_Text_Analysis_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
%cd /content/
!git clone  https://github.com/cagBRT/SentimentTextAnalysis.git cloned-repo
%cd cloned-repo
!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("images/sentTextAna"+str(num)+ ".png" , width=600)

# **Pre-requisites:**<br>
Python - 2 day course is sufficient<br>
Keras understanding - intro course is sufficient<br>
logistic regression - understanding<br>
Deep neural networks - intro course is sufficient<br>
Hyperparameter tuning _ intro course is sufficient<br>



# **Pre-work**
**Do this step before coming to class. It may take as long as 30 minutes to complete the download and upload.**<br>
<br>
You will need to add a file to your google drive. <br>
1. Download the file to your computer. 
>Click on the link below. Then click **Download** <br>
The download can take as long as 15 minutes.<br>
The file to download is: [fileToAddToGoogleDrive](https://drive.google.com/open?id=1zJI1Xz-CgaQqX1UtBcOhUjEKWcSt6QK6)<br>
The file is large: 2GBytes<br><br>


2. Upload the file to Google Drive:<br>
>Open Google Drive<br>
On the Drive menu, click on **New** >> **File Upload**<br>
Find the file on your computer, click on it and upload the file. 

The file is large, it may take as long as 15 minutes<br>
Once the file is on you<br><br>
The file is from a website: [English word vectors](https://fasttext.cc/docs/en/english-vectors.html)<br>
This page gathers several pre-trained word vectors trained using fastText.

# **Sentiment Text Analyis**<br>
This notebook introduces basic machine learning methods for determing if yelp, Amazon, and imdb reviews are positive or negative. <br>
The following topics are covered in this notebook: 
1. Create and train a baseline model.
2. Create and train a simple DNN model
3. Add an embedded layer to a simple DNN model
4. Add a MaxPooling layer to an embedded model
5. Use a pretrained embedded model
6. Create a CNN model with an embedded layer
7. Do hyperparameter tuning using random search

It does not cover recurrent neural networks or LSTM or GRU. 

In [None]:
page(1)

**What is sentiment analysis?**<br>
Sentiment analysis is a branch of natural language processing. It is one of the most important sources in decision making for determining the sentiment of language. It can extract from, identify, evaluate, or characterize online sentiment of reviews.<br>

Sentiment analysis assumes various forms, from models that focus on polarity (positive, negative, neutral) to those that detect feelings and emotions (angry, happy, sad, etc), or even models that identify intentions (e.g. interested v. not interested).<br><br>

It’s estimated that 80% of the world’s data is unstructured, in other words it’s unorganized. Huge amounts of text data (emails, support tickets, chats, social media conversations, surveys, articles, documents, etc), is created every day but it’s hard to analyze, understand, and sort through, not to mention time-consuming and expensive.

Sentiment analysis, however, helps businesses make sense of all this unstructured text by automatically tagging it.



**Benefits of sentiment analysis**<br>
Benefits of sentiment analysis include:

1. Processes data at scale: Sentiment analysis helps businesses process huge amounts of data in an efficient and cost-effective way.

2. Real-Time analysis: Sentiment analysis can identify critical issues in real-time, for example is a PR crisis on social media escalating? Is an angry customer about to churn? Sentiment analysis models can  immediately identify, through text, these types of situations. 

3. Consistent criteria: Tagging text by sentiment is highly subjective, influenced by personal experiences, thoughts, and beliefs. It’s estimated that people only agree around 60-65% of the time when determining the sentiment of a particular text. By using a centralized sentiment analysis system, companies can apply the same criteria to all of their data, helping them improve accuracy and gain better insights.

# **Mount your Google Drive on this CoLab Notebook**
Execute the following code cell<br>
Click on the given link<br>
Select your user name<br>
Click **Allow**<br>
Copy the authorization code<br>
Paste the authorization code into the user input box. <br>
You Google Drive is mounted to this notebook. 

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

# **Check that your drive is mounted**
1. On the menu bar, click the **folder icon**<br>
2. Click on the **folder icon with the up arrow**
3. Click on **gdrive**
4. Click on **My Drive**
5. Check for the file called **wiki-news-300d-1M.vec **<br>
If the file is there, you have correctly installed the necessary files for this notebook. <br>




# **scikit-learn**
This notebook uses the [scikit-learn](https://scikit-learn.org/stable/) library<br>
Scikit-learn is a free software machine learning library for Python.<br>
*   Simple and efficient tools for predictive data analysis
*  Accessible to everybody, and reusable in various contexts
*Built on NumPy, SciPy, and matplotlib
*Open source, commercially usable - BSD license




# **Import the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

In [None]:
import pandas as pd

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

# **Examine the data**<br>
The data is from three sources: <br>
> yelp reviews<br>
> amazon reviews<br>
> movie reviews<br>

The data has the structure: <br>
>"review text" label source<br>

**review text is called**: sentence<br>
**label**: 0 = negative review, 1 = positive review<br>
**source**: yelp, amazon, imdb

In [None]:
#!cat yelp_labelled.txt
#Change directory to the cloned repo
%cd /content/cloned-repo/

In [None]:
#create a dataframe containing all three sources
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])
print("dataframe shape: ",df.shape)
df['label'].value_counts()

# **Create a Bag of Words**
The bag-of-words model is a word/sentence representation used in NLP and information retrieval. <br>
In this model, text is represented as a bag of its words, *disregarding grammar and even word order*.

Create a bag of words (BoW) for vectorizing the text. <br>
Take the data and create a vocabulary from all the words in all reviews.<br>
The collection of texts is also called a **corpus** in NLP.<br>
The **vocabulary** in this case is a list of words that occurred in our text where each word has its own index.<br>
The resulting vector is also called a **feature vector**.

In [None]:
page(2)

# **Bag of Words Example:** <br>
# **Create a bag of words (BoW) for vectorizing the text**

Create a toy set of words to demonstrate BoW.<br>

In [None]:
#1. The words
john_words = ['Sam likes to run.', 
              'John hates to be cold and he hates to run.', 
              'John hates to be late.'
            ]

In [None]:
#2. Create the bag of words
#The words are in alphabetical order (uppercase listed before lowercase)
#Each word is assigned a number
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(john_words)
vocab = vectorizer.vocabulary_
vocab

In [None]:
#3. Convert the sentences in to vectors
#each word used in a sentence has a '1' in the corresponding column
dfbow = pd.DataFrame()
dfbow['sentences'] = john_words
vector = vectorizer.transform(john_words).toarray().tolist()
dfbow['vector'] = vector
print(vocab)
dfbow

Notice in the above output: 
1. All the words in all the sentences created the vocabulary.<br>
2. Each sentence is turned into a vector. <br>
3. The length of the vector is the length of the vocabulary.<br>
4. A '1' in the vector indicates the word is used in the sentence, a '0' means it is not used in the sentence. 

# **Assignment 1**: <br>
1. Write 5 sentences. 
2. Create a BoW from the sentences. 
3. Vectorize the sentences. 

Although Bag-Of-Words model is the most widely used technique for sentiment analysis, it has two major weaknesses: <br>
* the larger the vocabulary, the larger the feature vector
* it assumes each word is independent of other words
* the feature vectors are extremely sparse

A **sparse matrix**  is a matrix in which most of the elements are zero<br>
A **dense matrix**  is a matrix in which most of the elements are nonzero


# **Split the review data into train and test sets**

Split the Yelp data into training and tests sets<br>

[train_test_split](https://www.bitdegree.org/learn/train-test-split)

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train, sentences_test, y_train, y_test = train_test_split(
   sentences, y, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train[0])

# **Vectorize the training and test set**
Vectorize the data: <br>

Assign each word a number.<br>
Count the number of times each word appears in the individual review text. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
#1. use the words from the training set
#2. create a BoW from the yelp reviews
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
vocab = vectorizer.vocabulary_
vocab = pd.Series(vocab)
#2. vectorize the sentences
X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)
print("training data: ", X_train.shape,"\ntest data: ", X_test.shape)

What has been done so far: 
1. Created a vocabulary from all the words used in the yelp reviews.
2. Assigned each word a number.<br>

Now check the vectorization of the sentences in the yelp review training and test data. 

In [None]:
#"Select a number between 0 - 749
check=24
print(sentences_train[check])
print(X_train[check])
#Prints sentence number, word vector, quantity of word in sentence

In [None]:
#"Select a number between 0 - 749
check=24
print(sentences_test[check])
print(X_test[check])
#Prints sentence number, word vector, quantity of word in sentence

# **Assignment #2**: 
1. Split the Amazon reviews into a training and test set.
2. Create a BoW using the amazon training and test reviews. 
2. Print out the vectorization of two or three reviews from the test and training set. <br>
**Make sure your variable names for the Amazon BoW are different than the Yelp BoW.** 

# **Create a logistic regression model and train it**<br>

The [Logistic Regression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from Scikit Learn implements regularized logistic regression. It can handle both dense and sparse input

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

# **Assignment 3:**

Perform logistic regression on the amazon data set<br>


Get a baseline using logistic regression. This will give us something to compare with the other methods. 

In [None]:
#@title 
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    reviews = df_source['sentence'].values
    reviews_y = df_source['label'].values

    reviews_train, reviews_test, reviews_y_train, reviews_y_test = train_test_split(
        reviews, reviews_y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(reviews_train)
    r_X_train = vectorizer.transform(reviews_train)
    r_X_test  = vectorizer.transform(reviews_test)

    classifier = LogisticRegression()
    classifier.fit(r_X_train, reviews_y_train)
    score = classifier.score(r_X_test, reviews_y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))