<a href="https://colab.research.google.com/github/cagBRT/SentimentTextAnalysis/blob/master/Sentiment_Text_Analysis_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
%cd /content/
!git clone  https://github.com/cagBRT/SentimentTextAnalysis.git cloned-repo
%cd cloned-repo
!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("images/sentTextAna"+str(num)+ ".png" , width=600)

# **Import the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

In [None]:
import pandas as pd

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

# **Examine the data**<br>
The data is from three sources: <br>
> yelp reviews<br>
> amazon reviews<br>
> movie reviews<br>

The data has the structure: <br>
>"review text" label source<br>

**review text is called**: sentence<br>
**label**: 0 = negative review, 1 = positive review<br>
**source**: yelp, amazon, imdb

In [None]:
#!cat yelp_labelled.txt
#Change directory to the cloned repo
%cd /content/cloned-repo/

In [None]:
#create a dataframe containing all three sources
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])
print("dataframe shape: ",df.shape)
df['label'].value_counts()

# **Split the review data into train and test sets**

Split the Yelp data into training and tests sets<br>

[train_test_split](https://www.bitdegree.org/learn/train-test-split)

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_yelp = df[df['source'] == 'yelp']

sentences_yelp = df_yelp['sentence'].values
y_yelp = df_yelp['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_yelp, sentences_test_yelp, y_train_yelp, y_test_yelp = train_test_split(
   sentences_yelp, y_yelp, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_yelp[0])

In [None]:
from sklearn.model_selection import train_test_split
#select the rows of the data set that are from yelp
df_amazon = df[df['source'] == 'amazon']

sentences_amazon = df_amazon['sentence'].values
y_amazon = df_amazon['label'].values

#do a 75 - 25 split between train and test data
#If int, random_state is the seed used by the random number generator; 
#If RandomState instance, random_state is the random number generator; 
#If None, the random number generator is the RandomState instance used by np.random.
sentences_train_amazon, sentences_test_amazon, y_train_amazon, y_test_amazon = train_test_split(
   sentences_amazon, y_amazon, test_size=0.25, random_state=1000)

#print out the first sentence of the training set
print(sentences_train_amazon[0])

# **Word Embedding**
There are various ways to vectorize text, such as:
*   Words represented as a vector.
*   Characters represented as a vector


In this notebook, you’ll see how to deal with representing words as vectors which is the common way to use text in neural networks. Two possible ways to represent a word as a vector are:
*   one-hot encoding
*   word embeddings<br>

The first example shown below does one-hot encoding. The second example does word embeddings.



**Hot-one encoding Example**<br>
In this example you experiment with one hot encoding on a small dataset

In [None]:
#There are five cities, vocabulary size of three
food = ['bacon', 'egg', 'bacon', 'beer', 'toast']
food

Convert the foods into one their one-hot equivalents.

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
food_labels = encoder.fit_transform(food)

#create a dataframe to examine the data
df = pd.DataFrame()
df['food']= food
df['food_labels']= food_labels
df.sort_values(by=['food'])

#Convert the food into one-hot values
cat_columns = ["food_labels"]
df_processed = pd.get_dummies(df, prefix_sep="__",columns=cat_columns)
df_processed


In [None]:
#Or use the OneHotEncoder method to encode the data
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
food_labels = food_labels.reshape((5, 1))
encoder.fit_transform(food_labels)

# **Assignment #6:**
1. Create a list of 10 household appliances.
2. Convert the list into one-hot encodings

# **Discussion:** 
What are some downsides of one-hot encoding?

# **Word embedding**<br>
A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.




In [None]:
page(4)

Word embedding represents words as dense word vectors (also called word embeddings) which are trained unlike the one-hot encoding which are hardcoded. This means that the word embeddings collect more information into fewer dimensions.

Note that the word embeddings do not understand the text as a human would, but they rather map the statistical structure of the language used in the corpus. Their aim is to map semantic meaning into a geometric space. This geometric space is then called the embedding space.<br>

This would map semantically similar words close on the embedding space like numbers or colors. If the embedding captures the relationship between words well, things like vector arithmetic should become possible. A famous example in this field of study is the ability to map King - Man + Woman = Queen.

Word embedding has fewer dimensions than one-hot encoding<br>
Word embedding places similar words near each other<br>
One-hot encoding has a sparse matrics

There are two methods for doing word embedding: <br> 

>1.Train your word embeddings during the training of your neural network. <br>
>2.Use pretrained word embeddings which you can directly use in your model. You can leave these word embeddings unchanged during training or you can train them.<br><br>

Then tokenize the data into a format that can be used by the word embeddings. <br><br>
Keras offers a couple of convenience methods for text preprocessing and sequence preprocessing which you can employ to prepare your text.<br>

[Keras Tokenizer ](https://keras.io/preprocessing/text/)

In [None]:
from keras.preprocessing.text import Tokenizer

#Go through all the reviews and keep 3000 words.
tokenizer = Tokenizer(num_words=3000) #keep 3000 words

#Update the internal vocabulary based on a list of texts
#Must be run before running texts_to_sequences
tokenizer.fit_on_texts(sentences_train_yelp)

The number assigned to each word is dependent upon is frequency of use in all the sentences. <br>
For example:<br>
>'the' is 1<br>
'and' is 2<br>
'was' is 3<br>


In [None]:
#Examples of reviews as word embeddings
X_train_yelp = tokenizer.texts_to_sequences(sentences_train_yelp)
print(sentences_train_yelp[3],X_train_yelp[3])
print(sentences_train_yelp[23],X_train_yelp[23])
print(sentences_train_yelp[620],X_train_yelp[620])

In [None]:
X_test_yelp = tokenizer.texts_to_sequences(sentences_test_yelp)
vocab_size_yelp = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print("vocab size=", vocab_size_yelp)

The indexing begins with the most common word first (the). <br>
It is important to note that the index 0 is reserved and is not assigned to any word. 

In [None]:
for word in ['the', 'all', 'bad', 'terrible','horrible','lost','lukewarm','bacon','atom']: 
    print('{}: {}'.format(word, tokenizer.word_index[word]))

The list can be searched by word or by index. 

In [None]:
#What is the least used word in the list? 
print((tokenizer.index_word[1746]))

# **Assignment #7:**
Use the Amazon reviews to do word embedding. 
<br>
Use different variable names than the ones used for the Yelp reviews

# **Find similar words with gensim**<br>
Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is implemented in Python and Cython.

In [None]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")
#Find words similiar to other words
#If done correctly, you can do math with words
result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))
#Try sandwich, tuna, bread

# **Assignment #8:** 
Use the gensim library to find other word equations. <br>
Share them with the class. 