<a href="https://colab.research.google.com/github/ZorkDaNerd/CS345-Text-Recognition/blob/main/Text_Recognition_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*This notebook is part of our text recognition project for class CS345 at Colorado State University.
Original versions were created by Zachary Shimpa, Jenelle Dobyns and Jordan Rust.
The content is availabe [on GitHub](github.com/ZorkDaNerd/CS345-Text-Recognition).*

*Code help and referance was provided from Prof. Asa Ben-Hur and CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions of these notebooks were created by Asa Ben-Hur with updates by Ross Beveridge.
The content is availabe [on his GitHub](https://github.com/asabenhur/CS345).*

<a href="https://colab.research.google.com/github/ZorkDaNerd/CS345-Text-Recognition/blob/main/Text_Recognition_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Possible data sets

https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Dataset from this

https://medium.com/mlearning-ai/sentiment-analysis-using-lstm-21767a130857

https://en.wikipedia.org/wiki/Sentiment_analysis

https://valueml.com/sentiment-analysis-using-keras

https://towardsdatascience.com/sentiment-analysis-on-amazon-reviews-45cd169447ac

https://paperswithcode.com/dataset/imdb-movie-reviews

https://www.kaggle.com/code/shubhamptrivedi/sentiment-analysis-on-imdb-movie-reviews

https://www.kaggle.com/code/sohamdas27/imdb-movie-review-eda-sentiment-analysis

https://www.kaggle.com/code/zhangwei20220818/imbd-sentiment-analysis-using-pytorch-lstm

https://www.kaggle.com/code/vincentman0403/sentimental-analysis-on-imdb-by-lstm

# Description of Project

This project is about recognizing text emotions using LSTM. This is a form of natural language processing.

### Coding languages and packages used in project

Ex: Anaconda, Python, ect

Imports

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import os
from glob import glob
import tensorflow as tf
from tensorflow import keras
from keras.datasets import imdb
from wordcloud import WordCloud,STOPWORDS
import string
import re

Let's import our data

In [None]:
imdb_data=pd.read_csv("https://github.com/ZorkDaNerd/CS345-Text-Recognition/raw/main/Datasets/IMDB%20Dataset/IMDB%20Dataset.csv")
imdb_data.head(10)

# Sentiment count - We can see that our data is perfectly balanced
# imdb_data['sentiment'].value_counts()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Some house keeping of the data

In [None]:
# Importing 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Removing html tags
imdb_data.review=imdb_data.review.str.replace('<[^<]+?>','')
# Print(imdb_data['review'])
# Set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

# Removing the stopwords
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer=ToktokTokenizer()

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stop]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stop]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
    
#Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(remove_stopwords).str.lower()

sentiment = {'positive': 1, 'negative': 0}
imdb_data['sentiment'] = [sentiment[item] for item in imdb_data['sentiment']]

imdb_data.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  imdb_data.review=imdb_data.review.str.replace('<[^<]+?>','')


{'why', 'just', 're', "couldn't", 'further', 'all', 't', 'she', 'which', 'myself', 'my', 'weren', "it's", 'these', "aren't", 'when', 'few', 'is', 'your', 'mustn', "you've", 'now', 'had', 'while', 'her', "hadn't", 'about', 'out', 'haven', 'above', "she's", 'from', 'on', 'other', 'through', 'you', 'between', 'can', "that'll", 'there', 'yours', 'here', 'are', "needn't", 've', 'more', 'ourselves', "didn't", 'ma', 'herself', 'any', 'each', 'only', 'ours', 'after', 'i', 'mightn', 'o', "doesn't", "wasn't", 'ain', 'its', 'doesn', 'and', 'did', 'off', 'hadn', 'below', 'was', 'but', 'during', 'hers', 'didn', 'isn', 'nor', 'or', "isn't", 'has', 'at', 'don', 'into', 'those', "won't", 'yourselves', 'an', 'as', 'if', "you'll", 'against', 'been', 'before', 'because', 'same', "should've", 'so', 'shouldn', 'down', "mustn't", 'both', 'whom', 'than', "you're", 'such', "shan't", "don't", 'couldn', 'who', 'itself', 'aren', 'our', 'his', 'then', 'wasn', 'most', 'this', 'with', 'won', 'of', 'shan', "wouldn't

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,1
1,wonderful little production. filming technique...,1
2,thought wonderful way spend time hot summer we...,1
3,basically ' family little boy ( jake ) thinks ...,0
4,"petter mattei ' "" love time money "" visually s...",1


Let's convert our data to a numpy array for faster data preprocessing and cleaning

In [None]:
imdb_data = np.array(imdb_data)

#50,000 rows, 2 columns
imdb_data.shape

#Access our reviews column
imdb_data[:,0]

#Access our sentiment column
imdb_data[:,1]

array([1, 1, 1, ..., 0, 0, 0], dtype=object)

Let's use numpy to remove unneeded punctuation from our dataset that don't contribute to the overall sentiment of our reviews

In [None]:
def remove_punctuation(text):
    stripPunct = str.maketrans('', '', string.punctuation)
    return np.array([i.translate(stripPunct) for i in text])

#Apply function on review column
imdb_data[:,0] = remove_punctuation(imdb_data[:,0])

imdb_data[:,1]

array([1, 1, 1, ..., 0, 0, 0], dtype=object)

Define X and y

In [None]:
X = imdb_data[:,0]
imdb_data[:,1] = imdb_data[:,1].astype(str).astype(int)
y = imdb_data[:,1]

Now that our data is cleaned up, let's make a train/test split 80/20

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)

We convert the words into numbers for the LSTM to process

In [None]:
print(X_train[0])
from keras.preprocessing.text import Tokenizer    
from keras.preprocessing.text import text_to_word_sequence 
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
print(X_train[0])

excellent reason edison went straight video  would landed theaters crumbling thud movie lasted entirely long perilously boring notch lowbrow  thanks freeman spacey  obviously spare two weeks next films   bad guys laughable action near nonexistent justin timberlake  acting hate knock guy  sooner realizes pop forte  betterthe movie  bad  mostly like fact cool j given appears shot leading man deserves it  unlike fellow musician costar  act kevin spacey almost always enjoyable well  see gulp several times chews scenery   freeman ability elevate flick three stars  ten   good  when said done  ultimate error movie mundane tiresome piece pseudoaction poppycock fails keep anyone awake also fails make anyone give good crap characters   plain boring said  rent suffering insomnia 
[200, 167, 282, 660, 270, 6, 2065, 1, 4571, 963, 101, 231, 3357, 1081, 3144, 4643, 395, 3510, 32, 2136, 248, 29, 17, 315, 1199, 116, 647, 2817, 3511, 34, 609, 4319, 106, 2487, 1534, 1, 17, 513, 4, 86, 464, 2477, 225, 592

Making the train and test statements to be of size 50 by truncating or padding accordingly

In [None]:
from keras.utils import pad_sequences
lenRev = 50
X_train = pad_sequences(X_train, padding='post', maxlen=lenRev)
X_test = pad_sequences(X_test, padding='post', maxlen=lenRev)

Converts to a tensor object so the keras is able to analyze it

In [None]:
X_train = tf.convert_to_tensor(X_train)
X_test = tf.convert_to_tensor(X_test)
y_train = tf.convert_to_tensor(y_train, dtype=tf.int32)
y_test = tf.convert_to_tensor(y_test, dtype=tf.int32)

Initializing the keras LSTM model

In [None]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Embedding, GlobalAveragePooling1D
model = Sequential([Embedding(10000, 17), 
                   GlobalAveragePooling1D(),
                   Dense(17,activation = "relu"),
                   Dense(12,activation = "relu"),
                   Dense(1,activation = "sigmoid")])
model.compile(
    loss = "binary_crossentropy",
    optimizer =  "adam",
    metrics = ["accuracy"])
model.summary()

Model: "sequential_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_24 (Embedding)     (None, None, 17)          170000    
_________________________________________________________________
global_average_pooling1d_24  (None, 17)                0         
_________________________________________________________________
dense_72 (Dense)             (None, 17)                306       
_________________________________________________________________
dense_73 (Dense)             (None, 12)                216       
_________________________________________________________________
dense_74 (Dense)             (None, 1)                 13        
Total params: 170,535
Trainable params: 170,535
Non-trainable params: 0
_________________________________________________________________


Fit to the LSTM keras model

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, verbose = 1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fbc5dcb35e0>

Calculate the accuracy

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy is : ", accuracy*100)

Accuracy is :  81.91999793052673


# Test the model

In [None]:
sample = "The product was very good and satisfying."
sample = sample.translate(str.maketrans('', '', string.punctuation)).lower()
sample = sample.split()
sample = [word for word in sample if word not in stop]
sample

['product', 'good', 'satisfying']

Tokenize the sample to numbers

In [None]:
sample = tokenizer.texts_to_sequences(sample)
sample
simple_list = []
for sublist in sample:
    for item in sublist:
        simple_list.append(item)
simple_list = [simple_list]
sample_review = pad_sequences(simple_list, padding='post', maxlen=lenRev)

Predict the sample

In [None]:
ans = model.predict(sample_review)
ans

array([[0.8333976]], dtype=float32)

Tell weather the sample is positive or negative

In [None]:
if (0.4 <= ans <= 0.6):
    print("The review is not too good nor too bad")
if(ans>0.6):
    print("The review is positive")
elif(ans<0.4):
    print("The review is negative")

The review is positive


# Project Code

This section puts all of the books into 2 numpy arrays, 1 fore the origional books and 1 for the thined version.

This section removes all of the unwanted words from the books and makes the words lowercase.

How many words and sentances are in the books? and how long would they take to read?

# Findings

# Analysis of results