<a href="https://colab.research.google.com/github/ZorkDaNerd/CS345-Text-Recognition/blob/main/Text_Recognition_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*This notebook is part of our text recognition project for class CS345 at Colorado State University.
Original versions were created by Zachary Shimpa, Jenelle Dobyns and Jordan Rust.
The content is availabe [on GitHub](github.com/ZorkDaNerd/CS345-Text-Recognition).*

*Code help and referance was provided from Prof. Asa Ben-Hur and CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions of these notebooks were created by Asa Ben-Hur with updates by Ross Beveridge.
The content is availabe [on his GitHub](https://github.com/asabenhur/CS345).*

<a href="https://colab.research.google.com/github/ZorkDaNerd/CS345-Text-Recognition/blob/main/Text_Recognition_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Possible data sets

https://github.com/amephraim/nlp/tree/master/code 
Dataset from this

https://medium.com/mlearning-ai/sentiment-analysis-using-lstm-21767a130857

https://en.wikipedia.org/wiki/Sentiment_analysis



# Description of Project

This project is about recognizing text emotions using LSTM. This is a form of natural language processing.

### Coding languages and packages used in project

Ex: Anaconda, Python, ect

In [None]:
#Import
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import os
from glob import glob
import tensorflow as tf
from tensorflow import keras
from keras.datasets import imdb
from wordcloud import WordCloud,STOPWORDS
import string
import re

In [None]:
# Let's import our data
imdb_data=pd.read_csv("https://github.com/ZorkDaNerd/CS345-Text-Recognition/raw/main/Datasets/IMDB%20Dataset/IMDB%20Dataset.csv")
imdb_data.head(10)

# Sentiment count - We can see that our data is perfectly balanced
# imdb_data['sentiment'].value_counts()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

#removing html tags
imdb_data.review=imdb_data.review.str.replace('<[^<]+?>','')

#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer=ToktokTokenizer()

#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text

#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(simple_stemmer)

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stop]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stop]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
    
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(remove_stopwords)


imdb_data.head()

  imdb_data.review=imdb_data.review.str.replace('<[^<]+?>','')


{'hers', 'couldn', 'we', "she's", 'does', 'they', 'why', 'their', 'with', 'me', 'because', 'that', 'no', 'whom', 'do', 'should', 'been', "won't", 'again', 'further', 'be', "couldn't", 'from', 'shan', 'about', "you're", "wasn't", 'now', 'to', 'own', 'o', 'each', 'same', 'my', 'ours', 'who', 'which', 'haven', 'once', 'an', "weren't", 'theirs', 'needn', 'myself', 'or', 'between', 'not', 'don', "doesn't", 'before', 'so', "shan't", 's', 'too', 'his', 'isn', 'ourselves', 'm', 'and', 'than', 'during', 'yourself', 'shouldn', 'above', 'by', "don't", 'can', 'doing', "isn't", 'll', 'there', "aren't", "should've", 'am', 're', 'your', 'y', "it's", 'itself', "that'll", 'doesn', 'its', 'below', "didn't", 'under', 'him', 'themselves', "needn't", 'then', 'how', 'these', 'wouldn', 'being', 'for', 'will', 've', 'has', 'having', 'if', 'here', 'when', 'any', 'few', 'in', 'as', 'did', 'she', 'while', 'very', 'just', 'mustn', 'mightn', 'didn', "you'd", 'all', 'down', 'other', 'off', "haven't", 'wasn', 'some'

Unnamed: 0,review,sentiment
0,one review ha mention watch 1 oz episod ' hook...,positive
1,wonder littl production. film techniqu veri un...,positive
2,thought thi wa wonder way spend time hot summe...,positive
3,basic ' famili littl boy ( jake ) think ' zomb...,negative
4,"petter mattei ' "" love time money "" visual stu...",positive


In [None]:
# Let's convert our data to a numpy array for faster data preprocessing and cleaning
imdb_data = np.array(imdb_data)

#50,000 rows, 2 columns
imdb_data.shape

#Access our reviews column
imdb_data[:,0]

#Access our sentiment column
imdb_data[:,1]

array(['positive', 'positive', 'positive', ..., 'negative', 'negative',
       'negative'], dtype=object)

In [None]:
# Let's use numpy to remove unneeded punctuation from our dataset that don't contribute to the overall sentiment of our reviews
import string

# Remove punctuation
def remove_punctuation(text):
    stripPunct = str.maketrans('', '', string.punctuation)
    return np.array([i.translate(stripPunct) for i in text])

#Apply function on review column
imdb_data[:,0] = remove_punctuation(imdb_data[:,0])

imdb_data[:,0]

array(['one review ha mention watch 1 oz episod  hooked right  thi exactli happen meth first thing struck oz wa brutal unflinch scene violence  set right word go trust  thi show faint heart timid thi show pull punch regard drugs  sex violence hardcore  classic use wordit call oz nicknam given oswald maximum secur state penitentary focus mainli emerald city  experiment section prison cell glass front face inwards  privaci high agenda em citi home many  aryans  muslims  gangstas  latinos  christians  italians  irish  scuffles  death stares  dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show  dare forget pretti pictur paint mainstream audiences  forget charm  forget romance  oz  mess around first episod ever saw struck nasti wa surreal   say wa readi  watch  develop tast oz  got accustom high level graphic violence violence  injustic  crook guard  sold nickel  inmat  kill order get away  well mannered  middl class inmat turn prison bitch due lack stree

In [None]:
# Now that our data is cleaned up, let's make a train/test split 80/20

X_train, X_test, y_train, y_test = train_test_split(imdb_data[:,0], imdb_data[:,1], test_size=0.20, random_state=5)

# Project Code

This section puts all of the books into 2 numpy arrays, 1 fore the origional books and 1 for the thined version.

This section removes all of the unwanted words from the books and makes the words lowercase.

How many words and sentances are in the books? and how long would they take to read?

# Findings

# Analysis of results