# ***NLP Text Classification***

This is a simple project, which classifies text to know if it's positive or negative.It is trained by an Amazon csv file, that contains reviews from an item.

The targets:

- Use NLTK in order to preprocess reviews
- Use RNN Model to classify
- Make it with Tensorflow, Keras,...


# **Getting the file**

First, we are gonna choose the amazon excel opinions.Then turn it to a Dataframe and study the reviews of the item.

In [None]:
from google.colab import files
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from sklearn.model_selection import train_test_split
import string
from nltk.corpus import stopwords
import nltk
from collections import Counter
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import layers
from tensorflow import keras

# upload file from pc and get the name of the file to read it
# Once you have upload it, and run it, if you want to re-run the code again,
# keep in mind you should remove the previous file or you just need to modify
# the file_name variable from the read_csv method for the name of the css file.
uploaded_file = files.upload()
file_name = list(uploaded_file.keys())[0]

# read csv
df = pd.read_csv(file_name)


df

Unnamed: 0,Text,label
0,This is the best apps acording to a bunch of ...,1
1,This is a pretty good version of the game for ...,1
2,this is a really . there are a bunch of levels...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1
...,...,...
19991,this app is fricken stupid.it froze on the kin...,0
19992,Please add me!!!!! I need neighbors! Ginger101...,1
19993,love it! this game. is awesome. wish it had m...,1
19994,I love love love this app on my side of fashio...,1


# **Preprocessing data**

Now, we are cleaning and transforming data, in order to have the dataframe cleaned and well prepared to split data to train and to test.

In [None]:
# remove null values
df = df.dropna()

In [None]:
# Then we clean the reviews, removing puctuation.

def remove_punct(text):
    translator = str.maketrans("", "", string.punctuation)
    return text.translate(translator)

df["Text"] = df.Text.map(remove_punct)

In [None]:
# Next step for this model, Im going to remove stopwords or empty words.
# Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine
# has been programmed to ignore, both when indexing entries for searching and when retrieving them
# as the result of a search query.

nltk.download('stopwords')
stop = set(stopwords.words("english"))

def remove_stopwords(text):
    filtered_words = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(filtered_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df["Text"] = df.Text.map(remove_stopwords)

In [None]:
df.Text

0        best apps acording bunch people agree bombs eg...
1        pretty good version game free lots different l...
2           really bunch levels find golden eggs super fun
3        silly game frustrating lots fun definitely rec...
4        terrific game pad hrs fun grandkids love great...
                               ...                        
19991    app fricken stupidit froze kindle wont allow p...
19992    please add need neighbors ginger1016 thanks bu...
19993    love game awesome wish free stuff houses didnt...
19994    love love love app side fashion story fights w...
19995    game rip list things make betterbull first nee...
Name: Text, Length: 19996, dtype: object

In [None]:
# After the last step to clean the reviews, we need to know how many words the dataframe has in total before we tokenize the reviews.
def counter_word(text_col):
    count = Counter()
    for text in text_col.values:
        for word in text.split():
            count[word] += 1
    return count


counter = counter_word(df.Text)
len(counter)

23997

In [None]:
counter

Counter({'best': 1069,
         'apps': 1235,
         'acording': 1,
         'bunch': 64,
         'people': 798,
         'agree': 47,
         'bombs': 1,
         'eggs': 21,
         'pigs': 54,
         'tnt': 1,
         'king': 33,
         'realustic': 1,
         'stuff': 234,
         'pretty': 464,
         'good': 1931,
         'version': 885,
         'game': 5834,
         'free': 2207,
         'lots': 344,
         'different': 786,
         'levels': 439,
         'play': 2011,
         'kids': 477,
         'enjoy': 580,
         'lot': 806,
         'really': 2194,
         'find': 863,
         'golden': 9,
         'super': 186,
         'fun': 2580,
         'silly': 62,
         'frustrating': 80,
         'definitely': 298,
         'recommend': 1033,
         'time': 2432,
         'terrific': 25,
         'pad': 66,
         'hrs': 4,
         'grandkids': 52,
         'love': 3958,
         'great': 4015,
         'entertainment': 60,
         'waiting': 1

# **Split data and Tokenize**

Now, we have to split data into train and test.In addition, due to the model learns from numbers not from letter,we must tokenize the words and assign a different token to each one.This will help the model to recognize which word it is and also get make combinations while is training.

In [None]:
# split text and labels
x_train,X_test,y_train,Y_test = train_test_split(df['Text'],df['label'],random_state=42,test_size=0.2)

# convert it to a numpy array
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()
X_test = X_test.to_numpy()
Y_test = Y_test.to_numpy()

In [None]:
x_train.shape, X_test.shape

((15996,), (4000,))

# Tokenizer
After splitting the data and knowing the number of words from the whole dataset, we are going to tokenize all the reviews

In [None]:
num_unique_words = len(counter)
# vectorize a text corpus by turning each text into a sequence of integers
tokenizer = Tokenizer(num_words=num_unique_words)
tokenizer.fit_on_texts(x_train) # fit only to training

In [None]:
# each word has unique index
word_index = tokenizer.word_index

In [None]:
word_index

{'app': 1,
 'game': 2,
 'great': 3,
 'like': 4,
 'love': 5,
 'use': 6,
 'get': 7,
 'kindle': 8,
 'one': 9,
 'fun': 10,
 'time': 11,
 'dont': 12,
 'really': 13,
 'free': 14,
 'play': 15,
 'would': 16,
 'easy': 17,
 'fire': 18,
 'good': 19,
 'works': 20,
 'even': 21,
 'well': 22,
 'much': 23,
 'apps': 24,
 'got': 25,
 'phone': 26,
 'work': 27,
 'im': 28,
 'want': 29,
 'need': 30,
 'best': 31,
 'also': 32,
 'recommend': 33,
 'way': 34,
 'many': 35,
 'better': 36,
 'day': 37,
 'cant': 38,
 'games': 39,
 'make': 40,
 'playing': 41,
 'ive': 42,
 'alarm': 43,
 'know': 44,
 'version': 45,
 'could': 46,
 'download': 47,
 'find': 48,
 'used': 49,
 'lot': 50,
 'u': 51,
 'every': 52,
 'go': 53,
 'first': 54,
 'people': 55,
 'see': 56,
 'little': 57,
 'think': 58,
 'different': 59,
 'worth': 60,
 'nice': 61,
 'doesnt': 62,
 'using': 63,
 'never': 64,
 'keep': 65,
 'downloaded': 66,
 'tried': 67,
 'android': 68,
 'awesome': 69,
 'ever': 70,
 'makes': 71,
 'try': 72,
 'old': 73,
 'read': 74,
 'simple

In [None]:
# here I show the way the tokenizer works in the output we see the reviews and below, the tokens of each word from each sentence
x_train_sequences = tokenizer.texts_to_sequences(x_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)
print(x_train[10:15])
print(x_train_sequences[10:15])

['love dominoes bad app dice wayyyyy small even fire glad daily freebornalready uninstalled'
 'love app going buy pro version helps relax evening happy suceed knowing placements colors short time lets know intuition tune day plus fun'
 'good kindle app qvc site glitches makes easy order using 1 click watch'
 'wanted check game thus looking bowling game therefore downloaded app buying full versionbut sucks dont time add exclamation marks sign disgus'
 'like angry birds love app many different levels girls 10 8 always stealing kindle fire playing']
[[5, 1714, 109, 1, 1553, 8422, 371, 21, 18, 204, 247, 8423, 334], [5, 1, 97, 95, 438, 45, 166, 1251, 3198, 236, 8424, 1069, 8425, 384, 636, 11, 326, 44, 8426, 2094, 37, 265, 10], [19, 8, 1, 693, 492, 859, 71, 17, 427, 63, 195, 569, 180], [190, 226, 2, 1554, 80, 3523, 2, 2203, 66, 1, 570, 245, 8427, 315, 12, 11, 176, 8428, 2202, 679, 8429], [4, 141, 129, 5, 1, 35, 59, 116, 2009, 367, 981, 83, 3524, 8, 18, 41]]


**Add padding (white spaces) on the right to have always the same length for each sentence**

This will helkp the model to train always with the same length for each sentence and don't make random combinations with tokens just in case it finds different shapes of each sentence.

In [None]:
# Pad the sequences to have the same length
# Max number of words in a sequence (random)
max_length = 20

train_padded = pad_sequences(x_train_sequences, maxlen=max_length, padding="post", truncating="post")
val_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding="post", truncating="post")
train_padded.shape, val_padded.shape

((15996, 20), (4000, 20))

In [None]:
train_padded[10]

array([   5, 1714,  109,    1, 1553, 8422,  371,   21,   18,  204,  247,
       8423,  334,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [None]:
print(x_train[10])
print(x_train_sequences[10])
print(train_padded[10])

love dominoes bad app dice wayyyyy small even fire glad daily freebornalready uninstalled
[5, 1714, 109, 1, 1553, 8422, 371, 21, 18, 204, 247, 8423, 334]
[   5 1714  109    1 1553 8422  371   21   18  204  247 8423  334    0
    0    0    0    0    0    0]


In [None]:
# Check reversing the indices¡
# flip (key, value)
reverse_word_index = dict([(idx, word) for (word, idx) in word_index.items()])
reverse_word_index

{1: 'app',
 2: 'game',
 3: 'great',
 4: 'like',
 5: 'love',
 6: 'use',
 7: 'get',
 8: 'kindle',
 9: 'one',
 10: 'fun',
 11: 'time',
 12: 'dont',
 13: 'really',
 14: 'free',
 15: 'play',
 16: 'would',
 17: 'easy',
 18: 'fire',
 19: 'good',
 20: 'works',
 21: 'even',
 22: 'well',
 23: 'much',
 24: 'apps',
 25: 'got',
 26: 'phone',
 27: 'work',
 28: 'im',
 29: 'want',
 30: 'need',
 31: 'best',
 32: 'also',
 33: 'recommend',
 34: 'way',
 35: 'many',
 36: 'better',
 37: 'day',
 38: 'cant',
 39: 'games',
 40: 'make',
 41: 'playing',
 42: 'ive',
 43: 'alarm',
 44: 'know',
 45: 'version',
 46: 'could',
 47: 'download',
 48: 'find',
 49: 'used',
 50: 'lot',
 51: 'u',
 52: 'every',
 53: 'go',
 54: 'first',
 55: 'people',
 56: 'see',
 57: 'little',
 58: 'think',
 59: 'different',
 60: 'worth',
 61: 'nice',
 62: 'doesnt',
 63: 'using',
 64: 'never',
 65: 'keep',
 66: 'downloaded',
 67: 'tried',
 68: 'android',
 69: 'awesome',
 70: 'ever',
 71: 'makes',
 72: 'try',
 73: 'old',
 74: 'read',
 75: 'si

Demostration of the tokenizer working, becasue if we reverse the dictionary we created cells before to assign one unique token to each word, we will see each token belongs to a unique word and can decode text.




In [None]:
def decode(sequence):
    return " ".join([reverse_word_index.get(idx, "?") for idx in sequence])

decoded_text = decode(x_train_sequences[10])
print(x_train_sequences[10])
print(decoded_text)

[5, 1714, 109, 1, 1553, 8422, 371, 21, 18, 204, 247, 8423, 334]
love dominoes bad app dice wayyyyy small even fire glad daily freebornalready uninstalled


# Create RNN Model

In [None]:
# Create LSTM model

# Word embeddings give us a way to use an efficient, dense representation in which similar words have
# a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a
# dense vector of floating point values (the length of the vector is a parameter you specify).

model = keras.models.Sequential()
model.add(layers.Embedding(num_unique_words, 32, input_length=max_length))

# The layer will take as input an integer matrix of size (batch, input_length),
# and the largest integer (i.e. word index) in the input should be no larger than num_words (vocabulary size).
# Now model.output_shape is (None, input_length, 32), where `None` is the batch dimension.

model.add(layers.LSTM(64, dropout=0.1))
model.add(layers.Dense(1, activation="sigmoid"))

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 32)            767904    
                                                                 
 lstm_1 (LSTM)               (None, 64)                24832     
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 792801 (3.02 MB)
Trainable params: 792801 (3.02 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# Train the model

In [None]:
loss = keras.losses.BinaryCrossentropy(from_logits=False)
optim = keras.optimizers.Adam()
metrics = ["accuracy"]

model.compile(loss=loss, optimizer=optim, metrics=metrics)
model.fit(train_padded, y_train, epochs=10, validation_data=(val_padded, Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7cba015bb940>

# Prediction with text I have had preprocessed

In [None]:
predictions = model.predict(train_padded)
predictions = [1 if p > 0.5 else 0 for p in predictions]
print(x_train[10:20])

print(y_train[10:20])
print(predictions[10:20])

[[0.9967172 ]
 [0.9973804 ]
 [0.99891675]
 ...
 [0.99365956]
 [0.99491256]
 [0.00330118]] [0.99999356]
['love dominoes bad app dice wayyyyy small even fire glad daily freebornalready uninstalled'
 'love app going buy pro version helps relax evening happy suceed knowing placements colors short time lets know intuition tune day plus fun'
 'good kindle app qvc site glitches makes easy order using 1 click watch'
 'wanted check game thus looking bowling game therefore downloaded app buying full versionbut sucks dont time add exclamation marks sign disgus'
 'like angry birds love app many different levels girls 10 8 always stealing kindle fire playing'
 'great app especially free love work jigsaw puzzles always way pieces fall floor app dont worry puzzles different settings change'
 'cant get sound stop looping make save slot restart phone'
 'could open real first place think stupid please get app worth'
 'app accurate speech feature isnt accurate typing words sentences none speech recogniti

# Make prediction with diferent text

In [None]:
text = np.array(['worst book read'])
# Convert it to a string
texto_completo = ' '.join(text)

# Split into words
palabras = texto_completo.split()

# Count unique words
conteo_palabras = Counter(palabras)

In [None]:
num_unique_words_pred = len(conteo_palabras)

In [None]:
tokenize_pred = Tokenizer(num_words=num_unique_words_pred)
tokenize_pred.fit_on_texts(text) # fit only to training

In [None]:
# tokenize words
text_sequences = tokenize_pred.texts_to_sequences(text)

In [None]:
# add padding
pred_padded = pad_sequences(text_sequences, maxlen=20, padding="post", truncating="post")

In [None]:
text

array(['worst book read'], dtype='<U15')

In [None]:
# each word has unique index
word_index_pred = tokenize_pred.word_index
# flip (key, value)
reverse_word_index_pred = dict([(rtx, word_pred) for (word_pred, rtx) in word_index_pred.items()])
reverse_word_index_pred

{1: 'worst', 2: 'book', 3: 'read'}

In [None]:
predictions_pred = model.predict(pred_padded)
print(predictions_pred)
predictions_pred = [1 if p > 0.5 else 0 for p in predictions_pred]
print(predictions_pred)


[[0.18086828]]
[0]
