# Text Data Preprocessing

In [1]:
from data_pipeline import ETL_Pipeline
# pip install scipy==1.10.1

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shak-\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shak-\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shak-\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# instantiate data pipeline
etl = ETL_Pipeline()

In [3]:
etl.extract(data_path='data/amazon_movie_reviews.csv')

Data loaded successfully!
Text data extracted successfully!


In [4]:
etl.data.head()

Unnamed: 0.1,Unnamed: 0,rating,review_title,text,images_x,asin,parent_asin,user_id,timestamp,helpful_vote,...,features,description,price,images_y,videos,store,categories,details,bought_together,author
0,0,5.0,Five Stars,"Amazon, please buy the show! I'm hooked!",[],B013488XFS,B013488XFS,AGGZ357AO26RQZVRLGU4D4N52DZQ,1440385637000,0,...,"['IMDb 8.1', '2017', '10 episodes', 'X-Ray', '...",['A\xa0con man (Giovanni Ribisi) on the run fr...,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,"['Suspense', 'Drama']","{'Content advisory': ['Nudity', 'violence', 's...",,
1,1,5.0,Five Stars,My Kiddos LOVE this show!!,[],B00CB6VTDS,B00CB6VTDS,AGKASBHYZPGTEPO6LWZPVJWB2BVA,1461100610000,0,...,"['2014', '13 episodes', 'X-Ray', 'ALL']",['Follow the adventures of Arty and his sideki...,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,['Kids'],{'Audio languages': ['English Dialogue Boost: ...,,
2,2,3.0,Some decent moments...but...,Annabella Sciorra did her character justice wi...,[],B096Z8Z3R6,B096Z8Z3R6,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1646271834582,0,...,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Content advisory': ['Violence', 'substance u...",,
3,3,4.0,"Decent Depiction of Lower-Functioning Autism, ...",...there should be more of a range of characte...,[],B09M14D9FZ,B09M14D9FZ,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1645937761864,1,...,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Content advisory': ['Violence', 'alcohol use...",,
4,4,5.0,What Love Is...,"...isn't always how you expect it to be, but w...",[],B001H1SVZC,B001H1SVZC,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1590639227074,0,...,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Subtitles': ['None available'], 'Directors':...",,


In [5]:
etl.data.dtypes

Unnamed: 0             int64
rating               float64
review_title          object
text                  object
images_x              object
asin                  object
parent_asin           object
user_id               object
timestamp              int64
helpful_vote           int64
verified_purchase       bool
main_category         object
movie_title           object
subtitle              object
average_rating       float64
rating_number        float64
features              object
description           object
price                 object
images_y              object
videos                object
store                 object
categories            object
details               object
bought_together      float64
author                object
dtype: object

## Preprocessing

This is the first step in my Sentiment Analysis project using the Amazon Reviews Dataset. The preprocessing steps for the text data are:
+ Removal of HTML tags
+ Lowercasing the words
+ Expanding contractions
+ Removing special characters
+ Removing stopwords
+ Lemmatization
+ Tokenization

Some additional preprocessing steps that I have not yet implemented, but which may be useful, are:
+ Removal of URLs
+ Removal of extra spaces

Text preprocessing lets us clean, normalize, and transform raw text data into a format that is suitable for NLP tasks, such as sentiment analysis. 

For example, converting text to lowercase and expanding contractions ensures consistency within the data. Removing special characters, punctuation, and irrelevant symbols reduces noise in the text, making it easier to analyze. Stopwords are common words like "the", "is", or "and", which are very frequently found throughout text, but carry little semantic meaning. By removing stopwords, we are reducing the dimensionality of the text data and also retaining only the words that provide the most information about the meaning of the text. This will improve the efficiency and accuracy of the NLP model for sentiment analysis. 

Lemmatization is a technique to reduce words to their root forms (ex. running -> run). This helps consolidate variations of words, reducing the complexity of the resulting 'vocabulary' and improving model generalization.

One technique that I would have employed if I went back to try and improve my model's accuracy, would be to generate n-grams. This could help capture context better by capturing negation (ex. "not good" vs "good"), and giving my model a better ability to recognize nuance in language. I, along with many of my classmates, noticed that some words show up with high frequency in all the rating brackets, such as "good", and "like". For lower rating reviews, the full phrases are most likely "not good" or "did not like", but this context was not captured by my model. If I'd included bigrams or trigrams, I could have captured these patterns more effectively.

In [6]:
etl.preprocess()

Text data preprocessed successfully!
Text data tokenized and lemmatized successfully!
Vocabulary created successfully!


(CountVectorizer(),
 0                       [amazon, please, buy, show, hooked]
 1                                      [kiddos, love, show]
 2         [annabella, sciorra, character, justice, portr...
 3         [range, character, highfunctioning, autism, ja...
 4         [always, expect, know, movie, deep, struggle, ...
                                 ...                        
 999995                                  [go, wrong, martin]
 999996                                  [go, wrong, martin]
 999997    [good, pace, action, good, character, plot, bi...
 999998    [watched, th, whole, thing, one, criterion, ac...
 999999    [start, somewhat, interesting, fade, episode, ...
 Name: text, Length: 1000000, dtype: object)

## Encoding

The `encode()` function takes a method argument with options 'bow' (Bag of Words), 'tf-idf', and 'word2vec', with Word2Vec being the default encoding option.
Here, I am testing the 'bow' encoding method. 

Encoding is the process of converting the textual data into a numerical format that machine learning algorithms can understand and process. This is achieved by transforming words, phrases, or documents into numerical vectors. My `encode()` method can do this via three separate techniques.  

Bag of Words represents text by counting the frequency of each word in the document and then creating a sparse matrix where each row corresponds to a document, and each column to a unique word in the corpus. The value of each cell is the frequency of the corresponding word in the document. 

TF-IDF also represents text by considering the frequency of the words, but it also weighs each word based on its importance in the corpus, by calculating the term frequency (TF) and the inverse document frequency (IDF), which measures how unique or rare that word is across documents in the corpus. 

Word2Vec represents words in a continuous vector space where semantically similar words are mapped to nearby points. This captures contextual relationships between words by training a neural net to predict a target word based on its surrounding words, or vice versa. This type of embedding better preserves semantic relationships between words so that the model can better capture meaning and context compared to BoW or TF-IDF.

In [7]:
etl.encode(method='bow')

Bag of Words Encoding completed successfully!


<1000000x393022 sparse matrix of type '<class 'numpy.int64'>'
	with 20314046 stored elements in Compressed Sparse Row format>