Mounting notebook on Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---

#Setup

This ipynb will be for preprocessing as well as analysing the review data.

There is a csv file with 69K titles worth of reviews (many reviews per title) which will be loaded here.

The objective will be to prepare and process these reviews for aspect based sentiment analysis.





---


## Import Libraries
First, let's import the necessary libraries for our analysis.

In [2]:
!pip install --upgrade textblob nltk




NLTK for preprocessing

In [3]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [5]:
from textblob import TextBlob

Time Tracking Library `tqdm`

In [6]:
!pip install tqdm
from tqdm import tqdm



Pandas for managing dataframes

In [7]:
import pandas as pd

---

## Load the Data
Next, let's load the reviews data from the CSV file.

In [8]:
file_path = "/content/drive/MyDrive/IMDB Project/review_analysis/data/all_reviews.csv"

In [9]:
all_reviews_df = pd.read_csv(file_path)

  all_reviews_df = pd.read_csv(file_path)


---

## Data Overview
Let's take a quick look at the data.

In [10]:
all_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3680057 entries, 0 to 3680056
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   review    object
 1   reviewer  object
 2   rating    object
 3   imdb_id   object
 4   title     object
dtypes: object(5)
memory usage: 140.4+ MB


In [11]:
all_reviews_df.shape

(3680057, 5)

In [12]:
all_reviews_df.head()

Unnamed: 0,review,reviewer,rating,imdb_id,title
0,Jackie Loh Chan is a motor mechanic whose fath...,bob the moo,,tt0114126,Thunderbolt
1,One of the most important things in a Jackie C...,sagacity_,,tt0114126,Thunderbolt
2,I read somewher that Jackie was still recoveri...,rutt13-1,8.0,tt0114126,Thunderbolt
3,This moving picture deals with Chan Foh To (Ja...,ma-cortes,6.0,tt0114126,Thunderbolt
4,This is another action-packed movie starring J...,OllieSuave-007,6.0,tt0114126,Thunderbolt


---

## Make a copy of the imported data

In [13]:
df = all_reviews_df.copy()

### Understanding the Data and dropping NaNs

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3680057 entries, 0 to 3680056
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   review    object
 1   reviewer  object
 2   rating    object
 3   imdb_id   object
 4   title     object
dtypes: object(5)
memory usage: 140.4+ MB


In [15]:
df.isna().sum()

review        8736
reviewer      8345
rating      321202
imdb_id       8619
title        18628
dtype: int64

In [16]:
# Drop rows with missing reviews
df = df.dropna(subset=['review'])

# Ensure all reviews are strings
df['review'] = df['review'].astype(str)

---

#Processing

---

## Preprocess the Reviews
Before we can perform aspect-based sentiment analysis, we need to preprocess the reviews. This typically involves steps like tokenization, lowercasing, stopword removal, and lemmatization. Let's define a function to handle this preprocessing.

In [17]:
def preprocess_review(review):
    # Tokenize the review
    tokens = word_tokenize(review)

    # Convert to lowercase
    tokens = [token.lower() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

Apply the function

In [18]:
tqdm.pandas(desc="Processing reviews")
df['preprocessed_review'] = df['review'].progress_apply(preprocess_review)

Processing reviews: 100%|██████████| 3671321/3671321 [1:13:49<00:00, 828.76it/s] 


Viewing the preprocessed reviews

In [19]:
df = df.drop('review', axis=1)

In [20]:
df.to_csv("/content/drive/MyDrive/IMDB Project/review_analysis/data/preprocessed_reviews.csv")

In [21]:
df['preprocessed_review'].head()

0    [jackie, loh, chan, motor, mechanic, whose, fa...
1    [one, important, thing, jackie, chan, movie, d...
2    [read, somewher, jackie, still, recovering, in...
3    [moving, picture, deal, chan, foh, (, jackie, ...
4    [another, action-packed, movie, starring, jack...
Name: preprocessed_review, dtype: object

# Okay done!
Going to the next notebook, which would be the one to conduct LDA to topic model the movie reviews!