## <center> Reviews Data Enrichment & Processing <center>

> To perform our impact analysis and more specifically the sentiment analysis, we needed more data telling us more about the movies perception of viewers. We tought of enriching our dataset with reviews as they provide many information on the emotions feeled by viewers towards the film and thus the event the movie might deal with.
>
> We found a [dataset](https://www.kaggle.com/datasets/ebiswas/imdb-review-dataset) containing 5.5 million movies reviews from IMDB and merged it to our already box-office and inflation index enriched dataset. However, the file `reviews.csv` being 922.2 MB big, we could not download it directly on GitHub to process it like our other additional datasets. Thus, we decided to proceed to its processing and to the dataset enrichment in a different notebook used locally on our computers where we could download the `reviews.csv` file and directly use it.
>
> To run this notebook properly, it should be placed in the same directory as the file `reviews.csv` the file `helpers.py` and the file `movies_with_events.csv` resulting from the first enrichment of our dataset in our main project notebook.

In [1]:
# Useful Python packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime


# Additional functions

from helpers import*

In [2]:
# Defining paths to the .csv files

REVIEWS_PATH = "reviews.csv"
MOVIES_EVENTS_PATH = "movies_with_events.csv"


# Loading of the .csv files in DataFrames

reviews_df = pd.read_csv(REVIEWS_PATH)
movies_with_events_df = pd.read_csv(MOVIES_EVENTS_PATH)


# Getting an idea of the reviews dataset

reviews_df.head()

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
0,rw1133942,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,Good follow up that answers all the questions,24 July 2005,0,"After seeing Tarantino's Kill Bill Vol: 1, I g...","['0', '1']"
1,rw1133959,lost-in-limbo,Feardotcom (2002),3.0,"""I couldn't make much sense of it myself"". Too...",24 July 2005,0,There's a Website called FearDotCom and anyone...,"['1', '4']"
2,rw1133985,NateManD,Persona (1966),10.0,Persona gives me all the reasons to love art-h...,24 July 2005,0,"Long before ""Muholland Drive"" there was anothe...","['9', '23']"
3,rw1133999,CAMACHO-4,War of the Worlds (2005),3.0,A disappointing film from the team that you Mi...,24 July 2005,0,Spielberg said this film is based on the H.G. ...,"['9', '14']"
4,rw1134010,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,A fun action movie with great chemistry,24 July 2005,0,"Director Doug Liman, who's gotten famous for m...","['1', '3']"


In [3]:
# Renaming the column to further perform a merge

movies_with_events_df.rename(columns = {'name': 'movie'}, inplace = True)


# Getting an idea of our enriched dataset

movies_with_events_df.head()

Unnamed: 0.1,Unnamed: 0,id_wiki,id_freebase,movie,date,box_office,runtime,lang,country,genre,...,Vietnam War,Black History,Digital Revolution,STDs,Drug Abuse,Atomic Bomb,Genetic Engineering,LGBTQ,Terrorism,events_belongs_to
0,0,330,/m/0ktn59,Actrius,1996.0,,90.0,"['Catalan language', 'Spanish Language']",Spain,"['Drama', 'Comedy-drama']",...,False,False,False,False,False,False,False,False,False,[]
1,1,3217,/m/014hr,Army of Darkness,1992.0,21502796.0,81.0,English Language,United States of America,"['Cult', 'Horror', 'Stop motion', 'Costume dra...",...,False,False,False,False,False,False,True,False,False,['Genetic Engineering']
2,2,3333,/m/0151l,The Birth of a Nation,1915.0,50000000.0,190.0,"['Silent film', 'English Language']",United States of America,"['Silent film', 'Indie', 'Costume drama', 'Epi...",...,False,False,False,False,False,False,False,False,False,[]
3,3,3746,/m/017n9,Blade Runner,1982.0,33139618.0,116.0,"['Japanese Language', 'Cantonese', 'English La...","['United States of America', 'Hong Kong']","['Thriller', 'Cyberpunk', 'Science Fiction', '...",...,False,False,False,False,False,False,True,False,False,['Genetic Engineering']
4,4,3837,/m/018f8,Blazing Saddles,1974.0,119500000.0,93.0,"['Yiddish Language', 'English Language']",United States of America,"['Western', 'Satire', 'Comedy']",...,False,False,False,False,False,False,False,False,False,[]


In [4]:
# The regular expression r'\(\d+\)' matches any substring that starts with '(' followed by one or more digits and ends with ')'
# It is replaced with an empty string, effectively removing it

# Processing the reviews dataset

reviews_df['movie'] = reviews_df['movie'].str.replace(r' \(\d+\)','', regex = True)

In [5]:
# Merging both datasets

movies_with_events_and_reviews_df = movies_with_events_df.merge(reviews_df, how = 'left', on = ['movie'])

In [6]:
movies_events_reviews_unique_df = movies_with_events_and_reviews_df.copy()


# Renaming the column 'movie' as in our original dataset

movies_events_reviews_unique_df.rename(columns = {'movie': 'name'}, inplace = True)


# Dropping duplicates

movies_events_reviews_unique_df = movies_events_reviews_unique_df.drop_duplicates(subset = ['Unnamed: 0'], keep = 'first')


# Resetting the index to a default one

index_list = list(movies_events_reviews_unique_df['Unnamed: 0'])
movies_events_reviews_unique_df = movies_events_reviews_unique_df.set_index('Unnamed: 0').reindex(index_list)
movies_events_reviews_unique_df.reset_index(inplace = True)


# Cleaning the columns of our dataset

movies_events_reviews_unique_df = movies_events_reviews_unique_df.drop(['Unnamed: 0'], axis = 1)


# Getting an idea of our final dataset

movies_events_reviews_unique_df.head()

Unnamed: 0,id_wiki,id_freebase,name,date,box_office,runtime,lang,country,genre,id_wiki_movie,...,Terrorism,events_belongs_to,review_id,reviewer,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
0,330,/m/0ktn59,Actrius,1996.0,,90.0,"['Catalan language', 'Spanish Language']",Spain,"['Drama', 'Comedy-drama']",330,...,False,[],,,,,,,,
1,3217,/m/014hr,Army of Darkness,1992.0,21502796.0,81.0,English Language,United States of America,"['Cult', 'Horror', 'Stop motion', 'Costume dra...",3217,...,False,['Genetic Engineering'],rw1132243,Coventry,7.0,Bruce Campbell kicks medieval butt in the last...,22 July 2005,0.0,"Well, one thing you can't possibly say about S...","['1', '7']"
2,3333,/m/0151l,The Birth of a Nation,1915.0,50000000.0,190.0,"['Silent film', 'English Language']",United States of America,"['Silent film', 'Indie', 'Costume drama', 'Epi...",3333,...,False,[],rw1139003,Cineanalyst,10.0,The Birth of an Art,31 July 2005,1.0,"Before ""The Birth of a Nation,"" motion picture...","['171', '241']"
3,3746,/m/017n9,Blade Runner,1982.0,33139618.0,116.0,"['Japanese Language', 'Cantonese', 'English La...","['United States of America', 'Hong Kong']","['Thriller', 'Cyberpunk', 'Science Fiction', '...",3746,...,False,['Genetic Engineering'],rw1144451,whpratt1,10.0,Looking Forward to Year 2019,7 August 2005,0.0,"Being a big fan of Harrison Ford, I just can't...","['0', '1']"
4,3837,/m/018f8,Blazing Saddles,1974.0,119500000.0,93.0,"['Yiddish Language', 'English Language']",United States of America,"['Western', 'Satire', 'Comedy']",3837,...,False,[],rw1159971,standardmetal,9.0,Blazing Saddles-the DVD (30th Anniversary Spec...,28 August 2005,0.0,Blazing Saddles is such a network favorite on ...,"['1', '2']"


In [7]:
# Checking the NaN percentage in each column of our final enriched dataset

compute_nan_count_and_percentage(movies_events_reviews_unique_df)

id_wiki: 0 NaN values, which represents 0.00 % of the column.

id_freebase: 0 NaN values, which represents 0.00 % of the column.

name: 0 NaN values, which represents 0.00 % of the column.

date: 2619 NaN values, which represents 6.21 % of the column.

box_office: 21004 NaN values, which represents 49.77 % of the column.

runtime: 6624 NaN values, which represents 15.70 % of the column.

lang: 5264 NaN values, which represents 12.47 % of the column.

country: 3312 NaN values, which represents 7.85 % of the column.

genre: 411 NaN values, which represents 0.97 % of the column.

id_wiki_movie: 0 NaN values, which represents 0.00 % of the column.

summary: 0 NaN values, which represents 0.00 % of the column.

popularity: 21604 NaN values, which represents 51.19 % of the column.

vote_average: 21604 NaN values, which represents 51.19 % of the column.

vote_count: 21604 NaN values, which represents 51.19 % of the column.

inflation_index: 0 NaN values, which represents 0.00 % of the column.

> There is a lot of NaN values in the reviews columns, but it still gives us around 5'000 films that we can consider for our sentiment analysis.

In [8]:
# Saving our final dataset in a .csv file

movies_events_reviews_unique_df.to_csv('data/AdditionalDatasets/movies_events_reviews.csv')

> We can now use the enriched dataset with reviews to perform a sentiment analysis.