# Data Exploration

In [59]:
import pandas as pd
import numpy as np
import re
import clean

## Movies Dataframe
The Movies dataframe contains detailed information about over 940,000 films, including their unique IDs, titles, release dates, poster taglines, short descriptions, durations in minutes, and their ratings.

In [60]:
movies_df = pd.read_csv('../csv/movies.csv')

In [61]:
movies_df.shape

(941597, 7)

This dataset consists of 941,597 rows and 7 columns, providing comprehensive details about a vast collection of films.

In [62]:
movies_df.head()

Unnamed: 0,id,name,date,tagline,description,minute,rating
0,1000001,Barbie,2023.0,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114.0,3.86
1,1000002,Parasite,2019.0,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133.0,4.56
2,1000003,Everything Everywhere All at Once,2022.0,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140.0,4.3
3,1000004,Fight Club,1999.0,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139.0,4.27
4,1000005,La La Land,2016.0,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129.0,4.09


In [63]:
movies_df.dtypes

id               int64
name            object
date           float64
tagline         object
description     object
minute         float64
rating         float64
dtype: object

In [64]:
movies_df['date'] = pd.to_datetime(movies_df['date'], format='%Y', errors='coerce')

The <code>date</code> column is currently stored as a <code>float</code>, which is not suitable for representing release dates; we will convert it to a proper date format for accurate analysis.

In [65]:
movies_df['name'] = movies_df['name'].astype('string')
movies_df['tagline'] = movies_df['tagline'].astype('string')
movies_df['description'] = movies_df['description'].astype('string')

Converting <code>object</code> types to <code>string</code>.

In [66]:
movies_df.dtypes

id                      int64
name           string[python]
date           datetime64[ns]
tagline        string[python]
description    string[python]
minute                float64
rating                float64
dtype: object

Result of the conversions.

### Cleaning text columns

In [67]:
# Rimuove spazi vuoti e converte in minuscolo per le colonne 'name', 'tagline' e 'description'
movies_df['name'] = movies_df['name'].str.strip().str.lower()
movies_df['tagline'] = movies_df['tagline'].str.strip().str.lower()
movies_df['description'] = movies_df['description'].str.strip().str.lower()

In [68]:
columns_to_clean = ['description', 'name', 'tagline']
characters_to_replace = ['.', '-', ',', '#']
patterns_to_replace = [ r'^-+$',          # Una o più linee di trattini
                        r'^-placeholder-$', # La parola "placeholder" esatta
                        r'^-Placeholder-$', # La parola "Placeholder" con iniziale maiuscola
]

movies_df = clean.clean_data(movies_df, columns_to_clean, characters_to_replace, patterns_to_replace)

Clean the <code>movies_df</code> data in the <code>name</code>, <code>tagline</code>, and <code>description</code> columns, removing unnecessary spaces and flattening the format to lowercase. Next, unwanted information, such as specific characters and placeholder words, is removed and replaced with missing values <code>(NaN)</code>.

### Check for <i>null</i> values

In [69]:
movies_df.isna().sum()

id                  0
name               49
date            91913
tagline        802220
description    160852
minute         181570
rating         850598
dtype: int64

In [70]:
movies_df[movies_df['name'].isna()]

Unnamed: 0,id,name,date,tagline,description,minute,rating
287514,1287515,,2015-01-01,,none is a short film that explores the balance...,4.0,
307905,1307906,,2018-01-01,they're all gonna laugh at you,a surreal visual film about the inner feelings...,9.0,
320567,1320568,,NaT,,directed by lutfi akad,,
323158,1323159,,NaT,,,,
330334,1330335,,NaT,,,,
336644,1336645,,2020-01-01,,"infodemic, pandemic, pandemonium",4.0,
417362,1417363,,2020-01-01,,andrew ohanesian’s conceptual works (primarily...,2.0,
424121,1424122,,2011-01-01,,the film raises some of the problems experienc...,,
457879,1457880,,2022-01-01,what is youth but vastness?,a young soul wanders through their own daydreams,3.0,
490622,1490623,,2000-01-01,,"eleven is a fourteen-minute long vignette, vir...",15.0,


In [71]:
movies_df = movies_df.dropna(subset=['name'])
#rimuovo righe con name nullo

We chose to remove tuples with <u>**null**</u> <code>name</code>.

### Check for <i>duplicate</i> values

In [72]:
movies_df.duplicated().sum()

np.int64(0)

In [73]:
movies_df['id'].duplicated().any()

np.False_

In [74]:
# Trova i film con nomi duplicati
movies_df['name'].duplicated().any()

np.True_

In [75]:
duplicati = movies_df[movies_df['name'].duplicated(keep=False)]

duplicati = duplicati.sort_values(by='name')

duplicati

Unnamed: 0,id,name,date,tagline,description,minute,rating
524259,1524260,#1,2022-01-01,a modern story of suspense... in their biggest...,a broke comic shop owner and an obsessive coll...,8.0,
526951,1526952,#1,2011-01-01,,short movie by tomoyasu murata,5.0,
393181,1393182,#1,2009-01-01,,"you're running and you're running away, but yo...",4.0,
204330,1204331,#2,1993-01-01,,joost rekveld's first film is an hommage to th...,12.0,
885075,1885076,#2,2023-01-01,,subject #2's death,2.0,
...,...,...,...,...,...,...,...
676051,1676052,정브르의 동물일기,2021-01-01,,,,
848930,1848931,조폭의 브이로그,2023-01-01,,,,
884959,1884960,조폭의 브이로그,2023-01-01,,,57.0,
782298,1782299,코리안 커넥션,1990-01-01,,,117.0,


We found more than 200,000 tuples with duplicate <code>name</code> value, now we will analyze them.

In [76]:
# Rimuovere i duplicati dal DataFrame
movies_df = clean.remove_duplicates(movies_df, clean.get_columns_conditions(movies_df))

We decided to remove duplicates with: 

<ul>
  <li>same <code>name</code>, <code>description</code>, <code>date</code>, <code>tagline</code>, <code>minute</code>, <code>rating</code></li>
  <li>same <code>name</code> but null <code>description</code>, <code>date</code>, <code>tagline</code>, <code>minute</code>, <code>rating</code></li>
  <li>same <code>name</code>, <code>date</code>, <code>description</code> but null <code>tagline</code>, <code>minute</code>, <code>rating</code></li>
  <li>...</li>
</ul>
keeping the first occurrence.

In [77]:
movies_df[(movies_df['rating'] > 5) | (movies_df['rating'] < 0)].shape[0]

0

We checked that in the <code>rating</code> column there are no values ​​outside the range.

---

## Countries Dataframe
This dataset contains over 693,000 entries with the columns 'id', representing the film ID, and 'country', indicating one of the originating countries, with the possibility of multiple countries per film.

In [78]:
countries_df = pd.read_csv('../csv/countries.csv')

In [79]:
countries_df.shape

(693476, 2)

This dataset consists of 693,476 rows and 2 columns, providing detailed information about the countries associated with each film.

In [80]:
countries_df.head()

Unnamed: 0,id,country
0,1000001,UK
1,1000001,USA
2,1000002,South Korea
3,1000003,USA
4,1000004,Germany


In [81]:
countries_df['country'] = countries_df['country'].astype('string')

In [82]:
# Verifica il tipo di dati della colonna 'country'
countries_df.dtypes

id                  int64
country    string[python]
dtype: object

In [83]:
countries_df.isna().sum()

id         0
country    0
dtype: int64

There are no duplicate entries in the dataset.

We also verified that the countries were not duplicated under different names.

---

## Actors Dataframe
This dataset contains over 5.7 million entries with the columns 'id', representing the film ID, 'name', indicating the name of the actor, and 'role', specifying the character played.

In [84]:
actors_df = pd.read_csv('../csv/actors.csv')

In [85]:
actors_df.shape

(5798450, 3)

This dataset consists of 5,798,450 rows and 3 columns, providing detailed information about the actors, their roles, and the films they appear in.

In [86]:
actors_df.head()

Unnamed: 0,id,name,role
0,1000001,Margot Robbie,Barbie
1,1000001,Ryan Gosling,Ken
2,1000001,America Ferrera,Gloria
3,1000001,Ariana Greenblatt,Sasha
4,1000001,Issa Rae,Barbie


In [87]:
actors_df['name'] = actors_df['name'].astype('string')
actors_df['role'] = actors_df['role'].astype('string')

actors_df.dtypes

id               int64
name    string[python]
role    string[python]
dtype: object

In [88]:
actors_df.isna().sum()

id            0
name          4
role    1361559
dtype: int64

In [89]:
actors_df[actors_df['name'].isnull()]

Unnamed: 0,id,name,role
4145738,1443629,,
4281100,1469981,,Self
4306960,1474958,,Cinematography
5430275,1773264,,


In [90]:
actors_df = actors_df.dropna(subset=['name'])

Tuples with <code>name</code> null have been deleted

In [91]:
actors_df.duplicated().sum()

np.int64(946)

In [92]:
actors_df = actors_df.drop_duplicates()

In [93]:
actors_df.duplicated().sum()

np.int64(0)

All the duplicates have been removed.

---

## Crew Dataframe
This dataset contains over 4.7 million entries with the columns 'id', representing the film ID, 'role', indicating the role (e.g., director), and 'name', specifying the name of the person.

In [94]:
crew_df = pd.read_csv('../csv/crew.csv')

In [95]:
crew_df.shape

(4720183, 3)

This dataset consists of 4,720,183 rows and 3 columns, providing detailed information about the crew members, their roles, and the films they are associated with.

In [96]:
crew_df.head()

Unnamed: 0,id,role,name
0,1000001,Director,Greta Gerwig
1,1000001,Producer,Tom Ackerley
2,1000001,Producer,Margot Robbie
3,1000001,Producer,Robbie Brenner
4,1000001,Producer,David Heyman


In [97]:
crew_df['role'] = crew_df['role'].astype('string')
crew_df['name'] = crew_df['name'].astype('string')

crew_df.dtypes

id               int64
role    string[python]
name    string[python]
dtype: object

In [98]:
crew_df.isna().sum()

id      0
role    0
name    1
dtype: int64

In [186]:
crew_df.dropna(subset=['name'], inplace=True)
crew_df.isna().sum()

id      0
role    0
name    0
dtype: int64

In [100]:
crew_df.duplicated().sum()

np.int64(1282)

In [187]:
crew_df = crew_df.drop_duplicates()

In [188]:
crew_df.duplicated().sum()

np.int64(0)

All the duplicates have been removed.

---

## Posters Dataframe
This dataset contains over 940,000 entries with the columns 'id', representing the film ID, and 'link', providing the link to the original film posters.

In [103]:
posters_df = pd.read_csv('../csv/posters.csv')

In [104]:
posters_df.shape

(941597, 2)

This dataset consists of 941,597 rows and 2 columns, providing links to the original film posters associated with each film.

In [105]:
posters_df

Unnamed: 0,id,link
0,1000001,https://a.ltrbxd.com/resized/film-poster/2/7/7...
1,1000002,https://a.ltrbxd.com/resized/film-poster/4/2/6...
2,1000003,https://a.ltrbxd.com/resized/film-poster/4/7/4...
3,1000004,https://a.ltrbxd.com/resized/film-poster/5/1/5...
4,1000005,https://a.ltrbxd.com/resized/film-poster/2/4/0...
...,...,...
941592,1941593,
941593,1941594,
941594,1941595,https://a.ltrbxd.com/resized/film-poster/1/1/8...
941595,1941596,https://a.ltrbxd.com/resized/film-poster/1/1/8...


In [106]:
posters_df['link'] = posters_df['link'].astype('string')

posters_df.dtypes

id               int64
link    string[python]
dtype: object

In [107]:
posters_df.isna().sum()

id           0
link    180712
dtype: int64

In [108]:
posters_df = posters_df.dropna(subset=['link'])

Removes all rows from the <code>posters_df</code> that have a <code>NaN</code> value in the link column.

In [109]:
posters_df.duplicated().sum()

np.int64(0)

---

## Releases Dataframe
This dataset contains over 13 million entries with the columns 'id', representing the film ID, 'country', indicating the country of release, 'date', showing the release date, 'type', specifying the release type (e.g., Theatrical, Digital), and 'rating', representing the rating received in that country (e.g., PG, etc.).

In [110]:
releases_df = pd.read_csv('../csv/releases.csv')

In [111]:
releases_df.shape

(1332782, 5)

This dataset consists of 1,332,782 rows and 5 columns, providing detailed information about the film releases, including the country, date, type, and rating in each country.

In [112]:
releases_df.head()

Unnamed: 0,id,country,date,type,rating
0,1000001,Andorra,2023-07-21,Theatrical,
1,1000001,Argentina,2023-07-20,Theatrical,ATP
2,1000001,Australia,2023-07-19,Theatrical,PG
3,1000001,Australia,2023-10-01,Digital,PG
4,1000001,Austria,2023-07-20,Theatrical,


In [113]:
releases_df['country'] = releases_df['country'].astype('string')
releases_df['type'] = releases_df['type'].astype('string')
releases_df['rating'] = releases_df['rating'].astype('string')
releases_df['date'] = pd.to_datetime(releases_df['date'])

releases_df.dtypes

id                  int64
country    string[python]
date       datetime64[ns]
type       string[python]
rating     string[python]
dtype: object

The date column is currently stored as an object, which is not suitable for representing release dates; we will convert it to a proper date format for accurate analysis.

In [114]:
releases_df.isna().sum()

id              0
country         0
date            0
type            0
rating     998802
dtype: int64

There are many null values in the 'rating' column, but no action will be taken as they will be left as is, since they may represent cases where a rating was not assigned in certain countries.

In [115]:
releases_df.duplicated().sum()

np.int64(0)

---

## Genres Dataframe

In [152]:
genres_df = pd.read_csv('../csv/genres.csv')

In [153]:
genres_df.shape

(1046849, 2)

In [154]:
genres_df.head()

Unnamed: 0,id,genre
0,1000001,Comedy
1,1000001,Adventure
2,1000002,Comedy
3,1000002,Thriller
4,1000002,Drama


In [155]:
genres_df.dtypes

id        int64
genre    object
dtype: object

In [156]:
genres_df['genre'] = genres_df['genre'].astype('string')

In [157]:
genres_df.isna().sum()

id       0
genre    0
dtype: int64

In [158]:
genres_df.duplicated().sum()

np.int64(0)

### Languages Dataframe

In [159]:
languages_df = pd.read_csv('../csv/languages.csv')

In [160]:
languages_df.shape

(1038762, 3)

In [161]:
languages_df.head()

Unnamed: 0,id,type,language
0,1000001,Language,English
1,1000002,Primary language,Korean
2,1000002,Spoken language,English
3,1000002,Spoken language,German
4,1000002,Spoken language,Korean


In [162]:
languages_df.dtypes

id           int64
type        object
language    object
dtype: object

In [163]:
languages_df['language'] = languages_df['language'].astype('string')
languages_df['type'] = languages_df['type'].astype('string')

In [164]:
languages_df.isna().sum()

id          0
type        0
language    0
dtype: int64

In [165]:
languages_df.duplicated().sum()

np.int64(0)

### Studios Dataframe

In [166]:
studios_df = pd.read_csv('../csv/studios.csv')

In [167]:
studios_df.shape

(679283, 2)

In [168]:
studios_df.head()

Unnamed: 0,id,studio
0,1000001,LuckyChap Entertainment
1,1000001,Heyday Films
2,1000001,NB/GG Pictures
3,1000001,Mattel
4,1000001,Warner Bros. Pictures


In [169]:
studios_df.dtypes

id         int64
studio    object
dtype: object

In [170]:
studios_df['studio'] = studios_df['studio'].astype('string')

In [171]:
studios_df.isna().sum()

id         0
studio    10
dtype: int64

In [173]:
studios_df = studios_df.dropna(subset=['studio'])
studios_df.isna().sum()

id        0
studio    0
dtype: int64

In [174]:
studios_df.duplicated().sum()

np.int64(212)

In [177]:
studios_df = studios_df.drop_duplicates()
studios_df.duplicated().sum()

np.int64(0)

### Themes Dataframe

In [179]:
themes_df = pd.read_csv('../csv/themes.csv')

In [180]:
themes_df.shape

(125641, 2)

In [181]:
themes_df.head()

Unnamed: 0,id,theme
0,1000001,Humanity and the world around us
1,1000001,Crude humor and satire
2,1000001,Moving relationship stories
3,1000001,Emotional and captivating fantasy storytelling
4,1000001,Surreal and thought-provoking visions of life ...


In [182]:
themes_df.dtypes

id        int64
theme    object
dtype: object

In [183]:
themes_df['theme'] = themes_df['theme'].astype('string')

In [184]:
themes_df.isna().sum()

id       0
theme    0
dtype: int64

In [185]:
themes_df.duplicated().sum()

np.int64(0)

## Oscar Awards Dataframe
This dataset contains over 10,000 entries about Oscar candidates, with the columns 'year_film', representing the year the film was issued, 'year_ceremony', indicating the year of the ceremony, 'category', specifying the category (e.g., actor), 'name', showing the name of the person, 'film', listing the title of the film, and 'winner', indicating whether they won or not.

In [116]:
oscar_awards_df = pd.read_csv('../csv/the_oscar_awards.csv')

In [117]:
oscar_awards_df.shape

(10889, 7)

This dataset consists of 10,889 rows and 7 columns, providing detailed information about Oscar candidates, including the year of the film, ceremony, category, person, film title, and whether the candidate won.

In [118]:
oscar_awards_df.head()

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


In [119]:
oscar_awards_df['category'] = oscar_awards_df['category'].astype('string')
oscar_awards_df['name'] = oscar_awards_df['name'].astype('string')
oscar_awards_df['film'] = oscar_awards_df['film'].astype('string')

oscar_awards_df.dtypes

year_film                 int64
year_ceremony             int64
ceremony                  int64
category         string[python]
name             string[python]
film             string[python]
winner                     bool
dtype: object

In [120]:
oscar_awards_df.isna().sum()

year_film          0
year_ceremony      0
ceremony           0
category           0
name               5
film             319
winner             0
dtype: int64

In [121]:
oscar_awards_df[oscar_awards_df['name'].isnull()]

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
10513,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10514,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10635,2021,2022,94,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10759,2022,2023,95,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True


In [122]:
oscar_awards_df[oscar_awards_df['film'].isnull()]

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
16,1927,1928,1,ENGINEERING EFFECTS,Ralph Hammeras,,False
18,1927,1928,1,ENGINEERING EFFECTS,Nugent Slaughter,,False
31,1927,1928,1,WRITING (Title Writing),Joseph Farnham,,True
32,1927,1928,1,WRITING (Title Writing),"George Marion, Jr.",,False
33,1927,1928,1,SPECIAL AWARD,Warner Bros.,,True
...,...,...,...,...,...,...,...
10763,2022,2023,95,GORDON E. SAWYER AWARD,Iain Neil,,True
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10886,2023,2024,96,HONORARY AWARD,"To Angela Bassett, who has inspired audiences ...",,True
10887,2023,2024,96,HONORARY AWARD,"To Mel Brooks, for his comedic brilliance, pro...",,True


Null values were found, but will be kept as it may represent an important entry for the context of the dataset.

In [123]:
invalid_oscar = oscar_awards_df[oscar_awards_df['name'].isna() & oscar_awards_df['film'].isna()]

invalid_oscar

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
10513,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10514,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10635,2021,2022,94,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10759,2022,2023,95,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True


In [124]:
oscar_awards_df = oscar_awards_df.drop(invalid_oscar.index)

The <i>Jean Hersholt Humanitarian Award</i> is a special category for exceptional contributions to humanitarian causes.

Since tuples have no value for <code>name</code>, we decided not to keep them.

In [125]:
oscar_awards_df.duplicated().sum()

np.int64(6)

In [126]:
oscar_awards_df = oscar_awards_df.drop_duplicates()

In [127]:
oscar_awards_df.duplicated().sum()

np.int64(0)

All the duplicates have been removed.

---

## Rotten Tomatoes Reviews Dataframe
This dataset contains over 1.1 million reviews from Rotten Tomatoes, with the columns <code>rotten_tomatoes_link</code>, providing the URL of the Rotten Tomatoes page, <code>movie_title</code>, indicating the title of the movie, <code>critic_name</code>, showing the name of the critic, <code>top_critic</code>, indicating if the critic is a top critic, <code>publisher_name</code>, specifying where the review was published, <code>review_type</code>, categorizing the type of review, <code>review_score</code>, representing the score given in the review, <code>review_date</code>, showing the date of the review, and <code>review_content</code>, containing the text of the actual review.

In [128]:
rotten_tomatoes_df = pd.read_csv('../csv/rotten_tomatoes_reviews.csv')

In [129]:
rotten_tomatoes_df.shape

(1129887, 9)

This dataset consists of 1,129,887 rows and 9 columns, providing detailed information about Rotten Tomatoes reviews, including the review URL, movie title, critic details, review score, date, and content.

In [130]:
rotten_tomatoes_df

Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,Percy Jackson & the Olympians: The Lightning T...,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...
...,...,...,...,...,...,...,...,...,...
1129882,m/zulu_dawn,Zulu Dawn,Chuck O'Leary,False,Fantastica Daily,Rotten,2/5,2005-11-02,
1129883,m/zulu_dawn,Zulu Dawn,Ken Hanke,False,"Mountain Xpress (Asheville, NC)",Fresh,3.5/5,2007-03-07,"Seen today, it's not only a startling indictme..."
1129884,m/zulu_dawn,Zulu Dawn,Dennis Schwartz,False,Dennis Schwartz Movie Reviews,Fresh,B+,2010-09-16,A rousing visual spectacle that's a prequel of...
1129885,m/zulu_dawn,Zulu Dawn,Christopher Lloyd,False,Sarasota Herald-Tribune,Rotten,3.5/5,2011-02-28,"A simple two-act story: Prelude to war, and th..."


In [131]:
rotten_tomatoes_df.dtypes

rotten_tomatoes_link    object
movie_title             object
critic_name             object
top_critic                bool
publisher_name          object
review_type             object
review_score            object
review_date             object
review_content          object
dtype: object

Many columns are currently stored as <code>objects</code>, we will convert them to more useful formats for manipulation.

In [132]:
rotten_tomatoes_df['review_date'] = pd.to_datetime(rotten_tomatoes_df['review_date'])
rotten_tomatoes_df['movie_title'] = rotten_tomatoes_df['movie_title'].astype('string')
rotten_tomatoes_df['critic_name'] = rotten_tomatoes_df['critic_name'].astype('string')
rotten_tomatoes_df['publisher_name'] = rotten_tomatoes_df['publisher_name'].astype('string')
rotten_tomatoes_df['review_type'] = rotten_tomatoes_df['review_type'].astype('string')
rotten_tomatoes_df['review_score'] = rotten_tomatoes_df['review_score'].astype('string')
rotten_tomatoes_df['review_content'] = rotten_tomatoes_df['review_content'].astype('string')

rotten_tomatoes_df.dtypes

rotten_tomatoes_link            object
movie_title             string[python]
critic_name             string[python]
top_critic                        bool
publisher_name          string[python]
review_type             string[python]
review_score            string[python]
review_date             datetime64[ns]
review_content          string[python]
dtype: object

In [133]:
rotten_tomatoes_df.isna().sum()

rotten_tomatoes_link         0
movie_title                  0
critic_name              18521
top_critic                   0
publisher_name               0
review_type                  0
review_score            305902
review_date                  0
review_content           65778
dtype: int64

In [134]:
rotten_tomatoes_df[rotten_tomatoes_df['review_score'].isnull() &
                           rotten_tomatoes_df['review_content'].isnull() &
                           rotten_tomatoes_df['critic_name'].isnull()]

Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
118910,m/alice_sweet_alice,"Alice, Sweet Alice (Communion)",,False,Film4,Rotten,,2003-05-24,
219988,m/callas_forever,Callas Forever,,True,Denver Rocky Mountain News,Rotten,,2004-12-17,
323790,m/escape_to_witch_mountain,Escape to Witch Mountain,,True,Chicago Reader,Fresh,,2000-01-01,
349208,m/flawless,Flawless,,False,E! Online,Rotten,,2000-01-01,
493769,m/kings_ransom,King's Ransom,,False,Hollywood.com,Fresh,,2005-04-23,
619413,m/nurse_betty,Nurse Betty,,True,Atlanta Journal-Constitution,Fresh,,2000-01-01,
751877,m/saddest_music_in_the_world,The Saddest Music in the World,,False,Premiere Magazine,Rotten,,2004-07-03,
751879,m/saddest_music_in_the_world,The Saddest Music in the World,,False,Premiere Magazine,Rotten,,2004-07-03,


Reviews without <code>review_score</code>, <code>review_content</code>, <code>critic_name</code> were found but will be kept because they may be relevant in the future.

In [135]:
rotten_tomatoes_df.duplicated().sum()

np.int64(119471)

In [136]:
rotten_tomatoes_df = rotten_tomatoes_df.drop_duplicates()

In [137]:
rotten_tomatoes_df.duplicated().sum()

np.int64(0)

All the duplicates have been removed.

# Final thoughts on dataframes

In [138]:
# Trova tutti i titoli di film unici nel dataset di Rotten Tomatoes
unique_rotten_tomatoes_movies = rotten_tomatoes_df['movie_title'].unique()

# Trova i titoli di film unici nel dataset dei film
unique_movies = movies_df['name'].unique()

# Converti tutti i titoli di film in stringhe e poi in minuscolo per il confronto
unique_rotten_tomatoes_movies = [str(title).lower() for title in unique_rotten_tomatoes_movies]
unique_movies = [str(title).lower() for title in unique_movies]

# Trova i film che sono presenti in entrambi i dataset
common_movies = set(unique_rotten_tomatoes_movies).intersection(set(unique_movies))

# Stampa il numero di film trovati
print(f"Number of unique films on Rotten Tomatoes: {len(unique_rotten_tomatoes_movies)}")
print(f"Number of unique movies in the movie dataset: {len(unique_movies)}")
print(f"Number of common films: {len(common_movies)}")

Number of unique films on Rotten Tomatoes: 17100
Number of unique movies in the movie dataset: 786963
Number of common films: 15056


# Saving modified dataframes

In [139]:
movies_df.to_csv('movie_df.csv', index=False)

In [140]:
actors_df.to_csv('actors_df.csv', index=False)

In [141]:
posters_df.to_csv('posters_df.csv', index=False)

In [142]:
oscar_awards_df.to_csv('oscar_awards_df.csv', index=False)

In [178]:
studios_df.to_csv('studios_df.csv', index=False)

In [189]:
crew_df.to_csv('crew_df.csv', index=False)

In [149]:
oscar_awards_df.isna().sum()

year_film          0
year_ceremony      0
ceremony           0
category           0
name               0
film             314
winner             0
dtype: int64

In [150]:
oscar_awards_df.dtypes

year_film                 int64
year_ceremony             int64
ceremony                  int64
category         string[python]
name             string[python]
film             string[python]
winner                     bool
dtype: object

In [151]:
oscar_awards_df.duplicated().sum()

np.int64(0)