# Data Exploration

In [1]:
import pandas as pd

## Movies Dataframe
The Movies dataframe contains detailed information about over 940,000 films, including their unique IDs, titles, release dates, poster taglines, short descriptions, durations in minutes, and their ratings.

In [2]:
movies_df = pd.read_csv('csv/movies.csv')

In [3]:
movies_df.shape

(941597, 7)

This dataset consists of 941,597 rows and 7 columns, providing comprehensive details about a vast collection of films.

In [4]:
movies_df.head()

Unnamed: 0,id,name,date,tagline,description,minute,rating
0,1000001,Barbie,2023.0,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114.0,3.86
1,1000002,Parasite,2019.0,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133.0,4.56
2,1000003,Everything Everywhere All at Once,2022.0,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140.0,4.3
3,1000004,Fight Club,1999.0,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139.0,4.27
4,1000005,La La Land,2016.0,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129.0,4.09


In [5]:
movies_df.dtypes

id               int64
name            object
date           float64
tagline         object
description     object
minute         float64
rating         float64
dtype: object

The date column is currently stored as a float, which is not suitable for representing release dates; We will convert it to a proper date format for accurate analysis.

In [6]:
movies_df['date'] = pd.to_datetime(movies_df['date'])

In [7]:
movies_df.dtypes

id                      int64
name                   object
date           datetime64[ns]
tagline                object
description            object
minute                float64
rating                float64
dtype: object

### Analisi dei valori nulli

In [8]:
movies_df.isna().sum()

id                  0
name               10
date            91913
tagline        802210
description    160812
minute         181570
rating         850598
dtype: int64

In [9]:
movies_df[movies_df['name'].isna()]

Unnamed: 0,id,name,date,tagline,description,minute,rating
287514,1287515,,1970-01-01 00:00:00.000002015,,NONE is a short film that explores the balance...,4.0,
617642,1617643,,NaT,,,,
646520,1646521,,1970-01-01 00:00:00.000002008,,,,
648185,1648186,,NaT,,,,
720294,1720295,,NaT,,"In this directorial debut of Eden Ewardson, he...",8.0,
725369,1725370,,NaT,,,,
741481,1741482,,NaT,,,90.0,
840337,1840338,,NaT,,,,
883228,1883229,,NaT,,,,
894771,1894772,,NaT,,,,


###  Eliminazione Valori nulli

In [10]:
movies_df = movies_df.dropna(subset=['name'])

### Analisi dei duplicati

In [11]:
movies_df.duplicated().sum()

np.int64(0)

In [12]:
movies_df['id'].duplicated().any()

np.False_

In [13]:
# Trova i film con nomi duplicati
movies_df['name'].duplicated().any()

np.True_

In [27]:
duplicati = movies_df[movies_df['name'].duplicated(keep=False)]

duplicati = duplicati.sort_values(by='name')

duplicati

Unnamed: 0,id,name,date,tagline,description,minute,rating
336644,1336645,#,1970-01-01 00:00:00.000002020,,"Infodemic, pandemic, pandemonium.",4.0,
787774,1787775,#,1970-01-01 00:00:00.000002012,,,,
524259,1524260,#1,1970-01-01 00:00:00.000002022,A modern story of suspense... in their biggest...,A broke comic shop owner and an obsessive coll...,8.0,
393181,1393182,#1,1970-01-01 00:00:00.000002009,,"You're running and you're running away, but yo...",4.0,
526951,1526952,#1,1970-01-01 00:00:00.000002011,,Short movie by Tomoyasu Murata.,5.0,
...,...,...,...,...,...,...,...
769416,1769417,螺旋,1970-01-01 00:00:00.000001981,,,86.0,
864354,1864355,飞跃绝境,1970-01-01 00:00:00.000001991,,,,
896514,1896515,飞跃绝境,NaT,,"In 1964, my country will conduct the first exp...",,
729419,1729420,鬼妹,1970-01-01 00:00:00.000001985,,,91.0,


In [18]:
duplicati = movies_df[movies_df.duplicated(subset=['name', 'date', 'description'], keep=False)]

# Ordina i duplicati per una migliore visualizzazione
duplicati = duplicati.sort_values(by=['name', 'date', 'description'])

# Mostra i duplicati
duplicati

Unnamed: 0,id,name,date,tagline,description,minute,rating


si è deciso di eliminare i duplicati con stesso 'name', 'date', 'description'

In [17]:
movies_df = movies_df.drop_duplicates(subset=['name', 'date', 'description'])

In [25]:
# Filtra i duplicati per la colonna 'name' con keep=False
duplicati = movies_df[movies_df['name'].duplicated(keep=False)]

# Filtra ulteriormente per le righe dove 'date' e 'description' sono NaN
duplicati_nan = duplicati[duplicati['date'].isna() & duplicati['description'].isna()]

# Ordina i risultati per una migliore leggibilità
duplicati_nan = duplicati_nan.sort_values(by='name')

# Mostra i duplicati con le condizioni richieste
duplicati_nan

Unnamed: 0,id,name,date,tagline,description,minute,rating


si è deciso di eliminare anche i duplicati con stesso 'name' ma con i valori 'date' e 'description' nulli

In [24]:
movies_df = movies_df.drop(duplicati_nan.index)

In [None]:
movies_df[(movies_df['rating'] > 5) | (movies_df['rating'] < 0)]

## Countries Dataframe
This dataset contains over 693,000 entries with the columns 'id', representing the film ID, and 'country', indicating one of the originating countries, with the possibility of multiple countries per film.

In [18]:
countries_df = pd.read_csv('csv/countries.csv')

In [None]:
countries_df.shape

This dataset consists of 693,476 rows and 2 columns, providing detailed information about the countries associated with each film.

In [18]:
countries_df.head()

id          int64
country    object
dtype: object

In [19]:
countries_df.dtypes

id         0
country    0
dtype: int64

In [None]:
countries_df.isna().sum()

There are no null values in the dataset.

In [None]:
countries_df.duplicated().sum()

There are no duplicate entries in the dataset.

## Actors Dataframe
This dataset contains over 5.7 million entries with the columns 'id', representing the film ID, 'name', indicating the name of the actor, and 'role', specifying the character played.

In [22]:
actors_df = pd.read_csv('../csv/actors.csv')

(5798450, 3)

In [None]:
actors_df.shape

This dataset consists of 5,798,450 rows and 3 columns, providing detailed information about the actors, their roles, and the films they appear in.

In [24]:
actors_df.head()

id       int64
name    object
role    object
dtype: object

In [25]:
actors_df.dtypes

id         0
country    0
dtype: int64

In [None]:
countries_df.isna().sum()

There are no null values in the dataset.

In [None]:
countries_df.duplicated().sum()

There are no duplicate entries in the dataset.

## Crew Dataframe
This dataset contains over 4.7 million entries with the columns 'id', representing the film ID, 'role', indicating the role (e.g., director), and 'name', specifying the name of the person.

In [28]:
crew_df = pd.read_csv('../csv/crew.csv')

(4720183, 3)

In [None]:
crew_df.shape

This dataset consists of 4,720,183 rows and 3 columns, providing detailed information about the crew members, their roles, and the films they are associated with.

In [30]:
crew_df.head()

id       int64
role    object
name    object
dtype: object

In [31]:
crew_df.dtypes

id      0
role    0
name    1
dtype: int64

In [32]:
crew_df.isna().sum()

Unnamed: 0,id,role,name
4562126,1859397,Writer,


In [None]:
crew_df[crew_df['name'].isnull()]

A null value was found in the 'name' column, but it will be retained as it may represent a missing or incomplete entry that is important for the context of the dataset.

In [None]:
countries_df.duplicated().sum()

There are no duplicate entries in the dataset.

## Posters Dataframe
This dataset contains over 940,000 entries with the columns 'id', representing the film ID, and 'link', providing the link to the original film posters.

In [35]:
posters_df = pd.read_csv('../csv/posters.csv')

(941597, 2)

In [None]:
posters_df.shape

This dataset consists of 941,597 rows and 2 columns, providing links to the original film posters associated with each film.

In [37]:
posters_df.head()

id       int64
link    object
dtype: object

In [38]:
posters_df.dtypes

id           0
link    180712
dtype: int64

In [39]:
posters_df.isna().sum()

0

In [None]:
posters_df.duplicated().sum()

## Releases Dataframe
This dataset contains over 13 million entries with the columns 'id', representing the film ID, 'country', indicating the country of release, 'date', showing the release date, 'type', specifying the release type (e.g., Theatrical, Digital), and 'rating', representing the rating received in that country (e.g., PG, etc.).

In [41]:
releases_df = pd.read_csv('../csv/releases.csv')

(1332782, 5)

In [None]:
releases_df.shape

This dataset consists of 1,332,782 rows and 5 columns, providing detailed information about the film releases, including the country, date, type, and rating in each country.

In [43]:
releases_df.head()

id          int64
country    object
date       object
type       object
rating     object
dtype: object

In [None]:
releases_df.dtypes

The date column is currently stored as an object, which is not suitable for representing release dates; We will convert it to a proper date format for accurate analysis.

In [45]:
releases_df['date'] = pd.to_datetime(releases_df['date'])
releases_df.dtypes

id                  int64
country            object
date       datetime64[ns]
type               object
rating             object
dtype: object

In [46]:
releases_df.dtypes

id              0
country         0
date            0
type            0
rating     998802
dtype: int64

In [None]:
releases_df.isna().sum()

There are many null values in the 'rating' column, but no action will be taken as they will be left as is, since they may represent cases where a rating was not assigned in certain countries.

In [None]:
releases_df.duplicated().sum()

## Oscar Awards Dataframe
This dataset contains over 10,000 entries about Oscar candidates, with the columns 'year_film', representing the year the film was issued, 'year_ceremony', indicating the year of the ceremony, 'category', specifying the category (e.g., actor), 'name', showing the name of the person, 'film', listing the title of the film, and 'winner', indicating whether they won or not.

In [49]:
oscar_awards_df = pd.read_csv('../csv/the_oscar_awards.csv')

(10889, 7)

In [None]:
oscar_awards_df.shape

This dataset consists of 10,889 rows and 7 columns, providing detailed information about Oscar candidates, including the year of the film, ceremony, category, person, film title, and whether the candidate won.

In [51]:
oscar_awards_df.head()

year_film         int64
year_ceremony     int64
ceremony          int64
category         object
name             object
film             object
winner             bool
dtype: object

In [52]:
oscar_awards_df.dtypes

year_film          0
year_ceremony      0
ceremony           0
category           0
name               5
film             319
winner             0
dtype: int64

In [53]:
oscar_awards_df.isna().sum()

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
10513,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10514,2020,2021,93,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10635,2021,2022,94,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10759,2022,2023,95,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True


In [54]:
oscar_awards_df[oscar_awards_df['name'].isnull()]

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
16,1927,1928,1,ENGINEERING EFFECTS,Ralph Hammeras,,False
18,1927,1928,1,ENGINEERING EFFECTS,Nugent Slaughter,,False
31,1927,1928,1,WRITING (Title Writing),Joseph Farnham,,True
32,1927,1928,1,WRITING (Title Writing),"George Marion, Jr.",,False
33,1927,1928,1,SPECIAL AWARD,Warner Bros.,,True
...,...,...,...,...,...,...,...
10763,2022,2023,95,GORDON E. SAWYER AWARD,Iain Neil,,True
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10886,2023,2024,96,HONORARY AWARD,"To Angela Bassett, who has inspired audiences ...",,True
10887,2023,2024,96,HONORARY AWARD,"To Mel Brooks, for his comedic brilliance, pro...",,True


In [None]:
oscar_awards_df[oscar_awards_df['film'].isnull()]

## Rotten Tomatoes Reviews Dataframe
This dataset contains over 1.1 million reviews from Rotten Tomatoes, with the columns 'rotten_tomatoes_link', providing the URL of the Rotten Tomatoes page, 'movie_title', indicating the title of the movie, 'critic_name', showing the name of the critic, 'top_critic', indicating if the critic is a top critic, 'publisher_name', specifying where the review was published, 'review_type', categorizing the type of review, 'review_score', representing the score given in the review, 'review_date', showing the date of the review, and 'review_content', containing the text of the actual review.

In [56]:
rotten_tomatoes_df = pd.read_csv('../csv/rotten_tomatoes_reviews.csv')

(1129887, 9)

In [None]:
rotten_tomatoes_df.shape

This dataset consists of 1,129,887 rows and 9 columns, providing detailed information about Rotten Tomatoes reviews, including the review URL, movie title, critic details, review score, date, and content.

In [58]:
rotten_tomatoes_df.head()

rotten_tomatoes_link    object
movie_title             object
critic_name             object
top_critic                bool
publisher_name          object
review_type             object
review_score            object
review_date             object
review_content          object
dtype: object

In [None]:
rotten_tomatoes_df.dtypes

The review_date column is currently stored as an object, which is not suitable for representing review dates; We will convert it to a proper date format for accurate analysis.

In [60]:
rotten_tomatoes_df['review_date'] = pd.to_datetime(rotten_tomatoes_df['review_date'])
rotten_tomatoes_df.dtypes

rotten_tomatoes_link         0
movie_title                  0
critic_name              18521
top_critic                   0
publisher_name               0
review_type                  0
review_score            305902
review_date                  0
review_content           65778
dtype: int64

In [61]:
rotten_tomatoes_df.isna().sum()

Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
118910,m/alice_sweet_alice,"Alice, Sweet Alice (Communion)",,False,Film4,Rotten,,2003-05-24,
219988,m/callas_forever,Callas Forever,,True,Denver Rocky Mountain News,Rotten,,2004-12-17,
323790,m/escape_to_witch_mountain,Escape to Witch Mountain,,True,Chicago Reader,Fresh,,2000-01-01,
349208,m/flawless,Flawless,,False,E! Online,Rotten,,2000-01-01,
493769,m/kings_ransom,King's Ransom,,False,Hollywood.com,Fresh,,2005-04-23,
619413,m/nurse_betty,Nurse Betty,,True,Atlanta Journal-Constitution,Fresh,,2000-01-01,
751877,m/saddest_music_in_the_world,The Saddest Music in the World,,False,Premiere Magazine,Rotten,,2004-07-03,
751879,m/saddest_music_in_the_world,The Saddest Music in the World,,False,Premiere Magazine,Rotten,,2004-07-03,


In [None]:
rotten_tomatoes_df[rotten_tomatoes_df['review_score'].isnull() &
                           rotten_tomatoes_df['review_content'].isnull() &
                           rotten_tomatoes_df['critic_name'].isnull()]