# Work with TMDB + IMDB dataset 
- [Source](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

**Metadata**

| **#** | **Column**               | **Description**                                                                                     |
|-------|--------------------------|-----------------------------------------------------------------------------------------------------|
| 1     | **id**                   | Unique identifier for the film in the TMDB.                                                        |
| 2     | **title**                | The official title of the movie.                                                                   |
| 3     | **vote_average**         | Average rating of the movie on a scale from 0 to 10.                                               |
| 4     | **vote_count**           | Number of votes that contributed to the movie's rating.                                            |
| 5     | **status**               | The current release status of the movie (e.g., *Released*, *Post-Production*).                     |
| 6     | **release_date**         | The date when the film was officially released.                                                    |
| 7     | **revenue**              | Box office earnings of the movie.                                                                  |
| 8     | **runtime**              | Duration of the movie in minutes.                                                                  |
| 9     | **budget**               | Financial budget allocated for the movie production.                                               |
| 10    | **imdb_id**              | Identifier for the movie in the Internet Movie Database (IMDB).                                    |
| 11    | **original_language**    | The language in which the movie was originally produced.                                           |
| 12    | **original_title**       | The title of the movie in its original language.                                                   |
| 13    | **overview**             | Brief summary of the movie's plot.                                                                 |
| 14    | **popularity**           | Popularity score of the movie on TMDB.                                                             |
| 15    | **tagline**              | Official tagline of the movie.                                                                     |
| 16    | **genres**               | Categories of genres the movie belongs to.                                                         |
| 17    | **production_companies** | Companies involved in producing the movie.                                                        |
| 18    | **production_countries** | Countries where the movie was produced.                                                            |
| 19    | **spoken_languages**     | Languages spoken in the movie.                                                                     |
| 20    | **cast**                 | All cast.                                                                                          |
| 21    | **director**             | All director(s).                                                                                   |
| 22    | **director_of_photography** | All DOP (cinematographers).                                                                       |
| 23    | **writers**              | All writers.                                                                                       |
| 24    | **producers**            | Producers and executive producers.                                                                 |
| 25    | **music_composer**       | Music by.                                                                                          |
| 26    | **imdb_rating**          | IMDB rating.                                                                                       |
| 27    | **imdb_votes**           | IMDB vote count.                                                                                   |
| 28    | **poster_path**          | Path to the movie's poster image.                                                                  |




In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns
import langcodes

In [2]:
import sys
sys.path.append('../utils')
import functions

In [3]:
tmdb_df = pd.read_csv('../data/local/raw/TMDB_all_movies.csv')

In [4]:
functions.show_basic_info(tmdb_df)


DataFrame Shape: (1028026, 28)
Number of Rows: 1028026
Number of Columns: 28

Data Types of Columns:
id                           int64
title                       object
vote_average               float64
vote_count                 float64
status                      object
release_date                object
revenue                    float64
runtime                    float64
budget                     float64
imdb_id                     object
original_language           object
original_title              object
overview                    object
popularity                 float64
tagline                     object
genres                      object
production_companies        object
production_countries        object
spoken_languages            object
cast                        object
director                    object
director_of_photography     object
writers                     object
producers                   object
music_composer              object
imdb_rating            

In [5]:
# functions.show_column_summary(tmdb_df)

Columns to drop:
- cast
- director_of_photography
- music_composer
- poster_path
- writers
- tagline

In [6]:
tmdb_df.drop(columns=['cast', 'director_of_photography', 'music_composer', 'poster_path', 'writers', 'tagline'], inplace=True)
tmdb_df.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,overview,popularity,genres,production_companies,production_countries,spoken_languages,director,producers,imdb_rating,imdb_votes
0,2,Ariel,7.1,335.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,...,After the coal mine he works at closes and his...,11.915,"Comedy, Drama, Romance, Crime",Villealfa Filmproductions,Finland,suomi,Aki Kaurismäki,Aki Kaurismäki,7.4,8812.0
1,3,Shadows in Paradise,7.3,369.0,Released,1986-10-17,0.0,74.0,0.0,tt0092149,...,"Nikander, a rubbish collector and would-be ent...",16.287,"Comedy, Drama, Romance",Villealfa Filmproductions,Finland,"suomi, English, svenska",Aki Kaurismäki,Mika Kaurismäki,7.5,7587.0
2,5,Four Rooms,5.8,2628.0,Released,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,...,It's Ted the Bellhop's first night on the job....,21.312,Comedy,"Miramax, A Band Apart",United States of America,English,"Quentin Tarantino, Robert Rodriguez, Alexandre...","Quentin Tarantino, Alexandre Rockwell, Lawrenc...",6.7,112798.0
3,6,Judgment Night,6.5,331.0,Released,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,...,"Four young friends, while taking a shortcut en...",8.924,"Action, Crime, Thriller","Largo Entertainment, JVC, Universal Pictures",United States of America,English,Stephen Hopkins,"Gene Levy, Lloyd Segan, Marilyn Vance",6.6,19361.0
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006-01-01,0.0,80.0,42000.0,tt0825671,...,Timo Novotny labels his new project an experim...,3.203,Documentary,inLoops,Austria,"English, हिन्दी, 日本語, Pусский, Español",Timo Novotny,"Ulrich Gehmacher, Timo Novotny",8.2,284.0


In [7]:
functions.check_for_duplicates(tmdb_df)


No duplicate rows found in the DataFrame.


Remove all rows where 'status' is not 'Released'

In [8]:
print(tmdb_df['status'].unique())

['Released' 'Rumored' 'Post Production' 'Canceled' 'Planned'
 'In Production' nan]


In [9]:
initial_rows = len(tmdb_df)
tmdb_df = tmdb_df[tmdb_df['status'] == 'Released'] # keep rows where 'status' is 'Released'
final_rows = len(tmdb_df)
removed_rows = initial_rows - final_rows
print(f'Number of rows removed: {removed_rows}')

Number of rows removed: 16758


#### 'release_date' column
- Convert to datetime
- Extract year only
- Convert year to integer

In [10]:
df = tmdb_df.copy()

In [11]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year.astype('Int64')
print(df[['release_date', 'release_year']].head())

  release_date  release_year
0   1988-10-21          1988
1   1986-10-17          1986
2   1995-12-09          1995
3   1993-10-15          1993
4   2006-01-01          2006


Get released movie count per year in df

In [12]:
release_year_counts = df['release_year'].value_counts().sort_index()
print(release_year_counts)

release_year
1800        1
1865        1
1874        1
1878       31
1882        1
        ...  
2022    41950
2023    47546
2024    44390
2025      194
2026        1
Name: count, Length: 147, dtype: Int64


In [13]:
# release_year_counts = df['release_year'].value_counts().sort_index()

# top_20_years = release_year_counts.nlargest(20).sort_index()

# top_20_years.index = top_20_years.index.astype(str)

# # Create the line plot
# plt.figure(figsize=(10, 6))
# sns.lineplot(x=top_20_years.index, y=top_20_years.values, color='lightcoral')

# plt.title('Top 20 Years with Most Film Releases')
# plt.xlabel('Year')
# plt.ylabel('Releases')
# plt.xticks(rotation=45)
# plt.grid(True)

# plt.show()

In [14]:
df.drop(['overview', 'production_companies', 'production_countries', 'producers', 'status', 'spoken_languages'], axis=1, inplace=True)

In [15]:
df.head()

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,imdb_id,original_language,original_title,popularity,genres,director,imdb_rating,imdb_votes,release_year
0,2,Ariel,7.1,335.0,1988-10-21,0.0,73.0,0.0,tt0094675,fi,Ariel,11.915,"Comedy, Drama, Romance, Crime",Aki Kaurismäki,7.4,8812.0,1988
1,3,Shadows in Paradise,7.3,369.0,1986-10-17,0.0,74.0,0.0,tt0092149,fi,Varjoja paratiisissa,16.287,"Comedy, Drama, Romance",Aki Kaurismäki,7.5,7587.0,1986
2,5,Four Rooms,5.8,2628.0,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,en,Four Rooms,21.312,Comedy,"Quentin Tarantino, Robert Rodriguez, Alexandre...",6.7,112798.0,1995
3,6,Judgment Night,6.5,331.0,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,en,Judgment Night,8.924,"Action, Crime, Thriller",Stephen Hopkins,6.6,19361.0,1993
4,8,Life in Loops (A Megacities RMX),7.5,27.0,2006-01-01,0.0,80.0,42000.0,tt0825671,en,Life in Loops (A Megacities RMX),3.203,Documentary,Timo Novotny,8.2,284.0,2006


In [16]:
functions.show_missing_values(df)


Missing Values in Columns:
id                        0
title                     9
vote_average              0
vote_count                0
release_date         100087
revenue                   0
runtime                   0
budget                    0
imdb_id              424017
original_language         0
original_title            9
popularity                0
genres               293770
director             182074
imdb_rating          579492
imdb_votes           579492
release_year         100087
dtype: int64


New DF from a year span

In [17]:
# Filter rows where 'release_year' is between 2019 and 2024
filtered_df = df[(df['release_year'] >= 2019) & (df['release_year'] <= 2024)]

# Drop rows where 'genres' or 'director' have empty or missing values
filtered_df = filtered_df.dropna(subset=['genres', 'director', 'original_title', 'title'])


In [18]:
functions.show_missing_values(filtered_df)


Missing Values in Columns:
id                        0
title                     0
vote_average              0
vote_count                0
release_date              0
revenue                   0
runtime                   0
budget                    0
imdb_id               84610
original_language         0
original_title            0
popularity                0
genres                    0
director                  0
imdb_rating          110670
imdb_votes           110670
release_year              0
dtype: int64


In [19]:
display(filtered_df)

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,imdb_id,original_language,original_title,popularity,genres,director,imdb_rating,imdb_votes,release_year
3048,5492,Gunner,5.3,94.0,2024-08-16,0.0,90.0,20000000.0,tt12598606,en,Gunner,128.500,"Action, Thriller, Crime",Dimitri Logothetis,3.2,1845.0,2024
15031,25356,The Break-up Artist,3.5,4.0,2019-03-07,0.0,105.0,0.0,tt1266542,en,The Break-up Artist,4.549,"Comedy, Romance",Steve Woo,4.8,1617.0,2019
17769,28930,Samurai Priest Vampire Hunter,3.5,2.0,2023-10-06,0.0,94.0,0.0,tt0791321,en,Samurai Priest Vampire Hunter,10.846,"Horror, Action",Mark Terry,4.0,381.0,2023
20391,32471,Mixtape,7.0,126.0,2021-12-03,0.0,94.0,0.0,tt1587420,en,Mixtape,7.627,"Comedy, Family",Valerie Weiss,6.6,4292.0,2021
24107,38258,Grizzly II: Revenge,3.0,34.0,2020-02-17,0.0,74.0,7500000.0,tt0093119,en,Grizzly II: Revenge,4.055,"Horror, Music, Thriller",André Szöts,2.7,1770.0,2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1027901,1400039,My First Year Off Campus,0.0,0.0,2024-12-09,0.0,85.0,0.0,tt33343744,en,My First Year Off Campus,0.000,"Comedy, Horror, Thriller",Chad Bolling,,,2024
1027902,1400040,Despierta,0.0,0.0,2019-12-31,0.0,0.0,0.0,,es,Despierta,0.000,Thriller,"Héctor ""Pinke"" Mejía González",,,2019
1027904,1400042,The LEGO Movie 3: The Third Part,0.0,0.0,2024-11-17,0.0,109.0,0.0,,en,The LEGO Movie 3: The Third Part,0.000,"Animation, Comedy, Action, Family, Adventure",Mike Mitchell,,,2024
1027905,1400043,Blinded,0.0,0.0,2024-12-12,0.0,7.0,0.0,,en,Blinded,0.000,Drama,"Jaigreev, Jesu M Joe",,,2024


In [30]:
movie_df = filtered_df.copy()

In [23]:
# # drop columns
# columns_to_drop = ['overview', 'production_companies', 'production_countries', 'producers', 'release_date', 'spoken_languages', 'status', 'vote_average', 'vote_count']

# df_2018 = df_2018.drop(columns=columns_to_drop)

In [24]:
# df_2018['clean_title'] = functions.prepare_clean_titles(df_2018, 'title')

# # reorder columns
# df_2018 = df_2018[['id', 'title', 'original_title', 'clean_title', 'release_year', 'imdb_id', 'imdb_rating', 'imdb_votes', 'genres', 'director', 'revenue', 'budget', 'runtime', 'original_language', 'popularity']]

In [25]:
# # clean genres
# df_2018['genres'] = functions.clean_genres(df_2018, 'genres')
# df_2018.head()

In [26]:
# # float columns to int

# columns_to_convert = ['imdb_votes', 'revenue', 'budget', 'runtime']

# df_2018[columns_to_convert] = df_2018[columns_to_convert].astype('Int64', errors='ignore')

# df_2018.head()

In [27]:
# # Check empty rows
# total_rows = len(df_2018)

# empty_rows = df_2018.isna().any(axis=1).sum()

# print(f'Total number of rows:\n{total_rows}')
# print(f'\nNumber of rows with empty values:\n{empty_rows}')

In [28]:
# df_2018['language'] = df_2018['original_language'].apply(functions.get_language_name)

# df_2018.head()

Handle unknown languages

In [29]:
print(f'Unique values in language column:\n{df_2018['language'].unique()}')

NameError: name 'df_2018' is not defined

In [None]:
print(f'Value counts in language column:\n{df_2018['language'].value_counts()}')

In [None]:
# occurrences of [cn]
unknown_lang = df_2018[df_2018['original_language'] == 'cn']
# print(unknown_lang)

In [None]:
# replace [cn] with a proper label
df_2018['language'] = df_2018['language'].replace('Unknown language [cn]', 'Cantonese')

In [None]:
# occurrences of xx
unknown_lang = df_2018[df_2018['original_language'] == 'xx']
# print(unknown_lang)

In [None]:
# replace [xx] with a proper label
df_2018['language'] = df_2018['language'].replace('Unknown language [xx]', 'Unknown')

In [None]:
print(f'Unique values in language column after re labeling: {df_2018['language'].unique()}')

In [None]:
df_2018.head(50)

#### Check runtime
- Movies with less than 40' runtime are considered short films.
- There's likely a lot of missing values.

In [None]:
count_below_40_or_0 = ((df_2018['runtime'] < 40) & (df_2018['runtime'] > 1)).sum()

print(f'Number of rows with runtime below 40: {count_below_40_or_0}')

Popularity to one decimal

In [None]:
df_2018['popularity'] = df_2018['popularity'].round(1)
display(df_2018)

In [None]:
functions.show_missing_values(df_2018)

In [None]:
df_2018.to_csv('../data/local/clean/2018_films.csv')

### Top 10s

In [None]:
# most popular genres:
# split column by commas
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()

# add column for popularity
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df_2018['popularity'])

# calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_popularity.head(10))

# calculate average popularity
language_popularity = df_2018.groupby('language')['popularity'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

In [None]:
# split 'genres' by commas, explode it, create a row for each genre
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()

# 'popularity' column to exploded genres
df_genres_imdb_rating = df_exploded_genres.to_frame(name='genre').join(df_2018['imdb_rating'])

# calculate average imdb_rating
genre_imdb_rating = df_genres_imdb_rating.groupby('genre')['imdb_rating'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_imdb_rating.head(10))

# most popular languages:
# calculate average imdb_rating
language_imdb_rating = df_2018.groupby('language')['imdb_rating'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

### Correlation Plots

In [None]:
# correlations for numeric columns
numeric_columns = ['popularity', 'revenue', 'budget', 'runtime', 'imdb_rating', 'imdb_votes']
correlation_matrix = df_2018[numeric_columns].corr()

# correlation matrix heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='winter', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numeric Columns')
plt.show()

In [None]:
# most popular genres
# split and explode the genres column
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df_2018['popularity'])

# group and calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)
top_20_genres = genre_popularity.head(20)

top_20_genres.index = top_20_genres.index.str.title()

# top 20 plot
plt.figure(figsize=(10, 8))
sns.barplot(x=top_20_genres.values, y=top_20_genres.index, palette='viridis')
plt.title('Top 20 Most Popular Genres (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Genre')
plt.show()

In [None]:
# most popular languages
language_popularity = df_2018.groupby('language')['popularity'].mean().sort_values(ascending=False)
top_10_languages = language_popularity.head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_languages.values, y=top_10_languages.index, palette='magma')
plt.title('Top 10 Most Popular Languages (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Language')
plt.show()