# Work with TMDB + IMDB dataset 
- [Source](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

**Metadata**

| **#** | **Column**               | **Description**                                                                                     |
|-------|--------------------------|-----------------------------------------------------------------------------------------------------|
| 1     | **id**                   | Unique identifier for the film in the TMDB.                                                        |
| 2     | **title**                | The official title of the movie.                                                                   |
| 3     | **vote_average**         | Average rating of the movie on a scale from 0 to 10.                                               |
| 4     | **vote_count**           | Number of votes that contributed to the movie's rating.                                            |
| 5     | **status**               | The current release status of the movie (e.g., *Released*, *Post-Production*).                     |
| 6     | **release_date**         | The date when the film was officially released.                                                    |
| 7     | **revenue**              | Box office earnings of the movie.                                                                  |
| 8     | **runtime**              | Duration of the movie in minutes.                                                                  |
| 9     | **budget**               | Financial budget allocated for the movie production.                                               |
| 10    | **imdb_id**              | Identifier for the movie in the Internet Movie Database (IMDB).                                    |
| 11    | **original_language**    | The language in which the movie was originally produced.                                           |
| 12    | **original_title**       | The title of the movie in its original language.                                                   |
| 13    | **overview**             | Brief summary of the movie's plot.                                                                 |
| 14    | **popularity**           | Popularity score of the movie on TMDB.                                                             |
| 15    | **tagline**              | Official tagline of the movie.                                                                     |
| 16    | **genres**               | Categories of genres the movie belongs to.                                                         |
| 17    | **production_companies** | Companies involved in producing the movie.                                                        |
| 18    | **production_countries** | Countries where the movie was produced.                                                            |
| 19    | **spoken_languages**     | Languages spoken in the movie.                                                                     |
| 20    | **cast**                 | All cast.                                                                                          |
| 21    | **director**             | All director(s).                                                                                   |
| 22    | **director_of_photography** | All DOP (cinematographers).                                                                       |
| 23    | **writers**              | All writers.                                                                                       |
| 24    | **producers**            | Producers and executive producers.                                                                 |
| 25    | **music_composer**       | Music by.                                                                                          |
| 26    | **imdb_rating**          | IMDB rating.                                                                                       |
| 27    | **imdb_votes**           | IMDB vote count.                                                                                   |
| 28    | **poster_path**          | Path to the movie's poster image.                                                                  |




In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns
import langcodes

In [None]:
import sys
sys.path.append('../utils')
import functions

In [None]:
tmdb_df = pd.read_csv('../data/local/raw/TMDB_all_movies.csv')

In [None]:
functions.show_basic_info(tmdb_df)

In [None]:
# functions.show_column_summary(tmdb_df)

Columns to drop:
- cast
- director_of_photography
- music_composer
- poster_path
- writers
- tagline

In [None]:
tmdb_df.drop(columns=['cast', 'director_of_photography', 'music_composer', 'poster_path', 'writers', 'tagline'], inplace=True)
tmdb_df.head()

In [None]:
functions.check_for_duplicates(tmdb_df)

Remove all rows where 'status' is not 'Released'

In [None]:
print(tmdb_df['status'].unique())

In [None]:
initial_rows = len(tmdb_df)
tmdb_df = tmdb_df[tmdb_df['status'] == 'Released'] # keep rows where 'status' is 'Released'
final_rows = len(tmdb_df)
removed_rows = initial_rows - final_rows
print(f'Number of rows removed: {removed_rows}')

#### 'release_date' column
- Convert to datetime
- Extract year only
- Convert year to integer

In [None]:
df = tmdb_df.copy()

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year.astype('Int64')
print(df[['release_date', 'release_year']].head())

Get released movie count per year in df

In [None]:
release_year_counts = df['release_year'].value_counts().sort_index()
print(release_year_counts)

In [None]:
# release_year_counts = df['release_year'].value_counts().sort_index()

# top_20_years = release_year_counts.nlargest(20).sort_index()

# top_20_years.index = top_20_years.index.astype(str)

# # Create the line plot
# plt.figure(figsize=(10, 6))
# sns.lineplot(x=top_20_years.index, y=top_20_years.values, color='lightcoral')

# plt.title('Top 20 Years with Most Film Releases')
# plt.xlabel('Year')
# plt.ylabel('Releases')
# plt.xticks(rotation=45)
# plt.grid(True)

# plt.show()

In [None]:
df.drop(['overview', 'production_companies', 'production_countries', 'producers', 'status', 'spoken_languages'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
functions.show_missing_values(df)

### New DF from a year span
- Titles between 2019 and 2024
- Drop empty rows in columns 'genre', 'director', 'original_title', and 'title'
- 174702 rows × 17 columns

In [None]:
# Filter rows where 'release_year' is between 2019 and 2024
filtered_df = df[(df['release_year'] >= 2019) & (df['release_year'] <= 2024)]

# Drop rows where 'genres' or 'director' have empty or missing values
filtered_df = filtered_df.dropna(subset=['genres', 'director', 'original_title', 'title'])


In [None]:
functions.show_missing_values(filtered_df)

In [None]:
display(filtered_df)

In [None]:
movie_df = filtered_df.copy()

In [None]:
# drop columns
columns_to_drop = ['release_date']

movie_df = movie_df.drop(columns=columns_to_drop)

In [None]:
movie_df['clean_title'] = functions.prepare_clean_titles(movie_df, 'title')

# # reorder columns
# movie_df = movie_df[['id', 'title', 'original_title', 'clean_title', 'release_year', 'imdb_id', 'imdb_rating', 'imdb_votes', 'genres', 'director', 'revenue', 'budget', 'runtime', 'original_language', 'popularity']]

In [None]:
# clean genres
movie_df['genres'] = functions.clean_genres(movie_df, 'genres')
movie_df.head()

In [None]:
columns_to_convert = ['imdb_votes', 'revenue', 'budget', 'runtime', 'vote_count']

# Check if the columns exist in the DataFrame
columns_to_convert = [col for col in columns_to_convert if col in movie_df.columns]

# Convert the specified columns to 'Int64' type, handling errors gracefully
movie_df[columns_to_convert] = movie_df[columns_to_convert].apply(pd.to_numeric, errors='coerce').astype('Int64')

# Now the DataFrame should have the correct data types
print(movie_df.head())


In [None]:
# # Check empty rows
# total_rows = len(movie_df)

# empty_rows = movie_df.isna().any(axis=1).sum()

# print(f'Total number of rows:\n{total_rows}')
# print(f'\nNumber of rows with empty values:\n{empty_rows}')

In [None]:
movie_df['language'] = movie_df['original_language'].apply(functions.get_language_name)

movie_df.head()

Handle unknown languages

In [None]:
print(f'Unique values in language column:\n{movie_df['language'].unique()}')

In [None]:
print(f'Value counts in language column:\n{movie_df['language'].value_counts()}')

In [None]:
# occurrences of [cn]
unknown_lang = movie_df[movie_df['original_language'] == 'cn']
# print(unknown_lang)

In [None]:
# replace [cn] with a proper label
movie_df['language'] = movie_df['language'].replace('Unknown language [cn]', 'Cantonese')

In [None]:
# occurrences of xx
unknown_lang = movie_df[movie_df['original_language'] == 'xx']
# print(unknown_lang)

In [None]:
# replace [xx] with a proper label
movie_df['language'] = movie_df['language'].replace('Unknown language [xx]', 'Unknown')

In [None]:
print(f'Unique values in language column after re labeling: {movie_df['language'].unique()}')

In [None]:
movie_df.drop(columns=['original_language'], inplace=True)

In [None]:
movie_df.head(50)

Popularity to one decimal

In [None]:
movie_df['popularity'] = movie_df['popularity'].round(1)
movie_df['vote_average'] = movie_df['vote_average'].round(1)

display(movie_df)

In [None]:
functions.show_missing_values(movie_df)

In [None]:
display(movie_df)

#### Check runtime
- Movies with less than 40' runtime are considered short films.
- There's likely a lot of missing values.

In [None]:
rows_to_remove = movie_df[(movie_df['runtime'] >= 1) & (movie_df['runtime'] <= 40)]

removed_count = rows_to_remove.shape[0]

movie_df = movie_df[~((movie_df['runtime'] >= 1) & (movie_df['runtime'] <= 40))]

print(f'Number of rows removed: {removed_count}')

In [None]:
display(movie_df)

#### Rename and reorder columns

In [None]:
rename_columns = {
    'id': 'tmdb_id',
    'vote_average': 'tmdb_rating',
    'vote_count': 'tmdb_votes'
}

movie_df.rename(columns=rename_columns, inplace=True)

In [None]:
new_column_order = [
    'title', 'clean_title', 'original_title', 'genres', 'director', 'release_year',
    'runtime', 'budget', 'revenue', 'popularity', 'tmdb_rating', 'tmdb_votes', 
    'imdb_rating', 'imdb_votes', 'language', 'tmdb_id', 'imdb_id'
]

# Reorder the DataFrame columns
movie_df = movie_df[new_column_order]


In [None]:
display(movie_df)

#### df to csv

In [None]:
movie_df = movie_df.sort_values(by='tmdb_id').reset_index(drop=True)

movie_df.to_csv('../data/local/clean/films_19to24.csv', index=False)

### Top 10s

In [None]:
# most popular genres:
# split column by commas
df_exploded_genres = movie_df['genres'].str.split(',').explode().str.strip()

# add column for popularity
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(movie_df['popularity'])

# calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_popularity.head(10))

# calculate average popularity
language_popularity = movie_df.groupby('language')['popularity'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

In [None]:
# split 'genres' by commas, explode it, create a row for each genre
df_exploded_genres = movie_df['genres'].str.split(',').explode().str.strip()

# 'popularity' column to exploded genres
df_genres_imdb_rating = df_exploded_genres.to_frame(name='genre').join(movie_df['imdb_rating'])

# calculate average imdb_rating
genre_imdb_rating = df_genres_imdb_rating.groupby('genre')['imdb_rating'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_imdb_rating.head(10))

# most popular languages:
# calculate average imdb_rating
language_imdb_rating = movie_df.groupby('language')['imdb_rating'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

### Correlation Plots

In [None]:
# correlations for numeric columns
numeric_columns = ['popularity', 'revenue', 'budget', 'runtime', 'imdb_rating', 'imdb_votes']
correlation_matrix = movie_df[numeric_columns].corr()

# correlation matrix heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='winter', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numeric Columns')
plt.show()

In [None]:
# most popular genres
# split and explode the genres column
df_exploded_genres = movie_df['genres'].str.split(',').explode().str.strip()
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(movie_df['popularity'])

# group and calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)
top_20_genres = genre_popularity.head(20)

top_20_genres.index = top_20_genres.index.str.title()

# top 20 plot
plt.figure(figsize=(10, 8))
sns.barplot(x=top_20_genres.values, y=top_20_genres.index, palette='viridis')
plt.title('Top 20 Most Popular Genres (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Genre')
plt.show()

In [None]:
# most popular languages
language_popularity = movie_df.groupby('language')['popularity'].mean().sort_values(ascending=False)
top_10_languages = language_popularity.head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_languages.values, y=top_10_languages.index, palette='magma')
plt.title('Top 10 Most Popular Languages (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Language')
plt.show()