# Work with TMDB + IMDB dataset 
- [Source](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

**Metadata**

| **#** | **Column**               | **Description**                                                                                     |
|-------|--------------------------|-----------------------------------------------------------------------------------------------------|
| 1     | **id**                   | Unique identifier for the film in the TMDB.                                                        |
| 2     | **title**                | The official title of the movie.                                                                   |
| 3     | **vote_average**         | Average rating of the movie on a scale from 0 to 10.                                               |
| 4     | **vote_count**           | Number of votes that contributed to the movie's rating.                                            |
| 5     | **status**               | The current release status of the movie (e.g., *Released*, *Post-Production*).                     |
| 6     | **release_date**         | The date when the film was officially released.                                                    |
| 7     | **revenue**              | Box office earnings of the movie.                                                                  |
| 8     | **runtime**              | Duration of the movie in minutes.                                                                  |
| 9     | **budget**               | Financial budget allocated for the movie production.                                               |
| 10    | **imdb_id**              | Identifier for the movie in the Internet Movie Database (IMDB).                                    |
| 11    | **original_language**    | The language in which the movie was originally produced.                                           |
| 12    | **original_title**       | The title of the movie in its original language.                                                   |
| 13    | **overview**             | Brief summary of the movie's plot.                                                                 |
| 14    | **popularity**           | Popularity score of the movie on TMDB.                                                             |
| 15    | **tagline**              | Official tagline of the movie.                                                                     |
| 16    | **genres**               | Categories of genres the movie belongs to.                                                         |
| 17    | **production_companies** | Companies involved in producing the movie.                                                        |
| 18    | **production_countries** | Countries where the movie was produced.                                                            |
| 19    | **spoken_languages**     | Languages spoken in the movie.                                                                     |
| 20    | **cast**                 | All cast.                                                                                          |
| 21    | **director**             | All director(s).                                                                                   |
| 22    | **director_of_photography** | All DOP (cinematographers).                                                                       |
| 23    | **writers**              | All writers.                                                                                       |
| 24    | **producers**            | Producers and executive producers.                                                                 |
| 25    | **music_composer**       | Music by.                                                                                          |
| 26    | **imdb_rating**          | IMDB rating.                                                                                       |
| 27    | **imdb_votes**           | IMDB vote count.                                                                                   |
| 28    | **poster_path**          | Path to the movie's poster image.                                                                  |




In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns
import langcodes


In [None]:
import sys
sys.path.append('../utils')
import functions

In [None]:
tmdb_df = pd.read_csv('../data/local/raw/TMDB_all_movies.csv')

In [None]:
functions.show_basic_info(tmdb_df)

In [None]:
# functions.show_column_summary(tmdb_df)

Columns to drop:
- cast
- director_of_photography
- music_composer
- poster_path
- writers
- tagline

In [None]:
tmdb_df.drop(columns=['cast', 'director_of_photography', 'music_composer', 'poster_path', 'writers', 'tagline'], inplace=True)
tmdb_df.head()

In [None]:
functions.check_for_duplicates(tmdb_df)

Remove all rows where 'status' is not 'Released'

In [None]:
print(tmdb_df['status'].unique())

In [None]:
initial_rows = len(tmdb_df)
tmdb_df = tmdb_df[tmdb_df['status'] == 'Released'] # keep rows where 'status' is 'Released'
final_rows = len(tmdb_df)
removed_rows = initial_rows - final_rows
print(f'Number of rows removed: {removed_rows}')

#### 'release_date' column
- Convert to datetime
- Extract year only
- Convert year to integer

In [None]:
df = tmdb_df.copy()

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year.astype('Int64')
print(df[['release_date', 'release_year']].head())

Get released movie count per year in df

In [None]:
release_year_counts = df['release_year'].value_counts().sort_index()
print(release_year_counts)

In [None]:
release_year_counts = df['release_year'].value_counts().sort_index()

top_20_years = release_year_counts.nlargest(20).sort_index()

top_20_years.index = top_20_years.index.astype(str)

# Create the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(x=top_20_years.index, y=top_20_years.values, color='lightcoral')

plt.title('Top 20 Years with Most Film Releases')
plt.xlabel('Year')
plt.ylabel('Releases')
plt.xticks(rotation=45)
plt.grid(True)

plt.show()

### Create new DF from one specific year

In [None]:
df_2018 = df[df['release_year'] == 2018]

df_2018 = df_2018.reset_index(drop=True)

print(f'Number of rows in the 2018 dataset: {len(df_2018)}')

In [None]:
df_2018.head()

Drop Columns
- 'id'
- 'imdb_id'
- 'overview'
- 'production_companies'
- 'production_countries'
- 'producers'
- 'release_date'
- 'spoken_languages'
- 'status'
- 'vote_average'
- 'vote_count'

In [None]:
# Drop specified columns
columns_to_drop = ['id', 'imdb_id', 'overview', 'production_companies', 'production_countries', 'producers', 'release_date', 'spoken_languages', 'status', 'vote_average', 'vote_count']

df_2018 = df_2018.drop(columns=columns_to_drop)

In [None]:
df_2018['clean_title'] = functions.prepare_clean_titles(df_2018, 'title')

# reorder columns
df_2018 = df_2018[['title', 'original_title', 'clean_title', 'release_year', 'imdb_rating', 'imdb_votes', 'genres', 'director', 'revenue', 'budget', 'runtime', 'original_language', 'popularity']]

In [None]:
# clean genres
df_2018['genres'] = functions.clean_genres(df_2018, 'genres')
df_2018.head()

In [None]:
# float columns to int

columns_to_convert = ['imdb_votes', 'revenue', 'budget', 'runtime']

df_2018[columns_to_convert] = df_2018[columns_to_convert].astype('Int64', errors='ignore')

df_2018.head()

In [None]:
# Check empty rows
total_rows = len(df_2018)

empty_rows = df_2018.isna().any(axis=1).sum()

print(f'Total number of rows: {total_rows}')
print(f'Number of rows with empty values: {empty_rows}')


In [None]:
df_2018['language'] = df_2018['original_language'].apply(functions.get_language_name)

df_2018.head()

### Top 10s

In [None]:
# For most popular genres:
# Split the 'genres' column by commas and explode it to create a new row for each genre
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()

# Add the 'popularity' column to the exploded genres
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df_2018['popularity'])

# Group by genre and calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)

# Display top 10 most popular genres
print('Most Popular Genres:')
print(genre_popularity.head(10))

# For most popular languages:
# Group by 'original_language' and calculate average popularity
language_popularity = df_2018.groupby('language')['popularity'].mean().sort_values(ascending=False)

# Display top 10 most popular languages
print('\nMost Popular Languages:')
print(language_popularity.head(10))


In [None]:
# Split the 'genres' column by commas and explode it to create a new row for each genre
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()

# Add the 'popularity' column to the exploded genres
df_genres_imdb_rating = df_exploded_genres.to_frame(name='genre').join(df_2018['imdb_rating'])

# Group by genre and calculate average imdb_rating
genre_imdb_rating = df_genres_imdb_rating.groupby('genre')['imdb_rating'].mean().sort_values(ascending=False)

# Display top 10 most popular genres
print('Most Popular Genres:')
print(genre_imdb_rating.head(10))

# For most popular languages:
# Group by 'original_language' and calculate average imdb_rating
language_imdb_rating = df_2018.groupby('language')['imdb_rating'].mean().sort_values(ascending=False)

# Display top 10 most popular languages
print('\nMost Popular Languages:')
print(language_popularity.head(10))

### Correlations

In [None]:
# 1. Calculate correlations for numeric columns
numeric_columns = ['popularity', 'revenue', 'budget', 'runtime', 'imdb_rating', 'imdb_votes']
correlation_matrix = df_2018[numeric_columns].corr()

In [None]:
# 2. Plot correlation matrix heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='winter', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numeric Columns')
plt.show()

In [None]:
# 3. Top 10 or 20 most popular genres
# For genres, split and explode the genres column
df_exploded_genres = df_2018['genres'].str.split(',').explode().str.strip()
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df_2018['popularity'])

In [None]:
# Group by genre and calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)
top_20_genres = genre_popularity.head(20)

In [None]:
# Capitalize the first letter of each genre
top_20_genres.index = top_20_genres.index.str.title()

# Plot the top 20 most popular genres
plt.figure(figsize=(10, 8))
sns.barplot(x=top_20_genres.values, y=top_20_genres.index, palette='viridis')
plt.title('Top 20 Most Popular Genres (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Genre')
plt.show()


In [None]:
# 4. Top 10 most popular languages
language_popularity = df_2018.groupby('language')['popularity'].mean().sort_values(ascending=False)
top_10_languages = language_popularity.head(10)

In [None]:
# Plot the top 10 most popular languages
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_languages.values, y=top_10_languages.index, palette='magma')
plt.title('Top 10 Most Popular Languages (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Language')
plt.show()