# Work with TMDB + IMDB dataset 
- [Source](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

**Metadata**

| **#** | **Column**               | **Description**                                                                                     |
|-------|--------------------------|-----------------------------------------------------------------------------------------------------|
| 1     | **id**                   | Unique identifier for the film in the TMDB.                                                        |
| 2     | **title**                | The official title of the movie.                                                                   |
| 3     | **vote_average**         | Average rating of the movie on a scale from 0 to 10.                                               |
| 4     | **vote_count**           | Number of votes that contributed to the movie's rating.                                            |
| 5     | **status**               | The current release status of the movie (e.g., *Released*, *Post-Production*).                     |
| 6     | **release_date**         | The date when the film was officially released.                                                    |
| 7     | **revenue**              | Box office earnings of the movie.                                                                  |
| 8     | **runtime**              | Duration of the movie in minutes.                                                                  |
| 9     | **budget**               | Financial budget allocated for the movie production.                                               |
| 10    | **imdb_id**              | Identifier for the movie in the Internet Movie Database (IMDB).                                    |
| 11    | **original_language**    | The language in which the movie was originally produced.                                           |
| 12    | **original_title**       | The title of the movie in its original language.                                                   |
| 13    | **overview**             | Brief summary of the movie's plot.                                                                 |
| 14    | **popularity**           | Popularity score of the movie on TMDB.                                                             |
| 15    | **tagline**              | Official tagline of the movie.                                                                     |
| 16    | **genres**               | Categories of genres the movie belongs to.                                                         |
| 17    | **production_companies** | Companies involved in producing the movie.                                                        |
| 18    | **production_countries** | Countries where the movie was produced.                                                            |
| 19    | **spoken_languages**     | Languages spoken in the movie.                                                                     |
| 20    | **cast**                 | All cast.                                                                                          |
| 21    | **director**             | All director(s).                                                                                   |
| 22    | **director_of_photography** | All DOP (cinematographers).                                                                       |
| 23    | **writers**              | All writers.                                                                                       |
| 24    | **producers**            | Producers and executive producers.                                                                 |
| 25    | **music_composer**       | Music by.                                                                                          |
| 26    | **imdb_rating**          | IMDB rating.                                                                                       |
| 27    | **imdb_votes**           | IMDB vote count.                                                                                   |
| 28    | **poster_path**          | Path to the movie's poster image.                                                                  |




In [20]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

In [21]:
import sys
sys.path.append('../utils')
import data_cleaning
import data_inspection
import helpers

In [22]:
tmdb_df = pd.read_csv('../data/local/raw/1_TMDB_all_movies.csv')

In [None]:
data_inspection.show_basic_info(tmdb_df)

In [24]:
# drop columns
tmdb_df.drop(columns=['cast', 'director_of_photography', 'music_composer', 'poster_path', 'writers', 'spoken_languages', 'producers'], inplace=True)

In [None]:
# check for duplicates
data_inspection.check_for_duplicates(tmdb_df)

In [None]:
# remove rows where 'status' is not 'Released'
print(f'Unique values in status column:\n{tmdb_df['status'].unique()}\n')

initial_rows = len(tmdb_df)
tmdb_df = tmdb_df[tmdb_df['status'] == 'Released'] # keep rows where 'status' is 'Released'
final_rows = len(tmdb_df)
removed_rows = initial_rows - final_rows
print(f'\nNumber of rows removed: {removed_rows}')

In [27]:
# remove 'status' column after sorting
tmdb_df.drop(['status'], axis=1, inplace=True)

In [28]:
# create copy
df = tmdb_df.copy()

In [None]:
# convert 'release_date' to datetime, extract year and convert it to int
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year.astype('Int64')
print(df[['release_date', 'release_year']].head())

In [30]:
# drop 'release_date' column
df.drop(columns='release_date', inplace=True)

In [None]:
release_year_counts = df['release_year'].value_counts().sort_index()
print(f'Movie releases per year:\n{release_year_counts}')

In [None]:
# filter out rows for a specific timeframe (eg between 1906 and 2018)
initial_count = len(df)
df = df[(df['release_year'] >= 1906) & (df['release_year'] <= 2024)]
final_count = len(df)
rows_dropped = initial_count - final_count
print(f'Rows dropped after filtering year: {rows_dropped}')
display(df)

In [None]:
release_year_counts = df['release_year'].value_counts().sort_index()
print(f'Movie releases per year:\n{release_year_counts}')

In [None]:
data_inspection.show_missing_values(df)

In [None]:
# drop missing titles
df = data_cleaning.drop_empty_rows_from_column(df, 'title')

In [None]:
# data conversion to int
columns_to_convert = ['imdb_votes', 'revenue', 'budget', 'runtime', 'vote_count']
data_cleaning.convert_columns_to_int(df, columns_to_convert)

In [18]:
# df.to_csv('../data/local/raw/2_tmdb_backup.csv', index=False)

In [None]:
# backup df while testing notebook
df = pd.read_csv('../data/local/raw/2_tmdb_backup.csv')

In [None]:
display(df)

In [21]:
# get language names
df['language'] = df['original_language'].apply(helpers.get_language_name)

In [None]:
# handle unknown language(s)
print(f'Unique values in language column:\n{df['language'].unique()}')
print(f'Value counts in language column:\n{df['language'].value_counts()}')

In [23]:
# replace [cn] with a proper label
unknown_lang = df[df['original_language'] == 'cn']
df['language'] = df['language'].replace('Unknown language [cn]', 'Cantonese')

In [24]:
# replace [xx] with a proper label
unknown_lang = df[df['original_language'] == 'xx']
df['language'] = df['language'].replace('Unknown language [xx]', 'Unknown')

In [None]:
print(f'Unique values in language column after re labeling: {df['language'].unique()}')

In [26]:
# drop column after extracting languages
df.drop(columns='original_language', inplace=True)

In [None]:
# parse 'genres' column, make sure they're al lowcase, no extra empty spaces, separated by commas
df['genres'] = helpers.clean_genres(df, 'genres')
print('Unique genres:')
print(df['genres'].unique())

In [None]:
def drop_rows_with_specific_genres(df, column_name='genres', genres_to_exclude=None):
    if genres_to_exclude is None:
        genres_to_exclude = {'documentary', 'music'}

    def contains_excluded_genre(genres):
        if isinstance(genres, list):  # If genres is a list
            return any(genre.strip().lower() in genres_to_exclude for genre in genres)
        elif isinstance(genres, str):  # If genres is a comma-separated string
            return any(genre.strip().lower() in genres_to_exclude for genre in genres.split(','))
        return False  # For NaN or other invalid cases

    rows_before = len(df)  # Number of rows before filtering
    filtered_df = df[~df[column_name].apply(contains_excluded_genre)]
    rows_after = len(filtered_df)  # Number of rows after filtering

    rows_dropped = rows_before - rows_after
    print(f'Number of rows dropped: {rows_dropped}')

    return filtered_df

df = drop_rows_with_specific_genres(df, column_name='genres')

In [None]:
df = helpers.drop_rows_by_runtime(df, column_name='runtime', min_runtime=40)

In [30]:
# Replace 'United States of America' with 'USA' and 'United Kingdom' with 'UK' in 'countries' column
df['production_countries'] = df['production_countries'].replace(
    {'United States of America': 'USA', 'United Kingdom': 'UK'}, regex=True
)

In [None]:
def remove_title_keywords(df, column_name, words_list):
    if df is None:
        raise ValueError("Input DataFrame is None")
    
    pattern = '|'.join([rf'\b{re.escape(word)}\b' for word in words_list])
    
    initial_count = len(df)  # This will fail if df is None

    df_filtered = df[~df[column_name].str.contains(pattern, case=False, na=False)]

    filtered_count = len(df_filtered)
    print(f'Rows removed: {initial_count - filtered_count}')
    
    return df_filtered

words_to_remove = ['vixen', 'rape', 'slut', 'playboy', 'live at', 'baby einstein', 'championship', 'standup', 'wwe', 'wec', 'fia', 'playoff', 'ufc', 'mma', 'wcw', 'porn', 'snuff', 'nfl', 'nhl', 'raw sex', 'milf', 'molester', 'bondage', 'nba', 'tits', 'f1']
df = remove_title_keywords(df, 'title', words_to_remove)

In [None]:
data_inspection.show_missing_values(df)

In [33]:
rename_columns = {
    'id': 'tmdb_id',
    'vote_average': 'tmdb_rating',
    'vote_count': 'tmdb_votes'
}

df.rename(columns=rename_columns, inplace=True)

In [34]:
# round decimals
df['popularity'] = df['popularity'].round(1)
df['tmdb_rating'] = df['tmdb_rating'].round(1)

In [35]:
# generate clean title column
df['clean_title'] = helpers.prepare_clean_titles(df, 'title')

In [None]:
display(df)

In [36]:
## create .csv file
df = df.sort_values(by='tmdb_id').reset_index(drop=True)
# df.to_csv('../data/local/raw/3_tmdb_released_movies.csv', index=False)

## Check some stats

### Top 10s

In [None]:
# most popular genres:
# split column by commas
df_exploded_genres = df['genres'].str.split(',').explode().str.strip()

# add column for popularity
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df['popularity'])

# calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_popularity.head(10))

# calculate average popularity
language_popularity = df.groupby('language')['popularity'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

In [None]:
# split 'genres' by commas, explode it, create a row for each genre
df_exploded_genres = df['genres'].str.split(',').explode().str.strip()

# 'popularity' column to exploded genres
df_genres_imdb_rating = df_exploded_genres.to_frame(name='genre').join(df['imdb_rating'])

# calculate average imdb_rating
genre_imdb_rating = df_genres_imdb_rating.groupby('genre')['imdb_rating'].mean().sort_values(ascending=False)

print('Most Popular Genres:')
print(genre_imdb_rating.head(10))

# most popular languages:
# calculate average imdb_rating
language_imdb_rating = df.groupby('language')['imdb_rating'].mean().sort_values(ascending=False)

print('\nMost Popular Languages:')
print(language_popularity.head(10))

### Correlation Plots

In [None]:
# correlations for numeric columns
numeric_columns = ['popularity', 'revenue', 'budget', 'runtime', 'imdb_rating', 'imdb_votes']
correlation_matrix = df[numeric_columns].corr()

# correlation matrix heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='winter', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numeric Columns')
plt.show()

In [None]:
# most popular genres
# split and explode the genres column
df_exploded_genres = df['genres'].str.split(',').explode().str.strip()
df_genres_popularity = df_exploded_genres.to_frame(name='genre').join(df['popularity'])

# group and calculate average popularity
genre_popularity = df_genres_popularity.groupby('genre')['popularity'].mean().sort_values(ascending=False)
top_20_genres = genre_popularity.head(20)

top_20_genres.index = top_20_genres.index.str.title()

# top 20 plot
plt.figure(figsize=(10, 8))
sns.barplot(x=top_20_genres.values, y=top_20_genres.index, palette='viridis')
plt.title('Top 20 Most Popular Genres (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Genre')
plt.show()

In [None]:
# most popular languages
language_popularity = df.groupby('language')['popularity'].mean().sort_values(ascending=False)
top_10_languages = language_popularity.head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_languages.values, y=top_10_languages.index, palette='magma')
plt.title('Top 10 Most Popular Languages (2018)')
plt.xlabel('Average Popularity')
plt.ylabel('Language')
plt.show()