# TMDB Clean Films Dataset Summary

## General Information
- **Number of Rows**: 9783  
- **Number of Columns**: 23

## Key Insights
- **Languages**: 61 unique languages  
- **Countries**: 112 unique countries  
- **Genres**: Genres unique genres  

## Runtime Statistics
- **Shortest Film Runtime**: 56 minutes  
- **Longest Film Runtime**: 254 minutes  

## Release Year Range
- **Earliest Year**: 1913
- **Latest Year**: 2024

## Columns in the Dataset
1. `tmdb_id` - Unique ID for the film from TMDB  
2. `imdb_id` - Unique ID for the film from IMDB  
3. `doesthedog_id` - Reference ID for content warnings  
4. `title` - Title of the film  
5. `original_title` - Original title of the film  
6. `genres` - Film genres (comma-separated)  
7. `director` - Film director(s)  
8. `release_year` - Year the film was released  
9. `runtime` - Duration of the film (in minutes)  
10. `budget` - Production budget of the film  
11. `revenue` - Total revenue earned by the film  
12. `profit` - Net profit (revenue - budget)  
13. `popularity` - Popularity score from TMDB  
14. `tmdb_rating` - Average rating on TMDB  
15. `tmdb_votes` - Number of votes on TMDB  
16. `imdb_rating` - Average rating on IMDB  
17. `imdb_votes` - Number of votes on IMDB  
18. `language` - Languages spoken in the film (comma-separated)  
19. `countries` - Countries of origin (comma-separated)  
20. `overview` - Brief summary or description of the film  
21. `tagline` - Film tagline  
22. `events` - Specific events or warnings associated with the film  
23. `has_warnings` - Boolean flag indicating the presence of content warnings  


---

## Notes
- **Dataset Cleaning**:  
  - Films with runtimes exceeding 300 minutes were excluded as likely outliers (e.g., TV series or incorrect entries).  
- **Data Coverage**:  
  - The dataset spans over a century of cinema, from **1911 to 2024**.  
  - Includes a diverse range of films across **137 languages**, **126 countries**, and **19 genres**.


In [36]:
import pandas as pd
import plotly.express as px
from collections import Counter

In [37]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import helpers
import data_inspection
import content_tagging

In [38]:
films = pd.read_csv('../data/clean/tmdb_clean_films.csv')

In [39]:
data_inspection.show_basic_info(films)


DataFrame Shape: (9785, 23)
Number of Rows: 9785
Number of Columns: 23

Data Types of Columns:
tmdb_id             int64
imdb_id            object
doesthedog_id     float64
title              object
original_title     object
genres             object
director           object
release_year        int64
runtime             int64
budget              int64
revenue             int64
profit              int64
popularity        float64
tmdb_rating       float64
tmdb_votes          int64
imdb_rating       float64
imdb_votes          int64
language           object
countries          object
overview           object
tagline            object
events             object
dtype: object

Missing Values per Column:
tmdb_id              0
imdb_id              0
doesthedog_id      494
title                0
original_title       0
genres               1
director             1
release_year         0
runtime              0
budget               0
revenue              0
profit               0
popularity    

In [40]:
print(films.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9785 entries, 0 to 9784
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tmdb_id         9785 non-null   int64  
 1   imdb_id         9785 non-null   object 
 2   doesthedog_id   9291 non-null   float64
 3   title           9785 non-null   object 
 4   original_title  9785 non-null   object 
 5   genres          9784 non-null   object 
 6   director        9784 non-null   object 
 7   release_year    9785 non-null   int64  
 8   runtime         9785 non-null   int64  
 9   budget          9785 non-null   int64  
 10  revenue         9785 non-null   int64  
 11  profit          9785 non-null   int64  
 12  popularity      9785 non-null   float64
 13  tmdb_rating     9785 non-null   float64
 14  tmdb_votes      9785 non-null   int64  
 15  imdb_rating     9785 non-null   float64
 16  imdb_votes      9785 non-null   int64  
 17  language        9785 non-null   o

In [41]:
films = helpers.drop_rows_by_runtime(films, min_runtime=40, max_runtime=300)

Number of rows dropped (runtime < 40 or runtime > 300): 2


Dataframe overview

In [42]:
num_rows, num_cols = films.shape
print(f'Number of Rows: {num_rows}')
print(f'Number of Columns: {num_cols}')

# languages
languages = films['language'].dropna().str.split(',').explode().str.strip()
unique_languages = languages.nunique()
print(f'Number of Languages: {unique_languages}')

# countries
countries = films['countries'].dropna().str.split(',').explode().str.strip()
unique_countries = countries.nunique()
print(f'Number of Countries: {unique_countries}')

# genres
genres = films['genres'].dropna().str.split(',').explode().str.strip()
unique_genres = genres.nunique()
print(f'Number of Genres: {unique_genres}')

# earliest and latest year
earliest_year = films['release_year'].min()
latest_year = films['release_year'].max()
print(f'Earliest Year: {earliest_year}')
print(f'Latest Year: {latest_year}')

# longest and shortest runtime
shortest_runtime = films['runtime'].min()
longest_runtime = films['runtime'].max()
print(f'Shortest Runtime: {shortest_runtime} minutes')
print(f'Longest Runtime: {longest_runtime} minutes')

Number of Rows: 9783
Number of Columns: 23
Number of Languages: 61
Number of Countries: 112
Number of Genres: 19
Earliest Year: 1913
Latest Year: 2024
Shortest Runtime: 56 minutes
Longest Runtime: 254 minutes


#### Categorical Data Analysis

In [43]:
# split and explode values
def split_and_explode(df, column):
    return df[column].dropna().str.split(',').explode().str.strip()

# prepare data
countries = split_and_explode(films, 'countries')
countries_count = countries.value_counts().reset_index()
countries_count.columns = ['Country', 'Count']

languages = split_and_explode(films, 'language')
languages_count = languages.value_counts().reset_index()
languages_count.columns = ['Language', 'Count']

genres = split_and_explode(films, 'genres')
genres_count = genres.value_counts().reset_index()
genres_count.columns = ['Genre', 'Count']

Countries

In [44]:
fig_countries = px.bar(
    countries_count.head(10),
    x='Country',
    y='Count',
    title='Top 10 Countries by Film Presence',
    color='Count',
    color_continuous_scale='Viridis'
)
fig_countries.update_layout(
    xaxis_title='Country',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_countries.show()

Languages

In [45]:
fig_languages = px.bar(
    languages_count.head(10),
    x='Language',
    y='Count',
    title='Top 10 Languages by Film Presence',
    color='Count',
    color_continuous_scale='Viridis'
)
fig_languages.update_layout(
    xaxis_title='Language',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_languages.show()

Genres

In [46]:
fig_genres = px.bar(
    genres_count,
    x='Genre',
    y='Count',
    title='Film Genres Distribution',
    color='Count',
    color_continuous_scale='Plasma'
)
fig_genres.update_layout(
    xaxis_title='Genre',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_genres.show()

Most Popular Genres per Country

In [47]:
# explode countries and genres
countries_exploded = split_and_explode(films, 'countries')
genres_exploded = split_and_explode(films, 'genres')

# ensure same length by aligning indices
countries_repeated = countries_exploded.reset_index(drop=True)
genres_repeated = genres_exploded.reset_index(drop=True)

genres_by_country = pd.DataFrame({'Country': countries_repeated, 'Genre': genres_repeated})

genres_by_country = genres_by_country.dropna()

genres_by_country_count = (
    genres_by_country
    .groupby(['Country', 'Genre'])
    .size()
    .reset_index(name='Count')
)

most_popular_genres_country = genres_by_country_count.loc[
    genres_by_country_count.groupby('Country')['Count'].idxmax()
]

print('Most Popular Genre per Country:')
print(most_popular_genres_country.head(10))  

Most Popular Genre per Country:
        Country            Genre  Count
0   Afghanistan           comedy      1
1       Algeria        adventure      1
8     Argentina            drama      8
13        Aruba            drama      1
20    Australia            drama     37
35      Austria            drama      7
41      Bahamas        animation      1
45   Bangladesh            crime      2
47      Belarus  science fiction      1
53      Belgium            drama     33


Most Popular Genres per Year

In [48]:
# drop rows where 'genres' is NaN, as they can't be split
films_cleaned = films.dropna(subset=['genres'])

# exploding the genres and year
genres_exploded = split_and_explode(films_cleaned, 'genres')

# 'release_year' for each genre
years_exploded = films_cleaned['release_year'].repeat(films_cleaned['genres'].str.split(',').apply(len)).reset_index(drop=True)

years_exploded = years_exploded.reset_index(drop=True)
genres_exploded = genres_exploded.reset_index(drop=True)

genres_by_year = pd.DataFrame({'Year': years_exploded, 'Genre': genres_exploded})

genres_by_year = genres_by_year.dropna()

genres_by_year_count = (
    genres_by_year
    .groupby(['Year', 'Genre'])
    .size()
    .reset_index(name='Count')
)

most_popular_genres_year = genres_by_year_count.loc[
    genres_by_year_count.groupby('Year')['Count'].idxmax()
]


print('\nMost Popular Genre per Year:')
print(most_popular_genres_year.head(10))


Most Popular Genre per Year:
    Year   Genre  Count
0   1913   crime      1
2   1915   drama      2
7   1916   drama      2
10  1918  comedy      1
12  1920   crime      1
16  1921  comedy      1
18  1922   drama      1
20  1923  comedy      1
24  1924   drama      3
31  1925   drama      4


In [49]:
# most popular genre per country
fig_country_vertical = px.bar(
    most_popular_genres_country,
    x='Country',  
    y='Count',
    color='Genre',
    title='Most Popular Genre per Country'
)
fig_country_vertical.show()

# plot without USA
most_popular_genres_country_no_usa = most_popular_genres_country[most_popular_genres_country['Country'] != 'USA']

fig_country_no_usa_vertical = px.bar(
    most_popular_genres_country_no_usa,
    x='Country',  
    y='Count',
    color='Genre',
    title='Most Popular Genre per Country (Excluding USA)'
)
fig_country_no_usa_vertical.show()


In [50]:
fig_year = px.bar(
    most_popular_genres_year,
    x='Year',
    y='Count',
    color='Genre',
    title='Most Popular Genre per Year'
)
fig_year.show()


In [51]:
most_popular_genres_country_grouped = most_popular_genres_country.groupby(['Country', 'Genre'], as_index=False).agg({'Count': 'sum'})

most_popular_genre_per_country = most_popular_genres_country_grouped.sort_values('Count', ascending=False).drop_duplicates('Country')

fig_map = px.choropleth(
    most_popular_genre_per_country, 
    locations='Country', 
    locationmode='country names',
    color='Genre',
    hover_name='Country',
    hover_data={'Genre': True, 'Count': True},
    color_discrete_sequence=px.colors.qualitative.Set3,
    title='Most Popular Genre per Country'
)

fig_map.show()

### Events

Top events per genre

In [52]:
films_split_genres = films.explode('genres')

events_per_genre = {}

for genre, group in films_split_genres.groupby('genres'):
    if len(genre.split(',')) > 1:
        continue
    
    genre_events = Counter()
    
    group['events'].apply(lambda x: genre_events.update(x.split(',')) if isinstance(x, str) else None)
    
    events_per_genre[genre] = genre_events

top_events_per_genre = {}
for genre, events in events_per_genre.items():
    top_events = events.most_common(5)  
    top_events_per_genre[genre] = [event for event, count in top_events]

top_events_df = pd.DataFrame(
    [(genre, ', '.join(events)) for genre, events in top_events_per_genre.items() if events],  
    columns=['Genre', 'Top Events']
)

display(top_events_df)

Unnamed: 0,Genre,Top Events
0,action,"blood or gore, sexual content, dogs dying, ..."
1,animation,"kids dying, jump scares, clowns, spitting, ..."
2,comedy,"sexual content, cheating, alcohol abuse, h..."
3,crime,"car crashes, animals (besides dog/cat/horse) ..."
4,drama,"sexual content, blood or gore, drug use, c..."
5,family,vomiting
6,fantasy,"hospital scenes, people being burned alive, f..."
7,horror,"blood or gore, choking, someone dies, unco..."
8,mystery,"jump scares, shaving or cutting, people dyin..."
9,romance,"sexual content, domestic violence, alcohol a..."


In [53]:
all_events = []
films['events'].apply(lambda x: all_events.extend(x.split(',')) if isinstance(x, str) else None)

event_counts = Counter(all_events)

# most common events
top_events = event_counts.most_common(10)
print('\nMost Recurring Events:')
for event, count in top_events:
    print(f'{event}: {count}')


Most Recurring Events:
 blood or gore: 2811
 gun violence: 2332
 sexual content: 2301
 restraints: 2003
 choking: 1997
 someone dies: 1861
 kidnapping: 1623
 dead animals: 1582
 unconscious: 1554
 hospital scenes: 1547


Top Events per Genre

In [54]:
top_events_df = pd.DataFrame(
    [(genre, ', '.join(events)) for genre, events in top_events_per_genre.items() if events], 
    columns=['Genre', 'Top Events']
)

fig_events = px.bar(
    top_events_df,
    x='Genre',
    y='Top Events',
    title='Top Events per Genre',
    labels={'Top Events': 'Top Events', 'Genre': 'Genre'},
    color='Genre',
    text='Top Events',
)

fig_events.update_layout(
    xaxis_title='Genre',
    yaxis_title='Events',
    xaxis={'categoryorder': 'total ascending'}
)

fig_events.show()


Most Recurring Events (Across All Genres)

In [55]:
top_events_all_df = pd.DataFrame(top_events, columns=['Event', 'Count'])

fig_top_events = px.bar(
    top_events_all_df,
    x='Event',
    y='Count',
    title='Most Recurring Events Across All Genres',
    labels={'Event': 'Event', 'Count': 'Count'},
    color='Count',
    text='Count',
)

fig_top_events.update_layout(
    xaxis_title='Event',
    yaxis_title='Count',
    xaxis={'categoryorder': 'total descending'},
    showlegend=False
)

fig_top_events.show()
