# Letterboxd Clean Films Dataset Summary

## General Information
- **Number of Rows**: 18,184  
- **Number of Columns**: 16  

## Key Insights
- **Languages**: 137 unique languages  
- **Countries**: 126 unique countries  
- **Genres**: 19 unique genres  

## Runtime Statistics
- **Shortest Film Runtime**: 40 minutes  
- **Longest Film Runtime**: 300 minutes  

## Release Year Range
- **Earliest Year**: 1911  
- **Latest Year**: 2024  

## Columns in the Dataset
1. `letterboxd_id` - Unique ID for the film  
2. `title` - Title of the film  
3. `release_year` - Year the film was released  
4. `tagline` - Film tagline  
5. `summary` - Brief summary or description  
6. `runtime` - Duration of the film (in minutes)  
7. `letterboxd_rating` - Average Letterboxd rating  
8. `genres` - Film genres (comma-separated)  
9. `language` - Languages spoken in the film (comma-separated)  
10. `countries` - Countries of origin (comma-separated)  
11. `themes` - Themes or topics covered in the film  
12. `director` - Film director(s)  
13. `topics` - Additional topics (if any)  
14. `doesthedog_id` - Reference ID for content warnings  
15. `events` - Specific events or warnings associated with the film  
16. `has_warnings` - Boolean flag indicating content warnings  

---

## Notes
- **Dataset Cleaning**:  
  - Films with runtimes exceeding 300 minutes were excluded as likely outliers (e.g., TV series or incorrect entries).  
- **Data Coverage**:  
  - The dataset spans over a century of cinema, from **1911 to 2024**.  
  - Includes a diverse range of films across **137 languages**, **126 countries**, and **19 genres**.


In [28]:
import pandas as pd
import plotly.express as px
from collections import Counter

In [29]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import helpers
import data_inspection
import content_tagging

In [30]:
films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')

In [31]:
data_inspection.show_basic_info(films)


DataFrame Shape: (18489, 16)
Number of Rows: 18489
Number of Columns: 16

Data Types of Columns:
letterboxd_id          int64
title                 object
release_year           int64
tagline               object
summary               object
runtime                int64
letterboxd_rating    float64
genres                object
language              object
countries             object
themes                object
director              object
topics                object
doesthedog_id        float64
events                object
dtype: object

Missing Values per Column:
letterboxd_id            0
title                    0
release_year             0
tagline                  0
summary                  0
runtime                  0
letterboxd_rating        0
genres                  15
language                87
countries               77
themes                   0
director                11
topics               14242
doesthedog_id         1624
events               14242
dtype: int64

First 

In [32]:
print(films.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18489 entries, 0 to 18488
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   letterboxd_id      18489 non-null  int64  
 1   title              18489 non-null  object 
 2   release_year       18489 non-null  int64  
 3   tagline            18489 non-null  object 
 4   summary            18489 non-null  object 
 5   runtime            18489 non-null  int64  
 6   letterboxd_rating  18489 non-null  float64
 7   genres             18474 non-null  object 
 8   language           18402 non-null  object 
 9   countries          18412 non-null  object 
 10  themes             18489 non-null  object 
 11  director           18478 non-null  object 
 12  topics             4247 non-null   object 
 13  doesthedog_id      16865 non-null  float64
 14  events             4247 non-null   object 
dtypes: bool(1), float64(2), int64(3), object(10)
memory usage: 2.1+ MB
Non

In [33]:
films = helpers.drop_rows_by_runtime(films, min_runtime=40, max_runtime=300)

Number of rows dropped (runtime < 40 or runtime > 300): 305


Dataframe overview

In [34]:
num_rows, num_cols = films.shape
print(f'Number of Rows: {num_rows}')
print(f'Number of Columns: {num_cols}')

# languages
languages = films['language'].dropna().str.split(',').explode().str.strip()
unique_languages = languages.nunique()
print(f'Number of Languages: {unique_languages}')

# countries
countries = films['countries'].dropna().str.split(',').explode().str.strip()
unique_countries = countries.nunique()
print(f'Number of Countries: {unique_countries}')

# genres
genres = films['genres'].dropna().str.split(',').explode().str.strip()
unique_genres = genres.nunique()
print(f'Number of Genres: {unique_genres}')

# earliest and latest year
earliest_year = films['release_year'].min()
latest_year = films['release_year'].max()
print(f'Earliest Year: {earliest_year}')
print(f'Latest Year: {latest_year}')

# longest and shortest runtime
shortest_runtime = films['runtime'].min()
longest_runtime = films['runtime'].max()
print(f'Shortest Runtime: {shortest_runtime} minutes')
print(f'Longest Runtime: {longest_runtime} minutes')

Number of Rows: 18184
Number of Columns: 16


Number of Languages: 137
Number of Countries: 126
Number of Genres: 19
Earliest Year: 1911
Latest Year: 2024
Shortest Runtime: 40 minutes
Longest Runtime: 300 minutes


#### Categorical Data Analysis

In [35]:
# split and explode values
def split_and_explode(df, column):
    return df[column].dropna().str.split(',').explode().str.strip()

# prepare data
countries = split_and_explode(films, 'countries')
countries_count = countries.value_counts().reset_index()
countries_count.columns = ['Country', 'Count']

languages = split_and_explode(films, 'language')
languages_count = languages.value_counts().reset_index()
languages_count.columns = ['Language', 'Count']

genres = split_and_explode(films, 'genres')
genres_count = genres.value_counts().reset_index()
genres_count.columns = ['Genre', 'Count']

Countries

In [36]:
fig_countries = px.bar(
    countries_count.head(10),
    x='Country',
    y='Count',
    title='Top 10 Countries by Film Presence',
    color='Count',
    color_continuous_scale='Viridis'
)
fig_countries.update_layout(
    xaxis_title='Country',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_countries.show()

Languages

In [37]:
fig_languages = px.bar(
    languages_count.head(10),
    x='Language',
    y='Count',
    title='Top 10 Languages by Film Presence',
    color='Count',
    color_continuous_scale='Viridis'
)
fig_languages.update_layout(
    xaxis_title='Language',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_languages.show()

Genres

In [38]:
fig_genres = px.bar(
    genres_count,
    x='Genre',
    y='Count',
    title='Film Genres Distribution',
    color='Count',
    color_continuous_scale='Plasma'
)
fig_genres.update_layout(
    xaxis_title='Genre',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_genres.show()

Most Popular Genres per Country

In [39]:
# explode countries and genres
countries_exploded = split_and_explode(films, 'countries')
genres_exploded = split_and_explode(films, 'genres')

# ensure same length by aligning indices
countries_repeated = countries_exploded.reset_index(drop=True)
genres_repeated = genres_exploded.reset_index(drop=True)

genres_by_country = pd.DataFrame({'Country': countries_repeated, 'Genre': genres_repeated})

genres_by_country = genres_by_country.dropna()

genres_by_country_count = (
    genres_by_country
    .groupby(['Country', 'Genre'])
    .size()
    .reset_index(name='Count')
)

most_popular_genres_country = genres_by_country_count.loc[
    genres_by_country_count.groupby('Country')['Count'].idxmax()
]

print('Most Popular Genre per Country:')
print(most_popular_genres_country.head(10))  

Most Popular Genre per Country:
        Country      Genre  Count
0   Afghanistan     Comedy      1
3       Albania    Romance      2
4       Algeria     Family      2
6    Antarctica   Thriller      1
10    Argentina      Crime      6
21        Aruba  Adventure      3
30    Australia      Drama     73
54      Austria   Thriller      9
57      Bahamas      Crime      1
62     Barbados    Fantasy      1


Most Popular Genres per Year

In [40]:
# drop rows where 'genres' is NaN, as they can't be split
films_cleaned = films.dropna(subset=['genres'])

# exploding the genres and year
genres_exploded = split_and_explode(films_cleaned, 'genres')

# 'release_year' for each genre
years_exploded = films_cleaned['release_year'].repeat(films_cleaned['genres'].str.split(',').apply(len)).reset_index(drop=True)

years_exploded = years_exploded.reset_index(drop=True)
genres_exploded = genres_exploded.reset_index(drop=True)

genres_by_year = pd.DataFrame({'Year': years_exploded, 'Genre': genres_exploded})

genres_by_year = genres_by_year.dropna()

genres_by_year_count = (
    genres_by_year
    .groupby(['Year', 'Genre'])
    .size()
    .reset_index(name='Count')
)

most_popular_genres_year = genres_by_year_count.loc[
    genres_by_year_count.groupby('Year')['Count'].idxmax()
]


print('\nMost Popular Genre per Year:')
print(most_popular_genres_year.head(10))


Most Popular Genre per Year:
    Year    Genre  Count
0   1911  Fantasy      1
2   1914   Comedy      1
7   1916    Drama      2
11  1919    Drama      3
16  1920    Drama      4
25  1921    Drama      7
33  1922    Drama      2
36  1923   Comedy      4
42  1924   Comedy      2
52  1925    Drama      8


In [41]:
import pandas as pd

# Drop rows where 'genres' is NaN, as they can't be split
films_cleaned = films.dropna(subset=['genres'])

# Exploding the genres
genres_exploded = split_and_explode(films_cleaned, 'genres')

# 'release_year' for each genre
years_exploded = films_cleaned['release_year'].repeat(
    films_cleaned['genres'].str.split(',').apply(len)
).reset_index(drop=True)

# Reset index for alignment
years_exploded = years_exploded.reset_index(drop=True)
genres_exploded = genres_exploded.reset_index(drop=True)

# Create a DataFrame with 'Year' and 'Genre'
genres_by_year = pd.DataFrame({'Year': years_exploded, 'Genre': genres_exploded})

# Drop rows with NaN values
genres_by_year = genres_by_year.dropna()

# Add a 'Decade' column (e.g., 1990 -> 1990s)
genres_by_year['Decade'] = (genres_by_year['Year'] // 10) * 10

# Group by 'Decade' and 'Genre', and count occurrences
genres_by_decade_count = (
    genres_by_year
    .groupby(['Decade', 'Genre'])
    .size()
    .reset_index(name='Count')
)

# Sort the genres within each decade by Count in descending order
genres_by_decade_count = genres_by_decade_count.sort_values(
    ['Decade', 'Count'], ascending=[True, False]
)

# Select the top 5 genres for each decade
top_5_genres_per_decade = (
    genres_by_decade_count
    .groupby('Decade')
    .head(5)  # Top 5 rows per Decade
    .reset_index(drop=True)
)

# Print the results
print('\nTop 5 Genres per Decade:')
print(top_5_genres_per_decade)



Top 5 Genres per Decade:
    Decade            Genre  Count
0     1910            Drama      6
1     1910        Adventure      2
2     1910          History      2
3     1910          Romance      2
4     1910           Action      1
5     1920            Drama     60
6     1920          Romance     40
7     1920           Comedy     32
8     1920        Adventure     13
9     1920           Horror      9
10    1930            Drama    266
11    1930          Romance    207
12    1930           Comedy    163
13    1930            Crime     92
14    1930            Music     55
15    1940            Drama    374
16    1940          Romance    242
17    1940         Thriller    182
18    1940           Comedy    169
19    1940            Crime    151
20    1950            Drama    442
21    1950          Romance    229
22    1950            Crime    172
23    1950         Thriller    164
24    1950  Science Fiction    146
25    1960            Drama    433
26    1960           Comedy  

In [42]:
# most popular genre per country
fig_country_vertical = px.bar(
    most_popular_genres_country,
    x='Country',  
    y='Count',
    color='Genre',
    title='Most Popular Genre per Country'
)
fig_country_vertical.show()

# plot without USA
most_popular_genres_country_no_usa = most_popular_genres_country[most_popular_genres_country['Country'] != 'USA']

fig_country_no_usa_vertical = px.bar(
    most_popular_genres_country_no_usa,
    x='Country',  
    y='Count',
    color='Genre',
    title='Most Popular Genre per Country (Excluding USA)'
)
fig_country_no_usa_vertical.show()


In [43]:
fig_year = px.bar(
    most_popular_genres_year,
    x='Year',
    y='Count',
    color='Genre',
    title='Most Popular Genre per Year'
)
fig_year.show()

import pandas as pd
import plotly.express as px

# Assuming `most_popular_genres_year` DataFrame has 'Year', 'Count', 'Genre'.
# Create a new column for Decade
most_popular_genres_year['Decade'] = (most_popular_genres_year['Year'] // 10) * 10

# Aggregate by Decade and Genre
most_popular_genres_decade = most_popular_genres_year.groupby(['Decade', 'Genre'], as_index=False).sum()

# Plot using Plotly
fig_decade = px.bar(
    most_popular_genres_decade,
    x='Decade',
    y='Count',
    color='Genre',
    title='Most Popular Genre per Decade',
    color_discrete_sequence=px.colors.qualitative.Pastel  # Use a pastel color scheme
)

# Update layout with custom RGB background  #363535
#363535
fig_decade.update_layout(
    plot_bgcolor='#F7F7F7',  # Light beige-like custom RGB
    paper_bgcolor='#F7F7F7',  # Soft purple-like custom RGB
    title_font=dict(size=24, color='#363535'),  # Dark purple for title text
    font=dict(color='#363535'),  # Dark purple for general text
    legend=dict(
        title=dict(font=dict(color='#363535')),
        font=dict(size=12, color='#363535')
    ),
    width=700,  # Set a narrower width
    height=500  # Optional: Adjust height for balance
)
fig_decade.write_image("most_popular_genres_decade.svg", format="svg")

# Show the figure
fig_decade.show()


In [44]:
most_popular_genres_country_grouped = most_popular_genres_country.groupby(['Country', 'Genre'], as_index=False).agg({'Count': 'sum'})

most_popular_genre_per_country = most_popular_genres_country_grouped.sort_values('Count', ascending=False).drop_duplicates('Country')

fig_map = px.choropleth(
    most_popular_genre_per_country, 
    locations='Country', 
    locationmode='country names',
    color='Genre',
    hover_name='Country',
    hover_data={'Genre': True, 'Count': True},
    color_discrete_sequence=px.colors.qualitative.Set3,
    title='Most Popular Genre per Country'
)

fig_map.show()

### Themes and Events

In [45]:
# number of themes
films['theme_count'] = films['themes'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)  # eval is used to convert string lists to actual lists

# number of events (split by commas and count the resulting list length)
films['event_count'] = films['events'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)

# themes and events across all rows
total_themes = films['theme_count'].sum()
total_events = films['event_count'].sum()

print(f'Total themes: {total_themes}')
print(f'Total events: {total_events}')

Total themes: 97843
Total events: 60456


In [46]:
all_themes = set()
films['themes'].apply(lambda x: all_themes.update(eval(x)) if isinstance(x, str) else None)

all_events = set()
films['events'].apply(lambda x: all_events.update(x.split(',')) if isinstance(x, str) else None)

# unique themes and events
unique_themes_count = len(all_themes)
unique_events_count = len(all_events)

print(f'Unique themes count: {unique_themes_count}')
print(f'Unique events count: {unique_events_count}')

Unique themes count: 109
Unique events count: 301


Top events per genre

In [47]:
films_split_genres = films.explode('genres')

events_per_genre = {}

for genre, group in films_split_genres.groupby('genres'):
    if len(genre.split(',')) > 1:
        continue
    
    genre_events = Counter()
    
    group['events'].apply(lambda x: genre_events.update(x.split(',')) if isinstance(x, str) else None)
    
    events_per_genre[genre] = genre_events

top_events_per_genre = {}
for genre, events in events_per_genre.items():
    top_events = events.most_common(5)  
    top_events_per_genre[genre] = [event for event, count in top_events]

top_events_df = pd.DataFrame(
    [(genre, ', '.join(events)) for genre, events in top_events_per_genre.items() if events],  
    columns=['Genre', 'Top Events']
)

display(top_events_df)

Unnamed: 0,Genre,Top Events
0,Action,"child abuse, stalking, choking, sexual assa..."
1,Adventure,"animals harmed during making, animals (besides..."
2,Animation,"dogs dying, kids dying, people being burned ..."
3,Comedy,"sexual content, drug use, vomiting, alcoho..."
4,Crime,"shaving or cutting, blood or gore, jump scar..."
5,Documentary,"sad endings, dead animals, child abuse, se..."
6,Drama,"sexual content, blood or gore, child abuse,..."
7,Family,"people dying by suicide, kids dying, jump sc..."
8,Horror,"blood or gore, sexual content, choking, au..."
9,Music,"flashing lights or images, shaky cam, sexual..."


Top themes per genre

In [48]:
films_split_genres = films.explode('genres')

themes_per_genre = {}

for genre, group in films_split_genres.groupby('genres'):
    if len(genre.split(',')) > 1:
        continue
    
    genre_themes = Counter()
    
    group['themes'].apply(lambda x: genre_themes.update(eval(x)) if isinstance(x, str) else None)
    
    themes_per_genre[genre] = genre_themes

top_themes_per_genre = {}
for genre, themes in themes_per_genre.items():
    top_themes = themes.most_common(3)  
    top_themes_per_genre[genre] = [theme for theme, count in top_themes]

top_themes_df = pd.DataFrame(
    [(genre, ', '.join(themes)) for genre, themes in top_themes_per_genre.items() if themes],  
    columns=['Genre', 'Top Themes']
)

display(top_themes_df)

Unnamed: 0,Genre,Top Themes
0,Action,Explosive and action-packed heroes vs. villain...
1,Adventure,"Epic heroes, Historical battles and epic heroi..."
2,Animation,"Epic heroes, Emotional and captivating fantasy..."
3,Comedy,"Crude humor and satire, Gags, jokes, and slaps..."
4,Crime,"Crime, drugs and gangsters, Gritty crime and r..."
5,Documentary,"Politics and human rights, Fascinating, emotio..."
6,Drama,"Moving relationship stories, Powerful stories ..."
7,Family,"Fairy-tale fantasy and enchanted magic, Epic h..."
8,Fantasy,"Fantasy adventure, heroism, and swordplay, Hum..."
9,History,"Politics and human rights, Historical battles ..."


In [49]:
all_themes = []
films['themes'].apply(lambda x: all_themes.extend(eval(x)) if isinstance(x, str) else None)

theme_counts = Counter(all_themes)

# most common themes
top_themes = theme_counts.most_common(10)
print('Most Recurring Themes:')
for theme, count in top_themes:
    print(f'{theme}: {count}')

Most Recurring Themes:
Crude humor and satire: 3291
Horror, the undead and monster classics: 3134
Moving relationship stories: 2658
Gory, gruesome, and slasher horror: 2654
Twisted dark psychological thriller: 2610
Gags, jokes, and slapstick humor: 2584
Terrifying, haunted, and supernatural horror: 2312
Funny jokes and crude humor: 1995
Laugh-out-loud relationship entanglements: 1964
Intense violence and sexual transgression: 1825


In [50]:
all_events = []
films['events'].apply(lambda x: all_events.extend(x.split(',')) if isinstance(x, str) else None)

event_counts = Counter(all_events)

# most common events
top_events = event_counts.most_common(10)
print('\nMost Recurring Events:')
for event, count in top_events:
    print(f'{event}: {count}')


Most Recurring Events:
 blood or gore: 1800
 sexual content: 1406
 gun violence: 1195
 choking: 1127
 dead animals: 1010
 sad endings: 888
 restraints: 878
 torture: 870
 stalking: 849
 animals (besides dog/cat/horse) dying: 848


Top Theme per Genre

In [51]:
top_themes_df = pd.DataFrame(
    [(genre, ', '.join(themes)) for genre, themes in top_themes_per_genre.items() if themes], 
    columns=['Genre', 'Top Themes']
)

fig_themes = px.bar(
    top_themes_df,
    x='Genre',
    y='Top Themes',
    title='Top Themes per Genre',
    labels={'Top Themes': 'Top Themes', 'Genre': 'Genre'},
    color='Genre',
    text='Top Themes',
)

fig_themes.update_layout(
    xaxis_title='Genre',
    yaxis_title='Themes',
    xaxis={'categoryorder': 'total ascending'}  
)

fig_themes.show()


Top Events per Genre

In [52]:
top_events_df = pd.DataFrame(
    [(genre, ', '.join(events)) for genre, events in top_events_per_genre.items() if events], 
    columns=['Genre', 'Top Events']
)

fig_events = px.bar(
    top_events_df,
    x='Genre',
    y='Top Events',
    title='Top Events per Genre',
    labels={'Top Events': 'Top Events', 'Genre': 'Genre'},
    color='Genre',
    text='Top Events',
)

fig_events.update_layout(
    xaxis_title='Genre',
    yaxis_title='Events',
    xaxis={'categoryorder': 'total ascending'}
)

fig_events.show()


Most Recurring Themes (Across All Genres)

In [53]:
top_themes_all_df = pd.DataFrame(top_themes, columns=['Theme', 'Count'])

fig_top_themes = px.bar(
    top_themes_all_df,
    x='Theme',
    y='Count',
    title='Most Recurring Themes Across All Genres',
    labels={'Theme': 'Theme', 'Count': 'Count'},
    color='Count',
    text='Count',
)

fig_top_themes.update_layout(
    xaxis_title='Theme',
    yaxis_title='Count',
    xaxis={'categoryorder': 'total descending'},
    showlegend=False
)

fig_top_themes.show()


Most Recurring Events (Across All Genres)

In [54]:
top_events_all_df = pd.DataFrame(top_events, columns=['Event', 'Count'])

fig_top_events = px.bar(
    top_events_all_df,
    x='Event',
    y='Count',
    title='Most Recurring Events Across All Genres',
    labels={'Event': 'Event', 'Count': 'Count'},
    color='Count',
    text='Count',
)

fig_top_events.update_layout(
    xaxis_title='Event',
    yaxis_title='Count',
    xaxis={'categoryorder': 'total descending'},
    showlegend=False
)

fig_top_events.show()
