# Letterboxd Clean Films Dataset Summary

## General Information
- **Number of Rows**: 18,184  
- **Number of Columns**: 16  

## Key Insights
- **Languages**: 137 unique languages  
- **Countries**: 126 unique countries  
- **Genres**: 19 unique genres  

## Runtime Statistics
- **Shortest Film Runtime**: 40 minutes  
- **Longest Film Runtime**: 300 minutes  

## Release Year Range
- **Earliest Year**: 1911  
- **Latest Year**: 2024  

## Columns in the Dataset
1. `letterboxd_id` - Unique ID for the film  
2. `title` - Title of the film  
3. `release_year` - Year the film was released  
4. `tagline` - Film tagline  
5. `summary` - Brief summary or description  
6. `runtime` - Duration of the film (in minutes)  
7. `letterboxd_rating` - Average Letterboxd rating  
8. `genres` - Film genres (comma-separated)  
9. `language` - Languages spoken in the film (comma-separated)  
10. `countries` - Countries of origin (comma-separated)  
11. `themes` - Themes or topics covered in the film  
12. `director` - Film director(s)  
13. `topics` - Additional topics (if any)  
14. `doesthedog_id` - Reference ID for content warnings  
15. `events` - Specific events or warnings associated with the film  
16. `has_warnings` - Boolean flag indicating content warnings  

---

## Notes
- **Dataset Cleaning**:  
  - Films with runtimes exceeding 300 minutes were excluded as likely outliers (e.g., TV series or incorrect entries).  
- **Data Coverage**:  
  - The dataset spans over a century of cinema, from **1911 to 2024**.  
  - Includes a diverse range of films across **137 languages**, **126 countries**, and **19 genres**.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
import plotly.express as px
import ast
from collections import Counter

In [2]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import data_cleaning
import data_inspection
import content_tagging

In [3]:
films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')

In [4]:
data_inspection.show_basic_info(films)


DataFrame Shape: (18489, 16)
Number of Rows: 18489
Number of Columns: 16

Data Types of Columns:
letterboxd_id          int64
title                 object
release_year           int64
tagline               object
summary               object
runtime                int64
letterboxd_rating    float64
genres                object
language              object
countries             object
themes                object
director              object
topics                object
doesthedog_id        float64
events                object
dtype: object

Missing Values per Column:
letterboxd_id            0
title                    0
release_year             0
tagline                  0
summary                  0
runtime                  0
letterboxd_rating        0
genres                  15
language                87
countries               77
themes                   0
director                11
topics               14242
doesthedog_id         1624
events               14242
dtype: int64

First 

In [5]:
print(films.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18489 entries, 0 to 18488
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   letterboxd_id      18489 non-null  int64  
 1   title              18489 non-null  object 
 2   release_year       18489 non-null  int64  
 3   tagline            18489 non-null  object 
 4   summary            18489 non-null  object 
 5   runtime            18489 non-null  int64  
 6   letterboxd_rating  18489 non-null  float64
 7   genres             18474 non-null  object 
 8   language           18402 non-null  object 
 9   countries          18412 non-null  object 
 10  themes             18489 non-null  object 
 11  director           18478 non-null  object 
 12  topics             4247 non-null   object 
 13  doesthedog_id      16865 non-null  float64
 14  events             4247 non-null   object 
dtypes: bool(1), float64(2), int64(3), object(10)
memory usage: 2.1+ MB
Non

In [6]:
# filter out films above a 5h runtime threshold
max_film_runtime = 300

films = films[films['runtime'] <= max_film_runtime]

new_longest_runtime = films['runtime'].max()
print(f'Longest Film Runtime After Cleaning: {new_longest_runtime} minutes')

Longest Film Runtime After Cleaning: 300 minutes


Dataframe overview

In [7]:
num_rows, num_cols = films.shape
print(f'Number of Rows: {num_rows}')
print(f'Number of Columns: {num_cols}')

# languages
languages = films['language'].dropna().str.split(',').explode().str.strip()
unique_languages = languages.nunique()
print(f'Number of Languages: {unique_languages}')

# countries
countries = films['countries'].dropna().str.split(',').explode().str.strip()
unique_countries = countries.nunique()
print(f'Number of Countries: {unique_countries}')

# genres
genres = films['genres'].dropna().str.split(',').explode().str.strip()
unique_genres = genres.nunique()
print(f'Number of Genres: {unique_genres}')

# earliest and latest year
earliest_year = films['release_year'].min()
latest_year = films['release_year'].max()
print(f'Earliest Year: {earliest_year}')
print(f'Latest Year: {latest_year}')

# longest and shortest runtime
shortest_runtime = films['runtime'].min()
longest_runtime = films['runtime'].max()
print(f'Shortest Runtime: {shortest_runtime} minutes')
print(f'Longest Runtime: {longest_runtime} minutes')

Number of Rows: 18184
Number of Columns: 16
Number of Languages: 137
Number of Countries: 126
Number of Genres: 19
Earliest Year: 1911
Latest Year: 2024
Shortest Runtime: 40 minutes
Longest Runtime: 300 minutes


#### Categorical Data Analysis

In [8]:
# Helper function to split and explode values for analysis
def split_and_explode(df, column):
    return df[column].dropna().str.split(',').explode().str.strip()

# Prepare data for visualization
# 1. Countries
countries = split_and_explode(films, 'countries')
countries_count = countries.value_counts().reset_index()
countries_count.columns = ['Country', 'Count']

# 2. Languages
languages = split_and_explode(films, 'language')
languages_count = languages.value_counts().reset_index()
languages_count.columns = ['Language', 'Count']

# 3. Genres
genres = split_and_explode(films, 'genres')
genres_count = genres.value_counts().reset_index()
genres_count.columns = ['Genre', 'Count']

In [9]:
# Plot 1: Countries
fig_countries = px.bar(
    countries_count.head(10),  # Top 10 countries
    x='Country',
    y='Count',
    title='Top 10 Countries by Film Presence',
    color='Count',
    color_continuous_scale='Viridis'
)
fig_countries.update_layout(
    xaxis_title='Country',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_countries.show()

In [10]:
# Plot 2: Languages
fig_languages = px.bar(
    languages_count.head(10),  # Top 10 languages
    x='Language',
    y='Count',
    title='Top 10 Languages by Film Presence',
    color='Count',
    color_continuous_scale='Cividis'
)
fig_languages.update_layout(
    xaxis_title='Language',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_languages.show()



In [11]:
# Plot 3: Genres
fig_genres = px.bar(
    genres_count,  # All genres
    x='Genre',
    y='Count',
    title='Film Genres Distribution',
    color='Count',
    color_continuous_scale='Plasma'
)
fig_genres.update_layout(
    xaxis_title='Genre',
    yaxis_title='Number of Films',
    xaxis_tickangle=-45
)

fig_genres.show()

In [16]:
# 1. Most Popular Genres per Country
# Prepare data: Explode countries and genres
countries_exploded = split_and_explode(films, 'countries')
genres_exploded = split_and_explode(films, 'genres')

# Ensure same length by aligning indices (repeat countries to match exploded genres)
countries_repeated = countries_exploded.reset_index(drop=True)
genres_repeated = genres_exploded.reset_index(drop=True)

# Combine exploded countries and genres
genres_by_country = pd.DataFrame({'Country': countries_repeated, 'Genre': genres_repeated})

# Drop rows with NaN values
genres_by_country = genres_by_country.dropna()

# Group by Country and Genre to count occurrences
genres_by_country_count = (
    genres_by_country
    .groupby(['Country', 'Genre'])
    .size()
    .reset_index(name='Count')
)

# Find the most popular genre per country
most_popular_genres_country = genres_by_country_count.loc[
    genres_by_country_count.groupby('Country')['Count'].idxmax()
]

# Display the result
print("Most Popular Genre per Country:")
print(most_popular_genres_country.head(10))  # Display top 10 countries



Most Popular Genre per Country:
        Country      Genre  Count
0   Afghanistan     Comedy      1
3       Albania    Romance      2
4       Algeria     Family      2
6    Antarctica   Thriller      1
10    Argentina      Crime      6
21        Aruba  Adventure      3
30    Australia      Drama     73
54      Austria   Thriller      9
57      Bahamas      Crime      1
62     Barbados    Fantasy      1


In [21]:
# 2. Most Popular Genres per Year
# First, drop rows where 'genres' is NaN, as they can't be split
films_cleaned = films.dropna(subset=['genres'])

# Exploding the genres and year data properly
genres_exploded = split_and_explode(films_cleaned, 'genres')

# Repeat the 'release_year' for each genre (same length as exploded genres)
years_exploded = films_cleaned['release_year'].repeat(films_cleaned['genres'].str.split(',').apply(len)).reset_index(drop=True)

# Ensure both Series have the same length and reset index to avoid duplicate issues
years_exploded = years_exploded.reset_index(drop=True)
genres_exploded = genres_exploded.reset_index(drop=True)

# Create a DataFrame to hold the genres and years data
genres_by_year = pd.DataFrame({'Year': years_exploded, 'Genre': genres_exploded})

# Drop rows with NaN values (if any)
genres_by_year = genres_by_year.dropna()

# Group by Year and Genre to count occurrences
genres_by_year_count = (
    genres_by_year
    .groupby(['Year', 'Genre'])
    .size()
    .reset_index(name='Count')
)

# Find the most popular genre per year
most_popular_genres_year = genres_by_year_count.loc[
    genres_by_year_count.groupby('Year')['Count'].idxmax()
]

# Display the result
print("\nMost Popular Genre per Year:")
print(most_popular_genres_year.head(10))  # Display top 10 years



Most Popular Genre per Year:
    Year    Genre  Count
0   1911  Fantasy      1
2   1914   Comedy      1
7   1916    Drama      2
11  1919    Drama      3
16  1920    Drama      4
25  1921    Drama      7
33  1922    Drama      2
36  1923   Comedy      4
42  1924   Comedy      2
52  1925    Drama      8


In [25]:
# Plot the most popular genre per country vertically
fig_country_vertical = px.bar(
    most_popular_genres_country,
    x='Country',  # Country on the x-axis
    y='Count',  # Count on the y-axis
    color='Genre',
    title="Most Popular Genre per Country"
)
fig_country_vertical.show()

# Create another plot without the USA (vertically)
most_popular_genres_country_no_usa = most_popular_genres_country[most_popular_genres_country['Country'] != 'USA']

fig_country_no_usa_vertical = px.bar(
    most_popular_genres_country_no_usa,
    x='Country',  # Country on the x-axis
    y='Count',  # Count on the y-axis
    color='Genre',
    title="Most Popular Genre per Country (Excluding USA)"
)
fig_country_no_usa_vertical.show()


In [22]:
fig_year = px.bar(
    most_popular_genres_year,
    x='Year',
    y='Count',
    color='Genre',
    title="Most Popular Genre per Year"
)
fig_year.show()


In [52]:
# Prepare the data: Summarize the most popular genre per country
most_popular_genres_country_grouped = most_popular_genres_country.groupby(['Country', 'Genre'], as_index=False).agg({'Count': 'sum'})

# Sort by count to get the most popular genre for each country
most_popular_genre_per_country = most_popular_genres_country_grouped.sort_values('Count', ascending=False).drop_duplicates('Country')

# Now, let's map the data to countries, with color representing the genre
fig_map = px.choropleth(
    most_popular_genre_per_country,  # Data source
    locations='Country',  # Column with country names
    locationmode='country names',  # Using country names
    color='Genre',  # Color by Genre, which will automatically use different colors for each genre
    hover_name='Country',  # What to show when hovering
    hover_data={'Genre': True, 'Count': True},  # Additional data to show on hover
    color_discrete_sequence=px.colors.qualitative.Set3,  # Use a predefined discrete color scale for categorical data
    title="Most Popular Genre per Country"
)

# Show the map
fig_map.show()
