## Film Analysis

The dataset **tmdb_clean_films.csv** contains **9785 rows** and **24 columns** with films released between **1906 and 2024**.

### Questions to Explore:
1. **Sentiment Scores**:  
   - Compare sentiment scores across genres.  
   - Investigate differences in sentiment between movies **with** and **without trigger warnings**.

2. **Sentiment and Reviews**:  
   - Analyze if there is a noticeable difference in sentiment of movie reviews based on **genre** and **trigger warnings**.

3. **Correlations**:  
   - Explore correlations between **sentiment scores**, **movie ratings**, and **box office earnings**.

4. **Impact of Trigger Warnings on Ratings**:  
   - Analyze how the presence of trigger warnings influences movie ratings across platforms (e.g., **IMDb**, **Rotten Tomatoes**).  
   - Use trigger warnings as a categorical variable to compare movie ratings for films **with** and **without trigger warnings**.

5. **Statistical Testing**:  
   - Conduct statistical tests like **t-tests** or **ANOVA** to determine if there’s a significant difference in ratings based on trigger warnings.

6. **Visualizations**:  
   - Create visualizations to illustrate relationships, including:  
     - **Box plots**  
     - **Histograms**  
     - **Scatter plots**

7. **Additional Analysis**:  
   - Perform correlation analysis, hypothesis testing, and trend detection.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
import plotly.express as px
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from collections import Counter


In [2]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import data_cleaning
import data_inspection
import content_tagging

In [3]:
films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')

In [None]:
data_inspection.show_basic_info(films)

In [None]:
print(films.info())
print(films.describe())

### Basic Stats and Distributions

In [None]:
# Distribution of ratings
sns.histplot(films['letterboxd_rating'], bins=20, kde=True)
plt.title("Distribution of Movie Ratings")
plt.show()

In [None]:
# Distribution of runtime
sns.histplot(films['runtime'], bins=20, kde=True, color='orange')
plt.title("Distribution of Movie Runtime")
plt.show()

In [None]:
# Movies per release year
plt.figure(figsize=(10, 6))
sns.countplot(x='release_year', data=films, palette='viridis')
plt.xticks(rotation=45)
plt.title("Number of Movies Released Over Time")
plt.show()

In [None]:
# Check basic stats for numerical columns
numerical_cols = ['letterboxd_rating', 'runtime', 'release_year']
print(films[numerical_cols].describe())

# Check null counts and data distribution
print(films.isnull().sum())

### Correlaction analysis

In [None]:
# Select only numeric columns
numeric_columns = films.select_dtypes(include=['number']).columns

# Calculate correlation matrix for numeric columns only
correlation_matrix = films[numeric_columns].corr()

# Display heatmap of correlations
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


### Handling multi-value columns

In [None]:
# Define a function to split and count values in a column
def split_and_count(column):
    all_values = films[column].dropna().str.split(',').sum()
    return pd.Series(Counter(map(str.strip, all_values))).sort_values(ascending=False)

# Analyze 'genres'
genre_counts = split_and_count('genres')
print("Top 10 Genres:")
display(genre_counts.head(10))

# Analyze 'countries'
country_counts = split_and_count('countries')
print("\nTop 10 Countries:")
display(country_counts.head(10))

# Analyze 'languages'
language_counts = split_and_count('language')
print("\nTop 10 Languages:")
display(language_counts.head(10))

# Analyze 'events'
events_counts = split_and_count('events')
print("\nTop 10 Events:")
display(events_counts.head(10))

In [None]:
# Plot the top genres
plt.figure(figsize=(10, 6))
genre_counts.head(10).plot(kind='bar', color='skyblue')
plt.title("Top 10 Genres")
plt.show()

### Trends Across Different Columns

How Ratings Vary by Release Year

In [None]:
plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='letterboxd_rating', data=films, ci=None)
plt.title("Average Rating by Release Year")
plt.show()

Most Common Themes Over Time

Convert String Representations to Lists

In [14]:
# Convert the string representation of lists to actual Python lists
films['themes'] = films['themes'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


Count Themes Across the Dataset

In [None]:
# Flatten the lists in the 'themes' column and count occurrences
all_themes = films['themes'].dropna().explode().str.strip()  # Drop NaNs, explode lists to rows, and strip whitespace

# Count occurrences of each theme
themes_counts = all_themes.value_counts().sort_values(ascending=False)

# Show the top 10 themes
print("Top 10 Themes:")
print(themes_counts.head(10))

# Plot the top 10 themes
plt.figure(figsize=(10, 6))
themes_counts.head(10).plot(kind='bar', color='salmon')
plt.title("Top 10 Themes")
plt.show()


Analyzing Other Columns

In [None]:
# Split genres and clean up the spaces
films['genres'] = films['genres'].apply(lambda x: [genre.strip() for genre in x.split(',')] if isinstance(x, str) else [])

# Now we can explode the genres and count them
all_genres = films['genres'].dropna().explode().str.strip()
genre_counts = all_genres.value_counts().sort_values(ascending=False)

# Show the top 10 genres
print("Top 10 Genres:")
print(genre_counts.head(10))

# Plot the top 10 genres


plt.figure(figsize=(10, 6))
genre_counts.head(10).plot(kind='bar', color='lightblue')
plt.title("Top 10 Genres")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

Trends Over Time Based on Themes

In [None]:
# Check the first few rows of the films DataFrame to inspect 'release_year' and 'themes'
print(films.head())

# Check if 'themes' column exists and is not empty
print(films['themes'].isna().sum())  # Check how many rows have missing themes
print(films['release_year'].isna().sum())  # Check how many rows have missing release_year

# Exploding the themes into separate rows (if not done already)
films['themes'] = films['themes'].apply(lambda x: eval(x) if isinstance(x, str) else [])
films_with_themes = films[films['themes'].notna()]

# Exploding the themes column to create individual rows for each theme
exploded_themes = films_with_themes.explode('themes')

# Check the result of explosion
print(exploded_themes.head())

# Now, group by 'release_year' and 'themes' to count occurrences
theme_trends_by_year = exploded_themes.groupby(['release_year', 'themes']).size().unstack(fill_value=0)

# Check the grouped data structure
print(theme_trends_by_year.head())

# Ensure all values are numeric and fill NaN with 0 if any
theme_trends_by_year = theme_trends_by_year.apply(pd.to_numeric, errors='coerce').fillna(0)

# Check again if data is ready for plotting
print(theme_trends_by_year.head())




To analyze the correlation between genres and themes and explore how they affect ratings, we need to:

Prepare the Data:

Genres: Split the genre column (where genres are separated by commas) into separate binary columns, indicating the presence of each genre.
Themes: We'll explode the themes column into individual rows, as you have already done, but this time we'll keep track of the corresponding genre for each movie.
Aggregate the Data:

Group by both genre and theme to count the occurrences of each theme per genre.
Group by genre and calculate the average ratings for each genre, so we can examine the relationship between themes and ratings.
Calculate the Correlation:

We can compute the correlation between the themes' presence in each genre and the ratings.
Let's go step by step:

Step 1: Preprocessing
Split Genres into binary columns.
Explode Themes and associate each theme with the respective genre and rating.
Step 2: Group and Aggregate
Count the occurrences of each theme per genre.
Group by genre and calculate the average rating per genre.
Step 3: Correlation
Correlate theme counts with ratings.


In [18]:


# 1. Split genres into binary columns
genre_dummies = films['genres'].str.get_dummies(sep=',').add_prefix('genre_')

# 2. Explode themes
films['themes'] = films['themes'].apply(lambda x: eval(x) if isinstance(x, str) else [])
exploded_themes = films.explode('themes')

# 3. Now, associate each exploded theme with the genres and ratings
expanded_genres = genre_dummies.loc[exploded_themes.index].reset_index(drop=True)

# 4. Combine genres, themes, and ratings
exploded_themes_with_genres = pd.concat([exploded_themes[['themes', 'letterboxd_rating']], expanded_genres], axis=1)

# 5. Limit the number of themes and genres for efficient processing
# Get the top 20 most frequent themes
top_themes = exploded_themes_with_genres['themes'].value_counts().nlargest(20).index
exploded_themes_with_genres = exploded_themes_with_genres[exploded_themes_with_genres['themes'].isin(top_themes)]

# Get the top 10 most frequent genres
top_genres = genre_dummies.columns[:10]  # Assuming the first 10 genres are the most frequent
exploded_themes_with_genres = exploded_themes_with_genres[exploded_themes_with_genres[top_genres].any(axis=1)]

# 6. Group by genre and theme to count occurrences
theme_genre_counts = exploded_themes_with_genres.groupby(['themes'] + top_genres.tolist()).size().unstack(fill_value=0)

# 7. Now, let's group by genre and calculate average ratings
genre_ratings = exploded_themes_with_genres.groupby(top_genres.tolist())['letterboxd_rating'].mean()




In [19]:
# 8. Merge counts with ratings
theme_genre_counts = theme_genre_counts.stack().reset_index(name='count')
theme_genre_counts = theme_genre_counts.merge(genre_ratings.reset_index(name='avg_rating'), 
                                              on=top_genres.tolist(), 
                                              how='left')




In [None]:
# 9. Correlate theme counts with ratings
correlation_matrix = theme_genre_counts.pivot_table(index='themes', 
                                                    columns=top_genres.tolist(), 
                                                    values='count', 
                                                    aggfunc=np.sum).corrwith(theme_genre_counts['avg_rating'])
