## Film Analysis

The dataset **clean_films_id.csv** contains **9785 rows** and **21 columns** with films released between **1906 and 2024**.

### Questions to Explore:
1. **Sentiment Scores**:  
   - Compare sentiment scores across genres.  
   - Investigate differences in sentiment between movies **with** and **without trigger warnings**.

2. **Sentiment and Reviews**:  
   - Analyze if there is a noticeable difference in sentiment of movie reviews based on **genre** and **trigger warnings**.

3. **Correlations**:  
   - Explore correlations between **sentiment scores**, **movie ratings**, and **box office earnings**.

4. **Impact of Trigger Warnings on Ratings**:  
   - Analyze how the presence of trigger warnings influences movie ratings across platforms (e.g., **IMDb**, **Rotten Tomatoes**).  
   - Use trigger warnings as a categorical variable to compare movie ratings for films **with** and **without trigger warnings**.

5. **Statistical Testing**:  
   - Conduct statistical tests like **t-tests** or **ANOVA** to determine if there’s a significant difference in ratings based on trigger warnings.

6. **Visualizations**:  
   - Create visualizations to illustrate relationships, including:  
     - **Box plots**  
     - **Histograms**  
     - **Scatter plots**

7. **Additional Analysis**:  
   - Perform correlation analysis, hypothesis testing, and trend detection.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind

In [2]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import data_cleaning
import data_inspection
import content_tagging

In [3]:
films = pd.read_csv('../data/clean/clean_films_id.csv')

In [None]:
films = content_tagging.assign_content_tags(films)
films.head()

In [5]:
# data_inspection.show_basic_info(films)

### Weighed Average

1. Count occurrences for each category.
2. Calculate the mean for each category, but weight each category by the number of occurrences.
3. Normalize the means to ensure comparability across categories.

In [6]:
sample_df = films.copy()

In [7]:
# def remove_outliers(df, columns):
#     """
#     Removes outliers from the specified columns using the IQR method.
    
#     Args:
#         df (pd.DataFrame): The DataFrame from which outliers will be removed.
#         columns (list): The list of columns from which outliers will be removed.

#     Returns:
#         pd.DataFrame: DataFrame with outliers removed for the specified columns.
#     """
#     for col in columns:
#         # Calculate the IQR for the column
#         Q1 = df[col].quantile(0.25)
#         Q3 = df[col].quantile(0.75)
#         IQR = Q3 - Q1
        
#         # Define the upper and lower bounds for the outliers
#         lower_bound = Q1 - 1.5 * IQR
#         upper_bound = Q3 + 1.5 * IQR
        
#         # Remove rows where the column values are outside the bounds
#         df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
#     return df

# columns_to_check = ['imdb_rating', 'tmdb_rating', 'popularity']
# sample_df = remove_outliers(sample_df, columns_to_check)


In [8]:
sample_genres_df = sample_df.assign(genre=sample_df['genres'].str.split(', ')).explode('genre').dropna(subset=['genre'])

sample_events_df = sample_df.dropna(subset=['events']).assign(event=sample_df['events'].str.split(', ')).explode('event').dropna(subset=['event'])

# Exploding the 'content_tag' column, then dropping NaN and empty string values
sample_content_tags_df = (
    sample_df.dropna(subset=['content_tags'])
    .assign(content_tag=sample_df['content_tags'].str.split(', '))
    .explode('content_tag')
    .dropna(subset=['content_tag'])  # Drop NaN values
    .loc[sample_df['content_tags'].str.strip() != '', :]  # Drop empty strings
)


In [9]:
def weighted_mean_with_votes_and_popularity(df, group_col, rating_cols, votes_cols, popularity_col):
    """
    Calculate weighted averages for multiple ratings using corresponding votes and popularity.
    
    Args:
        df (pd.DataFrame): DataFrame containing the data.
        group_col (str): Column to group by (e.g., 'genre').
        rating_cols (list): List of rating columns (e.g., ['tmdb_rating', 'imdb_rating']).
        votes_cols (list): List of vote columns corresponding to rating columns.
        popularity_col (str): Column for popularity weighting.

    Returns:
        pd.DataFrame: Weighted averages by group, including average popularity.
    """
    results = []
    for group, group_df in df.groupby(group_col):
        weighted_sums = {}
        total_weights = 0
        avg_popularity = group_df[popularity_col].mean()  # Calculate average popularity for the group
        
        for rating_col, votes_col in zip(rating_cols, votes_cols):
            valid_data = group_df.dropna(subset=[rating_col, votes_col])
            weights = valid_data[votes_col] * valid_data[popularity_col]  # Weight by votes and popularity
            weighted_sum = (valid_data[rating_col] * weights).sum()
            total_weight = weights.sum()
            
            weighted_sums[rating_col] = round(weighted_sum / total_weight, 1) if total_weight > 0 else None
            total_weights += total_weight
        
        # Append results with rounding to 1 decimal
        results.append({
            group_col: group,
            **{key: round(value, 1) if value is not None else None for key, value in weighted_sums.items()},
            'average_popularity': round(avg_popularity, 1),
            # 'total_weight': round(total_weights, 1)
        })
    
    return pd.DataFrame(results)


In [None]:
# calculate weighted averages
sample_genre_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_genres_df,
    group_col='genre',
    rating_cols=['tmdb_rating', 'imdb_rating'],
    votes_cols=['tmdb_votes', 'imdb_votes'],
    popularity_col='popularity'
)

print('Weighted Average Ratings by Genre with Popularity and Votes:\n')
print(sample_genre_sentiment_weighted)

In [None]:
sample_event_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_events_df,
    group_col='event',
    rating_cols=['tmdb_rating', 'imdb_rating'],
    votes_cols=['tmdb_votes', 'imdb_votes'],
    popularity_col='popularity'
)

print('\nWeighted Average Ratings by Trigger Warnings (Events) with Popularity:')
print(sample_event_sentiment_weighted)

In [None]:
sample_content_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_content_tags_df,
    group_col='content_tag',
    rating_cols=['tmdb_rating', 'imdb_rating'],
    votes_cols=['tmdb_votes', 'imdb_votes'],
    popularity_col='popularity'
)

print('\nWeighted Average Ratings by Content Tags with Popularity:')
print(sample_content_sentiment_weighted)

Visuals

In [None]:
# Visualization setup
sns.set(style='whitegrid', palette='muted', font_scale=1.1)

# Plot weighted average ratings by genre
plt.figure(figsize=(10, 6))
sns.barplot(
    x='tmdb_rating',
    y='genre',
    data=sample_genre_sentiment_weighted.sort_values(by='tmdb_rating', ascending=False),
    color='skyblue',
    label='TMDB Rating',
)
sns.barplot(
    x='imdb_rating',
    y='genre',
    data=sample_genre_sentiment_weighted.sort_values(by='imdb_rating', ascending=False),
    color='salmon',
    label='IMDB Rating',
)

plt.title('Weighted Average Ratings by Genre')
plt.xlabel('Average Rating')
plt.ylabel('Genre')
plt.legend()
plt.tight_layout()
# plt.show()

In [None]:
# Top Events by Average Popularity
plt.figure(figsize=(10, 6))
sns.barplot(
    x='average_popularity',
    y='event',
    data=sample_event_sentiment_weighted.sort_values(by='average_popularity', ascending=False).head(10),
    color='crimson',
)

plt.title('Top Events by Average Popularity')
plt.xlabel('Popularity')
plt.ylabel('Event')
plt.tight_layout()
# plt.show()

In [None]:
# Top Content Tags by Average Popularity
plt.figure(figsize=(10, 6))
sns.barplot(
    x='average_popularity',
    y='content_tag',
    data=sample_content_sentiment_weighted.sort_values(by='average_popularity', ascending=False).head(10),
    color='goldenrod',
)

plt.title('Top Content Tags by Average Popularity')
plt.xlabel('Popularity')
plt.ylabel('Content Tag')
plt.tight_layout()
# plt.show()

**Impact of Trigger Warnings on Ratings**:  
   - Analyze how the presence of trigger warnings influences movie ratings across platforms (e.g., **IMDb**, **Rotten Tomatoes**).  
   - Use trigger warnings as a categorical variable to compare movie ratings for films **with** and **without trigger warnings**.
   - Welch's t-test is applied, which adjusts for unequal variances and different sample sizes. `equal_var=False`
   - The statistical comparison is reliable even if the number of rows in the two groups (with/without warnings) is different.

In [None]:
# Ensure 'has_warnings' is a boolean
films['has_warnings'] = films['has_warnings'].astype(bool)

# Split the data into two groups: with and without trigger warnings
films_with_warnings = films[films['has_warnings'] == True]
films_without_warnings = films[films['has_warnings'] == False]

# Descriptive Statistics for IMDb Ratings
time_with_mean = films_with_warnings['imdb_rating'].mean()
time_without_mean = films_without_warnings['imdb_rating'].mean()
print(f'Mean IMDb Rating (With Warnings): {time_with_mean:.2f}')
print(f'Mean IMDb Rating (Without Warnings): {time_without_mean:.2f}')

# Independent T-test for IMDb Ratings
t_stat_imdb, p_value_imdb = ttest_ind(
    films_with_warnings['imdb_rating'].dropna(),
    films_without_warnings['imdb_rating'].dropna(),
    equal_var=False  # Welch's t-test if variance is unequal
)
print(f'IMDb Ratings T-test: t-statistic = {t_stat_imdb:.2f}, p-value = {p_value_imdb:.4f}')

# Descriptive Statistics for TMDB Ratings
tmdb_with_mean = films_with_warnings['tmdb_rating'].mean()
tmdb_without_mean = films_without_warnings['tmdb_rating'].mean()
print(f'Mean TMDB Rating (With Warnings): {tmdb_with_mean:.2f}')
print(f'Mean TMDB Rating (Without Warnings): {tmdb_without_mean:.2f}')

# Independent T-test for TMDB Ratings
t_stat_tmdb, p_value_tmdb = ttest_ind(
    films_with_warnings['tmdb_rating'].dropna(),
    films_without_warnings['tmdb_rating'].dropna(),
    equal_var=False
)
print(f'TMDB Ratings T-test: t-statistic = {t_stat_tmdb:.2f}, p-value = {p_value_tmdb:.4f}')

# descriptive stats for popularity
popularity_with_mean = films_with_warnings['popularity'].mean()
popularity_without_mean = films_without_warnings['popularity'].mean()
print(f'Mean popularity (With Warnings): {popularity_with_mean:.2f}')
print(f'Mean popularity (Without Warnings): {popularity_without_mean:.2f}')

# independet t-test for popularity
t_stat_popularity, p_value_popularity = ttest_ind(
    films_with_warnings['popularity'].dropna(),
    films_without_warnings['popularity'].dropna(),
    equal_var=False
)
print(f'Popularity T-test: t-statistic = {t_stat_tmdb:.2f}, p-value = {p_value_tmdb:.4f}')

# Conclusion
if p_value_imdb < 0.05:
    print('There is a significant difference in IMDb ratings between movies with and without trigger warnings.')
else:
    print('No significant difference in IMDb ratings between movies with and without trigger warnings.')

if p_value_tmdb < 0.05:
    print('There is a significant difference in TMDB ratings between movies with and without trigger warnings.')
else:
    print('No significant difference in TMDB ratings between movies with and without trigger warnings.')
if p_value_popularity < 0.05:
    print('There is a significant difference in popularity ratings between movies with and without trigger warnings.')
else:
    print('No significant difference in popularity ratings between movies with and without trigger warnings.')

### Analysis Summary

#### Mean Ratings and Popularity:
- **IMDb Rating (With Warnings)**: 6.55
- **IMDb Rating (Without Warnings)**: 6.10
- **TMDB Rating (With Warnings)**: 6.59
- **TMDB Rating (Without Warnings)**: 6.12
- **Popularity (With Warnings)**: 36.67
- **Popularity (Without Warnings)**: 13.58

#### T-tests:
- **IMDb Ratings**: t-statistic = 20.89, p-value = 0.0000
- **TMDB Ratings**: t-statistic = 26.98, p-value = 0.0000
- **Popularity**: t-statistic = 26.98, p-value = 0.0000

#### Interpretation:
- Movies **with trigger warnings** have significantly **higher ratings** on IMDb and TMDB, and **higher popularity** than movies **without trigger warnings**.
- **T-test results** (p-value = 0.0000) show that the differences between the groups are statistically significant, meaning the observed differences are not due to random chance.

#### Conclusion:
- There is a **significant difference** in IMDb ratings, TMDB ratings, and popularity between movies with and without trigger warnings.


In [None]:
import plotly.graph_objects as go
import numpy as np

# Data for the plot
categories = ['IMDb Ratings', 'TMDB Ratings', 'Popularity']
means_with_warnings = [time_with_mean, tmdb_with_mean, popularity_with_mean]
means_without_warnings = [time_without_mean, tmdb_without_mean, popularity_without_mean]
t_stats = [t_stat_imdb, t_stat_tmdb, t_stat_popularity]
p_values = [p_value_imdb, p_value_tmdb, p_value_popularity]

# Create a bar plot with error bars
fig = go.Figure()

# Bar for 'With Warnings' group
fig.add_trace(go.Bar(
    x=categories,
    y=means_with_warnings,
    name='With Warnings',
    marker_color='lightblue',  # Softer pastel color
    error_y=dict(type='data', array=[np.std(films_with_warnings[col].dropna()) for col in ['imdb_rating', 'tmdb_rating', 'popularity']]),
))

# Bar for 'Without Warnings' group
fig.add_trace(go.Bar(
    x=categories,
    y=means_without_warnings,
    name='Without Warnings',
    marker_color='lightcoral',  # Softer pastel color
    error_y=dict(type='data', array=[np.std(films_without_warnings[col].dropna()) for col in ['imdb_rating', 'tmdb_rating', 'popularity']]),
))

# Add T-test results as annotations
for i, category in enumerate(categories):
    fig.add_annotation(
        x=categories[i],
        y=max(means_with_warnings[i], means_without_warnings[i]) + 1,
        text=f"T-statistic: {t_stats[i]:.2f}<br>P-value: {p_values[i]:.4f}",
        showarrow=True,
        arrowhead=2,
        ax=0,
        ay=-40
    )

# Layout of the plot
fig.update_layout(
    title='Comparison of Ratings and Popularity (With vs Without Trigger Warnings)',
    barmode='group',
    xaxis_title='Metric',
    yaxis_title='Mean Value',
    legend_title='Group',
    template='plotly_white',  # White background
    plot_bgcolor='white',     # White background for the plot area
    paper_bgcolor='white',    # White background for the whole plot
)

# Show the plot
fig.show()


### Effect Size
Add an effect size metric (like Cohen's d) to measure the magnitude of the difference between means.

In [None]:

def cohen_d(x, y):
    nx, ny = len(x), len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx - 1) * np.std(x, ddof=1)**2 + (ny - 1) * np.std(y, ddof=1)**2) / dof)

d_imdb = cohen_d(
    films_with_warnings['imdb_rating'].dropna(),
    films_without_warnings['imdb_rating'].dropna()
)
print(f"Cohen's d for IMDb Ratings: {d_imdb:.2f}")

In [None]:
def cohen_d(x, y):
    nx, ny = len(x), len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx - 1) * np.std(x, ddof=1)**2 + (ny - 1) * np.std(y, ddof=1)**2) / dof)

t_tmdb = cohen_d(
    films_with_warnings['tmdb_rating'].dropna(),
    films_without_warnings['tmdb_rating'].dropna()
)
print(f"Cohen's d for TMDb Ratings: {t_tmdb:.2f}")

In [None]:
def cohen_d(x, y):
    nx, ny = len(x), len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx - 1) * np.std(x, ddof=1)**2 + (ny - 1) * np.std(y, ddof=1)**2) / dof)

t_popularity = cohen_d(
    films_with_warnings['popularity'].dropna(),
    films_without_warnings['popularity'].dropna()
)
print(f"Cohen's d for popularity: {t_popularity:.2f}")

In [None]:
import plotly.express as px

# Sample DataFrame (replace with actual data)
df = sample_content_sentiment_weighted.sort_values(by='average_popularity', ascending=False).head(10)

# Create bar plot with a valid color scale
fig = px.bar(df, 
             x='average_popularity', 
             y='content_tag', 
             orientation='h', 
             color='average_popularity', 
             title='Top Content Tags by Average Popularity',
             labels={'average_popularity': 'Popularity', 'content_tag': 'Content Tag'},
             color_continuous_scale='Viridis')  # Use a valid color scale like 'Viridis'

# Show the plot
fig.show()


In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral11  # Use any desired color palette

# Sample DataFrame (replace with actual data)
df = sample_content_sentiment_weighted.sort_values(by='average_popularity', ascending=False).head(10)

# Convert to Bokeh's ColumnDataSource
source = ColumnDataSource(df)

# Create the figure
p = figure(
    title="Top Content Tags by Average Popularity", 
    y_range=df['content_tag'].tolist(),
    height=600,  # Corrected from plot_height to height
    width=800,   # Corrected from plot_width to width
    toolbar_location=None
)

# Add a horizontal bar renderer
p.hbar(
    y='content_tag', 
    right='average_popularity', 
    height=0.4, 
    color=Spectral11[0],  # Pick a color from a palette
    legend_field='content_tag', 
    source=source
)

# Customize the axes and labels
p.xaxis.axis_label = "Popularity"
p.yaxis.axis_label = "Content Tag"
p.yaxis.major_label_orientation = "horizontal"
p.legend.orientation = "horizontal"
p.legend.location = "top_right"

# Show the plot in the notebook
output_notebook()
show(p)


In [None]:
import altair as alt
import pandas as pd

# Sample DataFrame (replace with actual data)
df = sample_content_sentiment_weighted.sort_values(by='average_popularity', ascending=False).head(10)

# Create a bar chart using Altair
chart = alt.Chart(df).mark_bar(color='goldenrod').encode(
    x=alt.X('average_popularity', title='Popularity'),
    y=alt.Y('content_tag', title='Content Tag', sort=None),  # No sorting for categories
    tooltip=['content_tag', 'average_popularity']  # Add tooltips for interactivity
).properties(
    title='Top Content Tags by Average Popularity',
    width=800,
    height=400
)

# Display the chart
chart.show()
