## Film Analysis

The dataset **clean_films_id.csv** contains **9785 rows** and **21 columns** with films released between **1906 and 2024**.

### Questions to Explore:
1. **Sentiment Scores**:  
   - Compare sentiment scores across genres.  
   - Investigate differences in sentiment between movies **with** and **without trigger warnings**.

2. **Sentiment and Reviews**:  
   - Analyze if there is a noticeable difference in sentiment of movie reviews based on **genre** and **trigger warnings**.

3. **Correlations**:  
   - Explore correlations between **sentiment scores**, **movie ratings**, and **box office earnings**.

4. **Impact of Trigger Warnings on Ratings**:  
   - Analyze how the presence of trigger warnings influences movie ratings across platforms (e.g., **IMDb**, **Rotten Tomatoes**).  
   - Use trigger warnings as a categorical variable to compare movie ratings for films **with** and **without trigger warnings**.

5. **Statistical Testing**:  
   - Conduct statistical tests like **t-tests** or **ANOVA** to determine if there’s a significant difference in ratings based on trigger warnings.

6. **Visualizations**:  
   - Create visualizations to illustrate relationships, including:  
     - **Box plots**  
     - **Histograms**  
     - **Scatter plots**

7. **Additional Analysis**:  
   - Perform correlation analysis, hypothesis testing, and trend detection.

In [1]:
import pandas as pd

In [2]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import data_cleaning
import content_tagging

In [3]:
films = pd.read_csv('../data/clean/clean_films_id.csv')

In [4]:
films = content_tagging.assign_content_tags(films)
films.head()

Unnamed: 0,title,original_title,genres,director,release_year,runtime,budget,revenue,profit,popularity,...,tmdb_votes,imdb_rating,imdb_votes,language,tmdb_id,imdb_id,doesthedog_id,events,has_warnings,content_tags
0,Traffic in Souls,Traffic in Souls,"crime, drama",George Loane Tucker,1913,88,5700,1800000,1794300,2.3,...,19,5.9,751,English,96128,tt0003471,268281.0,,False,
1,The Birth of a Nation,The Birth of a Nation,"drama, history, war",D.W. Griffith,1915,193,100000,11000000,10900000,13.8,...,520,6.1,26938,English,618,tt0004972,68194.0,"sexual assault, blood or gore, falling deaths,...",True,"Sexual Violence and Abuse, Horror and Supernat..."
2,The Cheat,The Cheat,drama,Cecil B. DeMille,1915,59,17311,137365,120054,6.6,...,64,6.5,2892,English,70368,tt0005078,47532.0,,False,
3,Intolerance: Love's Struggle Throughout the Ages,Intolerance: Love's Struggle Throughout the Ages,"drama, history",D.W. Griffith,1916,197,385907,1750000,1364093,7.4,...,329,7.7,17121,English,3059,tt0006864,46705.0,"kids dying, parents dying, sexual assault, blo...",True,"Sexual Violence and Abuse, Horror and Supernat..."
4,"20,000 Leagues Under the Sea","20,000 Leagues Under the Sea","adventure, drama, action, science fiction",Stuart Paton,1916,99,200000,8000000,7800000,7.9,...,52,6.1,2066,English,30266,tt0006333,226637.0,"shaving or cutting, blood or gore, animals (be...",True,"Horror and Supernatural, Physical Violence, An..."


### Weighed Average

1. Count occurrences for each category.
2. Calculate the mean for each category, but weight each category by the number of occurrences.
3. Normalize the means to ensure comparability across categories.

In [5]:
sample_df = films.copy()

In [6]:
sample_genres_df = sample_df.assign(genre=sample_df['genres'].str.split(', ')).explode('genre')
sample_events_df = sample_df.dropna(subset=['events']).assign(event=sample_df['events'].str.split(', ')).explode('event')
sample_content_tags_df = sample_df.dropna(subset=['content_tags']).assign(content_tag=sample_df['content_tags'].str.split(', ')).explode('content_tag')

In [7]:
# Weighted mean function with popularity included in results
def weighted_mean_with_votes_and_popularity(group_df, category_column, rating_columns, vote_column, popularity_column):
    """
    Calculate the weighted average of rating_columns using vote_column as weights,
    and include the popularity in the results.

    Parameters:
    - group_df: DataFrame to group.
    - category_column: Column to group by (e.g., 'genre', 'event').
    - rating_columns: List of rating columns to calculate weighted means for.
    - vote_column: Column containing vote counts to use as weights.
    - popularity_column: Column containing the popularity scores to include in the result.
    """
    # Calculate weighted sums for ratings
    weighted_sums = group_df.groupby(category_column)[rating_columns].apply(
        lambda x: (x.multiply(group_df.loc[x.index, vote_column], axis=0)).sum()
    )
    
    # Calculate total votes per group
    total_votes = group_df.groupby(category_column)[vote_column].sum()

    # Calculate weighted means
    weighted_means = weighted_sums.divide(total_votes, axis=0).reset_index()

    # Get the average popularity for each category
    avg_popularity = group_df.groupby(category_column)[popularity_column].mean().reset_index()

    # Merge popularity with weighted means
    weighted_means = pd.merge(weighted_means, avg_popularity, on=category_column, how='left')

    # Round the results to 2 decimal places
    weighted_means[rating_columns] = weighted_means[rating_columns].round(2)
    weighted_means[popularity_column] = weighted_means[popularity_column].round(2)

    return weighted_means

In [8]:
# calculate weighted averages
sample_genre_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_genres_df, 'genre', ['tmdb_rating', 'imdb_rating'], 'tmdb_votes', 'popularity'
)

# 4. Print Results with Popularity
print('Weighted Average Ratings by Genre with Popularity:')
print(sample_genre_sentiment_weighted)

Weighted Average Ratings by Genre with Popularity:
              genre  tmdb_rating  imdb_rating  popularity
0            action        21.80        21.84       38.66
1         adventure        23.50        23.47       47.90
2         animation        28.38        28.27       58.24
3            comedy        19.88        19.79       23.66
4             crime        21.09        21.34       23.17
5       documentary        14.26        14.14       12.18
6             drama        19.20        19.38       20.04
7            family        26.43        26.11       43.17
8           fantasy        23.39        23.11       38.97
9           history        21.97        22.25       23.75
10           horror        16.70        16.38       30.16
11            music        23.99        22.59       26.70
12          mystery        21.82        21.90       23.43
13          romance        19.78        19.53       19.52
14  science fiction        21.64        21.73       50.10
15         thriller  

In [9]:
sample_event_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_events_df, 'event', ['tmdb_rating', 'imdb_rating'], 'tmdb_votes', 'popularity'
)

print('\nWeighted Average Ratings by Trigger Warnings (Events) with Popularity:')
print(sample_event_sentiment_weighted)


                              event  tmdb_rating  imdb_rating  popularity
0                              9/11       319.20       328.20       47.78
1                       ABA therapy       341.54       356.09       27.38
2            Achilles Tendon injury       379.76       379.26       47.03
3                              BDSM       481.60       488.17       58.73
4          D.I.D. misrepresentation       425.52       436.20       49.75
..                              ...          ...          ...         ...
192     violent mentally ill person       357.16       359.03       42.47
193                        vomiting       269.85       271.45       45.81
194                wet/soiled pants       310.85       312.71       78.19
195  women brutalized for spectacle       382.10       384.77       59.86
196                   women slapped       371.47       374.08       48.54

[197 rows x 4 columns]


In [10]:
sample_content_sentiment_weighted = weighted_mean_with_votes_and_popularity(
    sample_content_tags_df, 'content_tag', ['tmdb_rating', 'imdb_rating'], 'tmdb_votes', 'popularity'
)

print('\nWeighted Average Ratings by Content Tags with Popularity:')
print(sample_content_sentiment_weighted)


Weighted Average Ratings by Content Tags with Popularity:
                             content_tag  tmdb_rating  imdb_rating  popularity
0                                                6.46         6.36       13.59
1          Addiction and Substance Abuse       107.58       108.43       41.20
2                           Animal Abuse       104.90       105.54       46.24
3             Body Shaming and Fatphobia       107.64       108.29       42.67
4             Catastrophes and Accidents       103.97       104.64       42.47
5                            Child Abuse       107.16       107.73       39.13
6                        Death and Grief       106.21       106.85       43.75
7                       Fear and Anxiety       101.60       102.16       42.05
8                   Gore and Body Horror       100.81       101.36       39.49
9                Horror and Supernatural       105.15       105.80       44.16
10                         LGBTQ+ Phobia       111.62       112.46      