## Film Analysis

The dataset **clean_films_id.csv** contains **(x) rows** and **(x) columns** with films released between **1906 and 2024**.

### Questions to Explore:
1. **Sentiment Scores**:  
   - Compare sentiment scores across genres.  
   - Investigate differences in sentiment between movies **with** and **without trigger warnings**.

2. **Sentiment and Reviews**:  
   - Analyze if there is a noticeable difference in sentiment of movie reviews based on **genre** and **trigger warnings**.

3. **Correlations**:  
   - Explore correlations between **sentiment scores**, **movie ratings**, and **box office earnings**.

4. **Impact of Trigger Warnings on Ratings**:  
   - Analyze how the presence of trigger warnings influences movie ratings across platforms (e.g., **IMDb**, **Rotten Tomatoes**).  
   - Use trigger warnings as a categorical variable to compare movie ratings for films **with** and **without trigger warnings**.

5. **Statistical Testing**:  
   - Conduct statistical tests like **t-tests** or **ANOVA** to determine if there’s a significant difference in ratings based on trigger warnings.

6. **Visualizations**:  
   - Create visualizations to illustrate relationships, including:  
     - **Box plots**  
     - **Histograms**  
     - **Scatter plots**

7. **Additional Analysis**:  
   - Perform correlation analysis, hypothesis testing, and trend detection.

In [16]:
import pandas as pd

In [17]:
import sys
sys.path.append('../utils')
sys.path.append('../scripts')
import data_cleaning
import content_tagging

In [18]:
films = pd.read_csv('../data/clean/clean_films_id.csv')

In [19]:
films = content_tagging.assign_content_tags(films)
display(films)


Unnamed: 0,title,original_title,genres,director,release_year,runtime,budget,revenue,profit,popularity,...,tmdb_votes,imdb_rating,imdb_votes,language,tmdb_id,imdb_id,doesthedog_id,events,has_warnings,content_tags
0,Traffic in Souls,Traffic in Souls,"crime, drama",George Loane Tucker,1913,88,5700,1800000,1794300,2.3,...,19,5.9,751,English,96128,tt0003471,268281.0,,False,
1,The Birth of a Nation,The Birth of a Nation,"drama, history, war",D.W. Griffith,1915,193,100000,11000000,10900000,13.8,...,520,6.1,26938,English,618,tt0004972,68194.0,"sexual assault, blood or gore, falling deaths,...",True,"Sexual Violence and Abuse, Horror and Supernat..."
2,The Cheat,The Cheat,drama,Cecil B. DeMille,1915,59,17311,137365,120054,6.6,...,64,6.5,2892,English,70368,tt0005078,47532.0,,False,
3,Intolerance: Love's Struggle Throughout the Ages,Intolerance: Love's Struggle Throughout the Ages,"drama, history",D.W. Griffith,1916,197,385907,1750000,1364093,7.4,...,329,7.7,17121,English,3059,tt0006864,46705.0,"kids dying, parents dying, sexual assault, blo...",True,"Sexual Violence and Abuse, Horror and Supernat..."
4,"20,000 Leagues Under the Sea","20,000 Leagues Under the Sea","adventure, drama, action, science fiction",Stuart Paton,1916,99,200000,8000000,7800000,7.9,...,52,6.1,2066,English,30266,tt0006333,226637.0,"shaving or cutting, blood or gore, animals (be...",True,"Horror and Supernatural, Physical Violence, An..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9780,The Apprentice,The Apprentice,"drama, history",Ali Abbasi,2024,122,16000000,12013393,-3986607,63.5,...,260,7.1,25086,English,1182047,tt8368368,1224057.0,"flashing lights or images, shaving or cutting,...",True,"Sexual Violence and Abuse, Horror and Supernat..."
9781,Dancing Village: The Curse Begins,Badarawuhi di Desa Penari,"horror, mystery, thriller",Kimo Stamboel,2024,122,1259208,3710622,2451414,15.3,...,17,5.7,3947,Indonesian,1176166,tt28763074,,,False,
9782,Weekend in Taipei,Weekend in Taipei,"action, adventure, thriller",George Huang,2024,101,30000000,2826064,-27173936,287.9,...,59,5.7,1774,English,1167271,tt28142896,1336183.0,,False,
9783,Bramayugam,ഭ്രമയുഗം,"horror, thriller, drama",Rahul Sadasivan,2024,140,3300000,10000000,6700000,9.5,...,40,7.8,12384,Malayalam,1166133,tt27431598,1072596.0,,False,


Create new df with sample data (1000 rows, randomly selected)

In [25]:
# sample_df = films.sample(n=1000, random_state=50)
sample_df = films.copy()

# split 'genres' column into individual rows
sample_genres_df = sample_df.assign(genre=sample_df['genres'].str.split(', ')).explode('genre')

# split 'events' (trigger warnings) into individual rows
sample_events_df = sample_df.dropna(subset=['events']).assign(event=sample_df['events'].str.split(', ')).explode('event')

# split 'content_tags' (trigger warnings) into individual rows
sample_content_tags_df = sample_df.dropna(subset=['content_tags']).assign(content_tag=sample_df['content_tags'].str.split(', ')).explode('content_tag')

# calculate average ratings for movies with and without warnings
sample_warnings_df = sample_df.groupby('has_warnings')[['tmdb_rating', 'imdb_rating']].mean().reset_index()

# average ratings by genre
sample_genre_sentiment_df = sample_genres_df.groupby('genre')[['tmdb_rating', 'imdb_rating']].mean().reset_index()

# average ratings by specific trigger warnings (events)
sample_event_sentiment_df = sample_events_df.groupby('event')[['tmdb_rating', 'imdb_rating']].mean().reset_index()

# average ratings by specific trigger warnings (content_tags)
sample_content_sentiment_df = sample_content_tags_df.groupby('content_tag')[['tmdb_rating', 'imdb_rating']].mean().reset_index()

In [26]:
print("Average Ratings by Genre:")
print(sample_genre_sentiment_df)

Average Ratings by Genre:
              genre  tmdb_rating  imdb_rating
0            action     6.287609     6.191297
1         adventure     6.408254     6.252910
2         animation     6.769057     6.495283
3            comedy     6.220350     6.186622
4             crime     6.451474     6.515256
5       documentary     7.308333     7.275000
6             drama     6.607767     6.689444
7            family     6.437435     6.150365
8           fantasy     6.409244     6.171429
9           history     6.852834     6.943510
10           horror     6.042869     5.832295
11            music     7.138235     6.747059
12          mystery     6.370828     6.377071
13          romance     6.411791     6.437355
14  science fiction     6.271731     6.141635
15         thriller     6.290085     6.258605
16         tv movie     5.566667     4.716667
17              war     6.818063     6.898429
18          western     6.642857     6.797619


In [27]:
print("\nAverage Ratings for Movies With and Without Warnings:")
print(sample_warnings_df)


0         False     6.115677     6.100584
1          True     6.589308     6.551982


In [28]:

print("\nAverage Ratings by Specific Trigger Warnings:")
print(sample_event_sentiment_df)


                              event  tmdb_rating  imdb_rating
0                              9/11     6.722581     6.861290
1                       ABA therapy     6.625000     6.275000
2            Achilles Tendon injury     6.589583     6.354167
3                              BDSM     6.745455     6.725455
4          D.I.D. misrepresentation     6.729412     6.688235
..                              ...          ...          ...
192     violent mentally ill person     6.698353     6.649551
193                        vomiting     6.592147     6.552675
194                wet/soiled pants     6.622654     6.568608
195  women brutalized for spectacle     6.825347     6.781250
196                   women slapped     6.877149     6.865611

[197 rows x 3 columns]


In [29]:
print("\nAverage Ratings by Specific Content Tags:")
print(sample_content_sentiment_df)


Average Ratings by Specific Content Tags:
                             content_tag  tmdb_rating  imdb_rating
0                                            6.115138     6.099879
1          Addiction and Substance Abuse     6.662712     6.648107
2                           Animal Abuse     6.608314     6.562412
3             Body Shaming and Fatphobia     6.631359     6.592457
4             Catastrophes and Accidents     6.656332     6.628812
5                            Child Abuse     6.702664     6.664034
6                        Death and Grief     6.701627     6.667514
7                       Fear and Anxiety     6.631551     6.592291
8                   Gore and Body Horror     6.627297     6.594572
9                Horror and Supernatural     6.652543     6.612554
10                         LGBTQ+ Phobia     6.642816     6.644402
11                    Medical and Health     6.644938     6.609984
12             Mental Health and Ableism     6.669928     6.639618
13                 