NLP on Letterboxd df
- Check word frequency in several columns -> 'tagline', 'summary', 'genres', 'themes', 'events'
- Based on correlations, rate from 0 to 5
- Tie back to whether a title has warnings or not
- Figure how it impacts sentiment on movies
- Sentiment-o-meter

The Data
- `cleaned_films` contains 4203 titles and 13 columns.
- There are no Null values.

| **Column Name**        | **Data Type**   | **Description**                                                                 |
|------------------------|-----------------|----------------------------------------------------------------------------------|
| **title**              | object          | The title of the movie.                                                          |
| **release_year**       | int64           | The year the movie was released.                                                 |
| **tagline**            | object          | The movie's tagline (promotional phrase).                                        |
| **summary**            | object          | A brief description of the movie's plot.                                         |
| **runtime**            | int64           | The total runtime of the movie in minutes.                                       |
| **letterboxd_rating**  | float64         | The movie's average rating on Letterboxd.                                        |
| **genres**             | object          | A list of genres the movie belongs to (e.g., Drama, Comedy).                     |
| **language**           | object          | The languages the movie was produced in.                                         |
| **countries**          | object          | The countries where the movie was made or released.                              |
| **themes**             | object          | The central themes explored in the movie (e.g., Love, War, Friendship).          |
| **director**           | object          | The director(s) of the movie.                                                   |
| **events**             | object          | Key events or warnings in the movie (e.g., violence, strong language).           |
| **has_warnings**       | bool            | A boolean indicating if the movie contains warnings for sensitive content.       |


In [1]:
import pandas as pd
import nltk 

In [2]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/bru/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
import sys
sys.path.append('../utils')
import sentiment_utils

In [4]:

films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')
films.head(2)

Unnamed: 0,letterboxd_id,title,release_year,tagline,summary,runtime,letterboxd_rating,genres,language,countries,themes,director,topics,doesthedog_id,events,has_warnings
0,1000001,Barbie,2023,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114,3.86,"Comedy, Adventure",English,"UK, USA","['Humanity and the world around us', 'Crude hu...",Greta Gerwig,,381345.0,,False
1,1000002,Parasite,2019,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133,4.56,"Comedy, Thriller, Drama","Korean, English, German, Korean",South Korea,"['Humanity and the world around us', 'Intense ...",Bong Joon-ho,,19408.0,,False


In [5]:
cleaned_films = films.copy()
cleaned_films.drop(columns=['topics', 'doesthedog_id', 'letterboxd_id'], inplace=True)

In [6]:
cleaned_films = cleaned_films.dropna()

In [7]:
display(cleaned_films)


Unnamed: 0,title,release_year,tagline,summary,runtime,letterboxd_rating,genres,language,countries,themes,director,events,has_warnings
4,La La Land,2016,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129,4.09,"Drama, Comedy, Music, Romance",English,"Hong Kong, USA","['Song and dance', 'Humanity and the world aro...",Damien Chazelle,"flashing lights or images, shower scenes, sad ...",True
11,Whiplash,2014,The road to greatness can take you to the edge.,"Under the direction of a ruthless instructor, ...",107,4.43,"Drama, Music",English,USA,"['Moving relationship stories', 'Student comin...",Damien Chazelle,"finger or toe mutilation, spitting, car crashe...",True
32,Once Upon a Time in Hollywood,2019,"In this town, it can all change… like that","Los Angeles, 1969. TV star Rick Dalton, a stru...",162,3.76,"Drama, Thriller, Comedy","English, English, Italian, Spanish","China, UK, USA","['Humanity and the world around us', 'Fascinat...",Quentin Tarantino,"people being burned alive, spitting, blood or ...",True
39,Glass Onion,2022,"When the game ends, the mystery begins.",World-famous detective Benoit Blanc heads to G...,140,3.45,"Comedy, Crime, Mystery",English,USA,"['Thrillers and murder mysteries', 'Intriguing...",Rian Johnson,"flashing lights or images, car crashes, people...",True
68,Coco,2017,The celebration of a lifetime,Despite his family’s baffling generations-old ...,105,4.12,"Adventure, Animation, Music, Family","English, English, Spanish",USA,"['Moving relationship stories', 'Song and danc...",Lee Unkrich,"parents dying, spitting, ghosts, child abuse, ...",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18425,The Triangle,2001,"In the Bermuda Triangle, nothing stays lost fo...",This made-for-TV movie follows a group of frie...,92,2.89,"Thriller, Horror, TV Movie",English,"USA, Canada, Barbados","['Horror, the undead and monster classics', 'T...",Lewis Teague,"kids dying, parents dying, shaving or cutting,...",True
18427,CAT,2022,Drugs. Deceit. Danger.,"Living under an alias, a former police informa...",360,3.50,"Crime, Drama",Hindi,India,"['Crime, drugs and gangsters', 'Intense politi...",Balwinder Singh Janjua,"people being burned alive, flashing lights or ...",True
18429,Wraith,2017,There's Something in My Room,After living in an old mansion for almost 10 y...,99,2.61,"Mystery, Thriller, Horror",English,USA,"['Faith and religion', 'Terrifying, haunted, a...",Michael O. Sajbel,"people being burned alive, spitting, shaky cam...",True
18436,Tulsa,2020,Big changes come in small packages,A desperate marine biker’s life is turned upsi...,120,2.97,"Comedy, Drama",English,USA,"['Faith and religion', 'Moving relationship st...","Gloria Stella, Scott Pryor","kids dying, parents dying, car crashes, people...",True


### 1. Data Preprocessing

**Tagline Word Count**

In [8]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='tagline', top_n=20)

Most common words in tagline:
[('one', 249), ('love', 237), ('story', 155), ('never', 154), ('life', 137), ('world', 123), ('time', 111), ('dont', 100), ('evil', 90), ('new', 85), ('every', 84), ('family', 81), ('terror', 78), ('get', 70), ('man', 65), ('go', 64), ('cant', 62), ('true', 62), ('die', 62), ('hell', 62)]


[('one', 249),
 ('love', 237),
 ('story', 155),
 ('never', 154),
 ('life', 137),
 ('world', 123),
 ('time', 111),
 ('dont', 100),
 ('evil', 90),
 ('new', 85),
 ('every', 84),
 ('family', 81),
 ('terror', 78),
 ('get', 70),
 ('man', 65),
 ('go', 64),
 ('cant', 62),
 ('true', 62),
 ('die', 62),
 ('hell', 62)]

**Summary Word Count**

In [9]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='summary', top_n=20)

Most common words in summary:
[('young', 617), ('life', 561), ('one', 516), ('new', 494), ('two', 452), ('family', 436), ('find', 351), ('man', 328), ('woman', 325), ('world', 315), ('must', 312), ('friends', 302), ('home', 292), ('group', 286), ('love', 276), ('town', 266), ('mysterious', 264), ('years', 259), ('finds', 243), ('story', 240)]


[('young', 617),
 ('life', 561),
 ('one', 516),
 ('new', 494),
 ('two', 452),
 ('family', 436),
 ('find', 351),
 ('man', 328),
 ('woman', 325),
 ('world', 315),
 ('must', 312),
 ('friends', 302),
 ('home', 292),
 ('group', 286),
 ('love', 276),
 ('town', 266),
 ('mysterious', 264),
 ('years', 259),
 ('finds', 243),
 ('story', 240)]

**Genre Word Count**

In [10]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='genres', top_n=20)

Most common words in genres:
[('drama', 1833), ('horror', 1542), ('thriller', 1244), ('comedy', 1092), ('mystery', 537), ('crime', 523), ('romance', 520), ('action', 498), ('science', 459), ('fiction', 459), ('fantasy', 327), ('adventure', 273), ('music', 254), ('family', 215), ('animation', 195), ('history', 141), ('documentary', 113), ('tv', 105), ('movie', 105), ('war', 103)]


[('drama', 1833),
 ('horror', 1542),
 ('thriller', 1244),
 ('comedy', 1092),
 ('mystery', 537),
 ('crime', 523),
 ('romance', 520),
 ('action', 498),
 ('science', 459),
 ('fiction', 459),
 ('fantasy', 327),
 ('adventure', 273),
 ('music', 254),
 ('family', 215),
 ('animation', 195),
 ('history', 141),
 ('documentary', 113),
 ('tv', 105),
 ('movie', 105),
 ('war', 103)]

**Themes Word Count**

In [11]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='themes', top_n=20)

Most common words in themes:
[('horror', 5696), ('stories', 1948), ('terrifying', 1579), ('monster', 1516), ('humor', 1453), ('crime', 1422), ('undead', 1229), ('classics', 1229), ('dark', 1204), ('relationship', 1202), ('gory', 1173), ('family', 1124), ('jokes', 1081), ('gruesome', 1062), ('slasher', 1062), ('intense', 1056), ('twisted', 1032), ('psychological', 1032), ('thriller', 1032), ('violence', 997)]


[('horror', 5696),
 ('stories', 1948),
 ('terrifying', 1579),
 ('monster', 1516),
 ('humor', 1453),
 ('crime', 1422),
 ('undead', 1229),
 ('classics', 1229),
 ('dark', 1204),
 ('relationship', 1202),
 ('gory', 1173),
 ('family', 1124),
 ('jokes', 1081),
 ('gruesome', 1062),
 ('slasher', 1062),
 ('intense', 1056),
 ('twisted', 1032),
 ('psychological', 1032),
 ('thriller', 1032),
 ('violence', 997)]

**Events Word Count**

In [12]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='events', top_n=20)

Most common words in events:
[('dying', 4676), ('gore', 3218), ('sexual', 3153), ('scenes', 2831), ('animals', 2341), ('abuse', 2336), ('dies', 2025), ('violence', 2010), ('blood', 2006), ('people', 1905), ('content', 1566), ('parents', 1439), ('gun', 1261), ('suicide', 1258), ('car', 1241), ('choking', 1203), ('mutilation', 1122), ('dead', 1082), ('child', 1075), ('sad', 1067)]


[('dying', 4676),
 ('gore', 3218),
 ('sexual', 3153),
 ('scenes', 2831),
 ('animals', 2341),
 ('abuse', 2336),
 ('dies', 2025),
 ('violence', 2010),
 ('blood', 2006),
 ('people', 1905),
 ('content', 1566),
 ('parents', 1439),
 ('gun', 1261),
 ('suicide', 1258),
 ('car', 1241),
 ('choking', 1203),
 ('mutilation', 1122),
 ('dead', 1082),
 ('child', 1075),
 ('sad', 1067)]

### Word Correlation

In [13]:
df = cleaned_films.copy()

Taglines vs Ratings

In [14]:
sentiment_utils.word_rating_correlation(df, text_column='processed_tagline', rating_column='letterboxd_rating', top_n=20)

Top Words Correlated with Ratings:
love: 0.11
film: 0.08
evil: -0.08
world: 0.07
story: 0.07
men: 0.07
picture: 0.06
could: 0.06
boy: 0.06
women: 0.06
life: 0.06
america: 0.06
fear: -0.06
unforgettable: 0.05
movie: 0.05
time: 0.05
motion: 0.05
murder: 0.05
woman: 0.05
sometimes: 0.05


Summary vs Ratings

In [15]:
sentiment_utils.word_rating_correlation(df, text_column='processed_summary', rating_column='letterboxd_rating', top_n=20)

Top Words Correlated with Ratings:
story: 0.11
group: -0.11
friends: -0.10
love: 0.10
war: 0.09
evil: -0.09
game: -0.09
shark: -0.08
discover: -0.08
political: 0.08
sinister: -0.08
french: 0.08
mysterious: -0.08
killer: -0.08
men: 0.08
paranormal: -0.07
college: -0.07
terrifying: -0.07
creature: -0.07
fight: -0.07


Genres vs Ratings

In [16]:
sentiment_utils.word_rating_correlation(df, text_column='processed_genres', rating_column='letterboxd_rating', top_n=20)

Top Words Correlated with Ratings:
horror: -0.42
drama: 0.41
thriller: -0.24
fiction: -0.17
science: -0.17
documentary: 0.16
history: 0.14
crime: 0.12
action: -0.11
music: 0.11
war: 0.10
animation: 0.10
romance: 0.09
western: 0.07
movie: -0.05
tv: -0.05
fantasy: -0.05
mystery: -0.04
family: -0.03
adventure: -0.03


Themes vs Ratings

In [17]:
sentiment_utils.word_rating_correlation(df, text_column='processed_themes', rating_column='letterboxd_rating', top_n=20)

Top Words Correlated with Ratings:
horror: -0.37
monster: -0.36
classics: -0.35
undead: -0.35
drama: 0.31
powerful: 0.30
gory: -0.30
gruesome: -0.30
slasher: -0.30
world: 0.29
stories: 0.29
around: 0.28
humanity: 0.28
us: 0.28
moving: 0.27
aliens: -0.25
captivating: 0.25
creatures: -0.25
scifi: -0.24
life: 0.24


Events vs Ratings

In [18]:
sentiment_utils.word_rating_correlation(df, text_column='processed_events', rating_column='letterboxd_rating', top_n=20)

Top Words Correlated with Ratings:
jump: -0.22
scares: -0.22
audio: -0.12
mutilation: -0.11
gore: -0.11
eye: -0.11
hate: 0.10
speech: 0.10
domestic: 0.10
excessive: -0.09
incarceration: 0.09
child: 0.09
sad: 0.09
hospital: 0.08
cheating: 0.08
antisemitism: 0.08
age: 0.08
gap: 0.08
large: 0.08
abandonment: 0.08


## Conclusions

### **Tagline:**
These are the top words that are correlated with ratings. Words with positive correlations tend to be associated with higher ratings, while words with negative correlations are typically associated with lower ratings. The correlation values indicate the strength and direction of the relationship between the words and the ratings.

### **Summary:**
- Positive Correlation: Words or genres associated with higher ratings include *drama*, *documentary*, *love*, *story*, and *political*.
- Negative Correlation: Words or genres associated with lower ratings include *horror*, *thriller*, *jump*, *scares*, and *gore*.
- Weak/Negligible Correlation: Some words have minimal or no significant impact on ratings, such as *movie*, *time*, *life*, and *friends*.

### **Genre:**
#### Positive Correlation (higher ratings):
- *Drama*: 0.41
- *Documentary*: 0.16
- *History*: 0.14
- *Crime*: 0.12
- *Music*: 0.11
- *War*: 0.10
- *Animation*: 0.10
- *Romance*: 0.09
- *Western*: 0.07

#### Negative Correlation (lower ratings):
- *Horror*: -0.42
- *Thriller*: -0.24
- *Fiction*: -0.17
- *Science*: -0.17
- *Action*: -0.11
- *Movie*: -0.05
- *TV*: -0.05
- *Fantasy*: -0.05
- *Mystery*: -0.04
- *Family*: -0.03
- *Adventure*: -0.03


### **Themes:**
#### Positive Correlation (higher ratings):
- *Story*: 0.11
- *Love*: 0.10
- *Political*: 0.08
- *Men*: 0.08
- *World*: 0.07
- *Film*: 0.08
- *Picture*: 0.06
- *Life*: 0.06
- *Unforgettable*: 0.05
- *Murder*: 0.05
- *Sometimes*: 0.05

#### Negative Correlation (lower ratings):
- *Evil*: -0.09
- *Game*: -0.09
- *Shark*: -0.08
- *Sinister*: -0.08
- *Mysterious*: -0.08
- *Killer*: -0.08
- *Paranormal*: -0.07
- *Terrifying*: -0.07
- *Creature*: -0.07
- *Fight*: -0.07
- *Fear*: -0.06



### **Events:**
#### Positive Correlation (higher ratings):
- *Hate*: 0.10
- *Speech*: 0.10
- *Domestic*: 0.10
- *Incarceration*: 0.09
- *Child*: 0.09
- *Sad*: 0.09
- *Hospital*: 0.08
- *Cheating*: 0.08
- *Antisemitism*: 0.08
- *Age*: 0.08
- *Gap*: 0.08
- *Large*: 0.08
- *Abandonment*: 0.08

#### Negative Correlation (lower ratings):
- *Jump*: -0.22
- *Scares*: -0.22
- *Audio*: -0.12
- *Mutilation*: -0.11
- *Gore*: -0.11
- *Eye*: -0.11
- *Excessive*: -0.09
- *Aliens*: -0.25
- *Gory*: -0.30
- *Gruesome*: -0.30
- *Slasher*: -0.30