# **NLP on Letterboxd df**

- Check word frequency in several columns -> 'tagline', 'summary', 'genres', 'themes', 'events'
- Based on correlations, rate from 0 to 10
- Tie back to whether a title has warnings or not
- Figure how it impacts sentiment on movies
- Sentiment-o-meter

## **Metadata**

- `cleaned_films` contains 5301 titles and 20 columns.

- There are no Null values.

| **Column Name**        | **Data Type**   | **Description**                                                                 |
|------------------------|-----------------|----------------------------------------------------------------------------------|
| **title**              | object          | The title of the movie.                                                          |
| **release_year**       | int64           | The year the movie was released.                                                 |
| **tagline**            | object          | The movie's tagline (promotional phrase).                                        |
| **summary**            | object          | A brief description of the movie's plot.                                         |
| **runtime**            | int64           | The total runtime of the movie in minutes.                                       |
| **letterboxd_rating**  | float64         | The movie's average rating on Letterboxd.                                        |
| **genres**             | object          | A list of genres the movie belongs to (e.g., Drama, Comedy).                     |
| **language**           | object          | The languages the movie was produced in.                                         |
| **countries**          | object          | The countries where the movie was made or released.                              |
| **themes**             | object          | The central themes explored in the movie (e.g., Love, War, Friendship).          |
| **director**           | object          | The director(s) of the movie.                                                   |
| **events**             | object          | Key events or warnings in the movie (e.g., violence, strong language).           |
| **has_warnings**       | bool            | A boolean indicating if the movie contains warnings for sensitive content.       |


In [1]:
import pandas as pd
import nltk 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample

In [2]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/bru/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Import reusable functions for sentiment analysis from /utils folder.

In [3]:
import sys
sys.path.append('../utils')
import sentiment_utils

In [24]:

films = pd.read_csv('../data/clean/tmdb_clean_films.csv')
films.rename(columns={'overview' : 'summary'}, inplace=True)
films.head(2)

Unnamed: 0,tmdb_id,imdb_id,doesthedog_id,title,original_title,genres,director,release_year,runtime,budget,...,tmdb_rating,tmdb_votes,imdb_rating,imdb_votes,language,countries,summary,tagline,events,has_warnings
0,5,tt0113101,62268.0,Four Rooms,Four Rooms,comedy,"Quentin Tarantino, Robert Rodriguez, Alexandre...",1995,98,4000000,...,5.8,2628,6.7,112798,English,USA,It's Ted the Bellhop's first night on the job....,Twelve outrageous guests. Four scandalous requ...,"blood or gore, needles or syringes are used, d...",True
1,6,tt0107286,236737.0,Judgment Night,Judgment Night,"action, crime, thriller",Stephen Hopkins,1993,109,21000000,...,6.5,331,6.6,19361,English,USA,"Four young friends, while taking a shortcut en...",Don't move. Don't whisper. Don't even breathe.,"car crashes, drownings, people getting hit by ...",True


In [5]:
cleaned_films = films.copy()
cleaned_films.drop(columns=['doesthedog_id', 'tmdb_id', 'imdb_id'], inplace=True)

In [6]:
cleaned_films = cleaned_films.dropna()


In [7]:
display(cleaned_films)


Unnamed: 0,title,original_title,genres,director,release_year,runtime,budget,revenue,profit,popularity,tmdb_rating,tmdb_votes,imdb_rating,imdb_votes,language,countries,summary,tagline,events,has_warnings
0,Four Rooms,Four Rooms,comedy,"Quentin Tarantino, Robert Rodriguez, Alexandre...",1995,98,4000000,4257354,257354,21.3,5.8,2628,6.7,112798,English,USA,It's Ted the Bellhop's first night on the job....,Twelve outrageous guests. Four scandalous requ...,"blood or gore, needles or syringes are used, d...",True
1,Judgment Night,Judgment Night,"action, crime, thriller",Stephen Hopkins,1993,109,21000000,12136938,-8863062,8.9,6.5,331,6.6,19361,English,USA,"Four young friends, while taking a shortcut en...",Don't move. Don't whisper. Don't even breathe.,"car crashes, drownings, people getting hit by ...",True
2,Star Wars,Star Wars,"adventure, action, science fiction",George Lucas,1977,121,11000000,775398007,764398007,98.8,8.2,20622,8.6,1482739,English,USA,Princess Leia is captured and held hostage by ...,"A long time ago in a galaxy far, far away...","people being burned alive, flashing lights or ...",True
3,Finding Nemo,Finding Nemo,"animation, family",Andrew Stanton,2003,100,94000000,940335536,846335536,125.7,7.8,19241,8.2,1139333,English,USA,"Nemo, an adventurous young clownfish, is unexp...",There are 3.7 trillion fish in the ocean. They...,"kids dying, jump scares, parents dying, spitti...",True
4,Forrest Gump,Forrest Gump,"comedy, drama, romance",Robert Zemeckis,1994,142,55000000,677387716,622387716,134.8,8.5,27494,8.8,2326538,English,USA,A man with a low IQ has accomplished great thi...,The world will never be the same once you've s...,"parents dying, shower scenes, shaving or cutti...",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9775,The Mouse Trap,The Mouse Trap,"horror, thriller",Jamie Bailey,2024,80,800000,5737,-794263,70.3,4.3,41,2.5,2338,English,Canada,"It's Alex's 21st Birthday, but she's stuck at ...",This is not the funhouse.,"jump scares, flashing lights or images, blood ...",True
9776,Longlegs,Longlegs,"horror, thriller, crime",Osgood Perkins,2024,101,10000000,126388179,116388179,153.9,6.6,1475,6.7,141019,English,"Canada, USA",FBI Agent Lee Harker is a gifted new recruit a...,Every year there's another,"kids dying, jump scares, flashing lights or im...",True
9781,Moana 2,Moana 2,"animation, adventure, family, comedy","David G. Derrick Jr., Dana Ledoux Miller, Jaso...",2024,100,150000000,600055655,450055655,4485.0,6.8,424,7.1,30852,English,"Canada, USA",After receiving an unexpected call from her wa...,The ocean is calling them back.,"flashing lights or images, ghosts, bugs, restr...",True
9782,Sound of Hope: The Story of Possum Trot,Sound of Hope: The Story of Possum Trot,drama,Joshua Weigel,2024,130,8500000,11721425,3221425,58.5,6.7,23,7.1,1782,English,USA,"Led by Donna and Reverend W.C. Martin, 22 fami...",The fight for kids begins.,"hate speech, child abuse, minority misrepresen...",True


## **Word Count**
- Display most commong words in *tagline*, *summary*, *genre*, *themes* and *events* columns.

In [8]:
cleaned_films.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5301 entries, 0 to 9783
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           5301 non-null   object 
 1   original_title  5301 non-null   object 
 2   genres          5301 non-null   object 
 3   director        5301 non-null   object 
 4   release_year    5301 non-null   int64  
 5   runtime         5301 non-null   int64  
 6   budget          5301 non-null   int64  
 7   revenue         5301 non-null   int64  
 8   profit          5301 non-null   int64  
 9   popularity      5301 non-null   float64
 10  tmdb_rating     5301 non-null   float64
 11  tmdb_votes      5301 non-null   int64  
 12  imdb_rating     5301 non-null   float64
 13  imdb_votes      5301 non-null   int64  
 14  language        5301 non-null   object 
 15  countries       5301 non-null   object 
 16  summary         5301 non-null   object 
 17  tagline         5301 non-null   object

**Tagline Word Count**

In [9]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='tagline', top_n=30)

Most common words in tagline:
[('one', 370), ('love', 278), ('never', 214), ('story', 208), ('life', 198), ('man', 195), ('world', 187), ('get', 140), ('time', 135), ('back', 130), ('every', 129), ('hes', 126), ('new', 125), ('everything', 111), ('dont', 104), ('ever', 100), ('adventure', 96), ('cant', 92), ('true', 88), ('family', 87), ('nothing', 84), ('theyre', 83), ('go', 83), ('like', 81), ('two', 79), ('comes', 73), ('evil', 73), ('come', 72), ('way', 71), ('begins', 68)]


[('one', 370),
 ('love', 278),
 ('never', 214),
 ('story', 208),
 ('life', 198),
 ('man', 195),
 ('world', 187),
 ('get', 140),
 ('time', 135),
 ('back', 130),
 ('every', 129),
 ('hes', 126),
 ('new', 125),
 ('everything', 111),
 ('dont', 104),
 ('ever', 100),
 ('adventure', 96),
 ('cant', 92),
 ('true', 88),
 ('family', 87),
 ('nothing', 84),
 ('theyre', 83),
 ('go', 83),
 ('like', 81),
 ('two', 79),
 ('comes', 73),
 ('evil', 73),
 ('come', 72),
 ('way', 71),
 ('begins', 68)]

**Summary Word Count**

In [10]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='summary', top_n=20)

Most common words in summary:
[('new', 789), ('life', 759), ('young', 696), ('one', 620), ('world', 582), ('family', 561), ('must', 535), ('find', 491), ('two', 486), ('man', 459), ('finds', 373), ('love', 364), ('years', 348), ('friends', 345), ('help', 340), ('home', 326), ('woman', 321), ('becomes', 301), ('father', 296), ('soon', 286)]


[('new', 789),
 ('life', 759),
 ('young', 696),
 ('one', 620),
 ('world', 582),
 ('family', 561),
 ('must', 535),
 ('find', 491),
 ('two', 486),
 ('man', 459),
 ('finds', 373),
 ('love', 364),
 ('years', 348),
 ('friends', 345),
 ('help', 340),
 ('home', 326),
 ('woman', 321),
 ('becomes', 301),
 ('father', 296),
 ('soon', 286)]

**Genre Word Count**

In [11]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='genres', top_n=20)

Most common words in genres:
[('drama', 2246), ('comedy', 1747), ('thriller', 1556), ('action', 1333), ('adventure', 1062), ('horror', 977), ('crime', 872), ('romance', 836), ('science', 750), ('fiction', 750), ('fantasy', 621), ('family', 558), ('mystery', 555), ('animation', 330), ('history', 254), ('war', 178), ('western', 93), ('tv', 6), ('movie', 6)]


[('drama', 2246),
 ('comedy', 1747),
 ('thriller', 1556),
 ('action', 1333),
 ('adventure', 1062),
 ('horror', 977),
 ('crime', 872),
 ('romance', 836),
 ('science', 750),
 ('fiction', 750),
 ('fantasy', 621),
 ('family', 558),
 ('mystery', 555),
 ('animation', 330),
 ('history', 254),
 ('war', 178),
 ('western', 93),
 ('tv', 6),
 ('movie', 6)]

**Events Word Count**

In [12]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='events', top_n=20)

Most common words in events:
[('dying', 6410), ('scenes', 4888), ('sexual', 4726), ('gore', 4532), ('dies', 4263), ('animals', 3550), ('abuse', 3464), ('violence', 3221), ('people', 3039), ('car', 2824), ('blood', 2779), ('parents', 2332), ('gun', 2237), ('content', 2236), ('someone', 2055), ('restraints', 1914), ('choking', 1906), ('mutilation', 1820), ('suicide', 1810), ('child', 1585)]


[('dying', 6410),
 ('scenes', 4888),
 ('sexual', 4726),
 ('gore', 4532),
 ('dies', 4263),
 ('animals', 3550),
 ('abuse', 3464),
 ('violence', 3221),
 ('people', 3039),
 ('car', 2824),
 ('blood', 2779),
 ('parents', 2332),
 ('gun', 2237),
 ('content', 2236),
 ('someone', 2055),
 ('restraints', 1914),
 ('choking', 1906),
 ('mutilation', 1820),
 ('suicide', 1810),
 ('child', 1585)]

## **Word Correlation**

- Find the correlation between ratings and tagline, summary, genre, themes and events.
- Bar plots displaying positive, neutral or negative related rating average.


In [13]:
df = cleaned_films.copy()

Taglines vs Ratings

In [14]:
sentiment_utils.word_rating_correlation(df, text_column='processed_tagline', rating_column='popularity', top_n=20)

Top Words Correlated with Ratings:
begins: 0.06
heroes: 0.05
woman: -0.05
murder: -0.05
happened: -0.05
dead: -0.04
story: -0.04
comedy: -0.04
problem: 0.04
two: -0.04
girl: -0.04
good: -0.04
saga: 0.04
guy: -0.04
bond: 0.04
love: -0.04
affair: -0.04
evil: -0.04
ends: 0.04
living: -0.04


Summary vs Ratings

In [16]:
sentiment_utils.word_rating_correlation(df, text_column='processed_summary', rating_column='popularity', top_n=20)

Top Words Correlated with Ratings:
world: 0.09
must: 0.08
mission: 0.08
worlds: 0.07
battle: 0.07
warrior: 0.06
prince: 0.06
enemy: 0.06
fate: 0.06
spiderman: 0.06
threat: 0.06
gay: -0.06
adventure: 0.06
humans: 0.06
film: -0.06
team: 0.06
skills: 0.06
planet: 0.06
hogwarts: 0.06
epic: 0.06


Genres vs Ratings

In [17]:
sentiment_utils.word_rating_correlation(df, text_column='processed_genres', rating_column='popularity', top_n=20)

Top Words Correlated with Ratings:
adventure: 0.22
action: 0.22
drama: -0.17
animation: 0.17
family: 0.15
fantasy: 0.14
fiction: 0.10
science: 0.10
romance: -0.08
comedy: -0.05
horror: -0.05
movie: -0.04
tv: -0.04
crime: -0.04
mystery: -0.03
history: -0.03
western: -0.02
war: 0.01
thriller: 0.00


Events vs Ratings

In [18]:
sentiment_utils.word_rating_correlation(df, text_column='processed_events', rating_column='popularity', top_n=20)

Top Words Correlated with Ratings:
screaming: 0.31
dies: 0.30
restraints: 0.29
loud: 0.29
noises: 0.29
sudden: 0.29
someone: 0.29
water: 0.27
unconscious: 0.27
bodies: 0.26
death: 0.25
selfsacrifice: 0.25
watched: 0.25
family: 0.24
choking: 0.24
scenes: 0.23
character: 0.23
major: 0.23
car: 0.23
stabbings: 0.22


## Conclusions

## **Conclusions**

### **Taglines vs Ratings**
- Words like *begins* and *heroes* positively correlate with higher ratings.  
- Words such as *woman*, *murder*, and *love* show slight negative correlations with ratings.  
- Neutral or commonly used words (*good*, *story*, *comedy*) tend to have weak or negligible impact.


### **Summary vs Ratings**
- Words like *world*, *mission*, *battle*, and *epic* positively influence ratings, suggesting an association with grand or action-oriented themes.  
- Negative correlations occur with words such as *gay* and *film*, which might reflect thematic or tonal biases in audience reception.


### **Genres vs Ratings**
- Genres like *adventure* and *action* have the strongest positive correlations, indicating audience preference for exciting and fast-paced content.  
- *Drama* shows a strong negative correlation (-0.17), while *romance*, *horror*, and *comedy* have slight negative associations.  


### **Events vs Ratings**
- Events involving intensity or high stakes (*screaming*, *dies*, *restraints*, *death*, *self-sacrifice*) show strong positive correlations with ratings.  
- This suggests that dramatic, emotionally charged, or suspenseful scenes resonate with audiences.  


### **Overall**
- Positive correlations are associated with action, adventure, and high-intensity words or genres.  
- Negative correlations reflect themes like horror, emotional drama, or specific words (e.g., *love*, *murder*, *film*) that may divide audience opinions.  

---

### **Taglines vs Ratings**  
**Top Words Correlated with Ratings:**  
- *begins*: 0.06  
- *heroes*: 0.05  
- *woman*: -0.05  
- *murder*: -0.05  
- *happened*: -0.05  
- *dead*: -0.04  
- *story*: -0.04  
- *comedy*: -0.04  
- *problem*: 0.04  
- *two*: -0.04  
- *girl*: -0.04  
- *good*: -0.04  
- *saga*: 0.04  
- *guy*: -0.04  
- *bond*: 0.04  
- *love*: -0.04  
- *affair*: -0.04  
- *evil*: -0.04  
- *ends*: 0.04  
- *living*: -0.04  


### **Summary vs Ratings**  
**Top Words Correlated with Ratings:**  
- *world*: 0.09  
- *must*: 0.08  
- *mission*: 0.08  
- *worlds*: 0.07  
- *battle*: 0.07  
- *warrior*: 0.06  
- *prince*: 0.06  
- *enemy*: 0.06  
- *fate*: 0.06  
- *spiderman*: 0.06  
- *threat*: 0.06  
- *gay*: -0.06  
- *adventure*: 0.06  
- *humans*: 0.06  
- *film*: -0.06  
- *team*: 0.06  
- *skills*: 0.06  
- *planet*: 0.06  
- *hogwarts*: 0.06  
- *epic*: 0.06  



### **Genres vs Ratings**  
**Top Words Correlated with Ratings:**  
- *adventure*: 0.22  
- *action*: 0.22  
- *drama*: -0.17  
- *animation*: 0.17  
- *family*: 0.15  
- *fantasy*: 0.14  
- *fiction*: 0.10  
- *science*: 0.10  
- *romance*: -0.08  
- *comedy*: -0.05  
- *horror*: -0.05  
- *movie*: -0.04  
- *tv*: -0.04  
- *crime*: -0.04  
- *mystery*: -0.03  
- *history*: -0.03  
- *western*: -0.02  
- *war*: 0.01  
- *thriller*: 0.00  



### **Events vs Ratings**  
**Top Words Correlated with Ratings:**  
- *screaming*: 0.31  
- *dies*: 0.30  
- *restraints*: 0.29  
- *loud*: 0.29  
- *noises*: 0.29  
- *sudden*: 0.29  
- *someone*: 0.29  
- *water*: 0.27  
- *unconscious*: 0.27  
- *bodies*: 0.26  
- *death*: 0.25  
- *selfsacrifice*: 0.25  
- *watched*: 0.25  
- *family*: 0.24  
- *choking*: 0.24  
- *scenes*: 0.23  
- *character*: 0.23  
- *major*: 0.23  
- *car*: 0.23  
- *stabbings*: 0.22  


In [25]:
regression_df = films.copy()

In [26]:
regression_df.head(3)

Unnamed: 0,tmdb_id,imdb_id,doesthedog_id,title,original_title,genres,director,release_year,runtime,budget,...,tmdb_rating,tmdb_votes,imdb_rating,imdb_votes,language,countries,summary,tagline,events,has_warnings
0,5,tt0113101,62268.0,Four Rooms,Four Rooms,comedy,"Quentin Tarantino, Robert Rodriguez, Alexandre...",1995,98,4000000,...,5.8,2628,6.7,112798,English,USA,It's Ted the Bellhop's first night on the job....,Twelve outrageous guests. Four scandalous requ...,"blood or gore, needles or syringes are used, d...",True
1,6,tt0107286,236737.0,Judgment Night,Judgment Night,"action, crime, thriller",Stephen Hopkins,1993,109,21000000,...,6.5,331,6.6,19361,English,USA,"Four young friends, while taking a shortcut en...",Don't move. Don't whisper. Don't even breathe.,"car crashes, drownings, people getting hit by ...",True
2,11,tt0076759,27949.0,Star Wars,Star Wars,"adventure, action, science fiction",George Lucas,1977,121,11000000,...,8.2,20622,8.6,1482739,English,USA,Princess Leia is captured and held hostage by ...,"A long time ago in a galaxy far, far away...","people being burned alive, flashing lights or ...",True


Try it with equal true and false rows

In [28]:
X = regression_df[['genres', 'tagline', 'summary', 'events']]
y = regression_df['has_warnings']  

X_combined = X['genres'] + ' ' + X['tagline'] + ' ' + X['summary'] + ' ' + X['events']

# fill NaN values in the combined text column
X_combined_filled = X_combined.fillna("missing")

# vectorization to transform the text data
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(X_combined_filled)

# sparse matrix to a dense DataFrame
X_tfidf_dense = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

#combine TF-IDF features with the target column
df_balanced = pd.concat([X_tfidf_dense, y], axis=1)

# majority and minority classes
df_majority = df_balanced[df_balanced['has_warnings'] == False]
df_minority = df_balanced[df_balanced['has_warnings'] == True]

# undersample the majority class
df_majority_undersampled = resample(
    df_majority,
    replace=False, 
    n_samples=len(df_majority),  
    random_state=42  
)

# combine majority class with the minority class
df_balanced = pd.concat([df_majority_undersampled, df_minority])

df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# split balanced data into features (X) and target (y)
X_balanced = df_balanced.drop(columns='has_warnings')
y_balanced = df_balanced['has_warnings']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3, random_state=42)

# train a Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# evaluate
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

       False       0.91      1.00      0.95      1261
        True       1.00      0.93      0.96      1675

    accuracy                           0.96      2936
   macro avg       0.96      0.96      0.96      2936
weighted avg       0.96      0.96      0.96      2936

