# **NLP on Letterboxd df**

- Check word frequency in several columns -> 'tagline', 'summary', 'genres', 'themes', 'events'
- Based on correlations, rate from 0 to 10
- Tie back to whether a title has warnings or not
- Figure how it impacts sentiment on movies
- Sentiment-o-meter

## **Metadata**

- `cleaned_films` contains 5301 titles and 20 columns.

- There are no Null values.

| **Column Name**        | **Data Type**   | **Description**                                                                 |
|------------------------|-----------------|----------------------------------------------------------------------------------|
| **title**              | object          | The title of the movie.                                                          |
| **release_year**       | int64           | The year the movie was released.                                                 |
| **tagline**            | object          | The movie's tagline (promotional phrase).                                        |
| **summary**            | object          | A brief description of the movie's plot.                                         |
| **runtime**            | int64           | The total runtime of the movie in minutes.                                       |
| **letterboxd_rating**  | float64         | The movie's average rating on Letterboxd.                                        |
| **genres**             | object          | A list of genres the movie belongs to (e.g., Drama, Comedy).                     |
| **language**           | object          | The languages the movie was produced in.                                         |
| **countries**          | object          | The countries where the movie was made or released.                              |
| **themes**             | object          | The central themes explored in the movie (e.g., Love, War, Friendship).          |
| **director**           | object          | The director(s) of the movie.                                                   |
| **events**             | object          | Key events or warnings in the movie (e.g., violence, strong language).           |
| **has_warnings**       | bool            | A boolean indicating if the movie contains warnings for sensitive content.       |


In [1]:
import pandas as pd
import nltk 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample

In [None]:
nltk.download('punkt_tab')

Import reusable functions for sentiment analysis from /utils folder.

In [3]:
import sys
sys.path.append('../utils')
import sentiment_utils

In [None]:

films = pd.read_csv('../data/clean/tmdb_clean_films.csv')
films.rename(columns={'overview' : 'summary'}, inplace=True)
films.head(2)

In [5]:
cleaned_films = films.copy()
cleaned_films.drop(columns=['doesthedog_id', 'tmdb_id', 'imdb_id', 'original_title', 'imdb_votes', 'tmdb_votes'], inplace=True)

In [6]:
cleaned_films = cleaned_films.dropna()


In [None]:
display(cleaned_films)


## **Word Count**
- Display most commong words in *tagline*, *summary*, *genre*, *themes* and *events* columns.

In [None]:
cleaned_films.info()

**Tagline Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='tagline', top_n=30)

**Summary Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='summary', top_n=20)

**Genre Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='genres', top_n=20)

**Events Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='events', top_n=20)

## **Word Correlation**

- Find the correlation between ratings and tagline, summary, genre, themes and events.
- Bar plots displaying positive, neutral or negative related rating average.


In [13]:
df = cleaned_films.copy()

Taglines vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_tagline', rating_column='popularity', top_n=20)

Summary vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_summary', rating_column='popularity', top_n=20)

Genres vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_genres', rating_column='popularity', top_n=20)

Events vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_events', rating_column='popularity', top_n=20)

## Conclusions

## **Conclusions**

### **Taglines vs Ratings**
- Words like *begins* and *heroes* positively correlate with higher ratings.  
- Words such as *woman*, *murder*, and *love* show slight negative correlations with ratings.  
- Neutral or commonly used words (*good*, *story*, *comedy*) tend to have weak or negligible impact.


### **Summary vs Ratings**
- Words like *world*, *mission*, *battle*, and *epic* positively influence ratings, suggesting an association with grand or action-oriented themes.  
- Negative correlations occur with words such as *gay* and *film*, which might reflect thematic or tonal biases in audience reception.


### **Genres vs Ratings**
- Genres like *adventure* and *action* have the strongest positive correlations, indicating audience preference for exciting and fast-paced content.  
- *Drama* shows a strong negative correlation (-0.17), while *romance*, *horror*, and *comedy* have slight negative associations.  


### **Events vs Ratings**
- Events involving intensity or high stakes (*screaming*, *dies*, *restraints*, *death*, *self-sacrifice*) show strong positive correlations with ratings.  
- This suggests that dramatic, emotionally charged, or suspenseful scenes resonate with audiences.  


### **Overall**
- Positive correlations are associated with action, adventure, and high-intensity words or genres.  
- Negative correlations reflect themes like horror, emotional drama, or specific words (e.g., *love*, *murder*, *film*) that may divide audience opinions.  

---

### **Taglines vs Ratings**  
**Top Words Correlated with Ratings:**  
- *begins*: 0.06  
- *heroes*: 0.05  
- *woman*: -0.05  
- *murder*: -0.05  
- *happened*: -0.05  
- *dead*: -0.04  
- *story*: -0.04  
- *comedy*: -0.04  
- *problem*: 0.04  
- *two*: -0.04  
- *girl*: -0.04  
- *good*: -0.04  
- *saga*: 0.04  
- *guy*: -0.04  
- *bond*: 0.04  
- *love*: -0.04  
- *affair*: -0.04  
- *evil*: -0.04  
- *ends*: 0.04  
- *living*: -0.04  


### **Summary vs Ratings**  
**Top Words Correlated with Ratings:**  
- *world*: 0.09  
- *must*: 0.08  
- *mission*: 0.08  
- *worlds*: 0.07  
- *battle*: 0.07  
- *warrior*: 0.06  
- *prince*: 0.06  
- *enemy*: 0.06  
- *fate*: 0.06  
- *spiderman*: 0.06  
- *threat*: 0.06  
- *gay*: -0.06  
- *adventure*: 0.06  
- *humans*: 0.06  
- *film*: -0.06  
- *team*: 0.06  
- *skills*: 0.06  
- *planet*: 0.06  
- *hogwarts*: 0.06  
- *epic*: 0.06  



### **Genres vs Ratings**  
**Top Words Correlated with Ratings:**  
- *adventure*: 0.22  
- *action*: 0.22  
- *drama*: -0.17  
- *animation*: 0.17  
- *family*: 0.15  
- *fantasy*: 0.14  
- *fiction*: 0.10  
- *science*: 0.10  
- *romance*: -0.08  
- *comedy*: -0.05  
- *horror*: -0.05  
- *movie*: -0.04  
- *tv*: -0.04  
- *crime*: -0.04  
- *mystery*: -0.03  
- *history*: -0.03  
- *western*: -0.02  
- *war*: 0.01  
- *thriller*: 0.00  



### **Events vs Ratings**  
**Top Words Correlated with Ratings:**  
- *screaming*: 0.31  
- *dies*: 0.30  
- *restraints*: 0.29  
- *loud*: 0.29  
- *noises*: 0.29  
- *sudden*: 0.29  
- *someone*: 0.29  
- *water*: 0.27  
- *unconscious*: 0.27  
- *bodies*: 0.26  
- *death*: 0.25  
- *selfsacrifice*: 0.25  
- *watched*: 0.25  
- *family*: 0.24  
- *choking*: 0.24  
- *scenes*: 0.23  
- *character*: 0.23  
- *major*: 0.23  
- *car*: 0.23  
- *stabbings*: 0.22  


In [18]:
regression_df = films.copy()

In [None]:
regression_df.head(3)

Try it with equal true and false rows

In [None]:
X = regression_df[['genres', 'tagline', 'summary', 'events']]
y = regression_df['has_warnings']  

X_combined = X['genres'] + ' ' + X['tagline'] + ' ' + X['summary'] + ' ' + X['events']

# fill NaN values in the combined text column
X_combined_filled = X_combined.fillna("missing")

# vectorization to transform the text data
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(X_combined_filled)

# sparse matrix to a dense DataFrame
X_tfidf_dense = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

#combine TF-IDF features with the target column
df_balanced = pd.concat([X_tfidf_dense, y], axis=1)

# majority and minority classes
df_majority = df_balanced[df_balanced['has_warnings'] == False]
df_minority = df_balanced[df_balanced['has_warnings'] == True]

# undersample the majority class
df_majority_undersampled = resample(
    df_majority,
    replace=False, 
    n_samples=len(df_majority),  
    random_state=42  
)

# combine majority class with the minority class
df_balanced = pd.concat([df_majority_undersampled, df_minority])

df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# split balanced data into features (X) and target (y)
X_balanced = df_balanced.drop(columns='has_warnings')
y_balanced = df_balanced['has_warnings']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3, random_state=42)

# train a Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# evaluate
print(classification_report(y_test, y_pred))
