# **NLP on Letterboxd df**

- Check word frequency in several columns -> 'tagline', 'summary', 'genres', 'themes', 'events'
- Based on correlations, rate from 0 to 5
- Tie back to whether a title has warnings or not
- Figure how it impacts sentiment on movies
- Sentiment-o-meter

## **Metadata**

- `cleaned_films` contains 4203 titles and 13 columns.

- There are no Null values.

| **Column Name**        | **Data Type**   | **Description**                                                                 |
|------------------------|-----------------|----------------------------------------------------------------------------------|
| **title**              | object          | The title of the movie.                                                          |
| **release_year**       | int64           | The year the movie was released.                                                 |
| **tagline**            | object          | The movie's tagline (promotional phrase).                                        |
| **summary**            | object          | A brief description of the movie's plot.                                         |
| **runtime**            | int64           | The total runtime of the movie in minutes.                                       |
| **letterboxd_rating**  | float64         | The movie's average rating on Letterboxd.                                        |
| **genres**             | object          | A list of genres the movie belongs to (e.g., Drama, Comedy).                     |
| **language**           | object          | The languages the movie was produced in.                                         |
| **countries**          | object          | The countries where the movie was made or released.                              |
| **themes**             | object          | The central themes explored in the movie (e.g., Love, War, Friendship).          |
| **director**           | object          | The director(s) of the movie.                                                   |
| **events**             | object          | Key events or warnings in the movie (e.g., violence, strong language).           |
| **has_warnings**       | bool            | A boolean indicating if the movie contains warnings for sensitive content.       |


In [20]:
import pandas as pd
import nltk 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample
import plotly.express as px

In [None]:
nltk.download('punkt_tab')

Import reusable functions for sentiment analysis from /utils folder.

In [22]:
import sys
sys.path.append('../utils')
import sentiment_utils

In [None]:

films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')
films.head(2)

In [24]:
films.dropna(subset=['genres', 'language', 'countries', 'director'], inplace=True)

In [25]:
cleaned_films = films.copy()
cleaned_films.dropna(inplace=True)

In [None]:
display(cleaned_films)


## **Word Count**
- Display most commong words in *tagline*, *summary*, *genre*, *themes* and *events* columns.

**Tagline Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='tagline', top_n=30)

**Summary Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='summary', top_n=10)

**Genre Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='genres', top_n=10)

**Themes Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='themes', top_n=10)

**Events Word Count**

In [None]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='events', top_n=10)

## **Word Correlation**

- Find the correlation between ratings and tagline, summary, genre, themes and events.
- Bar plots displaying positive, neutral or negative related rating average.


In [32]:
df = cleaned_films.copy()

Taglines vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_tagline', rating_column='letterboxd_rating', top_n=10)

Summary vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_summary', rating_column='letterboxd_rating', top_n=10)

Genres vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_genres', rating_column='letterboxd_rating', top_n=10)

In [None]:
import pandas as pd
import plotly.express as px
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import spearmanr
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter

def word_rating_correlation(df, text_column, rating_column, top_n=10, save_path=None):
    # Capitalize movie genres
    df[text_column] = df[text_column].apply(
        lambda x: ' '.join([word.capitalize() for word in x.split()]) if isinstance(x, str) else str(x)
    )

    vectorizer = CountVectorizer()  # vectorize text column
    x = vectorizer.fit_transform(df[text_column])
    word_count_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())

    correlations = {}
    for word in word_count_df.columns:
        corr, _ = spearmanr(word_count_df[word], df[rating_column])
        correlations[word] = corr

    sorted_correlations = sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)  # correlation by absolute value

    top_correlated_words = sorted_correlations[:top_n]
    print("Top Words Correlated with Ratings:")
    for word, corr in top_correlated_words:
        print(f"{word}: {corr:.2f}")

    words, corrs = zip(*top_correlated_words)
    
    # Capitalize the words for the plot
    words_capitalized = [word.capitalize() for word in words]

    fig = px.bar(x=words_capitalized, y=corrs, title=f'Sentiment Distribution per Genre',
                 labels={'x': '', 'y': 'Spearman Correlation'}, color=corrs,
                 color_continuous_scale='Temps_r')
    fig.update_layout(
        xaxis_tickangle=-45,
        plot_bgcolor='#f7f7f7',  # Background color for the plot area
        paper_bgcolor='#f7f7f7',
        width=900
    )
    
    if save_path:
        fig.write_image(save_path)
        print(f"Plot saved to {save_path}")
        
    fig.show()

# Example usage with your dataset
word_rating_correlation(df, text_column='processed_genres', rating_column='letterboxd_rating', top_n=20)
    
    

Themes vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_themes', rating_column='letterboxd_rating', top_n=10)

Events vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_events', rating_column='letterboxd_rating', top_n=10)

## Conclusions

### **Tagline:**
These are the top words that are correlated with ratings. Words with positive correlations tend to be associated with higher ratings, while words with negative correlations are typically associated with lower ratings. The correlation values indicate the strength and direction of the relationship between the words and the ratings.

### **Summary:**
- Positive Correlation: Words or genres associated with higher ratings include *drama*, *documentary*, *love*, *story*, and *political*.
- Negative Correlation: Words or genres associated with lower ratings include *horror*, *thriller*, *jump*, *scares*, and *gore*.
- Weak/Negligible Correlation: Some words have minimal or no significant impact on ratings, such as *movie*, *time*, *life*, and *friends*.

### **Genre:**
#### Positive Correlation (higher ratings):
- *Drama*: 0.41
- *Documentary*: 0.16
- *History*: 0.14
- *Crime*: 0.12
- *Music*: 0.11
- *War*: 0.10
- *Animation*: 0.10
- *Romance*: 0.09
- *Western*: 0.07

#### Negative Correlation (lower ratings):
- *Horror*: -0.42
- *Thriller*: -0.24
- *Fiction*: -0.17
- *Science*: -0.17
- *Action*: -0.11
- *Movie*: -0.05
- *TV*: -0.05
- *Fantasy*: -0.05
- *Mystery*: -0.04
- *Family*: -0.03
- *Adventure*: -0.03


### **Themes:**
#### Positive Correlation (higher ratings):
- *Story*: 0.11
- *Love*: 0.10
- *Political*: 0.08
- *Men*: 0.08
- *World*: 0.07
- *Film*: 0.08
- *Picture*: 0.06
- *Life*: 0.06
- *Unforgettable*: 0.05
- *Murder*: 0.05
- *Sometimes*: 0.05

#### Negative Correlation (lower ratings):
- *Evil*: -0.09
- *Game*: -0.09
- *Shark*: -0.08
- *Sinister*: -0.08
- *Mysterious*: -0.08
- *Killer*: -0.08
- *Paranormal*: -0.07
- *Terrifying*: -0.07
- *Creature*: -0.07
- *Fight*: -0.07
- *Fear*: -0.06



### **Events:**
#### Positive Correlation (higher ratings):
- *Hate*: 0.10
- *Speech*: 0.10
- *Domestic*: 0.10
- *Incarceration*: 0.09
- *Child*: 0.09
- *Sad*: 0.09
- *Hospital*: 0.08
- *Cheating*: 0.08
- *Antisemitism*: 0.08
- *Age*: 0.08
- *Gap*: 0.08
- *Large*: 0.08
- *Abandonment*: 0.08

#### Negative Correlation (lower ratings):
- *Jump*: -0.22
- *Scares*: -0.22
- *Audio*: -0.12
- *Mutilation*: -0.11
- *Gore*: -0.11
- *Eye*: -0.11
- *Excessive*: -0.09
- *Aliens*: -0.25
- *Gory*: -0.30
- *Gruesome*: -0.30
- *Slasher*: -0.30