# **NLP on Letterboxd df**

- Check word frequency in several columns -> 'tagline', 'summary', 'genres', 'themes', 'events'
- Based on correlations, rate from 0 to 5
- Tie back to whether a title has warnings or not
- Figure how it impacts sentiment on movies
- Sentiment-o-meter

## **Metadata**

- `cleaned_films` contains 4203 titles and 13 columns.

- There are no Null values.

| **Column Name**        | **Data Type**   | **Description**                                                                 |
|------------------------|-----------------|----------------------------------------------------------------------------------|
| **title**              | object          | The title of the movie.                                                          |
| **release_year**       | int64           | The year the movie was released.                                                 |
| **tagline**            | object          | The movie's tagline (promotional phrase).                                        |
| **summary**            | object          | A brief description of the movie's plot.                                         |
| **runtime**            | int64           | The total runtime of the movie in minutes.                                       |
| **letterboxd_rating**  | float64         | The movie's average rating on Letterboxd.                                        |
| **genres**             | object          | A list of genres the movie belongs to (e.g., Drama, Comedy).                     |
| **language**           | object          | The languages the movie was produced in.                                         |
| **countries**          | object          | The countries where the movie was made or released.                              |
| **themes**             | object          | The central themes explored in the movie (e.g., Love, War, Friendship).          |
| **director**           | object          | The director(s) of the movie.                                                   |
| **events**             | object          | Key events or warnings in the movie (e.g., violence, strong language).           |
| **has_warnings**       | bool            | A boolean indicating if the movie contains warnings for sensitive content.       |


In [1]:
import pandas as pd
import nltk 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample
import plotly.express as px

In [2]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/bru/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Import reusable functions for sentiment analysis from /utils folder.

In [3]:
import sys
sys.path.append('../utils')
import sentiment_utils

In [4]:

films = pd.read_csv('../data/clean/letterboxd_clean_films.csv')
films.head(2)

Unnamed: 0,letterboxd_id,title,release_year,tagline,summary,runtime,letterboxd_rating,genres,language,countries,themes,director,topics,doesthedog_id,events,has_warnings
0,1000001,Barbie,2023,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114,3.86,"Comedy, Adventure",English,"UK, USA","['Humanity and the world around us', 'Crude hu...",Greta Gerwig,,381345.0,,False
1,1000002,Parasite,2019,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133,4.56,"Comedy, Thriller, Drama","Korean, English, German, Korean",South Korea,"['Humanity and the world around us', 'Intense ...",Bong Joon-ho,,19408.0,,False


In [5]:
films.dropna(subset=['genres', 'language', 'countries', 'director'], inplace=True)

In [6]:
cleaned_films = films.copy()
cleaned_films.dropna(inplace=True)

In [7]:
display(cleaned_films)


Unnamed: 0,letterboxd_id,title,release_year,tagline,summary,runtime,letterboxd_rating,genres,language,countries,themes,director,topics,doesthedog_id,events,has_warnings
4,1000005,La La Land,2016,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129,4.09,"Drama, Comedy, Music, Romance",English,"Hong Kong, USA","['Song and dance', 'Humanity and the world aro...",Damien Chazelle,167176222260266269339363,12823.0,"flashing lights or images, shower scenes, sad ...",True
11,1000012,Whiplash,2014,The road to greatness can take you to the edge.,"Under the direction of a ruthless instructor, ...",107,4.43,"Drama, Music",English,USA,"['Moving relationship stories', 'Student comin...",Damien Chazelle,"171,180,184,187,188,199,208,212,219,225,233,23...",12593.0,"finger or toe mutilation, spitting, car crashe...",True
32,1000033,Once Upon a Time in Hollywood,2019,"In this town, it can all change… like that","Los Angeles, 1969. TV star Rick Dalton, a stru...",162,3.76,"Drama, Thriller, Comedy","English, English, Italian, Spanish","China, UK, USA","['Humanity and the world around us', 'Fascinat...",Quentin Tarantino,"164,180,188,189,192,193,197,210,212,216,223,22...",20150.0,"people being burned alive, spitting, blood or ...",True
39,1000041,Glass Onion,2022,"When the game ends, the mystery begins.",World-famous detective Benoit Blanc heads to G...,140,3.45,"Comedy, Crime, Mystery",English,USA,"['Thrillers and murder mysteries', 'Intriguing...",Rian Johnson,"167,184,187,190,193,197,201,212,225,232,237,24...",688430.0,"flashing lights or images, car crashes, people...",True
68,1000071,Coco,2017,The celebration of a lifetime,Despite his family’s baffling generations-old ...,105,4.12,"Adventure, Animation, Music, Family","English, English, Spanish",USA,"['Moving relationship stories', 'Song and danc...",Lee Unkrich,"168,180,207,218,237,245,253,262,266,282,291,29...",13647.0,"parents dying, spitting, ghosts, child abuse, ...",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18425,1097809,The Triangle,2001,"In the Bermuda Triangle, nothing stays lost fo...",This made-for-TV movie follows a group of frie...,92,2.89,"Thriller, Horror, TV Movie",English,"USA, Canada, Barbados","['Horror, the undead and monster classics', 'T...",Lewis Teague,"158,168,177,184,188,189,191,202,212,218,219,22...",17975.0,"kids dying, parents dying, shaving or cutting,...",True
18427,1098166,CAT,2022,Drugs. Deceit. Danger.,"Living under an alias, a former police informa...",360,3.50,"Crime, Drama",Hindi,India,"['Crime, drugs and gangsters', 'Intense politi...",Balwinder Singh Janjua,"164,167,177,181,188,200,224,229,231,232,250,26...",198460.0,"people being burned alive, flashing lights or ...",True
18429,1098239,Wraith,2017,There's Something in My Room,After living in an old mansion for almost 10 y...,99,2.61,"Mystery, Thriller, Horror",English,USA,"['Faith and religion', 'Terrifying, haunted, a...",Michael O. Sajbel,"164,180,181,182,184,188,193,197,199,200,207,20...",17105.0,"people being burned alive, spitting, shaky cam...",True
18436,1098864,Tulsa,2020,Big changes come in small packages,A desperate marine biker’s life is turned upsi...,120,2.97,"Comedy, Drama",English,USA,"['Faith and religion', 'Moving relationship st...","Gloria Stella, Scott Pryor",158168184187191193208225230239269298,685816.0,"kids dying, parents dying, car crashes, people...",True


## **Word Count**
- Display most commong words in *tagline*, *summary*, *genre*, *themes* and *events* columns.

**Tagline Word Count**

In [8]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='tagline', top_n=30)

Top 30 most common words in 'tagline':

Word           Count     Percentage (%)
-----------------------------------
one            249       1.41
love           237       1.34
story          155       0.88
never          154       0.87
life           137       0.78
world          123       0.70
time           111       0.63
dont           100       0.57
evil           90        0.51
new            85        0.48
every          84        0.48
family         81        0.46
terror         78        0.44
get            70        0.40
man            65        0.37
go             64        0.36
cant           62        0.35
true           62        0.35
die            62        0.35
hell           62        0.35
back           60        0.34
like           58        0.33
fear           58        0.33
hes            58        0.33
kill           58        0.33
woman          57        0.32
lives          56        0.32
death          56        0.32
way            56        0.32
theres        

[('one', 249),
 ('love', 237),
 ('story', 155),
 ('never', 154),
 ('life', 137),
 ('world', 123),
 ('time', 111),
 ('dont', 100),
 ('evil', 90),
 ('new', 85),
 ('every', 84),
 ('family', 81),
 ('terror', 78),
 ('get', 70),
 ('man', 65),
 ('go', 64),
 ('cant', 62),
 ('true', 62),
 ('die', 62),
 ('hell', 62),
 ('back', 60),
 ('like', 58),
 ('fear', 58),
 ('hes', 58),
 ('kill', 58),
 ('woman', 57),
 ('lives', 56),
 ('death', 56),
 ('way', 56),
 ('theres', 55)]

**Summary Word Count**

In [9]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='summary', top_n=10)

Top 10 most common words in 'summary':

Word           Count     Percentage (%)
-----------------------------------
young          617       0.59
life           561       0.54
one            516       0.49
new            494       0.47
two            452       0.43
family         436       0.42
find           351       0.34
man            328       0.31
woman          325       0.31
world          315       0.30


[('young', 617),
 ('life', 561),
 ('one', 516),
 ('new', 494),
 ('two', 452),
 ('family', 436),
 ('find', 351),
 ('man', 328),
 ('woman', 325),
 ('world', 315)]

**Genre Word Count**

In [10]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='genres', top_n=10)

Top 10 most common words in 'genres':

Word           Count     Percentage (%)
-----------------------------------
drama          1833      17.27
horror         1542      14.53
thriller       1244      11.72
comedy         1092      10.29
mystery        537       5.06
crime          523       4.93
romance        520       4.90
action         498       4.69
science        459       4.32
fiction        459       4.32


[('drama', 1833),
 ('horror', 1542),
 ('thriller', 1244),
 ('comedy', 1092),
 ('mystery', 537),
 ('crime', 523),
 ('romance', 520),
 ('action', 498),
 ('science', 459),
 ('fiction', 459)]

**Themes Word Count**

In [11]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='themes', top_n=10)

Top 10 most common words in 'themes':

Word           Count     Percentage (%)
-----------------------------------
horror         5696      6.75
stories        1948      2.31
terrifying     1579      1.87
monster        1516      1.80
humor          1453      1.72
crime          1422      1.68
undead         1229      1.46
classics       1229      1.46
dark           1204      1.43
relationship   1202      1.42


[('horror', 5696),
 ('stories', 1948),
 ('terrifying', 1579),
 ('monster', 1516),
 ('humor', 1453),
 ('crime', 1422),
 ('undead', 1229),
 ('classics', 1229),
 ('dark', 1204),
 ('relationship', 1202)]

**Events Word Count**

In [12]:
sentiment_utils.analyze_most_common_words(cleaned_films, text_column='events', top_n=10)

Top 10 most common words in 'events':

Word           Count     Percentage (%)
-----------------------------------
dying          4676      3.84
gore           3218      2.64
sexual         3153      2.59
scenes         2831      2.32
animals        2341      1.92
abuse          2336      1.92
dies           2025      1.66
violence       2010      1.65
blood          2006      1.65
people         1905      1.56


[('dying', 4676),
 ('gore', 3218),
 ('sexual', 3153),
 ('scenes', 2831),
 ('animals', 2341),
 ('abuse', 2336),
 ('dies', 2025),
 ('violence', 2010),
 ('blood', 2006),
 ('people', 1905)]

In [None]:
import pandas as pd
import plotly.graph_objects as go
from collections import Counter
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Function to preprocess the text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabet characters
    tokens = word_tokenize(text)  # Tokenize text
    stop_words = set(stopwords.words('english'))  # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Function to analyze the most common words and plot a pie chart
def analyze_most_common_words(df, text_column, top_n=50):
    processed_column = f"processed_{text_column}"
    
    # Preprocess the text column
    df[processed_column] = df[text_column].apply(preprocess_text)  # Assumes preprocess_text is defined
    
    # Flatten all tokens into a single list
    all_tokens = [word for tokens in df[processed_column] for word in tokens]
    total_word_count = len(all_tokens)
    
    # Count the most common words
    word_freq = Counter(all_tokens)
    most_common_words = word_freq.most_common(top_n)
    
    # Print results
    print(f"Top {top_n} most common words in '{text_column}':\n")
    print(f"{'Word':<15}{'Count':<10}{'Percentage (%)':<10}")
    print("-" * 35)
    for word, count in most_common_words:
        percentage = (count / total_word_count) * 100
        print(f"{word:<15}{count:<10}{percentage:.2f}")
    
    # Prepare data for Plotly
    words, counts = zip(*most_common_words)
    percentages = [round((count / total_word_count) * 100, 2) for count in counts]
    
    # Create the pie chart with text outside
    fig = go.Figure(data=[go.Pie(
        labels=words,
        values=counts,
        hoverinfo='label+percent',  # Show label and percentage
        textinfo='label+value+percent',  # Show label, count, and percentage
        textposition='outside',  # Place text outside the chart
        pull=[0.1] * len(words),  # Optionally pull slices apart for better visibility
        marker=dict(colors=['#66b3ff', '#99ff99', '#ffcc99', '#ffb3e6', '#c2c2f0'])  # Optional color scheme
    )])

    # Update layout
    fig.update_layout(
        title=f'Most Common Words in Events',  # Title of the chart
        showlegend=True  # Optional: Shows the legend
    )

    # Show the figure
    fig.show()
    
    return most_common_words

# Example usage
# cleaned_films is your DataFrame and 'events' is the column containing text data
analyze_most_common_words(cleaned_films, text_column='events', top_n=10)


Top 10 most common words in 'events':

Word           Count     Percentage (%)
-----------------------------------
dying          4676      3.84
gore           3218      2.64
sexual         3153      2.59
scenes         2831      2.32
animals        2341      1.92
abuse          2336      1.92
dies           2025      1.66
violence       2010      1.65
blood          2006      1.65
people         1905      1.56


[('dying', 4676),
 ('gore', 3218),
 ('sexual', 3153),
 ('scenes', 2831),
 ('animals', 2341),
 ('abuse', 2336),
 ('dies', 2025),
 ('violence', 2010),
 ('blood', 2006),
 ('people', 1905)]

In [42]:
import pandas as pd
import plotly.graph_objects as go
from collections import Counter
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Function to preprocess the text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabet characters
    tokens = word_tokenize(text)  # Tokenize text
    stop_words = set(stopwords.words('english'))  # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Function to analyze the most common words and plot a pie chart
def analyze_most_common_words(df, text_column, top_n=50, save_path=None):
    processed_column = f"processed_{text_column}"
    
    # Preprocess the text column
    df[processed_column] = df[text_column].apply(preprocess_text)  # Assumes preprocess_text is defined
    
    # Flatten all tokens into a single list
    all_tokens = [word for tokens in df[processed_column] for word in tokens]
    total_word_count = len(all_tokens)
    
    # Count the most common words
    word_freq = Counter(all_tokens)
    most_common_words = word_freq.most_common(top_n)
    
    # Print results
    print(f"Top {top_n} most common words in '{text_column}':\n")
    print(f"{'Word':<15}{'Count':<10}{'Percentage (%)':<10}")
    print("-" * 35)
    for word, count in most_common_words:
        percentage = (count / total_word_count) * 100
        print(f"{word:<15}{count:<10}{percentage:.2f}")
    
    # Prepare data for Plotly
    words, counts = zip(*most_common_words)
    words_capitalized = [word.capitalize() for word in words]  # Capitalize the first letter of each word
    percentages = [round((count / total_word_count) * 100, 2) for count in counts]
    
    # Create the pie chart with text outside
    fig = go.Figure(data=[go.Pie(
        labels=words_capitalized,  # Use capitalized words
        values=counts,
        hoverinfo='label+percent',  # Show label and percentage
        textinfo='label+percent',  # Show label, count, and percentage
        textposition='outside',  # Place text outside the chart
        pull=[0.1] * len(words),  # Optionally pull slices apart for better visibility
        marker=dict(colors=[
    '#F4D6A0',  # Warm light yellow
    '#A8CBB7',  # Light mint green
    '#D4B9A3',  # Muted salmon beige
    '#A6C6D9',  # Soft pastel blue
    '#B3A0A1',  # Dusty rose brown
    '#6E7B7A',  # Cool greyish teal
    '#99A7A4',  # Soft olive gray
    '#C4D8C1',  # Soft sage green
    '#3E4A49',  # Deep slate gray
    '#B8C6D0',  # Light cool blue-gray
])  # Optional color scheme
    )])

    # Update layout
    fig.update_layout(
        title=f'Top Recurring Words in Events',  # Title of the chart
        showlegend=True, 
        plot_bgcolor='#f7f7f7',  # Background color for the plot area
        paper_bgcolor='#f7f7f7',
        width=800
    )
    
    if save_path:
        fig.write_image(save_path)
        print(f"Plot saved to {save_path}")
        
    fig.show()
    
    
    return most_common_words

# Example usage
# cleaned_films is your DataFrame and 'events' is the column containing text data
analyze_most_common_words(cleaned_films, text_column='events', top_n=10, save_path='../visuals/events_most_recurring_words.png')


Top 10 most common words in 'events':

Word           Count     Percentage (%)
-----------------------------------
dying          4676      3.84
gore           3218      2.64
sexual         3153      2.59
scenes         2831      2.32
animals        2341      1.92
abuse          2336      1.92
dies           2025      1.66
violence       2010      1.65
blood          2006      1.65
people         1905      1.56
Plot saved to ../visuals/events_most_recurring_words.png


[('dying', 4676),
 ('gore', 3218),
 ('sexual', 3153),
 ('scenes', 2831),
 ('animals', 2341),
 ('abuse', 2336),
 ('dies', 2025),
 ('violence', 2010),
 ('blood', 2006),
 ('people', 1905)]

## **Word Correlation**

- Find the correlation between ratings and tagline, summary, genre, themes and events.
- Bar plots displaying positive, neutral or negative related rating average.


In [42]:
df = cleaned_films.copy()

Taglines vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_tagline', rating_column='letterboxd_rating', top_n=10)

Summary vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_summary', rating_column='letterboxd_rating', top_n=10)

Genres vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_genres', rating_column='letterboxd_rating', top_n=10)

In [None]:
import pandas as pd
import plotly.express as px
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import spearmanr
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter

def word_rating_correlation(df, text_column, rating_column, top_n=10, save_path=None):
    # Capitalize movie genres
    df[text_column] = df[text_column].apply(
        lambda x: ' '.join([word.capitalize() for word in x.split()]) if isinstance(x, str) else str(x)
    )

    vectorizer = CountVectorizer()  # vectorize text column
    x = vectorizer.fit_transform(df[text_column])
    word_count_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())

    correlations = {}
    for word in word_count_df.columns:
        corr, _ = spearmanr(word_count_df[word], df[rating_column])
        correlations[word] = corr

    sorted_correlations = sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)  # correlation by absolute value

    top_correlated_words = sorted_correlations[:top_n]
    print("Top Words Correlated with Ratings:")
    for word, corr in top_correlated_words:
        print(f"{word}: {corr:.2f}")

    words, corrs = zip(*top_correlated_words)
    
    # Capitalize the words for the plot
    words_capitalized = [word.capitalize() for word in words]

    fig = px.bar(x=words_capitalized, y=corrs, title=f'Sentiment Distribution per Genre',
                 labels={'x': '', 'y': 'Spearman Correlation'}, color=corrs,
                 color_continuous_scale='Temps_r')
    fig.update_layout(
        xaxis_tickangle=-45,
        plot_bgcolor='#f7f7f7',  # Background color for the plot area
        paper_bgcolor='#f7f7f7',
        width=900
    )
    
    if save_path:
        fig.write_image(save_path)
        print(f"Plot saved to {save_path}")
        
    fig.show()

# Example usage with your dataset
word_rating_correlation(df, text_column='processed_genres', rating_column='letterboxd_rating', top_n=20)
    
    

Themes vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_themes', rating_column='letterboxd_rating', top_n=10)

Events vs Ratings

In [None]:
sentiment_utils.word_rating_correlation(df, text_column='processed_events', rating_column='letterboxd_rating', top_n=10)

## Conclusions

### **Tagline:**
These are the top words that are correlated with ratings. Words with positive correlations tend to be associated with higher ratings, while words with negative correlations are typically associated with lower ratings. The correlation values indicate the strength and direction of the relationship between the words and the ratings.

### **Summary:**
- Positive Correlation: Words or genres associated with higher ratings include *drama*, *documentary*, *love*, *story*, and *political*.
- Negative Correlation: Words or genres associated with lower ratings include *horror*, *thriller*, *jump*, *scares*, and *gore*.
- Weak/Negligible Correlation: Some words have minimal or no significant impact on ratings, such as *movie*, *time*, *life*, and *friends*.

### **Genre:**
#### Positive Correlation (higher ratings):
- *Drama*: 0.41
- *Documentary*: 0.16
- *History*: 0.14
- *Crime*: 0.12
- *Music*: 0.11
- *War*: 0.10
- *Animation*: 0.10
- *Romance*: 0.09
- *Western*: 0.07

#### Negative Correlation (lower ratings):
- *Horror*: -0.42
- *Thriller*: -0.24
- *Fiction*: -0.17
- *Science*: -0.17
- *Action*: -0.11
- *Movie*: -0.05
- *TV*: -0.05
- *Fantasy*: -0.05
- *Mystery*: -0.04
- *Family*: -0.03
- *Adventure*: -0.03


### **Themes:**
#### Positive Correlation (higher ratings):
- *Story*: 0.11
- *Love*: 0.10
- *Political*: 0.08
- *Men*: 0.08
- *World*: 0.07
- *Film*: 0.08
- *Picture*: 0.06
- *Life*: 0.06
- *Unforgettable*: 0.05
- *Murder*: 0.05
- *Sometimes*: 0.05

#### Negative Correlation (lower ratings):
- *Evil*: -0.09
- *Game*: -0.09
- *Shark*: -0.08
- *Sinister*: -0.08
- *Mysterious*: -0.08
- *Killer*: -0.08
- *Paranormal*: -0.07
- *Terrifying*: -0.07
- *Creature*: -0.07
- *Fight*: -0.07
- *Fear*: -0.06



### **Events:**
#### Positive Correlation (higher ratings):
- *Hate*: 0.10
- *Speech*: 0.10
- *Domestic*: 0.10
- *Incarceration*: 0.09
- *Child*: 0.09
- *Sad*: 0.09
- *Hospital*: 0.08
- *Cheating*: 0.08
- *Antisemitism*: 0.08
- *Age*: 0.08
- *Gap*: 0.08
- *Large*: 0.08
- *Abandonment*: 0.08

#### Negative Correlation (lower ratings):
- *Jump*: -0.22
- *Scares*: -0.22
- *Audio*: -0.12
- *Mutilation*: -0.11
- *Gore*: -0.11
- *Eye*: -0.11
- *Excessive*: -0.09
- *Aliens*: -0.25
- *Gory*: -0.30
- *Gruesome*: -0.30
- *Slasher*: -0.30