# Second Axis : narrative themes and movie plots analysis

In this axis, we aim at focusing on a more Natural Language Processing (NLP) strategy to process the CMU movie dataset.

In [None]:
from collections import Counter
from pathlib import Path

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
# import pyLDAvis
# import pyLDAvis.lda_model
from PIL import Image
# from sklearn.decomposition import LatentDirichletAllocation
# from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm
from wordcloud import WordCloud

# we will need some ntlk dependencies
nltk.download(['stopwords', 'wordnet', 'punkt_tab'])

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# === Load Data and Plot Summaries ===
DATASET_FILEPATH = Path("data/merged_movie_metadata.csv")
PLOT_SUMMARIES_FILEPATH = Path("data/plot_summaries.txt")

df = pd.read_csv(DATASET_FILEPATH)

# Take only unique Wikipedia_ID
df = df.drop_duplicates(subset=['Wikipedia_ID'])

plot_summaries = {}

with open(PLOT_SUMMARIES_FILEPATH, 'r', encoding='utf-8') as file:
    for line in file:
        parts = line.strip().split('\t', 1)
        if len(parts) == 2:
            wiki_id, summary = parts
            plot_summaries[int(wiki_id)] = summary

# Map plot summaries to the DataFrame
df['Plot_Summary_Base'] = df['Wikipedia_ID'].map(plot_summaries)

### Loading and Preparing Movie Plot Summaries

Our dataset includes two main sources for movie plot summaries:

1. **OMDB Data (Plot column)**: This source provides a brief summary of the movie plot in a few sentences.
2. **CMU Movie Summary Corpus (plot_summaries.txt)**: This file, obtained from the [CMU Movie Summary Corpus website](https://www.cs.cmu.edu/~ark/personas/), contains detailed plot summaries for a range of movies.

Since both sources may contain missing values for certain movies, we will combine the two to maximize coverage and ensure we have plot information for as many movies as possible. Below is a check to see how many values are missing in each plot source:

In [None]:
# Check missing values in both plot sources
print(f"Total movies: {len(df)}")
print(f"Missing values in OMDB Plot column: {df['Plot'].isna().sum()}")
print(f"Missing values in CMU Plot Summary column: {df['Plot_Summary_Base'].isna().sum()}")

# Movies missing both plot summaries
print(f"Movies missing both summaries: {(df['Plot'].isna() & df['Plot_Summary_Base'].isna()).sum()}")

## Additionnal dataset

For our journey, we will need to retrieve an aditionnal dataset from the internet. We will get it from OMDB [Open Movie Database](https://www.omdbapi.com), an open API that allows to retrieve movie metadata from their IMDBId, or even simpler, from their title.

After getting an API key, we managed to query the database for $73'000$ movies out of the $\approx 83'000$ original movies of `movie_metadata.tsv` from CoreNLP.

The resulting `.json` file was merged with the original CoreNLP dataset, adding the following columns :

    ["Rated", "Director", "Writer", "Actors", "Plot", "Awards", "Poster", "Ratings", "Metascore", "imdbRating", "imdbVotes", "imdbID", "DVD", "Production", "Website", "Response", "totalSeasons", "Oscar", "Nomination_Awards", "Win_Awards", "Internet_Movie_Database_Rating", "Rotten_Tomatoes_Rating", "Metacritic_Rating"]

We can then process to our text-based analysis.

### Text Preprocessing for Movie Plots

To analyze movie plots, we need to process the text data to standardize and clean it. We perform several key steps in this process:

1. **Text selection** - For each movie, we prioritize plot summaries from the [CMU Movie Summary Corpus](https://www.cs.cmu.edu/~ark/personas/) because they tend to be richer in content. If a summary from this source is unavailable (i.e., a missing value), we use the summary from OMDB. If both sources are missing, the result will be marked as `NaN`.
   
2. **Tokenization** We split each plot summary into individual words (tokens).

3. **Stop words removal** Commonly used words that don’t contribute meaningful information (known as stop words) are removed. Examples of stop words include "the," "is," "and," "in," "to," etc.

4. **Lemmatization** Each word is reduced to its base or root form. For instance, words like "running," "runs," and "ran" are all converted to "run". This helps improve the accuracy of our analysis by avoiding duplicates in different forms. This process is called.

5. **Frequency analysis** We analyze the frequency and type of words used.


In [None]:
# === Text Preprocessing ===
from src.utils.data_utils import tok_by_region

# Tokenize plots by region
tokenized_plots_america = tok_by_region(df, 'America')
tokenized_plots_europe = tok_by_region(df, 'Europe')
tokenized_plots_both = tok_by_region(df, 'Both')

We now want to see the most frequently used words in movie plots across our three regions: **America**, **Europe**, and **Both**. We use a `Counter` to count occurrences of each word in the tokenized plots for each region.

In [None]:
# === Word Counts ===
from src.utils.data_utils import get_word_count

word_count_america = get_word_count(tokenized_plots_america)
word_count_europe = get_word_count(tokenized_plots_europe)
word_count_both = get_word_count(tokenized_plots_both)

print("Top 10 words in America:", word_count_america.most_common(10))
print("\nTop 10 words in Europe:", word_count_europe.most_common(10))
print("\nTop 10 words in Both:", word_count_both.most_common(10))

We now have the most frequently used words in each region’s plot summaries, but these results feel somewhat limited. To gain deeper insights, we’ll use the Stanford CoreNLP-processed summaries, which provide richer linguistic information.

These summaries, derived from `plot_summaries.txt`, have been processed with the [Stanford CoreNLP pipeline](https://stanfordnlp.github.io/CoreNLP/), a tool that performs advanced language processing tasks such as part-of-speech tagging, syntactic parsing, named entity recognition (NER), and coreference resolution. This additional information will allow us to analyze not only word frequency but also the context and roles of words within each summary. We will first extract the **POS tags**.

First, let's configure the folder where we store these summaries (not included in github due to the large space it takes) :

In [None]:
folder_path = Path("../corenlp_plot_summaries")

assert folder_path.exists(), "Must configure the correct path to corenlp_plot_summaries"

Then, let's see how many movies that have their summaries processed.

 > **Note :** originally, only ~7'000 movies were processed. We added 3'500 missing movies by processing their OMDB plot summaries instead. This was a bit of a headache and since this is preprocessing, the logic was moved inside `src/scripts/` to lighten the notebook. See [extract_ids_to_run_in_pipeline.py](src/scripts/extract_ids_to_run_in_pipeline.py) and [INTELLIJ_run_Stanford_Pipeline.xml](src/scripts/INTELLIJ_run_Stanford_Pipeline.xml) for details on how this was achieved.

In [None]:
# count matching folders/files
processed_ids = {f.stem.replace(".xml", "") for f in folder_path.glob("*.xml.gz")}
available_ids = set(df["Wikipedia_ID"].astype(str))
matching_files = [folder_path/f"{filename}.xml.gz" for filename in processed_ids.intersection(available_ids)]

print(f"Number of matching folders/files: {len(matching_files)}")
print(f"Number of unique Wiki id's: {df['Wikipedia_ID'].nunique()}")

### Filter using the POS fields

**POS (Part Of Speech) fields contains word metadata. It indicates whether the current word is a noun, a verb, an adjective, etc.**

[An example POS definition can be found here](https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used) :

![POS example definition](data/POS_tokens_BrownCorpusWikipedia.png)

We need to know what kind of fields we have in our files to extract the most useful ones for our analysis. To do so, and to avoid unzipping all the xml files to the computer, we can leverage bash tools and use this (non-optimized) command. However, even if this method is (quiet) efficient, it would still take a long time to process all the files. Since we use this command only to get a grasp of the POS fields, we executed it on a subset of files chosen randomly. We just replace "*xml.gz" with the first 5000 files after being sorted randomly.


```bash
    zcat  $(ls *.xml.gz | sort -R | head -n5000) | grep -oP "(?<=<POS>)[^<]+" | sort | uniq -c | sort -rn
```

How it works :
 - `$(ls *.xml.gz | sort -R | head -n5000)` returns 5000 randomly chosen filenames inside the current directory.
 - `zcat` displays the content of a .gz (gzipped) file as text
 - The output is piped to `grep` which extracts all the POS fields as strings : "(...) <POS>NN</POS> (...)" => "NN" 
 - The output is piped to `sort` which sorts this sequence of strings
 - The output is piped to `uniq` which prints each string only once, preceded with its number of occurences
 - The output is piped to `sort` which sorts the list of string occurences by descending order.


Which gives as output :

```text
     250061 NN
     185484 IN
     164548 NNP
     162675 DT
     116598 VBZ
      93384 ,
     [...]
```

We redirect the output to a file `data/POS_tokens.csv` and read it with pandas :



In [None]:
df_pos = pd.read_csv("data/POS_tokens.csv", sep=" ", header=None)
df_pos.columns = ["occurence", "token"]
plt.figure(figsize=(10, 6))
df_pos.plot(x='token', y='occurence', kind='bar', color='skyblue')
plt.title(f'Repartition of the POS field content in 5000 files.', fontdict={'fontsize': 16, 'fontweight': 'bold'})
plt.xlabel('POS', fontdict={'fontsize': 14, 'fontweight': 'bold'})
plt.ylabel('Frequency', fontdict={'fontsize': 14, 'fontweight': 'bold'})
plt.xticks(rotation=90)
plt.show()

Now that we have a clear idea of the most commonly used POS tags, we can use this information to focus on the tags that will be most relevant and insightful for our analysis. 
Let's now see how many films in our dataset have gone through the Stanford CoreNLP pipeline:

### Running the CoreNLP pipeline on the OMDB plots summaries (fallback)

Out of a total of 11'681 films, 11'319 have been processed through the Stanford CoreNLP pipeline.

To work with the Stanford CoreNLP-processed data, we define the following functions:

1. **`get_sentence_word_metadata`**:
   - This function retrieves metadata from the CoreNLP pipeline output for a given movie ID. 
   - It extracts tokens, their lemmas, and part-of-speech (POS) tags from `.xml.gz` files.

2. **`filter_words_by_pos`**:
   - This function filters tokens based on specified POS tags, allowing us to focus on particular types of words, such as nouns, verbs, or adjectives.

3. **`filter_words_by_pos_ngram`**:
   - This function generates n-grams (sequences of n consecutive words) from the tokenized data, filtered by the specified POS tags. It is useful for capturing patterns and contexts in the text.

We also define three sets of ids for each region ("Europe", "America" or both of them).

In [None]:
from src.utils.data_utils import filter_words_by_pos, filter_words_by_pos_ngram, get_sentence_word_metadata

# === Define Regions ===
region_ids = {
    "America": set(df[df['Continents'].str.contains("America")]['Wikipedia_ID'].astype(str)),
    "Europe": set(df[df['Continents'].str.contains("Europe")]['Wikipedia_ID'].astype(str)),
    "Both": set(df[df['Continents'].str.contains("Both")]['Wikipedia_ID'].astype(str))
}

After we've successfully extracted the words we wanted, the next step is to visualize them. To do this, we will use a word cloud visualization to display the most frequent words. In a word cloud, word size represent their frequency of occurence. This gives immediately a grasp of the word occurences.

In [None]:
# === Word Cloud Generation ===
def generate_word_cloud(region, pos_tags, title, mask_image=None, sample_output=True, ngrams=1):
    """Generates a word cloud for a specified region and POS tags, with optional mask and sample output."""
    word_counter = Counter()
    for wiki_id in tqdm(region_ids[region]):
        tokens_metadata = get_sentence_word_metadata(folder_path, wiki_id)
        filtered_words = filter_words_by_pos(tokens_metadata, pos_tags) if ngrams == 1 else filter_words_by_pos_ngram(tokens_metadata, pos_tags, ngrams)
        word_counter.update(filtered_words)
    
    if sample_output:
        # print sample output for verification
        print(f"\nSample of filtered words for {region} - {title}:")
        print(word_counter.most_common(10))
    
    # word_counter.subtract(word_counter.most_common(6))
    
    # generate and display the word cloud with mask if provided
    wordcloud = WordCloud(width=800, height=400, background_color='white', mask=mask_image,
                          contour_color='black', contour_width=1).generate_from_frequencies(word_counter)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"{title} Word Cloud for {region}")
    plt.show()

Using the POS tags defined above, we can group the tags into specific categories (e.g., nouns, verbs, adjectives) to filter tokens for our word cloud.

In [None]:

# === POS Tags Groups ===
noun_tags = ['NN', 'NNA', 'NNC', 'NNS', 'NNP', 'NNPS']
verb_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBS', 'VBZ']
adjective_tags = ['JJ', 'JJR', 'JJS', 'JJC', 'JJA', 'JJF', 'JJM']

# === Word Cloud Mask Images ===
americas_mask = np.array(Image.open("data/Location_North_America.png"))
europe_mask = np.array(Image.open("data/Location_Europe.png"))
both_mask = np.array(Image.open("data/Location_Both.png"))

In [None]:
# === Generate Word Clouds ===

generate_word_cloud("America", noun_tags, "Nouns in America", mask_image=americas_mask, ngrams=2)
generate_word_cloud("Europe", verb_tags, "Verbs in Europe", mask_image=europe_mask, ngrams=2)

### [Experimental!] Applying Topic Modeling using LDA
We aim to uncover the underlying themes in movie plot summaries from the **America**, **Europe**, and **Both** datasets using **Latent Dirichlet Allocation (LDA)**. This approach allows us to explore and compare thematic similarities and differences across regions. The methodology is inspired by this TDS article: [End-to-End Topic Modeling in Python](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0).

We preprocessed the plot data by tokenizing, lemmatizing, and removing stopwords. These cleaned tokens are joined into complete strings and combined into a single corpus (**America**, **Europe**, and **Both**) to create a shared vocabulary using the `TfidfVectorizer`.

This shared vocabulary ensures consistent representation of words and phrases across all datasets. After fitting the vectorizer, we transform each region’s text into numerical representations (TF-IDF matrices). This allows us to analyze and compare themes and topics uniformly across regions.

In [None]:
documents_america = tokenized_plots_america.dropna().apply(lambda tokens: ' '.join(tokens))
documents_europe = tokenized_plots_europe.dropna().apply(lambda tokens: ' '.join(tokens))
documents_both = tokenized_plots_both.dropna().apply(lambda tokens: ' '.join(tokens))

all_documents = pd.concat([documents_america, documents_europe, documents_both])

In [None]:
# Fit vectorizer on the combined corpus
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    min_df=5,
    max_df=0.85,
    ngram_range=(1, 2)
)
vectorizer.fit(all_documents)

# Transform each region separately
tfidf_america = vectorizer.transform(documents_america)
tfidf_europe = vectorizer.transform(documents_europe)
tfidf_both = vectorizer.transform(documents_both)

We fit separate **Latent Dirichlet Allocation (LDA)** models for each region using their respective TF-IDF matrices. Each model identifies 3 latent topics (`n_components=3`) within the text data for its region. 


In [None]:
# Fit LDA models
lda_america = LatentDirichletAllocation(n_components=3, random_state=42)
lda_america.fit(tfidf_america)

lda_europe = LatentDirichletAllocation(n_components=3, random_state=42)
lda_europe.fit(tfidf_europe)

lda_both = LatentDirichletAllocation(n_components=3, random_state=42)
lda_both.fit(tfidf_both)

We use the `display_topics` function to extract and print the top words for each topic generated by the LDA models. 

This function:
1. Loops through each topic in the model.
2. Selects the top `no_top_words` most important words for the topic (based on their weights).
3. Prints these words to summarize the theme of each topic.

The `tfidf_feature_names`, which represent the vocabulary created by the `TfidfVectorizer`, are used to map the word indices back to their original terms.

In [None]:
def display_topics(model, feature_names, no_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[-no_top_words:]]))

tfidf_feature_names = vectorizer.get_feature_names_out()

print("America Topics:")
display_topics(lda_america, tfidf_feature_names)

print("\nEurope Topics:")
display_topics(lda_europe, tfidf_feature_names)

print("\nBoth Topics:")
display_topics(lda_both, tfidf_feature_names)

The results are still somewhat messy, but we can infer some patterns. For example, **Topic 1** from the America dataset seemts to relate to crime and detective themes. However, the other topics are less clear and need refinement. This was just an initial experiment, and we aim to improve the results by tweaking the parameters and preprocessing steps in future iterations.

### PyLDAvis Visualization

1. **Intertopic Distance Map (Left Panel)**:
   - Each circle represents a topic, and its size reflects how prevalent the topic is in the dataset.
   - The distance between circles shows how distinct the topics are (closer = more similar, farther = more different).

2. **Top-30 Relevant Terms (Right Panel)**:
   - Displays the most relevant words for the selected topic.
   - **Red bars**: Frequency of the word within the selected topic.
   - **Blue bars**: Overall frequency of the word across the dataset.

3. **λ Slider**:
   - Adjusts the balance between specificity and frequency of terms.
   - **λ = 1**: Shows terms unique to the topic.
   - **λ = 0**: Shows more general terms that are common in the dataset.

In [None]:
# For America
pyLDAvis.enable_notebook()
vis_america = pyLDAvis.lda_model.prepare(lda_america, tfidf_america, vectorizer)
pyLDAvis.display(vis_america)

# For Europe
vis_europe = pyLDAvis.lda_model.prepare(lda_europe, tfidf_europe, vectorizer)
pyLDAvis.display(vis_europe)


# For Both
vis_both = pyLDAvis.lda_model.prepare(lda_both, tfidf_both, vectorizer)
pyLDAvis.display(vis_both)