# C. DPM Trope Clustering

The algorithm proposed by the paper. Poor performance for our purpose.

In [1]:
import os

os.chdir("..")

In [2]:
bags_folder = os.path.abspath('data/char_bags') #Bag of words json file from build_char_word_bags.py output
tropes_file = os.path.abspath('data/tropes/tvtropes.clusters.txt') #tropes from CMU dataset

In [3]:
from trope_clustering.src.trope_clustering.helpers.DPM_utilities import load_character_bags

#load character bags of words
char_full = load_character_bags(bags_folder)

In [4]:
from trope_clustering.src.trope_clustering.helpers.DPM_utilities import get_dt_matrix

dt_matrix, agent_features, patient_features, attribute_features = get_dt_matrix(char_full)
combined_features = list(agent_features) + list(patient_features) + list(attribute_features)

print("Number of agent features:", len(agent_features))
print("Number of patient features:", len(patient_features))
print("Number of attribute features:", len(attribute_features))
print("Document-term matrix shape:\n", dt_matrix.shape)

Number of agent features: 9291
Number of patient features: 8627
Number of attribute features: 16799
Document-term matrix shape:
 (41529, 34717)


In [5]:
from trope_clustering.src.trope_clustering.helpers.DPM_utilities import load_tropes

#load character tropes
tropes = load_tropes(tropes_file)
print(f"Sample size: {len(tropes)}")
print(f"Number of unique tropes: {len(set(tropes.values()))}")

Sample size: 434
Number of unique tropes: 72


In [6]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=72) 
persona_distributions = lda.fit_transform(dt_matrix)

In [7]:
print("Top 5 words for the first 5 personas:")

for idx, topic in enumerate(lda.components_):
    if idx > 4:
        break
    top_indices = topic.argsort()[-5:][::-1]
    print(f"Persona {idx}: {[combined_features[i] for i in top_indices]}")

Top 5 words for the first 5 personas:
Persona 0: ['friend', 'fall', 'new', 'best', 'have']
Persona 1: ['crush', 'star', 'find', 'decide', 'reveal']
Persona 2: ['captain', 'sister', 'tom', 'king', 'team']
Persona 3: ['start', 'take', 'tell', 'leave', 'go']
Persona 4: ['learn', 'leader', 'meet', 'sir', 'journalist']


In [8]:
from trope_clustering.src.trope_clustering.helpers.DPM_utilities import get_overlapping_characters

#update the characaters and tropes to only include ones whose trope we know and who we have the bag of words for (ie. characters included in both the bag of words dataset we have and in the trope testing set)
char_filtered, tropes_filtered = get_overlapping_characters(char_full, tropes)

print(f"Sample size of overlapping chars: {len(tropes_filtered)}")

Sample size of overlapping chars: 131


In [9]:
from trope_clustering.src.trope_clustering.helpers.DPM_utilities import get_clusters_dictionary
cluster_dict = get_clusters_dictionary(char_full, char_filtered, persona_distributions.argmax(axis=1), len(set(tropes.values())))

from trope_clustering.src.trope_clustering.helpers.DPM_utilities import get_tropes_dictionary
grouped_by_trope = get_tropes_dictionary(tropes_filtered)

In [10]:
char_of_interest = ["Darth Vader", "Hannibal Lecter", "Anakin Skywalker", "Agent Smith", "Nancy Thompson"]
for char in char_of_interest:
    print(f"{char}:\n")
    print(f"Prediction cluster containing {char} (Only showing characters who exist in tvtropes.clusters.txt):")
    for cluster, characters in cluster_dict.items():
        if char in characters: print(f"Cluster {cluster}: {', '.join(characters)}")
    print(f"\nGroundtruth cluster containing {char} (Only showing characters who exist in tvtropes.clusters.txt):")
    for trope, characters in grouped_by_trope.items():
        if char in characters: print(f"{trope}: {', '.join(characters)}")
    print("\n=====================\n")

Darth Vader:

Prediction cluster containing Darth Vader (Only showing characters who exist in tvtropes.clusters.txt):
Cluster 22: Giacomo Casanova, Darth Vader, Guy

Groundtruth cluster containing Darth Vader (Only showing characters who exist in tvtropes.clusters.txt):
master_swordsman: Obi-Wan Kenobi, Blade, Darth Vader, Bill, Aragorn


Hannibal Lecter:

Prediction cluster containing Hannibal Lecter (Only showing characters who exist in tvtropes.clusters.txt):
Cluster 3: Hannibal Lecter, Harold Lee

Groundtruth cluster containing Hannibal Lecter (Only showing characters who exist in tvtropes.clusters.txt):
cultured_badass: Hannibal Lecter, Dorian Gray


Anakin Skywalker:

Prediction cluster containing Anakin Skywalker (Only showing characters who exist in tvtropes.clusters.txt):
Cluster 40: Captain Nemo, Mr. Kesuke Miyagi, Ashley, Anakin Skywalker

Groundtruth cluster containing Anakin Skywalker (Only showing characters who exist in tvtropes.clusters.txt):
gadgeteer_genius: Tony Star

### Why use the Dirichlet Persona Model:
The model used here to cluster the characters by trope is the Dirichlet Persona Model. We attempted to use it since it was the method described in the paper: </br> Bamman, D., O’Connor, B., & Smith, N. A. (2013). Learning latent personas of film characters. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), 352–361.

### How it works:
Before using this model we need to preprocess the movie summaries we have into character bag of words. Each character has a bag of words extracted from movie summaries. These bag of words contain attributes and agent and patient verbs, which are actions either the character took or were taken on the character. </br>
Using these bag of words, we can create a document-term matrix (here if the same verb is once patient and once agent, it is considered as 2 different features). </br>
Using the document-term matrix, the model associates each character with a latent persona.

### Initial concerns:
Before implementing this method, the concern was mainly the need to specify the number of clusters, and the need to manually assign a trope to each cluster. Here we attempt to simplify that by using the tvtropes.clusters.txt dataset. We could match the predicted clusters to the dataset tropes if they have a big enough intersection. Therefore we also set the number of cluster to the number of tropes in the dataset. </br>
Another concern is the loss of context/structure as we go from a movie summary to bag of words before performing clustering with this methods.

### Findings:
The results of using this method are shown above. Here we considered the tvtropes.clusters.txt to be a groundtruth. </br>
There are 131 characters for whom we have a groundtruth cluster out of a total of 72 clusters, so instead of outputing everything, we decided to highlight the results for 5 characters. </br>
Since the predictions are simply clusters which we still did not associate to any trope, by showing the predicted cluster of a character and the "groundtruth" cluster, we can see if there are other matches between the predicted cluster and the groundtruth trope. </br>
Here we see the model does gives a very poor performance. For all the characters we chose to highlight, there are no other matches between the predicted cluster and the grountruth trope. Of course re-running the code will result in different clusters and results due to randomness. Often during other runs, the model will give some clusters with 2 characters within the same cluster that share the same trope, however this is usually accompanied by multiple miss classified characters.</br>
Due to the poor results, we decided to change our approach for trope clustering/classification, completeling abandoning this method