In [1]:
import pandas as pd

A pivotal determinant influencing an actor's career trajectory is their proficiency in embodying a diverse array of characters, serving as a gauge of their versatile talent. To assess the spectrum of personas portrayed by actors in our dataset, we employed the methodology outlined by Bamman, O'Connor, and Smith in their paper titled "Learning Latent Personas of Film Characters" (ACL 2013).

In [None]:
# The code that we used clusters the dataset into 50 personas, and for each character assigns a probability of belonging to a particular cluster

persona_proba_file_path='./data/25.100.lda.log.txt'
proba_columns=[f'proba_{p}' for p in range(1, 51)]
columns=['e.id','Wikipedia Movie id','Movie name', 'charName', 'fullName' , 'occurrences', 'max', 'probas']
persona_prob_df=pd.read_csv(persona_proba_file_path, sep='\t', header=None, names=columns)
persona_prob_df.head()

In [None]:
persona_prob_df['probas'] = persona_prob_df['probas'].str.split()
persona_prob_df[proba_columns] = pd.DataFrame(persona_prob_df['probas'].tolist(), dtype=float)
persona_prob_df = persona_prob_df.drop(columns=['probas'])
persona_prob_df.head()

In the obtained dataframe we have the associates probability verctor to each character or entity in each movie. In order to use this dataset and cobine it with characters dataframes we perform a filtering on the characters for which the "char/actor freebase id" is known. The latter starst with "/m"

In [None]:
persona_prob_df['e.id'].apply(lambda x: x.startswith('/m'))
persona_proba_filtered=persona_prob_df.loc[persona_prob_df['e.id'].apply(lambda x: x.startswith('/m'))]
persona_proba_filtered.reset_index(inplace=True)
persona_proba_filtered=persona_proba_filtered.drop(['index'], axis=1)
persona_proba_filtered.head()

The idea is to associate one persona to each character, therefore we keep only the persona with the greatest probability

In [None]:
from tqdm import tqdm
max_proba = []
for i in tqdm(range(persona_proba_filtered.shape[0])):
    row = persona_proba_filtered[proba_columns].iloc[i] 
    max_idx = row.argmax()   
    max_proba.append(row.index[max_idx])

In [None]:
persona_proba_filtered['Persona']=max_proba
persona_df=persona_proba_filtered.drop(proba_columns, axis=1)
persona_df['Persona']=persona_df['Persona'].apply(lambda x: x.split('_')[1])

In [None]:
#all_characters = pd.read_table('~/ADA2023/Project/Data/MovieSummaries/character.metadata.tsv', header=None)
all_characters = pd.read_csv('Data/MovieSummaries/character.metadata.tsv', sep='\t', header=None)
all_characters.columns = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie release date','Character name','Actor date of birth','Actor gender','Actor height','Actor ethnicity','Actor name','Actor age at movie release','Freebase character/actor map ID','Freebase character ID','Freebase actor ID']
all_characters.head()

In [None]:
# Merge the obtained persona dataframe with the characters dataframe
merged_persona_character=pd.merge(all_characters, persona_df, left_on='Freebase character/actor map ID', right_on='e.id')
merged_persona_character.head(2)

In [None]:
merged_persona_character.to_csv('persona_data.csv')