# Dataset Expansion

This is the notebook for expanding the intersection dataset (which was calculated and dumped previously in bechdel_intersection.ipynb to get the intersection of CMU dataset & Bechdel dataset). Now, we additionally add the features we need, we are the following:
* Female Cast Ratio
* Sentiment Analysis (neutral, negative, positive)
* Summary Pronoun Density
* Summary Gender Mention Density
* GII ([Source](https://ourworldindata.org/grapher/gender-inequality-index-from-the-human-development-report?tab=chart))
* HDI ([Source](https://ourworldindata.org/grapher/human-development-index))


## Load the intersection & align the following datasets so that they all have matching Wikipedia movie IDs:
* Plot summaries
* Movie metadata with Bechdel intersection
* Characters metadata

In [8]:
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
pd.options.mode.chained_assignment = None
import time
from tqdm import tqdm
import matplotlib.pyplot as plt

movie_metadata = pd.read_csv('MovieSummaries/movie.metadata.tsv',sep='\t')

movie_metadata.columns = ['1. Wikipedia movie ID',
                          '2. Freebase movie ID',
                          '3. Movie name',
                          '4. Movie release date',
                          '5. Movie box office revenue',
                          '6. Movie runtime',
                          '7. Movie languages (Freebase ID:name tuples)',
                          '8. Movie countries (Freebase ID:name tuples)',
                          '9. Movie genres (Freebase ID:name tuples)']

character_metadata = pd.read_csv('MovieSummaries/character.metadata.tsv',sep='\t')

character_metadata.columns = ['1. Wikipedia movie ID',
                              '2. Freebase movie ID',
                              '3. Movie release date',
                              '4. Character name',
                              '5. Actor date of birth',
                              '6. Actor gender',
                              '7. Actor height (in meters)',
                              '8. Actor ethnicity (Freebase ID)',
                              '9. Actor name',
                              '10. Actor age at movie release',
                              '11. Freebase character/actor map ID',
                              '12. Freebase character ID',
                              '13. Freebase actor ID']

movie_metadata_bechdel = pd.read_csv("CMU_bechdel_added.csv")
print(movie_metadata_bechdel.shape)
movie_metadata_bechdel = movie_metadata_bechdel.drop("Unnamed: 0", axis=1)
movie_metadata_bechdel["actor_mention_score"] = pd.Series(np.zeros((movie_metadata_bechdel.shape[0],))) #add the new column

character_metadata_bechdel = character_metadata.copy(deep = True)
print("Characters: Size before:", character_metadata_bechdel.shape)
character_metadata_bechdel = character_metadata_bechdel[character_metadata_bechdel['2. Freebase movie ID'].isin(movie_metadata_bechdel["2. Freebase movie ID"].to_numpy())]
print("Characters: Size after:", character_metadata_bechdel.shape)

print("Movies: Size before:", movie_metadata_bechdel.shape)
movie_metadata_bechdel = movie_metadata_bechdel[movie_metadata_bechdel['2. Freebase movie ID'].isin(character_metadata_bechdel["2. Freebase movie ID"].to_numpy())]
print("Movies: Size after:", movie_metadata_bechdel.shape)

(6521, 11)
Characters: Size before: (450668, 13)
Characters: Size after: (72458, 13)
Movies: Size before: (6521, 11)
Movies: Size after: (6202, 11)


In [9]:
plot_summaries=pd.read_csv('MovieSummaries/plot_summaries.txt', sep='\t', header=None, names=['id', 'plot_summary'])
plot_summaries_bechdel = plot_summaries[plot_summaries['id'].isin(character_metadata_bechdel['1. Wikipedia movie ID'].to_numpy())]

plot_summaries.head()
print(plot_summaries.shape)
print(plot_summaries_bechdel.shape)

print("Movie Metadata Before sync:",movie_metadata_bechdel.shape)
movie_metadata_bechdel = movie_metadata_bechdel[movie_metadata_bechdel['1. Wikipedia movie ID'].isin(plot_summaries_bechdel["id"].to_numpy())]
print("Movie Metadata Before sync:",movie_metadata_bechdel.shape)

(42303, 2)
(5194, 2)
Movie Metadata Before sync: (6202, 11)
Movie Metadata Before sync: (5194, 11)


## Summary Gender Mention Density
We calculate the following simple ratio:
$$\frac{number_{female\_characters}}{number_{female\_characters} + number_{male\_characters}}$$

For each unique movie, we access its relative character dataframe and the summary provided. By comparing every name in the summary, we get their genders and calculate the abovementioned score.

In [10]:
def calculate_actor_mention_score(movie_idx):
    movie_summary = plot_summaries_bechdel.iloc[movie_idx]["plot_summary"]
    
    #Tokenize the movie summary
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(movie_summary)
    tokens = [x.lower() for x in tokens]
    tokens_freq = pd.Series(tokens).value_counts(sort=True)
    
    #align movie dataset & character dataset
    movie_id = plot_summaries_bechdel.iloc[movie_idx]["id"]
    character_list = character_metadata_bechdel[character_metadata_bechdel['1. Wikipedia movie ID'] == movie_id][['4. Character name','6. Actor gender']]
    
    character_list_processed = character_list.copy()
    character_list_processed = character_list_processed.dropna()

    #Lowercase character names
    character_list_processed["4. Character name"] = character_list_processed["4. Character name"].str.lower()

    #Split full name and only get the first name
    character_list_processed["4. Character name"] = character_list_processed["4. Character name"].str.split(' ').str[0]

    character_list_processed = character_list_processed.drop(character_list_processed[character_list_processed["4. Character name"] == "the"].index)

    character_gender_stacked = character_list_processed.drop_duplicates(subset='4. Character name', keep=False)
    character_gender_stacked_idx = character_gender_stacked.set_index("4. Character name")

    #Take the intersection between the token's frequency and movie cast
    tokens_intersection = tokens_freq[character_gender_stacked_idx.index.intersection(tokens_freq.index)]
    character_gender_stacked = character_gender_stacked.drop_duplicates()
    character_mention_freq = character_gender_stacked[character_gender_stacked["4. Character name"].isin(tokens_intersection.index)]#.drop_duplicates()

    #Add the number of character mentions in summary to the character meta-dataset
    character_mention_freq["no_mention"] = tokens_intersection.values#pd.DataFrame({'4. Character name':character_mention_freq[0], 'no_mention':character_mention_freq.values})
    character_mention_freq.columns = ["character_name", "gender", "no_mention"]
    character_list_final = character_mention_freq
    #print(character_list_final)
    
    #Group by gender and calculate total number of mentions by gender
    character_list_freq_added = character_list_final.groupby(['gender']).sum()
    if len(character_list_freq_added['no_mention'].index) != 0:
        if character_list_freq_added['no_mention'].shape[0] == 2:
            female_mention, male_mention = character_list_freq_added['no_mention'].iloc[0], character_list_freq_added['no_mention'].iloc[1] #groupby is alphabethic, index 0 = F
            mention_ratio = female_mention/(female_mention + male_mention)
        elif character_list_freq_added['no_mention'].index[0] == "M":
            mention_ratio = 0.
        elif character_list_freq_added['no_mention'].index[0] == "F":
            mention_ratio = 1.
        else:
            mention_ratio = np.nan
    else:
        mention_ratio = np.nan

    actor_mention_score = round(mention_ratio, 4)
    
    return actor_mention_score, movie_id


In [11]:
beginning = time.time()

for a in tqdm(range(movie_metadata_bechdel.shape[0]-1)):
    actor_mention_score, movie_id = calculate_actor_mention_score(a+1)
    
    movie_metadata_bechdel.loc[movie_metadata_bechdel['1. Wikipedia movie ID'] == movie_id, "actor_mention_score"] = actor_mention_score

100%|█████████████████████████████████████████████████████████████████████████████| 5193/5193 [00:32<00:00, 159.03it/s]


In [78]:
display(movie_metadata_bechdel)

Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,actor_mention_score
1,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",3,0.2500
2,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",3,0.5345
5,12053509,/m/02vn81r,Loverboy,1989-04-28,3960327.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",3,0.2830
6,1369204,/m/04x8zs,Juarez,1939,,125.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/03bxz7"": ""B...",2,
7,5664529,/m/0dyy_v,Vixen!,1968,,70.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01yldk"": ""Softcore Porn"", ""/m/06b0n3"": ""S...",3,
...,...,...,...,...,...,...,...,...,...,...,...
6514,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,0.7500
6516,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,0.1667
6517,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/03p5xs"": ""...",3,0.5200
6519,20244619,/m/04_0j2b,Mirage,1972,,82.0,"{""/m/06nm1"": ""Spanish Language""}","{""/m/016wzw"": ""Peru""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",3,


## Sentiment Analysis

In this section, we obtain the mood of the summary provided, by using the Roberta model, trained on Twitter, and running the model with the tokenized summary. With this method, we do not need the  preprocessing of our summary, such as stop-word-removal, etc. In return, we get the probabilities of the following three moods:
1. Negative
2. Neutral
3. Positive

In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForSequenceClassification.from_pretrained(MODEL).to(device)


scores_final=[]
for i, row in tqdm(plot_summaries.iterrows(), total=plot_summaries.shape[0]):
    encoded_text = tokenizer(row[1], return_tensors='pt', truncation=True, max_length=512).to(device)

    output = model(**encoded_text)

    scores = output[0][0].detach().cpu().numpy()

    scores = softmax(scores)

    scores_dict = {
        'index': row[0],
        'negative': scores[0],
        'neutral': scores[1],
        'positive': scores[2]
    }
    scores_final.append(scores_dict)
    
sentiments=scores_final.copy()
sentiments=pd.DataFrame(sentiments)
sentiments.to_csv("sentiment_analysis.csv", index=False)

In [79]:
sentiments_sync = sentiments[sentiments["index"].isin(movie_metadata_bechdel["1. Wikipedia movie ID"])]
sentiments_sync.columns = ["1. Wikipedia movie ID", "negative", "neutral", "positive"]
movie_metadata_bechdel_sentiment = pd.merge(movie_metadata_bechdel, sentiments_sync, on='1. Wikipedia movie ID', how="left")
print("After sentiment scores added:",movie_metadata_bechdel_sentiment.shape)

display(movie_metadata_bechdel_sentiment)

After sentiment scores added: (5194, 14)


Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,actor_mention_score,negative,neutral,positive
0,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",3,0.2500,0.048826,0.830560,0.120614
1,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",3,0.5345,0.129004,0.691920,0.179076
2,12053509,/m/02vn81r,Loverboy,1989-04-28,3960327.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",3,0.2830,0.252773,0.634228,0.112999
3,1369204,/m/04x8zs,Juarez,1939,,125.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/03bxz7"": ""B...",2,,0.394391,0.572009,0.033601
4,5664529,/m/0dyy_v,Vixen!,1968,,70.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01yldk"": ""Softcore Porn"", ""/m/06b0n3"": ""S...",3,,0.398135,0.527754,0.074110
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5189,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,0.7500,0.384289,0.565830,0.049881
5190,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,0.1667,0.131713,0.691487,0.176800
5191,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/03p5xs"": ""...",3,0.5200,0.317630,0.585357,0.097013
5192,20244619,/m/04_0j2b,Mirage,1972,,82.0,"{""/m/06nm1"": ""Spanish Language""}","{""/m/016wzw"": ""Peru""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",3,,0.111646,0.724559,0.163794


## Female Cast Ratio

We calculate the female cast ratio, defined by [(Yang et al., 2020)](https://doi.org/10.1145/3411213). This is relatively easy to understand and thus implement. For each movie in our dataset, we get their relative character metadata and calculate the ratio of female to total cast.

In [35]:
unique_count_wiki = character_metadata_bechdel['1. Wikipedia movie ID'].nunique()
unique_count_freebase = character_metadata_bechdel['2. Freebase movie ID'].nunique()
print("Number of unique Wikipedia movie ID values:", unique_count_wiki)
print("Number of unique Freebase movie ID values:", unique_count_freebase)

Number of unique Wikipedia movie ID values: 6202
Number of unique Freebase movie ID values: 6202


In [60]:
# Group by 1. Wikipedia movie ID and 6. Actor gender, then count the occurrences of each gender
gender_counts = character_metadata_bechdel.groupby(['1. Wikipedia movie ID', '6. Actor gender']).size().unstack(fill_value=0)

# Calculate the ratio of female actors to total actors for each movie
gender_counts['female_ratio'] = gender_counts['F'] / (gender_counts['M'] + gender_counts['F'])
gender_counts = gender_counts.reset_index().drop(columns = ["F", "M"])

# merge the gender_counts and character_metadata_bechdel
movie_metadata_bechdel_sentiment_fcr = pd.merge(movie_metadata_bechdel_sentiment, gender_counts, on='1. Wikipedia movie ID', how="left")

In [61]:
display(movie_metadata_bechdel_sentiment_fcr)

Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,actor_mention_score,negative,neutral,positive,female_ratio
0,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",3,0.2500,0.048826,0.830560,0.120614,0.150000
1,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",3,0.5345,0.129004,0.691920,0.179076,0.461538
2,12053509,/m/02vn81r,Loverboy,1989-04-28,3960327.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",3,0.2830,0.252773,0.634228,0.112999,0.428571
3,1369204,/m/04x8zs,Juarez,1939,,125.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/03bxz7"": ""B...",2,,0.394391,0.572009,0.033601,0.250000
4,5664529,/m/0dyy_v,Vixen!,1968,,70.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01yldk"": ""Softcore Porn"", ""/m/06b0n3"": ""S...",3,,0.398135,0.527754,0.074110,0.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5189,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,0.7500,0.384289,0.565830,0.049881,0.235294
5190,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,0.1667,0.131713,0.691487,0.176800,0.454545
5191,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/03p5xs"": ""...",3,0.5200,0.317630,0.585357,0.097013,0.285714
5192,20244619,/m/04_0j2b,Mirage,1972,,82.0,"{""/m/06nm1"": ""Spanish Language""}","{""/m/016wzw"": ""Peru""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",3,,0.111646,0.724559,0.163794,1.000000


## Summary Pronoun Density
Here, we check the density of gender related pronouns in the plot summary we have. After we tokenize the data, we search through the tokens for the frequency of the "she" & "her", compared to frequency of "he" & "him". Then, we simple calculate the following ratio:

$$density_{pronoun} = \frac{number_{she,her}}{number_{she,her} + number_{he,his}}$$



In [57]:
from nltk.tokenize import RegexpTokenizer
from tqdm import tqdm

print(plot_summaries.shape)
tokenizer = RegexpTokenizer(r'\w+')
counts=[{"1. Wikipedia movie ID":0,"she":0, "her":0, "he":0, "his":0} for i in range(plot_summaries.shape[0])]

for i, row in tqdm(plot_summaries.iterrows(), total=plot_summaries.shape[0]):
    tokens = tokenizer.tokenize(row[1])
    counts[i]["1. Wikipedia movie ID"]=row[0]
    counts[i]["she"]=tokens.count("she")
    counts[i]["her"]=tokens.count("her")
    counts[i]["he"]=tokens.count("he")
    counts[i]["his"]=tokens.count("his")

  0%|                                                                                        | 0/42303 [00:00<?, ?it/s]

(42303, 2)


100%|██████████████████████████████████████████████████████████████████████████| 42303/42303 [00:06<00:00, 6305.76it/s]


In [80]:
genders=pd.DataFrame(counts)
genders["gender_density"]=(genders['she'] + genders['her']) / (genders['she'] + genders['her'] + genders['he'] + genders['his'])
genders_only = genders.drop(columns = ["she", "her", "he", "his"])
movie_metadata_bechdel_sentiment_fcr_pronoun = pd.merge(movie_metadata_bechdel_sentiment_fcr, genders_only, on='1. Wikipedia movie ID', how='left')

display(movie_metadata_bechdel_sentiment_fcr_pronoun)

Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,actor_mention_score,negative,neutral,positive,female_ratio,gender_density
0,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",3,0.2500,0.048826,0.830560,0.120614,0.150000,0.000000
1,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",3,0.5345,0.129004,0.691920,0.179076,0.461538,0.333333
2,12053509,/m/02vn81r,Loverboy,1989-04-28,3960327.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",3,0.2830,0.252773,0.634228,0.112999,0.428571,0.240000
3,1369204,/m/04x8zs,Juarez,1939,,125.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/03bxz7"": ""B...",2,,0.394391,0.572009,0.033601,0.250000,0.083333
4,5664529,/m/0dyy_v,Vixen!,1968,,70.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01yldk"": ""Softcore Porn"", ""/m/06b0n3"": ""S...",3,,0.398135,0.527754,0.074110,0.250000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5189,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,0.7500,0.384289,0.565830,0.049881,0.235294,0.206897
5190,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,0.1667,0.131713,0.691487,0.176800,0.454545,0.055556
5191,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/03p5xs"": ""...",3,0.5200,0.317630,0.585357,0.097013,0.285714,0.206897
5192,20244619,/m/04_0j2b,Mirage,1972,,82.0,"{""/m/06nm1"": ""Spanish Language""}","{""/m/016wzw"": ""Peru""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",3,,0.111646,0.724559,0.163794,1.000000,0.000000


## Auxililary Features

We added the binary bechdel test for further analysis.

In [74]:
movie_metadata_bechdel_sentiment_fcr_pronoun["bechdel_binary"] = (movie_metadata_bechdel_sentiment_fcr_pronoun["bechdel_score"] == 3).astype(int)

In [75]:
display(movie_metadata_bechdel_sentiment_fcr_pronoun)

Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,actor_mention_score,negative,neutral,positive,female_ratio,gender_density,bechdel_binary
0,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",3,0.2500,0.048826,0.830560,0.120614,0.150000,0.000000,1
1,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",3,0.5345,0.129004,0.691920,0.179076,0.461538,0.333333,1
2,12053509,/m/02vn81r,Loverboy,1989-04-28,3960327.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",3,0.2830,0.252773,0.634228,0.112999,0.428571,0.240000,1
3,1369204,/m/04x8zs,Juarez,1939,,125.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/03bxz7"": ""B...",2,,0.394391,0.572009,0.033601,0.250000,0.083333,0
4,5664529,/m/0dyy_v,Vixen!,1968,,70.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01yldk"": ""Softcore Porn"", ""/m/06b0n3"": ""S...",3,,0.398135,0.527754,0.074110,0.250000,1.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5189,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,0.7500,0.384289,0.565830,0.049881,0.235294,0.206897,0
5190,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,0.1667,0.131713,0.691487,0.176800,0.454545,0.055556,1
5191,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/03p5xs"": ""...",3,0.5200,0.317630,0.585357,0.097013,0.285714,0.206897,1
5192,20244619,/m/04_0j2b,Mirage,1972,,82.0,"{""/m/06nm1"": ""Spanish Language""}","{""/m/016wzw"": ""Peru""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",3,,0.111646,0.724559,0.163794,1.000000,0.000000,1


In [77]:
#Dump final dataset to csv file
movie_metadata_bechdel_sentiment_fcr_pronoun.to_csv("movie_metadata_all_features.csv", index=False)

## Bechdel Analysis: GII & HDI Addition

To further analyze the dataset and relations with Bechdel score, we introduce Gender Inequality Index (GII) and Human Development Index (HDI) to our dataset. This way, we can carry out an analysis based on the countries.
([Source for GII](https://ourworldindata.org/grapher/gender-inequality-index-from-the-human-development-report?tab=chart))
([Source for HDI](https://ourworldindata.org/grapher/human-development-index))

In [81]:
import pandas as pd
import json

# Load Data
bechdel_movies = pd.read_csv('CMU_bechdel_added.csv')

GII = pd.read_csv('gender-inequality-index-from-the-human-development-report.csv')
HDI = pd.read_csv('human-development-index.csv')

#Add Country Names
bechdel_movies['country_names'] = bechdel_movies['8. Movie countries (Freebase ID:name tuples)'].apply(
    lambda x: list(json.loads(x).values()))

# If some movie is created in diferent countries it will create new rows for each country 
bechdel_movies = bechdel_movies.explode('country_names')

# Group by Country
grouped_by_country = bechdel_movies.groupby('country_names').size().sort_values().reset_index(name="Movie Count")

# Just show if 
filt_grouped_by_country = grouped_by_country[grouped_by_country['Movie Count']>20]


#TAKE OUT MOVIES WITHOUT RELEASE DATE OR/AND COUNTRY
# Bechdel Test and GII, we need to take out the data that doesn't have country or date of release
bechdel_movies_filt = bechdel_movies[~(bechdel_movies['4. Movie release date'].isnull() | 
                                       (bechdel_movies['8. Movie countries (Freebase ID:name tuples)'] == '{}'))]

print("\nWe pass from " + str(len(bechdel_movies['1. Wikipedia movie ID'].unique())) + ' movies to '+ 
      str(len(bechdel_movies_filt['1. Wikipedia movie ID'].unique())) + " movies that don't have release date or country" )

# Take off the nan values and just see the first 4 numbers (years)
bechdel_movies_filt['Year'] = bechdel_movies_filt['4. Movie release date'].astype(str).str[:4]
bechdel_movies_filt['Year'] = bechdel_movies_filt['Year'].astype(int)


# Combine the HDI and GII dataframes
merge_HDI_GII = pd.merge(HDI, GII, on = ['Entity','Year','Code'], how = 'left')
merge_HDI_GII['Year'] = merge_HDI_GII['Year'].astype(int)


bechdel_movies_filt = bechdel_movies_filt.rename(columns = {'country_names': 'Entity'})
#display(bechdel_movies_filt)

# Add the Code on bechdel_movies_filt
df_entityCode = merge_HDI_GII.loc[:,('Entity','Code')]
df_entityCode = df_entityCode.drop_duplicates(['Entity','Code'])
df_entityCode['Entity'] = df_entityCode['Entity'].replace('United States','United States of America')
merged = pd.merge(bechdel_movies_filt,df_entityCode, on = ['Entity'], how='left')
#display(merged)

# Print how many movies don't have Code 
nan_countries = merged[merged['Code'].isna()]
nan_countries_n = nan_countries['Entity'].unique()
nan_countries_movies = nan_countries['1. Wikipedia movie ID'].unique()

print("\nThere is " + str(len(nan_countries_n)) + ' countries that are not inside the Dataframe HDI and GII:')
print(nan_countries_n)
print('Corresponding to ' + str(len(nan_countries_movies)) + ' movies')

# Drop the movies that dont have data on merge_HDI_GII
merged = merged[~(merged['Code'].isna())]

# Merge the Bechdel Test with HDI and GII
merged_data = pd.merge(merged, merge_HDI_GII, on = ['Year','Entity','Code'], how = 'left')

# The out the nan from HDI or GII
final = merged_data[(~merged_data['Human Development Index'].isna())|(~merged_data['Gender Inequality Index'].isna())]

n_final_movies = final['1. Wikipedia movie ID'].unique()
print("\nThere is " + str(len(n_final_movies)) + ' countries with HDI and/or GII')
      
display(final)


We pass from 6521 movies to 6239 movies that don't have release date or country

There is 20 countries that are not inside the Dataframe HDI and GII:
['Yugoslavia' 'Soviet Union' 'West Germany' 'England' 'Czech Republic'
 'Serbia and Montenegro' 'Czechoslovakia' 'German Democratic Republic'
 'Weimar Republic' 'Scotland' 'Taiwan' 'Democratic Republic of the Congo'
 'Korea' 'Northern Ireland' 'Kingdom of Great Britain'
 'Palestinian territories' 'Mandatory Palestine' 'Slovak Republic'
 'Puerto Rico' 'Kingdom of Italy']
Corresponding to 174 movies

There is 1839 countries with HDI and/or GII


Unnamed: 0.1,Unnamed: 0,1. Wikipedia movie ID,2. Freebase movie ID,3. Movie name,4. Movie release date,5. Movie box office revenue,6. Movie runtime,7. Movie languages (Freebase ID:name tuples),8. Movie countries (Freebase ID:name tuples),9. Movie genres (Freebase ID:name tuples),bechdel_score,Entity,Year,Code,Human Development Index,Gender Inequality Index
4,56,11633165,/m/02rm6l8,Innocence,1997,,110.0,"{""/m/02hwyss"": ""Turkish Language""}","{""/m/01znc_"": ""Turkey""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",3,Turkey,1997,TUR,0.641,0.599
12,163,1031231,/m/03_wh5,"Black Cat, White Cat",1998-09-10,351447.0,135.0,"{""/m/012psb"": ""Romani language"", ""/m/02bjrlw"":...","{""/m/0f8l9c"": ""France"", ""/m/087vz"": ""Yugoslavi...","{""/m/06cvj"": ""Romantic comedy"", ""/m/01z4y"": ""C...",3,France,1998,FRA,0.842,0.186
13,163,1031231,/m/03_wh5,"Black Cat, White Cat",1998-09-10,351447.0,135.0,"{""/m/012psb"": ""Romani language"", ""/m/02bjrlw"":...","{""/m/0f8l9c"": ""France"", ""/m/087vz"": ""Yugoslavi...","{""/m/06cvj"": ""Romantic comedy"", ""/m/01z4y"": ""C...",3,Germany,1998,DEU,0.879,0.136
17,196,748616,/m/03813g,"Spring, Summer, Fall, Winter... and Spring",2003-08-14,9524745.0,95.0,"{""/m/02hwhyv"": ""Korean Language""}","{""/m/06qd3"": ""South Korea"", ""/m/0345h"": ""Germa...","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",1,South Korea,2003,KOR,0.844,0.159
18,196,748616,/m/03813g,"Spring, Summer, Fall, Winter... and Spring",2003-08-14,9524745.0,95.0,"{""/m/02hwhyv"": ""Korean Language""}","{""/m/06qd3"": ""South Korea"", ""/m/0345h"": ""Germa...","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",1,Germany,2003,DEU,0.905,0.114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8266,81658,25920477,/m/0b6lqyd,Source Code,2011-03-11,147332697.0,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0f8l9c"": ""France"", ""/m/09c7w0"": ""United S...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",1,France,2011,FRA,0.881,0.126
8268,81686,17288740,/m/043mmgb,La Tour Montparnasse Infernale,2001-03-28,,92.0,"{""/m/064_8sq"": ""French Language""}","{""/m/0f8l9c"": ""France""}","{""/m/05p553"": ""Comedy film"", ""/m/0hj3m_x"": ""Cr...",3,France,2001,FRA,0.847,0.184
8269,81693,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,Japan,1997,JPN,0.871,0.157
8270,81693,1191380,/m/04f_y7,Wilde,1997,2158775.0,118.0,"{""/m/02h40lc"": ""English Language""}","{""/m/014tss"": ""Kingdom of Great Britain"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/017fp"": ""Biography"", ...",3,United Kingdom,1997,GBR,0.842,0.238


In [82]:
#Dump final dataset to csv file
final.to_csv("movies_bechdel_GII_HDI.csv", index=False)