Diversity: 
1. Reorganise ethnicities into a few groups
2. Establish the Naive Coefficient
3. Add the multiplication by extra coefficient rewarding equal representation

A. Start by making the character_df
NB: final name should be "filtered_character"

In [1]:
#basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns

In [2]:
#load the data
character_metadata = pd.read_csv('../data/character.metadata.tsv', sep='\t')

# Look through the DataFrames:
print(f'the size of the dataframe is:{character_metadata.shape}') #-- (450668, 13)
print(character_metadata.columns) #need to rename the columns

the size of the dataframe is:(450668, 13)
Index(['975900', '/m/03vyhn', '2001-08-24', 'Akooshay', '1958-08-26', 'F',
       '1.62', 'Unnamed: 7', 'Wanda De Jesus', '42', '/m/0bgchxw',
       '/m/0bgcj3x', '/m/03wcfv7'],
      dtype='object')


In [3]:
#rename columns
new_column_names = [
    "Wikipedia_movie_ID",
    "Freebase_movie_ID",
    "Movie_release_date",
    "Character_name",
    "Actor_date_of_birth",
    "Actor_gender",
    "Actor_height_m",
    "Actor_ethnicity",
    "Actor_name",
    "Actor_age_at_movie_release",
    "Freebase_character_actor_map_ID",
    "Freebase_character_ID",
    "Freebase_actor_ID"
]
character_metadata.columns = new_column_names

In [4]:
columns_to_check = ['Wikipedia_movie_ID', 'Movie_release_date', 'Actor_ethnicity']
remaining_rows = {col: character_metadata[col].dropna().shape[0] for col in columns_to_check}
for col, count in remaining_rows.items():
    print(f"Remaining rows for '{col}': {count}")

filtered_character = character_metadata[['Wikipedia_movie_ID', 'Movie_release_date', 'Actor_ethnicity']].dropna(subset=['Actor_ethnicity']) # no missing value remaining in each col

Remaining rows for 'Wikipedia_movie_ID': 450668
Remaining rows for 'Movie_release_date': 440673
Remaining rows for 'Actor_ethnicity': 106058


In [5]:
#homogeneous release dates (only the year)
filtered_character['Movie_release_date'] = filtered_character['Movie_release_date'].astype(str).str[:4]

We now have a dataframe, but with freebase_id for ethnicity. We define the function to find the right labels corresponding to the id.

In [6]:
def fb_to_label(freebase_id,conversion_table):
    if freebase_id in conversion_table.index:
        return conversion_table.loc[freebase_id,'label']
    else:
        return None 

We found a table online with freebase_id, wikidata_id & Label
We import it, but it has 2 million lines --> we will select only the lines we need, i.e. the lines of the ethnicity id we have in our filtered_character

In [7]:
#import the file, setting freebase_id as index allows us to use .loc later
fb_wiki_gen = pd.read_csv('../data/fb_wiki_mapping.tsv',sep='\t')
fb_wiki_gen.set_index('freebase_id',inplace=True)

In [8]:
#make list of different existing ethnicities
ethnicities_list = filtered_character['Actor_ethnicity'].unique().tolist()
#now select those from the fb_wiki_gen
fb_wiki_ethnic = fb_wiki_gen.loc[fb_wiki_gen.index.isin(ethnicities_list)]

In [9]:
#we can now change the Actor_ethnicity column from freebase_id to label
filtered_character['Actor_ethnicity']=filtered_character['Actor_ethnicity'].apply(fb_to_label,conversion_table=fb_wiki_ethnic)
#some freebase_ids couldn't be found, so we get None. We will now drop those None values
filtered_character = filtered_character.dropna(subset=['Actor_ethnicity'])

In [10]:
filtered_character.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_release_date,Actor_ethnicity
212076,9105282,1997,Argentines
167859,8499221,2010,English people
190110,33747977,1948,Portuguese Americans
376856,11120865,1993,Malaysian Chinese
328648,6739637,2007,Indian


We now have the right dataframe to Analyse diversity, but we have too many different ethnicities. We want to remove the very specific ones and group them together in more general ones. 

In [26]:
import openai

In [None]:
openai.api_key = ''

# openai.api_key = os.getenv('OPENAI_API_KEY')

In [40]:
filtered_character['Actor_ethnicity']=filtered_character['Actor_ethnicity'].astype(str)
ethnicities_labels = filtered_character['Actor_ethnicity'].unique().tolist()

In [84]:
ethnicities_labels

['African Americans',
 'Omaha people',
 'Jewish people',
 'Irish Americans',
 'Indian Americans',
 'Italians',
 'German Americans',
 'Indian',
 'Ezhava',
 'Malayali',
 'Taiwanese',
 'Armenians',
 'Marathi people',
 'Lithuanian American',
 'Italian Americans',
 'Danish Americans',
 'American Jews',
 'Scottish Americans',
 'Puerto Ricans',
 'English people',
 'Irish people',
 'Russian Americans',
 'English Americans',
 'Gujarati people',
 'Spanish Americans',
 'Bihari people',
 'Nair',
 'Cuban Americans',
 'Russians',
 'Yoruba people',
 'Japanese people',
 'Filipino Americans',
 'Swedish Americans',
 'Finnish Americans',
 'Koreans',
 'French',
 'Welsh people',
 'White Americans',
 'Bengali',
 'Uruguayans',
 'Iranian peoples',
 'Mexicans',
 'Dutch Americans',
 'Hungarian Americans',
 'Spaniards',
 'Bunt',
 'Swedes',
 'Sindhis',
 'Tamil',
 'Italian Canadians',
 'Asian Americans',
 'Mexican Americans',
 'Punjabis',
 'White British',
 'Scottish Australian',
 'White Africans of European ances

In [41]:
def ethnic_spec_to_gen(ethnicities):
    prompt = f"Group the following ethnicities into broader categories: {ethnicities}"
    response = openai.Completion.create(
        model = 'gpt-3.5-turbo',
        prompt = prompt ,
        max_tokens = 500,
        temperature = 0.7,
    )
    return response.choices[0].text.strip()

In [88]:
grouping = {
    'African': [
        'African Americans', 'Yoruba people', 'Egyptians', 'Kikuyu', 'Xhosa people', 'Somalis', 
        'Mandinka people', 'Malagasy people', 'Afro-Cuban', 'Sudanese Arabs', 'Kabyle people', 
        'Nigerian Americans', 'Sierra Leone Creole people', 'Zulu', 'Berber', 'Blackfoot Confederacy','Mandinka', 'Kikuyu', 'Xhosa', 'Kabyle', 'Somalis', 'Berber', 'Afrikaners'
    ],
    'South Asian': [
        'Indian', 'Ezhava', 'Malayali', 'Gujarati people', 'Bihari people', 'Punjabis', 'Pashtuns', 
        'Telugu people', 'Sri Lankan Tamils', 'Tamil', 'Kayastha', 'Nair', 'Bengali', 'Marwari', 
        'Sindhis', 'Punjabis', 'Rajput', 'Khatri', 'Bengali Brahmins', 'Kashmiri Pandit', 'Indian Americans','Marwari', 'Konkani', 'Kayastha', 'Niyogi', 'Tamil', 'Ezhava', 'Bengali Brahmins', 
        'Sindhis', 'Gujarati people', 'Punjabis', 'Sri Lankan Tamils', 'Telugu people'
    ],
    'Middle Eastern': [
        'Jewish people', 'Israeli Americans', 'Palestinian Americans', 'Arabs', 'Persians', 'Kurdish', 
        'Tatars', 'Assyrian people', 'Azerbaijanis', 'Kurds', 'Lebanese Americans', 'Lebanese', 
        'Iranian Americans', 'Afghan', 'Turks', 'Armenians','Ashkenazi Jews', 'Sephardi Jews', 'Lebanese', 'Copts', 'Israelis', 'Arabs', 'Kurds', 
        'Tatars', 'Ossetians', 'Azerbaijanis', 'Persians', 'Iranians'
    ],
    'European or American': [
        'Germans', 'Swedes', 'British Indian', 'Spaniards', 'British', 'Russians', 'French', 
        'Italians', 'Greek Americans', 'Finnish Americans', 'Scots', 'Irish Americans', 
        'White British', 'Irish migration to Great Britain', 'German Americans', 'Italians','Catalan people', 'Basque people', 'Latvians', 'Baltic Russians', 'Transylvanian Saxons',
        'Corsicans', 'French Chilean', 'Italian Brazilians', 'Luxembourgish Americans', 'White South Africans',
        'Portuguese Americans', 'French Americans', 'French Canadians', 'British Asian'
    ],
    'Indigenous': [
        'Cherokee', 'Navajo', 'Sioux', 'Mohawk', 'Inuit', 'Metis', 'Quechua', 'Maya', 'Apache', 
        'Blackfoot Confederacy', 'Haudenosaunee', 'Ojibwe', 'Inupiat', 'Cheyenne', 'Taino', 
        'Comanche', 'Oneida', 'Zuni','Blackfoot', 'Mohawk', 'Inuit', 'Sioux', 'Lumbee', 'Cheyennes', 'Nez Perce', 'Oneida', 
        'Aymara', 'Inupiat people', 'Haudenosaunee', 'Apache', 'Ojibwe', 'Cherokee', 'Māori'
    ],
    'Latino': [
        'Mexicans', 'Hispanic', 'Spaniards', 'Puerto Ricans', 'Uruguayans', 'Colombians', 
        'Brazilians', 'Argentines', 'Chilean Americans', 'Venezuelan Americans', 'Dominican Americans','Afro-Cuban', 'Chilean American', 'Mexican Americans', 'Spanish Americans', 
        'Uruguayans', 'Dominican Americans', 'Ecuadorian Americans', 'Colombians', 
        'Spanish immigration to Mexico', 'Venezuelans'
    ],
    'Pacific Islander': [
        'Filipino Americans', 'Hawaiian', 'Samoans', 'Tongans', 'Maori', 'Fijians', 
        'Polynesian', 'Micronesian', 'Guamanian', 'Native Hawaiians', 'Marshallese','Filipino mestizo', 'Kapampangan', 'Samoan New Zealanders', 'Chinese Filipino',
        'Vietnamese people', 'Ryukyuan people', 'Japanese Brazilians', 'Japanese Americans', 
        'Pacific Islander Americans'
    ],
    'Mixed': [
        'Anglo-Indian people', 'Afro-Asians', 'Mulatto', 'Mestizo', 'Métis', 'Eurasian', 
        'British African-Caribbean', 'Hapa', 'Amerasians','Afro-Asians', 'multiracial people', 'Métis', 'British African-Caribbean people', 'White Latin American'
    ],
    'Other': [
        'Han Chinese people', 'Japanese Brazilians', 'Dalit', 'Cossacks', 'Tatars', 
        'Romani people', 'Yakuts', 'Hazaras', 'Yugoslavs', 'Ashkenazi Jews', 'Catalan people', 
        'Corsicans', 'Serbs of Bosnia and Herzegovina', 'Aromanians','Koryo-saram', 'Buryats', 'Hmong American', 'Sierra Leone Creole people', 'Dene', 
        'Chettiar', 'Sherpa', 'Tibetan people', 'Malagasy people', 'Hazaras', 'Gin people', 
        'Aromanians', 'Romanichal'
    ]
}


In [89]:
ethnicity_to_group = {}

for group, ethnicities in groupings.items():
    for ethnicity in ethnicities:
        ethnicity_to_group[ethnicity] = group

classified_ethnicities = {ethnicity: ethnicity_to_group.get(ethnicity, 'Unknown') for ethnicity in ethnicities_labels}

In [103]:
filtered_character['ethnic_group'] = filtered_character['Actor_ethnicity'].map(ethnicity_to_group)
filtered_character.sample(20)

Unnamed: 0,Wikipedia_movie_ID,Movie_release_date,Actor_ethnicity,ethnic_group
31183,16983442,1999.0,Irish Americans,European or American
440466,1362608,2006.0,English people,European or American
414583,34495806,,Swedish Americans,European or American
224052,10408933,1938.0,French Americans,European or American
179381,358367,1977.0,Jewish people,Middle Eastern
78613,18785526,1950.0,White British,European or American
220022,17987664,2008.0,Kapampangan people,
448978,967721,1987.0,African Americans,African
130922,3731073,1957.0,Jewish people,Middle Eastern
341811,5721950,2003.0,Indian,South Asian


We will now try to create a coefficient for diversity.
We start with a naive version, only counting the amount of different ethnicities and normalising over the number of actors mentioned for a film.

In [None]:
#In our dataframe, we will have a wikipedia iD, which we can use to sort by film. We then want to count ethnicity groups. 
