# III. Globalization of populations

In [173]:
import pandas as pd
import geopandas as gpd
import plotly.graph_objects as go
from importlib import reload
import sys
from itertools import combinations
import json
import numpy as np
import plotly.express as px
from shapely.geometry import Point

## III. 1. The diversity of ethnicities in movies

In this section, we study the increase in the number of ethnicities in the global film market but also in each individual movie. We used a first dictionnary to extract the names of the ethnicities from the freebase ID and a second one to associate the ethnicities with their countries of origin (or with the country in which they are the most represented). 

In [169]:
#loading the script
sys.path.append('./src/scripts')
import populationScripts as ps 
reload(ps)

<module 'populationScripts' from 'c:\\Users\\richa\\OneDrive\\Documents\\EPFL\\MA1\\ada\\ada-2024-project-teamcsx24\\./src/scripts\\populationScripts.py'>

In [52]:
# loading the ethnicity dataframe that is going to be used in the section. 
# Adding a column with the name of the ethnicity, a column with the country linked with the ethnicity and a column with the release year 
df_ethnicities = ps.createEthicitiesDf()
df_ethnicities.head()

Unnamed: 0,wiki_id,freebase_id,release_date,character,birth_date,gender,height,ethnicity_freebase_id,name,age,character_actor_freebase_id,character_freebase_id,actor_freebase_id,release_year,ethnicity_name,actor_country
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l,2001,African Americans,United States
5,975900,/m/03vyhn,2001-08-24,Commander Helena Braddock,1949-05-26,F,1.727,/m/0x67,Pam Grier,52.0,/m/02vdcfp,/m/0bgchnd,/m/0418ft,2001,African Americans,United States
11,975900,/m/03vyhn,2001-08-24,Tres,1959-03-09,M,,/m/064b9n,Rodney A. Grant,42.0,/m/0bgchrs,/m/0bgchrw,/m/03ydsb,2001,Omaha Tribe of Nebraska,United States
27,3196793,/m/08yl5d,2000-02-16,,1937-11-10,M,,/m/0x67,Albert Hall,62.0,/m/0lr37dy,,/m/01lntp,2000,African Americans,United States
55,2314463,/m/0734w5,2006-01-01,,1974-02-08,M,1.63,/m/041rx,Seth Green,31.0,/m/04htrcg,,/m/0gz5hs,2006,Jewish people,Jewish people


In [53]:
#plotting the distribution of the different ethnicities in the df
ps.plotEthnicityRepartition(df_ethnicities)

In this first plot, we can see that 'Indian' is the most represented ethnicity. However, several ethnicities among the 10 most represented ones are actually coming from the USA, hence the necessity to group the ethnicities by country for a more accurate representation. We will also remove the year before 1908 and after 2010 because the low number of data from those years might false the results.

In [54]:
#ploting the distribution of actor's country
ps.plotCountryRepartition(df_ethnicities)

From this plot we can already see that very few countries represent the majority of actors. However, this represents the global number of actors. Let's look at how it evolved overtime. For better visualization, we will only focus on the 15 main countries and group the rest as 'Others'.

To achieve this, we counted the occurence of each country and kepts only the 15 firts ones. The 'actor_country' was changed to 'Others' in the rest of the dataframe. We grouped the dataframe by year and country using groupby. Then, wa added a column 'Count' with the number of actors for each country and each year, and a column 'Proportion' with the proportion of actors from each country for each year. These columns will be used to plot the graph.

In [103]:
#preparing the data to plot the histograms :
#Grouping the data by country and by year, adding a count and a proportion columns 
main_ethnicities = ps.createMainEthnicitiesDf(df_ethnicities)
main_ethnicities.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,release_year,actor_country,count,proportion
0,1909,United States,4,1.0
1,1910,Canada,2,0.5
2,1910,United States,2,0.5
3,1911,Canada,2,1.0
4,1912,Canada,6,0.545455


We used functions to plot both histograms separately using px.histogram. In the following cell we get the data from both histograms and add buttons to alow the user to choose which one to display.

In [171]:
#plotting the figure
#Getting traces from both figures
original_traces = ps.plotOriginalGraph(main_ethnicities).data
normalized_traces = ps.plotNormalizedGraph(main_ethnicities).data

# Create a combined figure
fig = go.Figure()

# Add traces for both normalized and original data
for trace in normalized_traces:
    fig.add_trace(trace)
for trace in original_traces:
    fig.add_trace(trace)

# Hide original traces initially
for i in range(len(normalized_traces), len(fig.data)):
    fig.data[i].visible = False

# Add buttons
fig.update_layout(
    barmode="stack",
    xaxis=dict(title="Year", dtick=5),
    updatemenus=[
        {
            "buttons": [
                {
                    "label": "Normalized",
                    "method": "update",
                    "args": [{"visible": [True] * len(normalized_traces) + [False] * len(original_traces)},
                             {"title": "Proportion of actors from each country"}, {"yaxis": {"title": "Proportion of actors"}}]
                },
                {
                    "label": "Original",
                    "method": "update",
                    "args": [{"visible": [False] * len(normalized_traces) + [True] * len(original_traces)},
                             {"title": "Number of actors from each country"}, {"yaxis": {"title": "Number of actors"}}]
                },
            ],
            "direction": "down",
            "showactive": True,
        }
    ],
    width= 700,
    height= 400,
    title_x = 0.5,
)
fig.update_traces(xbins=dict( 
        size=0.5
    ))
#fig.write_html("proportion_ethnicity.html")
fig.show()

This graph shows 2 informations : 
- The overall number of actors has increased overtime, which is coherent with the increase observed in the number of movies (look at the Original version)
- Some countries represent an increasingly important proportion of the actors, like India, China or South Korea. Others have seen their proportion of actors decrease, like the United States and Canada (look at the Normalized version)

This confirms that the main players in the movie industry have evolved, going from Northern countries to Southern countries. However, this graph on its own is more a indicator of the development of countries than of their globalization. To visualize globalization, we must show that these increasing numbers of ethnicities to not remain on their own but rather participate in the same movies. 

To show this, let's look at the evolution of the diversity in movies.

To plot the following graph, we grouped the data by year and mivie freebase ID and computed the mean and standard deviation of each year. The plot was made using go.scatter.

In [None]:
#ploting the evolution of the number of ethnicities per movie
fig=ps.averageEthnicityPerYear(df_ethnicities)
fig.show()
#fig.write_html("average_ethnicity_per_movie.html")

Overall, the mean number of ethnicities per movies has increased. The standard deviation has also increased a lot which mean that some movies tend to become very diverse while other remain little diversified.

## III. 2. International careers

In this part, we will look at the evolution of careers. We will focus on 'international actors', actors who played in movies produced by different countries. We will consider only the single production movies since the coproduction were already studied earlier. Besides, co-production are decided by the productors while single productions highlight better the individual choices of actors.  

To create the international_actor_df, we used a function to count the number of countries involved in the production and kept only the single productions. Then, we used a function to obtin the release year from the release date. Then, we merged the movie_df and the character_df on the movie ID to link data both on the actors and on the countries of production.  

In [112]:
# Creating a dataframe for our study. The movies and characters dataframes were merged to link actors a countries of production
# A feature was added for the release year od the movie
international_actors_df = ps.createInternationalActorDf()
international_actors_df.head()


DataFrame.applymap has been deprecated. Use DataFrame.map instead.



Unnamed: 0,wiki_id,freebase_id,release_date,character,birth_date,gender,height,ethnicity_freebase_id,name,age,character_actor_freebase_id,character_freebase_id,actor_freebase_id,countries_freebase_id_x,number_production_countries,countries_freebase_id_y,release_year
0,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4,['/m/09c7w0'],1,2,2001
1,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc,['/m/09c7w0'],1,2,2001
2,975900,/m/03vyhn,2001-08-24,Whitlock,1945-08-02,F,1.753,,Joanna Cassidy,56.0,/m/02vd6kw,/m/0bgchmx,/m/06lj1m,['/m/09c7w0'],1,2,2001
3,975900,/m/03vyhn,2001-08-24,McSimms,1944-07-22,M,1.8,,Peter Jason,57.0,/m/0bgchxd,/m/0bgchxh,/m/03d663h,['/m/09c7w0'],1,2,2001
4,975900,/m/03vyhn,2001-08-24,Benchley,1935-08-13,M,,,Doug McGrath,66.0,/m/0bgcj4p,/m/0bgcj4s,/m/02r5d3j,['/m/09c7w0'],1,2,2001


Then, we used the createLinkDF function to create a dataframe in which each row represents a link between 2 countries. This function groups the data by actor to extract all the countries in which this actor played and creates one row for each possible pair within these countries. 

Then, we used findEarliestDate that takes as argument an actor and a country and returns the earliest date in which the actor played in that country.

We kepts the later of date1 and date2 as the year in which the link between both countries appears : from this date, they will be linked because an actor played in both of them.

Finally, we translated the movie ID into the name of the country using the dictionnary 'countries' and computed the number of occurence of each pair for every year.

In [None]:
#creating a dataframe with all the links in the form country1, country2, actor, date1, date2, date
df_links = ps.createLinkDf(international_actors_df)

#findEarliestDate is a function that takes an actor and a country and return the earliest date in which the actor played in that country
df_links['date1'] = df_links.apply(lambda row: ps.findEarliestDate(row['actor'], row['country1'], international_actors_df), axis=1)
df_links['date2'] = df_links.apply(lambda row: ps.findEarliestDate(row['actor'], row['country2'], international_actors_df), axis=1)

#date is the latest of date1, date2. The link will start to exist starting from date
df_links['date'] = df_links.apply(lambda row: max(row['date1'], row['date2']), axis=1)

# adding the names of the countries using the dictionnary
with open('./data/freebaseIdDictionnaries/countries', 'r') as file:
        countries_dict = json.load(file)
df_links['name1'] = df_links['country1'].apply(lambda x: countries_dict[x.strip("['']")])
df_links['name2'] = df_links['country2'].apply(lambda x: countries_dict[x.strip("['']")])

#computing the number of occurence of each pair for each year
df_links_grouped = df_links.groupby(['name1', 'name2', 'date'])
links_occurence = pd.DataFrame(df_links_grouped['country1'].count())
links_occurence = links_occurence.reset_index(names=['name1', 'name2', 'date'])


           name1                     name2    date  country1
0        Algeria                    Israel  1977.0         1
1        Algeria                     Italy  1987.0         1
2      Argentina                   Belgium  1993.0         1
3      Argentina                   Bolivia  1999.0         1
4      Argentina                    Brazil  1962.0         1
...          ...                       ...     ...       ...
5551  Yugoslavia              West Germany  1969.0         2
5552  Yugoslavia              West Germany  1979.0         1
5553  Yugoslavia              West Germany  1983.0         1
5554    Zimbabwe            United Kingdom  2012.0         1
5555    Zimbabwe  United States of America  2011.0         1

[5556 rows x 4 columns]


In [172]:
#Creating a df
links_occurence = links_occurence.dropna()
links_occurence = links_occurence.sort_values('date')#to have dates in chronological order
links_occurence['period'] = (links_occurence['date'] // 5) * 5 # grouping data all the 5 years to have a less heavy file
links_occurence['date'] = links_occurence['date'].astype(int)

We ploted the data using a map on which countries are linked when an actor played in both of them. We used geopandas for the map and scatter_geo for the lines

In [None]:
# plotting the map with a slider



# charging geographical data 
world = gpd.read_file('./data/map/ne_110m_admin_0_countries.shp')

# changing data into the right projection to compute the centroids
world_projected = world.to_crs("EPSG:3395")
world_projected["centroid"] = world_projected.geometry.centroid
world["centroid"] = world_projected["centroid"].to_crs(world.crs) # converting back the centroids into the right projection

# Linking each centroid with a country
centroids = world.set_index("NAME")["centroid"] 
links_occurence["coord1"] = links_occurence["name1"].map(centroids)
links_occurence["coord2"] = links_occurence["name2"].map(centroids)

#Fixing centroid of USA (without Alaska), Canada, Norway and France
links_occurence.loc[links_occurence['name1'] == 'United States of America', 'coord1'] = Point(-98.35, 39.50)
links_occurence.loc[links_occurence['name2'] == 'United States of America', 'coord2'] = Point(-98.35, 39.50)

links_occurence.loc[links_occurence['name1'] == 'Canada', 'coord1'] = Point(-105, 56.8)
links_occurence.loc[links_occurence['name2'] == 'Canada', 'coord2'] = Point(-105, 56.8)

links_occurence.loc[links_occurence['name1'] == 'Norway', 'coord1'] = Point(10, 62)
links_occurence.loc[links_occurence['name2'] == 'Norway', 'coord2'] = Point(10, 62)

links_occurence.loc[links_occurence['name1'] == 'France', 'coord1'] = Point(2.25, 46.32)
links_occurence.loc[links_occurence['name2'] == 'France', 'coord2'] = Point(2.25, 46.32)

# Creating the figure
fig = px.scatter_geo(
    world,  # GeoDataFrame
    locations="ISO_A3",  # countries id
    hover_name="NAME",  # names to print
    title="Countries involved in international careers",
    
)

# Adding the links from links_occurence df
traces = []
for year in links_occurence["period"].unique():
    df_year = links_occurence[links_occurence["date"] == year]
    df_year = df_year.dropna(subset=["coord1", "coord2"])

    trace = px.scatter_geo(
        world,  # Carte de base
        locations="ISO_A3",
        hover_name="NAME",
        title=f"Interactions between countries ({year})"
    )
    for _, row in df_year.iterrows():
        fig.add_scattergeo(
            lon=[row["coord1"].x, row["coord2"].x],
            lat=[row["coord1"].y, row["coord2"].y],
            mode="lines",
            line=dict(width=row["country1"] / 10, color="blue"),
            name=f"{row['name1']} ↔ {row['name2']} ({row['period']})",
            opacity=0.8,
        )

    traces.append(trace)

# Adding a slider
fig.update_layout(showlegend=False, 
sliders = [
    dict(
        active=0,
        currentvalue={"prefix": "Year: "},
        pad={"t": 50},
        steps=[
            dict(
                label=str(year),
                method="update",
                args=[{"visible": [i <= idx for i in range(len(traces))]}],
            )
            for idx, year in enumerate(links_occurence["period"].unique())
        ]
    )
],
height=400,
width=700, 
title_x = 0.5)

#fig.write_html("international_careers.html")
fig.show()

Interpretation : 

- If you look at the map in **1910**, only a few countries are linked together. It means that actors having an international career come from and go to a very limited number of countries. Most of them stay in countries that have the **same language**, like the *USA, Great Britain and Australia* or *Argentina and Spain*. Others will go to **colonies** of their home country, like *Great Britain and India*. Most of the involved countries are countries from the North, meaning rich and developed countries (and their colonies)

- **As time goes by**, you can observe that the number of countries involved in international careers increases -> More populations are involved in the process of globalisation. 
- You can also see that the diversity of links between countries increases : for an international career, the possibilities are much broader !

- By **2013**, almost all countries are involved in the process except for a big part of the African countries, some countries in the Middle East and central Asia. 

Let's compare these results to the KOF globalization index (KOFGI). The KOF Globalization Index measures the overall extent of globalization in a country, encompassing economic, social, and political dimensions. It captures the integration of economies, international interactions, and the exchange of ideas and information across borders.

We downloaded the data on the KOF index from the website https://kof.ethz.ch/en/forecasts-and-indicators/indicators/kof-globalisation-index.html. We droped the last line as they do not represent countries. Then, we ploted a map to visualize the evolution of the index overtime.

In [164]:
#creating a df with te KOF globalization index
kof_df = pd.read_csv("./data/additionalData/KOFGI_2023.csv", delimiter=';')
kof_df = kof_df[:10556]
kof_df = kof_df.dropna(subset=['KOFGI', 'KOFIpGI', 'year'])#the features we will need
kof_df = kof_df.sort_values('year')
kof_df.head()

Unnamed: 0,code,country,year,KOFGI,KOFGIdf,KOFGIdj,KOFEcGI,KOFEcGIdf,KOFEcGIdj,KOFTrGI,...,KOFIpGIdj,KOFInGI,KOFInGIdf,KOFInGIdj,KOFCuGI,KOFCuGIdf,KOFCuGIdj,KOFPoGI,KOFPoGIdf,KOFPoGIdj
6604,MRT,Mauritania,1970,27.186024,29.442863,24.999575,33.695881,35.241371,31.887442,30.233255,...,33.500343,32.225029,52.444408,18.708572,8.004759,4.630645,10.124382,22.910034,22.48501,23.33506
4160,HTI,Haiti,1970,22.379978,23.714258,20.977736,19.471323,18.761435,20.302002,19.943115,...,13.992142,19.374022,25.310408,13.437638,16.850893,19.106304,14.595482,29.917631,31.310732,28.524534
2236,CYM,Cayman Islands,1970,37.92717,47.302826,26.366627,60.832619,79.347786,,,...,75.014,,,58.083294,,54.854565,,2.371564,3.743128,1.0
6760,MYS,Malaysia,1970,40.994896,47.596378,34.393414,47.968849,52.655167,43.282539,50.455193,...,43.996212,18.711515,15.752656,21.670376,38.697311,41.42112,35.973507,37.60067,49.183662,26.017679
4472,IRQ,Iraq,1970,30.922258,38.107182,23.241814,41.447861,53.049034,25.504503,39.849258,...,18.452021,17.629368,10.622619,24.636116,14.751761,25.721161,3.782361,33.976517,38.739338,29.213696


In [167]:
#map with the KOF globalization index
# Creating the map
fig_kof = px.choropleth(
    kof_df,
    locations="country",  # Colonne contenant les noms des pays
    locationmode="country names",  # Indique que les noms de pays sont utilisés
    color="KOFGI",  # Colonne pour la couleur
    animation_frame="year",
    color_continuous_scale="Inferno",  # Palette de couleurs
    title="Map of the KOF globalization index",
    labels={"KOFGI": "KOFGI"}
)
fig_kof.update_layout(title_x = 0.5)
#fig_kof.write_html("map_kof.html")
fig_kof.show()

When comparing both maps, we can observe similarities. The countries that are strongly involved in international careers have a high KOF index (yellowish on the map) while the countries that are not involved at all have a low index (purple on the map). 