## Summary of the notebook

0. Import of **libraries**

1. **Import and treatment** of first dataset (the movie one)

2. **Import and treatment** of second dataset (the proportion of feminine population by country and year)

3. **First plot** showing the number of actor by gender and region

4. **Second plot** showing a heatmap of proportion of women by generation and region

5. With the use of the first 2 datasets, **calculation of** the over- or under- **representation** of women in movie

6. **Third plot:** interactive maps of representation by generation

#### 0. Import of libraries

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import json

In [3]:
#data_folder = "C:/Users/bapti/ADA2024-projet/MovieSummaries/" #Baptiste
#data_folder = "C:/Users/cricl/PROJADA/MovieSummaries/MovieSummaries/" #Etienne
data_folder = "C:/Users/locar/Desktop/Project ADA/ada-2024-project-ecureuilscosmiques24/" #Léo
#data_folder = "C:/Users/thoma/Documents/ADA_Projet_Data/MovieSummaries/" #Thomas
########### A CHANGER LE PATH DU FOLDER ##################

#### 1. Import and treatment of the first dataset (cleaned_countries.csv)
Here, a function to add generation is created, generation are 25-year periods of movie release.

Then, by using a json file of the countries of the world, we add a ISO column corresponding to the 3 letter code for the country. The json file will be used later to create to world map.

In [4]:
df = pd.read_csv(data_folder+'data_cleaned_countries.csv')
df = df[['Movie_release_date', 'Actor_gender_male', 'Main_country', 'Region']] #remove unused columns

def gen25(year):
    if 1900<year<=1925:
        return "1900-1925"
    if 1925<year<=1950:
        return "1925-1950"
    if 1950<year<=1975:
        return "1950-1975"
    if 1975<year<=2000:
        return "1975-2000"
    if 2000<year<= 2025:
        return "2000-2020"

order = ["2000-2020","1975-2000","1950-1975","1925-1950","1900-1925"]

df['Generation'] = df['Movie_release_date'].apply(gen25)
df['Generation'] = df['Generation'].astype("category")
df['Generation'].cat.reorder_categories(order, ordered= True)

#opening json file to add the iso file
with open('countries.geo.json', 'r') as file:
    json_data = json.load(file)

#adding ISO to the df
country_iso_mapping = {feature["properties"]["name"]: feature["id"] for feature in json_data["features"]}
df["ISO"] = df["Main_country"].map(country_iso_mapping)

#display of the df
df.head()

Unnamed: 0,Movie_release_date,Actor_gender_male,Main_country,Region,Generation,ISO
0,2001.0,0,United States of America,North America,2000-2020,USA
1,2001.0,0,United States of America,North America,2000-2020,USA
2,2001.0,1,United States of America,North America,2000-2020,USA
3,2001.0,1,United States of America,North America,2000-2020,USA
4,2001.0,0,United States of America,North America,2000-2020,USA


In [5]:
#showing all countries that don't have ISO: it's basically only old countries that does not exist anymore or changed name! good
missing_iso_country = df[df["ISO"].isna()]["Main_country"].unique()
missing_df = pd.DataFrame(missing_iso_country, columns=["Main_country"])
print("Main_countries with missing ISO:")
print(missing_df)

Main_countries with missing ISO:
                                Main_country
0                               Soviet Union
1                                  Hong Kong
2                                 Yugoslavia
3                               West Germany
4                            Weimar Republic
5                             Czechoslovakia
6                 German Democratic Republic
7                            Slovak Republic
8                                  Singapore
9                   Kingdom of Great Britain
10                     Serbia and Montenegro
11                               Isle of Man
12                                     Korea
13                                   England
14                                    Serbia
15                                     Wales
16                                     Burma
17                     Republic of Macedonia
18                                   Bahamas
19  Socialist Federal Republic of Yugoslavia
20                    

#### 2. Import and treatment of second dataset (share-population-female.csv)

This dataset contains the proportion of women in all the countries of the world per year from 1950 to 2020. What we do is creating the average value for each generation and country of this proportion. This will allow us to see if women are over- or under- represented in each country that has enough data.

Since the data goes only from 1950. The two first generation are ignored.

In [6]:
df2 = pd.read_csv(data_folder+'female_population_prop.csv')
df2.rename(columns={'Entity': 'Country', 'Code': 'ISO','Population, female (% of total population)': 'F_prop_population'},inplace=True)
df2['Generation'] = df2['Year'].apply(gen25)

df2 = df2[['Country', 'ISO', 'Generation', 'F_prop_population']]
df2['F_prop_population'] /= 100
df2 = df2.groupby(['Country','ISO', 'Generation'])['F_prop_population'].mean().reset_index()
df2.head()

Unnamed: 0,Country,ISO,Generation,F_prop_population
0,Afghanistan,AFG,1950-1975,0.485638
1,Afghanistan,AFG,1975-2000,0.496461
2,Afghanistan,AFG,2000-2020,0.495188
3,Albania,ALB,1950-1975,0.484692
4,Albania,ALB,1975-2000,0.490257


##### 3. First generic plot showing the number of actor by gender and by region

In [7]:
df_plot = df[df['Region'] != "Dead country"]
df_plot = pd.crosstab(df_plot['Region'], df_plot['Actor_gender_male'])
df_plot['Total'] = df_plot.sum(axis=1)  
df_plot = df_plot.sort_values('Total', ascending=False).drop(columns='Total')

fig = go.Figure()

for gender in df_plot.columns:
    fig.add_trace(go.Bar(
        x=df_plot.index,
        y=df_plot[gender],
        name=gender,
        marker=dict(opacity=0.7)
    ))

fig.update_layout(
    barmode='group',
    title="Number of actors by region and gender",
    xaxis_title="Region",
    yaxis_title="Number of actors (log scale)",
    yaxis_type="log",
    legend_title="Actor Gender",
    template="plotly_white"
)

fig.write_html("plots/bar_chart_region_actors.html")
print('Plot saved.')


Plot saved.


##### 4. Second plot: heatmap by generation and region

In [8]:
#Gives a table of proportion
'''
df: the dataset
scale: either 'Region' or 'Main_country'
threshold: put a NaN instead of the proportion if there is not enough movies for that case
order: order of the generations
'''
def get_proportion(df, scale, threshold, order):
    df = df[df['Region'] != "Dead country"] #remove dead countries
    total_counts = pd.crosstab(df['Generation'], df[scale])
    female_counts = pd.crosstab(df[df['Actor_gender_male'] == 0]['Generation'], 
                                df[df['Actor_gender_male'] == 0][scale])
    mask = total_counts < threshold
    proportions = female_counts / total_counts
    proportions = proportions.mask(mask, other=np.nan)
    proportions = proportions.reindex(order)
    return proportions

threshold = 90
proportions_region = get_proportion(df, 'Region', threshold, order)
#order by number of movies
proportions_region = proportions_region[['North America', 'West Europa', 'Asia', 'East Europa', 'Oceania', 'South America', 'Africa']]


fig = px.imshow(
    proportions_region.values,
    labels=dict(x="Region", y="Generation", color="Proportion of Female Actors"),
    x=proportions_region.columns, 
    y=proportions_region.index,
    color_continuous_scale="RdBu",
    zmin=0.22,zmax=0.42,
    text_auto=".2f"
    )
fig.update_layout(
    title="Proportion of female actors by generation",
    xaxis_title="Region",
    yaxis_title="Generation",
    coloraxis_colorbar=dict(title="Proportion"),
    template="plotly_white"
    )

fig.write_html("plots/heatmap_female_actors.html")
print('Plot saved.')


Plot saved.


##### 5. Cleaning proportion by country dataset and calculation of the representation

In [9]:
#getting the proportion by country (and not by region like for the heatmap)
proportions_country = get_proportion(df, 'ISO', threshold, order)

proportions_country = proportions_country.reset_index()
proportions_long = pd.melt(proportions_country, id_vars=['Generation'], var_name='ISO', value_name='F_movie_proportion')

#merge the dataframe of proportion in movies with df2 (proportion in population)
df_proportions = pd.merge(df2, proportions_long, on=['ISO', 'Generation'], how='left')

df_proportions = df_proportions.dropna(subset=["F_movie_proportion"])

#computing the representation
df_proportions['Representation'] = (df_proportions['F_movie_proportion']/df_proportions['F_prop_population']-1)*100

#keeping only interesting columns
df_proportions = df_proportions[['Country', 'ISO','Generation','Representation']]
df_proportions.head()

Unnamed: 0,Country,ISO,Generation,Representation
22,Argentina,ARG,1975-2000,-44.100613
23,Argentina,ARG,2000-2020,-22.492896
31,Australia,AUS,1975-2000,-35.434386
32,Australia,AUS,2000-2020,-28.402145
33,Austria,AUT,1950-1975,-22.363489


##### 6. Plot of the representation by country in a world map for each generation 

In [11]:
df_1950 = df_proportions[df_proportions['Generation'] == '1950-1975']
df_1975 = df_proportions[df_proportions['Generation'] == '1975-2000']
df_2000 = df_proportions[df_proportions['Generation'] == '2000-2020']

fig = go.Figure()

fig.add_trace(go.Choropleth(
    geojson=json_data,
    locations=df_1950["ISO"],
    z=df_1950["Representation"],
    colorscale=[[0, "red"], [0.75, "white"], [1, "blue"]],
    zmin=-60,
    zmax=20,
    hovertext=df_1950["Country"],
    hoverinfo="text+z",
    visible=True,
    colorbar_title="Representation"
))

fig.add_trace(go.Choropleth(
    geojson=json_data,
    locations=df_1975["ISO"],
    z=df_1975["Representation"],
    colorscale=[[0, "red"], [0.75, "white"], [1, "blue"]],
    zmin=-60,
    zmax=20,
    hovertext=df_1975["Country"],
    hoverinfo="text+z",
    visible=False,
    colorbar_title="Representation"
))

fig.add_trace(go.Choropleth(
    geojson=json_data,
    locations=df_2000["ISO"],
    z=df_2000["Representation"],
    colorscale=[[0, "red"], [0.75, "white"], [1, "blue"]],
    zmin=-60,
    zmax=20,
    hovertext=df_2000["Country"],
    hoverinfo="text+z",
    visible=False,
    colorbar_title="Representation"
))

#add cursor
fig.update_layout(
    updatemenus=[
        dict(
            buttons=[
                dict(label="1950-1975",method="update",args=[{"visible": [True, False, False]}]),#gen1
                dict(label="1975-2000",method="update",args=[{"visible": [False, True, False]}]),#gen2
                dict(label="2000-2020",method="update",args=[{"visible": [False, False, True]}]) #gen3
            ],
            direction="down",
            showactive=True
        )
    ],
    title="Representation of women in movies across generations",
    title_x=0.5,
)

#geographic layout and projection
fig.update_geos(
    showcountries=True,
    countrycolor="Black",
    projection_type="equirectangular"
)

fig.write_html("plots/women_representation_map.html")
print("HTML plot saved.")
#fig.show()

HTML plot saved.
