# DRACO - Milestone 2: Dataset exploration

This document is structured as follow:

1. Characters Data - Extraction and Processing
2. Movie Data - Extraction and Processing
3. Actors Ethinicites - Exploration

---

In [None]:
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
DATA_FOLDER = './Data/'

CHARACTER_PATH = DATA_FOLDER + 'MovieSummaries/character.metadata.tsv'
MOVIE_PATH = DATA_FOLDER + 'MovieSummaries/movie.metadata.tsv'
ETHNICITY_PATH = DATA_FOLDER + 'ethnicities_data.tsv'
NAME_PATH = DATA_FOLDER + 'MovieSummaries/name.clusters.txt'
PLOT_PATH = DATA_FOLDER + 'MovieSummaries/plot_summaries.txt'

## Characters Data - Extraction and Processing

First, we will load the character dataset and the ethnicity dataset. This is done to subsequently merge the two dataframes, connecting the characters with the ethnicity of the actors.

In [None]:
characters_original = pd.read_csv(CHARACTER_PATH, sep='\t', header=None, 
    names = ["Wikipedia Movie ID", "Freebase Movie ID", "Movie release date", "Character name", "Birth", 
    "Gender", "Height", "Ethnicity ID", "Name", "Age at movie release",
    "Freebase character/actor map ID", "Freebase character ID", "Freebase actor ID"])
characters = characters_original.copy()
characters.head()

In [None]:
ethnicities_data_original = pd.read_csv(ETHNICITY_PATH, sep='\t',  
                               header=0, names=["Ethnicity ID", "Ethnicity"])
ethnicities_data = ethnicities_data_original.copy()
ethnicities_data.head()

After loading both datasets, we'll perform an inner join on the column `Ethnicity ID` to retain only the pertinent information.

In [None]:
characters = characters_original.copy().merge(ethnicities_data.dropna(), how='inner', on='Ethnicity ID')
characters.head()

## Movie Data - Extraction and Processing

Next, we load the movie dataset. This is essential as we plan to merge the character dataframe with the movie dataframe, creating a comprehensive dataframe that encompasses all character-related information.

In [None]:
movies_original = pd.read_csv(MOVIE_PATH, sep='\t', header=None, 
    names = ["Wikipedia Movie ID", "Freebase Movie ID", "Movie name","Movie release date", "Box office revenue","Movie runtime","Movie language","Movie countries","Movie genres" ])
movies = movies_original.copy()
movies.head()

We can observe that the three columns `Movie language`, `Movie language`, `Movie genres` contain dictionaries. In our case, it would be much more convenient to have lists instead. Let's process it accordingly

In [None]:
movies["Movie countries"] = movies["Movie countries"].apply(lambda x: list(json.loads(x).values()) if len(json.loads(x).values()) > 0 else 'NaN')
movies["Movie language"] = movies["Movie language"].apply(lambda x: list(json.loads(x).values()) if len(json.loads(x).values()) > 0 else 'NaN')
movies["Movie genres"] = movies["Movie genres"].apply(lambda x: list(json.loads(x).values()) if len(json.loads(x).values()) > 0 else 'NaN')

We should also add a category for the year of release in addition of the date of release

In [None]:
#Errors = 'coerce' will force the values that are outside the bound to be NaT
movies["Movie release year"] = pd.to_datetime(movies["Movie release date"],format='mixed',errors = 'coerce').dt.year
#Remove the NaN and NaT values
movies = movies[movies["Movie release year"].notna()]
#Express all years of realese as int.
movies["Movie release year"] = movies["Movie release year"].astype("int")


As the dataset was realesed on 2013, the data from this year are not complet and thus should be removed

In [None]:
movies = movies[movies['Movie release year'] != 2013]

We should now pay attention to remove the movie of the Animation field as the actors that have played inside are rather voice actors than actors. Generally in this domain, the looking and the ethinicty of the actors isn't as relevant as in the real action movies.

In [None]:
# First let's see all the unique genres, to discard the animation ones:
unique_genres = set()
movies['Movie genres'].apply(lambda x: unique_genres.update(x))
print(unique_genres)

In [None]:
values_to_find = ['Anime', 'Animation', 'Computer Animation', 'Clay animation', 'Animated cartoon','Stop motion']
movies = movies[movies['Movie genres'].apply(lambda x: not(any(value in x for value in values_to_find)))]


Now, let's merge the movie dataframe with the character dataframe to create a single dataframe that encompasses all the information about a character along with details about the films they are involved in.

In [None]:
characters_movies = characters.merge(movies, how='inner', on=['Wikipedia Movie ID',"Freebase Movie ID","Movie release date"])
characters_movies.head()

## Actors Ethinicites - Exploration

### Ethinicity distribution

Let's visualized the proportion of ethinicties among the characters.
For now we're mostly going to compute the actors distribution thus we should remove also duplicates of actors.

In [None]:
unique_characters_movies = characters_movies.copy().drop_duplicates(subset='Freebase actor ID')

In [None]:
grouped_ethnicity=unique_characters_movies.groupby('Ethnicity').count()
grouped_ethnicity['Freebase actor ID'].sort_values(ascending=False)[0:20].plot(kind='bar')

In [None]:
grouped_ethinity = unique_characters_movies.groupby(['Ethnicity']).count()
#grouped_ethinity = grouped_ethinity.div(grouped_ethinity.sum(axis=1), axis=0)
main_ethnicities = grouped_ethinity['Freebase actor ID'].sort_values(ascending=False)[0:20]
main_ethnicities.plot(kind='bar')
plt.title("Distribution of the main actor ethnicities")
plt.xlabel("Ethnicities")
plt.ylabel("Number of actors")

We've noticed that many characters are portrayed by Indian actors. For our project, our primary focus is on Hollywood characters. Let's see if this choice is representavie of the movie industry.

In [None]:
characters_movies_main_country = unique_characters_movies[unique_characters_movies["Movie countries"] != "NaN"].copy()
characters_movies_main_country["Movie countries"] = characters_movies_main_country["Movie countries"].apply(lambda x: x[0])

In [None]:
nb_actors_per_industries = characters_movies_main_country.groupby(['Movie countries']).count()['Freebase actor ID'].sort_values(ascending=False)
nb_actors_per_industries.head(10)

To have a better view of the main industries let's consider only the industires with more than 100 different actors.

In [None]:
nb_actors_per_industries['Other'] = nb_actors_per_industries[nb_actors_per_industries < 100].sum()
nb_actors_per_industries = nb_actors_per_industries[nb_actors_per_industries > 100].sort_values(ascending=False)

In [None]:
nb_actors_per_industries.plot.pie(figsize=(10, 10), autopct='%1.1f%%', startangle=90, title='Proportion of actors per industries', label='')


### Hollywoodian analysis



Let's narrow down our selection to include only Hollywood or American characters, meaning those characters that appear in American movies

In [None]:
characters_holywood = unique_characters_movies.copy()[unique_characters_movies["Movie countries"].apply(lambda x: 'United States of America' in x)]

By grouping the data by ethnicities, let's analyse what are the main ethnicties of Hollywoodian actors

In [None]:
H_grouped_ethnicity = characters_holywood.groupby(['Ethnicity']).count()
#Let's take the 10 main actor ethinicities
main_ethinicities = H_grouped_ethnicity['Freebase actor ID'].sort_values(ascending=False)[0:10].index
main_characters_holywood = characters_holywood[characters_holywood['Ethnicity'].isin(main_ethinicities)]
main_ethinicities

Now let's visualize the distribution of holywoodian actors

In [None]:
main_holywoodian_ethnicities =  H_grouped_ethnicity.loc[main_ethinicities]['Freebase actor ID']
main_holywoodian_ethnicities.plot(kind='bar')
plt.title("Distribution of the main holywoodian actor ethnicities")
plt.xlabel("Ethnicities")
plt.ylabel("Number of actors")

Analysze the sex distribution by ethnicities 

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(data=main_characters_holywood,x='Movie release year',hue='Ethnicity',multiple='fill',stat='probability')
plt.title("Distribution of ethnicity of actors in the cinema industry over time",fontsize=20)
plt.xticks(np.arange(main_characters_holywood['Movie release year'].min(),main_characters_holywood['Movie release year'].max(),10))
plt.ylabel('Proportion of actors')


In [None]:
plt.figure(figsize=(20,10))
sns.kdeplot(data=main_characters_holywood,x='Movie release year',hue='Ethnicity',multiple='fill')
plt.title("Distribution of ethnicity of actors in the cinema industry over time",fontsize=20)
plt.xticks(np.arange(main_characters_holywood['Movie release year'].min(),main_characters_holywood['Movie release year'].max(),10))
plt.ylabel('Proportion of actors')


In [None]:
# Same plot but with diffrent orientationm for bettwer visibility
plt.figure(figsize=(20,10))
sns.histplot(data=main_characters_holywood, y='Movie release year', hue='Ethnicity', multiple='fill', stat='probability')
plt.title("Distribution of ethnicity of actors over time",fontsize=20)
plt.yticks(np.arange(main_characters_holywood['Movie release year'].min(), main_characters_holywood['Movie release year'].max(), 10))
plt.xlabel('Proportion')
plt.ylim(2020, 1900)

In [None]:
map_year_actor = main_characters_holywood.pivot_table(index=['Movie release year'], columns='Ethnicity', values='Freebase character ID', aggfunc='count')
df_year_actor = pd.DataFrame(map_year_actor.values,columns=map_year_actor.columns.values.tolist(),index=map_year_actor.index.values.tolist())
plt.figure(figsize=(20,10))
plt.title("Evolution of actor diversity over time")
sns.heatmap(map_year_actor.transpose(),cmap="rocket_r")

In [None]:
sns.histplot(data= main_characters_holywood ,x='Movie release year',stat='count')
plt.title('Evolution of the number of actors over year')
plt.ylabel('Number of actors')

Let's determine how does the diversity (number of different ethnicities) on a movies have evolved over time

**We could try to define a dervisty score depending on different factors**

In [None]:
#Count the number of unique ethnicities in a movie cast 
diversity_per_movie = main_characters_holywood.groupby('Freebase Movie ID')['Ethnicity'].apply(lambda x : (len(np.unique(x))/len(x))*100)
#Then for each year, find the mean number of unique ethnicities in the cast
diversity_per_movie = pd.DataFrame({'Freebase Movie ID': diversity_per_movie.index.values,'Unique ethnicity':diversity_per_movie.values })

main_characters_holywood = main_characters_holywood.merge(diversity_per_movie,how='left',on='Freebase Movie ID')

In [None]:
main_characters_holywood.groupby('Movie release year')['Unique ethnicity'].mean().plot()
plt.title('Mean proportion of diversity in a movie cast over the time')
plt.xlabel('Years')
plt.ylabel('Mean proportion of diversity in a movie cast')

## Gender analysis

Now, create a pyramid plot to show the difference between ethnicities over the years.

In [None]:
H_ethinity_M = main_characters_holywood[main_characters_holywood['Gender'] == 'M'].groupby(['Ethnicity'])['Freebase actor ID'].count().sort_values()
H_ethinity_M_reversed = H_ethinity_M*(-1)
H_ethinity_F = main_characters_holywood[main_characters_holywood['Gender'] == 'F'].groupby(['Ethnicity'])['Freebase actor ID'].count().sort_values()

In [None]:
plt.barh(y=H_ethinity_M.index, width=H_ethinity_M.values, left=H_ethinity_M_reversed.values, color="#4682b4", label="Male")
plt.barh(y=H_ethinity_F.index, width=H_ethinity_F.values, 
         color="#ee7a87", label="Female")
plt.legend()
plt.title("Sex distribtion depending on the ethincity over all time")

In [None]:
proportion_ethinity_M = main_characters_holywood[main_characters_holywood['Gender'] == 'M'].groupby(['Ethnicity'])['Freebase actor ID'].apply(lambda x: len(x)/(len(main_characters_holywood['Gender'])))
proportion_ethinity_F = main_characters_holywood[main_characters_holywood['Gender'] == 'F'].groupby(['Ethnicity'])['Freebase actor ID'].apply(lambda x: len(x)/(len(main_characters_holywood['Gender'])))


*What's the best way of plotting ?*

In [None]:
plt.bar(x=proportion_ethinity_M.index, height=proportion_ethinity_M.values, color="#4682b4", label="Male")
plt.bar(x=proportion_ethinity_F.index, height=proportion_ethinity_F.values, 
         color="#ee7a87", label="Female")
plt.legend()
plt.title("Sex distribtion depending on the ethincity over all time")
plt.xticks(rotation=90)

In [None]:
# Let's groupby Ethincity and Year and separate the genders.
H_ethinity_M_year = main_characters_holywood[main_characters_holywood['Gender'] == 'M'].groupby(['Ethnicity', 'Movie release year'])['Freebase actor ID'].count().sort_values()

H_ethinity_F_year = main_characters_holywood[main_characters_holywood['Gender'] == 'F'].groupby(['Ethnicity', 'Movie release year'])['Freebase actor ID'].count().sort_values()

In [None]:
ethnicities_M, years_M = zip(*H_ethinity_M_year.index)
ethnicities_F, years_F = zip(*H_ethinity_F_year.index)

fig, ax = plt.subplots(1, 2, sharey=True, figsize=(20, 20))
plt.subplots_adjust(wspace=0)

# Plot for men
sns.histplot(ax=ax[0], y=list(years_M), weights=np.abs(H_ethinity_M_year.values), hue=list(ethnicities_M), multiple='stack', stat='count', binwidth=1)

ax[0].set_xlabel("Count (Men)")
ax[0].invert_xaxis()# Flip the x-axis
ax[0].set_ylabel("Movie release year")
ax[0].set_ylim(2020, 1900)  # Flip the y-axis
ax[0].set_xlim(150, 0)
ax[0].set_title('Male Distribution')

# Plot for women
sns.histplot(ax=ax[1], y=list(years_F), weights=np.abs(H_ethinity_F_year.values), hue=list(ethnicities_F), multiple='stack', stat='count', binwidth=1, legend= False)
  
ax[1].set_xlabel("Count (Women)")
ax[1].set_ylim(2020, 1900)  # Flip the y-axis
ax[1].set_xlim(0, 150)
ax[1].set_title('Women Distribution')

plt.suptitle("Sex distribution depending on the ethnicity over all time")
plt.show()

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(10, 10))
flierprops = dict(marker='o', markerfacecolor='black', markersize=2)

sns.boxplot(data=main_characters_holywood, x="Age at movie release", y="Ethnicity", 
            ax=ax[0], flierprops=flierprops)
ax[0].set_title('Women and Men')
ax[0].set_xlabel('Age at movie release')

sns.boxplot(data=main_characters_holywood[main_characters_holywood['Gender'] == 'M'], 
            x="Age at movie release", y="Ethnicity", ax=ax[1], flierprops=flierprops)
ax[1].set_title('Men only')
ax[1].set_xlabel('Age at movie release')

sns.boxplot(data=main_characters_holywood[main_characters_holywood['Gender'] == 'F'], 
            x="Age at movie release", y="Ethnicity", ax=ax[2], flierprops=flierprops)
ax[2].set_title('Women only')
ax[2].set_xlabel('Age at movie release')
fig.tight_layout()

plt.suptitle('', fontsize=20)

### Movie revenue depending on the ethinicity proportion

In [None]:
movie_revenues=main_characters_holywood.groupby('Movie name').apply(lambda x: pd.DataFrame({
        'Number of ethnicities': x['Ethnicity'].nunique(),
        'Revenue': x['Box office revenue'].drop_duplicates()
    }))
movie_revenues=movie_revenues.dropna()

In [None]:
sns.barplot(data=movie_revenues, x='Number of ethnicities', y='Revenue')

## Role analysis

#### Unnamed actors

In [None]:
proportion_unnamed_actors = main_characters_holywood.copy().groupby('Ethnicity').apply(lambda x: pd.Series({
        'Unnamed characters': x['Character name'].isna().sum()*100/len(x['Character name'])
    })).sort_values(ascending=False,by = 'Unnamed characters')
proportion_unnamed_actors.plot(kind='bar')
plt.title("Proportion of actors with unnamed roles upon ethnicity")
plt.xlabel('Ethnicities')
plt.ylabel('Percentage of unnamed actors')

Now let's analyze the inside distribution of the unnamed actors 

In [None]:
unname_actors = main_characters_holywood[main_characters_holywood['Character name'].isna()]

In [None]:
plt.figure(figsize=(20,10))
sns.kdeplot(data=unname_actors,x='Movie release year',hue='Ethnicity',multiple='fill')
plt.title("Distribution of ethnicity of actors in the cinema industry over time",fontsize=20)
plt.xticks(np.arange(unname_actors['Movie release year'].min(),unname_actors['Movie release year'].max(),10))
plt.ylabel('Proportion of actors')


In [None]:
plt.figure(figsize=(20,10))
sns.histplot(data=unname_actors,x='Movie release year',hue='Ethnicity')
plt.title("Distribution of ethnicity of actors in the cinema industry over time",fontsize=20)
plt.xticks(np.arange(main_characters_holywood['Movie release year'].min(),main_characters_holywood['Movie release year'].max(),10))
plt.ylabel('Number of actors')

**Strange, I would have thought that the number of unnamed charcater would have decreased over time, as the recording method get better and better** -> Btw not really usefull as a figure

## Plot analysis

In [None]:
plot = pd.read_csv(PLOT_PATH, sep='\t',header=None,names=['Wikipedia Movie ID','Plot'])

In [None]:
main_characters_holywood = main_characters_holywood.merge(plot,on='Wikipedia Movie ID',how='left')