# DRACO - Milestone 2: Dataset exploration

This document is structured as follow:

1. Characters Data - Extraction and Processing
2. Movie Data - Extraction and Processing
3. Actors Ethinicites - Exploration

---

In [379]:
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [380]:
DATA_FOLDER = './Data/'

CHARACTER_PATH = DATA_FOLDER + 'MovieSummaries/character.metadata.tsv'
MOVIE_PATH = DATA_FOLDER + 'MovieSummaries/movie.metadata.tsv'
ETHNICITY_PATH = DATA_FOLDER + 'ethnicities_data.tsv'
NAME_PATH = DATA_FOLDER + 'MovieSummaries/name.clusters.txt'
PLOT_PATH = DATA_FOLDER + 'MovieSummaries/plot_summaries.txt'

## Importation of each datasets

### Character data

In [381]:
characters_original = pd.read_csv(CHARACTER_PATH, sep='\t', header=None, 
    names = ["Wikipedia Movie ID", "Freebase Movie ID", "Movie release date", "Character name", "Birth", 
    "Gender", "Height", "Ethnicity ID", "Name", "Age at movie release",
    "Freebase character/actor map ID", "Freebase character ID", "Freebase actor ID"])

### Movie data

In [382]:
movies_original = pd.read_csv(MOVIE_PATH, sep='\t', header=None, 
    names = ["Wikipedia Movie ID", "Freebase Movie ID", "Movie name","Movie release date", "Box office revenue","Movie runtime","Movie language","Movie countries","Movie genres" ])

### Ethnicity data

In [383]:
ethnicities_original = pd.read_csv(ETHNICITY_PATH, sep='\t',  
                               header=0, names=["Ethnicity ID", "Ethnicity"])

## Preprocessing of the Data

### Cleaning the data

#### Character data

Let's see what the movie dataset looks like.

In [384]:
characters_original.head()

Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie release date,Character name,Birth,Gender,Height,Ethnicity ID,Name,Age at movie release,Freebase character/actor map ID,Freebase character ID,Freebase actor ID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg


In [385]:
characters_original.describe()

Unnamed: 0,Wikipedia Movie ID,Height,Age at movie release
count,450669.0,154824.0,292556.0
mean,13969750.0,1.788893,37.788523
std,10796620.0,4.37994,20.58787
min,330.0,0.61,-7896.0
25%,3759292.0,1.6764,28.0
50%,11890650.0,1.75,36.0
75%,23665010.0,1.83,47.0
max,37501920.0,510.0,103.0


As wee can see there is issues in the dataset. 
- First some ethinicites as well as gender are not specified.
- Second we can see that the age at movie release spans from -7896 to 103 years.

For our analysis we have decided to discard all characters that don't have a specified ethnicity as well as a specified gender. Moreover, only strictly positive ages will be taken into account. Also, since we are not interrested in the height and name of the actor that the column `Freebase Movie ID` is redundant with `Wikipedia Movie ID`, we have decided to drop those columns. 

In [386]:
characters = characters_original.copy()
characters = characters.drop(['Freebase Movie ID','Height','Name'], axis=1)

In [387]:
characters = characters[characters['Ethnicity ID'].notna()]
characters = characters[characters['Gender'].notna()]
characters = characters[characters['Age at movie release'] > 0]

In [388]:
characters.head()

Unnamed: 0,Wikipedia Movie ID,Movie release date,Character name,Birth,Gender,Ethnicity ID,Age at movie release,Freebase character/actor map ID,Freebase character ID,Freebase actor ID
1,975900,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,/m/044038p,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,2001-08-24,Desolation Williams,1969-06-15,M,/m/0x67,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
5,975900,2001-08-24,Commander Helena Braddock,1949-05-26,F,/m/0x67,52.0,/m/02vdcfp,/m/0bgchnd,/m/0418ft
11,975900,2001-08-24,Tres,1959-03-09,M,/m/064b9n,42.0,/m/0bgchrs,/m/0bgchrw,/m/03ydsb
27,3196793,2000-02-16,,1937-11-10,M,/m/0x67,62.0,/m/0lr37dy,,/m/01lntp


#### Movie data

Let's see what the movie dataset looks like.

In [389]:
movies_original.head()

Unnamed: 0,Wikipedia Movie ID,Freebase Movie ID,Movie name,Movie release date,Box office revenue,Movie runtime,Movie language,Movie countries,Movie genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [390]:
movies_original.describe()

Unnamed: 0,Wikipedia Movie ID,Box office revenue,Movie runtime
count,81741.0,8401.0,61291.0
mean,17407840.0,47993630.0,111.8192
std,10987910.0,112175300.0,4360.07
min,330.0,10000.0,0.0
25%,7323695.0,2083193.0,81.0
50%,17778990.0,10639690.0,93.0
75%,27155730.0,40716960.0,106.0
max,37501920.0,2782275000.0,1079281.0


As wee can see there is issues in the dataset. 
- First some movies' box office revenue are not specified. 
- Second we can see that the movie run time spans from zero minutes to more than one million of minutes.

For our analysis we have decided to discard all movies that don't have a specified box office revenue and all movie that are shorter than ten minutes. Also, since we are not interrested in the movie language and that the column `Freebase Movie ID` is redundant with `Wikipedia Movie ID`, we have decided to drop this column. 

In [391]:
movies = movies_original.copy()
movies = movies.drop(['Freebase Movie ID','Movie language'], axis=1)

We can observe that the columns `Movie countries` and `Movie genres` contain dictionaries. In our case, it would be much more convenient to have lists instead. Let's process it accordingly.

In [392]:
movies["Movie countries"] = movies["Movie countries"].apply(lambda x: list(json.loads(x).values()) if len(json.loads(x).values()) > 0 else 'NaN')
movies["Movie genres"] = movies["Movie genres"].apply(lambda x: list(json.loads(x).values()) if len(json.loads(x).values()) > 0 else 'NaN')

As the dataset was realesed on 2013, the data from this year are not complet and thus should be removed. To simplify this step we can add a category for the year of release in addition of the date of release.

In [393]:
#Errors = 'coerce' will force the values that are outside the bound to be NaT
movies["Movie release year"] = pd.to_datetime(movies["Movie release date"],format='mixed',errors = 'coerce').dt.year
#Remove the NaN and NaT values
movies = movies[movies["Movie release year"].notna()]
#Express all years of realese as int.
movies["Movie release year"] = movies["Movie release year"].astype("int")

In [394]:
movies = movies[movies['Movie release year'] < 2013]
movies = movies[movies['Box office revenue'].notna()]
movies = movies[movies['Movie runtime'] >= 10]

We should now pay attention to remove the movie of the Animation field as the actors that have played inside are rather voice actors than actors. Generally in this domain, the looking and the ethinicty of the actors isn't as relevant as in the real action movies.

In [395]:
# First let's see all the unique genres, to discard the animation ones:
unique_genres = set()
movies['Movie genres'].apply(lambda x: unique_genres.update(x))
print(unique_genres)

{'Northern', 'Creature Film', 'Legal drama', 'Epic', 'Adventure Comedy', 'Sci-Fi Adventure', 'Addiction Drama', 'Fantasy Adventure', 'Musical Drama', 'Tamil cinema', 'Roadshow theatrical release', 'Animals', 'Extreme Sports', 'Biker Film', 'Steampunk', 'Comedy', 'Reboot', 'Swashbuckler films', 'Cult', 'Baseball', 'Political drama', 'Political thriller', 'Thriller', 'Crime Drama', 'Film à clef', 'Ensemble Film', 'Existentialism', 'Mystery', 'Demonic child', 'Instrumental Music', 'Detective', 'Inventions & Innovations', 'Erotica', 'Action Thrillers', 'Alien Film', "Children's/Family", 'Jungle Film', 'Camp', 'Revisionist Fairy Tale', 'Sci-Fi Horror', 'Musical comedy', 'Supermarionation', 'Stop motion', 'Natural horror films', 'Detective fiction', 'Revisionist Western', 'Doomsday film', 'Experimental film', 'Parody', 'Caper story', 'Nature', 'Movies About Gladiators', 'Inspirational Drama', 'Werewolf fiction', 'Cavalry Film', 'Archives and records', 'Family Drama', 'Future noir', 'Dogme 95

In [396]:
values_to_find = ['Anime', 'Animation', 'Computer Animation', 'Clay animation', 'Animated cartoon','Stop motion']
movies = movies[movies['Movie genres'].apply(lambda x: not(any(value in x for value in values_to_find)))]

In [397]:
movies.head()

Unnamed: 0,Wikipedia Movie ID,Movie name,Movie release date,Box office revenue,Movie runtime,Movie countries,Movie genres,Movie release year
0,975900,Ghosts of Mars,2001-08-24,14010832.0,98.0,[United States of America],"[Thriller, Science Fiction, Horror, Adventure,...",2001
7,10408933,Alexander's Ragtime Band,1938-08-16,3600000.0,106.0,[United States of America],"[Musical, Comedy, Black-and-white]",1938
13,171005,Henry V,1989-11-08,10161099.0,137.0,[United Kingdom],"[Costume drama, War film, Epic, Period piece, ...",1989
17,77856,Mary Poppins,1964-08-27,102272727.0,139.0,[United States of America],"[Children's/Family, Musical, Fantasy, Comedy, ...",1964
21,612710,New Rose Hotel,1999-10-01,21521.0,92.0,[United States of America],"[Thriller, Science Fiction, Future noir, Indie...",1999


#### Ethnicity data

Let's see what the ethnicity dataset looks like.

In [398]:
ethnicities_original.head()

Unnamed: 0,Ethnicity ID,Ethnicity
0,/m/044038p,
1,/m/0x67,African Americans
2,/m/064b9n,Omaha people
3,/m/041rx,Jewish people
4,/m/033tf_,Irish Americans


Here again we have NaN values for some ethnicities, we can drop them now.

In [399]:
ethnicities = ethnicities_original.copy()
ethnicities = ethnicities[ethnicities['Ethnicity'].notna()]

In [400]:
ethnicities.head()

Unnamed: 0,Ethnicity ID,Ethnicity
1,/m/0x67,African Americans
2,/m/064b9n,Omaha people
3,/m/041rx,Jewish people
4,/m/033tf_,Irish Americans
5,/m/04gfy7,Indian Americans


### Merging datafames

Now that all the datasets are cleaned, we can merge them in one big dataframe.

In [401]:
characters_extended = characters.copy().merge(ethnicities, how='inner', on='Ethnicity ID')
characters_extended.head()

Unnamed: 0,Wikipedia Movie ID,Movie release date,Character name,Birth,Gender,Ethnicity ID,Age at movie release,Freebase character/actor map ID,Freebase character ID,Freebase actor ID,Ethnicity
0,975900,2001-08-24,Desolation Williams,1969-06-15,M,/m/0x67,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l,African Americans
1,975900,2001-08-24,Commander Helena Braddock,1949-05-26,F,/m/0x67,52.0,/m/02vdcfp,/m/0bgchnd,/m/0418ft,African Americans
2,3196793,2000-02-16,,1937-11-10,M,/m/0x67,62.0,/m/0lr37dy,,/m/01lntp,African Americans
3,156558,2001-06-27,Yvette,1970-09-11,F,/m/0x67,30.0,/m/0jtx5t,/m/03jnxj_,/m/0blbxk,African Americans
4,156558,2001-06-27,Jody,1978-12-30,M,/m/0x67,22.0,/m/0jtx5h,/m/03jnxf4,/m/01l1b90,African Americans


In [402]:
characters_movies = characters_extended.copy().merge(movies, how='inner', on=['Wikipedia Movie ID','Movie release date'])
characters_movies.head()

Unnamed: 0,Wikipedia Movie ID,Movie release date,Character name,Birth,Gender,Ethnicity ID,Age at movie release,Freebase character/actor map ID,Freebase character ID,Freebase actor ID,Ethnicity,Movie name,Box office revenue,Movie runtime,Movie countries,Movie genres,Movie release year
0,975900,2001-08-24,Desolation Williams,1969-06-15,M,/m/0x67,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l,African Americans,Ghosts of Mars,14010832.0,98.0,[United States of America],"[Thriller, Science Fiction, Horror, Adventure,...",2001
1,975900,2001-08-24,Commander Helena Braddock,1949-05-26,F,/m/0x67,52.0,/m/02vdcfp,/m/0bgchnd,/m/0418ft,African Americans,Ghosts of Mars,14010832.0,98.0,[United States of America],"[Thriller, Science Fiction, Horror, Adventure,...",2001
2,975900,2001-08-24,Tres,1959-03-09,M,/m/064b9n,42.0,/m/0bgchrs,/m/0bgchrw,/m/03ydsb,Omaha people,Ghosts of Mars,14010832.0,98.0,[United States of America],"[Thriller, Science Fiction, Horror, Adventure,...",2001
3,156558,2001-06-27,Yvette,1970-09-11,F,/m/0x67,30.0,/m/0jtx5t,/m/03jnxj_,/m/0blbxk,African Americans,Baby Boy,29381649.0,123.0,[United States of America],"[Crime Fiction, Drama, Coming of age]",2001
4,156558,2001-06-27,Jody,1978-12-30,M,/m/0x67,22.0,/m/0jtx5h,/m/03jnxf4,/m/01l1b90,African Americans,Baby Boy,29381649.0,123.0,[United States of America],"[Crime Fiction, Drama, Coming of age]",2001


### Showing the feasability of the project 

In this section, we are only interested in the number of different films, actors and characters we have in the processed dataframe. This is to determine whether the dataframe contains enough data to be analyzed.

In [403]:
nb_movies = len(np.unique(characters_movies['Wikipedia Movie ID']))
print("Number of movies:",nb_movies)

Number of movies: 6503


In [404]:
nb_actors = len(np.unique(characters_movies['Freebase actor ID']))
print("Number of actors:",nb_actors)

Number of actors: 4321


In [405]:
nb_characters = np.shape(characters_movies)[0]
print("Number of characters:",nb_characters)

Number of characters: 27971


### Grouping ethnicities in ethnic groups 

In [406]:
nb_ethnicities = len(np.unique(characters_movies['Ethnicity']))
print('Number of ethnicities:',nb_ethnicities)

Number of ethnicities: 313


In [418]:
print(np.unique(characters_movies['Ethnicity']))

['Acadians' 'African Americans' 'African people'
 'Afro Trinidadians and Tobagonians' 'Afro-Asians' 'Afro-Cuban'
 'Afro-Guyanese' 'Akan people' 'Albanian American' 'Albanians'
 'American Jews' 'Americans' 'Anglo-Celtic Australians'
 'Anglo-Irish people' 'Apache' 'Arab Americans' 'Arab Mexican'
 'Arabs in Bulgaria' 'Argentines' 'Armenian American' 'Armenians'
 'Armenians in Italy' 'Ashkenazi Jews' 'Asian Americans' 'Asian people'
 'Assyrian people' 'Australian American' 'Australians'
 'Austrian Americans' 'Austrians' 'Aymara' 'Bahamian Americans'
 'Baltic Russians' 'Belarusians' 'Belgians' 'Bengali people' 'Berber'
 'Bhutia' 'Black British' 'Black Canadians' 'Black Irish' 'Black people'
 'Blackfoot Confederacy' 'Bolivian American' 'Bosnians'
 'Brazilian Americans' 'British Americans' 'British Asian'
 'British Chinese' 'British Indian people' 'British Jews'
 'British Nigerian' 'British Pakistanis' 'British people'
 'Bulgarian Canadians' 'Bulgarians' 'Bunt' 'Cajun' 'Cambodian Americans'
 

In [426]:
from collections import defaultdict

list_ethnicities = np.unique(characters_movies['Ethnicity'])
ethnic_groups = {}  
ethnic_groups.update(dict.fromkeys(['Indian', 'Pakistani', 'Bangladeshi', 'Chinese', 'Asian', 'Eurasian', 'Filipino'], 'Asian'))
ethnic_groups.update(dict.fromkeys(['Caribbean', 'African', 'Black', 'Afro', 'Cuban'], 'Black, Caribbean or African'))
ethnic_groups.update(dict.fromkeys(['white', 'Anglo', 'Canadian', 'White', 'Australian', 'Argentine', 'Austrian', 'Catalan', 'Danish', 'English', 'German', 'French', ''], 'White'))
ethnic_groups.update(dict.fromkeys(['Colombian', ], 'Mixed or multiple ethnic groups'))
ethnic_groups = defaultdict(lambda: "Not grouped", ethnic_groups)

for ethnicity in list_ethnicities:
    first_word = ethnicity.split()[0]

    if first_word == 'British':
        first_word = ethnicity.split()[1]
        if first_word == 'people':
            first_word = 'White'

    if '-' in first_word:
        first_word = first_word.split('-')[0]

    if first_word[-1] == 's':
        first_word = first_word[:-1]

    print(ethnic_groups[first_word])
    

Not grouped
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
White
White
Not grouped
Not grouped
Not grouped
Not grouped
White
Not grouped
Not grouped
Not grouped
Not grouped
Asian
Asian
Not grouped
White
White
White
White
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Black, Caribbean or African
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
Asian
Asian
Asian
Not grouped
Not grouped
Asian
White
Not grouped
Not grouped
Not grouped
Not grouped
Not grouped
White
White
White
White
Not grouped
Not grouped
Not grouped
Not grouped
Asian
Asian
Asian
Asian
Asian
Not grouped
Mixed or multiple ethnic groups
Mixed or multiple ethnic groups
Mixed or multiple ethnic groups
Not gr

As we can see we have a lot of different ethnicities in the dataframe. For our analysis it could be interresting to group them according to [UK's list of ethnic groups ](https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups).