## Idea 3. Towards inclusivity
Society is becoming more inclusive towards minorities and gender equality is also being pushed forward. Is the movie industry keeping up? 

Examining the evolving inclusivity trend could be done from multiple angles. 

1. Sorting the movies chronologically, by country and analyzing the distribution / proportion of races, genders and age in the movies. 

2. Since we have the movie description as well as movie characters, we could identify the protagonist (perhaps the character which appears most times in a movie description) and see what the protagonists’ distribution of each race/gender/age is in each country and how it evolves over time. 

3. We could further split the characters into good and bad using NLP and classification. It would be interesting whether there is any relationship between gender/race/age and being a “good” or “bad” character.


### Data we need
#### About movies
1. movies
2. countries
3. release dates

#### About characters
1. Age
2. Race
3. Gender

### Enhancing the data

In [1]:
import pandas as pd

In [2]:
pickle_folder = '../../pickles'

cmu_movies = pd.read_pickle(f'{pickle_folder}/cmu_movies_df.pkl')
imbd_data = pd.read_pickle(f'{pickle_folder}/imdb_data.pkl')
character_metadata = pd.read_pickle(f'{pickle_folder}/character_metadata.pkl')
metacritic = pd.read_pickle(f'{pickle_folder}/metacritic_df.pkl')
oskar = pd.read_pickle(f'{pickle_folder}/oskar_df.pkl')

In [3]:
cmu_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81741 entries, 0 to 81740
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Wikipedia Movie ID    81741 non-null  int64  
 1   Freebase Movie ID     81741 non-null  object 
 2   Movie Name            81741 non-null  object 
 3   Release Date          74839 non-null  object 
 4   Box Office Revenue    8401 non-null   float64
 5   Runtime               61291 non-null  float64
 6   Language Freebase ID  81741 non-null  object 
 7   Language Name         81741 non-null  object 
 8   Country Freebase ID   81741 non-null  object 
 9   Country Name          81741 non-null  object 
 10  Genre Freebase ID     81741 non-null  object 
 11  Genre Name            81741 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 7.5+ MB


In [22]:
# complete movie entries

movies_cleaned = cmu_movies[(cmu_movies['Movie Name'].notnull())]
movies_cleaned = movies_cleaned[(cmu_movies['Country Name'].notnull())]
movies_cleaned = movies_cleaned[(cmu_movies['Release Date'].notnull())]

movies_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74839 entries, 0 to 81740
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Wikipedia Movie ID    74839 non-null  int64  
 1   Freebase Movie ID     74839 non-null  object 
 2   Movie Name            74839 non-null  object 
 3   Release Date          74839 non-null  object 
 4   Box Office Revenue    8328 non-null   float64
 5   Runtime               58631 non-null  float64
 6   Language Freebase ID  74839 non-null  object 
 7   Language Name         74839 non-null  object 
 8   Country Freebase ID   74839 non-null  object 
 9   Country Name          74839 non-null  object 
 10  Genre Freebase ID     74839 non-null  object 
 11  Genre Name            74839 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 7.4+ MB


In [23]:
# Characters
character_metadata.info()

characters_gender = character_metadata[character_metadata['actor_gender'].notnull()]
characters_age = character_metadata[character_metadata['actor_age_at_release'].notnull()]
characters_race = character_metadata[character_metadata['actor_ethnicity_id'].notnull()]




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450669 entries, 0 to 450668
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   wikipedia_movie_id      450669 non-null  int64  
 1   freebase_movie_id       450669 non-null  object 
 2   movie_release_date      440674 non-null  object 
 3   character_name          192794 non-null  object 
 4   actor_dob               344524 non-null  object 
 5   actor_gender            405060 non-null  object 
 6   actor_height            154824 non-null  float64
 7   actor_ethnicity_id      106058 non-null  object 
 8   actor_name              449441 non-null  object 
 9   actor_age_at_release    292556 non-null  float64
 10  character_actor_map_id  450669 non-null  object 
 11  character_id            192804 non-null  object 
 12  actor_id                449854 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 44.7+ MB


In [None]:
# Join characters characters_gender
# x-axis time
# 

## TODO
- ratio of men to women over time (e.g. histogram per decade)
- join IMDB metacritic rotten tomatoes