In [2]:
%load_ext autoreload
%autoreload 2
from helpers import *
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Historical Timeline
In our analysis we want to see how representation of women in american movies might be impacted by historical events. We therefore need to retrieve a history timeline. We decided to use the timeline on https://www.history.com/topics/womens-history/womens-history-us-timeline since it contains major events related to women's history in the US.

In [2]:
get_history_timeline()

Successfully accessed https://www.history.com/topics/womens-history/womens-history-us-timeline
Timeline saved in DATA/timeline.csv


# Complementing the dataset after 2010
The provided dataset does not contain information on recent movies. We thus decided to complement it using IMDB data to be able to also perform our analysis in recent years. We have two main datasets to complete: the movie dataset and the character dataset. To do so, we used the data available on https://datasets.imdbws.com/ and the library Cinemagoer that can retrieve information on IMDB.

### A) Complementing movie data
To complete the movie dataset we used the following folders need to be downloaded from https://datasets.imdbws.com/, unzipped and placed in the /DATA folder:
- title.basics.tsv.gz
- title.akas.tsv.gz

However, these files are missing a lot information that we need for our analysis (plot summaries, countries, languages). We will therefore also use the library Cinemagoer to retrieve these information. In order to not have too many useless requests to IMDB through Cinemagoer, we use the datasets title.basics and titles.akas to get a list of ID of movies we are interested in.

In [3]:
titles_dataset = pd.read_csv('DATA/title.basics.tsv/data.tsv', sep='\t')
movie_IDs = filter_titles_IDs(titles_dataset)

  titles_dataset = pd.read_csv('DATA/title.basics.tsv/data.tsv', sep='\t')


In [4]:
print(len(movie_IDs)) # we still have 204'389

204389


Taking all of these movies from IMDB would take too much time. These IDs contain movies from a lot of different countries and we are only interested in american movies. We do not have the 'country' information in the downloaded datasets but we do have the 'original title' and the 'american title' in the title.akas dataset. We will use this dataset to find movies in which the original title is the same as the american one. Thus we can already remove some movies that we know are probably not american. We will of course keep a lot of non-american movies, but we can filter those out later.

In [5]:
titles_akas_dataset = pd.read_csv('DATA/title.akas.tsv/data.tsv', sep='\t')
original_titles = titles_akas_dataset[titles_akas_dataset['isOriginalTitle']==1]['title']
# get all only the lines where the title is the same as the original title
titles_akas_dataset_filtered = titles_akas_dataset[titles_akas_dataset['title'].isin(original_titles)]
# get only the movies where the US title is the same as the original title
titles_akas_dataset_filtered = titles_akas_dataset_filtered[titles_akas_dataset_filtered['region'] == 'US']

  titles_akas_dataset = pd.read_csv('DATA/title.akas.tsv/data.tsv', sep='\t')


In [6]:
# get only IDs that are in both datasets
common_ids = titles_akas_dataset_filtered[titles_akas_dataset_filtered['titleId'].isin(movie_IDs)]
common_ids = common_ids['titleId'].drop_duplicates()
print(len(common_ids)) # 39'595 movies left

39595


We will now retrieve information of these 39'595 movies directly from IMDB using Cinemagoer

!! THE FOLLOWING CELL TAKES A LONG TIME TO RUN !!
Since it has already been run once and the data was saved, there is no need to run it anymore and it is thus commented.

In [None]:
# get_IMDB_movies_data(common_ids)

In [8]:
IMDB_movie_data = pd.read_csv('DATA/IMDB_movies_2010-2022.csv')
IMDB_movie_data_filtered = filter_IMDB_movie_dataset(IMDB_movie_data)

In [9]:
IMDB_movie_data_filtered

Unnamed: 0.1,Unnamed: 0,IMDB_ID,name,release_date,languages,countries,genre,plot_summary
0,1,tt0112502,Bigfoot,2017,['English'],['United States'],"['Horror', 'Thriller']","['A story of a man who, after having been thro..."
1,2,tt0172182,Blood Type,2018,['English'],['United States'],"['Comedy', 'Drama', 'Mystery']","['During a frantic police car chase, a fleeing..."
2,3,tt0195933,Mysteries,2019,['English'],['United States'],,
3,4,tt0293429,Mortal Kombat,2021,"['English', 'Japanese', 'Chinese']",['United States'],"['Action', 'Adventure', 'Fantasy', 'Sci-Fi', '...","[""MMA fighter Cole Young seeks out Earth's gre..."
4,5,tt0297400,Snowblind,2015,['English'],"['United States', 'Canada']","['Crime', 'Drama']","['Revealing the entrepreneurial ingenuity, par..."
...,...,...,...,...,...,...,...,...
1021,1756,tt10544094,The Cold,2019,['English'],['United States'],['Drama'],['Four suburban young adults are trapped toget...
1022,1757,tt10544554,Mechanical Heart,2018,['English'],['United States'],"['Comedy', 'Drama']",['An aging stage actor loses his voice in an a...
1023,1760,tt10545182,Zealot,2019,['English'],['United States'],['Horror'],
1024,1764,tt10547804,The 27 Club,2020,['English'],['United States'],['Comedy'],['When indie pop sensation Imogen Wright belie...


We now have a datasets containing all the needed information on movies from 2010-2022. We just need to merge it with the provided dataset.

In [10]:
provided_data = pd.read_csv('DATA/provided_dataset_cleaned') # TODO: CHANGE WITH VARIABLE NAME
# change the release date to just the year to be able to merge more easily and convert it to the same format of the other df
provided_data['release_date'] = provided_data['release_date'].str[:4].astype(int)
# country and languages are not needed anymore
provided_data = provided_data.drop(columns=['languages', 'countries'])
IMDB_movie_data_filtered = IMDB_movie_data.drop(columns=['languages', 'countries'])

In [12]:
# merge IMDB and provided dataset
movie_data = pd.merge(provided_data, IMDB_movie_data_filtered, on=['name', 'release_date'], how='outer').copy()
duplicated_cols = ['plot_summary', 'genre']
movie_data = remove_duplicated_columns(movie_data, duplicated_cols)
# save data
#movie_data.to_csv('DATA/movie_data.csv')

In [13]:
movie_data

Unnamed: 0.1,wikipedia_ID,freebase_ID,name,release_date,box_office_revenue,runtime,genre,plot_summary,Unnamed: 0,IMDB_ID
0,975900.0,/m/03vyhn,Ghosts of Mars,2001,14010832.0,98.0,"['Thriller', 'Science Fiction', 'Horror', 'Adv...","Set in the second half of the 22nd century, th...",,
1,3196793.0,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000,,95.0,"['Mystery', 'Biographical film', 'Drama', 'Cri...",,,
2,13696889.0,/m/03cfc81,The Gangsters,1913,,35.0,"['Short Film', 'Silent film', 'Indie', 'Black-...",,,
3,10408933.0,/m/02qc0j7,Alexander's Ragtime Band,1938,3600000.0,106.0,"['Musical', 'Comedy', 'Black-and-white']",,,
4,175026.0,/m/017n1p,Sarah and Son,1930,,86.0,"['Drama', 'Black-and-white']",,,
...,...,...,...,...,...,...,...,...,...,...
30822,,,The Cold,2019,,,['Drama'],['Four suburban young adults are trapped toget...,1756.0,tt10544094
30823,,,Mechanical Heart,2018,,,"['Comedy', 'Drama']",['An aging stage actor loses his voice in an a...,1757.0,tt10544554
30824,,,Zealot,2019,,,['Horror'],,1760.0,tt10545182
30825,,,The 27 Club,2020,,,['Comedy'],['When indie pop sensation Imogen Wright belie...,1764.0,tt10547804


We now have a data sets containing movies until 2022 that is ready for our analysis!

### B) Complementing characters' data
To complete the character dataset the following folders need to be downloaded from https://datasets.imdbws.com/, unzipped and placed in the /DATA folder:
- title.principals.tsv.gz
- name.basics.tsv.gz

In [14]:
# load characters info
characters_data = pd.read_csv('DATA/title.principals.tsv/data.tsv', sep='\t')
IMDB_ids = IMDB_movie_data_filtered['IMDB_ID']
# only keep characters of the filtered movies
characters_data_filtered = clean_IMDB_character_dataset(characters_data, IMDB_ids)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  characters_data.loc[characters_data['category'] == 'actor', 'actor_gender'] = 'M'


In [15]:
# load actors info
actors_data = pd.read_csv('DATA/name.basics.tsv/data.tsv', sep='\t')
# remove useless columns
actors_data = actors_data.drop(columns=['deathYear', 'primaryProfession', 'knownForTitles'])

We now need to merge all the information we have on the characters and the actors. There are some information we still need to add to the dataframe: release date, actor age, movie name. These will be added by merging with the movie dataset created above.

In [17]:
# merge actor data and movie data on character data
IMDB_characters_data = merge_datasets_characters(characters_data_filtered, actors_data, IMDB_movie_data_filtered)
IMDB_characters_data.loc[IMDB_characters_data['character_name'] == '\\N', 'character_name'] = None
IMDB_characters_data['release_date'] = IMDB_characters_data['release_date'].astype(float) #so that it's the same type as the provided data

In [18]:
IMDB_characters_data

Unnamed: 0.1,IMDB_ID,actor_IMDB_ID,character_name,actor_gender,actor_name,actor_birthday,Unnamed: 0,name,release_date,actor_age
0,tt0172182,nm0182661,Bum Joe,M,Nicolas Coster,1933.0,2,Blood Type,2018.0,85.0
1,tt0172182,nm0500098,Tiffanie,F,Hudson Leick,1969.0,2,Blood Type,2018.0,49.0
2,tt0172182,nm0090981,Chad,M,Wolfgang Bodison,1966.0,2,Blood Type,2018.0,52.0
3,tt0172182,nm0001730,Mrs. Dow,F,Deborah Shelton,1948.0,2,Blood Type,2018.0,70.0
4,tt0293429,nm1167985,Cole Young,M,Lewis Tan,,4,Mortal Kombat,2021.0,
...,...,...,...,...,...,...,...,...,...,...
3123,tt10548502,nm10791665,Jehovah Witness,M,D.J. Gibson,,1765,Saturday,2020.0,
3124,tt10548502,nm10437154,Black,M,Rodney H. Glover,,1765,Saturday,2020.0,
3125,tt10548502,nm10791663,Roland,M,Deitrick Greer,,1765,Saturday,2020.0,
3126,tt10548502,nm8820193,Earl,M,Cecil M. Henry,,1765,Saturday,2020.0,


We can now merge this IMDB character dataset with the provided character dataset

In [19]:
#provided_characters = character_metadata_noNA_genderYear_personnas #TODO: replace by variable name
provided_characters = pd.read_csv('DATA/US_movies_with_movie_names')
#change birthday into birth year
provided_characters['actor_birthday'] = provided_characters['actor_birthday'].str[:4].astype(float)
provided_characters['release_date'] = provided_characters['release_date'].str[:4].astype(float)

In [22]:
#first put everything in lower case
provided_characters['character_name'] = name_to_lowercase(provided_characters, 'character_name')
provided_characters['actor_name'] = name_to_lowercase(provided_characters, 'actor_name')
IMDB_characters_data['character_name'] = name_to_lowercase(IMDB_characters_data, 'character_name')
IMDB_characters_data['actor_name'] = name_to_lowercase(IMDB_characters_data, 'actor_name')
# merge datasets
characters_data = pd.merge(provided_characters, IMDB_characters_data, on=['name', 'character_name', 'actor_name', 'release_date'], how='outer').copy()
# remove columns that were duplicated
duplicated_cols = ['actor_birthday', 'actor_gender', 'actor_age']
characters_data= remove_duplicated_columns(characters_data, duplicated_cols)
# save file
#characters_data.to_csv('DATA/characters_data.csv', index=False)

In [23]:
characters_data

Unnamed: 0.1,wikipedia_ID,freebase_ID,release_date,character_name,actor_birthday,actor_gender,actor_height,actor_ethnicity,actor_name,actor_age,freebase_character_actor_mapID,freebase_character_ID,freebase_actor_ID,personnas,name,IMDB_ID,actor_IMDB_ID,Unnamed: 0
0,975900.0,/m/03vyhn,2001.0,akooshay,1958.0,F,1.620,,wanda de jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7,,Ghosts of Mars,,,
1,975900.0,/m/03vyhn,2001.0,lieutenant melanie ballard,1974.0,F,1.780,/m/044038p,natasha henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4,,Ghosts of Mars,,,
2,975900.0,/m/03vyhn,2001.0,desolation williams,1969.0,M,1.727,/m/0x67,ice cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l,,Ghosts of Mars,,,
3,975900.0,/m/03vyhn,2001.0,sgt jericho butler,1967.0,M,1.750,,jason statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc,,Ghosts of Mars,,,
4,975900.0,/m/03vyhn,2001.0,bashira kincaid,1977.0,F,1.650,,clea duvall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg,,Ghosts of Mars,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
398294,,,2020.0,jehovah witness,,M,,,d.j. gibson,,,,,,Saturday,tt10548502,nm10791665,1765.0
398295,,,2020.0,black,,M,,,rodney h. glover,,,,,,Saturday,tt10548502,nm10437154,1765.0
398296,,,2020.0,roland,,M,,,deitrick greer,,,,,,Saturday,tt10548502,nm10791663,1765.0
398297,,,2020.0,earl,,M,,,cecil m. henry,,,,,,Saturday,tt10548502,nm8820193,1765.0
