# milestone P2
## Investigating the influence of history on the movies industry
#### Team ADAcADAbra:
DEMIRTAS Enes Eray\
MAILLARD Alexandre Benjamin\
SIRIPATTHITI Punnawat\
ZEMP Manuel Nicolas

## General Information

This notebook presents the way how we prepared the raw data for our analysis. In the first part we import the raw data, in the second part we analyse it for errors and bring the dataframes in an appropriate format which suits our needs. In the third part of the notebook we present the main pipelines for our analysis, which will be performed in the course of milestone P3. 


#### structure of files and directories

- **data/**
  - *character_metadata.tsv*
  - *movie.metadata.tsv*
  - *plot_summaries.txt*
- *P2_data_preparation.ipynb*



Directories are in **bold** and files are in *italic*

#### libraries used

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.fft import fft
import scipy.signal as scisi
import scipy.stats as scist

---

## 1. import data and create dataframes

In [2]:
path = 'data/'

In [3]:
columns_movie = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie name', 'Movie release date', 'Movie box office revenue', 'Movie runtime', 
                 'Movie languages', 'Movie countries', 'Movie genres']
movie_df = pd.read_csv(path+'movie.metadata.tsv', sep='\t', names=columns_movie)

In [4]:
columns_character = ['Wikipedia_movie_ID', 'Freebase_movie_ID', 'Movie_release_date', 'Character_name', 'Actor_date_of_birth', 'Actor_gender', 
                     'Actor_height', 'Actor_ethnicity', 'Actor_name', 'Actor_age_at_movie_release', 'Freebase_character_actor_map_ID', 
                     'Freebase_character_ID', 'Freebase_actor_ID']
character_df = pd.read_csv(path+'character.metadata.tsv', sep='\t', names=columns_character)

In [5]:
columns_summaries = ['Wikipedia movie ID','Summary']
summaries_df = pd.read_csv(path+'plot_summaries.txt',sep='\t|\n', names=columns_summaries, engine='python')

### a first glimpse at the data

In [6]:
movie_df

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


In [7]:
character_df

Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_release_date,Character_name,Actor_date_of_birth,Actor_gender,Actor_height,Actor_ethnicity,Actor_name,Actor_age_at_movie_release,Freebase_character_actor_map_ID,Freebase_character_ID,Freebase_actor_ID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


In [8]:
summaries_df

Unnamed: 0,Wikipedia movie ID,Summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...
...,...,...
42301,34808485,"The story is about Reema , a young Muslim scho..."
42302,1096473,"In 1928 Hollywood, director Leo Andreyev look..."
42303,35102018,American Luthier focuses on Randy Parsons’ tra...
42304,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


---

## 2. transforming the data to our needs

For our analysis, only some parts of the data are of interest to us. Here we look at the dataframes, check their formats, missing values and errors, rectify them as good as possible and transform the dataframes to our needs. 

#### data of interest to us in frames:
- movie_df
    - wikipedia_ID
    - name of the movie
    - release date
    - movie genres
- character_df
    - wikipedia_ID
    - gender of actors
    - age of actors
- summaries_df
    - wikipedia_ID
    - summaries

#### to prepare this data, the following steps have to be performed:

**movie_df:** \
Wikipedia_ID: check if all IDs are present in *plot_summaries>wikipedia_ID*, else filter them out \
movie name: check if existant (else find it through research on the internet) \
release date: check if values are possible, cross check with release date from *character_metadata* \
genre: filter, so that we only have one genre per movie, and group similar genres together

**character_df:** \
Wikipedia_ID: check if all IDs are present in *plot_summaries>wikipedia_ID*, else filter them out \
release date: check if values are possible, cross check with release date from *movie_metadata* \
actor gender: check if values are m/f, transform them into 1: man, 0: woman (if no value put NaN) \
actor age: check if existant (else put NaN)

**summaries_df:** \
check if all data is existant (is there an ID for every summary and vice-versa) \
check if there is for all IDs data in the other dataframes

In [9]:
# code of Alex and Kla

---

## 3. how to go on

In the first 2 parts we prepared the data for our analysis. Now that it is ready, we introduce a detailed plan about how we will process the data and get the information to tell the data story we are interested in.

### general principle of analysis

add some text -------------------------------------------------------------------------

### definitions

In our analysis we need the following terms: 

time of impact: \
time span over which we can relate a change in the dataset to the event. The event happens at time x, so over the next few years (x + time of impact) we expect to see changes in the dataset, which we can correlate to the event. 

time of reference: \
time span over which we evaluate the data as a reference to measure changes that might occur due to a event, that happened after the time of reference. 

keywords: \
we define keywords that are charateristic for each event. they represent words and word combinations that uniquely correspond to a topic of an event. With the help of these words we hope to be able to identify the movies which talk about the specific topic that this event is correlated to. 

### concrete pipelines

There are different pipelines for "abrupt changes" and "slow developments":

#### abrupt changes

- clearly define event characteristics: what did the event influence, where could changes occur.
    - primary keywords: words that are unique for this topic and are rarely used in other topics / secondary keywords: words that are relevant for this topic, but are also used frequently in other situations
        - if a movie summary contains a defined number of primary keywords, it’s identified as this kind of movie. 
        - If it contains at least one primary keyword and several secondary keywords, it is also identified as this kind of movie.
        - else it is not classified as a movie of this subject. 
- define time of impact (how many years after the event can there be an influence)
    - optional: vary it and see whether it changes something
- option 1
    - define time of reference
    - define hyptheses, run hypothesis tests
- option 2
    - run a regression analysis for the time before the event, to get the already existing trends and the expected values
    - run a hypothesis test with the expected values and real values in the defined time of impact
- quantify uncertainty


#### slow developments:

- clearly define characteristics of interest: what does the movement influence? where might structural changes occur due to it?
- define time step (for example: within 10 years, what has changed?)
- option 1
    - define hypotheses, run hypothesis tests
- option 2
    - identify trends through a regression analysis
    - run a hypothesis test with the expected values and real values in the defined time of impact
- quantify uncertainty


### Explicit preparation for first analysis

World War II
- expected changes for
    - genres of movies
        - more tragic movies
        - less comedies
    - name of movies (hard to identify since a title only has a few words - maybe leave this out)
        - topic change: more movies about war
    - summaries
        - topic change: more movies about war
            - keywords of a war movie: 
                - primary: war, battle(s), military, army, troops, soldier(s), weapon(s)
                - secondary: conflict(s), fight, enemy, end, attacking, country, countries, violence, death, crisis, world, strategy
        - in general harsher movie plots (more  tragic, hard life, sad, fate)
- identify the set of movies of interest in the time of reference, calculate their frequency (or absolute numbers)
- perform a regression analysis and get the predicted values for the time of impact (about frequency of movies of that type, or about the absolute number)
- null hypothesis: the real values correlate with the expected values, meaning there are no significant changes in the dataset. 
- run hypothesis test (and hopefully reject null hypothesis)


Gender Neutrality
- expected changes (we expect the ideal: equity between men and women)
    - gender partition of actors in a movie is 50:50
    - age (by gender) distribution: normal distribution expected
        - evaluate these for different genres
    - summaries
        - no gender-offending scenes or language
        - option 1
            - detect movies with a combination of keywords that suppose offending scenes (frequency or absolute values) in reference time and time of impact
        - option 2
            - train with ML for detecting offending scenes -> need a training dataset!!
            - run algorithm on dataset and identify these movies (frequency or absolute values) in reference time and time of impact
- define interval of progression (for example every 20 years we expect to see a development). then the time of reference are the first x years, the time of impact are the next x years for the first iteration. in the second iteration the first time of impact becomes the time of reference and the x years after it become the time of impact. and so on
- repeat for each time step:
    - run regression analysis for reference time
    - predict values for time of impact (frequency or absolute values)
    - null hypothesis: the real values correlate with the expected values
    - run hypothesis test (and hopefully reject null hypothesis)

### First analysis

In this section we provide our first code for the actual analysis. This goes, to our knowledge, beyond the scope of milestone P2, but creating this code already helped us getting more concrete about our analysis, which is why we include it in this notebook nontheless. The code is not absolutely finished yet, it still contains parts that we will surely change and only serves as an inspiration. 

In [11]:
# code of Eray