# Milestone 2: Insight Of Data And Feasibility Check Of Our Idea

<hr style="clear:both">

!! Explain our idea clearly here !!


**Guidline for us (to delete when submit the milestone):**
This notebook will present the journey of the group through the data analysis. We show that we :
- can handle the data in its size.
- understand what’s in the data giving the formats, distributions, missing values, correlations, and so on.
- considered ways to enrich, filter, transform the data according to our needs.
- have a reasonable plan and ideas for methods we’re going to use, giving their essential mathematical details in the notebook.
- have a plan for analysis and its communication is reasonable and sound, potentially discussing alternatives to our choices that we considered but dropped.

**Authors:** : [Luca Carroz](https://people.epfl.ch/emilie.carroz), [David Schroeter](https://people.epfl.ch/david.schroeter), [Xavier Ogay](https://people.epfl.ch/xavier.ogay), [Joris Monnet](https://people.epfl.ch/joris.monnet), [Paulo Ribeiro de Carvalho](https://people.epfl.ch/paulo.ribeirodecarvalho)

<hr style="clear:both">

## Import

In [1]:
import numpy as np
import pandas as pd

## Load Data

In [2]:
# Please be sure your data is store in the same path
data_path = 'dataset/MovieSummaries/'

# Read the text file into a Pandas DataFrame
character = pd.read_csv(f"{data_path}/character.metadata.tsv", delimiter='\t', header=None)
movie = pd.read_csv(f"{data_path}/movie.metadata.tsv", delimiter='\t', header=None)
name = pd.read_csv(f"{data_path}/name.clusters.txt", delimiter='\t', header=None)

## Data Formatting
Note that no header are given in the raw data sets. We then looked at the documentation and set these headers with meaningful names. Please, find the documentation of data set [here](http://www.cs.cmu.edu/~ark/personas/).

In [3]:
character.columns = ["movie_wiki_ID", "movie_freebase_ID", "movie_release_date", "character_name",
                  "actor_date_of_birth", "actor_gender", "actor_height", "actor_ethnicity", "actor_name",
                  "actor_age_movie_release", "freebase_character_map", "dont_know", "dont_know_bis"]

movie.columns = ["wiki_ID", "freebase_ID", "name", "release_date", "box_office_revenue", "runtime",
                 "languages", "countries", "genres"]

name.columns = ["character_name", "freebase_character_map"]

Now we can display the DataFrame with the right features/columns names and start working on them.

### Character Metadata

Let's explore the first dataset that is provided to us. First, display the raw data with correct label for each variable. We see that some variables are textual (i.e. `movie_freebase_ID`, `actor_name`, ...), categorical (i.e. `actor_gender`) or numerical (i.e. `actor_height`, `actor_age_movie_release`, ...).

In [4]:
display(character)

Unnamed: 0,movie_wiki_ID,movie_freebase_ID,movie_release_date,character_name,actor_date_of_birth,actor_gender,actor_height,actor_ethnicity,actor_name,actor_age_movie_release,freebase_character_map,dont_know,dont_know_bis
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


Next, we can display some metrics to get insight of each variable/metric.

### Movie Metadata

Explanations...

In [5]:
display(movie)

Unnamed: 0,wiki_ID,freebase_ID,name,release_date,box_office_revenue,runtime,languages,countries,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


## Simple Data Insight

Let's try to get some insight of the data provide by the course. We focus here to describe some statistical description of some attributes, but also to explain which attributes might be interesting for our final project.

### Character Metadata

Let's begin with the Character data set.

In [6]:
character.describe()

Unnamed: 0,movie_wiki_ID,actor_height,actor_age_movie_release
count,450669.0,154824.0,292556.0
mean,13969750.0,1.788893,37.788523
std,10796620.0,4.37994,20.58787
min,330.0,0.61,-7896.0
25%,3759292.0,1.6764,28.0
50%,11890650.0,1.75,36.0
75%,23665010.0,1.83,47.0
max,37501920.0,510.0,103.0


### Movie Metadata

What about the Movie data set.

In [7]:
movie.describe()

Unnamed: 0,wiki_ID,box_office_revenue,runtime
count,81741.0,8401.0,61291.0
mean,17407840.0,47993630.0,111.8192
std,10987910.0,112175300.0,4360.07
min,330.0,10000.0,0.0
25%,7323695.0,2083193.0,81.0
50%,17778990.0,10639690.0,93.0
75%,27155730.0,40716960.0,106.0
max,37501920.0,2782275000.0,1079281.0


## Additional Data Set

Let's enrich our dataset with an additional one downloaded [here (LINK NOT PROVIDED YET)](https://google.com).

Here we should also provide some statistical inisght of this new dataset to be sure that it is interesting for us to work with this one.

In [8]:
pass

## Joining Data Sets

Explain how we are going to join the datasets to have a meaningful big dataset to achieve our goal.

In [9]:
pass

## Results Expected

Let's try to explain what are the main plots we want to bring to our readers. Do we have in mind also to have a interactive plot, and so on ?

In [10]:
# Might be interesting to try to produce the graph in a easy and not carefully way.
pass