# Milestone 2: Insight Of Data And Feasibility Check Of Our Idea

<hr style="clear:both">

This notebook traces our journey through the feasibility check of our idea : Checking trends in music composition for the movie industry.

**Project Mentor:** [Aoxiang Fan](https://people.epfl.ch/aoxiang.fan) ([Email](mailto:aoxiang.fan@epfl.ch))
**Authors:** [Luca Carroz](https://people.epfl.ch/emilie.carroz), [David Schroeter](https://people.epfl.ch/david.schroeter), [Xavier Ogay](https://people.epfl.ch/xavier.ogay), [Joris Monnet](https://people.epfl.ch/joris.monnet), [Paulo Ribeiro de Carvalho](https://people.epfl.ch/paulo.ribeirodecarvalho)

<hr style="clear:both">

## Storyline

**Title : Stanislas' music dream : Road to Hollywood !**

A 20-year-old aspiring musician, Stanislas, fueled by a passion for the film industry, embarks on a quest to launch his career. His ultimate dream? To hear one of his productions featured in a Hollywood film and become one of the planet's top composers. To increase his chances, he turns to a team of Data Scientists known as LSD.

The "LearningtheSecretsofData" team's mission is to identify trends shared among successful music composers and compositions, ultimately optimizing choices for our young musician. This is not an easy task but the team is driven by the wish of helping Stanislas. How could they provoke a cascADA of successful choices in Stany career.

Which music genre Stany should he focus on? Will this new direction be enough for him to conquers the show business? Maybe he may invest in a ludicrous website to promote himself? Or should he even consider changing Nationality to achieve his goal? Let’s see what’s the plan LSD had concocted for Stanislas.


## Needed data

To help Stanislas with his dream, we will need some specific information about each analysed movie. To support our final answer concerning potential trends in composer for movie industry, we will analyse few questions :

1) Which are the most frequent music genre appearing in movies ?
2) What is the average composer's age at their :
   - first movie appearance ?
   - biggest box office revenue ?
3) How the top composers' career progress over the years ?
4) Where do composers come from ?
5) Does composer's gender matter ?
6) Does having a personal website correlate with the composers' success ?
7) Is there a correlation between box office revenue and movie's playlist popularity ?

List of needed information about movie's **composer** :

- Name
- Birthday
- Gender
- Homepage
- Place of birth
- First appearance in movie credits

List of needed information about composers' **musics** :

- Genre
- Spotify's popularity


## Import

In [1]:
# Import all needed libraries
from helpers import *

# Load autoreload extension
%load_ext autoreload

# Set autoreload mode
%autoreload 2

## Load Data

Only `movie.metadata.tsv` is loaded, since others data sets are not relevant for our data analysis.

Note that no header are given in the raw data sets. We then looked at the documentation and set these headers with meaningful names. Please, find the documentation of data set [here](http://www.cs.cmu.edu/~ark/personas/).

In [2]:
# Please be sure your data is store in the same path
data_path = 'dataset/MovieSummaries/movie.metadata.tsv'

# Load the text file into Pandas DataFrame
movie = load_movies(data_path)

# Display the dataframe
display(movie)

Unnamed: 0,wiki_movieID,freebase_movieID,name,release_date,box_office_revenue,runtime,languages,countries,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


A quick visual insight can be done to get information gain. From the $81,741$ movies, we see that some variables are textual (i.e. `freebase_movieID`, `name`, ...) or numerical (i.e. `box_office_revenue`, `runtime`, ...).

Let's have a more meaningful insight of our current data.

In [3]:
insight(movie)

Unnamed: 0,class,missing_values
wiki_movieID,<class 'int'>,0.0
freebase_movieID,<class 'str'>,0.0
name,<class 'str'>,0.0
release_date,<class 'str'>,8.44
box_office_revenue,<class 'int'>,89.72
runtime,<class 'float'>,25.02
languages,<class 'str'>,0.0
countries,<class 'str'>,0.0
genres,<class 'str'>,0.0


We see that only few movies ($8,401$) have information about their box office revenue. Also, not all needed attribute are stored in the actual raw data. Since they are important metrics for out analysis, let's try to retrieve as much information as possible via API requests (procedure explained below).

## Clean/Enrich Data

 To enrich our dataset, we use a free to use API ([TMDB](https://www.themoviedb.org/?language=fr)). Some important features are still missing after this procedure. Then, we drop these movies from our analysis.

A specific script has been created to be run once and create our `clean_enrich_movie.csv` dataset. Go to `enrich_data.py` and its linked library `tmdb/tmdb.py` for more explanation on how we retrieved these information. Please note that a personal API key is needed to successfully run the script ([create key](https://developer.themoviedb.org/reference/intro/getting-started)). Make sure to create file `.env` with your API bearer token using the `.env_example` as template.

In [4]:
# TODO update the markdown just above with the final API procedure. What we retrieve and so on.
# Load the cleaned and enriched data set (created via enrich_movie_data.py script)
enhanced_movie = pd.read_pickle('dataset/clean_enrich_movies.pickle')

# Display the loaded dataframe
display(enhanced_movie)

Unnamed: 0,name,release_date,box_office_revenue,countries,genres,tmdb_id,composers
0,Avatar,2009,2782275172,"[United States of America, United Kingdom]","[Thriller, Science Fiction, Adventure, Compute...",19995,"[Composer(id=1729, name='James Horner', birthd..."
1,Titanic,1997,2185372302,[United States of America],"[Tragedy, Costume drama, Historical fiction, A...",597,"[Composer(id=1729, name='James Horner', birthd..."
2,The Avengers,2012,1511757910,[United States of America],"[Science Fiction, Action]",24428,"[Composer(id=37, name='Alan Silvestri', birthd..."
3,Harry Potter and the Deathly Hallows – Part 2,2011,1328111219,"[United States of America, United Kingdom]","[Drama, Mystery, Fantasy, Adventure]",12445,"[Composer(id=2949, name='Alexandre Desplat', b..."
4,Transformers: Dark of the Moon,2011,1123746996,[United States of America],"[Alien Film, Science Fiction, Action, Adventure]",38356,"[Composer(id=18264, name='Steve Jablonsky', bi..."
...,...,...,...,...,...,...,...
8322,Frankie and Alice,2010,10670,[Canada],"[Biography, Drama]",55061,"[Composer(id=70789, name='Andrew Lockington', ..."
8323,Fighting Tommy Riley,2005,10514,[United States of America],"[LGBT, Sports, Drama, Indie, Boxing]",47534,
8324,Logan,2010,10474,[United States of America],"[Family Film, Drama, Comedy]",44010,
8325,GhettoPhysics,2010,10200,[],"[Political cinema, Documentary]",-1,


Let's have a more deep look at this new dataframe. First, let's have a look of the percentage of missing values now.

In [5]:
insight(enhanced_movie)

Unnamed: 0,class,missing_values
name,<class 'str'>,0.0
release_date,<class 'str'>,0.0
box_office_revenue,<class 'int'>,0.0
countries,<class 'list'>,0.0
genres,<class 'list'>,0.0
tmdb_id,<class 'int'>,0.0
composers,<class 'list'>,30.92


We also display statistical values to better understand the distribution of few metrics of the data.

In [6]:
# TODO see if we can give better insight of data.
insight_enhance(enhanced_movie)

count              8327.0
mean      48269077.422241
std      112467320.603979
min               10000.0
25%             2105039.0
50%            10815378.0
75%            41050113.0
max          2782275172.0
Name: box_office_revenue, dtype: Float64

There is 30.92% of nan composers

Considering the first composer of the list if multiple have been returned for a movie, we can compute the following statistics on the retrieved data:

	 - There is 0.00% of nan name for composers
	 - There is 18.08% of nan birthday for composers
	 - There is 0.00% of nan gender for composers
	 - There is 72.24% of nan homepage for composers
	 - There is 22.93% of nan place of birth for composers
	 - There is 100.00% of nan first appearance in movie for composers


  composers_no_na_name = composers_no_na.agg(lambda c: c[0].name)
  composers_no_na_birthday = composers_no_na.agg(lambda c: c[0].birthday)
  composers_no_na_gender = composers_no_na.agg(lambda c: c[0].gender)
  composers_no_na_homepage = composers_no_na.agg(lambda c: c[0].homepage)
  composers_no_na_place_of_birth = composers_no_na.agg(lambda c: c[0].place_of_birth)
  composers_no_na_first_appearance_in_movie = composers_no_na.agg(lambda c: c[0].date_first_appearance)


## Results Expected

Let's try to explain what are the main plots we want to bring to our readers. Do we have in mind also to have a interactive plot, and so on ?

In [7]:
pass

In [8]:
import pandas as pd
test=pd.read_csv('test')
test_without = pd.read_csv('test_without')

nb_test = test['box_office_revenue'].isna().sum()
nb_test_without = test_without['box_office_revenue'].isna().sum()
print(f'Number of movies without box office revenue : {nb_test}')
print(f'Number of movies without box office revenue (without enrich) : {nb_test_without}')

FileNotFoundError: [Errno 2] No such file or directory: 'test'