In [None]:
# Copyright (C) 2020 Artefact
# licence-information@artefact.com

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

<table><tr>
<td> <img src="https://upload.wikimedia.org/wikipedia/fr/thumb/e/e5/Logo_%C3%A9cole_des_ponts_paristech.svg/676px-Logo_%C3%A9cole_des_ponts_paristech.svg.png" width="200"  height="200" hspace="200"/> </td>
<td> <img src="https://pbs.twimg.com/profile_images/1156541928193896448/5ihYIbCQ_200x200.png" width="200" height="200" /> </td>
</tr></table>

<br/>

<h1><center>Session 9 - Data visualization</center></h1>



<font size="3">The goal of this session is to dicover insights in the data thanks to visualizations. It will be divided into **2** parts:
- **Exploration of raw data**
- **Restitution of prediction results**

In each part, some **guidelines** and **hints** will be given, but you are free to make any graph that helps you better understand the dataset.
</font>

# Useful libraries

**[OPTIONAL]** If you first want to create a virtual env with Conda and use it in your Notebook, you can run the following commands in a Terminal:
- conda create -n ENV_NAME python=3.8
- conda activate ENV_NAME
- conda install pip
- ipython kernel install --name ENV_NAME --user

If you do not have the needed libraries already installed please run the following commands in a Terminal:
- pip install pandas==1.2.3
- pip install numpy==1.18.1
- pip install matplotlib==3.1.3
- pip install seaborn==0.10.0
- pip install plotly==4.8.2
- conda install -c plotly plotly_express [OR] pip install plotly-express==0.4.1

In [3]:
import pandas as pd
import numpy as np
import json
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [4]:
import warnings
warnings.filterwarnings('ignore')

# Dataset introduction

We will use the movies dataset from the previous sessions, with only a limited number of variables.
A column ``sales_predicted`` has been added : it corresponds to the number of sales predicted for each movie by a machine learning model.

Our goal will be to explore this dataset and analyze the performances of the given predictions.

In [5]:
df = pd.read_csv('data_visualisation.csv')
print(df.shape)
df.head()

(6676, 12)


Unnamed: 0,release_date,year,month,budget,runtime,genre,prod,original_lang,is_part_of_collection,nb_movie_collection,sales,sales_predicted
0,2000-01-01,2000,1,21182915.0,120.0,Comédie,FR,fr,0,0.0,139087,142667.0
1,2000-01-05,2000,1,22000000.0,142.0,Action,US,en,0,0.0,66228,73028.0
2,2000-01-05,2000,1,20098347.0,77.0,Drame,OTHER,es,0,0.0,1463152,1532683.0
3,2000-01-05,2000,1,22985221.0,116.0,Drame,CA,en,0,0.0,32954,35909.0
4,2000-01-12,2000,1,40000000.0,99.0,Action,FR,en,1,2.0,223564,215181.0


# Part 1 - Exploration of raw data

The objective of this section is to build visualizations based on raw data to get insights from the movies in the dataset. You will use Matplotlib, Seaborn and Pyplot to achieve this goal. Here are galleries that show the kind of plots you can do with each lib :
- [Matplotlib](https://matplotlib.org/stable/gallery/index.html)
- [Seaborn](https://seaborn.pydata.org/examples/index.html)
- [Pyplot](https://plotly.com/python/plotly-express/)

The variables available in the dataset have different types : for example ``genre`` is categorical and ``budget`` is numerical. Different kinds of graphs can be made depending on variables types. In this section, we will plot both categorical and numerical features.

Do not hesitate to check documentation to understand how to build the different kinds of plots. 

**Categorical :**
- Bar plot : [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html), [Seaborn](https://seaborn.pydata.org/generated/seaborn.barplot.html), [Plotly](https://plotly.com/python/bar-charts/)
- Pie chart : [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html), [Plotly](https://plotly.com/python/pie-charts/)
- Sunburst : [Plotly](https://plotly.com/python/sunburst-charts/)

**Numerical :**
- Distribution : [Matplotlib](https://matplotlib.org/stable/gallery/statistics/hist.html), [Seaborn](https://seaborn.pydata.org/tutorial/distributions.html), [Plotly](https://plotly.com/python/distplot/)
- Violin plot : [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.violinplot.html), [Seaborn](https://seaborn.pydata.org/generated/seaborn.violinplot.html), [Plotly](https://plotly.com/python/violin/)
- Box plot : [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html), [Seaborn](https://seaborn.pydata.org/generated/seaborn.boxplot.html), [Plotly](https://plotly.com/python/box-plots/)
- 2D density plot : [Seaborn](https://seaborn.pydata.org/generated/seaborn.jointplot.html), [Plotly](https://plotly.com/python/v3/density-plots/)

**Visualizing many variables on the same graph :**
- Scatter plot : [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html), [Seaborn](https://seaborn.pydata.org/generated/seaborn.scatterplot.html), [Plotly](https://plotly.com/python/line-and-scatter/)
- Heat map : [Seaborn](https://seaborn.pydata.org/generated/seaborn.heatmap.html), [Plotly](https://plotly.com/python/heatmaps/)

You can choose the color palette for your plots for [Matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html), [Seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) and [Plotly](https://plotly.com/python/discrete-color/).

## Part 1 instructions
- For each exercice, think about the data at your disposal and choose what you want to plot: what are you looking for in your exploration ?
- Choose the most appropriate graph for your needs
- Write down any insight you may find
- Feel free to add other variables to your analysis at any time

**We recommend the use of Seaborn for data exploration. Matplotlib is also a good option. Plotly can be a little more complex to implement but provides interactive graphs that you can experiment with if you have extra time.**

## Exercice 1: Exploring a categorical feature
The objective of this section is to choose and build an appropriate graph for a categorical feature.

Here are some suggestions:
- What is the composition of movie genres in your dataframe ?
- What about the production country, or language of the movie ?

What kind of insights can you draw from your graphs ?

For example, can you tell if there is any dominant country of production ?

In [1]:
# Create at least one graph to explore a categorical feature



## Exercice 2: Exploring a numerical feature

The objective of this section is to choose and build an appropriate graph for a numerical feature.

Here are some suggestions:
- How is the amount of sales distributed ?
- What about the duration of the movies ?

What kind of insights can you draw from your graphs ?

Feel free to cross your analyses with other variables.

For example, can you tell if the distribution of budgets is different depending on the country of production ?

In [10]:
# Create at least one graph to explore a numerical feature



## Exercice 3: Combining multiple variables

The objective of this section is to focus on graphs combining at least 2 different variables (of any type).

What interesting insight can you find by combining multiple features on your graphs ?

Here are some suggestions:
- Evolution of numerical a variable over time
- Composition of a categorical variable over time
- An insightful graph combining 3 or more variables while staying easy to understand

In [11]:
# Create at least one insightful graph combining multiple features.



## [OPTIONNAL] To go further

If you've finished early, try to build interactive graphs with Plotly

In [12]:
# Feel free to experiment with Plotly to build interactive visuals



# Part 2 - Restitution of model results

The objective of this section is to analyse the performance of the machine learning model that generated predictions in the ``sales_predicted`` column.

**IMPORTANT NOTE**: When exploring data, it's not uncommon to create graphs with no obvious insights. However, when you are sharing your results, you need to make sure that each graph carries a clear messages.

The quality of each prediction can be described by its **percentage error** or its **absolute percentage error**. To quantify model performance, the idea is to aggregate these metrics on various subsets of the dataset.

To save time, we already computed some metrics for you to analyze your results, but feel free to add more if you have enough time.

In [7]:
# Compute prediction errors
df['percent_error'] = (df['sales'] - df['sales_predicted']) / df['sales']
df['abs_percent_error'] = abs(df['percent_error'])

In [14]:
# Create as many graphs as you want to provide the most insights about the performances of the predictions.
# Try to evaluate the performances globally but also try to find interesting insights by crossing variables.

