# Introduction

In this laboratory, you will get acquainted to the basic visualization techniques for the Exploratory Data Analysis (EDA). You will deal with charts to
- analyze a data distribution, like **histograms**, **boxplots** and **violin plots**
- explore relationships and (linear) correlations in your data, like **scatterplots** or **heatmaps**
- visualize the trend and some salient information of a **time series**.

Choosing the right type of chart for the type of data at hand is a never ending job. Some [communities](https://www.reddit.com/r/dataisbeautiful/) took it seriously since a while. You will soon find yourself eager to test many approaches: this laboratory will help you getting started.  

## Structure

Each main section of the notebook introduces one of the chart categories listed above. Within the sections, you will find a brief description of type of plot you are required to build, along with a short snippet of code that produces the chart on syntetic data.

The whole notebook is written in **Python**. The code level is introductory: if your are a Python master, bear with me.

For your convenience, some parts of the notebook are pre-compiled, we filled the boilerplate code for you. Although the notebook is sequential, each section is self-contained: feel free to skip back and forth you are more interested in specific parts.
  
---
 
So, to recap. Let's say your client provides you and your team a large set of raw, unprocessed data and you are the one in charge to do a first EDA pass. And you happen to know Python.

![Data Viz](https://venngage-wordpress.s3.amazonaws.com/uploads/2020/06/image17.png)

## Before we start: Python & Co.

Python is the *standard de-facto* ecosystem for data scientists and practitionaires. If I had to summarized its advantages over other languages like R, or even some GUI-enabled tools, I would say:

- a large, well-documented standard library, and a *huge* number of third-part libraries for virtually *everything* you need
- a large, supportive community with a lot of resources
- modern Python keeps the syntax at a minimum. To run data science and visualization experiments, the python code almost reduces to *plain english*
- a series of online tools that let you experiment with the language: you only need a Google Account to create [Colab](https://colab.research.google.com/notebook) Notebooks, and the same applies for [Kaggle](https://www.kaggle.com/), [Deepnote](https://deepnote.com/), and many more.

In this laboratory, we will three popular libraries: [NumPy](https://numpy.org/devdocs/index.html), [pandas](https://pandas.pydata.org/docs/) and [seaborn](https://seaborn.pydata.org/). Let's run the cell below to check and install all the dependecies for these libraries (do not worry about warnings).

In [4]:
!pip install --quiet --user numpy
!pip install --quiet --user pandas
!pip install --quiet --user seaborn

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


#### Imports

We can import them now and provide an alias with

In [13]:
import numpy as np
import pandas as pd
import seaborn as sns

# set the style for all the charts in the notebook
sns.set_theme("notebook")

### Data Manipulation

NumPy and Pandas let you easily handle numerical arrays and tabular data respectively.
Simply put, NumPy (or numpy in the following) is your best companion for the Geometry class, while Pandas provides you a table-like access to your data (with index and column names, and so on). "Tables" in pandas are known as DataFrames.

They have been thoroughly covered in the course Data Science Lab: process and methods at Politecnico di Torino. If you are interest, feel free to visit the [course website](https://dbdmg.polito.it/wordpress/teaching/data-science-lab-process-and-methods-2020-2021/) for slides and laboratories.

To give a practical example, let's see how we create a simple array with number from 0 to 99 in numpy: 

In [8]:
np.arange(100)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

or how we create a table-like dataset with pandas:

In [12]:
pd.DataFrame({"Age": [22, 43, 19], "Gender": ["Male", "Female", "Female"]})

Unnamed: 0,Age,Gender
0,22,Male
1,43,Female
2,19,Female


### Data Visualization

The most used library for data visualization in python is [matplotlib](https://matplotlib.org/). However, in this laboratory, you will use **seaborn**.
Seaborn is a matplotlib wrapper which provides simpler-to-use, high-level APIs to produce charts with better quality and lower effort.

## Datasets

You will deal with two different types of data.

First, we will focus on tabular data, with the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset). The dataset contains

Next, we will do some finance. We collected the ticker prices of eight publicly traded companies, namely Amazon (AMZN), AAPL (AAPL), Alphabet Inc (GOOG), Microsoft (MSFT), Johnson & Johnson (JNJ), Pfizer (PFE), Sanofi (SNY), and AstraZeneca (AZN). Not by coincidence, four of them belong to the tech sector, and the rest to healthcare.

Run the cell below to download and extract the datasets.

In [17]:
!wget -q https://dbdmg.polito.it/wordpress/wp-content/uploads/2021/05/datasets_Data_Theory_Python.zip -O datasets.zip
!unzip -qu datasets.zip
!rm datasets.zip

# Exercise 1. Tabular Data: Stroke Data

# Exercise 2. Data Correlation

# Exercise 3. Time series

# Conclusion (and what we have left behind)

# Bonus #1. Dimensionality Reduction

# Bonus #2: Interactivity

# Credits

This notebook was created by [Giuseppe Attanasio](https://gattanasio.cc), PhD student @ Politecnico di Torino.
You are free to download, edit and publish newer versions of the notebook.

Credit to *fedesoriano* for sharing the Stroke Prediction Dataset on Kaggle.
Ticker data is retrieved from Yahoo Finance.

*v1: 02/05/2021*
