# Data Visualization

![Data Science Workflow](img/ds-workflow.png)

## Data Visualization

Key skill today
>  *“The ability to take data-to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it-that’s going to be a hugely important skill in the next decades."*

[Hal Varian (Google’s Chief Economist)](https://en.wikipedia.org/wiki/Hal_Varian)

## Data Visualization for a Data Scientist
1. **Data Quality**: Explore data quality including identifying outliers
2. **Data Exploration**: Understand data with visualizing ideas
3. **Data Presentation**: Present results

## The power of Data Visualization

### Consider the following data
- what is the connection?
- See any patterns?

### Visualizing the same data
- Let's try to visualize the data

[Matplotlib](https://matplotlib.org) is an easy to use visualization library for Python.

In Notebooks you get started with.
```Python
import matplotlib.pyplot as plt
%matplotlib inline
```

### What Data Visualization gives
- Absorb information quickly
- Improve insights
- Make faster decisions

## Data Quality
### Is the data quality usable

Consider the dataset: `files/sample_height.csv`

#### Check for missing values
[`isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)[`.any()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html): Check for any missing values - returns True if missing values
```Python
data.isna().any()
```

#### Visualize data
- Notice: you need to know something about the data
- We know that it is heights of humans in centimeters
- This could be checked with a histogram

### Identifying outliers

Consider the dataset: `files/sample_age.csv`

#### Visualize with a histogram
- This gives fast insights

#### Describe the data
[`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html): Makes simple statistics of the DataFrame
```Python
data.describe()
```

## Data Exploration

### Data Visaulization
- Absorb information quickly
- Improve insights
- Make faster decisions

### World Bank
The [World Bank](https://www.worldbank.org/en/home) is a great source of datasets

#### CO2 per capita
- Let's explore this dataset [EN.ATM.CO2E.PC](https://data.worldbank.org/indicator/EN.ATM.CO2E.PC)
- Already available here: `files/WorldBank-ATM.CO2E.PC_DS2.csv`

#### Explore typical Data Visualizations
- Simple plot
- Set title
- Set labels
- Adjust axis

#### Read the data

#### Simple plot
- ```.plot()``` Creates a simple plot of data
- This gives you an idea of the data

#### Adding title and labels
Arguments
- ```title='Tilte'``` adds the title
- ```xlabel='X label'``` adds or changes the X-label
- ```ylabel='X label'``` adds or changes the Y-label

#### Adding axis range
- ```xlim=(min, max)``` or ```xlim=min``` Sets the x-axis range
- ```ylim=(min, max)``` or ```ylim=min``` Sets the y-axis range

### Comparing data
- Explore **USA** and **WLD**

#### Set the figure size
- ```figsize=(width, height)``` in inches

### Bar plot
- ```.plot.bar()``` Create a bar plot

### Plot a range of data
- ```.loc[from:to]``` apply this on the DataFrame to get a range (both inclusive)

### Histogram
- ```.plot.hist()``` Create a histogram
- ```bins=<number of bins>``` Specify the number of bins in the histogram.

### Pie chart
- ```.plot.pie()``` Creates a Pie Chart

### Value counts and pie charts
- A simple chart of values above/below a threshold
- ```.value_counts()``` Counts occurences of values in a Series (or DataFrame column)
- A few arguments to ```.plot.pie()```
    - ```colors=<list of colors>```
    - ```labels=<list of labels>```
    - ```title='<title>'```
    - ```ylabel='<label>'```
    - ```autopct='%1.1f%%'``` sets percentages on chart

### Scatter plot
- Assume we want to investigate if GDP per capita and CO2 per capita are correlated
    - Data available in *'files/co2_gdp_per_capita.csv'*
- ```.plot.scatter(x=<label>, y=<label>)``` Create a scatter plot
- ```.corr()``` Compute pairwise correlation of columns ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))

## Data Presentation
- This is about making data esay to digest

### The message
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world

#### Preparation
- Let's take 2017 (as more recent data is incomplete)
- What is the mean, max, and min CO2 per capital in the world

#### And in the US?

#### How can we tell a story?
- US is above the mean
- US is not the max
- It is above 75%

#### Some more advanced matplotlib

### Creative story telling with data visualization

Check out this video https://www.youtube.com/watch?v=jbkSRLYSojo