In [None]:
# Run before lecture to load datasets and do simple prep
import altair as alt #plotting
import pandas as pd #data wrangling/plotting
import numpy as np 
alt.data_transformers.disable_max_rows()


#Mauna Loa
co2_df = pd.read_csv("data/co2_df.csv",index_col=[0])
co2_df['date'] = pd.to_datetime(co2_df['date']).dt.floor('d')


#Top 12 Island landmasses

islands_df = pd.read_csv(
    "data/islands.csv",
    names = ['landmass', 'size'],
    skiprows=1
)

continents = ['Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America']

islands_df['landmass_type'] = np.where(islands_df['landmass'].isin(continents),'continent','other')


#gapminder

gapminder = pd.read_csv("data/gapminder.csv")
gapminder_filter = gapminder.loc[:,["country", "year", "continent", "life_expectancy"]]
gapminder_2016=gapminder_filter[gapminder_filter["year"]==2016].reset_index()

#mtcars

mtcars= pd.read_csv("data/mtcars.csv",index_col=0)


#diamonds
diamonds = pd.read_csv("data/diamonds.csv",index_col=0)

# DSCI 100 - Introduction to Data Science


## Lecture 4 - Data visualization in Python


**Attribution:** images in these slides that are not accompanied by code mostly come from <br>[The Fundamentals of Data Visualization by Claus O. Wilke](https://clauswilke.com/dataviz/)

<img src="img/visual-data-exploration_modified.png" width="500"/>

*Artwork modified from original by @allison_horst*

# Housekeeping


### Today: Visualization  


<center>
<img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="1200"/>
</center>

*image source: [R for Data Science](https://r4ds.had.co.nz/) by Grolemund & Wickham*

### Designing a visualization: ask a question, then answer it

The purpose of a visualization is to *answer a question* about a dataset of interest.

A good visualization answers the question clearly. A *great* visualization also hints at the question itself.

Visualizations alone help us answer two types of questions:

- **descriptive:** What are the largest 7 landmasses on Earth?
- **exploratory:** Is there a relationship between penguin body mass and bill length?
- ~~inferential~~
- ~~predictive~~
- ~~causal~~
- ~~mechanistic~~

(we need more tools + visualizations to answer the others)

- Descriptive: A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact).	(describe characteristics)

- Exploratory: A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. (discovery of ideas and thoughts)

- inferential: determine if association observed in your exploratory analysis hold in a different sample that is rep of pop (infew what is true)

- predictive: what predicts whether someone will eat a certain diet 

- causal: whether changing one factor will change another factor 

- mechanistic: how e.g. how diet leads to a reduction in the number of viral illnesses

### Creating visualizations in Python

- It's an iterative procedure. Try things, make mistakes, and refine! 
- We will use the `altair` package. There are three key aspects of plots in `altair`:
    1. `alt.Chart` creates a base chart and specifies which data to use.
    2. `mark_*` specifies which graphical marks to use when displaying the data.
    3. `encode` indicates how data frame columns should be encoded to visual channels (x, y, colors, etc).

- All aspects of our chart will be chained using the dot notation i.e.

```python
alt.Chart(df).mark_circle().encode(
    x='column1',
    y='column2',
)
```

### Scatter Plots

To visualize the relationship between two quantitative variables

e.g. Is there a relationship between horsepower and fuel economy of an engine? Does the number of cylinders affect that relationship?

#### Types of variables 
A **variable** refers to a characteristic of interest and can be: 

1. categorical: can be divided into groups (categories) e.g. marital status 
2. quantitative: measured on a numeric scale (usually units are attached) e.g. height

In [None]:
# Load pandas for reading in data
import pandas as pd


mtcars = pd.read_csv("data/mtcars.csv").rename(columns={'Unnamed: 0': 'model'})
mtcars

In [None]:
# Is there a relationship between fuel economy (mpg) and horsepower?


Build up one-by-one:

```python
alt.Chart(mtcars).mark_circle()
```

then add:

```python
.encode(
    x="hp"
)
```

then also add:

```python
    y="mpg"
```

Ask students: does there seem to be a relationship between the fuel efficiency and horsepower?

As horsepower increases miles per gallon (fuel efficiency) tends to decrease (negative relationship). But is this true for all cars? Can we group the data in some way to find out more? What about per the number of cylinders (the size) of the engine?

```python
# Color per cylinder
alt.Chart(mtcars).mark_circle().encode(
    x="hp",
    y="mpg", 
    color="cyl"
)
```

to change to the cylinder to use a categorical color map, we need to use the `:N` suffix to tell altair this is a "nominal" (categorical) column.

Ask students, does there seem to be a relationship between these two variables within cars with the same engine size?

Cars with more cylinders tend to have higher horsepower and lower fuel efficiency. We can make this plot easier to understand by adding axis labels. When doing this we need to use the helper functions `alt.X` and `alt.Y`.

```python
# Add labels
alt.Chart(mtcars).mark_circle().encode(
    x=alt.X("hp").title("Horsepower"),
    y=alt.Y("mpg").title("Miles per Gallon"), 
    color=alt.Color("cyl:N").title("Cylinders")
)
```

### Line Plots

To visualize trends with respect to an independent quantity 

e.g. How has atmospheric carbon dioxide changed over the last 40 years?

<center><img src="https://media.sciencephoto.com/e1/80/03/84/e1800384-800px-wm.jpg" width="600"/> 

Mauna Loa Research Station</center>

In [None]:
# Inspect the data
co2_df

In [None]:
# How does atmospheric CO2 concentration change over time?


Start with `mark_circle` as we just learned:

```
alt.Chart(co2_df).mark_circle().encode(
    x="date", 
    y="concentration"
)
```

*note the scale attribute for the y-axis
```
alt.Chart(co2_df).mark_circle().encode(
    x="date", 
    y=alt.Y("concentration").scale(zero=False)
)
```

The visualization shows a clear upward trend in the atmospheric concentration of CO2 over time.
However, something is not quite right here; ask students what wrong with this plot (overplotting).

Switch to `mark_line`:

```
alt.Chart(co2_df).mark_line().encode(
    x="date", 
    y=alt.Y("concentration").scale(zero=False)
)
```

Share additional conclusion: The concentration seems to oscillate as well.

Optionally add labels or just mention it.
Can show that when chaining two methods its nice to have them on separate rows.

```
alt.Chart(co2_df).mark_line().encode(
    x=alt.X("date").title('Year'), 
    y=alt.Y("concentration")
        .scale(zero=False)
        .title('Concentration (ppm)')
)
```

### Bar Plots

To visualize the comparison of amounts

e.g. Which are the largest 12 island landmasses on Earth? Are they all continents or are there some other islands with large landmasses as well?

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Worldmap_LandAndPolitical.jpg/1200px-Worldmap_LandAndPolitical.jpg" width="800"/>
    Source: Wiktionary
</center>

In [None]:
# Inspect the data
islands_df

In [None]:
# What are the largest 12 island landmasses on Earth?


Simplest approach first:

```
alt.Chart(islands_df).mark_bar().encode(
    x="landmass",
    y="size"
)
```

The major issues are that the smaller landmasses’ sizes are hard to distinguish,
and the names of the landmasses are tilted by default to fit in the labels. 
Let's switch X and Y and also look at only the largest landmasses;

```
islands_top12 = islands_df.nlargest(12, "size")  

alt.Chart(islands_top12).mark_bar().encode(
    x="size",
    y="landmass"
)
```


To make the plot easier to read, we can reorder the bars by size. Generally we put the largest bar closest to the axis, but this is not a hard rule.
This will also allow us to make the answer to our question clearer.
```
alt.Chart(islands_top12).mark_bar().encode(
    x="size",
    y=alt.Y("landmass").sort('x')
)
```

Furthermore, we migh want to color the bars based on whether they are a continent or not
```
alt.Chart(islands_top12).mark_bar().encode(
    x="size",
    y=alt.Y("landmass").sort('x'),
    color='landmass_type'
)
```

we should also change the labels to meaningful names
```
alt.Chart(islands_top12).mark_bar().encode(
    x=alt.X("size").title('Size (1000 square mi)'),
    y=alt.Y("landmass").sort('x').title('Landmass'),
    color=alt.Color('landmass_type').title('Type')
)
```

### Histograms

To visualize the distribution of a single quantitative variable

e.g. Was there a difference in life expectancy across different continents in 2016?

In [None]:
# Inspect the data
gapminder_2016

In [None]:
# Was there a difference in life expectancy across different continents in 2016?


First let's create a histogram of all countries' life expectancies.

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x='life_expectancy',
    y='count()'
)
```

Hmmm, that looks kind of correct and kind of not correct.
Ask student: why does this plot look so spikey and what could we do to fix it?

We need to bin/group the x-variable:

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x =alt.X("life_expectancy").bin(),
    y='count()',

)
```

Then color by continent.

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x =alt.X("life_expectancy").bin(),
    y='count()',
    color="continent",
)
```

The position of the bars defaults to "stack" that they are stacked on top of each other - not very easy to read.
This is not an effective way to convey this information, so let’s try a different strategy of creating multiple separate histograms instead using facetting.

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x =alt.X("life_expectancy").bin(),
    y='count()',
    color="continent",
).facet(
    'continent'
)
```
When we want to compare the x-axes between the facets as in this case,
it is more effective to stack in a column instead of in a row
since we can easier compare the position of the different histograms along the x-axis.

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x =alt.X("life_expectancy").bin(),
    y='count()',
    color="continent",
).facet(
    'continent',
    columns=1
)
```

Changing the height of the charts so that we can see everything at once simplifies things further:

```
alt.Chart(gapminder_2016).mark_bar().encode(
    x =alt.X("life_expectancy").bin(),
    y='count()',
    color="continent",
).properties(
    height=100
).facet(
    'continent',
    columns=1
)

```
That looks much better and is easy to compare! (Mention that axis labels should be changed to be human readable and optionally show it as well as how to make the plot taller)

## A few rules of thumb for creating effective visualizations

### Rule of Thumb: No tables / pie charts / 3D

<img src="img/pie.png" width="1200" />

Which one is easier to interpret? 
Pie graph - colours don't mean anything (unneccessary) 
- hard to see size of slices relative to the other slices 

### Rule of Thumb: No tables / pie charts / 3D

<img align="left" src="https://clauswilke.com/dataviz/no_3d_files/figure-html/VA-death-rates-3d-1.png" width="600" />
<img align="right" src="https://clauswilke.com/dataviz/no_3d_files/figure-html/VA-death-rates-Trellis-1.png" width="800" />

- the third dimension does not improve the reading of the data
- these plots are difficult to interpret because of the distorted effect of perspective associated with the third dimension. 
- 3D is discouraged for charts in general, and should only be used for very specific applications
- the bars or slices in a pie graph that are closer to the reader appear to be larger than those in the back due to the angle at which they're presented

### Rule of Thumb: Use simple, colourblind-friendly colour palettes
<img align="left" src="https://clauswilke.com/dataviz/pitfalls_of_color_use_files/figure-html/popgrowth-vs-popsize-colored-1.png" width="500" />
<img align="right" src="https://clauswilke.com/dataviz/pitfalls_of_color_use_files/figure-html/popgrowth-vs-popsize-bw-1.png" width="800" />

- https://www.color-blindness.com/coblis-color-blindness-simulator/


### Rule of Thumb: Include labels and legends, make them legible

Remember: a great visualization tells its own story without needing you to be there explaining things<img align="left" src="https://clauswilke.com/dataviz/small_axis_labels_files/figure-html/Aus-athletes-small-1.png" width="700" />
<img align="right" src="https://clauswilke.com/dataviz/small_axis_labels_files/figure-html/Aus-athletes-good-1.png" width="700" />

<img align="left" src="https://clauswilke.com/dataviz/figure_titles_captions_files/figure-html/tech-stocks-minimal-labeling-bad-1.png" width="700" />
<img align="right" src="https://clauswilke.com/dataviz/figure_titles_captions_files/figure-html/tech-stocks-minimal-labeling-1.png" width="700" />

### Rule of Thumb: avoid overplotting

Generally, need to use an alternative geometric object

In [None]:

diamond_plot =  alt.Chart(diamonds).mark_circle().encode(
    x=alt.X("carat",title="Size (carat)"),
    y=alt.Y("price",title="Price (US dollars)")
)
diamond_plot

Add `opacity=0.5` to `mark_*`
- transparency setting between [0,1]

- too many colours (overwhelming)
- less is more


- Make sure to use colourschemes that are understandable by those with colourblindness. For example, the RColorBrewer R library provide the ability to pick such colourschemes, and you can check your visualizations after you have created them by uploading to online tools such as the colour blindness simulator.
- Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. For instance you can also consider using shapes to represent different groups

## Go and create!


![](https://media.giphy.com/media/d31vTpVi1LAcDvdm/giphy.gif)

## What did we learn today?

-
- 
- 

### Saving the visualization

There are two major types of image format for storing your visualization:

- **raster graphics**
    - stored as a grid of *pixels* each with their own colour
    - storage size / display time is (roughly) independent of how complicated the image is
    - zooming in / resizing causes loss of quality
    - JPEG (`.jpg`) for natural images, PNG (`.png`) for line drawings/plots
- **vector graphics**
    - stored as a collection of mathematical objects (lines, geometric shapes, curves)
    - storage size / display time depends on how complicated the image is (how many objects)
    - can zoom in / resize arbitrarily and it still looks good
    - SVG (`.svg`) for general usage

    
<center><img src="img/faithful.png" width="700"/></center>
<center><img src="img/raster.png" width="700"/></center>
<center>Zoomed in raster (left) and vector (right)</center>