In [None]:
# Run before lecture to load datasets and do simple prep
library(tidyverse) #all our data wrangling/plotting
library(repr) #for changing the dimensions of plots
options(repr.matrix.max.rows = 6)

#Mauna Loa
co2_df <- data.frame(concentration = as.matrix(co2), date = time(co2))

#Top 12 Island landmasses
islands_df <- enframe(islands)
colnames(islands_df) <- c('landmass', 'size')
islands_df = top_n(islands_df, 12, size)

continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America')
islands_df <- mutate(islands_df, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other'))

install.packages("dslabs")
library(dslabs) # gapminder data
data(gapminder)
gapminder <- gapminder  %>% select(country, year, population, continent, life_expectancy)
#old faithful, mtcars -- nothing to do


# DSCI 100 - Introduction to Data Science


## Lecture 4 - Data visualization in R


**Attribution:** images in these slides that are not accompanied by code mostly come from <br>[The Fundamentals of Data Visualization by Claus O. Wilke](https://clauswilke.com/dataviz/)

<img src="https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/ggplot2_exploratory.png?raw=true" width=500>

*Illustration by Allison Horst*

# Housekeeping

- Quiz next week! 
    - June 2, 10:30 - 11:15 pm
    - open book (but not collaborative)
    - will be served on Canvas
    - practice questions will be posted on Canvas 
    - covers all the material from weeks 1 - 4 (but none of the optional web scraping is included)
    - make sure you have either Chrome, Firefox, or some other browser compatible with all of Canvas' functionality. Sometimes Safari has trouble loading images and math.
    - Download your worksheets/tutorials onto your local computer just in case!
        - File > Download as > HTML

### Today: Visualization  


<center>
<img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="1200"/>
</center>

*image source: [R for Data Science](https://r4ds.had.co.nz/) by Grolemund & Wickham*

### Designing a visualization: ask a question, then answer it

The purpose of a visualization is to *answer a question* about a dataset of interest.

A good visualization answers the question clearly. A *great* visualization also hints at the question itself.

Visualizations alone help us answer two types of questions:

- **descriptive:** What are the largest 7 landmasses on Earth?
- **exploratory:** Is there a relationship between penguin body mass and bill length?
- ~~inferential~~
- ~~predictive~~
- ~~causal~~
- ~~mechanistic~~

(we need more tools + visualizations to answer the others)




- Descriptive: A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact).	(describe characteristics)

- Exploratory: A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. (discovery of ideas and thoughts)

- inferential: determine if association observed in your exploratory analysis hold in a different sample that is rep of pop (infer what is true)

- predictive: what predicts whether someone will eat a certain diet 

- causal: whether changing one factor will change another factor 

- mechanistic: how e.g. how diet leads to a reduction in the number of viral illnesses

### Creating visualizations in R

- It's an iterative procedure. Try things, make mistakes, and refine! 
- We will use `ggplot2`. There are three key aspects of plots in `ggplot2`:
    1. **aesthetic mappings:** map dataframe columns to visual properties
    2. **geometric objects:** encode how to display those visual properties
    3. **scales:** transform variables, set limits
    
- Add these one by one using `+`

### Types of variables 
A **variable** refers to a characteristic of interest and can be: 

1. categorical: can be divided into groups (categories) e.g. marital status 
2. quantitative: measured on a numeric scale (usually units are attached) e.g. height

In [None]:
#load libraries

library(tidyverse) # all our data wrangling/plotting
library(repr) # for changing the dimensions of plots


### Scatter Plots

To visualize the relationship between two quantitative variables

e.g. Is there a relationship between horsepower and fuel economy of an engine? Does the number of cylinders affect that relationship?

<br>

<center>
<img src="https://i.imgur.com/J7o9uj0.gif?noredirect" width="550"/>
</center>

In [None]:
# inspect the data
mtcars

In [None]:
#Is there a relationship between fuel economy and horsepower? Does # cylinders affect it?

options(repr.plot.width = 7, repr.plot.height = 5) 

ggplot(
    mtcars,  
)


Build up one-by-one:
Do geom_point without colouring by cyl then colour by factor 
```
options(repr.plot.width = 5, repr.plot.height = 5)                   # change size of output plot
    ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl)))+ # treat cyl as a factor (often referred to as cat var) 
    geom_point() +                                                   # make it a scatterplot 
    labs(x='Horsepower', y='Miles per Gallon', color='# Cylinders')+ # labels on the x-axis, y-axis and legend title
    theme(text=element_text(size=18))                                # make the text size bigger 
```

- As horsepower increases miles per gallon (fuel efficiency) tends to decrease (negative relationship)
- Cars with more cylinders tend to have higher horsepower and lower fuel efficiency 

### Line Plots

To visualize trends with respect to an independent quantity 

e.g. How has atmospheric carbon dioxide changed over the last 40 years?

<center><img src="https://media.sciencephoto.com/e1/80/03/84/e1800384-800px-wm.jpg" width="600"/> Mauna Loa Research Station</center>

In [None]:
#inspect the data
co2_df


In [None]:
#How does atmospheric CO2 concentration change over time?
options(repr.plot.width = 7, repr.plot.height = 5) 
co2_plot <- ggplot(co2_df, ...) 
co2_plot

start with `geom_point`, then show students and ask "what's wrong here?" then do line
```
ggplot(co2_df, aes(x=date, y=concentration))+ 
    geom_line()+
    labs(x = 'Date', y = 'CO2 Concentration')+
                theme(text=element_text(size=18)) 
```
- visualization shows a clear upward trend in the atmospheric concentration of CO2 over time. 
- in addition to increasing over time, the concentration seems to oscillate as well.

- error message: "ts" = time series object 
- Time series is a series of data points in which each data point is associated with a timestamp

### Bar Plots

To visualize the comparison of amounts

e.g. Are Earth's biggest landmasses continents? If so, what are the next largest few landmasses?

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Worldmap_LandAndPolitical.jpg/1200px-Worldmap_LandAndPolitical.jpg" width="800"/>
    Source: Wiktionary
</center>

In [None]:
#What are the largest 12 landmasses on Earth?
islands_df
options(repr.plot.width = 7, repr.plot.height = 5) # change plot size 

Start with simple geom_bar, then clip, then reorder, then theme/legend

- By default, geom_bar uses stat="count" . This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. 
- If you want the heights of the bars to represent values in the data, use stat="identity" and map a value to the y aesthetic.
- notice without stat = identity and y = size R will count the number of different landmasses in the landmass column, which is just 1 each 

```
ggplot(islands_df, aes(x = landmass, y = size)) +        # add y = size 
    geom_bar(stat = "identity") +                        # make height equal to the y
    labs(x = 'Landmass', y = 'Size (1000 square mi)') +  # change labels
    theme(text = element_text(size = 18))                # make labels bigger
```

coord_flip() then reorder

```
ggplot(islands_df, aes(x = reorder(landmass,size), y = size)) +        # reorder landmass according to size  
                   geom_bar(stat = "identity") +
                   labs(x = 'Landmass', y = 'Size (1000 square mi)') +
                   coord_flip() +                                      # flip so the bars are horizontal
                   theme(text = element_text(size = 18))
```


colour by whether it is a continent or not 
```
ggplot(islands_df, aes(x = reorder(landmass, size), y = size, fill = is_continent)) + 
                   geom_bar(stat="identity") +
                   labs(x = 'Landmass', y = 'Size (1000 square mi)', fill = 'Type') +
                   coord_flip() +
                   theme(text = element_text(size = 18), legend.position = c(0.75, 0.45)) 
                   # legend position is the x and y axis position in the chart area, where (0,0) is the bottom left and (1,1) is the top right 
```                 

### Histograms

To visualize the distribution of a single quantitative variable

e.g. Was there a difference in life expectancy across different continents in 2016?

In [None]:
gapminder

gap_2016 <- gapminder %>% 
    filter(year == 2016)

gap_hist <- ggplot(gap_2016, aes(x = life_expectancy,  fill=continent)) +
    geom_histogram() + 
    labs(x = "Life Expectancy (years)") + 
    theme(text = element_text(size = 18)) +

gap_hist


- position defaults to "stack" so bars are stacked on top of each other - not very easy to read misleading 
    - (dodge: bars are beside each other) 
    - (identity: bars are on overlaid on top of one another) 
```
ggplot(gap_2016, aes(x = life_expectancy, fill = continent)) +   # colour the plot according to continent, fill by continent
  geom_histogram() + 
  labs(x = "Life Expectancy (years)", fill = "Continent") +      # change the label for the legend and x-axis
  theme(text = element_text(size = 18))                    
```

- try overlaying the histograms on top of each other
- change the transparency of the hists 
```
ggplot(gap_2016, aes(x = life_expectancy, fill = continent)) +
  geom_histogram(position = "identity", alpha = 0.6) +       # overlay the histograms and change the transparency
  labs(x = "Life Expectancy (years)", fill = "Continent") + 
  theme(text = element_text(size = 18)) 
```

-  it isn’t the clearest way to convey the information
-  Let’s try a different strategy of creating multiple separate histograms on top of one another.
- In our case we only want to split vertically along the continent variable, so we use continent ~ . as the argument to facet_grid
```
ggplot(gap_2016, aes(x = life_expectancy, fill = continent)) +
  geom_histogram(position = "identity") + 
  labs(x = "Life Expectancy (years)", fill = "Continent") + 
  theme(text = element_text(size = 18)) + 
  facet_grid(continent ~.) # can show facet_grid both ways and ask which is better to answer our question?
```

### Rule of Thumb: No tables / pie charts / 3D

<img src="img/pie.png" width="1200" />

Which one is easier to interpret? 
Pie graph - colours don't mean anything (unneccessary) 
- hard to see size of slices relative to the other slices 

### Rule of Thumb: No tables / pie charts / 3D

<img align="left" src="https://clauswilke.com/dataviz/no_3d_files/figure-html/VA-death-rates-3d-1.png" width="600" />
<img align="right" src="https://clauswilke.com/dataviz/no_3d_files/figure-html/VA-death-rates-Trellis-1.png" width="800" />

- the third dimension does not improve the reading of the data
- these plots are difficult to interpret because of the distorted effect of perspective associated with the third dimension. 
- 3D is discouraged for charts in general
- the bars or slices in a pie graph that are closer to the reader appear to be larger than those in the back due to the angle at which they're presented

### Rule of Thumb: Use simple, colourblind-friendly colour palettes
<img align="left" src="https://clauswilke.com/dataviz/pitfalls_of_color_use_files/figure-html/popgrowth-vs-popsize-colored-1.png" width="500" />
<img align="right" src="https://clauswilke.com/dataviz/pitfalls_of_color_use_files/figure-html/popgrowth-vs-popsize-bw-1.png" width="800" />

- https://www.color-blindness.com/coblis-color-blindness-simulator/


### Rule of Thumb: Include labels and legends, make them legible

Remember: a great visualization tells its own story without needing you to be there explaining things<img align="left" src="https://clauswilke.com/dataviz/small_axis_labels_files/figure-html/Aus-athletes-small-1.png" width="700" />
<img align="right" src="https://clauswilke.com/dataviz/small_axis_labels_files/figure-html/Aus-athletes-good-1.png" width="700" />

<img align="left" src="https://clauswilke.com/dataviz/figure_titles_captions_files/figure-html/tech-stocks-minimal-labeling-bad-1.png" width="700" />
<img align="right" src="https://clauswilke.com/dataviz/figure_titles_captions_files/figure-html/tech-stocks-minimal-labeling-1.png" width="700" />

### Rule of Thumb: avoid overplotting

Generally, need to use an alternative geometric object

In [None]:
options(repr.plot.width = 4, repr.plot.height = 4)
diamond_plot <- ggplot(diamonds, aes(x = carat, y = price)) +
    geom_point() +
    xlab("Size (carat)") +
    ylab("Price (US dollars)")
diamond_plot

Add `alpha = 0.2` to geom_point
- transparency setting between [0,1]

- too many colours (overwhelming)
- less is more


- Make sure to use colourschemes that are understandable by those with colourblindness. For example, the RColorBrewer R library provide the ability to pick such colourschemes, and you can check your visualizations after you have created them by uploading to online tools such as the colour blindness simulator.
- Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. For instance you can also consider using shapes to represent different groups

## Go and create!

1. Go to your project groups
2. Work on worksheet 4 
3. Remember you can ask your group for help and share screen if needed! 


![](https://media.giphy.com/media/d31vTpVi1LAcDvdm/giphy.gif)

## What did we learn today?

-
- 
- 

### Saving the visualization

There are two major types of image format for storing your visualization:

- **raster graphics**
    - stored as a grid of *pixels* each with their own colour
    - storage size / display time is (roughly) independent of how complicated the image is
    - zooming in / resizing causes loss of quality
    - JPEG (`.jpg`) for natural images, PNG (`.png`) for line drawings/plots
- **vector graphics**
    - stored as a collection of mathematical objects (lines, geometric shapes, curves)
    - storage size / display time depends on how complicated the image is (how many objects)
    - can zoom in / resize arbitrarily and it still looks good
    - SVG (`.svg`) for general usage

    
<center><img src="img/faithful.png" width="700"/></center>
<center><img src="img/raster.png" width="700"/></center>
<center>Zoomed in raster (left) and vector (right)</center>