# Introduction to R

<font color='blue'> keywords:<font color='black'>  
**Visualization:** scatter plot, bar graph, line graph, box plot  
**transformations:** select, filter, arrange, mutate, group by, summarize 

<font color='blue'>Learning outcomes<font color='black'> 
- data visualization with ggplot2
- data transformation with the dplyr

In [None]:
# load required packages
library(tidyverse) # a group of packages for the functions to load, transform, analyze, and visualize data
# these packages also contain some built-in datasets
# visit https://www.tidyverse.org/ for more information about

## Visualization

In the lecture, we saw the following template to generate plots with **ggplot2**

    ggplot(data = <DATA>) +
        <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Type `?geom_` followed by `TAB` to get a list of `<GEOM_FUNCTION>`s. Here are some examples:
 - Scatterplot - geom_point()
 - Bar chart - geom_bar()
 - Line chart - geom_line()
 - Boxplot - geom_boxplot()
 - Histogram - geom_histogram()

In [None]:
# for the following few plots we will use the mpg dataset
# mpg dataset is included in ggplot2 package
# visit https://ggplot2.tidyverse.org/reference/mpg.html for more information
# (use: data() command to list out all the loaded datasets, listed by package)
# first, let view columns of the dataset 
str(mpg)

### Scatter plot

In [None]:
# Here is an example scatter plot to show the correlation between `cty` and `hwy`
ggplot(data = mpg) +
    geom_point(mapping = aes(x = cty, y = hwy))

In [None]:
# The above figure may be a bit too large, we can change the display size of the figures as follows
options(repr.plot.width=5, repr.plot.height=3)

# Now plot the figure again and see the difference
ggplot(data = mpg) +
    geom_point(mapping = aes(x = cty, y = hwy))

#### Practice
We can add more options to change `colour`, `fill`, `size`, `shape`, and transpancey (`alpha`) of points in the figure. Change the values of those options in the following command to see the effects.

In [None]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = cty, y = hwy), color='red', size = 2, shape = 21, alpha=1, fill='blue')

In [None]:
# Be reminded that by adding `color=class` inside aes as follows, we can distinguish between types of car
ggplot(data = mpg) +
    geom_point(mapping = aes(x = hwy, y = cty, color=class), size = 2)

#### Excercise

Draw the scatter plot to show the relationship between `displ` and `hwy` where  
- points' colors are distinguished by fuel type (`fl`)
- points' size is set to 2
- points have star shape ([here](https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/) is more information on setting shape)

In [None]:
### YOUR SOLUTION GOES HERE
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color=fl), size = 2, shape = 8)

### Bar plot

In [None]:
# Let us look at another dataset - diamonds
str(diamonds)

In [None]:
# Generate a bar graph to display counts of diamonds by cut
ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut))

In [None]:
# We can further decompose the bars by both cut and then clarity using the `fill` option inside `aes` as follows
ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = clarity), position = 'dodge', color='gray')

#### Practice

- In the following command, note that `position` is set to `identity` and the sub-bars are vertically stacked
- Change the value for `color`, `width`, and `show.legend` to see their effects

In [None]:
ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = clarity), position = 'identity', 
                 color='gray', width=0.1, show.legend=FALSE)

## Data Transformation

In [None]:
# For the subsequent plots we will use the flights dataset as mentioned in the lecture

# First, load the package that contains dataset
# install.packages('nycflights13') # may be required if you have not yet installed the package
library(nycflights13)   # for the "nycflights13: Flights that Departed NYC in 2013" dataset, 
# visit https://cran.r-project.org/web/packages/nycflights13/index.html for more information

# Again, view the columns in the dataset 
str(flights)

#### Example
Let's be reminded about some steps for tranforming data through an example

**Projecting a subset of variables via select**

In [None]:
# Create a subset of the flights data set for carrier and the delay
flights_carrier_day <- select(flights, carrier, arr_delay, dep_delay)

**Selecting a subset of observations via filter**

In [None]:
# Create a subset of the flights data set with details for Winter
winter_flights <- filter(flights, month == 1 | month == 2 | month == 12 & day > 21 | month == 3 & day < 22)

**Sorting of observations via arrange**

In [None]:
# Find the least delayed flights. Which are the top delayed flights?
flights_sorted <- arrange(flights, dep_delay, arr_delay)
flights_sorted <- arrange(flights, desc(dep_delay), desc(arr_delay))

**Adding new computed variables via mutate**

In [None]:
# Sort planes (tailnum) by their average speed in descending order
# where speed is defined as distance / air_time
# Check where NANs come from
flights_by_speed <- mutate(flights, speed = distance / air_time) %>%
                    group_by(tailnum) %>%
                    summarise(ave_speed = mean(speed, na.rm = TRUE)) %>%
                    arrange(desc(ave_speed))

**Grouping observiations via group by and aggregation of groups via summarize**

In [None]:
# Find the carriers with the worst arrival delays
delays <- flights %>%
    group_by(dest) %>%
    summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
    ) %>%
    filter(count > 20, dest != 'HNL')

**Drawing a scatter plot to examine the relationship between `dist` and `delay`**

In [None]:
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
    geom_point() +
    geom_smooth()

#### Exercise  
Draw a bar chart or a pie-chart of delayed flights by different airlines on January 1st. Hint: use `coord_polar` as `<GEOM_FUNCTION>`

In [None]:
### YOUR SOLUTION GOES HERE

#filter the flights for Jan 1st

#get delay frequency of airlines

#re-names the columns' header in delayFrequency to carrier and frequency

# sort the airlines by delay frequency, from most to least

#plot a bar / pie chart


#### Exercise 

Draw a box-plot of delay time with respect to the original airport

In [None]:
# YOUR SOLUTION GOES HERE

#select delayed time and original airport, then filter out missing values


# make a box plot of departure delay (dep_delay) for each origin (origin)

#What if we limit the delay time between the range -180 and 180


#### Exercise

Draw a scatter-plot between daily averages of delayed time and flight time (air_time)

In [None]:
# YOUR SOLUTION GOES HERE

#group flight records by day


#compute daily averages' delayed time and flying time

#draw the scatter-plot of average delay (avg_delay) by average flight time (avg_fly)

# What if we add a smoothing line to interprete the plot


#### More excercises

**Plot the frequency of delayed flights by month**

In [None]:
### YOUR SOLUTION GOES HERE

**Plot the box-plot of delay time with respect to airlines**

In [None]:
### YOUR SOLUTION GOES HERE

**Plot the scatter-plot between hourly average of flying distance and flying time**

In [None]:
### YOUR SOLUTION GOES HERE

<font color='green'> Congratulations! <font color='black'>You have seen:<font color='green'>  
**Visualization:** pie chart, box plot, scatter plot 

**transformations:** select, filter, arrange, mutate, group by, summarize 

## References

[R for Data Science](http://r4ds.had.co.nz/)

[R Graphics Cookbook](http://www.cookbook-r.com/Graphs/)

[NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 26-10-2017](http://www.itl.nist.gov/div898/handbook/eda/section3/eda34.htm)

[Data Visualization Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)

[Data Transformation Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)