# 7. Visualization with ggplot

## Visualizations in R

In this next section, we will introduce the `ggplot2` package and some of the basic plot types you can use. We will come back to data visualization later on in the course as it is such an important part of data analysis. Often times a graph can provide easier to understand information than a summarized table. 

### Scatter plots

We'll begin with a scatter plot in `ggplot2` to introduce the three key elements of a `ggplot2` object. Take a look at the code below which creates a scatter plot of 2019 total trips (y-axis) vs 2015 total trips (the x-axis). To start a graph, we can create a `ggplot` object as below. Note that this brings up a gray box - this will be the base that we build up from. We include the data as an argument so that any future layers we add will use the same data. 

Now we can start adding layers to our `ggplot` object. One type of layer is a **geometry** layer. These layers specify what kind of plot. Below, we use `geom_point` to specify that we want a scatter plot. Last, we need to pass an aesthetic to the geometry layer to tell it how to display the data as a scatter plot. Below we just specify the x-axis and y-axis.

If we want to improve our plot, we may want to add different labels and a title. To do so, we add a `labs` layer in which we can specify all labels. Additionally, I have passed more information to the geometry layer by changing the color and size. 

density plot

Statisticians use visualizations and tables to communicate results in an easy to interpret format. We learned the basics of ggplot earlier in the course. We start by expanding on that knowledge and learning more about the different layers we can use in ggplot. As with the tidyverse functions, there are quite a few functions to cover. In class, we will practice using these skills to build up creative and informative visualizations of data. Beyond visualizations, we will also talk about how to create well-formatted tables in R using the `kable` and `gt` packages. This notebook is video heavy because I wanted to make sure each piece of code generating the plots was highlighted. 


## Visual Plots

The video below reviews the main components of a plot using the `ggplot2` library that we have seen before: the data, aesthetic, and layers.

https://www.youtube.com/embed/SzqtP7Y2BQ4

In the example below, we create a scatter plot visualization of arrival and departure delays from the NYC flight data you saw before. We will use a sample of the data for these examples. Try changing the size, color, and alpha values from what I have set them below. What does alpha correspond to?

We can also add additional layers to our visualization. For example, suppose I want to add a line showing departure delay equal to arrival delay. I can do so as below where I am using a `geom_line` layer. Note that I have added in a different aesthetic for the line layer that does not come from the same data. Therefore, we move the data and aesthetics from the first ggplot call to the layers themselves.

### Adjusting Axes and Adding Text

In the video below, I show you we can add to this plot by adjusting the axes and adding an annotation corresponding to the red line. In particular, the different ggplot functions used here are:

  - [Annotation](https://ggplot2.tidyverse.org/reference/annotate.html) (`annotate`): this function allows us to add layers to the graph that do not come from a data frame. In the example below, we show how we can use this function to create a line as well as to add a text annotation to the plot. This function is useful if you want to add extra pieces to the graph that are not coming from a data frame.  
  
  -  [Continuous Scaling](https://ggplot2.tidyverse.org/reference/scale_continuous.html) (`scale_x_continuous`, `scale_y_continuous`): there are several scaling functions that allow us to specify how we want different aesthetic elements to be displayed. In the example below, we use two functions to set the breaks and labels along the x and y axes. 
  
  -  [Coordinate Limits](https://ggplot2.tidyverse.org/reference/coord_cartesian.html) (`coord_cartesian`): this function allows you to set the x and y limits for your graph so you can zoom in on an area of interest.  
  
  -  [Default Themes](https://ggplot2.tidyverse.org/reference/ggtheme.html) (`theme_minimal`): ggplot provides default themes you can use to set the background, axes lines, and colors. Besides the minimal theme below, you can also use `theme_classic` and `theme_bw`.
  
https://www.youtube.com/embed/znz3qYtIJ7w

### Adding Grouping  

We can use data frame columns to also create layers for different groups. The example below shows how to do this and demonstrates scaling by these groups. In particular, we want to construct a plot showing the distribution of short departure delays (defined to be between 0 and 120 minutes) for the five top airlines. The plot itself shows that the distribution is roughly similar for each airline and uses the following tools.  

  -  Adding a group to the aesthetic: here we use the color to differentiate the group. We could also have the shape or size correspond to variables in our data.   
  
  -  [Scaling aesthetics](https://ggplot2.tidyverse.org/reference/scale_colour_discrete.html) (`scale_color_discrete`): this function allows us to map labels to the colors and also we could set the colors for each group manually if we wanted to.  
  
  -  [Other theme options](https://ggplot2.tidyverse.org/reference/theme.html) (`theme`): We expand on the given theme by specifying the legend position to be at the bottom of the figure. This function can be used to specify other things like the grid lines and background color. 
  
https://www.youtube.com/embed/el6q6Meag4M

### Working with Discrete Axes

Now suppose that we want to create a bar plot for the percentage of trips by each airline. The code below calculates those percentages. 

The example below shows how to use the `geom_bar` layer to display the percentage (rather than the default count) and also shows how to update the plot with the following tools.  

  -   Ordering the x-axis by value: when you set up the aesthetic, you can specify the order of your discrete variable.  

  -  [Flipping the coordinate system](https://ggplot2.tidyverse.org/reference/coord_flip.html) (`coord_flip`): this function flips the x and y axes.
  
  -  [Renaming the discrete x values](https://ggplot2.tidyverse.org/reference/scale_discrete.html) (`scale_x_discrete`): similar to the continuous example, we can add our own labels to the plot using a scaling function. We did not use breaks in this function because the x-axis is discrete and we specified the labels in order.  
  
  -  [Changing the color scheme](http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually) (`scale_fill_brewer`): rather than using the default colors, we specify a specific color scheme to use for the fill. Beyond these set schemes, you can also set the colors manually. The link provided shows many different examples (including Wes Anderson film themes).
  
https://www.youtube.com/embed/gF0NdILxcqA

### Faceting 

Lastly, we show how to separate the plot by group with *facets*. We demonstrate how to use the `facet_wrap` function to create a grid of plots by carrier. The first argument to this function is the formula that we want to facet on (to facet by two variables `x1` and `x2` we would use `~x1+x2`). This separates the graph by those facets. Additional arguments can give the number of rows or columns to use when dispalying the graphs.

**Other packages not covered:** The `gridExtra` package allows you to arrange multiple plots in rows and columns. Another package that can be very useful for visualization in the exploratory phase of data analysis is the `GGally` library. This library allows you to view pairwise scatterplots and create visually appealing correlation matrices. 

I know that there are a lot of different options available in ggplot and the amount of functions may seem overwhelming. We will practice these more in class. A few key things may help you remember which functions to look at. The `geom` layers all have to do with displaying data points on the graph. The `scale` functions correspond to changing how an aspect of the graph is displayed. For example, `scale_color_continuous` would specify how to display colors that are used on a continuous scale. On the other hand, `scale_x_discrete` would specify how to display the x-axis when x is discrete. Last, the `theme` and `annotate` functions help to finetune the display. 