In this file, we'll learn to use one of the most popular tidyverse packages: `ggplot2`. The `ggplot2` package is so popular among R users because of its consistent syntax and the efficiency with which we can use it to create high-quality visualizations.

We'll create line graphs to visualize and understand changes in United States life expectancies over time.

The [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm) has been tracking United States mortality trends since 1900. They've compiled [data](https://catalog.data.gov/dataset/age-adjusted-death-rates-and-life-expectancy-at-birth-all-races-both-sexes-united-sta-1900) on United States citizen life expectancy by race and sex.

`library(readr)`

`life_expec <- read_csv("life_expec.csv")`

Here are the first ten rows of the `life_expec` data frame:

![image.png](attachment:image.png)

Each column in the data frame contains a variable pertaining to the population of individuals born each year. Here's an explanation of each variable:

* `Year`: The year of birth.
* `Race`: The races represented in the measured population.
* `Sex`: The sex of the measured population.
* `Avg_Life_Expec`: The average life expectancy, in years, at birth of people born in a given year.
* `Age_Adj_Death_Rate`: The age adjusted death rate of people born in a given year. The [age adjusted death rate](http://health.mo.gov/data/mica/CDP_MICA/AARate.html) is a metric that adjusts death rate for populations' age distributions to make population comparisons fair.

Since collection of these data began in 1900, we have over 100 years of data that we can use to study changes in average U.S. life expectancy over time.

We may suspect that, as health care has improved, people are living longer than they did earlier in the 20th century. As we scroll through the data frame, we'll see that, indeed, life expectancy has generally increased and death rate has decreased over time.

While scanning a data frame may give us some sense of general patterns, creating a visualization of the data allows for a more detailed analysis, such as envisioning when historical events led to temporarily decreased life expectancy. Exploring data visually is usually one of the first steps data scientists take when working with new data.

We have seen data represented as text stored in vectors, matrices, lists, and data frames.

In contrast, [plots](https://en.wikipedia.org/wiki/Plot_%28graphics%29) are visual representations that use graphics like dots, lines, and bars to help us look for patterns in data.

There are many types of plots we can use to visualize data. Selecting the appropriate plot for our data, and the questions we want to use it to answer, is an important skill that we'll hone over time.

We are interested in the relationship between life expectancy (the variable `Avg_Life_Expec`) and time (the variable `Year`).

For this task, we'll use a **line chart**, which is a type of plot especially useful for visualizing changes over time. A line chart displays information as a series of data points connected by a line.

Line graphs are useful for depicting data that are [continuous](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable#Continuous_variable), meaning the data can have any value. Average life expectancy, as measured in years, is an example of continuous data.

To create line charts and other types of data visualizations, we'll learn to use the `ggplot2` package.

`library(ggplot2)`

The `gg` in `ggplot2` stands for ["Grammar of Graphics"](http://vita.had.co.nz/papers/layered-grammar.pdf), which refers to a system for data visualization first described by [Leland Wilkinson](https://en.wikipedia.org/wiki/Leland_Wilkinson).

[Hadley Wickham](http://hadley.nz/), chief data scientist at RStudio, used the principles of the Grammar of Graphics to develop `ggplot2` to allow systematic, consistent, time-efficient creation of data visualizations.

Let's dive into creating a line chart to visualize the change in U.S. life expectancy over time. To begin making a plot, use the `ggplot()` function and specify the data frame we'll be visualizing data from:

`ggplot(data = data_frame)`

This step creates a coordinate system

**Task**

* Use the `ggplot()` function to create the first layer of our plot using the `life_expec` data frame

**Answer**

`ggplot(data = life_expec)`

We've created an empty graph for our data that we will now add layers to.

The first layer we'll add is one to map our data points to scales and a coordinate system, which generates axes. To define the variables we want to map to our graph, we'll add a layer using `aes()`, which is short for "aesthetics", to our graph:

`ggplot(data = data_frame, aes(x = variable_1, y = variable_2))`

When graphing two-dimensional data, as we will do in this file, ggplot2 by default uses the [Cartesian coordinate system](https://en.wikipedia.org/wiki/Cartesian_coordinate_system). This means that our graphs have two axes:

![image.png](attachment:image.png)

How do we know which axis to use for which variable, though?

The answer to this question is informed by what we think the relationship between two variables is:

* The variable that changes **depending on** the other variable is called the [dependent variable](https://en.wikipedia.org/wiki/Dependent_and_independent_variables). We assign this variable to the vertical axis, or y-axis.

* The variable that changes **independent of** the other variable is called the [independent variable](https://en.wikipedia.org/wiki/Dependent_and_independent_variables), We assign this variable to the horizontal axis, or x-axis.

In the case of our `life_expec data`, the `Avg_Life_Expec` changes as time progresses, and so we would consider it to be the dependent variable. The `Year` variable represents time, and is the independent variable

**Task**

* Add an `aes()` layer to our graph specifying `Avg_Life_Expec` as the dependent variable and `Year` as the independent variable

**Answer**

`ggplot(data = life_expec, aes(x = Year, y = Avg_Life_Expec))`

Now, our chart has a coordinate system and axes. With this foundation in place, the next step is to add geometric symbols to the graph to represent data points.

To add a line representing the relationship between the `Year` and `Avg_Life_Expec` variables to our graph, we'll add a `geom_line()` layer to our graph:

`ggplot(data = data_frame,
  aes(x = variable_1, y = variable_2)) +
  geom_line()`

Notice how we add each new layer to the graph using a `+` at the end of the preceding line of code. This syntax is consistent for any type of visualization we'll create using `ggplot2`.

Using `ggplot2`, we can add geometric objects of different types to a graph depending on what type of data we're working with and the relationships between variables we're looking to explore. 

**Task**

* Add a layer to our graph to illustrate the relationship between `Year` and `Avg_Life_Expec`

**Answer**

`ggplot(data = life_expec,
  aes(x = Year, y = Avg_Life_Expec)) +
  geom_line()`

We have now produced a graph with the relationship between `Year` and `Avg_Life_Expec` represented by a line:

![image.png](attachment:image.png)

However, the graph we've created appears to have multiple data points for `Avg_Life_Expec` for each instance of `Year`. Let's take a look at a single year from the `life_expec` data frame to see what's going on:

`life_expec %>% filter(Year == 2000)`

For the year 2000, there are nine data points:

![image.png](attachment:image.png)

This is because, for each year, average life expectancies for multiple populations (by sex and race) are included in the data set.

To get a sense for the change over time of life expectancy for the entire U.S. population, let's use data for all races and both sexes to create the line graph.

![image.png](attachment:image.png)

**Task**

* Create a new line graph containing only average life expectancy data for the entire U.S. population.

**Answer** 

`life_expec_filter <- life_expec %>% filter(Race == "All Races" & Sex == "Both Sexes")`

`ggplot(data = life_expec_filter) + 
    aes(x = Year, y = Avg_Life_Expec) + 
        geom_line()`

Here's the line graph we've created:

![image.png](attachment:image.png)

This visualization of the relationship between `Avg_Life_Expec` and `Year` allows for the quick detection of some interesting patterns:

* Generally, average life expectancy of the U.S. population has been increasing over time.
* Before the 1950s, life expectancy fluctuated substantially from year to year.
* In the 1920s, life expectancy dropped dramatically, and then increased once again.

When creating data visualizations, it's good practice to make sure that the visual can "stand alone". That is, if someone were to find our graph, they should be able to clearly understand what the visualization represents.

One way to ensure that our graph is easy to understand is by adding a title to it. We can do so by adding another layer to our graph using the argument `labs()`, short for "labels":

`ggplot(data = data_frame, 
  aes(x = variable_1, y = variable_2)) +
  geom_line() +
  labs(title = "Title of Graph")`

It's also important to make sure that someone who looks at our graph can understand the data that are being represented. In the case of the graph we've been working on, it may be unclear what that y-axis label, `Avg_Life_Expec`, refers to.

To change axis labels, we can also use the `labs()` argument. To specify new labels for the x- or y-axis, use the syntax `x =` or `y =` within `labs()`.

`ggplot(data = data_frame,
  aes(x = variable_1, y = variable_2)) +
  geom_line() +
   labs(title = "Title of Graph", x = "new x label", y = "new y label")`

**Task**

* Give our graph the descriptive title that communicates the goal of the visualization: "United States Life Expectancy: 100 Years of Change"
* Give the y-axis this label: "Average Life Expectancy (Years)"

**Answer**

`ggplot(data = life_expec_filter,
  aes(x = Year, y = Avg_Life_Expec)) +
  geom_line() +
  labs(title = "United States Life Expectancy: 100 Years of Change", y = "Average Life Expectancy (Years)")`

![image.png](attachment:image.png)

The line graph's gray background and white grid lines are unnecessary for understanding the data and are a bit distracting. Let's simplify the graph background to help the line representation of the data stand out.

To modify non-data `ggplot2` graph components, including background color, we can add a layer to our graph using `theme()`.

Within the `theme()` layer, we'll use the argument `panel.background = element_rect("background_color")` to specify the color of the background rectangle (which is what "rect" stands for).

`ggplot(data = data_frame,
  aes(x = variable_1, y = variable_2)) +
  geom_line() +
  labs(title = "Title of Graph", x = "new x label", y = "new y label") + 
  theme(panel.background = element_rect(fill = "background_color"))`

**Task**

* Change the background color of our line graph to white.

`ggplot(data = life_expec_filter,
  aes(x = Year, y = Avg_Life_Expec)) +
  geom_line() +
  labs(title = "United States Life Expectancy: 100 Years of Change", y = "Average Life Expectancy (Years)") +
  theme(panel.background = element_rect(fill = "white"))`

We've now created a clear, informative visualization of the `life_expec data`:

![image.png](attachment:image.png)

Our line graph makes it easy to identify interesting features of the data that would have been harder to envision when looking at a table of data. Let's take a look at some of them.

* First, we can see that, generally, the average U.S. life expectancy has increased over time.
* However, notice the sharp drop in life expectancy around the year 1920, from about 55 to under 40 years. What could have cause this? A bit of research reveals that, in 1918, there was a deadly [influenza epidemic](https://www.archives.gov/exhibits/influenza-epidemic/) that affected the U.S. population. This could explain the rapid decrease in average life expectancy.
* It's also interesting to observe that average life expectancy fluctuated between 1900 and 1950 before becoming more stable from year to year after about 1950. Do you have any ideas about possible causes?