# Tutorial 4: Effective Data Visualization 

### Lecture and Tutorial Learning Goals:

Expand your data visualization knowledge and tool set beyond what we have seen and practiced so far. We will move beyond scatter plots and learn other effective ways to visualize data, as well as some general rules of thumb to follow when creating visualizations. All visualization tasks this week will be applied to real world data sets. Remember, it is an iterative process to answer questions and each step taken should have a good reason behind it.  

After completing this week's lecture and tutorial work, you will be able to:

- Describe when to use the following kinds of visualizations:
    - scatter plots
    - line plots
    - bar plots
    - histogram plots
- Given a dataset and a question, select from the above plot types to create a visualization that best answers the question
- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question
- Identify rules of thumb for creating effective visualizations
- Define the three key aspects of ggplot objects:
    - aesthetic mappings
    - geometric objects
    - scales
- Use the `ggplot2` library in R to create and refine the above visualizations using:
    - geometric objects: `geom_point`, `geom_line`, `geom_histogram`, `geom_bar`, `geom_vline`, `geom_hline`
    - scales: `scale_x_continuous`, `scale_y_continuous`
    - aesthetic mappings: `x`, `y`, `fill`, `colour`, `shape`
    - labelling: `xlab`, `ylab`, `labs`
    - font control and legend positioning: `theme`
    - subplots: `facet_grid`
- Describe the difference in raster and vector output formats
- Use `ggsave` to save visualizations in `.png` and `.svg` format

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Replace `fail()` with your completed code and run the cell!

This worksheet covers parts of [the Visualization chapter](https://datasciencebook.ca/viz.html) of the online textbook. You should read this chapter before attempting the worksheet.

In [None]:
### Run this cell before continuing. 

library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

**Question 0.1** 
<br> {points: 1}

Match the following definitions with the corresponding aesthetic mapping or function used in R:

*Definitions*

A. Prevents a chart from being stacked. It preserves the vertical position of a plot while adjusting the horizontal position. 

B. In bar charts, this aesthetic fills in the bars by a specific colour or separates the counts by a variable different from the x-axis. 

C. In bar charts, it outlines the bars but in scatterplots, it fills in the points (colouring them based on a particular variable aside from the x/y-axis). 

D. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. This stat basically allows the y-axis to represent particular values from the data instead of just counts. 

E. This aesthetic allows further visualization of data by varying data points by shape (modifying their shape based on a particular variable aside from the x/y-axis).

F. Labels the y-axis. 


*Aesthetics and Functions*

1. `colour`
2. `dodge`
3. `fill`
4. `identity`
5. `ylab`
6. `shape`

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list above. For example: `B <- 1`

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of A is not numeric"= setequal(digest(paste(toString(class(A)), "e8756")), "4f3d5cffe07acdf173387725e4f08dcf"))
stopifnot("value of A is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(A, 2)), "e8756")), "b4b556abc9a04498c66ad9ae2ee600ec"))
stopifnot("length of A is not correct"= setequal(digest(paste(toString(length(A)), "e8756")), "0e0dbda4fa7df509121cb4d7de2e21ec"))
stopifnot("values of A are not correct"= setequal(digest(paste(toString(sort(A)), "e8756")), "b4b556abc9a04498c66ad9ae2ee600ec"))

stopifnot("type of B is not numeric"= setequal(digest(paste(toString(class(B)), "e8757")), "429a02725513a891e7fb6fc0437d9d2e"))
stopifnot("value of B is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(B, 2)), "e8757")), "cc4690ff3c2b9cbf11e142e335d94306"))
stopifnot("length of B is not correct"= setequal(digest(paste(toString(length(B)), "e8757")), "8708a417e7584bff0e260ffa72384309"))
stopifnot("values of B are not correct"= setequal(digest(paste(toString(sort(B)), "e8757")), "cc4690ff3c2b9cbf11e142e335d94306"))

stopifnot("type of C is not numeric"= setequal(digest(paste(toString(class(C)), "e8758")), "581457b210c0937f5379af270394e22b"))
stopifnot("value of C is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(C, 2)), "e8758")), "5f7227da558f9d8d0d353cc80d8254dd"))
stopifnot("length of C is not correct"= setequal(digest(paste(toString(length(C)), "e8758")), "5f7227da558f9d8d0d353cc80d8254dd"))
stopifnot("values of C are not correct"= setequal(digest(paste(toString(sort(C)), "e8758")), "5f7227da558f9d8d0d353cc80d8254dd"))

stopifnot("type of D is not numeric"= setequal(digest(paste(toString(class(D)), "e8759")), "a59b5c0b6facf510c0749c511363f5ef"))
stopifnot("value of D is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(D, 2)), "e8759")), "ff09bd4e0a4f7245b0dad7609faf2050"))
stopifnot("length of D is not correct"= setequal(digest(paste(toString(length(D)), "e8759")), "ca7306d405e881547cb23899fb6647af"))
stopifnot("values of D are not correct"= setequal(digest(paste(toString(sort(D)), "e8759")), "ff09bd4e0a4f7245b0dad7609faf2050"))

stopifnot("type of E is not numeric"= setequal(digest(paste(toString(class(E)), "e875a")), "97936fe9a83ce14e71ccb23f38660197"))
stopifnot("value of E is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(E, 2)), "e875a")), "18ba5aa4fad2986e18f522546dc957bb"))
stopifnot("length of E is not correct"= setequal(digest(paste(toString(length(E)), "e875a")), "97463bd9b1c9bc03d2d78837d19dfabb"))
stopifnot("values of E are not correct"= setequal(digest(paste(toString(sort(E)), "e875a")), "18ba5aa4fad2986e18f522546dc957bb"))

stopifnot("type of F is not numeric"= setequal(digest(paste(toString(class(F)), "e875b")), "2dd1f324b4618ad5f0b329de4df2e296"))
stopifnot("value of F is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(F, 2)), "e875b")), "3bb5bd0c5aa23f5ea189fbd928325d11"))
stopifnot("length of F is not correct"= setequal(digest(paste(toString(length(F)), "e875b")), "e0427343d2abd635a90f6650f64488d0"))
stopifnot("values of F are not correct"= setequal(digest(paste(toString(sort(F)), "e875b")), "3bb5bd0c5aa23f5ea189fbd928325d11"))

print('Success!')

**Question 0.2** True or False:
<br> {points: 1}

We should save a plot as an `.svg` file if we want to be able to rescale it without losing quality.

*Assign your answer to an object called `answer0.2`. Make sure your answer is in lowercase letters and is surrounded by quotation marks (e.g. `"true"` or `"false"`).*

In [None]:
# Replace the fail() with your answer.
 
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.2 is not character"= setequal(digest(paste(toString(class(answer0.2)), "1f34d")), "638069ba8e223b370e5b7ffbc657e4df"))
stopifnot("length of answer0.2 is not correct"= setequal(digest(paste(toString(length(answer0.2)), "1f34d")), "ef68b1075103f89ad725c62972606ae8"))
stopifnot("value of answer0.2 is not correct"= setequal(digest(paste(toString(tolower(answer0.2)), "1f34d")), "ff63ee396be81bebb7084774c1cd512a"))
stopifnot("letters in string value of answer0.2 are correct but case is not correct"= setequal(digest(paste(toString(answer0.2), "1f34d")), "ff63ee396be81bebb7084774c1cd512a"))

print('Success!')

## 1. Data on Personal Medical Costs 

As we saw in the worksheet, data scientists work in all types of organizations and with all kinds of problems. One of these types of organizations are companies in the private sector that work with health data. Today we will be looking at data on personal medical costs. There are varying factors that affect health and consequently medical costs. Our goal for today is to determine how are variables related to the medical costs billed by health insurance companies. 


To analyze this, we will be looking at a dataset that includes the following columns:

* `age`: age of primary beneficiary
* `sex`: insurance contractor gender: female, male
* `bmi`: body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/$m^{2}$) using the ratio of height to weight, ideally 18.5 to 24.9
* `children`: number of children covered by health insurance / number of dependents
* `smoker`: smoking
* `region`: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest.
* `charges`: individual medical costs billed by health insurance

*This dataset, was taken from the [collection of Data Sets](https://github.com/stedy/Machine-Learning-with-R-datasets) created and curated for the [Machine Learning with R](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r) book by Brett Lantz.*

**Question 1.1** Yes or No: 
<br> {points: 1}

Based on the information given in the cell above, do you think the column `charges` includes quantitative/numerical data? 

*Assign your answer to an object called `answer1.1`. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. `"yes"` or `"no"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.1 is not character"= setequal(digest(paste(toString(class(answer1.1)), "38012")), "0c47ee4fa16f033c16b42cf680515056"))
stopifnot("length of answer1.1 is not correct"= setequal(digest(paste(toString(length(answer1.1)), "38012")), "680b4f9f86662a6563b45916d7b72e71"))
stopifnot("value of answer1.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.1)), "38012")), "84765aa38a86621fb4bc2863ad15eead"))
stopifnot("letters in string value of answer1.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.1), "38012")), "84765aa38a86621fb4bc2863ad15eead"))

print('Success!')

**Question 1.2** Multiple Choice:
<br> {points: 1}

Assuming overplotting is not an issue, which plot would be the most effective to compare the relationship of `age` and `charges`?

A. Scatterplot 

B. Stacked Bar Plot 

C. Bar Plot 

D. Histogram 

*Assign your answer to an object called `answer1.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.2 is not character"= setequal(digest(paste(toString(class(answer1.2)), "551dd")), "9c29b528886527f1f8996be06bb6653a"))
stopifnot("length of answer1.2 is not correct"= setequal(digest(paste(toString(length(answer1.2)), "551dd")), "5f5cda26323755a289a1e1ab4a1f9b18"))
stopifnot("value of answer1.2 is not correct"= setequal(digest(paste(toString(tolower(answer1.2)), "551dd")), "95f908f98317e1490c81ec98ef6fbef9"))
stopifnot("letters in string value of answer1.2 are correct but case is not correct"= setequal(digest(paste(toString(answer1.2), "551dd")), "a1434b88e0f556d46a6dfa603a430b1b"))

print('Success!')

**Question 1.3**
<br> {points: 1}

Read the `insurance.csv` file in the `data/` folder and use `tail` to view the last 6 individuals presented. 

*Assign your answer to an object called `insurance`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
tail(insurance) # preview the last 6 rows of the data set

In [None]:
library(digest)
stopifnot("insurance should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(insurance)), "7029a")), "7a3d8f115d0cbf056bc5e3b99f7b23ca"))
stopifnot("dimensions of insurance are not correct"= setequal(digest(paste(toString(dim(insurance)), "7029a")), "c1491bebea9f4ac66bea66140e0cbe60"))
stopifnot("column names of insurance are not correct"= setequal(digest(paste(toString(sort(colnames(insurance))), "7029a")), "1173e8e4817f249b12b9dcd4d3b251ca"))
stopifnot("types of columns in insurance are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(insurance, class)))), "7029a")), "11e5f844d521855e65b661cc3b24b37e"))
stopifnot("values in one or more numerical columns in insurance are not correct"= setequal(digest(paste(toString(if (any(sapply(insurance, is.numeric))) sort(round(sapply(insurance[, sapply(insurance, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "7029a")), "68ca7ac2df6c3976b0c561c22d3464e3"))
stopifnot("values in one or more character columns in insurance are not correct"= setequal(digest(paste(toString(if (any(sapply(insurance, is.character))) sum(sapply(insurance[sapply(insurance, is.character)], function(x) length(unique(x)))) else 0), "7029a")), "3160de08a210aed06159cd6cbdd4c08f"))
stopifnot("values in one or more factor columns in insurance are not correct"= setequal(digest(paste(toString(if (any(sapply(insurance, is.factor))) sum(sapply(insurance[, sapply(insurance, is.factor)], function(col) length(unique(col)))) else 0), "7029a")), "a159d74bfefce82f5dad24d7f8e790a0"))

print('Success!')

**Question 1.4** 
<br> {points: 3}

Looking over the loaded data shown above, what observations can you make about the relationship between medical charges and age? How about medical charges and BMI? Finally, what about medical charges and smoking? 

Also, comment on whether our observations might change if we visualize the data? And/or whether visualizing the data might allow us to more easily make observations about the relationships in the data as opposed to trying to make them directly from the data table?

Answer in the cell below.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.5**
<br> {points: 1}

According to the National Heart, Lung and Blood Institute of the US: "The higher your BMI, the higher your risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers". 

Based on this information, we can hypothesize that individuals with a higher BMI are likely to have more medical costs. Let's use our data and see if this holds true. Create a scatter plot of `charges` (y-axis) versus `bmi` (x-axis).

In the scaffolding we provide below, we suggest that you set `alpha` to a value between 0.2 and 0.4. `alpha` sets the transparency of points on a scatter plot, and increasing transparencing of points is one tool you can use to deal with over plotting issues.

*Assign your answer to an object called `bmi_plot`. Make sure to label your axes appropriately.*

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8) #Remember to set your plot sizes to an appropiate size

#... <- insurance |>
#    ggplot(aes(x = ..., y =  ...)) + 
#        geom_...(alpha = ...) + # Deals with the transparency of the points, set it to an appropiate value
#        xlab(...) +
#        ylab(...) +
#        ggtitle(...)

# your code here
fail() # No Answer - remove if you provide an answer
bmi_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(bmi_plot$layers)), function(i) {c(class(bmi_plot$layers[[i]]$geom))[1]})), "9fc61")), "1e050c25f34adbbf10f53434343ddca4"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bmi_plot$layers)), function(i) {rlang::get_expr(c(bmi_plot$layers[[i]]$mapping, bmi_plot$mapping)$x)}), as.character))), "9fc61")), "d3d244cfc58b8cfaaf5235e3066029c2"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bmi_plot$layers)), function(i) {rlang::get_expr(c(bmi_plot$layers[[i]]$mapping, bmi_plot$mapping)$y)}), as.character))), "9fc61")), "e064324a2ec8a508778076ef23fb3c91"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$x)!= bmi_plot$labels$x), "9fc61")), "890b80a4d6f2cd5c599b26347d3d9976"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$y)!= bmi_plot$labels$y), "9fc61")), "890b80a4d6f2cd5c599b26347d3d9976"))
stopifnot("incorrect colour variable in bmi_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$colour)), "9fc61")), "ad35902feaa4f02b44ed80cc1cc8a53f"))
stopifnot("incorrect shape variable in bmi_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$shape)), "9fc61")), "ad35902feaa4f02b44ed80cc1cc8a53f"))
stopifnot("the colour label in bmi_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$colour) != bmi_plot$labels$colour), "9fc61")), "ad35902feaa4f02b44ed80cc1cc8a53f"))
stopifnot("the shape label in bmi_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bmi_plot$layers[[1]]$mapping, bmi_plot$mapping)$colour) != bmi_plot$labels$shape), "9fc61")), "ad35902feaa4f02b44ed80cc1cc8a53f"))
stopifnot("fill variable in bmi_plot is not correct"= setequal(digest(paste(toString(quo_name(bmi_plot$mapping$fill)), "9fc61")), "43cdd95815798dc118e89f4369f07d56"))
stopifnot("fill label in bmi_plot is not informative"= setequal(digest(paste(toString((quo_name(bmi_plot$mapping$fill) != bmi_plot$labels$fill)), "9fc61")), "ad35902feaa4f02b44ed80cc1cc8a53f"))
stopifnot("position argument in bmi_plot is not correct"= setequal(digest(paste(toString(class(bmi_plot$layers[[1]]$position)[1]), "9fc61")), "ac9068d46e9b51e26909d2e264d1a284"))

stopifnot("bmi_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(bmi_plot$data)), "9fc62")), "aeb294a87439937001e81613556efb2a"))
stopifnot("dimensions of bmi_plot$data are not correct"= setequal(digest(paste(toString(dim(bmi_plot$data)), "9fc62")), "3d9f555d1098cd4faf8dc091d64bf8e2"))
stopifnot("column names of bmi_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(bmi_plot$data))), "9fc62")), "7fd769c5ad9ce133f838b629684398e5"))
stopifnot("types of columns in bmi_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(bmi_plot$data, class)))), "9fc62")), "6d63a6d37b142ea46f5176bd04ea68e9"))
stopifnot("values in one or more numerical columns in bmi_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bmi_plot$data, is.numeric))) sort(round(sapply(bmi_plot$data[, sapply(bmi_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "9fc62")), "2b4101157b1819969ac2b75701e88a39"))
stopifnot("values in one or more character columns in bmi_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bmi_plot$data, is.character))) sum(sapply(bmi_plot$data[sapply(bmi_plot$data, is.character)], function(x) length(unique(x)))) else 0), "9fc62")), "c0e9e88ceb2165bfd0dd004ce5b13f72"))
stopifnot("values in one or more factor columns in bmi_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bmi_plot$data, is.factor))) sum(sapply(bmi_plot$data[, sapply(bmi_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "9fc62")), "c95c8451f83df871544ec4394bdedd36"))

print('Success!')

**Question 1.6**
<br> {points: 3}

Analysis: Comment on the effectiveness of the plot. Take into consideration the rules of thumb discussed in lecture. Also comment on what could be improved for this plot and also what is done correctly. 

Answer in the cell below.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.7**
<br> {points: 3}

Analysis: What do you observe from the scatter plot? Do the data suggest that there might be evidence of a relationship between BMI and medical costs of individuals? 
From this plot alone, can we say higher BMI causes higher medical charges? Why or why not? 

Answer in the cell below. 

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.8**
<br> {points: 3}

Again, based on information from the National Heart, Lung and Blood Institute of the US, smoking cigarettes is said to be a risk factor for obesity. Create the same plot as you did in **Question 1.5** but this time add the `colour` aesthetic to observe if smoking might affect the body mass of individuals. Also, use `labs` to format your legend title. You may want to pass `alpha = 0.4` to the scatter geometric object to make the scatter points translucent (just for your own ease of visualization; you don't have to and we won't check that when grading).

*Assign your answer to an object called `smoke_plot`. Make sure to label your axes appropriately.*

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8) #Remember to set your plot sizes to an appropiate size

# add the code for your plot here!

# your code here
fail() # No Answer - remove if you provide an answer


In [None]:
# Most of the tests for this question are hidden. You have to decide whether you've created a good visualization!
# here's one test to at least ensure you named the plot object correctly:
library(digest)
stopifnot("type of exists('smoke_plot') is not logical"= setequal(digest(paste(toString(class(exists('smoke_plot'))), "bb69")), "63dac263447731c7c69c611608650e46"))
stopifnot("logical value of exists('smoke_plot') is not correct"= setequal(digest(paste(toString(exists('smoke_plot')), "bb69")), "639e91ec7f6945ab6378fe2dfd01c630"))

print('Success!')

**Question 1.9.0** (Analyzing the Graph) True or False: 
<br> {points: 1}

Smokers generally have a lower BMI than non-smokers. 

*Assign your answer to an object called `answer1.9.0`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `"true"` or `"false"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.9.0 is not character"= setequal(digest(paste(toString(class(answer1.9.0)), "4f043")), "6bfd2c585ac2df0d90b1c27ded4e0b4a"))
stopifnot("length of answer1.9.0 is not correct"= setequal(digest(paste(toString(length(answer1.9.0)), "4f043")), "3ccc8989558773aaa96b19301a85697f"))
stopifnot("value of answer1.9.0 is not correct"= setequal(digest(paste(toString(tolower(answer1.9.0)), "4f043")), "afb974b1ee4cf7b1739668a351cfc96c"))
stopifnot("letters in string value of answer1.9.0 are correct but case is not correct"= setequal(digest(paste(toString(answer1.9.0), "4f043")), "afb974b1ee4cf7b1739668a351cfc96c"))

print('Success!')

**Question 1.9.1** (Analyzing the Graph) True or False: 
<br> {points: 1}

Smokers generally have higher medical charges than non-smokers.

*Assign your answer to an object called `answer1.9.1`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `"true"` or `"false"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.9.1 is not character"= setequal(digest(paste(toString(class(answer1.9.1)), "ed15b")), "b1404bd0c8ede04b0e8af37ec760bc9d"))
stopifnot("length of answer1.9.1 is not correct"= setequal(digest(paste(toString(length(answer1.9.1)), "ed15b")), "07ffc81a1422e81c5a116a8c30c72c2c"))
stopifnot("value of answer1.9.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.9.1)), "ed15b")), "56d1de7386c9386770dcad3a39d727f2"))
stopifnot("letters in string value of answer1.9.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.9.1), "ed15b")), "56d1de7386c9386770dcad3a39d727f2"))

print('Success!')

**Question 1.10**
<br> {points: 1}

Finally, create a bar graph that displays the proportion of smokers for both females and males in the data set. Use sex as the horizontal axis, and colour the bars to differentiate between smokers / nonsmokers. This could, for example, be used help us determine whether we should consider smoking behaviour when exploring whether there is a relationship between sex and medical costs.

*Assign your answer to an object called `bar_plot`. Make sure to label your axes appropriately.*

>*Note - many historical datasets treated sex as a variable where the possible values are only binary: male or female. This representation in this question reflects how the data were historically collected and is not meant to imply that we believe that sex is binary.*

In [None]:
#... <- insurance |>
#    ggplot(aes(x = ..., fill = ...)) + 
#    ..._...(position = 'fill') + 
#    xlab(...) +
#    ylab(...) +
#    labs(fill = "Does the person smoke") +
#    ggtitle(...)


# your code here
fail() # No Answer - remove if you provide an answer
bar_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(bar_plot$layers)), function(i) {c(class(bar_plot$layers[[i]]$geom))[1]})), "b90b4")), "5aabe9bd076a0badeeb3dd99058a4282"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bar_plot$layers)), function(i) {rlang::get_expr(c(bar_plot$layers[[i]]$mapping, bar_plot$mapping)$x)}), as.character))), "b90b4")), "92b4721f1caaf539af762ff39320b950"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bar_plot$layers)), function(i) {rlang::get_expr(c(bar_plot$layers[[i]]$mapping, bar_plot$mapping)$y)}), as.character))), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$x)!= bar_plot$labels$x), "b90b4")), "88f90e17f7939fb41c43d53193f618a0"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$y)!= bar_plot$labels$y), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("incorrect colour variable in bar_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$colour)), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("incorrect shape variable in bar_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$shape)), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("the colour label in bar_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$colour) != bar_plot$labels$colour), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("the shape label in bar_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot$layers[[1]]$mapping, bar_plot$mapping)$colour) != bar_plot$labels$shape), "b90b4")), "f59d8bd8f25289db287f8a852f34c655"))
stopifnot("fill variable in bar_plot is not correct"= setequal(digest(paste(toString(quo_name(bar_plot$mapping$fill)), "b90b4")), "ef4882355299372710427399c36e1543"))
stopifnot("fill label in bar_plot is not informative"= setequal(digest(paste(toString((quo_name(bar_plot$mapping$fill) != bar_plot$labels$fill)), "b90b4")), "88f90e17f7939fb41c43d53193f618a0"))
stopifnot("position argument in bar_plot is not correct"= setequal(digest(paste(toString(class(bar_plot$layers[[1]]$position)[1]), "b90b4")), "b6e6e6e02f04b8724621d0cdc9291c8f"))

stopifnot("bar_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(bar_plot$data)), "b90b5")), "8fc73d7af6df65245d347d5375db6c41"))
stopifnot("dimensions of bar_plot$data are not correct"= setequal(digest(paste(toString(dim(bar_plot$data)), "b90b5")), "b8dc69cc681f114b28738773fe76c134"))
stopifnot("column names of bar_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(bar_plot$data))), "b90b5")), "ee70eaefdb24bd6d9a43f20578ca3118"))
stopifnot("types of columns in bar_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(bar_plot$data, class)))), "b90b5")), "9e94d802ff655a0fb4f5606020dbc40d"))
stopifnot("values in one or more numerical columns in bar_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot$data, is.numeric))) sort(round(sapply(bar_plot$data[, sapply(bar_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "b90b5")), "9018ac3d9edfb2e335c5776d56cb8b84"))
stopifnot("values in one or more character columns in bar_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot$data, is.character))) sum(sapply(bar_plot$data[sapply(bar_plot$data, is.character)], function(x) length(unique(x)))) else 0), "b90b5")), "761ef152aaba914792d92c5e3dc5f70d"))
stopifnot("values in one or more factor columns in bar_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot$data, is.factor))) sum(sapply(bar_plot$data[, sapply(bar_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "b90b5")), "1c9a40bf137e5c82f7482f43ef1778eb"))

print('Success!')

**Question 1.11**
<br> {points: 1}

Based on the graph, is the proportion of smokers higher amongst men or women?

*Assign your answer to an object called `answer1.11`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `"male"` or `"female"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.11 is not character"= setequal(digest(paste(toString(class(answer1.11)), "add13")), "4d5397c95785e4f7614fc4d0056b3376"))
stopifnot("length of answer1.11 is not correct"= setequal(digest(paste(toString(length(answer1.11)), "add13")), "6a57415a6f69812e8e9ef5ee4900ec55"))
stopifnot("value of answer1.11 is not correct"= setequal(digest(paste(toString(tolower(answer1.11)), "add13")), "dd6e24acdab1da830de39c00c8df64e2"))
stopifnot("letters in string value of answer1.11 are correct but case is not correct"= setequal(digest(paste(toString(answer1.11), "add13")), "dd6e24acdab1da830de39c00c8df64e2"))

print('Success!')

## 2. Color Palettes (beyond the defaults)
{points: 1}

In the worksheet and this tutorial, you have seen the same colours again and again. These are from the default `ggplot2` color palette. What if you want different colors? We can do this! In R, one of the libraries that provides altenative color palettes is the `RColorBrewer` library. 

For this question:

1. Load the `RColorBrewer`library
2. Print the list of palettes available for you with the `display.brewer.all()` function (you can also print out a list of color blind friendly palettes with `display.brewer.all(colorblindFriendly = T)`).
3. Use the chart you created in Q1.10 and change the color pallette to your favourite from `RColorBrewer`. Remember that instead of recreating the entire chart from scratch, you can use the `bar_plot` variable you already created and just add the color palette change with the `+` operator (it is also fine if you prefer to copy all the code).
    - For the fill aesthetic with categorical variable the function is: `scale_fill_brewer(palette = '...')`
    - For the fill aesthetic with numeric variable the function is: `scale_fill_distiller(palette = '...')`

You can look more in depth into the documentation of the `scale_fill_*` functions here: https://ggplot2.tidyverse.org/reference/scale_brewer.html.  Optionally, you can also use this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check if your visualization is color blind friendly


*Assign your answer to an object called `bar_plot_palette`.*

In [None]:
## Run this cell and to explore the RColorBrewer features (step 1 & 2 above)
library(RColorBrewer)
display.brewer.all()

In [None]:
## Enter you code to answer step 3 here

# your code here
fail() # No Answer - remove if you provide an answer
bar_plot_palette

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(bar_plot_palette$layers)), function(i) {c(class(bar_plot_palette$layers[[i]]$geom))[1]})), "c5183")), "2bff9314413d329789f6da739cf4cf99"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bar_plot_palette$layers)), function(i) {rlang::get_expr(c(bar_plot_palette$layers[[i]]$mapping, bar_plot_palette$mapping)$x)}), as.character))), "c5183")), "3580250e2a85f5c0b49691b5a524b45f"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(bar_plot_palette$layers)), function(i) {rlang::get_expr(c(bar_plot_palette$layers[[i]]$mapping, bar_plot_palette$mapping)$y)}), as.character))), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$x)!= bar_plot_palette$labels$x), "c5183")), "658085f52f1054fb111423a8ccaacf4c"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$y)!= bar_plot_palette$labels$y), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("incorrect colour variable in bar_plot_palette, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$colour)), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("incorrect shape variable in bar_plot_palette, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$shape)), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("the colour label in bar_plot_palette is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$colour) != bar_plot_palette$labels$colour), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("the shape label in bar_plot_palette is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(bar_plot_palette$layers[[1]]$mapping, bar_plot_palette$mapping)$colour) != bar_plot_palette$labels$shape), "c5183")), "6afa11412fbc16e2d814104f9c265652"))
stopifnot("fill variable in bar_plot_palette is not correct"= setequal(digest(paste(toString(quo_name(bar_plot_palette$mapping$fill)), "c5183")), "5f96543addd02003939b6b99ff6e5903"))
stopifnot("fill label in bar_plot_palette is not informative"= setequal(digest(paste(toString((quo_name(bar_plot_palette$mapping$fill) != bar_plot_palette$labels$fill)), "c5183")), "658085f52f1054fb111423a8ccaacf4c"))
stopifnot("position argument in bar_plot_palette is not correct"= setequal(digest(paste(toString(class(bar_plot_palette$layers[[1]]$position)[1]), "c5183")), "90066ac1001667da6b1ad3b63aa065b7"))

stopifnot("bar_plot_palette$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(bar_plot_palette$data)), "c5184")), "2535c86a524a13b7ee5a4885deedd155"))
stopifnot("dimensions of bar_plot_palette$data are not correct"= setequal(digest(paste(toString(dim(bar_plot_palette$data)), "c5184")), "d3f3d6b1cd01fb80f8b2dc18304eba42"))
stopifnot("column names of bar_plot_palette$data are not correct"= setequal(digest(paste(toString(sort(colnames(bar_plot_palette$data))), "c5184")), "1bc7100d8b2707aefbeaebcc255b03b4"))
stopifnot("types of columns in bar_plot_palette$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(bar_plot_palette$data, class)))), "c5184")), "bf230a6b3b93dc5c3b5613d220043936"))
stopifnot("values in one or more numerical columns in bar_plot_palette$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot_palette$data, is.numeric))) sort(round(sapply(bar_plot_palette$data[, sapply(bar_plot_palette$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "c5184")), "e199b7995663fbf782c0c921b4526152"))
stopifnot("values in one or more character columns in bar_plot_palette$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot_palette$data, is.character))) sum(sapply(bar_plot_palette$data[sapply(bar_plot_palette$data, is.character)], function(x) length(unique(x)))) else 0), "c5184")), "e1938f63f7b6ca3a8051c4fc6f43058e"))
stopifnot("values in one or more factor columns in bar_plot_palette$data are not correct"= setequal(digest(paste(toString(if (any(sapply(bar_plot_palette$data, is.factor))) sum(sapply(bar_plot_palette$data[, sapply(bar_plot_palette$data, is.factor)], function(col) length(unique(col)))) else 0), "c5184")), "629a9e8b6dc2696f387f87ac054f2964"))

stopifnot("type of !identical(scales::hue_pal()(2), unique(ggplot_build(bar_plot_palette)$data[[1]]$fill)) is not logical"= setequal(digest(paste(toString(class(!identical(scales::hue_pal()(2), unique(ggplot_build(bar_plot_palette)$data[[1]]$fill)))), "c5185")), "61f426b684c0850e4c6b89709e361679"))
stopifnot("logical value of !identical(scales::hue_pal()(2), unique(ggplot_build(bar_plot_palette)$data[[1]]$fill)) is not correct"= setequal(digest(paste(toString(!identical(scales::hue_pal()(2), unique(ggplot_build(bar_plot_palette)$data[[1]]$fill))), "c5185")), "7e1edcec31d6c5ff13cdc7d4039ab464"))

print('Success!')

## 3. Fast-Food Chains in the United States (Continued)
<br> {points: 3}

In `worksheet_viz`, we explored this data set through some visualizations. Now, it is is all up to you. The goal of this assignment is to create **one** plot that can help you figure out which restaurant to open and where! Your goal is the same as in the worksheet: to figure out which fast food chain to open and figure out which state would be the least competitive.

After creating your visualization you need to write a paragraph explaining your visualization and why you chose it. Also, explain your conclusion from the visualization and reasoning as to how you came to that conclusion. You can use properly-cited outside information here to help support your reasoning (but **do not** download and analyze any data from an outside source in this notebook -- our autograder will not be able to see it). Finally, if there is some way that you could improve your visualization, but don't yet know how to do it, please explain what you would do if you knew how.

In answering this question, there is no need to restrict yourself to the west coast of the USA. Consider all states that you have data for. You have a variety of graphs to choose from, but before starting the assignment, discuss with a partner which plot would be the most optimal to answer this question.

     "After creating your visualization you need to write a paragraph explaining your visualization and why you chose it. Also, explain your conclusion from the visualization and reasoning as to how you came to that conclusion. If you need to bring in outside information to help you answer your question, please feel free to do so.  Finally, if there is some way that you could improve your visualization, but don't yet know how to do it, please explain what you would do if you knew how.\n",

*Note that some restaurant names are spelled incorrectly in data. For the purpose of this exercise you can ignore this and only count the spelling with the most entries for each restaurant.*

<img src="mcdonalds.jpg" width = "600"/>


Hint: The function `pull` from the `dplyr` package selects a column in a data frame and transforms it into a vector. Note: There are different ways you can complete this question so you don't necessarily need to use `pull` (you may find a solution without using it) but it may be helpful.

In [None]:
# write the code for your plot here
# your code here
fail() # No Answer - remove if you provide an answer

*Write a paragraph explaining your visualization and why you chose it. Also explain your conclusion from the visualization and reasoning as to how you came to that conclusion. You can use properly-cited outside information here to help support your reasoning (but **do not** download and analyze any data from an outside source in this notebook -- our autograder will not be able to see it). Finally, if there is some way that you could improve your visualization, but don't yet know how to do it, please explain what you would do if you knew how.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

In [None]:
source("cleanup.R")