Skip to content

Commit

Permalink
correct numbers for questions in r code
Browse files Browse the repository at this point in the history
  • Loading branch information
atheobold committed Apr 13, 2023
1 parent c6f2dc2 commit e93da4a
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 10 deletions.
4 changes: 2 additions & 2 deletions _freeze/labs/lab-2/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "3b9538c89017606254dba67382c341ec",
"hash": "737666e97a406534e80bad601671c206",
"result": {
"markdown": "---\ntitle: \"Visualizing and Summarizing Numerical Data\"\nauthor: \"Your group's names here!\"\ndate: \"April 13, 2023\"\nformat: html\nembed-resources: true\nstandalone: true\neditor: visual\nexecute: \n echo: true\n eval: false\n warning: false\n message: false\n---\n\n\n## Getting started\n\n### Load packages\n\nLet's load the following packages:\n\n- The **tidyverse** \"umbrella\" package which houses a suite of many different `R` packages for data wrangling and data visualization\n\n- Note: This did not work last week, but should this week!\n\n- The **openintro** `R` package: houses the dataset we will be working with\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Package for functions \nlibrary(tidyverse)\n\n# Package for data\nlibrary(openintro)\n```\n:::\n\n\n### The data\n\nThe [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.\n\nFirst, we'll view the `nycflights` data frame. Run the following code to load in the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(nycflights)\n```\n:::\n\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?nycflights\n```\n:::\n\n\nRemember that you can use `glimpse()` to take a quick peek at your data to understand its contents better.\n\n**Question 1**\n\n**(a) How large is the `nycflights` dataset? (i.e. How many rows and columns does it have?)**\n\n**(b) Are there numerical variables in the dataset? If so, what are their names?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# You code for question 1 goes here! \n```\n:::\n\n\n### Departure Delays\n\nLet's start by examining the distribution of departure delays (`dep_delay`) of all flights with a histogram.\n\n**Question 2 -- Create a histogram of the `dep_delay` variable from the `nycflights` data**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code for question 2 goes here! \n```\n:::\n\n\nHistograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins.\n\nYou can easily define the binwidth you want to use, by specifying the `binwidth` argument inside of `geom_histogram()`, like so: `geom_histogram(binwidth = 15)`\n\n**Question 3**\n\n**(a) Make two other histograms, one with a `binwidth` of 15 and one with a `binwidth` of 150.**\n\n\n::: {.cell layout-nrow=\"1\"}\n\n```{.r .cell-code}\n# Your code for question 3 goes here! \n```\n:::\n\n\n**(b) How do these three histograms compare? Are features revealed in one that are obscured in another?**\n\n## SFO Dstinations\n\nOne of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a `dest` of `\"LAX\"`, flights into San Francisco have a `dest` of `\"SFO\"`, and flights into Chicago (O'Hare) have a `dest` of `\"ORD\"`.\n\nIf you want to visualize only on delays of flights headed to Los Angeles, you need to first `filter()` the data for flights with that destination (e.g., `filter(dest == \"LAX\")`) and then make a histogram of the departure delays of only those flights.\n\n**Logical operators:** Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the `filter()` function and a series of **logical operators**. The most commonly used logical operators for data analysis are as follows:\n\n- `==` means \"equal to\"\n- `!=` means \"not equal to\"\n- `>` or `<` means \"greater than\" or \"less than\"\n- `>=` or `<=` means \"greater than or equal to\" or \"less than or equal to\"\n\n**Question 4 -- Fill in the code to create a new dataframe named `sfo_flights` that is the result of `filter()`ing only the observations whose destination was San Francisco.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code for question 4 goes here! \n\nsfo_flights <- filter(nycflights, \n dest == )\n```\n:::\n\n\n### Multiple Data Filters\n\nYou can filter based on multiple criteria! Within the `filter()` function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(nycflights, \n origin == \"LGA\", \n month == 2)\n\n## Remember months are coded as numbers!\n```\n:::\n\n\nNote that you can separate the conditions using commas if you want flights that are both leaving from LGA **and** flights in February. If you are interested in either flights leaving from LGA **or** flights that happened in February, you can use the `|` instead of the comma.\n\n**Question 5 -- Fill in the code below to find the number of flights flying into SFO in July that arrived early. What does the result of this code tell you?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 5 here! \n\nfilter(sfo_flights, \n month == __, \n arr_delay > __) |> \n dim()\n```\n:::\n\n\n## Data Summaries\n\nYou can also obtain numerical summaries for the flights headed to SFO, using the `summarise()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarise(sfo_flights, \n mean_dd = mean(dep_delay), \n median_dd = median(dep_delay), \n n = n())\n```\n:::\n\n\nNote that in the `summarise()` function I've created a list of three different numerical summaries that I'm interested in.\n\nThe names of these elements are user defined, like `mean_dd`, `median_dd`, `n`, and you can customize these names as you like (just don't use spaces in your names!).\n\nCalculating these summary statistics also requires that you know the summary functions you would like to use.\n\n**Summary statistics:** Some useful function calls for summary statistics for a single numerical variable are as follows:\n\n- `mean()`: calculates the average\n- `median()`: calculates the median\n- `sd()`: calculates the standard deviation\n- `var()`: calculates the variances\n- `IQR()`: calculates the inner quartile range (Q3 - Q1)\n- `min()`: finds the minimum\n- `max()`: finds the maximum\n- `n()`: reports the sample size\n\nNote that each of these functions takes a single variable as an input and returns a single value as an output.\n\n## Summaries vs. Visualizations\n\n*If I'm flying from New York to San Francisco, should I expect that my flights will typically arrive on time?*\n\nLet's think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let's try both!\n\n**Question 6 -- Calculate the following statistics for the arrival delays in the `sfo_flights` dataset:**\n\n- mean\n- median\n- max\n- min\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 6 goes here! \n```\n:::\n\n\n**Question 7 -- Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 8 -- Now, rather than calculating summary statistics, plot the distribution of arrival delays for the `sfo_flights` dataset.**\n\n*Choose the type of plot you believe is appropriate for visualizing the **distribution** of departure delays.*\n\n*Don't forget to give your visualization informative axis labels!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 7 goes here! \n```\n:::\n\n\n**Question 9 -- Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 10 -- How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not \"see\" with the summary statistics?**\n",
"markdown": "---\ntitle: \"Visualizing and Summarizing Numerical Data\"\nauthor: \"Your group's names here!\"\ndate: \"April 13, 2023\"\nformat: html\nembed-resources: true\nstandalone: true\neditor: visual\nexecute: \n echo: true\n eval: false\n warning: false\n message: false\n---\n\n\n## Getting started\n\n### Load packages\n\nLet's load the following packages:\n\n- The **tidyverse** \"umbrella\" package which houses a suite of many different `R` packages for data wrangling and data visualization\n\n- Note: This did not work last week, but should this week!\n\n- The **openintro** `R` package: houses the dataset we will be working with\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Package for functions \nlibrary(tidyverse)\n\n# Package for data\nlibrary(openintro)\n```\n:::\n\n\n### The data\n\nThe [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.\n\nFirst, we'll view the `nycflights` data frame. Run the following code to load in the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(nycflights)\n```\n:::\n\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?nycflights\n```\n:::\n\n\nRemember that you can use `glimpse()` to take a quick peek at your data to understand its contents better.\n\n**Question 1**\n\n**(a) How large is the `nycflights` dataset? (i.e. How many rows and columns does it have?)**\n\n**(b) Are there numerical variables in the dataset? If so, what are their names?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# You code for exercise 1 goes here! \n```\n:::\n\n\n### Departure Delays\n\nLet's start by examining the distribution of departure delays (`dep_delay`) of all flights with a histogram.\n\n**Question 2 -- Create a histogram of the `dep_delay` variable from the `nycflights` data**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code for exercise 2 goes here! \n```\n:::\n\n\nHistograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins.\n\nYou can easily define the binwidth you want to use, by specifying the `binwidth` argument inside of `geom_histogram()`, like so: `geom_histogram(binwidth = 15)`\n\n**Question 3**\n\n**(a) Make two other histograms, one with a `binwidth` of 15 and one with a `binwidth` of 150.**\n\n\n::: {.cell layout-nrow=\"1\"}\n\n```{.r .cell-code}\n# Your code for exercise 3 goes here! \n```\n:::\n\n\n**(b) How do these three histograms compare? Are features revealed in one that are obscured in another?**\n\n## SFO Dstinations\n\nOne of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a `dest` of `\"LAX\"`, flights into San Francisco have a `dest` of `\"SFO\"`, and flights into Chicago (O'Hare) have a `dest` of `\"ORD\"`.\n\nIf you want to visualize only on delays of flights headed to Los Angeles, you need to first `filter()` the data for flights with that destination (e.g., `filter(dest == \"LAX\")`) and then make a histogram of the departure delays of only those flights.\n\n**Logical operators:** Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the `filter()` function and a series of **logical operators**. The most commonly used logical operators for data analysis are as follows:\n\n- `==` means \"equal to\"\n- `!=` means \"not equal to\"\n- `>` or `<` means \"greater than\" or \"less than\"\n- `>=` or `<=` means \"greater than or equal to\" or \"less than or equal to\"\n\n**Question 4 -- Fill in the code to create a new dataframe named `sfo_flights` that is the result of `filter()`ing only the observations whose destination was San Francisco.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fill in the code for exercise 4 here! \n\nsfo_flights <- filter(nycflights, \n dest == )\n```\n:::\n\n\n### Multiple Data Filters\n\nYou can filter based on multiple criteria! Within the `filter()` function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Remember months are coded as numbers!\nfilter(nycflights, \n origin == \"LGA\", \n month == 2)\n```\n:::\n\n\nNote that you can separate the conditions using commas if you want flights that are both leaving from LGA **and** flights in February. If you are interested in either flights leaving from LGA **or** flights that happened in February, you can use the `|` instead of the comma.\n\n**Question 5 -- Fill in the code below to find the number of flights flying into SFO in July that arrived early. What does the result of this code tell you?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Fill in the code for exercise 5 here! \n\nfilter(sfo_flights, \n month == __, \n arr_delay > __) |> \n dim()\n```\n:::\n\n\n## Data Summaries\n\nYou can also obtain numerical summaries for the flights headed to SFO, using the `summarise()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarise(sfo_flights, \n mean_dd = mean(dep_delay), \n median_dd = median(dep_delay), \n n = n())\n```\n:::\n\n\nNote that in the `summarise()` function I've created a list of three different numerical summaries that I'm interested in.\n\nThe names of these elements are user defined, like `mean_dd`, `median_dd`, `n`, and you can customize these names as you like (just don't use spaces in your names!).\n\nCalculating these summary statistics also requires that you know the summary functions you would like to use.\n\n**Summary statistics:** Some useful function calls for summary statistics for a single numerical variable are as follows:\n\n- `mean()`: calculates the average\n- `median()`: calculates the median\n- `sd()`: calculates the standard deviation\n- `var()`: calculates the variances\n- `IQR()`: calculates the inner quartile range (Q3 - Q1)\n- `min()`: finds the minimum\n- `max()`: finds the maximum\n- `n()`: reports the sample size\n\nNote that each of these functions takes a single variable as an input and returns a single value as an output.\n\n## Summaries vs. Visualizations\n\n*If I'm flying from New York to San Francisco, should I expect that my flights will typically arrive on time?*\n\nLet's think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let's try both!\n\n**Question 6 -- Calculate the following statistics for the arrival delays in the `sfo_flights` dataset:**\n\n- mean\n- median\n- max\n- min\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 6 goes here! \n```\n:::\n\n\n**Question 7 -- Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 8 -- Now, rather than calculating summary statistics, plot the distribution of arrival delays for the `sfo_flights` dataset.**\n\n*Choose the type of plot you believe is appropriate for visualizing the **distribution** of departure delays.*\n\n*Don't forget to give your visualization informative axis labels!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 8 goes here! \n```\n:::\n\n\n**Question 9 -- Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 10 -- How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not \"see\" with the summary statistics?**\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
15 changes: 7 additions & 8 deletions labs/lab-2.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Remember that you can use `glimpse()` to take a quick peek at your data to under
**(b) Are there numerical variables in the dataset? If so, what are their names?**

```{r data-inspect}
# You code for question 1 goes here!
# You code for exercise 1 goes here!
```
Expand All @@ -72,7 +72,7 @@ Let's start by examining the distribution of departure delays (`dep_delay`) of a
**Question 2 -- Create a histogram of the `dep_delay` variable from the `nycflights` data**

```{r histogram}
# Your code for question 2 goes here!
# Your code for exercise 2 goes here!
```

Expand All @@ -86,7 +86,7 @@ You can easily define the binwidth you want to use, by specifying the `binwidth`

```{r binwidth}
#| layout-nrow: 1
# Your code for question 3 goes here!
# Your code for exercise 3 goes here!
```
Expand All @@ -109,7 +109,7 @@ If you want to visualize only on delays of flights headed to Los Angeles, you ne
**Question 4 -- Fill in the code to create a new dataframe named `sfo_flights` that is the result of `filter()`ing only the observations whose destination was San Francisco.**

```{r sfo}
# Your code for question 4 goes here!
# Fill in the code for exercise 4 here!
sfo_flights <- filter(nycflights,
dest == )
Expand All @@ -120,19 +120,18 @@ sfo_flights <- filter(nycflights,
You can filter based on multiple criteria! Within the `filter()` function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:

```{r lga-feb}
## Remember months are coded as numbers!
filter(nycflights,
origin == "LGA",
month == 2)
## Remember months are coded as numbers!
```

Note that you can separate the conditions using commas if you want flights that are both leaving from LGA **and** flights in February. If you are interested in either flights leaving from LGA **or** flights that happened in February, you can use the `|` instead of the comma.

**Question 5 -- Fill in the code below to find the number of flights flying into SFO in July that arrived early. What does the result of this code tell you?**

```{r}
## Code for exercise 5 here!
## Fill in the code for exercise 5 here!
filter(sfo_flights,
month == __,
Expand Down Expand Up @@ -198,7 +197,7 @@ Let's think about how you could answer this question. One option is to summarize
*Don't forget to give your visualization informative axis labels!*

```{r arr-delay-plot}
## Code for exercise 7 goes here!
## Code for exercise 8 goes here!
```
Expand Down

0 comments on commit e93da4a

Please sign in to comment.