diff --git a/_freeze/critique/critique-2/execute-results/html.json b/_freeze/critique/critique-2/execute-results/html.json index cc7f2ae2..4085600d 100644 --- a/_freeze/critique/critique-2/execute-results/html.json +++ b/_freeze/critique/critique-2/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "7f372c1e58c005995fcef2c1bca8426c", + "hash": "d554846cd1ce962e4a5fe17d8c98b92f", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Statistical Critique 2: Exploring p-values\"\nsubtitle: \"Due March 4, 2024 by 5pm\"\nformat: \n html:\n table-of-contents: true\n toc-depth: 2\n number-sections: true\n number-depth: 1\neditor: visual\n---\n\n\n![](images/significant.jpeg)\n\n## Assignment Details\n\nIn your second statistical critique, you will focus on critiquing another key aspect of any statistical argument---statistical significance. No doubt you have seen $p$-values in a previous statistical course and / or disciplinary course, and this week you're adding to that knowledge. For this critique you will compare the model you selected in your Midterm Project with what model you would have chosen based on a statistical test.\n\nThis critique involves coding! You can find a template for critique on [Posit Cloud](https://posit.cloud/).\n\n# Part Zero: p-values in Multiple Linear Regression\n\nFor the first step of this critique, you are required to read about how p-values can be used in the context of multiple linear regression: [Extending to Multiple Linear Regression](../weeks/chapters/week-8-reading-mlr.qmd \"Extending to Multiple Linear Regression\")\n\n# Part One: Revisiting the Midterm Project\n\nFor the first part of this critique, you are going to revisit the model you selected for your Midterm Project. You need to copy-and-paste the code you wrote in your Midterm Project to create your 2-3 visualizations. After these visualizations, you should write a 2-3 sentence justification as to *why* you chose the model you did in your Midterm Project.\n\n# Part Two: Using p-values Instead\n\nFor this second part, you are tasked with testing what regression model you would have chosen if you had used p-values to make your decision. Regardless of the model you chose for your Midterm Project, you will fit the **most complex** regression model. If you used two numerical explanatory variables, the most complex model has **both** variables included. If you used one numerical and one categorical explanatory variable, the most complex model is the different slopes (interaction) model.\n\n### For two numerical explanatory variables\n\n1. fit a multiple linear regression with **both** variables included:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_model <- lm(bill_length_mm ~ body_mass_g + flipper_length_mm, \n data = penguins)\n```\n:::\n\n\n2. run an ANOVA to test if each variable should be included:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(my_model)\n```\n:::\n\n\n### For one numerical and one categorical explanatory variable\n\n1. fit a different slopes multiple linear regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_model <- lm(bill_length_mm ~ flipper_length_mm * species, \n data = penguins)\n```\n:::\n\n\n2. run an ANOVA to test for different slopes\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(my_model)\n```\n:::\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n# Part Three: Learning More about the Backlash Against $p$-values\n\n> \"The p-value was never intended to be a substitute for scientific reasoning.\" Ron Wasserstein, Executive Director of the American Statistical Association\n\nIssues with the use of $p$-values had gotten so problematic that the American Statistical Association (ASA)[^1] put out a statement in 2016 titled, [\"The ASA Statement on Statistical Significance and $p$-Values\"](https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf). This statement includes six principles which address misconceptions and misuse of the $p$-value.\n\n[^1]: This is my professional organization.\n\nIn March of 2019, Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories published an article in Nature [calling for an end to \"statistical significance\"](https://www.nature.com/articles/d41586-019-00857-9). The article details how, on top of the many common misunderstandings about hypothesis testing and $p$-values, there is an incentive for researchers to \"cherry pick\" only the results that are \"statistically significant\" while dismissing those that aren't. There are two problems with this system:\n\n1. it incentivizes researchers to do whatever it takes to obtain \"significant\" p-values, even through dishonest means\n2. it dismisses the importance of results where no \"significant\" effects are found\n\n\n\n\n\n\n\nFor Part Three, you are going to inspect what the publication requirements are for journal the article you selected (in Week 1) was published in. \n\n:::{.callout-tip}\n# Statistics in Your Field\nYou are revisiting (again) the article you chose in Week 1 for the \"Statistics in your Field\" assignment! \n:::\n\nFirst, go to the website for the journal where your article was published. Now, find their criteria for publication. If you are having a difficult time finding these criteria, it may be simpler to Google \"*title of journal* publication criteria,\" substituting the name of your journal.\n\nSearch through the criteria and see what the requirements are for (1) the \"significance\" of the findings and (2) the availability of the data and / or analyses. Describe what you find!\n\n::: callout-tip\nFeel free to type out what you find while searching the journal or simply copy-and-paste the criteria you find listed on their website.\n:::\n\n# Part Four: Lessons Learned\n\nNow that you have explored the use of p-values for model selection and publication criteria, write down **two** things you have learned that you will take with you.\n", + "markdown": "---\ntitle: \"Statistical Critique 2: Exploring p-values\"\nformat: \n html:\n table-of-contents: true\n toc-depth: 2\n number-sections: true\n number-depth: 1\neditor: visual\n---\n\n\n![](images/significant.jpeg)\n\n## Assignment Details\n\nIn your second statistical critique, you will focus on critiquing another key aspect of any statistical argument---statistical significance. No doubt you have seen $p$-values in a previous statistical course and / or disciplinary course, and this week you're adding to that knowledge. For this critique you will compare the model you selected in your Midterm Project with what model you would have chosen based on a statistical test.\n\nThis critique involves coding! You can find a template for critique on [Posit Cloud](https://posit.cloud/).\n\n# Part Zero: p-values in Multiple Linear Regression\n\nFor the first step of this critique, you are required to read about how p-values can be used in the context of multiple linear regression: [Extending to Multiple Linear Regression](../weeks/chapters/week-8-reading-mlr.qmd \"Extending to Multiple Linear Regression\")\n\n# Part One: Revisiting the Midterm Project\n\nFor the first part of this critique, you are going to revisit the model you selected for your Midterm Project. You need to copy-and-paste the code you wrote in your Midterm Project to create your 2-3 visualizations. After these visualizations, you should write a 2-3 sentence justification as to *why* you chose the model you did in your Midterm Project.\n\n# Part Two: Using p-values Instead\n\nFor this second part, you are tasked with testing what regression model you would have chosen if you had used p-values to make your decision. Regardless of the model you chose for your Midterm Project, you will fit the **most complex** regression model. If you used two numerical explanatory variables, the most complex model has **both** variables included. If you used one numerical and one categorical explanatory variable, the most complex model is the different slopes (interaction) model.\n\n### For two numerical explanatory variables\n\n1. fit a multiple linear regression with **both** variables included:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_model <- lm(bill_length_mm ~ body_mass_g + flipper_length_mm, \n data = penguins)\n```\n:::\n\n\n2. run an ANOVA to test if each variable should be included:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(my_model)\n```\n:::\n\n\n### For one numerical and one categorical explanatory variable\n\n1. fit a different slopes multiple linear regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_model <- lm(bill_length_mm ~ flipper_length_mm * species, \n data = penguins)\n```\n:::\n\n\n2. run an ANOVA to test for different slopes\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(my_model)\n```\n:::\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n# Part Three: Learning More about the Backlash Against $p$-values\n\n> \"The p-value was never intended to be a substitute for scientific reasoning.\" Ron Wasserstein, Executive Director of the American Statistical Association\n\nIssues with the use of $p$-values had gotten so problematic that the American Statistical Association (ASA)[^1] put out a statement in 2016 titled, [\"The ASA Statement on Statistical Significance and $p$-Values\"](https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf). This statement includes six principles which address misconceptions and misuse of the $p$-value.\n\n[^1]: This is my professional organization.\n\nIn March of 2019, Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories published an article in Nature [calling for an end to \"statistical significance\"](https://www.nature.com/articles/d41586-019-00857-9). The article details how, on top of the many common misunderstandings about hypothesis testing and $p$-values, there is an incentive for researchers to \"cherry pick\" only the results that are \"statistically significant\" while dismissing those that aren't. There are two problems with this system:\n\n1. it incentivizes researchers to do whatever it takes to obtain \"significant\" p-values, even through dishonest means\n2. it dismisses the importance of results where no \"significant\" effects are found\n\n\n\n\n\n\n\nFor Part Three, you are going to inspect what the publication requirements are for journal the article you selected (in Week 1) was published in.\n\n::: callout-tip\n# Statistics in Your Field\n\nYou are revisiting (again) the article you chose in Week 1 for the \"Statistics in your Field\" assignment!\n:::\n\nFirst, go to the website for the journal where your article was published. Now, find their criteria for publication. If you are having a difficult time finding these criteria, it may be simpler to Google \"*title of journal* publication criteria,\" substituting the name of your journal.\n\nSearch through the criteria and see what the requirements are for (1) the \"significance\" of the findings and (2) the availability of the data and / or analyses. Describe what you find!\n\n::: callout-tip\nFeel free to type out what you find while searching the journal or simply copy-and-paste the criteria you find listed on their website.\n:::\n\n# Part Four: Lessons Learned\n\nNow that you have explored the use of p-values for model selection and publication criteria, write down **two** things you have learned that you will take with you.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/labs/lab-1/execute-results/html.json b/_freeze/labs/lab-1/execute-results/html.json index 40c80600..8615285f 100644 --- a/_freeze/labs/lab-1/execute-results/html.json +++ b/_freeze/labs/lab-1/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "90fda53ebdc205ed529529fb22e68be6", + "hash": "6a23caebccb8846818a09dd99cb213a1", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 1: Welcome to Posit Cloud!\"\nauthor: \"Your Name Here!\"\ndate: \"January 9, 2024\"\nformat: html\neditor: visual\nexecute: \n echo: true\n eval: false\n---\n\n\n# Quarto\n\nThis is a Quarto document!\n\nQuarto is a software that allows you to interweave text and R code to create HTML, PDF, and Microsoft Word documents\n\nThere are two ways to view a Quarto document, (1) as the \"Source\" file, or (2) as the \"Visual\" file. We will **only** use the Visual option in this class, as it allows you to interact with Quarto similar to how you interact with Word.\n\n## Formatting your Document\n\nSimilar to a Word Doc, there are a variety of ways you can spice up a Quarto document! Let's explore a few.\n\n**Question 1:** Using the formatting options, make a numbered list of your top three favorite animals.\n\n**Question 2:** Using the formatting options, insert an image of your favorite animal.\n\n**Question 3:** Now, change the \"Formatting your Document\" section name to the name of your favorite animal. Make sure your header is a level 1 -- use the Header 1 formatting option!\n\n## R Code\n\nYou can differentiate the R code within a Quarto file from the body of the document, based on the gray boxes that start with an `{r}.`\n\nHere is an example of an R code chunk:\n\n\n\n\n\nNotice in the line after the `{r}` there are two lines that start with `#|` – this is the symbol that declares options for a code chunk. The `#| label:` allows us to specify a name for a code chunk, I typically choose a name that tells me what the code chunk does (e.g., load-packages, clean-data). The `#| include: false` option at the beginning of the code chunk controls how the code output looks in our final rendered document.\n\nThis code chunk has two things we want to pay attention to:\n\n1. The `library(tidyverse)` code loads in an R package called the \"tidyverse\". This is code you will have in **every** lab assignment for this class!\n\n2. Code comments which are denoted by a `#` symbol. Code comments are a way for you (and me) to write what the code is doing, without R thinking what we are writing is code it should execute.\n\n## Rendering\n\nWhen you click the **Render** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.\n\n**Question 4:** Do you see the above code chunk when you knit the document? Why do you think this is the case?\n\n## Including Code Output\n\nYou can include code output in your knitted document:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(mpg)\n```\n:::\n\n\n**Question 5:** What do you think the above code does? What type of output does it give you?\\\n*Hint: You have saw this type of output on Tuesday!\\\n*\n\n## Including Plots\n\nYou can also embed plots in the rendered document.\n\nHere is an example of a plot.\n\n\n::: {.cell}\n\n:::\n\n\n**Question 6**: What do you think the `echo: false` option does in the above code chunk?\n\n**Question 7:** What do you think the `mapping = aes(y = manufacturer, x = hwy))` code does?\n\n**Question 8:** What do you think the `labs(x = \"Highway Miles Per Gallon\", y = \"Car Manufacturer\")` code does?\n", - "supporting": [], + "markdown": "---\ntitle: \"Lab 1: Welcome to Posit Cloud!\"\nauthor: \"Your Name Here!\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n---\n\n\n# Quarto\n\nThis is a Quarto document!\n\nQuarto is a software that allows you to interweave text and R code to create HTML, PDF, and Microsoft Word documents\n\nThere are two ways to view a Quarto document, (1) as the \"Source\" file, or (2) as the \"Visual\" file. We will **only** use the Visual option in this class, as it allows you to interact with Quarto similar to how you interact with Word.\n\n## Formatting your Document\n\nSimilar to a Word Doc, there are a variety of ways you can spice up a Quarto document! Let's explore a few.\n\n**Question 1:** Using the formatting options, make a numbered list of your top three favorite animals.\n\n**Question 2:** Using the formatting options, insert an image of your favorite animal.\n\n**Question 3:** Now, change the \"Formatting your Document\" section name to the name of your favorite animal. Make sure your header is a level 1 -- use the Header 1 formatting option!\n\n## R Code\n\nYou can differentiate the R code within a Quarto file from the body of the document, based on the gray boxes that start with an `{r}.`\n\nHere is an example of an R code chunk:\n\n\n\n\n\nNotice in the line after the `{r}` there are two lines that start with `#|` – this is the symbol that declares options for a code chunk. The `#| label:` allows us to specify a name for a code chunk, I typically choose a name that tells me what the code chunk does (e.g., load-packages, clean-data). The `#| include: false` option at the beginning of the code chunk controls how the code output looks in our final rendered document.\n\nThis code chunk has two things we want to pay attention to:\n\n1. The `library(tidyverse)` code loads in an R package called the \"tidyverse\". This is code you will have in **every** lab assignment for this class!\n\n2. Code comments which are denoted by a `#` symbol. Code comments are a way for you (and me) to write what the code is doing, without R thinking what we are writing is code it should execute.\n\n## Rendering\n\nWhen you click the **Render** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.\n\n**Question 4:** Do you see the above code chunk when you knit the document? Why do you think this is the case?\n\n## Including Code Output\n\nYou can include code output in your knitted document:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(mpg)\n```\n:::\n\n\n**Question 5:** What do you think the above code does? What type of output does it give you?\\\n*Hint: You have saw this type of output on Tuesday!\\\n*\n\n## Including Plots\n\nYou can also embed plots in the rendered document.\n\nHere is an example of a plot.\n\n\n::: {.cell}\n\n:::\n\n\n**Question 6**: What do you think the `echo: false` option does in the above code chunk?\n\n**Question 7:** What do you think the `mapping = aes(y = manufacturer, x = hwy))` code does?\n\n**Question 8:** What do you think the `labs(x = \"Highway Miles Per Gallon\", y = \"Car Manufacturer\")` code does?\n", + "supporting": [ + "lab-1_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/labs/lab-2/execute-results/html.json b/_freeze/labs/lab-2/execute-results/html.json index c5286311..e79c6545 100644 --- a/_freeze/labs/lab-2/execute-results/html.json +++ b/_freeze/labs/lab-2/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "a4bd2aeca013687c84318274b08e3cf5", + "hash": "f1b2ab48c113df793c9ce69977ba52f2", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 2: Visualizing and Summarizing Numerical Data\"\nauthor: \"Your group's names here!\"\ndate: \"January 19, 2024\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n warning: false\n message: false\n---\n\n\n## Getting started\n\n### Load packages\n\nLet's load the following packages:\n\n- The **tidyverse** \"umbrella\" package which houses a suite of many different `R` packages for data wrangling and data visualization\n\n- The **openintro** `R` package: houses the dataset we will be working with\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Package for functions \nlibrary(tidyverse)\n\n# Package for data\nlibrary(openintro)\n```\n:::\n\n\n### The data\n\nThe [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.\n\nFirst, we'll view the `nycflights` data frame. Run the following code to load in the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(nycflights)\n```\n:::\n\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?nycflights\n```\n:::\n\n\nRemember that you can use `glimpse()` to take a quick peek at your data to understand its contents better.\n\n**Question 1**\n\n**(a) How large is the `nycflights` dataset? (i.e. How many rows and columns does it have?)**\n\n**(b) Are there numerical variables in the dataset? If so, what are their names?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# You code for exercise 1 goes here! Yes, your answer should use code!\n```\n:::\n\n\n### Departure Delays\n\nLet's start by examining the distribution of departure delays (`dep_delay`) of all flights with a histogram.\n\n**Question 2 -- Create a histogram of the `dep_delay` variable from the `nycflights` data.** *Don't forget to give your visualization informative axis labels that include the units the variable was measured in!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code for exercise 2 goes here! \n```\n:::\n\n\nHistograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data into different bins.\n\nYou can easily define the binwidth you want to use, by specifying the `binwidth` argument inside of `geom_histogram()`, like so:\n\n`geom_histogram(binwidth = 15)`\n\n**Question 3**\n\n**(a) Make two other histograms, one with a `binwidth` of 15 and one with a `binwidth` of 150.** *Feel free to copy-and-paste the code you used for Question 2 and modify the `binwidth`*.\n\n\n::: {.cell layout-nrow=\"1\"}\n\n```{.r .cell-code}\n# Your code for exercise 3 goes here! \n```\n:::\n\n\n**(b) How do these three histograms compare? Are features revealed in one that are obscured in another?**\n\n## SFO Destinations\n\nOne of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a `dest` of `\"LAX\"`, flights into San Francisco have a `dest` of `\"SFO\"`, and flights into Chicago (O'Hare) have a `dest` of `\"ORD\"`.\n\nIf you want to visualize only on delays of flights headed to Los Angeles, you need to first `filter()` the data for flights with that destination (e.g., `filter(dest == \"LAX\")`) and then make a histogram of the departure delays of only those flights.\n\n**Logical operators:** Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the `filter()` function and a series of **logical operators**. The most commonly used logical operators for data analysis are as follows:\n\n- `==` means \"equal to\"\n- `!=` means \"not equal to\"\n- `>` or `<` means \"greater than\" or \"less than\"\n- `>=` or `<=` means \"greater than or equal to\" or \"less than or equal to\"\n\n**Question 4 -- Fill in the code to create a new dataframe named `sfo_flights` that is the result of `filter()`ing only the observations whose destination was San Francisco.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsfo_flights <- filter(nycflights, \n dest == )\n```\n:::\n\n\n### Multiple Data Filters\n\nYou can filter based on multiple criteria! Within the `filter()` function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(nycflights, \n origin == \"LGA\", \n month == 2)\n\n## Remember months are coded as numbers (February = 2)!\n```\n:::\n\n\nNote that you can separate the conditions using commas if you want flights that are both leaving from LGA **and** flights in February. If you are interested in either flights leaving from LGA **or** flights that happened in February, you can use the `|` instead of the comma.\n\n**Question 5 -- Fill in the code below to find the number of flights flying into SFO in July that arrived early.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(sfo_flights, \n month == __, \n arr_delay > __) %>% \n dim()\n```\n:::\n\n\n**Question 6 -- When you ran the code above it output two numbers. What do those numbers tell you about the number of flights that met your criteria (SFO, July, arrived early)?**\n\n## Data Summaries\n\nYou can also obtain numerical summaries for the flights headed to SFO, using the `summarise()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarise(sfo_flights, \n mean_dd = mean(dep_delay), \n median_dd = median(dep_delay), \n n = n())\n```\n:::\n\n\nNote that in the `summarise()` function I've created a list of three different numerical summaries that I'm interested in.\n\nThe names of these elements are user defined, like `mean_dd`, `median_dd`, `n`, and you can customize these names as you like (just don't use spaces in your names!).\n\nCalculating these summary statistics also requires that you know the summary functions you would like to use.\n\n**Summary statistics:** Some useful function calls for summary statistics for a single numerical variable are as follows:\n\n- `mean()`: calculates the average\n- `median()`: calculates the median\n- `sd()`: calculates the standard deviation\n- `var()`: calculates the variances\n- `IQR()`: calculates the inner quartile range (Q3 - Q1)\n- `min()`: finds the minimum\n- `max()`: finds the maximum\n- `n()`: reports the sample size\n\nNote that each of these functions takes a single variable as an input and returns a single value as an output.\n\n## Summaries vs. Visualizations\n\n*If I'm flying from New York to San Francisco, should I expect that my flights will typically arrive on time?*\n\nLet's think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let's try both!\n\n**Question 7 -- Calculate the following statistics for the arrival delays in the `sfo_flights` dataset:**\n\n- mean\n- median\n- max\n- min\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 7 goes here! \n```\n:::\n\n\n**Question 8 -- Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 9 -- Now, rather than calculating summary statistics, plot the distribution of arrival delays for the `sfo_flights` dataset.**\n\n*Choose the type of plot you believe is appropriate for visualizing the **distribution** of arrival delays. Don't forget to give your visualization informative axis labels that include the units of measurement!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 9 goes here! \n```\n:::\n\n\n**Question 10 -- Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 11 -- How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not \"see\" with the summary statistics?**\n", + "markdown": "---\ntitle: \"Lab 2: Visualizing and Summarizing Numerical Data\"\nauthor: \"Your group's names here!\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n warning: false\n message: false\n---\n\n\n## Getting started\n\n### Load packages\n\nLet's load the following packages:\n\n- The **tidyverse** \"umbrella\" package which houses a suite of many different `R` packages for data wrangling and data visualization\n\n- The **openintro** `R` package: houses the dataset we will be working with\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Package for functions \nlibrary(tidyverse)\n\n# Package for data\nlibrary(openintro)\n```\n:::\n\n\n### The data\n\nThe [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.\n\nFirst, we'll view the `nycflights` data frame. Run the following code to load in the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(nycflights)\n```\n:::\n\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?nycflights\n```\n:::\n\n\nRemember that you can use `glimpse()` to take a quick peek at your data to understand its contents better.\n\n**Question 1**\n\n**(a) How large is the `nycflights` dataset? (i.e. How many rows and columns does it have?)**\n\n**(b) Are there numerical variables in the dataset? If so, what are their names?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# You code for exercise 1 goes here! Yes, your answer should use code!\n```\n:::\n\n\n### Departure Delays\n\nLet's start by examining the distribution of departure delays (`dep_delay`) of all flights with a histogram.\n\n**Question 2 -- Create a histogram of the `dep_delay` variable from the `nycflights` data.** *Don't forget to give your visualization informative axis labels that include the units the variable was measured in!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code for exercise 2 goes here! \n```\n:::\n\n\nHistograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data into different bins.\n\nYou can easily define the binwidth you want to use, by specifying the `binwidth` argument inside of `geom_histogram()`, like so:\n\n`geom_histogram(binwidth = 15)`\n\n**Question 3**\n\n**(a) Make two other histograms, one with a `binwidth` of 15 and one with a `binwidth` of 150.** *Feel free to copy-and-paste the code you used for Question 2 and modify the `binwidth`*.\n\n\n::: {.cell layout-nrow=\"1\"}\n\n```{.r .cell-code}\n# Your code for exercise 3 goes here! \n```\n:::\n\n\n**(b) How do these three histograms compare? Are features revealed in one that are obscured in another?**\n\n## SFO Destinations\n\nOne of the variables refers to the destination (i.e. airport) of the flight, which have three letter abbreviations. For example, flights into Los Angeles have a `dest` of `\"LAX\"`, flights into San Francisco have a `dest` of `\"SFO\"`, and flights into Chicago (O'Hare) have a `dest` of `\"ORD\"`.\n\nIf you want to visualize only on delays of flights headed to Los Angeles, you need to first `filter()` the data for flights with that destination (e.g., `filter(dest == \"LAX\")`) and then make a histogram of the departure delays of only those flights.\n\n**Logical operators:** Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the `filter()` function and a series of **logical operators**. The most commonly used logical operators for data analysis are as follows:\n\n- `==` means \"equal to\"\n- `!=` means \"not equal to\"\n- `>` or `<` means \"greater than\" or \"less than\"\n- `>=` or `<=` means \"greater than or equal to\" or \"less than or equal to\"\n\n**Question 4 -- Fill in the code to create a new dataframe named `sfo_flights` that is the result of `filter()`ing only the observations whose destination was San Francisco.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsfo_flights <- filter(nycflights, \n dest == )\n```\n:::\n\n\n### Multiple Data Filters\n\nYou can filter based on multiple criteria! Within the `filter()` function, each criteria is separated using commas. For example, suppose you are interested in flights leaving from LaGuardia (LGA) in February:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(nycflights, \n origin == \"LGA\", \n month == 2)\n\n## Remember months are coded as numbers (February = 2)!\n```\n:::\n\n\nNote that you can separate the conditions using commas if you want flights that are both leaving from LGA **and** flights in February. If you are interested in either flights leaving from LGA **or** flights that happened in February, you can use the `|` instead of the comma.\n\n**Question 5 -- Fill in the code below to find the number of flights flying into SFO in July that arrived early.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(sfo_flights, \n month == __, \n arr_delay > __) %>% \n dim()\n```\n:::\n\n\n**Question 6 -- When you ran the code above it output two numbers. What do those numbers tell you about the number of flights that met your criteria (SFO, July, arrived early)?**\n\n## Data Summaries\n\nYou can also obtain numerical summaries for the flights headed to SFO, using the `summarise()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarise(sfo_flights, \n mean_dd = mean(dep_delay), \n median_dd = median(dep_delay), \n n = n())\n```\n:::\n\n\nNote that in the `summarise()` function I've created a list of three different numerical summaries that I'm interested in.\n\nThe names of these elements are user defined, like `mean_dd`, `median_dd`, `n`, and you can customize these names as you like (just don't use spaces in your names!).\n\nCalculating these summary statistics also requires that you know the summary functions you would like to use.\n\n**Summary statistics:** Some useful function calls for summary statistics for a single numerical variable are as follows:\n\n- `mean()`: calculates the average\n- `median()`: calculates the median\n- `sd()`: calculates the standard deviation\n- `var()`: calculates the variances\n- `IQR()`: calculates the inner quartile range (Q3 - Q1)\n- `min()`: finds the minimum\n- `max()`: finds the maximum\n- `n()`: reports the sample size\n\nNote that each of these functions takes a single variable as an input and returns a single value as an output.\n\n## Summaries vs. Visualizations\n\n*If I'm flying from New York to San Francisco, should I expect that my flights will typically arrive on time?*\n\nLet's think about how you could answer this question. One option is to summarize the data and inspect the output. Another option is to plot the delays and inspect the plots. Let's try both!\n\n**Question 7 -- Calculate the following statistics for the arrival delays in the `sfo_flights` dataset:**\n\n- mean\n- median\n- max\n- min\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 7 goes here! \n```\n:::\n\n\n**Question 8 -- Using the above summary statistics, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 9 -- Now, rather than calculating summary statistics, plot the distribution of arrival delays for the `sfo_flights` dataset.**\n\n*Choose the type of plot you believe is appropriate for visualizing the **distribution** of arrival delays. Don't forget to give your visualization informative axis labels that include the units of measurement!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code for exercise 9 goes here! \n```\n:::\n\n\n**Question 10 -- Using the plot above, what is your answer be to my question? What should I expect if I am flying from New York to San Francisco?**\n\n**Question 11 -- How did your answer change when using the plot versus using the summary statistics? i.e. What were you able to see in the plot that could could not \"see\" with the summary statistics?**\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/labs/lab-3/execute-results/html.json b/_freeze/labs/lab-3/execute-results/html.json index 4aa679ac..7aeb91f3 100644 --- a/_freeze/labs/lab-3/execute-results/html.json +++ b/_freeze/labs/lab-3/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "5a5aae73605631f0785d95b5bb6073bc", + "hash": "a3450ef8279e0782a52588459b1ac695", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 3: Incorporating Categorical Variables\"\nauthor: \"Your group's names here!\"\ndate: \"January 26, 2024\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n\n# Getting started\n\n## R Resources\n\nYou should have at least one member of your lab group pull up the R resources from Canvas. Specifically, the \"cheatsheets\" from Weeks 2 & 3 will be very helpful while completing this assignment.\n\n## Load packages\n\nIn this lab, we will explore and visualize the data using packages housed in the **tidyverse** suite of packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Package for ggplot and dplyr tools\nlibrary(tidyverse)\n\n## Package for ecological data\nlibrary(lterdatasampler)\n\n## Package for density ridge plots\nlibrary(ggridges)\n```\n:::\n\n\n## The data\n\nIn this lab we will work with data from the H.J. Andrews Experimental Forest. The following is a description of the data:\n\n> Populations of West Slope cutthroat trout (Onchorhyncus clarki clarki) in two standard reaches of Mack Creek in the H.J. Andrews Experimental Forest have been monitored since 1987. Monitoring of Pacific Giant Salamanders, Dicamptodon tenebrosus began in 1993. The two standard reaches are in a section of clearcut forest (ca. 1963) and an upstream 500 year old coniferous forest. Sub-reaches are sampled with 2-pass electrofishing, and all captured vertebrates are measured and weighed. Additionally, a set of channel measurements are taken with each sampling. This study constitutes one of the longest continuous records of salmonid populations on record.\n\nFirst, we'll view the `and_vertebrates` dataframe where these data are stored.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(and_vertebrates)\n```\n:::\n\n\n## Exploring the Dataset\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?and_vertebrates\n```\n:::\n\n\n**Question 1 -- How large is the `and_vertebrates` dataset? (i.e. How many rows and columns does the dataset have?)**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 1 (and 2) goes here!\n```\n:::\n\n\n**Question 2 -- Are there categorical variables in the dataset? If so, what are their names?**\n\n## Accessing the Levels of a Variable\n\nThe `species` variable refers to the species of the animal which was captured. You can use the `distinct()` function to access the distinct values of a categorical variable (e.g., `distinct(nycflights, carrier)`). Notice the first input is the name of the dataset and the second input is the name of the categorical variable!\n\n**Question 3 -- Use the `distinct()` function to discover the levels / values of the `species` variable.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 3 goes here!\n```\n:::\n\n\n# Data Wrangling\n\nAlright, you should have found that there is more than one species included in these data. For our analysis, we are only interested in Cutthroat trout.\n\nThis study used electrofishing to capture the trout. Electrofishing is a technique that uses direct current electricity flowing between a submerged cathode and anode, to insert an electric current into the water. This current stuns fish in a (hopefully) non-lethal manner, in order to capture them for marking and measuring. Technically, smaller fish are less affected by the current, so there presumably is a size of fish that is \"uncatchable\". For our analysis, we are going to filter out fish that are less than 4 inches long, as this size of fish is nearly impossible to capture.\n\n**Question 4 -- Use the `filter()` function to include *only* observations on Cutthroat trout whose `length_1_mm` is greater than 4 inches (or 101 mm).** *The only part you need to remove is the `...`! Keep the `trout <-`!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrout <- ...\n```\n:::\n\n\n# Data Visualizations\n\nAlright, now that we've gotten our data ready for analysis, let's start with some visualizations\n\n**Question 5 -- Using `ggplot()` create a visualization of the *distribution* of the lengths of the Cutthroat trout (from the `trout` dataset you `filter`ed above).** *Your plot should have axis lables that describe the variable being plotted, and its associated units!*\n\n*Keep in mind your plot should only extend to 101mm if you completed #4 correctly.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 6 goes here!\n```\n:::\n\n\n**Question 6 -- Name three possible sources of variation for the length of a Cutthroat trout.**\n\n## Adding Categorical Variables\n\nWhen we are interested in comparing the distribution of a numerical variable across groups of a categorical variable, we \"typically\" see people use stacked histograms or side-by-side boxplots. I believe an unsung hero of these types of comparisons is the **ridge plot**.\n\nAs introduced in *Introduction to Modern Statistics*, a ridge plot essentially has multiple density plots stacked in the same plotting window. A key feature of ridge plots is a categorical variable is **always** on the y-axis, with a numeric variable on the x-axis.\n\nIn R, we use the `geom_density_ridges()` function from the **ggridges** package to create a ridge plot. Yes, this is new, but don't worry! The function has the same layout as things you've seen before.\n\n**Question 7 -- Fill in the code below to create a ridge plot comparing the lengths of Cutthroat trout between the different types of channels (`unittype`).** *Be sure to add nice axis labels to your plot, which describe the variables being plotted (and their units)!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(data = trout, \n mapping = aes(x = , \n y = )\n ) +\n geom_density_ridges() \n```\n:::\n\n\n**Question 8 -- Modify your plot from #7 to incorporate the `section` of the forest into your plot, using either `color` or `facet`s.**\n\n*Hint: The `fill` aesthetic will **fill** the ridge plots with color.*\n\n**Question 9 -- Based on your plot, how different are the lengths of the Cutthroat trout between the different channel types and forest sections?** *Be sure to address how the centers and shapes of these distributions compare!*\n\n# Data Summaries\n\nPaired with visualizations, summary statistics can provide a clearer picture for the comparisons we are interested in. To obtain summary statistics for different groups of a categorical variable, we need to use our friend the `group_by()` function.\n\n**Question 10 -- Find the average length of Cutthroat trout from the different channel types (`unittype`).** *Be sure to use the `trout` dataset from Question 4!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 11 goes here!\n```\n:::\n\n\n**Question 11 -- Find the average length of Cutthroat trout from the different channel types (`unittype`) *and* forest `section`.** *Be sure to use the `trout` dataset from Question 4!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 12 goes here!\n```\n:::\n\n\n**Question 12 -- How do the differences in these averages compare with what you saw in your visualization in Question 9? Why do you believe they are similar or different from what you saw in the visualizations?**\n\n*Hint: Your response should directly compare the statistics you got in Question 12 with the density ridges you saw in Question 8.*\n", + "markdown": "---\ntitle: \"Lab 3: Incorporating Categorical Variables\"\nauthor: \"Your group's names here!\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n\n# Getting started\n\n## R Resources\n\nYou should have at least one member of your lab group pull up the R resources from Canvas. Specifically, the \"cheatsheets\" from Weeks 2 & 3 will be very helpful while completing this assignment.\n\n## Load packages\n\nIn this lab, we will explore and visualize the data using packages housed in the **tidyverse** suite of packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Package for ggplot and dplyr tools\nlibrary(tidyverse)\n\n## Package for ecological data\nlibrary(lterdatasampler)\n\n## Package for density ridge plots\nlibrary(ggridges)\n```\n:::\n\n\n## The data\n\nIn this lab we will work with data from the H.J. Andrews Experimental Forest. The following is a description of the data:\n\n> Populations of West Slope cutthroat trout (Onchorhyncus clarki clarki) in two standard reaches of Mack Creek in the H.J. Andrews Experimental Forest have been monitored since 1987. Monitoring of Pacific Giant Salamanders, Dicamptodon tenebrosus began in 1993. The two standard reaches are in a section of clearcut forest (ca. 1963) and an upstream 500 year old coniferous forest. Sub-reaches are sampled with 2-pass electrofishing, and all captured vertebrates are measured and weighed. Additionally, a set of channel measurements are taken with each sampling. This study constitutes one of the longest continuous records of salmonid populations on record.\n\nFirst, we'll view the `and_vertebrates` dataframe where these data are stored.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(and_vertebrates)\n```\n:::\n\n\n## Exploring the Dataset\n\nThe **codebook** (description of the variables) can be accessed by pulling up the help file by typing a `?` before the name of the dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?and_vertebrates\n```\n:::\n\n\n**Question 1 -- How large is the `and_vertebrates` dataset? (i.e. How many rows and columns does the dataset have?)**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 1 (and 2) goes here!\n```\n:::\n\n\n**Question 2 -- Are there categorical variables in the dataset? If so, what are their names?**\n\n## Accessing the Levels of a Variable\n\nThe `species` variable refers to the species of the animal which was captured. You can use the `distinct()` function to access the distinct values of a categorical variable (e.g., `distinct(nycflights, carrier)`). Notice the first input is the name of the dataset and the second input is the name of the categorical variable!\n\n**Question 3 -- Use the `distinct()` function to discover the levels / values of the `species` variable.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 3 goes here!\n```\n:::\n\n\n# Data Wrangling\n\nAlright, you should have found that there is more than one species included in these data. For our analysis, we are only interested in Cutthroat trout.\n\nThis study used electrofishing to capture the trout. Electrofishing is a technique that uses direct current electricity flowing between a submerged cathode and anode, to insert an electric current into the water. This current stuns fish in a (hopefully) non-lethal manner, in order to capture them for marking and measuring. Technically, smaller fish are less affected by the current, so there presumably is a size of fish that is \"uncatchable\". For our analysis, we are going to filter out fish that are less than 4 inches long, as this size of fish is nearly impossible to capture.\n\n**Question 4 -- Use the `filter()` function to include *only* observations on Cutthroat trout whose `length_1_mm` is greater than 4 inches (or 101 mm).** *The only part you need to remove is the `...`! Keep the `trout <-`!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrout <- ...\n```\n:::\n\n\n# Data Visualizations\n\nAlright, now that we've gotten our data ready for analysis, let's start with some visualizations\n\n**Question 5 -- Using `ggplot()` create a visualization of the *distribution* of the lengths of the Cutthroat trout (from the `trout` dataset you `filter`ed above).** *Your plot should have axis lables that describe the variable being plotted, and its associated units!*\n\n*Keep in mind your plot should only extend to 101mm if you completed #4 correctly.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 6 goes here!\n```\n:::\n\n\n**Question 6 -- Name three possible sources of variation for the length of a Cutthroat trout.**\n\n## Adding Categorical Variables\n\nWhen we are interested in comparing the distribution of a numerical variable across groups of a categorical variable, we \"typically\" see people use stacked histograms or side-by-side boxplots. I believe an unsung hero of these types of comparisons is the **ridge plot**.\n\nAs introduced in *Introduction to Modern Statistics*, a ridge plot essentially has multiple density plots stacked in the same plotting window. A key feature of ridge plots is a categorical variable is **always** on the y-axis, with a numeric variable on the x-axis.\n\nIn R, we use the `geom_density_ridges()` function from the **ggridges** package to create a ridge plot. Yes, this is new, but don't worry! The function has the same layout as things you've seen before.\n\n**Question 7 -- Fill in the code below to create a ridge plot comparing the lengths of Cutthroat trout between the different types of channels (`unittype`).** *Be sure to add nice axis labels to your plot, which describe the variables being plotted (and their units)!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(data = trout, \n mapping = aes(x = , \n y = )\n ) +\n geom_density_ridges() \n```\n:::\n\n\n**Question 8 -- Modify your plot from #7 to incorporate the `section` of the forest into your plot, using either `color` or `facet`s.**\n\n*Hint: The `fill` aesthetic will **fill** the ridge plots with color.*\n\n**Question 9 -- Based on your plot, how different are the lengths of the Cutthroat trout between the different channel types and forest sections?** *Be sure to address how the centers and shapes of these distributions compare!*\n\n# Data Summaries\n\nPaired with visualizations, summary statistics can provide a clearer picture for the comparisons we are interested in. To obtain summary statistics for different groups of a categorical variable, we need to use our friend the `group_by()` function.\n\n**Question 10 -- Find the average length of Cutthroat trout from the different channel types (`unittype`).** *Be sure to use the `trout` dataset from Question 4!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 11 goes here!\n```\n:::\n\n\n**Question 11 -- Find the average length of Cutthroat trout from the different channel types (`unittype`) *and* forest `section`.** *Be sure to use the `trout` dataset from Question 4!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Your code for question 12 goes here!\n```\n:::\n\n\n**Question 12 -- How do the differences in these averages compare with what you saw in your visualization in Question 9? Why do you believe they are similar or different from what you saw in the visualizations?**\n\n*Hint: Your response should directly compare the statistics you got in Question 12 with the density ridges you saw in Question 8.*\n", "supporting": [ "lab-3_files" ], diff --git a/_freeze/labs/lab-4/execute-results/html.json b/_freeze/labs/lab-4/execute-results/html.json index 113b90e3..4bcc7def 100644 --- a/_freeze/labs/lab-4/execute-results/html.json +++ b/_freeze/labs/lab-4/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "ed19023639db17987936b8a102a0323b", + "hash": "1b2f109a93cb7686ab5f9d86125a8e26", "result": { - "markdown": "---\ntitle: \"Lab 4: Simple Linear Regression\"\nauthor: \"The names of the people working in your group TODAY!\"\ndate: \"February 1, 2024\"\nformat: html\nembed-resources: true\neditor: visual\nexecute: \n eval: false\n echo: true\n warning: false\n message: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(lterdatasampler)\n\n# New package from reading\nlibrary(moderndive)\n```\n:::\n\n\n## Data for Today\n\nToday we'll be working with data on lake ice duration for two lakes surrounding Madison, Wisconsin. This dataset contains information on the number of days of ice (ice duration) on each lake for years between 1851 and 2019. These data are stored in the `ntl_icecover` dataset, which lives in the **lterdatsampler** package.\n\nAccording to the EPA, lake ice duration can be an indicator of climate change. This is because lake ice is dependent on several environmental factors, so changes in these factors will influence the formation of ice on top of lakes. As a result, the study and analysis of lake ice formation can inform scientists about how quickly the climate is changing, and are critical to minimizing disruptions to lake ecosystems.\n\n## Inspecting the Data\n\n**Question 1 -- How large is the `ntl_icecover` dataset? (i.e. How many rows and columns does it have?)**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 1 goes here!\n```\n:::\n\n\n## Visualize a Simple Linear Regression\n\nLet's start with tools to visualize and summarize linear regression.\n\n### Tools\n\n1. Visualize the relationship between x & y -- `geom_point()`\n2. Visualize the linear regression line -- `geom_smooth()`\n\nWe will be investigating the relationship between the `ice_duration` of each lake and the `year`.\n\n### Step 1\n\n**Question 2 -- Make a scatterplot of the relationship between the `ice_duration` (response) and the `year` (explanatory).** *Be sure to make the axis labels look nice, including any necessary units!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 2 goes here!\n```\n:::\n\n\n**Question 3 -- Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points.** *Hint: You need to explicitly state **where** the unusual observations are!*\n\n### Step 2\n\nTo add a regression line on top of a scatterplot, you add (`+`) a `geom_smooth()` layer to your plot. However, if you add a \"plain\" `geom_smooth()` to the plot, it uses a wiggly line. You need to tell `geom_smooth()` what type of smoother line you want for it to use! We can get a straight line by including `method = \"lm\"` **inside** of `geom_smooth()`.\n\n**Question 4 -- Add a linear regression line to the scatterplot you made in Question 3.** *No code goes here, you need to modify your scatterplot from Question 3!*\n\n## Fit a Simple Linear Regression Model\n\nNext, we are going to summarize the relationship between `ice_duration` and `year` with a linear regression equation.\n\n### Tools\n\n1. Calculate the correlation between x & y -- `get_correlation()`\n2. Model the relationship between x & y -- `lm()`\n3. Explore coefficient estimates -- `get_regression_table()`\n\n### Step 1\n\n**Question 5 -- Calculate the correlation between these variables, using the `get_correlation()` function.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 5 goes here!\n```\n:::\n\n\n### Step 2\n\nNext, we will \"fit\" a linear regression with the `lm()` function. Remember, the \"formula\" for `lm()` is `response_variable ~ explanatory_variable`. Also recall that you need to tell `lm()` where the data live using `data =` argument!\n\n**Question 6 -- Fit a linear regression modeling the relationship between between `ice_duration` and `year`.** *The only part you need to remove is the `...`! Keep the `ice_lm <-`!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 6 goes here!\n\nice_lm <- ...\n```\n:::\n\n\n### Step 3\n\nFinally, to get the regression equation, we need grab the coefficients out of the linear model object you made in Step 2. The `get_regression_table()` function is a handy tool to do just that!\n\n**Question 7 -- Use the `get_regression_table()` function to obtain the coefficient estimates for the `ice_lm` regression you fit in Question 6.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 7 goes here!\n```\n:::\n\n\n**Question 8 -- Using the coefficient estimates above, write out the estimated regression equation.** *Your equation needs to be in the context of the variables, not in generic* $x$ and $y$ statements!\n\n**Question 9 -- Interpret the value of the slope coefficient.** *Your interpretation needs to be in the context of the variables, not in generic* $x$ and $y$ statements!\n\n**Question 10 -- Sometimes interpreting a 1-unit increase in the explanatory variable is not a meaningful change, so we instead use larger increases. Based on your slope interpretation from Q9, what do you expect to happen to the duration of ice for an incease of 100 years?**\n\n## A preview of what's to come\n\nIn our analysis above, we only looked at the relationship between ice duration and year, not accounting for which lake the measurements came from. That is another (categorical) explanatory variable we could include in our regression model!\n\n**Question 11 -- Using the code you wrote for Question 4 (with the regression line added), add a `color` for the name of the lake (`lakeid`).**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 11 goes here!\n```\n:::\n", + "engine": "knitr", + "markdown": "---\ntitle: \"Lab 4: Simple Linear Regression\"\nauthor: \"The names of the people working in your group TODAY!\"\nformat: html\nembed-resources: true\neditor: visual\nexecute: \n eval: false\n echo: true\n warning: false\n message: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(lterdatasampler)\n\n# New package from reading\nlibrary(moderndive)\n```\n:::\n\n\n## Data for Today\n\nToday we'll be working with data on lake ice duration for two lakes surrounding Madison, Wisconsin. This dataset contains information on the number of days of ice (ice duration) on each lake for years between 1851 and 2019. These data are stored in the `ntl_icecover` dataset, which lives in the **lterdatsampler** package.\n\nAccording to the EPA, lake ice duration can be an indicator of climate change. This is because lake ice is dependent on several environmental factors, so changes in these factors will influence the formation of ice on top of lakes. As a result, the study and analysis of lake ice formation can inform scientists about how quickly the climate is changing, and are critical to minimizing disruptions to lake ecosystems.\n\n## Inspecting the Data\n\n**Question 1 -- How large is the `ntl_icecover` dataset? (i.e. How many rows and columns does it have?)**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 1 goes here!\n```\n:::\n\n\n## Visualize a Simple Linear Regression\n\nLet's start with tools to visualize and summarize linear regression.\n\n### Tools\n\n1. Visualize the relationship between x & y -- `geom_point()`\n2. Visualize the linear regression line -- `geom_smooth()`\n\nWe will be investigating the relationship between the `ice_duration` of each lake and the `year`.\n\n### Step 1\n\n**Question 2 -- Make a scatterplot of the relationship between the `ice_duration` (response) and the `year` (explanatory).** *Be sure to make the axis labels look nice, including any necessary units!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 2 goes here!\n```\n:::\n\n\n**Question 3 -- Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points.** *Hint: You need to explicitly state **where** the unusual observations are!*\n\n### Step 2\n\nTo add a regression line on top of a scatterplot, you add (`+`) a `geom_smooth()` layer to your plot. However, if you add a \"plain\" `geom_smooth()` to the plot, it uses a wiggly line. You need to tell `geom_smooth()` what type of smoother line you want for it to use! We can get a straight line by including `method = \"lm\"` **inside** of `geom_smooth()`.\n\n**Question 4 -- Add a linear regression line to the scatterplot you made in Question 3.** *No code goes here, you need to modify your scatterplot from Question 3!*\n\n## Fit a Simple Linear Regression Model\n\nNext, we are going to summarize the relationship between `ice_duration` and `year` with a linear regression equation.\n\n### Tools\n\n1. Calculate the correlation between x & y -- `get_correlation()`\n2. Model the relationship between x & y -- `lm()`\n3. Explore coefficient estimates -- `get_regression_table()`\n\n### Step 1\n\n**Question 5 -- Calculate the correlation between these variables, using the `get_correlation()` function.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 5 goes here!\n```\n:::\n\n\n### Step 2\n\nNext, we will \"fit\" a linear regression with the `lm()` function. Remember, the \"formula\" for `lm()` is `response_variable ~ explanatory_variable`. Also recall that you need to tell `lm()` where the data live using `data =` argument!\n\n**Question 6 -- Fit a linear regression modeling the relationship between between `ice_duration` and `year`.** *The only part you need to remove is the `...`! Keep the `ice_lm <-`!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 6 goes here!\n\nice_lm <- ...\n```\n:::\n\n\n### Step 3\n\nFinally, to get the regression equation, we need grab the coefficients out of the linear model object you made in Step 2. The `get_regression_table()` function is a handy tool to do just that!\n\n**Question 7 -- Use the `get_regression_table()` function to obtain the coefficient estimates for the `ice_lm` regression you fit in Question 6.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 7 goes here!\n```\n:::\n\n\n**Question 8 -- Using the coefficient estimates above, write out the estimated regression equation.** *Your equation needs to be in the context of the variables, not in generic* $x$ and $y$ statements!\n\n**Question 9 -- Interpret the value of the slope coefficient.** *Your interpretation needs to be in the context of the variables, not in generic* $x$ and $y$ statements!\n\n**Question 10 -- Sometimes interpreting a 1-unit increase in the explanatory variable is not a meaningful change, so we instead use larger increases. Based on your slope interpretation from Q9, what do you expect to happen to the duration of ice for an incease of 100 years?**\n\n## A preview of what's to come\n\nIn our analysis above, we only looked at the relationship between ice duration and year, not accounting for which lake the measurements came from. That is another (categorical) explanatory variable we could include in our regression model!\n\n**Question 11 -- Using the code you wrote for Question 4 (with the regression line added), add a `color` for the name of the lake (`lakeid`).**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to answer question 11 goes here!\n```\n:::\n", "supporting": [ "lab-4_files" ], diff --git a/_freeze/labs/lab-6/execute-results/html.json b/_freeze/labs/lab-6/execute-results/html.json index 3b1cc25e..6320b95d 100644 --- a/_freeze/labs/lab-6/execute-results/html.json +++ b/_freeze/labs/lab-6/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "e5b63b8b6bc7cc8a3f2524239cbf5571", + "hash": "d2eed9358a048d26202d52c15e1bd765", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 6: Predicting Professor Evaluation Scores\"\nauthor: \"Your group's names here!\"\ndate: \"February 15, 2024\"\nformat: html\neditor: visual\nexecute: \n eval: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(moderndive)\nlibrary(openintro)\n\nevals <- evals |> \n mutate(large_class = if_else(cls_students > 100, \n \"large class\", \n \"regular class\"), \n eval_completion = cls_did_eval / cls_students \n ) |> \n select(-cls_did_eval, \n -cls_students, \n -prof_id,\n -course_id, \n -bty_f1lower, \n -bty_f1upper, \n -bty_f2upper, \n -bty_m1lower, \n -bty_m1upper, \n -bty_m2upper)\n```\n:::\n\n\n## Your Challenge\n\nThis week you have learned about model selection. During class you worked on performing a backward selection process to determine the \"best\" model for penguin body mass.\n\nToday, you are going to use **forward selection** to determine the \"best\" model for professor's evaluation score. This task will require you to fit **tons** of linear regressions. **You must be able to show me exactly how you got to your top model.** Meaning, I need to see a record of **every** model you fit and compared along the way.\n\n## Forward Selection\n\nThe forward selection process starts with a model with **no** predictor variables. That means, this model predicts the *same* mean evaluation score for every professor. I've fit this model for you below!\n\n\n::: {.cell}\n\n```{.r .cell-code}\none_mean <- lm(score ~ 1, data = evals)\n```\n:::\n\n\nYou can pull out the adjusted $R^2$ for this model using the `get_regression_summaries()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_regression_summaries(one_mean)\n```\n:::\n\n\nBased on this output, we are starting with a **really** low adjusted $R^2$. So, things can only get better from here!\n\n### Step 1\n\n**Rules: You can only add a variable to the model if it improves the adjusted** $R^2$ by at least 2% (0.02).\n\nAlright, so now we get cooking. The next step is to fit **every** model with **one** explanatory variable. I've provided a list of every explanatory variable you are allowed to consider!\n\n- `rank` -- rank of professor\n- `ethnicity` -- ethnicity of the professor\n- `gender` -- gender of the professor\n- `language` -- language of school where professor received education\n- `age` -- age of the professor\n- `cls_perc_eval` -- the percentage of students who completed the evaluation\n- `cls_level` -- class level\n- `cls_profs` -- number of professors teaching sections in course: single, multiple\n- `cls_credits` -- credits of class: one credit (lab, PE, etc.), multi credit\n- `bty_avg` -- average beauty rating of the professor\n- `pic_outfit` -- outfit of professor in picture\n- `pic_color` -- color of professor's picture\n- `large_class` -- whether the class had over 100 students\n- `eval_completion` -- proportion of students who completed the evaluation\n\nWoof, that's 14 different variables. That means, for this first round, you will need to compare the adjusted $R^2$ for [**12**]{.underline} different models to decide what variable should be added.\n\nEvery model you fit will have the *same* format:\n\n``` \nname_of_model <- lm(score ~ , data = evals)\n```\n\nBut, the name of the model will need to change. I've started the process for you, using the naming style of `one_` followed by the variable name (e.g., `one_id`, `one_bty`, etc.).\n\n\n::: {.cell}\n\n```{.r .cell-code}\none_rank <- lm(score ~ rank, data = evals)\none_ethnicity <- lm(score ~ ethnicity, data = evals)\none_gender <- lm(score ~ gender, data = evals)\none_language <- lm(score ~ language, data = evals)\none_age <- lm(score ~ age, data = evals)\none_perc_eval <- lm(score ~ cls_perc_eval, data = evals)\none_level <- lm(score ~ cls_level, data = evals)\n\n## Now, you need to fit the other seven models! \n```\n:::\n\n\nAlright, now that you've fit the models, you need to inspect the adjusted $R^2$ values to see which of these 14 models is the \"top\" model -- the model with the highest adjusted $R^2$! Similar to before, I've provided you with some code to get you started, but you need to write the remaining code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_regression_summaries(one_rank)\nget_regression_summaries(one_ethnicity)\nget_regression_summaries(one_gender)\nget_regression_summaries(one_language)\nget_regression_summaries(one_age)\nget_regression_summaries(one_perc_eval)\nget_regression_summaries(one_level)\n\n## Now, you need to compare the other seven models! \n```\n:::\n\n\n**1. What model was your top model? Specifically, which variable was selected to be included?**\n\n### Step 2 - Adding a Second Variable\n\nAlright, you've added one variable, the next step is to decide if you should add a second variable. This process looks nearly identical to the previous step, with one major change: **every model you fit needs to contain the variable you decided to add**. So, if you decided to add the `bty_avg` variable, every model you fit would look like this:\n\n``` \nname_of_model <- lm(score ~ bty_avg + , data = evals)\n```\n\nAgain, the name of the model will need to change. This round, you are on your own -- I've provided you with no code. Here are my recommendations:\n\n- name each model `two_` followed by the names of both variables included in the model (e.g., `two_bty_id`)\n- go through each variable step-by-step just like you did before\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code to fit all 13 models that add a second variable to your top model goes here!\n```\n:::\n\n\nAlright, now you should have 13 more models to compare! Like before, you need to inspect the adjusted $R^2$ values to see which of these 13 models is the \"top\" model.\n\n**Rules: You can only add a variable to the model if it improves adjusted** $R^2$ by at least 2% (0.02) from the model you chose in Question 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code to compare all 13 models you fit goes here!\n```\n:::\n\n\n**2. What model was your top model? State which variables are included in the model you chose!**\n\n### Step 3 - Adding a Third Variable\n\nAs you might have expected, in this step we add a *third* variable to our top model from the previous step. This process should be getting familiar at this point!\n\nThis process of fitting 12-14 models at a time is getting rather tedious! So, I've written some code that will carry out this process for us in **one** pipeline! This is how the code looks:\n\n``` \nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score,\n -,\n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n\nWoah, that's a lot. The only thing you need to change is:\n\n- add in the names of the variables you selected in Steps 1 & 2 in the `~lm(score ~ .x + , data = evals)` step\n\n- add in the names of the variables you selected in Steps 1 & 2 in the `select(-ID, -score, -, -)` step\n\nFor example, if you chose `gender` and `age` in Steps 1 and 2, your code on the first line would look like:\n\n``` \nmap(.f = ~lm(score ~ .x + gender + age, data = evals)) %>% \n```\n\nand your code on the fourth line would look like:\n\n``` \n select(-score,\n -gender,\n -sex\n ) %>% \n```\n\nYour turn!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1 & 2\n## Change the and in line 4 to the names of the variables you selected in Steps 1 & 2\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 2 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 2 is the \"best\" model!\n\n**3. What model was your top model? State which variables are included in the model you chose!**\n\n### Step 4 - Adding a Fourth Variable\n\n**If you decided to add a variable in Step 3, then you keep going! If you didn't add a variable in Step 3, then you stop!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1, 2, & 3\n## Change the and in line 4 to the names of the variables you selected in Steps 1, 2 & 3\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -, \n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 3 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 3 is the \"best\" model!\n\n**4. What model was your top model? You must state which variables are included in the model you chose!**\n\n### Step 5 - Adding a Fifth Variable\n\n**If you decided to add a variable in Step 4, then you keep going! If you didn't add a variable in Step 4, then you stop!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1, 2, 3 & 4\n## Change the and in line 4 to the names of the variables you selected in Steps 1, 2, 3 & 4\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -, \n -,\n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 4 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 4 is the \"best\" model!\n\n**5. What model was your top model? You must state which variables are included in the model you chose!**\n\n## Comparing with the `step()` Function\n\nLet's check the forward selection model you found with what model the `step()` function decides is best. Run the code chunk below to obtain the \"best\" model chosen by this function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfull_model <- lm(score ~ ., data = evals)\nstep(full_model, direction = \"forward\")\n```\n:::\n\n\n6. **Did the `step()` function choose the same model as you? If your \"best\" models do not not agree, why do you think this might have happened?**\n", - "supporting": [], + "markdown": "---\ntitle: \"Lab 6: Predicting Professor Evaluation Scores\"\nauthor: \"Your group's names here!\"\nformat: html\nembed-resources: true\neditor: visual\nexecute: \n eval: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(moderndive)\nlibrary(openintro)\n\nevals <- evals |> \n mutate(large_class = if_else(cls_students > 100, \n \"large class\", \n \"regular class\"), \n eval_completion = cls_did_eval / cls_students \n ) |> \n select(-cls_did_eval, \n -cls_students, \n -prof_id,\n -course_id, \n -bty_f1lower, \n -bty_f1upper, \n -bty_f2upper, \n -bty_m1lower, \n -bty_m1upper, \n -bty_m2upper)\n```\n:::\n\n\n## Your Challenge\n\nThis week you have learned about model selection. During class you worked on performing a backward selection process to determine the \"best\" model for penguin body mass.\n\nToday, you are going to use **forward selection** to determine the \"best\" model for professor's evaluation score. This task will require you to fit **tons** of linear regressions. **You must be able to show me exactly how you got to your top model.** Meaning, I need to see a record of **every** model you fit and compared along the way.\n\n## Forward Selection\n\nThe forward selection process starts with a model with **no** predictor variables. That means, this model predicts the *same* mean evaluation score for every professor. I've fit this model for you below!\n\n\n::: {.cell}\n\n```{.r .cell-code}\none_mean <- lm(score ~ 1, data = evals)\n```\n:::\n\n\nYou can pull out the adjusted $R^2$ for this model using the `get_regression_summaries()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_regression_summaries(one_mean)\n```\n:::\n\n\nBased on this output, we are starting with a **really** low adjusted $R^2$. So, things can only get better from here!\n\n### Step 1\n\n**Rules: You can only add a variable to the model if it improves the adjusted** $R^2$ by at least 2% (0.02).\n\nAlright, so now we get cooking. The next step is to fit **every** model with **one** explanatory variable. I've provided a list of every explanatory variable you are allowed to consider!\n\n- `rank` -- rank of professor\n- `ethnicity` -- ethnicity of the professor\n- `gender` -- gender of the professor\n- `language` -- language of school where professor received education\n- `age` -- age of the professor\n- `cls_perc_eval` -- the percentage of students who completed the evaluation\n- `cls_level` -- class level\n- `cls_profs` -- number of professors teaching sections in course: single, multiple\n- `cls_credits` -- credits of class: one credit (lab, PE, etc.), multi credit\n- `bty_avg` -- average beauty rating of the professor\n- `pic_outfit` -- outfit of professor in picture\n- `pic_color` -- color of professor's picture\n- `large_class` -- whether the class had over 100 students\n- `eval_completion` -- proportion of students who completed the evaluation\n\nWoof, that's 14 different variables. That means, for this first round, you will need to compare the adjusted $R^2$ for [**12**]{.underline} different models to decide what variable should be added.\n\nEvery model you fit will have the *same* format:\n\n``` \nname_of_model <- lm(score ~ , data = evals)\n```\n\nBut, the name of the model will need to change. I've started the process for you, using the naming style of `one_` followed by the variable name (e.g., `one_id`, `one_bty`, etc.).\n\n\n::: {.cell}\n\n```{.r .cell-code}\none_rank <- lm(score ~ rank, data = evals)\none_ethnicity <- lm(score ~ ethnicity, data = evals)\none_gender <- lm(score ~ gender, data = evals)\none_language <- lm(score ~ language, data = evals)\none_age <- lm(score ~ age, data = evals)\none_perc_eval <- lm(score ~ cls_perc_eval, data = evals)\none_level <- lm(score ~ cls_level, data = evals)\n\n## Now, you need to fit the other seven models! \n```\n:::\n\n\nAlright, now that you've fit the models, you need to inspect the adjusted $R^2$ values to see which of these 14 models is the \"top\" model -- the model with the highest adjusted $R^2$! Similar to before, I've provided you with some code to get you started, but you need to write the remaining code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_regression_summaries(one_rank)\nget_regression_summaries(one_ethnicity)\nget_regression_summaries(one_gender)\nget_regression_summaries(one_language)\nget_regression_summaries(one_age)\nget_regression_summaries(one_perc_eval)\nget_regression_summaries(one_level)\n\n## Now, you need to compare the other seven models! \n```\n:::\n\n\n**1. What model was your top model? Specifically, which variable was selected to be included?**\n\n### Step 2 - Adding a Second Variable\n\nAlright, you've added one variable, the next step is to decide if you should add a second variable. This process looks nearly identical to the previous step, with one major change: **every model you fit needs to contain the variable you decided to add**. So, if you decided to add the `bty_avg` variable, every model you fit would look like this:\n\n``` \nname_of_model <- lm(score ~ bty_avg + , data = evals)\n```\n\nAgain, the name of the model will need to change. This round, you are on your own -- I've provided you with no code. Here are my recommendations:\n\n- name each model `two_` followed by the names of both variables included in the model (e.g., `two_bty_id`)\n- go through each variable step-by-step just like you did before\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code to fit all 13 models that add a second variable to your top model goes here!\n```\n:::\n\n\nAlright, now you should have 13 more models to compare! Like before, you need to inspect the adjusted $R^2$ values to see which of these 13 models is the \"top\" model.\n\n**Rules: You can only add a variable to the model if it improves adjusted** $R^2$ by at least 2% (0.02) from the model you chose in Question 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Code to compare all 13 models you fit goes here!\n```\n:::\n\n\n**2. What model was your top model? State which variables are included in the model you chose!**\n\n### Step 3 - Adding a Third Variable\n\nAs you might have expected, in this step we add a *third* variable to our top model from the previous step. This process should be getting familiar at this point!\n\nThis process of fitting 12-14 models at a time is getting rather tedious! So, I've written some code that will carry out this process for us in **one** pipeline! This is how the code looks:\n\n``` \nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score,\n -,\n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n\nWoah, that's a lot. The only thing you need to change is:\n\n- add in the names of the variables you selected in Steps 1 & 2 in the `~lm(score ~ .x + , data = evals)` step\n\n- add in the names of the variables you selected in Steps 1 & 2 in the `select(-ID, -score, -, -)` step\n\nFor example, if you chose `gender` and `age` in Steps 1 and 2, your code on the first line would look like:\n\n``` \nmap(.f = ~lm(score ~ .x + gender + age, data = evals)) %>% \n```\n\nand your code on the fourth line would look like:\n\n``` \n select(-score,\n -gender,\n -sex\n ) %>% \n```\n\nYour turn!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1 & 2\n## Change the and in line 4 to the names of the variables you selected in Steps 1 & 2\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 2 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 2 is the \"best\" model!\n\n**3. What model was your top model? State which variables are included in the model you chose!**\n\n### Step 4 - Adding a Fourth Variable\n\n**If you decided to add a variable in Step 3, then you keep going! If you didn't add a variable in Step 3, then you stop!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1, 2, & 3\n## Change the and in line 4 to the names of the variables you selected in Steps 1, 2 & 3\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -, \n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 3 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 3 is the \"best\" model!\n\n**4. What model was your top model? You must state which variables are included in the model you chose!**\n\n### Step 5 - Adding a Fifth Variable\n\n**If you decided to add a variable in Step 4, then you keep going! If you didn't add a variable in Step 4, then you stop!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Change the in line 2 to the names of the variables you selected in Steps 1, 2, 3 & 4\n## Change the and in line 4 to the names of the variables you selected in Steps 1, 2, 3 & 4\n\nevals %>% \n map(.f = ~lm(score ~ .x + , data = evals)) %>% \n map_df(.f = ~get_regression_summaries(.x)$adj_r_squared) %>% \n select(-score, \n -, \n -, \n -,\n -\n ) %>% \n pivot_longer(cols = everything(), \n names_to = \"variable\", \n values_to = \"adj_r_sq\") %>% \n slice_max(adj_r_sq)\n```\n:::\n\n\nThe output of this code is the variable that has the highest adjusted $R^2$. Compare this value to the value of your \"top\" model from Step 4 and see if it improved adjusted $R^2$ by at least 2% (0.02). If so, this variable should be added. If not, then your model from Step 4 is the \"best\" model!\n\n**5. What model was your top model? You must state which variables are included in the model you chose!**\n\n## Comparing with the `step()` Function\n\nLet's check the forward selection model you found with what model the `step()` function decides is best. Run the code chunk below to obtain the \"best\" model chosen by this function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfull_model <- lm(score ~ ., data = evals)\nstep(full_model, direction = \"forward\")\n```\n:::\n\n\n6. **Did the `step()` function choose the same model as you? If your \"best\" models do not not agree, why do you think this might have happened?**\n", + "supporting": [ + "lab-6_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/labs/lab-7/execute-results/html.json b/_freeze/labs/lab-7/execute-results/html.json index 199da3a8..548cb811 100644 --- a/_freeze/labs/lab-7/execute-results/html.json +++ b/_freeze/labs/lab-7/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "31bb68709e780b84448a13268c3573da", + "hash": "490439b6a1f47292d50e9fce2e40538c", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 7: Confidence Intervals for Water Temperature and Latitude\"\nauthor: \"The names of your group members here!\"\ndate: \"February 22, 2024\"\nformat: html\neditor: visual\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n\n\n\n## Data\n\nToday we will explore the `pie_crab` dataset contained in the **lterdatasampler** R package. The data is from a study by Johnson et al. at the Plum Island Ecosystem Long Term Ecological Research site, studying the relationship between the size (carapace width) of a Fiddler Crab and the geographical location of its habitat. These data can be used to investigate if Bergmann's Rule applies to Fiddler Crabs, or specifically that the size of a crab increases as the distance from the equator increases.\n\n### Motivation\n\nThe students who investigated this relationship for their Midterm Project found that when *both* latitude *and* water temperature are included as explanatory variables in the multiple regression model, the coefficient associated with water temperature doesn't make sense. Namely, the model suggests warmer water temperatures are associated with larger crab sizes. However, we know that the water is warmer near the equator, which is where the crab sizes should be **smaller**. Rather perplexing!\n\nThe moral of the story is that water temperature and latitude are high correlated with each other, so including them both as explanatory variables leads to *multicollinearity* -- something we **do not** want in our multiple linear regression.\n\n### Our Investigation\n\nThe focus of this lab is on quantifying the relationship between water temperature (response) and latitude (explanatory) for marshes (sites) along the Atlantic coast.\n\n## Cleaning the Data\n\nThe data contains information on at total of 392 Fiddler Crabs caught at 13 marshes on the Atlantic coast of the United States in summer 2016. However, at each marsh, there is only **one** recorded water temperature. Meaning, we need to collapse our dataset to have only **one** observation per marsh.\n\n**1. Fill in the code below to create a new dataset called `marsh_info` which has 13 observations -- one per marsh.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmarsh_info <- pie_crab %>% \n group_by(____) %>% \n slice_sample(n = 1) %>% \n ungroup()\n```\n:::\n\n\n**From this point forward, you should use the `marsh_info` dataset for EVERY problem.** Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!\n\n## Visualizing Relationships\n\n**2. Create a scatterplot modeling the relationship between latitude (explanatory) and water temperature (response) for these 13 marshes.** *Don't forget to add descriptive axis labels!*\n\n\n::: {.cell}\n\n:::\n\n\n**3. Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points!** Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!\n\n### Summarizing the Relationship\n\nNow that you've visualized the relationship, let's summarize this relationship with a statistic. Specifically, we are interested in the slope statistic, as it captures the relationship between latitude and water temperature.\n\n**4. Fill in the code below to calculate the observed slope for the relationship between the water temperature (response) and latitude (explanatory).**\n\n*Note: Nothing will be output when you run this code!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_slope <- marsh_info %>% \n specify(response = ____, \n explanatory = ____) %>% \n calculate(stat = ____)\n```\n:::\n\n\n## Bootstrap Distribution\n\nNow that we have the observed slope statistic, let's see what variability we might get in the slope statistic for other samples (marshes) we might have gotten from the population (the Atlantic coast of the US).\n\nAs a refresher, when we use resampling to obtain our bootstrap distribution, our steps look like the following:\n\nStep 1: `specify()` the response and explanatory variables\n\nStep 2: `generate()` lots of bootstrap resamples\n\nStep 3: `calculate()` for each of the `generated()` samples, calculate the statistic you are interested in\n\nLet's give this a try!\n\n**5. Fill in the code to generate 500 bootstrap slope statistics (from 500 bootstrap resamples).**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbootstrap <- marsh_info %>% \n specify(response = ____, \n explanatory = ____) %>% \n generate(reps = ____, \n type = ____) %>% \n calculate(stat = ____)\n```\n:::\n\n\nAlright, now that we have the bootstrap slope statistics, let's see how it looks! Let's use the `visualize()` function (not `ggplot()`!) to make a quick visualization of the statistics you calculated above.\n\n**6. Use the `visualize()` function to create a simple histogram of your 500 bootstrap statistics.** *It would be nice to change the x-axis label to describe what statistic is being plotted!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to visualize bootstrap statistics\n```\n:::\n\n\n## Confidence Interval\n\nThe next step to obtain our confidence interval! First we need to determine what percentage of statistics we want to keep in the confidence interval. 90%? 95%? 99%? 80%?\n\nThis seems like a study where we care a bit less about our interval capturing the true value, at least compared to something like a medical study. So, I think this could be a great instance to use a 90% confidence interval.\n\n**7. Use the `get_confidence_interval()` function to find the 90% confidence interval from your bootstrap distribution, using the percentile method!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to obtain a 90% PERCENTILE based confidence interval\n```\n:::\n\n\n**8. Interpret the confidence interval you obtained in #7. Make sure to include the context of the data and the population of interest!**\n\nJust for fun, let's compare the confidence we obtained using a percentile method with an interval found using the SE method.\n\n**9. Use the `get_confidence_interval()` function to find the 90% confidence interval from your bootstrap distribution, using the SE method!** *Remember -- with the SE method, you need to specify the point estimate!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to obtain a 90% SE based confidence interval\n```\n:::\n\n\n**10. How do your confidence intervals compare? Based on the shape of the bootstrap distribution, would you expect for these methods to yield similar results?**\n\n*Hint: Think about the conditions for using the SE method to obtain a confidence interval!*\n\n## Bootstrap Assumptions\n\nA bootstrap distribution aims to simulate the variability we'd get from other samples from our population. However, the accuracy of these samples relies on the quality of our original sample.\n\n**11. Based on the information given, how do you feel about the assumption a bootstrap distribution makes about the original sample? What issues do you believe might prevent this assumption being appropriate?**\n", + "markdown": "---\ntitle: \"Lab 7: Confidence Intervals for Water Temperature and Latitude\"\nauthor: \"The names of your group members here!\"\nformat: html\nembed-resources: true\neditor: visual\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n\n\n\n## Data\n\nToday we will explore the `pie_crab` dataset contained in the **lterdatasampler** R package. The data is from a study by Johnson et al. at the Plum Island Ecosystem Long Term Ecological Research site, studying the relationship between the size (carapace width) of a Fiddler Crab and the geographical location of its habitat. These data can be used to investigate if Bergmann's Rule applies to Fiddler Crabs, or specifically that the size of a crab increases as the distance from the equator increases.\n\n### Motivation\n\nThe students who investigated this relationship for their Midterm Project found that when *both* latitude *and* water temperature are included as explanatory variables in the multiple regression model, the coefficient associated with water temperature doesn't make sense. Namely, the model suggests warmer water temperatures are associated with larger crab sizes. However, we know that the water is warmer near the equator, which is where the crab sizes should be **smaller**. Rather perplexing!\n\nThe moral of the story is that water temperature and latitude are high correlated with each other, so including them both as explanatory variables leads to *multicollinearity* -- something we **do not** want in our multiple linear regression.\n\n### Our Investigation\n\nThe focus of this lab is on quantifying the relationship between water temperature (response) and latitude (explanatory) for marshes (sites) along the Atlantic coast.\n\n## Cleaning the Data\n\nThe data contains information on at total of 392 Fiddler Crabs caught at 13 marshes on the Atlantic coast of the United States in summer 2016. However, at each marsh, there is only **one** recorded water temperature. Meaning, we need to collapse our dataset to have only **one** observation per marsh.\n\n**1. Fill in the code below to create a new dataset called `marsh_info` which has 13 observations -- one per marsh.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmarsh_info <- pie_crab %>% \n group_by(____) %>% \n slice_sample(n = 1) %>% \n ungroup()\n```\n:::\n\n\n**From this point forward, you should use the `marsh_info` dataset for EVERY problem.** Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!\n\n## Visualizing Relationships\n\n**2. Create a scatterplot modeling the relationship between latitude (explanatory) and water temperature (response) for these 13 marshes.** *Don't forget to add descriptive axis labels!*\n\n\n::: {.cell}\n\n:::\n\n\n**3. Describe the relationship you see in the scatterplot. Be sure to address the four aspects we discussed in class: form, direction, strength, and unusual points!** Keep in mind that you are no longer analyzing data on crabs! The dataset you have is on marshes along the Atlantic coast!\n\n### Summarizing the Relationship\n\nNow that you've visualized the relationship, let's summarize this relationship with a statistic. Specifically, we are interested in the slope statistic, as it captures the relationship between latitude and water temperature.\n\n**4. Fill in the code below to calculate the observed slope for the relationship between the water temperature (response) and latitude (explanatory).**\n\n*Note: Nothing will be output when you run this code!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_slope <- marsh_info %>% \n specify(response = ____, \n explanatory = ____) %>% \n calculate(stat = ____)\n```\n:::\n\n\n## Bootstrap Distribution\n\nNow that we have the observed slope statistic, let's see what variability we might get in the slope statistic for other samples (marshes) we might have gotten from the population (the Atlantic coast of the US).\n\nAs a refresher, when we use resampling to obtain our bootstrap distribution, our steps look like the following:\n\nStep 1: `specify()` the response and explanatory variables\n\nStep 2: `generate()` lots of bootstrap resamples\n\nStep 3: `calculate()` for each of the `generated()` samples, calculate the statistic you are interested in\n\nLet's give this a try!\n\n**5. Fill in the code to generate 500 bootstrap slope statistics (from 500 bootstrap resamples).**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbootstrap <- marsh_info %>% \n specify(response = ____, \n explanatory = ____) %>% \n generate(reps = ____, \n type = ____) %>% \n calculate(stat = ____)\n```\n:::\n\n\nAlright, now that we have the bootstrap slope statistics, let's see how it looks! Let's use the `visualize()` function (not `ggplot()`!) to make a quick visualization of the statistics you calculated above.\n\n**6. Use the `visualize()` function to create a simple histogram of your 500 bootstrap statistics.** *It would be nice to change the x-axis label to describe what statistic is being plotted!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to visualize bootstrap statistics\n```\n:::\n\n\n## Confidence Interval\n\nThe next step to obtain our confidence interval! First we need to determine what percentage of statistics we want to keep in the confidence interval. 90%? 95%? 99%? 80%?\n\nThis seems like a study where we care a bit less about our interval capturing the true value, at least compared to something like a medical study. So, I think this could be a great instance to use a 90% confidence interval.\n\n**7. Use the `get_confidence_interval()` function to find the 90% confidence interval from your bootstrap distribution, using the percentile method!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to obtain a 90% PERCENTILE based confidence interval\n```\n:::\n\n\n**8. Interpret the confidence interval you obtained in #7. Make sure to include the context of the data and the population of interest!**\n\nJust for fun, let's compare the confidence we obtained using a percentile method with an interval found using the SE method.\n\n**9. Use the `get_confidence_interval()` function to find the 90% confidence interval from your bootstrap distribution, using the SE method!** *Remember -- with the SE method, you need to specify the point estimate!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Code to obtain a 90% SE based confidence interval\n```\n:::\n\n\n**10. How do your confidence intervals compare? Based on the shape of the bootstrap distribution, would you expect for these methods to yield similar results?**\n\n*Hint: Think about the conditions for using the SE method to obtain a confidence interval!*\n\n## Bootstrap Assumptions\n\nA bootstrap distribution aims to simulate the variability we'd get from other samples from our population. However, the accuracy of these samples relies on the quality of our original sample.\n\n**11. Based on the information given, how do you feel about the assumption a bootstrap distribution makes about the original sample? What issues do you believe might prevent this assumption being appropriate?**\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/labs/lab-8/execute-results/html.json b/_freeze/labs/lab-8/execute-results/html.json index de0aed12..58f81dc2 100644 --- a/_freeze/labs/lab-8/execute-results/html.json +++ b/_freeze/labs/lab-8/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "0be71513110a4a4e0470a796363a7f04", + "hash": "526f252971fabbe875e51d2c06360587", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lab 8: Evaluating Conditions & Conducting Hypothesis Tests\"\nauthor: \"Your group's names here!\"\ndate: \"Leap Day, 2024\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(gapminder)\nlibrary(infer)\nlibrary(moderndive)\n```\n:::\n\n\n# Today's Data\n\nHere is a description of the `gapminder` dataset as written by two of your peers:\n\n> The gapminder dataset contains data representing three numerical attributes of a country: its average life expectancy in years, its population, and its per-capita GDP in \"international dollars\" (a hypothetical currency with the purchasing power of the U.S. dollar in 2005).\n\n> GDP data from years 1990 to 2019 come from the World Bank, which was published in May 2022. Data from previous years come from the Madison Project Database and the Penn World Table. The life expectancy data were compiled from three main sources: Mattias Lindgren and Klara Joahansson, gapminder version 7 (1800-1970), the Institute for Health Metrics and Evaluation (1970-2016), and the United Nations population data (2017-present). The geographical data for country and continent are based off of current borders as determined by the United Nations.\n\n> In total, there are 1704 observations and 6 variables, with the country variable being a factor containing 142 levels. The data were retrieved in 2008 and 2009 from [Gapminder](https://gapminder.org/), an organization that collects world data. The dataset was also manually cleaned by Dr. Jenny Bryan and her STAT 545 students.\n\n> Gapminder itself collected its GDP data from the World Bank, its life expectancy data from various studies published by the Institute for Health Metrics and Evaluation, and its population data from the US Census Bureau.\n\n## Question of Interest\n\nThe objective of this data analysis is to answer the question:\n\n> What is the relationship between life expectancy GDP per capita?\n\n# Exploratory data analysis\n\nLet's load the `gapminder` data into our workspace and start exploring!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(gapminder)\n```\n:::\n\n\n## Data Visualization\n\n**1. Create a scatterplot of the relationship between life expectancy (response) and GDP (explanatory).**\n\n*Remember to include nice axis labels (with units!).*\n\n\n::: {.cell}\n\n:::\n\n\nWhat you see should make you concerned about using a linear regression! So, let's play with some variable transformations.\n\nYou can explore if a log-transformation of the y-variable would make the relationship more linear by adding a `scale_y_log10()` layer to your plot, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlterdatasampler::hbr_maples %>% \n ggplot(mapping = aes(x = stem_length, \n y = stem_dry_mass)\n ) +\n geom_point() + \n scale_y_log10()\n```\n:::\n\n\nSimilarly, you can a log-transformation of the x-variable would be helpful by adding a `scale_x_log10()` layer to your plot.\n\n**2. Using `scale_x_log10()` and `scale_y_log10()`, decide on what relationship between life expectancy and GDP per capita appears the most linear. There should only be *one* plot for this problem!**\n\n*Remember to include nice axis labels (with any transformed units!).*\n\n\n::: {.cell}\n\n:::\n\n\n# Statistical Model\n\n**3. Fill in the code to fit the regression model you chose in #2.**\n\n*To include a variable with a log transformation in your model, you input the `variable` as `log(variable)` inside the `lm()` function (e.g., `lm(log(stem_dry_mass) ~ stem_length, data = hbr_maples)`.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder_lm <- lm(____ ~ ____, data = gapminder)\n```\n:::\n\n\n## Assessing Model Conditions\n\nThe next step is to check the conditions of our statistical model, we do this by analyzing our residuals and how the data were collected.\n\n### Independence of Observations\n\nEach row of the `gapminder` dataset is an observation for one country for one year (from 1952 to 2007).\n\n**4. Do you believe is it reasonable to assume these observations are independent of one another?**\n\n*Hint: This condition says the rows* of the dataset are independent of each other. Look at the rows of the dataset, is there any reason to believe there are relationships between the rows?\n\n### Normality of Residuals\n\nI've provided code to visualize the residuals from the model you fit in #3 below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbroom::augment(gapminder_lm) %>% \n ggplot(mapping = aes(x = .resid)) +\n geom_histogram() +\n labs(x = \"Residual\")\n```\n:::\n\n\n**5. Based on the distribution of residuals, do you believe the condition of normality is violated? Why or why not?**\n\n### Equal Variance of Residuals\n\nI've provided code to visualize the residuals versus fitted values from the model you fit in #3 below. With this plot, we want to assess if the variability (spread) of the residuals changes based on the values of the explanatory variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbroom::augment(gapminder_lm) %>% \n ggplot(mapping = aes(y = .resid, x = `log(gdpPercap)`)) +\n geom_point() + \n geom_hline(yintercept = 0, color = \"red\", linewidth = 3) +\n labs(x = \"Log Transformed GDP Per Capita\")\n```\n:::\n\n\n**6. Based on the plot above, do you believe the condition of equal variance is violated? Why or why not?**\n\n# Inference\n\n## Stating the Hypotheses\n\nNow that you've decided which regression appears the most linear, let's perform a hypothesis test for the slope coefficient.\n\n**7. Write the hypotheses [*in words*]{.underline} for testing if there is a linear relationship between the variables you used for your model in #3.**\n\n*Keep in mind, if you log-transformed y, you are testing if there is a linear relationship between log(y) and x!*\n\n$H_0$:\n\n$H_A$:\n\n## Obtaining a p-value Using Simulation\n\nNext, we will work through creating a permutation distribution using tools from the **infer** package.\n\n**8. First, we need to find the observed slope statistic, which we will save as `obs_slope`.**\n\n*Keep in mind, if you log-transformed y, you need to use log(y) as your response variable!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_slope <- gapminder %>%\n specify(response = ____, explanatory = ____) %>%\n calculate(stat = \"slope\")\n```\n:::\n\n\nAfter you have calculated your observed statistic, you need to create a permutation distribution of statistics that might have occurred if the null hypothesis was true.\n\n**9. Generate 500 permuted statistics for the permutation distribution and save these statistics in an object named `null_dist`.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnull_dist <- \n```\n:::\n\n\nWe can visualize this null distribution with the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvisualise(null_dist) \n```\n:::\n\n\nNow that you have calculated the observed statistic and generated a permutation distribution, you can calculate the p-value for your hypothesis test using the function `get_p_value()` from the infer package.\n\n**10. Fill in the code below to calculate the p-value for the hypothesis test you stated in #7.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_p_value(null_dist, \n obs_stat = ____, \n direction = ____)\n```\n:::\n\n\n**11. Based on your p-value and an** $\\alpha = 0.1$, **what decision would you reach regarding the hypotheses you stated in #7?**\n\n## Obtaining a p-value Using Theory\n\nAs we saw in the reading this week, the output from the `get_regression_table()` function provides us with theory-based estimates of our standard error, $t$-statistic, and p-value.\n\n**12. Use the `get_regression_table()` function to obtain the theory-based p-value for your hypothesis test.**\n\n*Hint: You'll want to use the model you fit in #3.*\n\n\n::: {.cell}\n\n:::\n\n\n**13. How does this p-value compare to what you obtained in #11?**\n\n**14. Why do you believe these p-values were similar or different?**\n\n**15. Based on your answers to #4-6, which p-value do you believe is the most reliable? Why?** *Note: If you believe neither are reliable, say so and state why.*\n", + "markdown": "---\ntitle: \"Lab 8: Evaluating Conditions & Conducting Hypothesis Tests\"\nauthor: \"Your group's names here!\"\nformat: html\neditor: visual\nembed-resources: true\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(gapminder)\nlibrary(infer)\nlibrary(moderndive)\n```\n:::\n\n\n# Today's Data\n\nHere is a description of the `gapminder` dataset as written by two of your peers:\n\n> The gapminder dataset contains data representing three numerical attributes of a country: its average life expectancy in years, its population, and its per-capita GDP in \"international dollars\" (a hypothetical currency with the purchasing power of the U.S. dollar in 2005).\n\n> GDP data from years 1990 to 2019 come from the World Bank, which was published in May 2022. Data from previous years come from the Madison Project Database and the Penn World Table. The life expectancy data were compiled from three main sources: Mattias Lindgren and Klara Joahansson, gapminder version 7 (1800-1970), the Institute for Health Metrics and Evaluation (1970-2016), and the United Nations population data (2017-present). The geographical data for country and continent are based off of current borders as determined by the United Nations.\n\n> In total, there are 1704 observations and 6 variables, with the country variable being a factor containing 142 levels. The data were retrieved in 2008 and 2009 from [Gapminder](https://gapminder.org/), an organization that collects world data. The dataset was also manually cleaned by Dr. Jenny Bryan and her STAT 545 students.\n\n> Gapminder itself collected its GDP data from the World Bank, its life expectancy data from various studies published by the Institute for Health Metrics and Evaluation, and its population data from the US Census Bureau.\n\n## Question of Interest\n\nThe objective of this data analysis is to answer the question:\n\n> What is the relationship between life expectancy GDP per capita?\n\n# Exploratory data analysis\n\nLet's load the `gapminder` data into our workspace and start exploring!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(gapminder)\n```\n:::\n\n\n## Data Visualization\n\n**1. Create a scatterplot of the relationship between life expectancy (response) and GDP (explanatory).**\n\n*Remember to include nice axis labels (with units!).*\n\n\n::: {.cell}\n\n:::\n\n\nWhat you see should make you concerned about using a linear regression! So, let's play with some variable transformations.\n\nYou can explore if a log-transformation of the y-variable would make the relationship more linear by adding a `scale_y_log10()` layer to your plot, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlterdatasampler::hbr_maples %>% \n ggplot(mapping = aes(x = stem_length, \n y = stem_dry_mass)\n ) +\n geom_point() + \n scale_y_log10()\n```\n:::\n\n\nSimilarly, you can a log-transformation of the x-variable would be helpful by adding a `scale_x_log10()` layer to your plot.\n\n**2. Using `scale_x_log10()` and `scale_y_log10()`, decide on what relationship between life expectancy and GDP per capita appears the most linear. There should only be *one* plot for this problem!**\n\n*Remember to include nice axis labels (with any transformed units!).*\n\n\n::: {.cell}\n\n:::\n\n\n# Statistical Model\n\n**3. Fill in the code to fit the regression model you chose in #2.**\n\n*To include a variable with a log transformation in your model, you input the `variable` as `log(variable)` inside the `lm()` function (e.g., `lm(log(stem_dry_mass) ~ stem_length, data = hbr_maples)`.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder_lm <- lm(____ ~ ____, data = gapminder)\n```\n:::\n\n\n## Assessing Model Conditions\n\nThe next step is to check the conditions of our statistical model, we do this by analyzing our residuals and how the data were collected.\n\n### Independence of Observations\n\nEach row of the `gapminder` dataset is an observation for one country for one year (from 1952 to 2007).\n\n**4. Do you believe is it reasonable to assume these observations are independent of one another?**\n\n*Hint: This condition says the rows* of the dataset are independent of each other. Look at the rows of the dataset, is there any reason to believe there are relationships between the rows?\n\n### Normality of Residuals\n\nI've provided code to visualize the residuals from the model you fit in #3 below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbroom::augment(gapminder_lm) %>% \n ggplot(mapping = aes(x = .resid)) +\n geom_histogram() +\n labs(x = \"Residual\")\n```\n:::\n\n\n**5. Based on the distribution of residuals, do you believe the condition of normality is violated? Why or why not?**\n\n### Equal Variance of Residuals\n\nI've provided code to visualize the residuals versus fitted values from the model you fit in #3 below. With this plot, we want to assess if the variability (spread) of the residuals changes based on the values of the explanatory variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbroom::augment(gapminder_lm) %>% \n ggplot(mapping = aes(y = .resid, x = `log(gdpPercap)`)) +\n geom_point() + \n geom_hline(yintercept = 0, color = \"red\", linewidth = 3) +\n labs(x = \"Log Transformed GDP Per Capita\")\n```\n:::\n\n\n**6. Based on the plot above, do you believe the condition of equal variance is violated? Why or why not?**\n\n# Inference\n\n## Stating the Hypotheses\n\nNow that you've decided which regression appears the most linear, let's perform a hypothesis test for the slope coefficient.\n\n**7. Write the hypotheses [*in words*]{.underline} for testing if there is a linear relationship between the variables you used for your model in #3.**\n\n*Keep in mind, if you log-transformed y, you are testing if there is a linear relationship between log(y) and x!*\n\n$H_0$:\n\n$H_A$:\n\n## Obtaining a p-value Using Simulation\n\nNext, we will work through creating a permutation distribution using tools from the **infer** package.\n\n**8. First, we need to find the observed slope statistic, which we will save as `obs_slope`.**\n\n*Keep in mind, if you log-transformed y, you need to use log(y) as your response variable!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_slope <- gapminder %>%\n specify(response = ____, explanatory = ____) %>%\n calculate(stat = \"slope\")\n```\n:::\n\n\nAfter you have calculated your observed statistic, you need to create a permutation distribution of statistics that might have occurred if the null hypothesis was true.\n\n**9. Generate 500 permuted statistics for the permutation distribution and save these statistics in an object named `null_dist`.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnull_dist <- \n```\n:::\n\n\nWe can visualize this null distribution with the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvisualise(null_dist) \n```\n:::\n\n\nNow that you have calculated the observed statistic and generated a permutation distribution, you can calculate the p-value for your hypothesis test using the function `get_p_value()` from the infer package.\n\n**10. Fill in the code below to calculate the p-value for the hypothesis test you stated in #7.**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_p_value(null_dist, \n obs_stat = ____, \n direction = ____)\n```\n:::\n\n\n**11. Based on your p-value and an** $\\alpha = 0.1$, **what decision would you reach regarding the hypotheses you stated in #7?**\n\n## Obtaining a p-value Using Theory\n\nAs we saw in the reading this week, the output from the `get_regression_table()` function provides us with theory-based estimates of our standard error, $t$-statistic, and p-value.\n\n**12. Use the `get_regression_table()` function to obtain the theory-based p-value for your hypothesis test.**\n\n*Hint: You'll want to use the model you fit in #3.*\n\n\n::: {.cell}\n\n:::\n\n\n**13. How does this p-value compare to what you obtained in #11?**\n\n**14. Why do you believe these p-values were similar or different?**\n\n**15. Based on your answers to #4-6, which p-value do you believe is the most reliable? Why?** *Note: If you believe neither are reliable, say so and state why.*\n", "supporting": [ "lab-8_files" ], diff --git a/_freeze/labs/lab-9/execute-results/html.json b/_freeze/labs/lab-9/execute-results/html.json index 04162511..5b088a0c 100644 --- a/_freeze/labs/lab-9/execute-results/html.json +++ b/_freeze/labs/lab-9/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "a481ec550223287ddbd2d2dfee2da9fe", + "hash": "37b47eb9d5da36cf4d8c3901329b0bbb", "result": { - "markdown": "---\ntitle: \"Lab 9 -- One-Way ANOVA\"\nauthor: \"Your group's names here!\"\ndate: \"June 2, 2023\"\nformat: html\neditor: visual\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(infer)\nlibrary(ggridges)\nlibrary(broom)\n```\n:::\n\n\n## Today's Data\n\nThese data come from the Gapminder Foundation, an organization interested in increasing the use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.\n\nToday we will be comparing math achievement scores across continents and years. Math achievement was measured for 42 countries based on their average score for the grade 8 international TIMSS test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores <- read_csv(here::here(\"labs\", \n \"data\",\n \"math_scores.csv\")\n )\n\n# Creating a year_cat variable that is the categorical version of year\nmath_scores <- mutate(math_scores, \n year_cat = as.factor(year)\n )\n\n# Removing the missing values from the grade_8_math_score variable\nmath_scores <- drop_na(data = math_scores, \n grade_8_math_score)\n```\n:::\n\n\n## Data Visualizations\n\nThe first step for a statistical analysis should always be creating visualizations of the data. Similar to what you are expected to do for your project, you will make three density ridge plots:\n\n- visualizing the relationship between math score and year\n- visualizing the relationship between math score and continent\n- visualizing the relationship between math score with both year **and** continent\n\n[**Question 1**]{.underline} -- Fill in the code below to visualize the distribution of grade 8 math scores over time.\n\n**Don't forget to include axis labels!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(data = math_scores, \n mapping = aes(x = ____, \n y = ____)) +\n geom_density_ridges(scale = 1) \n```\n:::\n\n\n*Note: I've included a `scale = 1` argument to show you how you can get the density plots not to overlap!*\n\n[**Question 2**]{.underline} -- What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?\n\n[**Question 3**]{.underline} -- Write the code to visualize the distribution of grade 8 math scores for the six different continents.\n\n**Don't forget to include axis labels!**\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 4**]{.underline} -- What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?\n\n[**Question 5**]{.underline} -- Write the code to visualize the distribution of grade 8 math scores for the six different continents for each of the four years.\n\n***Remember, you could either include a facet or a color here!Also remember you can use `alpha` to change the transparency of your density ridges!***\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 6**]{.underline} -- What do you see in the plot you made? Does it seem that the relationship between year and grade 8 math scores changes based on the continent of the student?\n\n## Statistical Model\n\nFor our analysis we will be using an analysis of variance (ANOVA) model. An ANOVA is an appropriate statistical model as we have a continuous response variable (grade 8 math score) and categorical explanatory variables (year, continent). Year is not considered to be a continuous numerical variable as we have only four measurements in time (1996, 1999, 2003, 2007).\n\n### Model Conditions\n\nAn ANOVA has model conditions that are very similar to what we learned for linear regression. In this section we will evaluate the conditions of the model.\n\nFor this section, it might be helpful to know how many observations there are for each year and for each continent. I have written code below to provide you with a table of these numbers:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount(math_scores, continent, year) %>% \n pivot_wider(names_from = continent, \n values_from = n, \n values_fill = 0) %>% \n janitor::adorn_totals(where = c(\"row\", \"col\"))\n```\n:::\n\n\n#### Independence\n\nBased on the table we know:\n\n- each year has measurements on about six continents\n- each continent has measurements for about four years\n\nUse this information to evaluate the condition of independence of observations.\n\n[**Question 7**]{.underline} -- Is it reasonable to assume that the observations *within* a continent are independent of each other?\n\n[**Question 8**]{.underline} -- Is it reasonable to assume that the observations *within* a year are independent of each other?\n\n[**Question 9**]{.underline} -- Is it reasonable to assume that the observations *between* continents are independent of each other?\n\n[**Question 10**]{.underline} -- Is it reasonable to assume that the observations *between* a years are independent of each other?\n\n#### Normality\n\nNow we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents -- the plot you created in #5. *Keep in mind, the normality condition is very important when the sample sizes for each group are relatively small.*\n\n[**Question 11**]{.underline} -- Is it reasonable to say that the grade 8 math scores across the four years and six continents are normally distributed?\n\n#### Equal Variance\n\nNow we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents -- the plot you created in #5. *Keep in mind, the constant variance condition is especially important when the sample sizes differ between groups.*\n\nFor this section, it might be helpful to know the standard deviations for each year / continent combo. I have written code below to provide you with a table of these numbers:\n\n*Keep in mind a standard deviation of `NA` can happen for two reasons, (1) there is no data, or (2) there is only one observation.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores %>% \n group_by(year, continent) %>% \n summarize(var = var(grade_8_math_score, na.rm = TRUE)\n ) %>% \n pivot_wider(names_from = continent, values_from = var)\n```\n:::\n\n\nLooking at the table, we can see that the largest variance of 10257 (North America, 2007) is nearly 27 times larger than the smallest variance of 381 (Europe, 2003). That's a lot! So, our equal variance condition is definitely violated.\n\nBut, we have learned tools to attempt to remedy this issue! Let's take the log of `grade_8_math_score` and see how the variances compare.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores %>% \n group_by(year, continent) %>% \n summarize(log_var = var(log(grade_8_math_score))\n ) %>%\n pivot_wider(names_from = continent, values_from = log_var)\n```\n:::\n\n\n[**Question 12**]{.underline} -- Based on the variances in the table above, is it reasonable to say that the *log* grade 8 math scores across the four years and six continents have equal variability?\n\n## One-Way ANOVA Inference\n\nWe are going to test out both methods for conducting a hypothesis test for an ANOVA -- theory-based and simulation-based methods. Keep in mind **both** methods require independence of observations **and** equal variability. Normality, however, is only a condition of theory-based methods.\n\n### Testing for a Difference Between Years\n\nSince the distribution of grade 8 math scores across the four years wasn't horribly not Normal, let's give a theory-based method a try.\n\n[**Question 13**]{.underline} -- Fill in the code below to conduct a one-way ANOVA modeling the relationship between mean grade 8 math score and the year\n\n*Keep in mind the response variable comes first and the explanatory variable comes second!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\naov(____ ~ ____, data = math_scores) %>% \n broom::tidy()\n```\n:::\n\n\n[**Question 14**]{.underline} -- At an $\\alpha = 0.1$, what decision would you reach for your hypothesis test?\n\n[**Question 15**]{.underline} -- What would you conclude about the relationship between the mean grade 8 math scores and year?\n\n### Testing for a Difference Between Continents\n\nSince the distribution of grade 8 math scores across the six continents didn't look very Normal, so let's give a simulation-based method a try.\n\nI've gotten you started by calculating the observed F-statistic for the relationship between a country's grade 8 math score and its continent.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_F <- math_scores %>% \n specify(response = grade_8_math_score, \n explanatory = continent) %>% \n calculate(stat = \"F\")\n```\n:::\n\n\n[**Question 16**]{.underline} -- Write the code to generate a permutation distribution of resampled F-statistics.\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 17**]{.underline} -- Visualize the null distribution and shade how the p-value should be calculated\n\n*Keep in mind you only look at the right tail for an ANOVA!*\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 18**]{.underline} -- Calculate the p-value for the observed F-statistic\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 19**]{.underline} -- At an $\\alpha = 0.1$, what decision would you reach for your hypothesis test?\n\n[**Question 20**]{.underline} -- What would you conclude about the relationship between the mean grade 8 math scores and continent?\n", + "engine": "knitr", + "markdown": "---\ntitle: \"Lab 9 -- One-Way ANOVA\"\nauthor: \"Your group's names here!\"\nformat: html\neditor: visual\nexecute: \n echo: true\n eval: false\n message: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(infer)\nlibrary(ggridges)\nlibrary(broom)\n```\n:::\n\n\n## Today's Data\n\nThese data come from the Gapminder Foundation, an organization interested in increasing the use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.\n\nToday we will be comparing math achievement scores across continents and years. Math achievement was measured for 42 countries based on their average score for the grade 8 international TIMSS test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores <- read_csv(here::here(\"labs\", \n \"data\",\n \"math_scores.csv\")\n )\n\n# Creating a year_cat variable that is the categorical version of year\nmath_scores <- mutate(math_scores, \n year_cat = as.factor(year)\n )\n\n# Removing the missing values from the grade_8_math_score variable\nmath_scores <- drop_na(data = math_scores, \n grade_8_math_score)\n```\n:::\n\n\n## Data Visualizations\n\nThe first step for a statistical analysis should always be creating visualizations of the data. Similar to what you are expected to do for your project, you will make three density ridge plots:\n\n- visualizing the relationship between math score and year\n- visualizing the relationship between math score and continent\n- visualizing the relationship between math score with both year **and** continent\n\n[**Question 1**]{.underline} -- Fill in the code below to visualize the distribution of grade 8 math scores over time.\n\n**Don't forget to include axis labels!**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(data = math_scores, \n mapping = aes(x = ____, \n y = ____)) +\n geom_density_ridges(scale = 1) \n```\n:::\n\n\n*Note: I've included a `scale = 1` argument to show you how you can get the density plots not to overlap!*\n\n[**Question 2**]{.underline} -- What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?\n\n[**Question 3**]{.underline} -- Write the code to visualize the distribution of grade 8 math scores for the six different continents.\n\n**Don't forget to include axis labels!**\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 4**]{.underline} -- What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?\n\n[**Question 5**]{.underline} -- Write the code to visualize the distribution of grade 8 math scores for the six different continents for each of the four years.\n\n***Remember, you could either include a facet or a color here!Also remember you can use `alpha` to change the transparency of your density ridges!***\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 6**]{.underline} -- What do you see in the plot you made? Does it seem that the relationship between year and grade 8 math scores changes based on the continent of the student?\n\n## Statistical Model\n\nFor our analysis we will be using an analysis of variance (ANOVA) model. An ANOVA is an appropriate statistical model as we have a continuous response variable (grade 8 math score) and categorical explanatory variables (year, continent). Year is not considered to be a continuous numerical variable as we have only four measurements in time (1996, 1999, 2003, 2007).\n\n### Model Conditions\n\nAn ANOVA has model conditions that are very similar to what we learned for linear regression. In this section we will evaluate the conditions of the model.\n\nFor this section, it might be helpful to know how many observations there are for each year and for each continent. I have written code below to provide you with a table of these numbers:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount(math_scores, continent, year) %>% \n pivot_wider(names_from = continent, \n values_from = n, \n values_fill = 0) %>% \n janitor::adorn_totals(where = c(\"row\", \"col\"))\n```\n:::\n\n\n#### Independence\n\nBased on the table we know:\n\n- each year has measurements on about six continents\n- each continent has measurements for about four years\n\nUse this information to evaluate the condition of independence of observations.\n\n[**Question 7**]{.underline} -- Is it reasonable to assume that the observations *within* a continent are independent of each other?\n\n[**Question 8**]{.underline} -- Is it reasonable to assume that the observations *within* a year are independent of each other?\n\n[**Question 9**]{.underline} -- Is it reasonable to assume that the observations *between* continents are independent of each other?\n\n[**Question 10**]{.underline} -- Is it reasonable to assume that the observations *between* a years are independent of each other?\n\n#### Normality\n\nNow we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents -- the plot you created in #5. *Keep in mind, the normality condition is very important when the sample sizes for each group are relatively small.*\n\n[**Question 11**]{.underline} -- Is it reasonable to say that the grade 8 math scores across the four years and six continents are normally distributed?\n\n#### Equal Variance\n\nNow we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents -- the plot you created in #5. *Keep in mind, the constant variance condition is especially important when the sample sizes differ between groups.*\n\nFor this section, it might be helpful to know the standard deviations for each year / continent combo. I have written code below to provide you with a table of these numbers:\n\n*Keep in mind a standard deviation of `NA` can happen for two reasons, (1) there is no data, or (2) there is only one observation.*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores %>% \n group_by(year, continent) %>% \n summarize(var = var(grade_8_math_score, na.rm = TRUE)\n ) %>% \n pivot_wider(names_from = continent, values_from = var)\n```\n:::\n\n\nLooking at the table, we can see that the largest variance of 10257 (North America, 2007) is nearly 27 times larger than the smallest variance of 381 (Europe, 2003). That's a lot! So, our equal variance condition is definitely violated.\n\nBut, we have learned tools to attempt to remedy this issue! Let's take the log of `grade_8_math_score` and see how the variances compare.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmath_scores %>% \n group_by(year, continent) %>% \n summarize(log_var = var(log(grade_8_math_score))\n ) %>%\n pivot_wider(names_from = continent, values_from = log_var)\n```\n:::\n\n\n[**Question 12**]{.underline} -- Based on the variances in the table above, is it reasonable to say that the *log* grade 8 math scores across the four years and six continents have equal variability?\n\n## One-Way ANOVA Inference\n\nWe are going to test out both methods for conducting a hypothesis test for an ANOVA -- theory-based and simulation-based methods. Keep in mind **both** methods require independence of observations **and** equal variability. Normality, however, is only a condition of theory-based methods.\n\n### Testing for a Difference Between Years\n\nSince the distribution of grade 8 math scores across the four years wasn't horribly not Normal, let's give a theory-based method a try.\n\n[**Question 13**]{.underline} -- Fill in the code below to conduct a one-way ANOVA modeling the relationship between mean grade 8 math score and the year\n\n*Keep in mind the response variable comes first and the explanatory variable comes second!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\naov(____ ~ ____, data = math_scores) %>% \n broom::tidy()\n```\n:::\n\n\n[**Question 14**]{.underline} -- At an $\\alpha = 0.1$, what decision would you reach for your hypothesis test?\n\n[**Question 15**]{.underline} -- What would you conclude about the relationship between the mean grade 8 math scores and year?\n\n### Testing for a Difference Between Continents\n\nSince the distribution of grade 8 math scores across the six continents didn't look very Normal, so let's give a simulation-based method a try.\n\nI've gotten you started by calculating the observed F-statistic for the relationship between a country's grade 8 math score and its continent.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobs_F <- math_scores %>% \n specify(response = grade_8_math_score, \n explanatory = continent) %>% \n calculate(stat = \"F\")\n```\n:::\n\n\n[**Question 16**]{.underline} -- Write the code to generate a permutation distribution of resampled F-statistics.\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 17**]{.underline} -- Visualize the null distribution and shade how the p-value should be calculated\n\n*Keep in mind you only look at the right tail for an ANOVA!*\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 18**]{.underline} -- Calculate the p-value for the observed F-statistic\n\n\n::: {.cell}\n\n:::\n\n\n[**Question 19**]{.underline} -- At an $\\alpha = 0.1$, what decision would you reach for your hypothesis test?\n\n[**Question 20**]{.underline} -- What would you conclude about the relationship between the mean grade 8 math scores and continent?\n", "supporting": [ "lab-9_files" ], diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css index 73c6aa9f..6103e1e1 100644 --- a/_freeze/site_libs/revealjs/dist/theme/quarto.css +++ b/_freeze/site_libs/revealjs/dist/theme/quarto.css @@ -1,4 +1,4 @@ -@import'https://fonts.googleapis.com/css2?family=Atkinson+Hyperlegible&display=swap';@import"https://fonts.googleapis.com/css2?family=Delius+Unicase&display=swap";@import"./fonts/source-sans-pro/source-sans-pro.css";:root{--r-background-color: #fff;--r-main-font: Atkinson Hyperlegible, sans-serif;--r-main-font-size: 40px;--r-main-color: #222;--r-block-margin: 12px;--r-heading-margin: 0 0 12px 0;--r-heading-font: Atkinson Hyperlegible, sans-serif;--r-heading-color: #75AADB;--r-heading-line-height: 1.2;--r-heading-letter-spacing: normal;--r-heading-text-transform: none;--r-heading-text-shadow: none;--r-heading-font-weight: 600;--r-heading1-text-shadow: none;--r-heading1-size: 2.5em;--r-heading2-size: 1.6em;--r-heading3-size: 1.3em;--r-heading4-size: 1em;--r-code-font: SFMono-Regular, Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace;--r-link-color: #75AADB;--r-link-color-dark: #3885cb;--r-link-color-hover: #9dc3e6;--r-selection-background-color: #dae8f5;--r-selection-color: #fff}.reveal-viewport{background:#fff;background-color:var(--r-background-color)}.reveal{font-family:var(--r-main-font);font-size:var(--r-main-font-size);font-weight:normal;color:var(--r-main-color)}.reveal ::selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal ::-moz-selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal .slides section,.reveal .slides section>section{line-height:1.3;font-weight:inherit}.reveal h1,.reveal h2,.reveal h3,.reveal h4,.reveal h5,.reveal h6{margin:var(--r-heading-margin);color:var(--r-heading-color);font-family:var(--r-heading-font);font-weight:var(--r-heading-font-weight);line-height:var(--r-heading-line-height);letter-spacing:var(--r-heading-letter-spacing);text-transform:var(--r-heading-text-transform);text-shadow:var(--r-heading-text-shadow);word-wrap:break-word}.reveal h1{font-size:var(--r-heading1-size)}.reveal h2{font-size:var(--r-heading2-size)}.reveal h3{font-size:var(--r-heading3-size)}.reveal h4{font-size:var(--r-heading4-size)}.reveal h1{text-shadow:var(--r-heading1-text-shadow)}.reveal p{margin:var(--r-block-margin) 0;line-height:1.3}.reveal h1:last-child,.reveal h2:last-child,.reveal h3:last-child,.reveal h4:last-child,.reveal h5:last-child,.reveal h6:last-child{margin-bottom:0}.reveal img,.reveal video,.reveal iframe{max-width:95%;max-height:95%}.reveal strong,.reveal b{font-weight:bold}.reveal em{font-style:italic}.reveal ol,.reveal dl,.reveal ul{display:inline-block;text-align:left;margin:0 0 0 1em}.reveal ol{list-style-type:decimal}.reveal ul{list-style-type:disc}.reveal ul ul{list-style-type:square}.reveal ul ul ul{list-style-type:circle}.reveal ul ul,.reveal ul ol,.reveal ol ol,.reveal ol ul{display:block;margin-left:40px}.reveal dt{font-weight:bold}.reveal dd{margin-left:40px}.reveal blockquote{display:block;position:relative;width:70%;margin:var(--r-block-margin) auto;padding:5px;font-style:italic;background:rgba(255,255,255,.05);box-shadow:0px 0px 2px rgba(0,0,0,.2)}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:inline-block}.reveal q{font-style:italic}.reveal pre{display:block;position:relative;width:90%;margin:var(--r-block-margin) auto;text-align:left;font-size:.55em;font-family:var(--r-code-font);line-height:1.2em;word-wrap:break-word;box-shadow:0px 5px 15px rgba(0,0,0,.15)}.reveal code{font-family:var(--r-code-font);text-transform:none;tab-size:2}.reveal pre code{display:block;padding:5px;overflow:auto;max-height:400px;word-wrap:normal}.reveal .code-wrapper{white-space:normal}.reveal .code-wrapper code{white-space:pre}.reveal table{margin:auto;border-collapse:collapse;border-spacing:0}.reveal table th{font-weight:bold}.reveal table th,.reveal table td{text-align:left;padding:.2em .5em .2em .5em;border-bottom:1px solid}.reveal table th[align=center],.reveal table td[align=center]{text-align:center}.reveal table th[align=right],.reveal table td[align=right]{text-align:right}.reveal table tbody tr:last-child th,.reveal table tbody tr:last-child td{border-bottom:none}.reveal sup{vertical-align:super;font-size:smaller}.reveal sub{vertical-align:sub;font-size:smaller}.reveal small{display:inline-block;font-size:.6em;line-height:1.2em;vertical-align:top}.reveal small *{vertical-align:top}.reveal img{margin:var(--r-block-margin) 0}.reveal a{color:var(--r-link-color);text-decoration:none;transition:color .15s ease}.reveal a:hover{color:var(--r-link-color-hover);text-shadow:none;border:none}.reveal .roll span:after{color:#fff;background:var(--r-link-color-dark)}.reveal .r-frame{border:4px solid var(--r-main-color);box-shadow:0 0 10px rgba(0,0,0,.15)}.reveal a .r-frame{transition:all .15s linear}.reveal a:hover .r-frame{border-color:var(--r-link-color);box-shadow:0 0 20px rgba(0,0,0,.55)}.reveal .controls{color:var(--r-link-color)}.reveal .progress{background:rgba(0,0,0,.2);color:var(--r-link-color)}@media print{.backgrounds{background-color:var(--r-background-color)}}.top-right{position:absolute;top:1em;right:1em}.visually-hidden{border:0;clip:rect(0 0 0 0);height:auto;margin:0;overflow:hidden;padding:0;position:absolute;width:1px;white-space:nowrap}.hidden{display:none !important}.zindex-bottom{z-index:-1 !important}figure.figure{display:block}.quarto-layout-panel{margin-bottom:1em}.quarto-layout-panel>figure{width:100%}.quarto-layout-panel>figure>figcaption,.quarto-layout-panel>.panel-caption{margin-top:10pt}.quarto-layout-panel>.table-caption{margin-top:0px}.table-caption p{margin-bottom:.5em}.quarto-layout-row{display:flex;flex-direction:row;align-items:flex-start}.quarto-layout-valign-top{align-items:flex-start}.quarto-layout-valign-bottom{align-items:flex-end}.quarto-layout-valign-center{align-items:center}.quarto-layout-cell{position:relative;margin-right:20px}.quarto-layout-cell:last-child{margin-right:0}.quarto-layout-cell figure,.quarto-layout-cell>p{margin:.2em}.quarto-layout-cell img{max-width:100%}.quarto-layout-cell .html-widget{width:100% !important}.quarto-layout-cell div figure p{margin:0}.quarto-layout-cell figure{display:block;margin-inline-start:0;margin-inline-end:0}.quarto-layout-cell table{display:inline-table}.quarto-layout-cell-subref figcaption,figure .quarto-layout-row figure figcaption{text-align:center;font-style:italic}.quarto-figure{position:relative;margin-bottom:1em}.quarto-figure>figure{width:100%;margin-bottom:0}.quarto-figure-left>figure>p,.quarto-figure-left>figure>div{text-align:left}.quarto-figure-center>figure>p,.quarto-figure-center>figure>div{text-align:center}.quarto-figure-right>figure>p,.quarto-figure-right>figure>div{text-align:right}.quarto-figure>figure>div.cell-annotation,.quarto-figure>figure>div code{text-align:left}figure>p:empty{display:none}figure>p:first-child{margin-top:0;margin-bottom:0}figure>figcaption.quarto-float-caption-bottom{margin-bottom:.5em}figure>figcaption.quarto-float-caption-top{margin-top:.5em}div[id^=tbl-]{position:relative}.quarto-figure>.anchorjs-link{position:absolute;top:.6em;right:.5em}div[id^=tbl-]>.anchorjs-link{position:absolute;top:.7em;right:.3em}.quarto-figure:hover>.anchorjs-link,div[id^=tbl-]:hover>.anchorjs-link,h2:hover>.anchorjs-link,h3:hover>.anchorjs-link,h4:hover>.anchorjs-link,h5:hover>.anchorjs-link,h6:hover>.anchorjs-link,.reveal-anchorjs-link>.anchorjs-link{opacity:1}#title-block-header{margin-block-end:1rem;position:relative;margin-top:-1px}#title-block-header .abstract{margin-block-start:1rem}#title-block-header .abstract .abstract-title{font-weight:600}#title-block-header a{text-decoration:none}#title-block-header .author,#title-block-header .date,#title-block-header .doi{margin-block-end:.2rem}#title-block-header .quarto-title-block>div{display:flex}#title-block-header .quarto-title-block>div>h1{flex-grow:1}#title-block-header .quarto-title-block>div>button{flex-shrink:0;height:2.25rem;margin-top:0}tr.header>th>p:last-of-type{margin-bottom:0px}table,table.table{margin-top:.5rem;margin-bottom:.5rem}caption,.table-caption{padding-top:.5rem;padding-bottom:.5rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-top{margin-top:.5rem;margin-bottom:.25rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-bottom{padding-top:.25rem;margin-bottom:.5rem;text-align:center}.utterances{max-width:none;margin-left:-8px}iframe{margin-bottom:1em}details{margin-bottom:1em}details[show]{margin-bottom:0}details>summary{color:#6f6f6f}details>summary>p:only-child{display:inline}pre.sourceCode,code.sourceCode{position:relative}p code:not(.sourceCode){white-space:pre-wrap}code{white-space:pre}@media print{code{white-space:pre-wrap}}pre>code{display:block}pre>code.sourceCode{white-space:pre}pre>code.sourceCode>span>a:first-child::before{text-decoration:none}pre.code-overflow-wrap>code.sourceCode{white-space:pre-wrap}pre.code-overflow-scroll>code.sourceCode{white-space:pre}code a:any-link{color:inherit;text-decoration:none}code a:hover{color:inherit;text-decoration:underline}ul.task-list{padding-left:1em}[data-tippy-root]{display:inline-block}.tippy-content .footnote-back{display:none}.tippy-content{overflow-x:auto}.quarto-embedded-source-code{display:none}.quarto-unresolved-ref{font-weight:600}.quarto-cover-image{max-width:35%;float:right;margin-left:30px}.cell-output-display .widget-subarea{margin-bottom:1em}.cell-output-display:not(.no-overflow-x),.knitsql-table:not(.no-overflow-x){overflow-x:auto}.panel-input{margin-bottom:1em}.panel-input>div,.panel-input>div>div{display:inline-block;vertical-align:top;padding-right:12px}.panel-input>p:last-child{margin-bottom:0}.layout-sidebar{margin-bottom:1em}.layout-sidebar .tab-content{border:none}.tab-content>.page-columns.active{display:grid}div.sourceCode>iframe{width:100%;height:300px;margin-bottom:-0.5em}a{text-underline-offset:3px}div.ansi-escaped-output{font-family:monospace;display:block}/*! +@import'https://fonts.googleapis.com/css2?family=Atkinson+Hyperlegible&display=swap';@import"https://fonts.googleapis.com/css2?family=Delius+Unicase&display=swap";@import"./fonts/source-sans-pro/source-sans-pro.css";:root{--r-background-color: #fff;--r-main-font: Atkinson Hyperlegible, sans-serif;--r-main-font-size: 40px;--r-main-color: #222;--r-block-margin: 12px;--r-heading-margin: 0 0 12px 0;--r-heading-font: Atkinson Hyperlegible, sans-serif;--r-heading-color: #75AADB;--r-heading-line-height: 1.2;--r-heading-letter-spacing: normal;--r-heading-text-transform: none;--r-heading-text-shadow: none;--r-heading-font-weight: 600;--r-heading1-text-shadow: none;--r-heading1-size: 2.5em;--r-heading2-size: 1.6em;--r-heading3-size: 1.3em;--r-heading4-size: 1em;--r-code-font: SFMono-Regular, Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace;--r-link-color: #75AADB;--r-link-color-dark: #3885cb;--r-link-color-hover: #9dc3e6;--r-selection-background-color: #dae8f5;--r-selection-color: #fff}.reveal-viewport{background:#fff;background-color:var(--r-background-color)}.reveal{font-family:var(--r-main-font);font-size:var(--r-main-font-size);font-weight:normal;color:var(--r-main-color)}.reveal ::selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal ::-moz-selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal .slides section,.reveal .slides section>section{line-height:1.3;font-weight:inherit}.reveal h1,.reveal h2,.reveal h3,.reveal h4,.reveal h5,.reveal h6{margin:var(--r-heading-margin);color:var(--r-heading-color);font-family:var(--r-heading-font);font-weight:var(--r-heading-font-weight);line-height:var(--r-heading-line-height);letter-spacing:var(--r-heading-letter-spacing);text-transform:var(--r-heading-text-transform);text-shadow:var(--r-heading-text-shadow);word-wrap:break-word}.reveal h1{font-size:var(--r-heading1-size)}.reveal h2{font-size:var(--r-heading2-size)}.reveal h3{font-size:var(--r-heading3-size)}.reveal h4{font-size:var(--r-heading4-size)}.reveal h1{text-shadow:var(--r-heading1-text-shadow)}.reveal p{margin:var(--r-block-margin) 0;line-height:1.3}.reveal h1:last-child,.reveal h2:last-child,.reveal h3:last-child,.reveal h4:last-child,.reveal h5:last-child,.reveal h6:last-child{margin-bottom:0}.reveal img,.reveal video,.reveal iframe{max-width:95%;max-height:95%}.reveal strong,.reveal b{font-weight:bold}.reveal em{font-style:italic}.reveal ol,.reveal dl,.reveal ul{display:inline-block;text-align:left;margin:0 0 0 1em}.reveal ol{list-style-type:decimal}.reveal ul{list-style-type:disc}.reveal ul ul{list-style-type:square}.reveal ul ul ul{list-style-type:circle}.reveal ul ul,.reveal ul ol,.reveal ol ol,.reveal ol ul{display:block;margin-left:40px}.reveal dt{font-weight:bold}.reveal dd{margin-left:40px}.reveal blockquote{display:block;position:relative;width:70%;margin:var(--r-block-margin) auto;padding:5px;font-style:italic;background:rgba(255,255,255,.05);box-shadow:0px 0px 2px rgba(0,0,0,.2)}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:inline-block}.reveal q{font-style:italic}.reveal pre{display:block;position:relative;width:90%;margin:var(--r-block-margin) auto;text-align:left;font-size:.55em;font-family:var(--r-code-font);line-height:1.2em;word-wrap:break-word;box-shadow:0px 5px 15px rgba(0,0,0,.15)}.reveal code{font-family:var(--r-code-font);text-transform:none;tab-size:2}.reveal pre code{display:block;padding:5px;overflow:auto;max-height:400px;word-wrap:normal}.reveal .code-wrapper{white-space:normal}.reveal .code-wrapper code{white-space:pre}.reveal table{margin:auto;border-collapse:collapse;border-spacing:0}.reveal table th{font-weight:bold}.reveal table th,.reveal table td{text-align:left;padding:.2em .5em .2em .5em;border-bottom:1px solid}.reveal table th[align=center],.reveal table td[align=center]{text-align:center}.reveal table th[align=right],.reveal table td[align=right]{text-align:right}.reveal table tbody tr:last-child th,.reveal table tbody tr:last-child td{border-bottom:none}.reveal sup{vertical-align:super;font-size:smaller}.reveal sub{vertical-align:sub;font-size:smaller}.reveal small{display:inline-block;font-size:.6em;line-height:1.2em;vertical-align:top}.reveal small *{vertical-align:top}.reveal img{margin:var(--r-block-margin) 0}.reveal a{color:var(--r-link-color);text-decoration:none;transition:color .15s ease}.reveal a:hover{color:var(--r-link-color-hover);text-shadow:none;border:none}.reveal .roll span:after{color:#fff;background:var(--r-link-color-dark)}.reveal .r-frame{border:4px solid var(--r-main-color);box-shadow:0 0 10px rgba(0,0,0,.15)}.reveal a .r-frame{transition:all .15s linear}.reveal a:hover .r-frame{border-color:var(--r-link-color);box-shadow:0 0 20px rgba(0,0,0,.55)}.reveal .controls{color:var(--r-link-color)}.reveal .progress{background:rgba(0,0,0,.2);color:var(--r-link-color)}@media print{.backgrounds{background-color:var(--r-background-color)}}.top-right{position:absolute;top:1em;right:1em}.visually-hidden{border:0;clip:rect(0 0 0 0);height:auto;margin:0;overflow:hidden;padding:0;position:absolute;width:1px;white-space:nowrap}.hidden{display:none !important}.zindex-bottom{z-index:-1 !important}figure.figure{display:block}.quarto-layout-panel{margin-bottom:1em}.quarto-layout-panel>figure{width:100%}.quarto-layout-panel>figure>figcaption,.quarto-layout-panel>.panel-caption{margin-top:10pt}.quarto-layout-panel>.table-caption{margin-top:0px}.table-caption p{margin-bottom:.5em}.quarto-layout-row{display:flex;flex-direction:row;align-items:flex-start}.quarto-layout-valign-top{align-items:flex-start}.quarto-layout-valign-bottom{align-items:flex-end}.quarto-layout-valign-center{align-items:center}.quarto-layout-cell{position:relative;margin-right:20px}.quarto-layout-cell:last-child{margin-right:0}.quarto-layout-cell figure,.quarto-layout-cell>p{margin:.2em}.quarto-layout-cell img{max-width:100%}.quarto-layout-cell .html-widget{width:100% !important}.quarto-layout-cell div figure p{margin:0}.quarto-layout-cell figure{display:block;margin-inline-start:0;margin-inline-end:0}.quarto-layout-cell table{display:inline-table}.quarto-layout-cell-subref figcaption,figure .quarto-layout-row figure figcaption{text-align:center;font-style:italic}.quarto-figure{position:relative;margin-bottom:1em}.quarto-figure>figure{width:100%;margin-bottom:0}.quarto-figure-left>figure>p,.quarto-figure-left>figure>div{text-align:left}.quarto-figure-center>figure>p,.quarto-figure-center>figure>div{text-align:center}.quarto-figure-right>figure>p,.quarto-figure-right>figure>div{text-align:right}.quarto-figure>figure>div.cell-annotation,.quarto-figure>figure>div code{text-align:left}figure>p:empty{display:none}figure>p:first-child{margin-top:0;margin-bottom:0}figure>figcaption.quarto-float-caption-bottom{margin-bottom:.5em}figure>figcaption.quarto-float-caption-top{margin-top:.5em}div[id^=tbl-]{position:relative}.quarto-figure>.anchorjs-link{position:absolute;top:.6em;right:.5em}div[id^=tbl-]>.anchorjs-link{position:absolute;top:.7em;right:.3em}.quarto-figure:hover>.anchorjs-link,div[id^=tbl-]:hover>.anchorjs-link,h2:hover>.anchorjs-link,h3:hover>.anchorjs-link,h4:hover>.anchorjs-link,h5:hover>.anchorjs-link,h6:hover>.anchorjs-link,.reveal-anchorjs-link>.anchorjs-link{opacity:1}#title-block-header{margin-block-end:1rem;position:relative;margin-top:-1px}#title-block-header .abstract{margin-block-start:1rem}#title-block-header .abstract .abstract-title{font-weight:600}#title-block-header a{text-decoration:none}#title-block-header .author,#title-block-header .date,#title-block-header .doi{margin-block-end:.2rem}#title-block-header .quarto-title-block>div{display:flex}#title-block-header .quarto-title-block>div>h1{flex-grow:1}#title-block-header .quarto-title-block>div>button{flex-shrink:0;height:2.25rem;margin-top:0}tr.header>th>p:last-of-type{margin-bottom:0px}table,table.table{margin-top:.5rem;margin-bottom:.5rem}caption,.table-caption{padding-top:.5rem;padding-bottom:.5rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-top{margin-top:.5rem;margin-bottom:.25rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-bottom{padding-top:.25rem;margin-bottom:.5rem;text-align:center}.utterances{max-width:none;margin-left:-8px}iframe{margin-bottom:1em}details{margin-bottom:1em}details[show]{margin-bottom:0}details>summary{color:#6f6f6f}details>summary>p:only-child{display:inline}pre.sourceCode,code.sourceCode{position:relative}p code:not(.sourceCode){white-space:pre-wrap}code{white-space:pre}@media print{code{white-space:pre-wrap}}pre>code{display:block}pre>code.sourceCode{white-space:pre}pre>code.sourceCode>span>a:first-child::before{text-decoration:none}pre.code-overflow-wrap>code.sourceCode{white-space:pre-wrap}pre.code-overflow-scroll>code.sourceCode{white-space:pre}code a:any-link{color:inherit;text-decoration:none}code a:hover{color:inherit;text-decoration:underline}ul.task-list{padding-left:1em}[data-tippy-root]{display:inline-block}.tippy-content .footnote-back{display:none}.footnote-back{margin-left:.2em}.tippy-content{overflow-x:auto}.quarto-embedded-source-code{display:none}.quarto-unresolved-ref{font-weight:600}.quarto-cover-image{max-width:35%;float:right;margin-left:30px}.cell-output-display .widget-subarea{margin-bottom:1em}.cell-output-display:not(.no-overflow-x),.knitsql-table:not(.no-overflow-x){overflow-x:auto}.panel-input{margin-bottom:1em}.panel-input>div,.panel-input>div>div{display:inline-block;vertical-align:top;padding-right:12px}.panel-input>p:last-child{margin-bottom:0}.layout-sidebar{margin-bottom:1em}.layout-sidebar .tab-content{border:none}.tab-content>.page-columns.active{display:grid}div.sourceCode>iframe{width:100%;height:300px;margin-bottom:-0.5em}a{text-underline-offset:3px}div.ansi-escaped-output{font-family:monospace;display:block}/*! * * ansi colors from IPython notebook's * diff --git a/_freeze/slides/week10-day1/execute-results/html.json b/_freeze/slides/week10-day1/execute-results/html.json index 069ee871..1f706fa4 100644 --- a/_freeze/slides/week10-day1/execute-results/html.json +++ b/_freeze/slides/week10-day1/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "b77fd5c2888783270b90dfedbab7d94c", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Week 10: Two-way ANOVA\"\nformat: \n revealjs:\n theme: style.scss\neditor: visual\n---\n\n\n\n\n# Week 10\n\n## Wrapping Up Revisions\n\n::: incremental\n- Statistical Critique 2 revisions are due by Thursday\n- Lab 8 revisions are due by Thursday\n- Final revisions on **all** assignments will be accepted until this Sunday, March 17\n:::\n\n. . .\n\n::: callout-caution\n# One round of revisions\n\nYou will only have time for *one* round of revisions on Lab 8 and Statistical Critique 2, so make sure you feel confident about your revisions.\n:::\n\n## Final Project\n\n- Feedback (from me) will be provided no later than Thursday evening\n- Peer feedback on Thursday\n - Print your report!\n\n# Two-Way ANOVA Models\n\n## \n\n::: {style=\"font-size: 3em; color: #000000;\"}\nTwo-way ANOVA\n:::\n\n
\n\n::: {style=\"font-size: 2em; color: #0F4C81;\"}\nGoal:\n:::\n\nAssess if [multiple]{style=\"color: #0F4C81\"} categorical variables have a relationship with the response.\n\n## Modeling Options\n\n::: columns\n::: {.column width=\"50%\"}\n::: {style=\"font-size: 1.25em; color: #ed8402;\"}\nAdditive Model\n:::\n\n::: fragment\nAssess if each explanatory variable has a meaningful relationship with the response, conditional on the variable(s) included in the model.\n:::\n:::\n\n::: {.column width=\"50%\"}\n::: {style=\"font-size: 1.25em; color: #0F4C81;\"}\nInteraction Model\n:::\n\n::: fragment\nAssess if the relationship between one categorical explanatory variable and the response **differs** based on the values of another categorical variable.\n:::\n:::\n:::\n\n## What are we looking for?\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-plot-year-1.png){width=960}\n:::\n:::\n\n\n## Another way to think about it...\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-facet-maples-1.png){width=960}\n:::\n:::\n\n\n# Interaction Two-way ANOVA\n\n## Research Question\n\n> Does the relationship between stem dry mass and calcium treatment for sugar maples differ based on the year the treatment was applied?\n\n. . .\n\n
\n\nOr, because the study was an experiment...\n\n> Does the effect of calcium treatment on the stem dry mass of sugar maples differ based on the year of the treatment?\n\n## Conditions\n\n- Independence of observations\n\n::: small\n> Observations are independent *within* groups **and** *between* groups\n:::\n\n. . .\n\n- Equal variability of the groups\n\n::: small\n> The spread of the distributions are similar across groups\n:::\n\n. . .\n\n- Normality of the residuals\n\n::: small\n> The distribution of residuals for each group is approximately normal\n:::\n\n## Theory-based Two-Way ANOVA\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ watershed * year_cat, \n data = hbr_maples_small)\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
watershed10.0167384010.016738400673.574931.714080e-15
year_cat10.1094174570.1094174573480.952872.232451e-57
watershed:year_cat10.0043200730.004320072718.989212.013559e-05
Residuals2210.0502778120.0002275014NANA
\n
\n```\n\n:::\n:::\n\n\n. . .\n\n::: small\nThe `watershed:year_cat` line is testing if the relationship between the calcium treatment (`watershed`) and stem dry mass differs between 2003 and 2004.\n:::\n\n. . .\n\n
\n\n
Does it?
\n\n## How are those p-values calculated?\n\nThe p-values in the previous table use **Type I** sums of squares.\n\n> Type I sums of squares are \"sequential,\" meaning variables are tested in the order they are listed.\n\n. . .\n\n
\n\nSo, the p-value for `watershed:year_cat` is **conditional** on including `watershed` and `year_cat` as explanatory variables.\n\n. . .\n\n
\n\nIs that what we want????\n\n## Testing \"main effects\"\n\nIf there is evidence of an interaction, we **do not** test if the main effects are \"significant.\"\n\n. . .\n\n
\n\nWhy?\n\n. . .\n\n
\n\nThe interactions depend on these variables, so they should be included in the model!\n\n## Interpreting \"main effects\"\n\nWhen interaction effects are present, the interpretation of the main effects is incomplete or misleading\n\n::: center\n![](images/car-interaction.png)\n:::\n\n# Additive Two-way ANOVA\n\n## What if our analysis found no evidence of an interaction?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-plot-watershed-1.png){fig-align='center' width=960}\n:::\n:::\n\n\n## Testing for a relationship for each variable\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ elevation + watershed, \n data = hbr_maples_small) %>% \n tidy()\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
elevation10.00013452970.00013452971.828451.783595e-01
watershed10.00357332490.003573324948.566589.607445e-11
Residuals1490.01096279350.0000735758NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\n. . .\n\n::: {style=\"text-align: center;\"}\nDo you think it matters which variable comes first?\n:::\n\n## Let's see...\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ watershed + elevation, \n data = hbr_maples) %>% \n tidy()\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
watershed10.00650625076.506251e-0386.6585049.073535e-18
elevation10.00058219355.821935e-047.7543925.791052e-03
Residuals2370.01779376927.507919e-05NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\nDid we get the same p-values as before?\n\n## Sequential Versus Partial Sums of Squares\n\nSimilar to before, the p-values in the ANOVA table use Type I (sequential) sums of squares.\n\n::: incremental\n::: small\n- The p-value for each variable is conditional on the variable(s) that came *before* it.\n- The p-value for `elevation` is conditional on `watershed` being included in the model\n- The p-value for `watershed` is conditional on...nothing.\n:::\n:::\n\n. . .\n\n::: small\nIf we want the p-value for each explanatory variable to be conditional on **every** variable included in the model, then we need to use a different type of sums of squares!\n:::\n\n## Partial Sums of Squares\n\n> Type III sums of squares are \"partial,\" meaning every term in the model is tested in light of the other terms in the model.\n\n. . .\n\n::: small\n- The p-value for `elevation` is conditional on `watershed` being included in the model\n- The p-value for `watershed` is conditional on `elevation` being included in the model\n:::\n\n. . .\n\n::: callout-tip\n# Only different for variables that *were not* first\n\nWe could have used Type III sums of squares for the interaction model and would have gotten the same p-value!\n:::\n\n## Getting the Conditional Tests for Every Variable\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\nlibrary(car)\n\nwater_elev_lm <- lm(stem_dry_mass ~ watershed + elevation, \n data = hbr_maples_small) \n\nAnova(water_elev_lm, type = \"III\")\n```\n:::\n\n\n::: callout-tip\n# Load in the `car` package!\n:::\n\n## Additive Model Hypothesis Tests\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n
termsumsqdfstatisticp.value
(Intercept)0.03705953361503.6919191.193044e-49
watershed0.0035733249148.5665839.607445e-11
elevation0.000305939314.1581514.320201e-02
Residuals0.0109627935149NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\n::: {style=\"text-align: center;\"}\n**What do you think the is the `elevation` line testing?**\n\n::: fragment\n**What would you decide?**\n:::\n:::\n\n## Keeping \"Non-significant\" Variables\n\n
\n\nShould you always remove variables with \"large\" p-values from an ANOVA?\n\n. . .\n\n
\n\nNo!\n\nEven \"non-significant\" variables explain some amount of the variation in the response. Which makes your estimates of a treatment effect more precise!\n\n# Steps for Final Project\n\n## Hypothesis Test Steps\n\n::: columns\n::: {.column width=\"35%\"}\n::: {style=\"font-size: 1.5em;\"}\nStep 1: Fit a one-way ANOVA model for each categorical variable\n:::\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"60%\"}\n::: fragment\n::: {style=\"font-size: 1.5em; color: #0F4C81;\"}\nStep 2: Decide if each explanatory variable has a meaningful relationship with the response variable\n:::\n:::\n\n::: fragment\n- If yes, then go to Step 3!\n- If no, then report which variable (if any) has the strongest relationship with the response.\n:::\n:::\n:::\n\n## Step 3 -- Fit an Additive Two-way ANOVA\n\nIf there is evidence that **both** variables have a relationship with the response variable, then you fit an *additive* two-way ANOVA.\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\nlibrary(car) \n\nmy_model <- lm( ~ + ,\n data = ) \n\nAnova(my_model, type = “III”) %>% \n tidy()\n```\n:::\n\n\n::: callout-tip\n# Don't forget to load in the `car` package!\n:::\n\n## What about interaction models?\n\n

\n\nFor the sake of time, we **are not** fitting interaction models for the Final Project.\n\n# Do you always expect your main effects to be \"significant\" in a two-way ANOVA?\n\n# Work Session\n\n## Your Options\n\n1. Complete your revisions on Lab 8\n2. Complete your revisions on Statistical Critique 2\n3. Fit your two-way ANOVA model for your Final Project and interpret the results\n4. Finish any remaining revisions on lab or statistical critiques\n", + "markdown": "---\ntitle: \"Week 10: Two-way ANOVA\"\nformat: \n revealjs:\n theme: style.scss\neditor: visual\n---\n\n\n\n\n# Week 10\n\n## Wrapping Up Revisions\n\n::: incremental\n- Statistical Critique 2 revisions are due by Thursday\n- Lab 8 revisions are due by Thursday\n- Final revisions on **all** assignments will be accepted until this Sunday, March 17\n:::\n\n. . .\n\n::: callout-caution\n# One round of revisions\n\nYou will only have time for *one* round of revisions on Lab 8 and Statistical Critique 2, so make sure you feel confident about your revisions.\n:::\n\n## Final Project\n\n- Feedback (from me) will be provided no later than Thursday evening\n- Peer feedback on Thursday\n - Print your report!\n\n# Two-Way ANOVA Models\n\n## \n\n::: {style=\"font-size: 3em; color: #000000;\"}\nTwo-way ANOVA\n:::\n\n
\n\n::: {style=\"font-size: 2em; color: #0F4C81;\"}\nGoal:\n:::\n\nAssess if [multiple]{style=\"color: #0F4C81\"} categorical variables have a relationship with the response.\n\n## Modeling Options\n\n::: columns\n::: {.column width=\"50%\"}\n::: {style=\"font-size: 1.25em; color: #ed8402;\"}\nAdditive Model\n:::\n\n::: fragment\nAssess if each explanatory variable has a meaningful relationship with the response, conditional on the variable(s) included in the model.\n:::\n:::\n\n::: {.column width=\"50%\"}\n::: {style=\"font-size: 1.25em; color: #0F4C81;\"}\nInteraction Model\n:::\n\n::: fragment\nAssess if the relationship between one categorical explanatory variable and the response **differs** based on the values of another categorical variable.\n:::\n:::\n:::\n\n## What are we looking for?\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-plot-year-1.png){width=960}\n:::\n:::\n\n\n## Another way to think about it...\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-facet-maples-1.png){width=960}\n:::\n:::\n\n\n# Interaction Two-way ANOVA\n\n## Research Question\n\n> Does the relationship between stem dry mass and calcium treatment for sugar maples differ based on the year the treatment was applied?\n\n. . .\n\n
\n\nOr, because the study was an experiment...\n\n> Does the effect of calcium treatment on the stem dry mass of sugar maples differ based on the year of the treatment?\n\n## Conditions\n\n- Independence of observations\n\n::: small\n> Observations are independent *within* groups **and** *between* groups\n:::\n\n. . .\n\n- Equal variability of the groups\n\n::: small\n> The spread of the distributions are similar across groups\n:::\n\n. . .\n\n- Normality of the residuals\n\n::: small\n> The distribution of residuals for each group is approximately normal\n:::\n\n## Theory-based Two-Way ANOVA\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ watershed * year_cat, \n data = hbr_maples_small)\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
watershed10.0160575150.016057514771.868013.285643e-15
year_cat10.1187204750.1187204750531.352771.034508e-60
watershed:year_cat10.0034458550.003445855015.422481.148968e-04
Residuals2210.0493781660.0002234306NANA
\n
\n```\n\n:::\n:::\n\n\n. . .\n\n::: small\nThe `watershed:year_cat` line is testing if the relationship between the calcium treatment (`watershed`) and stem dry mass differs between 2003 and 2004.\n:::\n\n. . .\n\n
\n\n
Does it?
\n\n## How are those p-values calculated?\n\nThe p-values in the previous table use **Type I** sums of squares.\n\n> Type I sums of squares are \"sequential,\" meaning variables are tested in the order they are listed.\n\n. . .\n\n
\n\nSo, the p-value for `watershed:year_cat` is **conditional** on including `watershed` and `year_cat` as explanatory variables.\n\n. . .\n\n
\n\nIs that what we want????\n\n## Testing \"main effects\"\n\nIf there is evidence of an interaction, we **do not** test if the main effects are \"significant.\"\n\n. . .\n\n
\n\nWhy?\n\n. . .\n\n
\n\nThe interactions depend on these variables, so they should be included in the model!\n\n## Interpreting \"main effects\"\n\nWhen interaction effects are present, the interpretation of the main effects is incomplete or misleading\n\n::: center\n![](images/car-interaction.png)\n:::\n\n# Additive Two-way ANOVA\n\n## What if our analysis found no evidence of an interaction?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](week10-day1_files/figure-revealjs/twa-plot-watershed-1.png){fig-align='center' width=960}\n:::\n:::\n\n\n## Testing for a relationship for each variable\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ elevation + watershed, \n data = hbr_maples_small) %>% \n tidy()\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
elevation10.00055634355.563435e-047.1186038.479264e-03
watershed10.00292206312.922063e-0337.3887838.206184e-09
Residuals1480.01156671337.815347e-05NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\n. . .\n\n::: {style=\"text-align: center;\"}\nDo you think it matters which variable comes first?\n:::\n\n## Let's see...\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\naov(stem_dry_mass ~ watershed + elevation, \n data = hbr_maples) %>% \n tidy()\n```\n:::\n\n\n
\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n \n \n \n
termdfsumsqmeansqstatisticp.value
watershed10.00650625076.506251e-0386.6585049.073535e-18
elevation10.00058219355.821935e-047.7543925.791052e-03
Residuals2370.01779376927.507919e-05NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\nDid we get the same p-values as before?\n\n## Sequential Versus Partial Sums of Squares\n\nSimilar to before, the p-values in the ANOVA table use Type I (sequential) sums of squares.\n\n::: incremental\n::: small\n- The p-value for each variable is conditional on the variable(s) that came *before* it.\n- The p-value for `elevation` is conditional on `watershed` being included in the model\n- The p-value for `watershed` is conditional on...nothing.\n:::\n:::\n\n. . .\n\n::: small\nIf we want the p-value for each explanatory variable to be conditional on **every** variable included in the model, then we need to use a different type of sums of squares!\n:::\n\n## Partial Sums of Squares\n\n> Type III sums of squares are \"partial,\" meaning every term in the model is tested in light of the other terms in the model.\n\n. . .\n\n::: small\n- The p-value for `elevation` is conditional on `watershed` being included in the model\n- The p-value for `watershed` is conditional on `elevation` being included in the model\n:::\n\n. . .\n\n::: callout-tip\n# Only different for variables that *were not* first\n\nWe could have used Type III sums of squares for the interaction model and would have gotten the same p-value!\n:::\n\n## Getting the Conditional Tests for Every Variable\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\nlibrary(car)\n\nwater_elev_lm <- lm(stem_dry_mass ~ watershed + elevation, \n data = hbr_maples_small) \n\nAnova(water_elev_lm, type = \"III\")\n```\n:::\n\n\n::: callout-tip\n# Load in the `car` package!\n:::\n\n## Additive Model Hypothesis Tests\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n
termsumsqdfstatisticp.value
(Intercept)0.03769349831482.3010322.023579e-48
watershed0.0029220631137.3887838.206184e-09
elevation0.000447143215.7213481.801533e-02
Residuals0.0115667133148NANA
\n
\n```\n\n:::\n:::\n\n\n
\n\n::: {style=\"text-align: center;\"}\n**What do you think the is the `elevation` line testing?**\n\n::: fragment\n**What would you decide?**\n:::\n:::\n\n## Keeping \"Non-significant\" Variables\n\n
\n\nShould you always remove variables with \"large\" p-values from an ANOVA?\n\n. . .\n\n
\n\nNo!\n\nEven \"non-significant\" variables explain some amount of the variation in the response. Which makes your estimates of a treatment effect more precise!\n\n# Steps for Final Project\n\n## Hypothesis Test Steps\n\n::: columns\n::: {.column width=\"35%\"}\n::: {style=\"font-size: 1.5em;\"}\nStep 1: Fit a one-way ANOVA model for each categorical variable\n:::\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"60%\"}\n::: fragment\n::: {style=\"font-size: 1.5em; color: #0F4C81;\"}\nStep 2: Decide if each explanatory variable has a meaningful relationship with the response variable\n:::\n:::\n\n::: fragment\n- If yes, then go to Step 3!\n- If no, then report which variable (if any) has the strongest relationship with the response.\n:::\n:::\n:::\n\n## Step 3 -- Fit an Additive Two-way ANOVA\n\nIf there is evidence that **both** variables have a relationship with the response variable, then you fit an *additive* two-way ANOVA.\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"false\"}\nlibrary(car) \n\nmy_model <- lm( ~ + ,\n data = ) \n\nAnova(my_model, type = “III”) %>% \n tidy()\n```\n:::\n\n\n::: callout-tip\n# Don't forget to load in the `car` package!\n:::\n\n## What about interaction models?\n\n

\n\nFor the sake of time, we **are not** fitting interaction models for the Final Project.\n\n# Do you always expect your main effects to be \"significant\" in a two-way ANOVA?\n\n# Work Session\n\n## Your Options\n\n1. Complete your revisions on Lab 8\n2. Complete your revisions on Statistical Critique 2\n3. Fit your two-way ANOVA model for your Final Project and interpret the results\n4. Finish any remaining revisions on lab or statistical critiques\n", "supporting": [ "week10-day1_files" ], diff --git a/_freeze/slides/week10-day1/figure-revealjs/twa-facet-maples-1.png b/_freeze/slides/week10-day1/figure-revealjs/twa-facet-maples-1.png index fd3bb0ba..535cc1e2 100644 Binary files a/_freeze/slides/week10-day1/figure-revealjs/twa-facet-maples-1.png and b/_freeze/slides/week10-day1/figure-revealjs/twa-facet-maples-1.png differ diff --git a/_freeze/slides/week10-day1/figure-revealjs/twa-plot-watershed-1.png b/_freeze/slides/week10-day1/figure-revealjs/twa-plot-watershed-1.png index eba1f567..01874df3 100644 Binary files a/_freeze/slides/week10-day1/figure-revealjs/twa-plot-watershed-1.png and b/_freeze/slides/week10-day1/figure-revealjs/twa-plot-watershed-1.png differ diff --git a/_freeze/slides/week10-day1/figure-revealjs/twa-plot-year-1.png b/_freeze/slides/week10-day1/figure-revealjs/twa-plot-year-1.png index 0cade7d4..cf68dea6 100644 Binary files a/_freeze/slides/week10-day1/figure-revealjs/twa-plot-year-1.png and b/_freeze/slides/week10-day1/figure-revealjs/twa-plot-year-1.png differ diff --git a/_freeze/slides/week10-day2/execute-results/html.json b/_freeze/slides/week10-day2/execute-results/html.json index a945c904..aebad3ed 100644 --- a/_freeze/slides/week10-day2/execute-results/html.json +++ b/_freeze/slides/week10-day2/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "f18382bcad08daab565f19756dc90408", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"STAT 313 Last Day\"\ntitle-slide-attributes:\n data-background-image: images/sad_cat.jpeg\n data-background-size: cover\n data-background-opacity: \"0.5\"\nformat: \n revealjs:\n theme: style.scss\neditor: visual\n---\n\n\n\n\n# Deadlines\n\n. . .\n\n::: incremental\n- Lab 8 revisions due tonight\n- Statistical Critique revisions due tonight\n- All other final revisions are due by Sunday\n- Final Project is due by Sunday\n:::\n\n# Peer Review Session (45-minutes)\n\n# Structure of Final Projects\n\n## Findings\n\nThe results of each hypothesis test go **directly below** the test.\n\n
\n\n::: columns\n::: {.column width=\"47%\"}\n[**Theory-based Methods**]{.underline}\n\nYour decision & conclusion for your hypothesis test *go directly below your ANOVA table*.\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"47%\"}\n[**Simulation-based Methods**]{.underline}\n\nYour decision & conclusion for your hypothesis test *go directly below your permutation distribution and p-value*.\n:::\n:::\n\n## Hypothesis Test Conclusions\n\n> Conclusions should be written in terms of the alternative hypothesis\n\n
\n\n::: columns\n::: {.column width=\"45%\"}\n**Did you reject the null hypothesis?**\n\nThen you have evidence that at least one group has a different mean!\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"45%\"}\n**Did you fail to reject the null hypothesis?**\n\nThen you have [insufficient evidence](https://critical-inference.com/the-problem-with-no-evidence-and-is-it-enough-to-bust-a-myth/) that at least one group has a different mean!\n:::\n:::\n\n## Model Validity\n\n> In this section you discuss the **reliability** of the p-values you obtained based on the model conditions.\n\n::: small\n- Independence\n\n - *within* groups\n - *between* groups\n\n- Normality of the distributions for each group\n\n- Equal variance of the distributions for each group\n:::\n\n::: callout-caution\n# Conditions for Each Test\n\nEach one-way ANOVA test considers different groups. So, your conditions should be evaluated for each test separately.\n:::\n\n## [Conditions are never met!](https://critical-inference.com/assumptions-are-not-met-period/)\n\n$H_0$: the condition is met\n\n$H_A$: the condition is violated\n\n
\n\n. . .\n\nJust like we never say \"I accept the null hypothesis,\" we never say a condition is \"met.\" Instead, we say there is no evidence that the condition is violated.\n\n## Study Limitations\n\nThis section summarizes your understanding of the foundational aspects of experimental design.\n\n::: columns\n::: {.column width=\"40%\"}\n::: small\n> Based on the sampling method used, what larger population can you infer the results or your analysis onto?\n\n- What were the inclusion criteria of the observations?\n- How does that influence the population you can infer your findings onto?\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"55%\"}\n::: small\n> Based on the design of the study, what type of statements can be made about the relationship between the explanatory and response variables?\n\n- Were the explanatory variables randomly assigned to control for confounding variables?\n - How does that influence what you can and cannot say about the relationships between the variables?\n:::\n:::\n:::\n\n## Overall Conclusions\n\n::: small\n> Based on the results of your analysis what is your conclusion for the questions of interest? Connect your conclusion(s) to the relationships you saw in the visualizations you made and the results of your hypothesis tests.\n:::\n\n
\n\n. . .\n\n::: columns\n::: {.column width=\"30%\"}\n::: small\nDid you distributions look similar but your hypothesis test said at least one group was different?\n\nThink about how sample size effects p-values!\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"30%\"}\n::: small\n::: fragment\nDid you reject the null hypothesis for your one-way ANOVA?\n\nLook back at your visualizations -- which group(s) look the most different?\n:::\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"30%\"}\n::: small\n::: fragment\nDid you fail to reject the null hypothesis for your one-way ANOVA?\n\nLook back at your visualizations -- do all of the groups look the same?\n:::\n:::\n:::\n:::\n\n# Remedying Condition Violations\n\n## Do you have really skewed data?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](week10-day2_files/figure-revealjs/channel--1.png){fig-align='center' width=960}\n:::\n:::\n\n\n## Try using a log transformation!\n\n::: columns\n::: {.column width=\"50%\"}\n**Un-transformed Variances**\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
unittypevar
C84.826493
I80.783024
IP5.729663
P129.138547
R59.096925
S49.280157
SC49.923399
NA112.284569
\n
\n```\n\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n::: fragment\n**Log Transformed Variances**\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
unittypevar
C1.6978424
I0.8788463
IP0.8990321
P1.5659500
R1.3621461
S2.1770549
SC1.6514917
NA0.7591942
\n
\n```\n\n:::\n:::\n\n:::\n:::\n:::\n\n. . .\n\n
\n\n
What do you think? Did it work?
\n\n# Final Presentations\n\n## Presentation Structure\n\n> You will give a 3-minute presentation on **one** aspect of your final project you found the most interesting. Notice, you need to pick one aspect, since your presentation is so short.\n\n. . .\n\nHere are some examples of what you could choose:\n\n- The relationships you saw in the visualizations\n- The design of the study\n- The model you found best represents the relationships between variables you selected\n\n## Presentation Slides\n\nFor your presentation you are allowed to make **two** slides:\n\n1. A title slide (make it fun!) with your name\n2. A content slide\n\nYour slides **must be submitted as a PDF**.\n\n::: callout-warning\n# Deadline for slides\n\nSlides are due by [5pm the night before your final exam timeslot]{.underline}. If you do not submit slides by the deadline, you will not be allowed to present.\n:::\n\n# Some Closing Thoughts...\n\n## I hope you leave this class understanding...\n\n- Reproducibility is a foundational aspect to scientific research.\n\n- Data visualizations tell you a story, where statistical tests only tell you a summary.\n\n- Multiple regression and ANOVA are powerful tools to explore multivariate relationships.\n\n- A well thought out study is more powerful than any statistical analysis.\n\n## The Discipline of Statistics\n\nThe field of Statistics was developed to evaluate evidence obtained from data. Over the last century, the use of statistics has become embedded as a component of the scientific process for many disciplines.\n\n. . .\n\n::: {style=\"font-size: 0.75em;\"}\n> \"Significance, the new s-word, is overused and underdefined in the realm of connecting statistical results to the underlying science.\" [(Higgs, 2013)](https://www.americanscientist.org/article/do-we-really-need-the-s-word)\n:::\n\n. . .\n\n::: {style=\"font-size: 0.75em;\"}\n> \"I advocate a simple solution: Replace the s-word with words describing what you actually mean by it.\"\n:::\n\n## Foundational ideas taught in statistics courses were invented by:\n\n- Francis Galton\n- Karl Pearson\n- Ronald Fisher\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n## \n\n::: columns\n::: {.column width=\"40%\"}\n![](images/happy-cat.webp)\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"45%\"}\n::: {style=\"font-size: 2.75em; color: #000000;\"}\nRemember to give yourself praise!\n:::\n:::\n:::\n", + "markdown": "---\ntitle: \"STAT 313 Last Day\"\ntitle-slide-attributes:\n data-background-image: images/sad_cat.jpeg\n data-background-size: cover\n data-background-opacity: \"0.5\"\nformat: \n revealjs:\n theme: style.scss\neditor: visual\n---\n\n\n\n\n# Deadlines\n\n. . .\n\n::: incremental\n- Lab 8 revisions due tonight\n- Statistical Critique revisions due tonight\n- All other final revisions are due by Sunday\n- Final Project is due by Sunday\n:::\n\n# Peer Review Session (45-minutes)\n\n# Structure of Final Projects\n\n## Findings\n\nThe results of each hypothesis test go **directly below** the test.\n\n
\n\n::: columns\n::: {.column width=\"47%\"}\n[**Theory-based Methods**]{.underline}\n\nYour decision & conclusion for your hypothesis test *go directly below your ANOVA table*.\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"47%\"}\n[**Simulation-based Methods**]{.underline}\n\nYour decision & conclusion for your hypothesis test *go directly below your permutation distribution and p-value*.\n:::\n:::\n\n## Hypothesis Test Conclusions\n\n> Conclusions should be written in terms of the alternative hypothesis\n\n
\n\n::: columns\n::: {.column width=\"45%\"}\n**Did you reject the null hypothesis?**\n\nThen you have evidence that at least one group has a different mean!\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"45%\"}\n**Did you fail to reject the null hypothesis?**\n\nThen you have [insufficient evidence](https://critical-inference.com/the-problem-with-no-evidence-and-is-it-enough-to-bust-a-myth/) that at least one group has a different mean!\n:::\n:::\n\n## Model Validity\n\n> In this section you discuss the **reliability** of the p-values you obtained based on the model conditions.\n\n::: small\n- Independence\n\n - *within* groups\n - *between* groups\n\n- Normality of the distributions for each group\n\n- Equal variance of the distributions for each group\n:::\n\n::: callout-caution\n# Conditions for Each Test\n\nEach one-way ANOVA test considers different groups. So, your conditions should be evaluated for each test separately.\n:::\n\n## [Conditions are never met!](https://critical-inference.com/assumptions-are-not-met-period/)\n\n$H_0$: the condition is met\n\n$H_A$: the condition is violated\n\n
\n\n. . .\n\nJust like we never say \"I accept the null hypothesis,\" we never say a condition is \"met.\" Instead, we say there is no evidence that the condition is violated.\n\n## Study Limitations\n\nThis section summarizes your understanding of the foundational aspects of experimental design.\n\n::: columns\n::: {.column width=\"40%\"}\n::: small\n> Based on the sampling method used, what larger population can you infer the results or your analysis onto?\n\n- What were the inclusion criteria of the observations?\n- How does that influence the population you can infer your findings onto?\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"55%\"}\n::: small\n> Based on the design of the study, what type of statements can be made about the relationship between the explanatory and response variables?\n\n- Were the explanatory variables randomly assigned to control for confounding variables?\n - How does that influence what you can and cannot say about the relationships between the variables?\n:::\n:::\n:::\n\n## Overall Conclusions\n\n::: small\n> Based on the results of your analysis what is your conclusion for the questions of interest? Connect your conclusion(s) to the relationships you saw in the visualizations you made and the results of your hypothesis tests.\n:::\n\n
\n\n. . .\n\n::: columns\n::: {.column width=\"30%\"}\n::: small\nDid you distributions look similar but your hypothesis test said at least one group was different?\n\nThink about how sample size effects p-values!\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"30%\"}\n::: small\n::: fragment\nDid you reject the null hypothesis for your one-way ANOVA?\n\nLook back at your visualizations -- which group(s) look the most different?\n:::\n:::\n:::\n\n::: {.column width=\"3%\"}\n:::\n\n::: {.column width=\"30%\"}\n::: small\n::: fragment\nDid you fail to reject the null hypothesis for your one-way ANOVA?\n\nLook back at your visualizations -- do all of the groups look the same?\n:::\n:::\n:::\n:::\n\n# Remedying Condition Violations\n\n## Do you have really skewed data?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](week10-day2_files/figure-revealjs/channel--1.png){fig-align='center' width=960}\n:::\n:::\n\n\n## Try using a log transformation!\n\n::: columns\n::: {.column width=\"50%\"}\n**Un-transformed Variances**\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
unittypevar
C84.826493
I80.783024
IP5.729663
P129.138547
R59.096925
S49.280157
SC49.923399
NA112.284569
\n
\n```\n\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n::: fragment\n**Log Transformed Variances**\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n
unittypevar
C1.6978424
I0.8788463
IP0.8990321
P1.5659500
R1.3621461
S2.1770549
SC1.6514917
NA0.7591942
\n
\n```\n\n:::\n:::\n\n:::\n:::\n:::\n\n. . .\n\n
\n\n
What do you think? Did it work?
\n\n# Final Presentations\n\n## Presentation Structure\n\n> You will give a 3-minute presentation on **one** aspect of your final project you found the most interesting. Notice, you need to pick one aspect, since your presentation is so short.\n\n. . .\n\nHere are some examples of what you could choose:\n\n- The relationships you saw in the visualizations\n- The design of the study\n- The model you found best represents the relationships between variables you selected\n\n## Presentation Slides\n\nFor your presentation you are allowed to make **two** slides:\n\n1. A title slide (make it fun!) with your name\n2. A content slide\n\nYour slides **must be submitted as a PDF**.\n\n::: callout-warning\n# Deadline for slides\n\nSlides are due by [5pm the night before your final exam timeslot]{.underline}. If you do not submit slides by the deadline, you will not be allowed to present.\n:::\n\n# Some Closing Thoughts...\n\n## I hope you leave this class understanding...\n\n- Reproducibility is a foundational aspect to scientific research.\n\n- Data visualizations tell you a story, where statistical tests only tell you a summary.\n\n- Multiple regression and ANOVA are powerful tools to explore multivariate relationships.\n\n- A well thought out study is more powerful than any statistical analysis.\n\n## The Discipline of Statistics\n\nThe field of Statistics was developed to evaluate evidence obtained from data. Over the last century, the use of statistics has become embedded as a component of the scientific process for many disciplines.\n\n. . .\n\n::: {style=\"font-size: 0.75em;\"}\n> \"Significance, the new s-word, is overused and underdefined in the realm of connecting statistical results to the underlying science.\" [(Higgs, 2013)](https://www.americanscientist.org/article/do-we-really-need-the-s-word)\n:::\n\n. . .\n\n::: {style=\"font-size: 0.75em;\"}\n> \"I advocate a simple solution: Replace the s-word with words describing what you actually mean by it.\"\n:::\n\n## Foundational ideas taught in statistics courses were invented by:\n\n- Francis Galton\n- Karl Pearson\n- Ronald Fisher\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n## \n\n::: columns\n::: {.column width=\"40%\"}\n![](images/happy-cat.webp)\n:::\n\n::: {.column width=\"5%\"}\n:::\n\n::: {.column width=\"45%\"}\n::: {style=\"font-size: 2.75em; color: #000000;\"}\nRemember to give yourself praise!\n:::\n:::\n:::\n", "supporting": [ "week10-day2_files" ], diff --git a/_freeze/slides/week10-day2/figure-revealjs/channel--1.png b/_freeze/slides/week10-day2/figure-revealjs/channel--1.png index 1159fbcf..9343117d 100644 Binary files a/_freeze/slides/week10-day2/figure-revealjs/channel--1.png and b/_freeze/slides/week10-day2/figure-revealjs/channel--1.png differ diff --git a/critique/critique-1.qmd b/critique/critique-1.qmd index 4d5af474..217c7c1f 100644 --- a/critique/critique-1.qmd +++ b/critique/critique-1.qmd @@ -1,6 +1,5 @@ --- title: "Statistical Critique 1" -subtitle: "Due February 5 by 5pm" format: html: table-of-contents: true @@ -100,22 +99,22 @@ Your critique of the visualization you selected needs to address the following q ::: callout-tip # If you are referencing a table -Similar to a visualization, the aesthetics of a table are variables being mapped to aspects of the table. Below is a table from Coyne et al. [-@coyne2020]. I like to think of the rows and columns of a table as similar to the x- and y-axis of a visualization. +Similar to a visualization, the aesthetics of a table are variables being mapped to aspects of the table. Below is a table from Coyne et al. [-@coyne2020]. I like to think of the rows and columns of a table as similar to the x- and y-axis of a visualization. ![](images/table-example.png){fig-alt="An image of a table from a scientific journal."} -- I start by noticing that the "study variables" are mapped to the rows (e.g., Social Network, Depressive Sym., Anxiety). -- Then, I notice that the columns are associated with different values of Age. -- Finally, I notice tha there are actually *two* rows per study variable, one associated with the mean and one associated with the standard deviation. +- I start by noticing that the "study variables" are mapped to the rows (e.g., Social Network, Depressive Sym., Anxiety). +- Then, I notice that the columns are associated with different values of Age. +- Finally, I notice tha there are actually *two* rows per study variable, one associated with the mean and one associated with the standard deviation. If I were to sketch out how this table would translate into a visualization, I would imagine the x-axis would be Age, the y-axis would be the value of the variable, there would be three facets (one per study variable), and there would be two types of points (one for the mean, one for the standard deviation). Here is a rough sketch of my mental image: ![](images/table-to-plot.jpg){fig-alt="A drawing of how I would convert the table into a visualization. There are two plots, one with a title 'Social Network' and one with a title 'Anxiety', representing two of the variables the study researched. Each plot has an x-axis with different values of age (ranging from 13 to 20), and a y-axis ranging from 0 to 5. For each age, there are two points plotted. One point is plotted as a red triangle, representing the mean value for a specific age. A second point is plotted as a green circle, representing the standard devation for a specific age."} -In this plot, there are four aesthetics: +In this plot, there are four aesthetics: -1. the age (included on the x-axis) -2. the study variable (included as facets) -3. the statistic measured (included as a color) -4. the value of the statistic (included on the y-axis) +1. the age (included on the x-axis) +2. the study variable (included as facets) +3. the statistic measured (included as a color) +4. the value of the statistic (included on the y-axis) ::: diff --git a/critique/critique-2.qmd b/critique/critique-2.qmd index 6a7d650f..007cd349 100644 --- a/critique/critique-2.qmd +++ b/critique/critique-2.qmd @@ -1,6 +1,5 @@ --- title: "Statistical Critique 2: Exploring p-values" -subtitle: "Due March 4, 2024 by 5pm" format: html: table-of-contents: true @@ -99,11 +98,12 @@ In March of 2019, Valentin Amrhein, Sander Greenland, Blake McShane and more tha -For Part Three, you are going to inspect what the publication requirements are for journal the article you selected (in Week 1) was published in. +For Part Three, you are going to inspect what the publication requirements are for journal the article you selected (in Week 1) was published in. -:::{.callout-tip} +::: callout-tip # Statistics in Your Field -You are revisiting (again) the article you chose in Week 1 for the "Statistics in your Field" assignment! + +You are revisiting (again) the article you chose in Week 1 for the "Statistics in your Field" assignment! ::: First, go to the website for the journal where your article was published. Now, find their criteria for publication. If you are having a difficult time finding these criteria, it may be simpler to Google "*title of journal* publication criteria," substituting the name of your journal. diff --git a/docs/LICENSE.html b/docs/LICENSE.html index 41b3d1db..ea1796db 100644 --- a/docs/LICENSE.html +++ b/docs/LICENSE.html @@ -2,7 +2,7 @@ - + @@ -91,7 +91,7 @@