# PSYC 111A Final Exam - Winter 2021

In [0]:
# Run this code to load the R Packages
suppressMessages(suppressWarnings(suppressPackageStartupMessages({
  library(mosaic)
  library(supernova)
  library(Lock5withR)
  library(gridExtra)
  library(magrittr)
  library(tidyverse)
  library(testthat)})))

In this final exam, you will consider a real research study from beginning to end. The first section will focus on the design and methodological choices about how the study was conducted. The second section will focus on cleaning and analyzing the data from this study. Here is a brief summary of the background and methods of the study: 

*Previous evidence suggests that mindfulness training may improve aspects of psychosocial well-being. Whilst mindfulness is traditionally taught in person, consumers are increasingly turning to mindfulness-based smartphone apps as an alternative delivery medium for training. Despite this growing trend, few studies have explored whether mindfulness delivered via a smartphone app can enhance psychosocial well-being within the general public. The present pilot randomized controlled trial compared the impact of engaging with the self-guided mindfulness meditation (MM) app ‘Headspace’ for a period of 10 or 30 days, to a waitlist (WL) control, using a cohort of adults from the general population. The Satisfaction with Life Scale, Perceived Stress Scale, and Wagnild Resilience Scale were administered online at baseline and after 10 and 30 days of the intervention. (Champion, Economides, & Chandler, 2018)*


# Part 1 - Design and Methodology  

### (37 points total)

Some of the following questions are multiple choice. In these instances, you will be asked to save your answer to a specific variable. When saving your answer, please remember to use quotes around the letter and make it lowercase (ex. "a"). There are visible test functions you can use to verify whether the answer is properly formatted (**NOTE: this will <u>NOT</u> tell you if the answer is correct**).

The other questions are open response. For these, please answer in **<u>150 words max</u>**. Please be clear and concise in your answers.

1.	In your own words, what is the broad question that this study is investigating? What is the specific question that will be tested with these methods? **(3 points)**

> The broad question this study is investigating is the impact of mindfulness training on psychosocial well-being. The specific question that will be tested is whether self-guided mindfulness training via the app 'Headspace' enhances psychosocial well-being.

2.	What kind of research design is being used? Save your answer to `mc.2`. **(1 point)**

    a.	Experimental
    
    b.	Quasi-experimental
    
    c.	Correlational
    
    d.	Descriptive

In [0]:
mc.2 <- "a"

In [0]:
if (test_that("testMC2.1", {
    expect_true(mc.2 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.2 should be one of a, b, c, d")
}


3.	The main independent variable in this design is the mindfulness intervention, in which people either engage in self-guided meditation (MM) or a waitlist control (WL) that gets no intervention. How many levels does the independent variable have, and was it manipulated within or between subjects? Save your answer to `mc.3` **(1 point)**

    a. 2 levels, within subjects

    b. 3 levels, within subjects

    c. 2 levels, between subjects

    d. 3 levels, between subjects

In [0]:
mc.3 <- "c"

In [0]:
if (test_that("testMC3.1", {
    expect_true(mc.3 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.3 should be one of a, b, c, d")
}


4.	Another independent variable is the timepoint at which participants are measured. All participants (in both the MM and WL conditions) completed measures of well-being at baseline (the day they began the study), 10 days later, and 30 days later. How many levels does this independent variable have, and was it manipulated within or between subjects? Save your answer to `mc.4`. **(1 point)**

    a.	2 levels, within subjects
    
    b.	3 levels, within subjects
    
    c.	2 levels, between subjects
    
    d.	3 levels, between subjects


In [0]:
mc.4 <- "b"

In [0]:
if (test_that("testMC4.1", {
    expect_true(mc.4 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.4 should be one of a, b, c, d")
}


5.	There were 3 main dependent variables measured in this study. One of them, the Satisfaction with Life Scale, is described in the paper like this: “The Satisfaction with Life Scale (SWLS) is a 5-item scale that assesses global satisfaction (e.g. ‘so far I have gotten the important things I want in my life’). Individuals indicate their degree of agreement or disagreement with the items using a 7-point Likert scale (1 = strongly disagree to 7 = strongly agree). The scale is shown to have strong psychometric properties including internal consistency (α = .87) and a high test-retest reliability (α = .82).” Explain what the authors mean when they say this scale has strong internal consistency and test-retest reliability. **(3 points)**

> Internal consistency means that there is a high correlation between the 5 items on the scale on how well they measure the same construct. For example, a participant who indicated high satisfaction with life according to one item would indicate the same when rating agreement with another item. Test-retest reliability refers to the strong correlation between scores across time. Since this scale has strong test-retest reliability, a participant who completes the test twice will score very similarly.

6. What type and scale of data will the Satisfaction With Life Scale (SWLS) produce? Save your answer to `mc.6` **(1 point)**

    a. Self-report, ratio

    b. Self-report, ordinal

    c. Behavioral, ratio

    d. Behavioral, ordinal

In [0]:
mc.6 <- "b"

In [0]:
if (test_that("testMC6.1", {
    expect_true(mc.6 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.6 should be one of a, b, c, d")
}


7.	In your own words, explain why there is some controversy over whether it is appropriate to treat Likert scale data as continuous. Does it ever make sense to model Likert data with a mean? Why or why not? **(3 points)** 

> There is controversy over whether it is appropriate to treat Likert scale data as continuous, because a Likert scale technically consists of ordered categories. It does not make sense to model Likert data with a mean, because the mode could be used to figure out the most common answer. Assigning values to scale points and finding the mean could result in a value that does not correspond to a specific Likert scale point. Also, intervals between scale values cannot always be considered equal.

8.	Another dependent variable in this study was the Perceived Stress Scale (PSS) is described like this: “The Perceived Stress Scale (PSS) measures the degree to which situations in an individual’s life might be interpreted as stressful. The 10-item scale aims to establish how unpredictable, uncontrollable and overloaded respondents’ find their lives with respect to the previous month. Respondents answer questions about their feelings and thoughts using a 5-point Likert scale (0 = never to 4 = very often). Scores range from 0–40 with higher scores indicating more stress. The scale has high test-retest reliability (α = .85) and its validity with other measures ranges from .52-.76.” What type of validity are the authors referring to when they say, “validity with other measures ranges from .52-.76”? Save your answer to `mc.8` **(1 point)**

    a.	Internal validity

    b.	External validity

    c.	Convergent validity

    d.	Divergent validity

In [0]:
mc.8 <- "c"

In [0]:
if (test_that("testMC8.1", {
    expect_true(mc.8 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.8 should be one of a, b, c, d")
}


9.	What part of the description in the previous question is the conceptual definition of perceived stress? What part is the operational definition? **(3 points)**

> The conceptual defenition of perceived stress is "how unpredictable, uncontrollable, and overloaded respondednts' find their lives." The operational definition is a participant's score on the Perceived Stress Scale.

10.	To check their manipulation, the researchers also measured participants’ engagement with the Headspace app over the 30-day course of the study. Participants were asked to report the number of days they had used the application. What type and scale of data did the researchers get from this measure of engagement? Save your answer to `mc.10` **(1 point)**

    a.	Self-report, ratio

    b.	Self-report, interval

    c.	Behavioral, ratio

    d.	Behavioral, interval

In [0]:
mc.10 <- "a"

In [0]:
if (test_that("testMC10.1", {
    expect_true(mc.10 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.10 should be one of a, b, c, d")
}


11.	The authors recruited their sample by sending a mass email to a cohort of employees at a particular company in the UK, inviting them to volunteer to take part in a study on mindfulness and well-being. This email also encouraged all recipients to identify and suggest other people they knew who would be interested in participating. What type of sampling method best describes this procedure? Save your answer to `mc.11` **(1 point)**

    a.	Snowball

    b.	Simple random

    c.	Systematic

    d.	Convenience

In [0]:
mc.11 <- "a"

In [0]:
if (test_that("testMC11.1", {
    expect_true(mc.11 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.11 should be one of a, b, c, d")
}


12.	What is the sampling frame in this study? Save your answer to `mc.12` **(1 point)**

    a.	All adults in the UK
    
    b.	All employees of the company
    
    c.	The cohort of employees at the company
    
    d.	The cohort of employees and their social contacts

In [0]:
mc.12 <- "d"

In [0]:
if (test_that("testMC12.1", {
    expect_true(mc.12 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.12 should be one of a, b, c, d")
}


13.	Do you think the authors’ sample is likely to be biased? Why or why not? **(3 points)**

> Since this study uses non-probability sampling, meaning that not all adults from the general population have an equal probability of being selected, this sample is likely to be biased. The cohort of employees at the UK compmany is not representative of all adults from the general population. Also, since people are invited to participate and can choose not to, this could result in self-selection bias. For example, people might be more willing to participate if they already enjoy using headpspace or are into mindfulness practices leading to biased results which cannot be generalized.

14.	A biased sample mostly presents a threat to which of the following. Save your answer to `mc.14` **(1 point)**

    a.	Internal validity
    
    b.	External validity
    
    c.	Measurement validity
    
    d.	Measurement reliability

In [0]:
mc.14 <- "b"

In [0]:
if (test_that("testMC14.1", {
    expect_true(mc.14 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.14 should be one of a, b, c, d")
}


15.	When screening participants for the study, the researchers excluded anyone who had a history of, presence, or ongoing treatment for a psychological disorder, and anyone above a pre-determined threshold on a general health questionnaire. Their reasoning was that they wanted to exclude participants who are likely to be experiencing high levels of distress. Doing this helps to mitigate ___ in their measurement of well-being. Save your answer to `mc.15` **(1 point)**

    a.	Range effects

    b.	Experimenter bias

    c.	Demand characteristics

    d.	Socially desirable responding

In [0]:
mc.15 <- "a"

In [0]:
if (test_that("testMC15.1", {
    expect_true(mc.15 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.15 should be one of a, b, c, d")
}


16.	Beyond the exclusion criteria above, participants met inclusion criteria if they were over 18 and had access to a smartphone. Given the information about how the sample was recruited and how volunteers were excluded, what is the best description of the population the researchers are interested in studying? Save your answer to `mc.16` **(1 point)**

    a.	All adults in the UK

    b.	All adults with smartphone access

    c.	All adults without clinically-diagnosed mental distress or major life stressors

    d.	All adults with smartphone access and without clinically-diagnosed mental distress or major life stressors

In [0]:
mc.16 <- "d"

In [0]:
if (test_that("testMC16.1", {
    expect_true(mc.16 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.16 should be one of a, b, c, d")
}


17.	The authors report that they ultimately recruited 74 participants who met all inclusion and exclusion criteria, but that only 62 of them successfully completed the intervention and measurements at all 3 time points. They note that the rate of dropout was higher for the MM group (23.7%) than the WL group (8.33%), but this was not a statistically significant difference. What kind of confound could this produce in their design? Save your answer to `mc.17` **(1 point)**

    a.	Order effects

    b.	Differential attrition

    c.	Assignment bias

    d.	Sampling bias

In [0]:
mc.17 <- "b"

In [0]:
if (test_that("testMC17.1", {
    expect_true(mc.17 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.17 should be one of a, b, c, d")
}


18.	In your own words, explain what it would mean if the results of this study were confounded by the dropout rate. What consequence would this have on the conclusions that could be drawn from the data? **(3 points)**

> If the results of this study were confounded by the dropout rate it could affect the validity of the results. For example, it could result in a type 2 error. Since more participants dropped out from the MM group, the researcher might conclude that there was no significant effect of Headpsace mindfulness training when there actually is.

19.	In this study, participants were randomly assigned to the MM or WL conditions. This type of control measure is most likely to eliminate ___ confounds in their design. Save your answer to `mc.19` **(1 point)**

    a.	Progressive error

    b.	Carryover effect

    c.	Individual differences

    d.	Order effect

In [0]:
mc.19 <- "c"

In [0]:
if (test_that("testMC19.1", {
    expect_true(mc.19 %in% c("a", "b", "c", "d"))
})) {
    print("Your answer is formatted properly!")
} else {
    stop("mc.19 should be one of a, b, c, d")
}


20.	The authors reported that the random assignment procedure resulted in two groups with very similar mean ages, but with unbalanced proportions of female participants (more females on the waitlist than in the intervention). Describe a different control strategy the authors could have used to reduce the chances of gender confounding their design. **(3 points)**

> The authors could have used a matched-pairs design to make sure there are an equal number of males and females in each condition. After gathering their random sample, the authors could have first split them up by gender then randomly assigned them to condition. This would decrease the chance of gender confounding the design.  

21.	What the authors of this study ultimately want to know is whether engaging with the Headspace app for 30 days can improve mental well-being (as measured by the scales described above). Based on what we have described in these questions, will they be able to draw that kind of causal conclusion? Why or why not? **(3 points)**

> The authors will not be able to draw a causal conclusion from their study. First, due to attrition some population groups in the MM group will be underepresented in the final data. This attrition has the potential to be differential meaning the final data could be biased, and this would compromise the generalizability and external validity of the findings. Also, if gender is an explantory variable in self-guided mindfulness then the unbalanced proportions of female particpants could bias the results. Since gender and attrition have the potential to confound the results, a causal conclusion cannot be drawn.

# Part 2 - Cleaning and Analyzing

### (63 points total)

In the following sections you will clean and analyze the data from the same study you were evaluating in Part 1. There are 10 subsections to this part, addressing all of the coding skills and statistical concepts we have covered this quarter.

In the questions, if we ask you to save the answer to a `variable` or `object`, it usually means we want you to save the answer to the **`variable` on its own**. If we ask you to save the answer to a `$variable`, it usually means we want you to save the answer to **a column named *variable*** in the dataframe.

In [0]:
#Run this code to create the dataframe where your indvidualized data will be input
suppressMessages(suppressWarnings({
    mindfulness = read_csv("https://www.ethanhurwitz.com/datasets/mindfulness_study.csv")
    }))

Save your AD/SSID login (the part of your UCSD email address before the @ucsd.edu) to a variable called `ADlogin` -- This will be used to load your specific data. 

**NOTE**: Each student has a unique dataset to work with. All the autograding will test for the answers for *your specific dataset*, so please make sure you enter this correctly and run the cells below right below! (there will be no output, just save `ADlogin` then run them)

In [0]:
ADlogin <- "assingh"

The data frame `mindfulness` contains data from a study comparing the effects of an app-based mindfulness intervention and a waitlist control group. This dataset contains the following variables: 
- `Group (A = app, B = waitlist)` - The group participants were randomized to
- `Age` - Participant age
- `Sex (1 = male, 2 = female)` - Participant's biological sex
- `SWLS baseline` - Satisfaction with Life Scale scores at baseline
- `SWLS day 10` - Satisfaction with Life Scale scores after 10 days of intervention
- `SWLS day 30` - Satisfaction with Life Scale scores after 30 days of intervention
- `PSS baseline` - Perceived Stress Scale scores at baseline
- `PSS day 10` - Perceived Stress Scale scores after 10 days of intervention
- `PSS day 30` - Perceived Stress Scale scores after 30 days of intervention
- `WRS baseline` - Wagnild Resilience Scale ratings at baseline
- `WRS day 10` - Wagnild Resilience Scale scores after 10 days of intervention
- `WRS day 30` - Wagnild Resilience Scale scores after 30 days of intervention
- `GHQ` - General Health Questionnaire scores (higher total scores indicate greater distress)
- `daysUsingApp` - Number of days App was used during the intervention period

## 1.0 - Cleaning and Prepping the Data

As is often the case, the data we have is not immediately ready to be worked with. It needs a bit of cleaning and manipulation. Let's outline a few things that should be addressed:

- We want to change the variable names so they are both more informative and more easy to work with.
- We want to recode several of the variables.
- We want to compute some convenience variables.

***NOTE:*** **Please save all the changes to the dataframe `mindfulness` instead of creating a new dataframe**. But before making any changes to your data, you might want to create a copy into a new dataframe called `mindfulness_original`. We'll still use `mindfulness` as the main dataframe to work with, but keep this new one as a backup of your original data just in case!

1.1 - Let's start with changing our variable names. For consistency in grading and comparing, let's all use the same naming convention here. Make the following changes: **(0.5 points)**

   - Change `Group (A = app, B = waitlist)` to `Group`

In [0]:
mindfulness_original <- data.frame(mindfulness)

mindfulness <- mindfulness %>%
  rename("Group" = `Group (A = app, B = waitlist)`)


   - Change `Sex (1 = male, 2 = female)` to `Sex` **(0.5 points)**

In [0]:
mindfulness <- mindfulness %>%
  rename("Sex" = `Sex (1 = male, 2 = female)`)



   - For all measures taken at day 10 and day 30, remove the word "day". *(HINT: Be mindful of the hanging space after the word "day")* (E.g. `SWLS day 10` after renaming should be `SWLS 10`) **(0.5 points)**

In [0]:
colnames(mindfulness) <- gsub("day ","",colnames(mindfulness))


   - For all SWLS, PSS and WRS measures, replace the space with an underscore. **(0.5 points)**

In [0]:
colnames(mindfulness) <- gsub(x = colnames(mindfulness), pattern = " ", replacement = "_")  


1.2 - Next, let's recode some variables.

- Change `Group` to a factor with values based on the coding convention specified in the original variable name. **(0.5 points)**

In [0]:
mindfulness$Group <- as.factor(mindfulness$Group)

mindfulness <- mindfulness %>%
mutate(Group = recode(Group, "A" = "app", "B" = "waitlist")) 


- Change `Sex` to a factor with values based on the coding convention specified in the original variable name. **(0.5 points)**

In [0]:
mindfulness$Sex <- as.factor(mindfulness$Sex)

mindfulness <- mindfulness %>%
mutate(Sex = recode(Sex, "1" = "male", "2" = "female")) 


1.3 - Finally, let's compute some convenience variables. We are interested in the effects of some intervention. So we want to compare some scores from before the intervention to scores from after it. For our 3 main measures, let's see how scores changed from baseline to the final assessment (at 30 days). 

Specifically, add 3 new variables to `mindfulness` called, `$SWLS_Dif`, `$PSS_Dif`, and `$RWS_Dif` that show these differences. **(1.5 points)**

In [0]:
mindfulness <- mindfulness %>%
mutate("SWLS_Dif" = SWLS_30 - SWLS_baseline, 
      "PSS_Dif" = PSS_30 - PSS_baseline, 
      "RWS_Dif" = RWS_30 - RWS_baseline)


1.4 Are there any negative values in these variables? If so, is that a problem? What would negative values here mean? **(1 point)**

> There are negative values in these variables. It is not a problem, it just means that some participants reported decreased psychosocial well-being after 30 days.

## 2.0 - Summary Stats and Demographics

Let's take a moment to consider our participants here. Who is this sample comprised of? Let's summarize some basic information about the participants who comprise this sample.

2.1 - How many participants are in this study? Save this value to an object called `nParticipants`. **(0.5 points)**

In [0]:
nParticipants = count(mindfulness) %>%
pull()


2.2 - What about for each group? Find the number of participants in each group and save the values to objects called `nParticipantsApp` and `nParticipantsWaitlist`. **(0.5 points)**

In [0]:
nParticipantsApp <- (count(filter(mindfulness, Group == "app"))) %>%
pull() %>%
as.numeric()

nParticipantsWaitlist <- (count(filter(mindfulness, Group == "waitlist"))) %>%
pull() %>%
as.numeric()

nParticipantsApp
nParticipantsWaitlist

2.3 - What is the average age of participants in this study? In each group? Round these values to 2 decimal places and save to objects called: `mAge`, `mAgeApp`, and `mAgeWaitlist`. **(0.75 points)**

In [0]:
#mAge
mAge <- mean(mindfulness$Age)
mAge <- format(round(mAge, 2), nsmall = 2) %>%
as.numeric()

#mAgeApp
app.Age <- mindfulness %>%
subset(Group == "app", select=c(Age)) 
mAgeApp <- mean(app.Age$Age)

mAgeApp <- format(round(mAgeApp, 2), nsmall = 2) %>%
as.numeric()

#mAgeWaitlist
waitlist.Age <- mindfulness %>%
subset(Group == "waitlist", select=c(Age)) 
mAgeWaitlist <- mean(waitlist.Age$Age)

mAgeWaitlist <- format(round(mAgeWaitlist, 2), nsmall = 2) %>%
as.numeric()


2.4 - What about the distribution of sex? How many of these participants were female? In each group? Save these values to objects called `nFemales`, `nFemalesApp`, and `nFemalesWaitlist`. **(0.75 points)**

In [0]:
#nFemales
nFemales <- mindfulness %>%
tally(Sex == "female") %>%
pull()

#nFemalesApp
app.female <- mindfulness %>%
subset(Group == "app", select=c(Sex)) 
nFemalesApp <- count(app.female$Sex) %>%
as.numeric()

#nFemalesWaitlist
waitlist.female <- mindfulness %>%
subset(Group == "waitlist", select=c(Sex)) 
nFemalesWaitlist <- count(waitlist.female$Sex) %>%
as.numeric()


2.5 - Do our groups have different scores at baseline? Find the average score on our 3 measures for each group. Do these values seem substantially different to you (no statistical analysis needed)? Would it be a problem if they were? Why or why not? **(2 points)**

In [0]:
#APP
#m.app.SWLS_baseline
app.SWLS_baseline <- mindfulness %>%
subset(Group == "app", select=c(SWLS_baseline)) 
m.app.SWLS_baseline <- mean(app.SWLS_baseline$SWLS_baseline)
m.app.SWLS_baseline

#m.app.PSS_baseline
app.PSS_baseline <- mindfulness %>%
subset(Group == "app", select=c(PSS_baseline)) 
m.app.PSS_baseline <- mean(app.PSS_baseline$PSS_baseline)
m.app.PSS_baseline

#m.app.RWS_baseline
app.RWS_baseline <- mindfulness %>%
subset(Group == "app", select=c(RWS_baseline)) 
m.app.RWS_baseline <- mean(app.RWS_baseline$RWS_baseline)
m.app.RWS_baseline

#WAITLIST
#m.app.SWLS_baseline
waitlist.SWLS_baseline <- mindfulness %>%
subset(Group == "waitlist", select=c(SWLS_baseline)) 
m.waitlist.SWLS_baseline <- mean(waitlist.SWLS_baseline$SWLS_baseline)
m.waitlist.SWLS_baseline

#m.app.PSS_baseline
waitlist.PSS_baseline <- mindfulness %>%
subset(Group == "waitlist", select=c(PSS_baseline)) 
m.waitlist.PSS_baseline <- mean(waitlist.PSS_baseline$PSS_baseline)
m.waitlist.PSS_baseline

#m.app.RWS_baseline
waitlist.RWS_baseline <- mindfulness %>%
subset(Group == "waitlist", select=c(RWS_baseline)) 
m.waitlist.RWS_baseline <- mean(waitlist.RWS_baseline$RWS_baseline)
m.waitlist.RWS_baseline


> Both groups had very similar baseline scores for all three measures. For SWLS it was 24.17 (app) and 24.55 (waitlist). For PSS it was 16.90 (app) and 17.73 (waitlist). For RWS it was 73.69 (app) and 75 (waitlist). If these scores were substantially different it would not be an issue when it comes to data analysis, because you could still calculate the difference between baseline and 10 or 30 day scores. However, the scores being different can create a confound. If the scores for the experimental group are already very high compared to the control group it could limit the effect of Headspace on mindfulness. This would limit the generalizability of the results and how much comparison can be done between the two groups, because the app's effect has not been tested on people at average levels psychosocial well-being.

## 3.0 - Explore Variation

3.1 - Let's start by exploring variation in `PSS_Dif`, one of the main outcome measures of this study. Create a histogram that would help us do this. **(0.5 points)**

In [0]:
mindfulness %>%
ggplot(aes(x = PSS_Dif)) +
geom_histogram(bins = 10, color = "black", fill = "palevioletred1") 

3.2 - Find one of the tallest bars and the lowest bars in the histogram. What does it mean for a bar to be really low, or really tall in this histogram? **(0.5 points)**

> One of the tallest bars is approximately at -5 and one of the lowest bars is at -15. If a bar is realy tall it means that most people's perceived stress changed by that much over the course of the study. In this case many people saw a decrease in stress by about 5 points. If a bar is really low it means that few people's perceived stress changed by that much over the course of the study. In this case few people saw a decrease in stress by about 15 points.

3.3 - Make a boxplot of the same variable. **(0.5 points)**

In [0]:
mindfulness %>%
ggplot(aes(y = PSS_Dif)) +
geom_boxplot(color = "black", fill = "palevioletred1") 

3.4 - How would you describe this distribution of `PSS_Dif` scores? What have we learned by looking at this distribution? **(1 point)**

> The spread of change in stress is medium with the range being between -15 and 15. Most people perceived a decrease in stress between 0 and 5, no change in perceived stress, or an increase in perceived stress between 0 and 4 points. The median change in stress perception was at a decrease of 2 points. There were three particpants who saw a very large decrease in stress compared to the other participants. This distribution shows that over the study almost half experienced an decrease in stress while the rest experienced an increase. This could be better analyzed by faceting by `Group`.

## 4.0 - Empty Model

4.1 - If we didn’t know anything about a participant, we might simply guess that they would have an average amount for their `PSS_Dif` score. How would we write this as a word equation? **(0.5 points)**

> `PSS_Dif` = mean(`PSS_Dif`) + error

4.2 -  Create an empty model called `empty.model` for `PSS_Dif` in the coding cell, and write it in your word equation in the markdown cell below. **(1 point)**

In [0]:
empty.model <- lm(PSS_Dif ~ NULL, data = mindfulness)
empty.model

> `PSS_Dif` = -1.161 + error

4.3 - Add your empty model to the histogram you made of `PSS_Dif` above. **(0.5 points)**

In [0]:
mindfulness %>%
ggplot(aes(x = PSS_Dif)) +
geom_histogram(bins = 10, color = "black", fill = "palevioletred1") +
geom_vline(aes(xintercept = mean(PSS_Dif)), color = "blue", size = 1)

4.4 - What will this model predict as the `PSS_Dif` for each person? Use the `predict()` function to generate predictions for each person in the data set, and save them back to the dataset as a variable called `$PSS_Difpred`. **(1 point)**

In [0]:
mindfulness <- mindfulness %>%
mutate(PSS_Difpred = predict(empty.model))


> This model will predict -1.161 for everyone because it is the mean of `PSS_Dif`.

4.5 - The empty model will probably be wrong -- but it’s better than nothing. The nice thing about actually having a model is that we can quantify exactly how wrong we are! Manually compute the residuals for each score and save them to a variable in the dataframe called `$residMan`. **(0.5 points)**

In [0]:
mindfulness <- mindfulness %>%
mutate(residMan = PSS_Dif - PSS_Difpred)


4.6 - Now use an r function to compute the residuals, and save them to `$residFun`. **(0.5 points)**

In [0]:
mindfulness <- mindfulness %>%
mutate(residFun = resid(empty.model))


Make a histogram of the residuals from the empty model. Are the errors balanced in each direction? What is the standard amount that an observation in this distribution deviates from its predicted score? **(1 point)**

In [0]:
mindfulness %>%
ggplot(aes(x = residFun)) +
geom_histogram(bins = 10, color = "black", fill = "deepskyblue3") 


> The errors are not completely balanced in each direction. More people seem to be overestimated by the mean than underestimated. The standard amount that an observation in this distribution deviates from its predicted score is slightly below 0 points.

4.7 - Explain why the histograms of `PSS_Dif` and `residFun / residMan` have the same shape but different values on the x-axis. **(1 point)**

- Explain why the distribution of a variable and the distribution of residuals around the empty model have the same shape. Will the x-axis always be the same too? Why or why not?”

> The `PSS_Dif` and `residFun` / `residMan` distributions have a similar shape and distribution since the mean is being subtracted from every point in the distribution to get the residuals. Therefore, the residual distribution shape will be the same as the original distribution. The values on the x-axis will always be different becuase the `PSS_Dif` histogram shows the actual data points and the `residFun` / `residMan` histogram shows the difference between the actual data and the predicted value. 

4.8 - Are the estimates from our model considered statistics or parameters? Do you think these estimates reflect the true population distribution of `PSS_Dif`? Explain your answer. **(1 point)**

> The estimates from the model are considered parameters since they are estimating an indiviual's score. These estimates will probably not reflect the true population distribution of `PSS_Dif` due to sampling variation.

4.9 - Write the empty model in General Linear Model (GLM) notation. Which parts correspond to our word equation? To Data? Model? Error? **(1 point)**

> - $Y_i = b_0 + e_i$
> - $Y_i$ is the data, $b_0$ is the model, and $e_i$ is the error.

4.10 - Express the model in GLM notation for the first participant in the dataset. **(0.5 points)**

In [0]:
mindfulness[1:1, "PSS_Dif"]

> $-5 = -1.161 + (-3.839)$

4.11 - If we were to shuffle `PSS_Dif` and plot the data and mean, and then reshuffle and plot the data and mean again, would the mean **<u>definitely</u>** be in the same place on both graphs? Why or why not? **(1 point)**

> Yes, the mean would definitely be in the same place on both graphs because the `PSS_Dif` values are only being shuffled to different observations, they are not being changed to different values.

4.12 - Why is the empty model a stand-in for a DGP of randomness? **(1 point)**

> The empty model is a stand-in for a DGP of randomness because it does not take into account any variation explained by explanatory variables.

## 5.0 -  Predicting Probabilities

5.1 - Fit a normal distribution over the histogram of `PSS_Dif`, add the empty model in blue, and the first participant in another color of your choosing. **(1.5 points)**

In [0]:
P1_PSS_Dif <- mindfulness[1:1, "PSS_Dif"] %>%
pull()

mindfulness %>%
ggplot(aes(x = PSS_Dif)) +
geom_histogram(aes(y = ..density..), bins = 10, color = "black", fill = "palevioletred1") +
geom_vline(aes(xintercept = mean(PSS_Dif)), color = "blue", size = 2) +
geom_vline(xintercept = P1_PSS_Dif, color = "green", size = 2) +
geom_vline(xintercept = mean(mindfulness$PSS_Dif) - sd(mindfulness$PSS_Dif), color = "red", size = 2) +
stat_function(fun = dnorm, args = list(mean = mean(mindfulness$PSS_Dif), 
                                       sd = sd(mindfulness$PSS_Dif)), color = "purple", size = 2)

           
#First participant indicated by the green line.

5.2 - Is this participant within 1 standard deviation of the mean (e.g., zone 1)? 
If so, save "yes" to the object `ans.5.2`. If no, save "no". **(0.5 points)**

In [0]:
ans.5.2 <- "yes"

5.3 - The area under the normal curve is 1 (you can think about it as 100%). Estimate of the percentage of participants that have a higher `PSS_Dif` than participant 1 **just by looking at your histogram.** Explain how you made your estimate. **(1 point)**

> - 67% of the particpants have a higher `PSS_Diff` than participant 1.
> - In a normal curve, the mean and the median are the same, and 50% of the scores will fall above it and 50% below. The space between the first standard deviation below the mean and mean is 34%. Since participant 1 falls right between the mean and one standard deviation below the mean the scores between participant 1 and the mean will represent 17% of the data. As a result, it can be estimated that 50% + 17% = 67% of the particpants have a higher `PSS_Diff` than participant 1.

5.4 - What is the *discrete probability* that a randomly selected participant will have a `PSS_Dif` less than or equal to participant 1? Answer this question **using the data as a model**. Round your answer to 2 decimal places and save to an object called `ans.5.4`. **(1 point)**

In [0]:
ans.5.4 <- mindfulness %>%
filter(PSS_Dif <= P1_PSS_Dif) %>%
nrow() / nrow(mindfulness)

ans.5.4 <- format(round(ans.5.4, 2), nsmall = 2) %>%
as.numeric()

ans.5.4

5.5 - Now find the probability **using the normal distribution as a model**. Round your answer to 2 decimal places and save to ab object called `ans.5.5`. **(1 point)**

In [0]:
ans.5.5 <- pnorm(P1_PSS_Dif, mean(mindfulness$PSS_Dif), sd(mindfulness$PSS_Dif))
ans.5.5 <- format(round(ans.5.5, 2), nsmall = 2) %>%
as.numeric()
ans.5.5

5.6 - Compute the z-score for all participants and save them to a variable in the dataframe called `$zScore` **(1 point)**

In [0]:
mindfulness <- mindfulness %>%
mutate(zScore = (PSS_Dif - mean(mindfulness$PSS_Dif)) / sd(mindfulness$PSS_Dif)) 

mindfulness$zScore <- round(mindfulness$zScore ,digit=2)


5.7 - What does the z-score of participant 1 mean? **(1 point)**

> It means that participant 1's score is 0.52 standard deviations below the mean.

5.8 - Why is the normal distribution helpful for predicting future observations? **(1 point)**

> The normal distribution is helpful for predicting future observations because sample distributions have variability. By assuming that the distribution of residuals is normal, we can make predictions that can be generalized to our data.

5.9 - Do samples from a normal distribution **always look** normal? Explain why or why not. **(1 point)**

> Samples from a normal distribution do not always look normal due to sampling variation. No sample will perfectly  represent the population or larger sample it is taken from.

5.10 - Even though the distribution of `PSS_Dif` doesn't look perfectly normal, could the DGP of `PSS_Dif` be roughly normal? Explain why or why not. **(1 point)**

> The DGP of `PSS_Dif` could be roughly normal even though the distribution of `PSS_Dif` doesn't look perfectly normal. The sample of `PSS_Dif` is relatively small, leaving more room for sampling variation. As the sample size increases, sampling error will decrease to be closer to the population distribution which could be normal.

## 6.0 - Explanatory Model <SPAN STYLE="font-size:18px">(modeling variation)

6.1 - If we think `Group` can help us explain the variation we see in `PSS_Dif`, how can we write this idea as a word equation? Which is the outcome and which is the explanatory variable? **(1 point)**

> `PSS_Dif` = `Group` + other stuff
> - `PSS_Dif` is the outcome variable
> - `Group` is the explanatory variable

6.2 - Create a histogram to help us see whether `Group` might be related to `PSS_Dif`. **(0.5 points)**

In [0]:
mindfulness %>%
ggplot(aes(x = PSS_Dif, fill = Group)) +
geom_histogram(bins = 10, color = "black") +
facet_grid(rows = vars(Group))

6.3 - Looking at the histograms, what do you notice about these distributions? Does any `Group` seem like it tends to have higher or lower `PSS_Dif` scores? **(0.5 points)**

> The waitlist group has higher `PSS_Dif` scores. The app group has a center around -5 while the waitlist group has a center around 5. This indicates that people in the waitlist group experienced an increase in perceived stress while people in the app group experienced a decrease in perceived stress by the end of the experiment.

6.4 - Now create a boxplot with an overlaid jitterplot to explore the same hypothesis. **(1 point)**

In [0]:
mindfulness %>%
ggplot(aes(y = PSS_Dif, x = Group, fill = Group)) +
geom_boxplot(color = "black") +
geom_jitter() 

6.5 - Based on these visualizations, if you knew that a person was in the App group and had to guess their `PSS_Dif` score, would you adjust your guess to be a little lower or a little higher? Why? **(1 point)**

> I would adjust my guess to be lower because the interquartile range and median of the app group are lower than that of the waitlist group.

6.6 - One definition of “explain variation” is that if we know a little bit more about some observation, we can use that information to make a better prediction of some outcome. Now that we’ve *visualized* the Group Hypothesis, does Group help us explain variation on `PSS_Dif`? **(1 point)**

> `Group` does help explain some of the variation in `PSS_Dif`. People in the app group tend to have lower `PSS_Dif` scores than those in the waitlist group. The interquartile range of the app group was approximately between -5 and -9 while the interquartile range of the waitlist group was approximately between -5 and 6. This indiates that most people in the app group experienced a decrease in stress compared to the waitlist group.

## 7.0 - Group Model Predictions

7.1 - Create a model for the `Group` hypothesis you visualized above and save it to a variable called `group.model`. **(0.5 points)**

In [0]:
group.model <- lm(PSS_Dif ~ Group, data = mindfulness)
group.model

7.2 - Find the best fitting model parameters and put the numbers into GLM notation. **(0.5 points)**

> $Y_i = -5.483 + 8.119X_1i + e_i$

7.3 - Interpret the Group model’s estimates from your code above. How are those numbers connected to the means from the two groups? **(1 point)**

> - -5.483 is the `PSS_Dif` mean of the app group. 8.119 is the distance of the waitlist group `PSS_Dif` mean from the app group mean. This means that the waitlist group `PSS_Dif` mean is 2.636.

7.4 - What is the model predicting for each person? **(0.5 points)**

> The model is predicting a `PSS_Dif` score of -5.483 for people in the app group and a `PSS_Dif` score of 2.636 for people in the waitlist group.

7.5 - What would $X_i$ be for someone in the **waitlist** group? What would the group model predict for someone in the waitlist group? How do you find that information? **(1 point)**

> $X_i$ would be 1 for someone in the waitlist group. The group model would predict 2.636 for someone in the waitlist group. It can be calculated by adding -5.483 + 8.119(1).

7.6 - What would $X_i$ be for someone in the **app** group? What would the group model predict for someone in the app group? How do you find that information? **(1 point)**

> > $X_i$ would be 0 for someone in the app group. The group model would predict -5.483 for someone in the waitlist group. It can be calculated by adding -5.483 + 8.119(0).

## 8.0 - How good is this model?

8.1 - A: Save the predictions of the group model into the data frame as `$groupPred`, and save the residuals of the model as `$groupResid`. **(1 point)**

In [0]:
mindfulness <- mindfulness %>%
mutate(groupPred = predict(group.model))

mindfulness <- mindfulness %>%
mutate(groupResid = resid(group.model))


8.1 - B: Add the predictions from the empty model (in blue) and the group model (different color for each group) to your faceted histogram from above. What part of the visualization depicts: 

- the residuals from the empty model?
- the residuals from the group model?
- the part of the error that has been explained? Which model did the explaining? 

**(1.5 points)**

In [0]:
groupPred.mean <- mindfulness %>%
        group_by(Group) %>%
        summarise(Mean = mean(groupPred))

mindfulness %>%
ggplot(aes(x = PSS_Dif, fill = Group)) +
geom_histogram(bins = 10, color = "black", alpha = 0.4) +
geom_vline(aes(xintercept = mean(PSS_Dif)), color = "blue", size = 1.7) +
geom_vline(data = groupPred.mean, aes(xintercept = Mean, color = Group), size = 1.7) +
facet_grid(rows = vars(Group))



> The residuals from the empty model are depicted by the distance between each bar and the blue line. The residuals from the group model for the app group are depicted by the distance between each bar and the red line. The residuals from the group model for the waitlist group are depicted by the distance between each bar and the turquoise line. The part of error that has been explained is the distance between the blue line and the group prediction line for each group. The group model did the explaining.

8.2 - Why do we quantify error in this model with the sum of squares rather than the sum of the residuals? **(1 point)**

> This model is quantified with sum of squares because sum of squares model represents the variation explained by the group model and leaves us with sum of sqaures error which tells us how much variation is left unexplained.

8.3 - Now run `anova()` or `supernova()` on the group model. Find the three sums of squares.
 
- Which tells us the SS (the leftover error) in the empty model? Save the value rounded to 2 decimal places to an object called `SS.empty`.
- Which tells us the SS (the leftover error) in the group model? Save the value rounded to 2 decimal places to an object called `SS.unexplained`.
- Which tells us how much of the error has been explained by the group model? Save the value rounded to 2 decimal places to an object called `SS.explained`.

**(1.5 points)**

In [0]:
anova.mindfulness <- anova(group.model)
anova.mindfulness

SS.empty <- anova.mindfulness[[2]][1] + anova.mindfulness[[2]][2]
SS.empty <- format(round(SS.empty, 2), nsmall = 2) %>%
as.numeric()
SS.empty

SS.unexplained <- sum(resid(group.model)^2)
SS.unexplained <- format(round(SS.unexplained, 2), nsmall = 2) %>%
as.numeric()
SS.unexplained

SS.explained <- sum((predict(group.model) - mean(mindfulness$PSS_Dif))^2) 
SS.explained <- format(round(SS.explained, 2), nsmall = 2) %>%
as.numeric()
SS.explained


> - The sum of the Sum Sq Group and Sum Sq Residuals tells us the SS in the empty model (3302.39).
> - Sum Sq Residuals tells us the SS in the group model (2284.88). 
> - The Sum Sq Group tells us how much of the error has been explained by the group model (1017.51).

8.4 - What is the relationship between SS Model, SS Error, and SS Total? Take a look at the word equation for the group hypothesis. Any similarities? **(1 point)**

> SS Total is the sum of SS Model and SS Error. Looking at the word equation for the group model, SS Model is similar to `Group` because `Group` is the explantory variable and SS Model is explained variation. SS Error is similar to `other stuff`, because SS Error is unexplained error and other stuff is unexplained variation.

8.5 - What is the relationship between PRE and the different sums of squares? **(1 point)**

> PRE is the  proportion of the total error from the empty model that can be explained by explanatory variables in the group model, so it is equal to SS Model / SS Total.

8.6 - Interpret the PRE for the group model. Is that a lot? Is that a little? Is it hard to tell? **(1 point)**

> PRE for the group model is 0.31. It means 31% of the variation in `PSS_Dif` scores can be explained by `Group`. It is not a little, but it is hard to tell its significance. The effect size can be better analyzed by calculating Cohen's D.

8.7 - Is it likely that we could have gotten this PRE from a random process 
(where these groups are essentially the same but they are just a little 
different because of randomness)? Explain your answer. **(1 point)**

> It is unlikley that this PRE could have resulted from a random process because it is not a low value, but it is possible due to sampling variation and a small sample size.

## 9.0 - More Than 2 Groups

9.1 Let's now consider another question with this data. Maybe `Age` is important. Let's consider whether satisfaction with life change `SWLS_Dif` is explained by participant's `Age`.

1. Write this as a word equation.
2. What is wrong with using `Age` as is?

**(1 point)**

> `SWLS_Dif` = `Age` + other stuff
> - Age is a quantitative variable so it would need to be coded into groups.

9.2 Use some R code to create 3 distinct groups for `Age` with the levels `low`, `medium`, and `high`, and save it in the dataframe as a variable called `$Age3`. **(0.5 points)**

In [0]:
mindfulness$Age3 <- ntile(mindfulness$Age, 3)

mindfulness <- mindfulness %>%
mutate(Age3 = factor(mindfulness$Age3, levels = 1:3, labels = c("low", "medium", "high")))


9.3 Fit a model predicting `SWLS_Dif` from `Age3` and save it to a variable called `model.Age3`. Then use this model to make predictions for each participant, and save them to the dataframe as a variable called `$Age3pred`. **(1 point)**

In [0]:
model.Age3 <- lm(SWLS_Dif ~ Age3, data = mindfulness)
model.Age3

mindfulness <- mindfulness %>%
mutate(Age3pred = predict(model.Age3))


9.4 Create a visualization to represent the variation in `SWLS_Dif` by `Age3`. Include the actual data and the predicted values. 

*HINT: make sure you can identify the actual data*

**(0.5 points)**

In [0]:
groupPred.mean.Age3 <- mindfulness %>%
        group_by(Age3) %>%
        summarise(Mean.Age3pred = mean(Age3pred))

mindfulness %>%
ggplot(aes(x = SWLS_Dif, fill = Age3)) +
geom_histogram(bins = 10, color = "black") +
geom_vline(data = groupPred.mean.Age3, aes(xintercept = Mean.Age3pred), size = 1) +
facet_grid(rows = vars(Age3))

9.5 Print out the parameter estimates from the model. What does $b_0$ tell us? How about $b_1$ and $b_2$? 
What part of the graph above corresponds to these numbers? **(2 points)**

> - $b_0$ tells us the mean `SWLS_Dif` score of people in the low age group. It is 0.04762
> - $b_1$ tells us how far away the mean `SWLS_Dif` score of people in the medium age group is away from that of the low age group. Since $b_1$ is 2.52381, the mean `SWLS_Dif` of the medium age group is 0.04762 + 2.52381 = 2.57143.
> - $b_2$ tells us how far away the mean `SWLS_Dif` score of people in the high age group is away from that of the low age group. Since $b_2$ is 2.65238 , the mean `SWLS_Dif` of the medium age group is 0.04762 + 2.65238  = 2.7.
> - The predicted values are the black lines. $b_0$ is the line on the low age graph. $b_1$ is the distance between the black lines on the low age graph and the medium age graph. $b_2$ is the distance between the black lines on the low age graph and the high age graph. 

9.6 Run `anova()` or `supernova()` on the `Age3` model. How does the PRE compare to a 2-group model? *(Hint: you need to make a new model to answer this.)* **(0.5 points)**

In [0]:
mindfulness$Age2 <- ntile(mindfulness$Age, 2)

mindfulness <- mindfulness %>%
mutate(Age2 = factor(mindfulness$Age2, levels = 1:2, labels = c("low", "high")))

model.Age2 <- lm(SWLS_Dif ~ Age2, data = mindfulness)

anova(model.Age3)
anova(model.Age2)


> The PRE for the 3-group model is higher than that of the 2-group model. This means that more variation is explained by 3-group model.

9.7 - A: Create a variable called `$Age20` with 20 distinct groups for `Age` (no need to name them), use it in a model to generate predictions (save to `$Age20pred`). **(0.5 points)**

In [0]:
mindfulness$Age20 <- ntile(mindfulness$Age, 20)
model.Age20 <- lm(SWLS_Dif ~ Age20, data = mindfulness)

mindfulness <- mindfulness %>%
mutate(Age20pred = predict(model.Age20))

mindfulness


9.7 - B: Create a visualization to represent the variation in `SWLS_Dif` by `Age20`. Include the actual data and the predicted values. **(0.5 points)**

In [0]:
groupPred.mean.Age20 <- mindfulness %>%
        group_by(Age20) %>%
        summarise(Mean.Age20pred = mean(Age20pred))

mindfulness %>%
ggplot(aes(x = SWLS_Dif, fill = Age20)) +
geom_histogram(bins = 10, color = "black") +
geom_vline(data = groupPred.mean.Age20, aes(xintercept = Mean.Age20pred), size = 1) +
facet_grid(rows = vars(Age20))

anova(model.Age20)

9.8 - How many more parameter estimates (the $b$s) will this model have compared to the empty model? How can you tell? **(1 point)**

> The empty model is estimating one parameter: $b_0$. The `Age20` model is estimating 20 parameters: $b_0$, $b_1$, $b_2$, $b_3$, etc. Therefore this model will have 19 more paramenter estimates than the empty model.

## 10.0 - Comparing Models and the F Ratio

10.1 - Compare the anova output of the models you created above (`Age3`, `Age20`, and the model with 2 groups for `Age`).Which model explains more error? Which model is least likely to be well-explained by a random DGP? **(1 point)**

> The Age20 model explains more error because it has a higher PRE. The Age20 model is least likely to be well-explained by a random DGP because it has the smallest sum of squares error.

10.2 - Why is F smaller in the `Age20` model? **(1 point)**

> F is smaller in the `Age20` model because the degrees of freedom is larger since it is estimating more parameters than the other two models.

10.3 - Are models with a high PRE always better? Why or why not? **(2 points)**

> Models with a high PRE are not always better if it was acheived by splitting the data up into many groups. Creating more groups would attribute more variation to differences among age, and there might be other variables affecting `SWLS_Dif` outside of Age.