# Mental Health in Tech Survey (COMPLETE)

## Chapter 4.6-4.9 Categorical, and More-Than-One, Explanatory Variables 

In [None]:
# This code will load the R packages we will use
library(coursekata)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# Load the data frame
tech1 <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS1tVN587A_zyDpFigG4Enccfn7zeTnhdjcd-B-71tFQHrELtV9iJGqXNGTHXwbVBBIfr041MdwwXJ_/pub?output=csv", header=TRUE)
# filtering based on common comments about applicability of responses 
tech2 <- filter(tech1, Country == "United States")
tech3 <- filter(tech2, self_employed == "No")
tech4 <- filter(tech3, tech_company == "Yes")
tech <- select(tech4,Timestamp, Gender, family_history, treatment, work_interfere, remote_work, benefits, comments)
# reorder work_interfere levels
tech$work_interfere <- factor(tech$work_interfere, levels = c("Never", "Rarely", "Sometimes", "Often"))

<div class="teacher-note">
    <b>Teacher Note:</b> The purpose of this mini-JNB is to practice visualizing relationships between 2 categorical variables, and visualizing relationships between 3 variables, then interpreting them to determine if there is any explained variation in the outcome variable.

</div>

Here is the link to the [full variable descriptions](https://docs.google.com/document/d/e/2PACX-1vQnqUQaNNDVT3sKG8_pbfUhxWZy3Gq9BHeMd9Fpw6tnPwUmftOdTiR3l-f2XQTKmdJmgR1a7gLn4Ctj/pub), if desired.

## 1 About the `tech` data 

We will be using the `tech` data frame. This dataset is from a survey that measures attitudes towards mental health, beliefs about mental health in the workplace, and frequency of mental health issues in the tech workplace.

[data source](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey?resource=download)

### 1.1 Run some code to look at the `tech` data frame

In [None]:
# 1.1
# run code here



In [None]:
# 1.1
# run code here


# sample code
tail(tech)



Here is a description of each of the variables, or their corresponding survey question:

- `Timestamp`The time the survey was submitted
- `Gender` The gender of the participant (Male, Female, T-F = transgender female, Alt Gen = nonbinary, gender queer, fluid, etc.) 
- `family_history` Do you have a family history of mental illness?
- `treatment` Have you sought treatment for a mental health condition?
- `work_interfere` If you have a mental health condition, do you feel that it interferes with your work?
- `remote_work` Do you work remotely (outside of an office) at least 50% of the time?
- `benefits` Does your employer provide mental health benefits?
- `comments` Any additional notes or comments

## 2 Explaining Variation in `treatment`

Let's explore the data and see if we can explain variation in tech workers who do or do not seek out treatment for a mental health condition.

Take a look at the distribution of `treatment` = other stuff. 

In [None]:
# 2
# Run this code 

gf_bar(~treatment, data = tech, fill = "gold2") %>%
  gf_labs(title = "Have you sought treatment for a mental health condition?")

tally(~treatment, data = tech)

### 2.1 If you had to guess which category a random tech worker probably belongs to, which would you guess? How confident would you feel in your guess?

2.1 Response:

<div class="teacher-note">

<b>Sample Response</b>: Probably the "Yes" group, but I wouldn't feel too confident because the two groups are not that far off from one another.
    
</div>

### 2.2 Categorize the variables below according to how much variation you think they would explain in `treatment` (either: "not much", "some", or "a lot").

Another way of thinking about this is: If I knew *this* extra information about the tech worker, I would probably be able to improve my guess of their `treatment` response.

- `Timestamp` The time the survey was submitted
- `Gender` The gender of the participant
- `family_history` Do you have a family history of mental illness?
- `work_interfere` If you have a mental health condition, do you feel that it interferes with your work?
- `remote_work` Do you work remotely (outside of an office) at least 50% of the time?
- `benefits` Does your employer provide mental health benefits?
- `comments` Any additional notes or comments

2.2 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
   
- `Timestamp` - not much
- `Gender` - some, or a lot
- `family_history` - some, or a lot
- `work_interfere` - some, or a lot
- `remote_work` -not much, some, or a lot
- `benefits` - some, or a lot
- `comments` - not much
    
*Students may have varying impressions of how big of an effect some of the variables may have, but they should at least notice that variables such as `Timestamp` and `comments` will likely not explain much variation in `treatment`.*
    
</div>

### 2.3 Write a word equation that expresses the idea that `work_interfere` might help us explain the variation we see in `treatment`.

2.3 Response:

<div class="teacher-note">

<b>Sample Response</b>: `treatment` = `work_interfere` + other stuff
    
</div>

### 2.4 Visualize this hypothesis, and generate its contingency table.

In [None]:
# 2.4
# Run code here





In [None]:
# 2.4
# Run code here


# sample response (filtering NAs)
tally(treatment ~ work_interfere, data = filter(tech, work_interfere != "NA"))

gf_bar(~treatment , data = filter(tech, work_interfere != "NA"), fill = ~work_interfere) %>%
  gf_facet_grid(.~work_interfere)



### 2.5 If you knew someone had responded "Never" to the `work_interfere` question, what would you guess to be their `treatment` category? 

How about if they responded "Often" to the `work_interfere` question?

2.5 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
- Never --> "No" category
- Often --> "Yes" category
    
</div>

### 2.6 Which word equation appears to be the better representation of the data generating process? Why?

1. `treatment` = `work_interfere` + other stuff
2. `treatment` = other stuff

2.6 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
`treatment` = `work_interfere` + other stuff
    
Because I can probably make a slightly better guess at someone's `treatment` category if I knew their `work_interfere` category.
  
Because the groups show a lot of differences, suggesting `work_interfere` might be involved in the data generating process.
    
Because keeping that variable in our word equation helps us explain some of the variation in our outcome variable.

</div>

### 2.7 Are there any other potential explanations for why we are seeing this particular pattern of variation?

2.7 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
- There could be other real things that are part of the data generating process, but that aren't `work_interfere` directly.
- It could just be random chance.

</div>

## 3 The `benefits` Hypothesis

In [None]:
# 3
# Run this code

tally(treatment ~ benefits, data = tech, format = "proportion")

gf_props(~ treatment , data = tech, fill = "coral1") %>%
  gf_facet_grid(.~benefits)

### 3.1 What question or hypothesis is the distribution above trying to explore? Can we tell what the word equation should be based on the code? If so, put it into a word equation.

3.1 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
It's trying to explore the hypothesis that whether an employer provides mental health benefits may impact how likely a tech worker is to seek treatment for mental health issues.
    
`treatment` = `benefits` + other stuff

</div>

### 3.2 What do you think about this hypothesis? Does it help us explain variation? How can you tell?

3.2 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
It seems like it helps us make a better prediction of `treatment` if we know their response to `benefits`. I would probably predict "Yes" on `treatment` for those who report "Yes" or "No" to the `benefits` question, and I would probably predict "No" on `treatment` for those who report "Don't know" to the `benefits` question.

</div>

### 3.3 How might we add the additional explanatory variable of `family_history` to this hypothesis? Add it to the plot and add it to your word equation.

In [None]:
# 3.3
# Add family_history to your plot here




In [None]:
# 3.3
# Add family_history to your plot here

# Sample responses

gf_bar(~ treatment , data = tech, fill = ~family_history) %>%
  gf_facet_grid(family_history~benefits)

gf_props(~ treatment , data = tech, fill = ~family_history) %>%
  gf_facet_grid(.~benefits)


3.3 Response:

Add `family_history` to your word equation here:



<div class="teacher-note">

<b>Sample Response</b>: `treatment` = `benefits` + `family_history` + other stuff
</div>

### 3.4 What `treatment` category would you predict for a tech worker who responded "Yes" to the `family_history` question, and "No" to the `benefits` question? 

What if they said "Yes" to the `family_history` question, and "Don't know" to the `benefits` question?

3.4 Response:

<div class="teacher-note">

<b>Sample Response</b>: 
    
- "Yes" family_history & "No" benefits --> "Yes" treatment category
- "Yes" family_history & "Don't know" benefits --> "Yes" treatment category
    
</div>

### 3.5 Have we explained any more variation in `treatment` by adding `family_history` to our word equation? How can you tell?

3.5 Response:

<div class="teacher-note">

<b>Sample Response</b>: Yes, we have. We would make different predictions for `treatment` if we knew both `benefits` and `family_history` than if we only knew `benefits` alone. For instance, we would probably no longer predict "No" on treatment for the "Don't know" benefits people if we also knew that they said "Yes" to family history.
    
</div>