# Putting it all together

At this point you've had experience with many different parts of the data science lifecycle---from the nitty gritty of programming and cleaning data, to running statistical tests and making inferences. In this lab, we're going to (i) clean up some loose ends and (ii) put you in the driver's seat by giving you a dataset and asking you to run analyses and make recommendations. 

## Some odds and ends that are good to know

### Fun with column names

* Creating a new column in a data frame:  
    * `d$new_col_name = new_col_vector`  
* Getting a vector of column names:  
    * `colnames(d)`  
* Changing column names:  
    * `colnames(d) = vector_of_new_column_names`  
        * This is a rare case where we put a function call to the *left* of the assignment operator.  
* Getting fancy with column names:  
    * `colnames(d)[colnames(d) == 'patient'] = 'patient_id'`  
        * This ^ takes only the column named `patient` and changes the name to `patient_id`, can you see why this works?

### Changing data types
Way long ago we talked about data types. R has several built in data types, and it's useful to make sure your variables are the type of data you intend them to be:  

* Factor: Often useful for categorical data  
* String: A data type that is treated as text.  
    * An example of this might be if we ask a patient an open ended question and they type a response. We want to treat this response as text, and not as something like a factor, because there's no meaningful way of grouping that type of text.  
* Integer: For integer data to be treated as numbers.  
* Numeric / double: For real numbers (with decimals).  

You can switch between data types with the following:  

`d$col_to_convert = as.numeric(d$col_to_convert)`

### Is this in that?

Checking whether some value(s) are in some container of values is a common and useful operation. The simplest case looks like:

In [1]:
5 %in% 1:10

The output of the `%in%` operator is a logical vector with length equal to the argument on the *left* side of the `%in%` operator. See a more complex example:

In [2]:
c(5, 6, 15, 2, 21) %in% 1:10

This function can come in super handy throughout a data analysis, when, for example, keeping only some special subset of observations.  

*Example.* Let's say we had some data where each row is a patient with their id and body temperature:

In [4]:
d = data.frame(patient_id = 1:10, temp = rnorm(10, 98, 1))
d

patient_id,temp
<int>,<dbl>
1,98.34231
2,97.40769
3,98.35531
4,96.73642
5,97.96423
6,97.2199
7,98.15124
8,99.20742
9,98.88801
10,98.84442


Imagine I want to analyze data only from patients 3, 7, 5, and 1. Instead of writing a huge logical argument to grab each of these I can use an `%in`:

In [5]:
to_keep = c(3,7,5,1)
d[d$patient %in% to_keep,]

Unnamed: 0_level_0,patient_id,temp
Unnamed: 0_level_1,<int>,<dbl>
1,1,98.34231
3,3,98.35531
5,5,97.96423
7,7,98.15124


Can you see why that worked? `d$patient %in% to_keep` creates a logical vector with a length equal to the length of data `d`, then this logical vector is used to index the rows of the data frame.

## t-tests
I won't go into too much detail about t-tests here, but I will point out the different types of t-tests that you can do, when you should use each one, and how to run them in R.

* One-sample: When comparing a single vector of data against a null hypothesis when a known *mu* (eg, you are testing a car manufacturer's claim that a certain car gets 40 mpg, and you have mpg data from 40 cars).  
* Paired sample: When you want to compare two measurements taken from the same set of cases (eg, you want to compare pre and post test scores from the same set of students before and after completing a prep course).  
* Independent sample: When you want to compare two measurements taken from different sets of cases (eg, you're comparing temperates from one set of patients against temperatures from a different set of patients).  

**Doing different t tests in R**

We use the `t.test` function for all the types of t tests, the critical arguments are `mu` and `paired`.  

* *One-sample*: `t.test(vector_x, mu = 50)`  
* *Paired sample*: `t.test(vector_x, vector_y, paired = TRUE)`  
* *Independent sample:* `t.test(vector_x, vector_y) # why don't i need to specify the paired argument here?`

## Assignment
The state of Pennsylvania is drowning in data and they need your help. Flu cases are on the rise and the department of health needs you to determine whether or not the state is safe. You need to first figure out what "safe" means, and then use your definition of safe to determine whether or not the current status of PA flu is safe or not.

We'll be using the total number of confirmed hospitalizations to inform our judgments.  
* The dataset `by_week.csv` contains on each row the sum total number of hospitalizations on a given week for flu across the state of PA.  
* The dataset `last_week.csv` contains on each row the sum total number of hospitalizations for flu across different hospitals in PA for the week of `2022-04-15`.

1. Assume that sum hospitalizations in each week was generated from a random sample $(W_1, W_2, W_3 ..., W_N)$ where each random variable $W_i$ is independently and identically distributed and corresponds to the number of hospitalizations on week $i$. Which distribution would you use as the probability function for these random variables and why?

2. Estimate the parameters for this distribution using the moment of methods.  

3. Calculate 95% confidence intervals around your parameter estimate.  

4. Based on what you have estimated so far, come up with a threshold for "safe" or "not safe". In other words, come up with some number of hospitalizations where, if this number is exceeded, we would conclude that PA is not safe and if this number is not exceeded we would conclude PA is safe. Support your decision with information from the analysis that you have conducted thus far.  

5. Generate a plot of number of hospitalizations over time.

6. Assess whether the number of hospitalizations in the week of `2022-04-15` should lead us to conclude that PA is safe or that PA is not safe. Write up your response as if you were communicating to the PA department of health and making a recommendation about whether the state is safe or not.  

