# Homework #1: Simple Linear Regreesion

## Background
The New York Times has published data for case counts and deaths for each county and state across the US. They also have a few additional datasets, one of which is a July 2020 survey of how regularly people in each county wore masks. This data comes for an online survey of about 250,000 responses.

You've been tasked with looking at the relationship between reported mask usage and case counts per capita (at the county level) for the state of Utah. This analysis is with real data so expect some level of cleaning and manipulation.

Data and descriptions are available at their GitHub page (https://github.com/nytimes/covid-19-data) but for consistency I've uploaded the relevant ones to the Canvas page under "Datasets".

Relevant Datasets:
* `county_census.csv`: Population estimates in 2019 by county
* `mask-use-by-county.csv`: Reported mask usage taken from the NY Time July 2020 survey
* `us-counties.csv`: Case and death county by county


### Task 1
First, you'll need to read in the three datasets and `merge` them together. The `SimpleLinearRegression.ipynb` file may be a useful reference.

In [1]:
# Import necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
from math import sqrt

import os
curr_dir = os.getcwd()
curr_dir

'C:\\Users\\Chanc\\OneDrive\\Desktop\\DATA 5600\\HW1'

In [2]:
# Import all 3 datasets, relative so I can work between machines on jupyter

county_census_df = pd.read_csv(os.path.join(curr_dir, 'county-census.csv'))
mask_use_by_county_df = pd.read_csv(os.path.join(curr_dir, 'mask-use-by-county.csv'))
us_counties_df = pd.read_csv(os.path.join(curr_dir, 'us-counties.csv'))

In [3]:
# Create merged dataframe

merged_df = pd.merge(pd.merge(county_census_df, mask_use_by_county_df, left_on='FIPS', right_on='COUNTYFP'), us_counties_df, left_on='FIPS', right_on='fips')

In [4]:
# Remove unnecessary columns. Some were empty for Utah and others didn't apply to the analysis.

columns_to_remove = ['Unnamed: 0','state','county','STNAME','SUMLEV','fips','COUNTYFP','CTYNAME','confirmed_cases','confirmed_deaths','probable_cases','probable_deaths']
merged_df = merged_df.drop(columns=columns_to_remove)

In [5]:
# Normalize column names by making them lower case

merged_df.columns = merged_df.columns.str.lower()

In [6]:
# View head to check results

merged_df.head()

Unnamed: 0,region,division,state,county,popestimate2019,fips,never,rarely,sometimes,frequently,always,date,cases,deaths
0,3,6,1,1,55869,1001,0.053,0.074,0.134,0.295,0.444,2022-05-05,15840,216.0
1,3,6,1,3,223234,1003,0.083,0.059,0.098,0.323,0.436,2022-05-05,55713,680.0
2,3,6,1,5,24686,1005,0.067,0.121,0.12,0.201,0.491,2022-05-05,5671,98.0
3,3,6,1,7,22394,1007,0.02,0.034,0.096,0.278,0.572,2022-05-05,6444,104.0
4,3,6,1,9,57826,1009,0.053,0.114,0.18,0.194,0.459,2022-05-05,14985,243.0


### Task 2
Since we're only interested in Utah, create a new data frame that contains only the merged data for the state of Utah. If you need to do some Googling, I'd suggest searching "conditional subset pandas dataframe".

In [7]:
# Limit to only Utah results (Utah is 49)

ut_df = merged_df[merged_df['state'] == 49]
ut_df.head()

Unnamed: 0,region,division,state,county,popestimate2019,fips,never,rarely,sometimes,frequently,always,date,cases,deaths
2768,4,8,49,1,6710,49001,0.099,0.026,0.271,0.283,0.32,2022-05-05,1609,16.0
2769,4,8,49,3,56046,49003,0.084,0.116,0.111,0.27,0.419,2022-05-05,14100,124.0
2770,4,8,49,5,128289,49005,0.09,0.108,0.08,0.31,0.411,2022-05-05,37192,109.0
2771,4,8,49,7,20463,49007,0.107,0.134,0.114,0.293,0.353,2022-05-05,5234,46.0
2772,4,8,49,9,950,49009,0.091,0.296,0.102,0.33,0.181,2022-05-05,151,0.0


### Task 3
The case count data is just an absolute cumulative count, meaning that larger counties unsurprisingly have much larger case counts. This may skew our results since we're just interested in the relative relationship between reported mask usage and cases. Create a new variable that is a ratio of `cases/population`. Note: the name of the population variable is `POPESTIMATE2019`.

In [8]:
# Create cases_per_capita variable

ut_df['cases_per_capita'] = ut_df['cases'] / ut_df['popestimate2019']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ut_df.loc[:, 'cases_per_capita'] = ut_df['cases'] / ut_df['popestimate2019']


In [9]:
# Drop rows where cell values are empty, as not to skew data

ut_df = ut_df.dropna()

In [10]:
# Show full dataframe to check things before analysis is done

ut_df

Unnamed: 0,region,division,state,county,popestimate2019,fips,never,rarely,sometimes,frequently,always,date,cases,deaths,cases_per_capita
2768,4,8,49,1,6710,49001,0.099,0.026,0.271,0.283,0.32,2022-05-05,1609,16.0,0.239791
2769,4,8,49,3,56046,49003,0.084,0.116,0.111,0.27,0.419,2022-05-05,14100,124.0,0.251579
2770,4,8,49,5,128289,49005,0.09,0.108,0.08,0.31,0.411,2022-05-05,37192,109.0,0.289908
2771,4,8,49,7,20463,49007,0.107,0.134,0.114,0.293,0.353,2022-05-05,5234,46.0,0.255779
2772,4,8,49,9,950,49009,0.091,0.296,0.102,0.33,0.181,2022-05-05,151,0.0,0.158947
2773,4,8,49,11,355481,49011,0.035,0.055,0.077,0.314,0.518,2022-05-05,100743,403.0,0.283399
2774,4,8,49,13,19938,49013,0.051,0.259,0.033,0.469,0.188,2022-05-05,4371,27.0,0.21923
2775,4,8,49,15,10012,49015,0.122,0.109,0.106,0.239,0.424,2022-05-05,2472,28.0,0.246904
2776,4,8,49,17,5051,49017,0.062,0.056,0.232,0.264,0.386,2022-05-05,929,16.0,0.183924
2777,4,8,49,19,9754,49019,0.056,0.111,0.098,0.176,0.558,2022-05-05,2322,6.0,0.238056


### Task 4
Finally, run a regression where our **predictor variable** is the proportion of people that responed "Always" to the question of "*How often do you wear a mask in public when you expect to be within six feet of another person?*" and the **response variable** is your newly created `cases_per_capita` variable.

In [11]:
# Run model and show summary

results = smf.ols('always ~ cases_per_capita', data=ut_df).fit()
results.summary()

0,1,2,3
Dep. Variable:,always,R-squared:,0.284
Model:,OLS,Adj. R-squared:,0.258
Method:,Least Squares,F-statistic:,10.72
Date:,"Thu, 09 May 2024",Prob (F-statistic):,0.0029
Time:,18:29:49,Log-Likelihood:,22.66
No. Observations:,29,AIC:,-41.32
Df Residuals:,27,BIC:,-38.59
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0600,0.112,0.536,0.596,-0.170,0.290
cases_per_capita,1.4642,0.447,3.275,0.003,0.547,2.382

0,1,2,3
Omnibus:,1.88,Durbin-Watson:,1.687
Prob(Omnibus):,0.391,Jarque-Bera (JB):,1.094
Skew:,-0.029,Prob(JB):,0.579
Kurtosis:,2.05,Cond. No.,22.2


### Questions

1. Are the coefficient estimate significant? What evidence is there to support your answer?

2. What proportion of variance in the response is explained by this model?

3. How would you interpret the estimates of the coefficient?

4. Does your model make sense intuitively? What could explain this result?



### Answers

1. Yes because the p value is 0.003, close enough to 0 that results are statistically significant. It is unlikely that the relationship is due to random chance.

2. R-squared is 0.284, so about 28.4% of the variance in the response of always wearing a mask is explained by cases per capita.

3. The coefficient estimate of cases_per_capita is 1.4642, so for every unit increase in cases_per_capita, the outcome of 'always' is expected to increase by 1.4642 units.

4. Intuitively, the model does make sense. It suggests that as case counts increase, people say they are wearing masks more frequently. This would make sense as people are more aware and show more initiative when case counts begin to rise. With a higher attention to the risk at hand, as people see it more around them, they are more likely to wear a mask more often. 