# Simple Linear Regression
By Adrian Chavez-Loya

## Background
The New York Times has published data for case counts and deaths for each county and state across the US. They also have a few additional datasets, one of which is a July 2020 survey of how regularly people in each county wore masks. This data comes for an online survey of about 250,000 responses.

You've been tasked with looking at the ***relationship between reported mask usage and case counts per capita (at the county level) for the state of Utah***. This analysis is with real data so expect some level of cleaning and manipulation.

Data and descriptions are available at their GitHub page (https://github.com/nytimes/covid-19-data) but for consistency I've uploaded the relevant ones to the Canvas page under "Datasets".

Relevant Datasets:
* `county_census.csv`: Population estimates in 2019 by county
* `mask-use-by-county.csv`: Reported mask usage taken from the NY Time July 2020 survey
* `us-counties.csv`: Case and death county by county


### Task 1
First, you'll need to read in the three datasets and `merge` them together. The `SimpleLinearRegression.ipynb` file may be a useful reference.

In [37]:
import pandas as pd

# Read in three data sets
county_census = pd.read_csv('/Users/adrianchavezloya/Desktop/Summer 2024/Intro to Regression:Machine Learning/Module 1 and 2/HW1_Simple_Linear_Regression/county_census.csv')
mask_data = pd.read_csv('/Users/adrianchavezloya/Desktop/Summer 2024/Intro to Regression:Machine Learning/Module 1 and 2/HW1_Simple_Linear_Regression/mask-use-by-county.csv')
us_counties = pd.read_csv('/Users/adrianchavezloya/Desktop/Summer 2024/Intro to Regression:Machine Learning/Module 1 and 2/HW1_Simple_Linear_Regression/us-counties.csv')

#Merge three data sets (Match by using FIPS code)
merged_data = us_counties.merge(mask_data, left_on='fips', right_on='COUNTYFP')
merged_data = merged_data.merge(county_census, left_on='fips', right_on='FIPS') #fixed merging here on second try

### Task 2
Since we're only interested in Utah, create a new data frame that contains only the merged data for the state of Utah. If you need to do some Googling, I'd suggest searching "conditional subset pandas dataframe".

In [39]:
# Filter for Utah
utah_data = merged_data[merged_data['state'] == 'Utah']

### Task 3
The case count data is just an absolute cumulative count, meaning that larger counties unsurprisingly have much larger case counts. This may skew our results since we're just interested in the relative relationship between reported mask usage and cases. Create a new variable that is a ratio of `cases/population`. Note: the name of the population variable is `POPESTIMATE2019`.

In [45]:
utah_data.loc[:, 'cases_per_capita'] = utah_data['cases'] / utah_data['POPESTIMATE2019']

### Task 4
Finally, run a regression where our **predictor variable** is the proportion of people that responed "Always" to the question of "*How often do you wear a mask in public when you expect to be within six feet of another person?*" and the **response variable** is your newly created `cases_per_capita` variable.

In [47]:
# Import statsmodel 
import statsmodels.api as sm  

# Extract relevant variables
X = utah_data['ALWAYS']  #People who always wear masks 
y = utah_data['cases_per_capita'] 

# Add a constant to the predictor variable
X = sm.add_constant(X) 

# Fit the regression model using OLS
model = sm.OLS(y, X).fit() 

# Print regression results using summary
print(model.summary()) 


                            OLS Regression Results                            
Dep. Variable:       cases_per_capita   R-squared:                       0.284
Model:                            OLS   Adj. R-squared:                  0.258
Method:                 Least Squares   F-statistic:                     10.72
Date:                Thu, 23 May 2024   Prob (F-statistic):            0.00290
Time:                        15:47:20   Log-Likelihood:                 51.958
No. Observations:                  29   AIC:                            -99.92
Df Residuals:                      27   BIC:                            -97.18
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1642      0.026      6.300      0.0

### Questions

1. ***Are the coefficient estimate significant? What evidence is there to support your answer?***
The coefficient estimate for the 'ALWAYS' (those who always wore a mask) variable is 0.1941 with a standard error of 0.059. The t-statistic is 3.275, and the associated p-value is 0.003. Since the p-value (0.003) is less than the typical significance level of 0.05, we can conclude that the coefficient estimate for the 'ALWAYS' variable is statistically significant.

2. ***What proportion of variance in the response is explained by this model?***
The R-squared value of 0.284 indicates that approximately 28.4% of the variance in the dependent variable "cases_per_capita" is explained by the independent variable(s) in the model.

3. ***How would you interpret the estimates of the coefficient?***
The coefficient estimate for the 'ALWAYS' variable is 0.1941. This suggests that for every unit increase in the 'ALWAYS' variable, the 'cases_per_capita' is expected to increase by 0.1941, holding all other variables constant.

4. ***Does your model make sense intuitively? What could explain this result?***
The positive coefficient for the 'ALWAYS' variable suggests a positive relationship between the 'ALWAYS' variable and 'cases_per_capita'. This means that higher values of the 'ALWAYS' variable are associated with higher values of 'cases_per_capita'. However, it's important to note that correlation **does not** imply causation, and further analysis is needed to understand the underlying factors driving this relationship.