# Lab 4 : Inference

## Reminders on using Google CoLab:

* Every time you reopen the lab, you need to upload the data file again
* Every time you reopen the lab, you need to run all cells again from the beginning
* If you need to add cells, hover your mouse between two cells and 
    * select `+ Code` to create a cell for your code
    * select `+ Text` to create a cell for discussion on the results, including your answers to the lab's questions 

You are welcome to save a copy of this project into your own Google folder.
* In the `File` menu, select `Save a copy in Drive`
* You can now edit the project and save your work

## Instructions

In Lab 4, we are going to utilize what we learned about Normal Distributions in Lab 3 to do inference of samples. We are going to look at these 4 topics:
1. Confidence intervals with 1 sample
2. Hypothesis testing with 1 sample
3. Inference with 2 samples (both confidence intervals and hypothesis testing)
4. Chi-squared tests

For the example code, I am using the [Duke Forest Housing Dataset](https://www.openintro.org/data/csv/duke_forest.csv) from our [textbook dataset website](https://www.openintro.org/data/). You will want to download the dataset and then load into Google CoLab. 

After we go through the example code, your job will be to select a dataset from the [textbook dataset website](https://www.openintro.org/data/) and then use these tools to perform an inference. You are welcome to use the same dataset that you used for Labs 1, 2, and/or 3. You should also review the previous labs for tools that you may need in this one:
* Lab 1: Graphing
* Lab 2: Summary Statistics
* Lab 3: Normal Distributions

__In this example, I am going to be doing inference using a Quantitative variable.__

## Load the Data

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
duke = pd.read_csv("../Datasets/duke_forest.csv")
duke.head()

Unnamed: 0,address,price,bed,bath,area,type,year_built,heating,cooling,parking,lot,hoa,url
0,"1 Learned Pl, Durham, NC 27705",1520000,3,4.0,6040,Single Family,1972,"Other, Gas",central,0 spaces,0.97,,https://www.zillow.com/homedetails/1-Learned-P...
1,"1616 Pinecrest Rd, Durham, NC 27705",1030000,5,4.0,4475,Single Family,1969,"Forced air, Gas",central,"Carport, Covered",1.38,,https://www.zillow.com/homedetails/1616-Pinecr...
2,"2418 Wrightwood Ave, Durham, NC 27705",420000,2,3.0,1745,Single Family,1959,"Forced air, Gas",central,"Garage - Attached, Covered",0.51,,https://www.zillow.com/homedetails/2418-Wright...
3,"2527 Sevier St, Durham, NC 27705",680000,4,3.0,2091,Single Family,1961,"Heat pump, Other, Electric, Gas",central,"Carport, Covered",0.84,,https://www.zillow.com/homedetails/2527-Sevier...
4,"2218 Myers St, Durham, NC 27707",428500,4,3.0,1772,Single Family,2020,"Forced air, Gas",central,0 spaces,0.16,,https://www.zillow.com/homedetails/2218-Myers-...


## Separating data into groups using quantitative data

In [3]:
# Create a condition and see what data fits
duke['area'] >= 3500

0      True
1      True
2     False
3     False
4     False
      ...  
93    False
94    False
95    False
96     True
97     True
Name: area, Length: 98, dtype: bool

In [4]:
# Take the sum of the True values to count how many

count_largehouses = (duke['area'] >= 3500).sum()

print("Number of houses with area >= 3500 sqft: ", count_largehouses)
print("Total number of houses: ", len(duke))    

p_largehouses = count_largehouses / len(duke)

print("Proportion of houses with area >= 3500 sqft: ", p_largehouses)

Number of houses with area >= 3500 sqft:  19
Total number of houses:  98
Proportion of houses with area >= 3500 sqft:  0.19387755102040816


## Confidence Interval for 1 sample

Recall the steps for forming a confidence interval:
1. Confidence Level
2. Critical Value
3. Find your Point Estimate ($\hat{p}$)
4. Calculate SE (Standard Error): $SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$
5. Calculate ME (Margin of Error): $ME = z_c SE$
6. Calculate the confidence interval: $p = \hat{p}\pm ME$
7. Make your conclusion

__Question__: Find the 95% confidence interval for the proprtion of large houses in the Duke Forest area.

In [5]:
# Longhand method for large houses confidence interval
C = 0.95
alpha = (1-C)/2

zcrit = stats.norm.ppf(alpha)  # This gives the critical value of the left tail, so it will be negative
zcrit = abs(zcrit)             # Critical value should be positive

me_central = zcrit * np.sqrt((p_largehouses * (1 - p_largehouses)) / len(duke))

print("95% confidence interval for proportion of houses with central cooling:")
print("Lower bound = ", p_largehouses - me_central)
print("Upper bound = ", p_largehouses + me_central)

95% confidence interval for proportion of houses with central cooling:
Lower bound =  0.11560683196507983
Upper bound =  0.2721482700757365


In [6]:
# Python shortcut for large houses confidence interval

###  NOTE:  You do not have to do both methods. This is just to show you that they give the same result.  ###

C = 0.95

se_largehouses = np.sqrt((p_largehouses * (1 - p_largehouses)) / len(duke))
ci_largehouses = stats.norm.interval(C, loc=p_largehouses, scale=se_largehouses)

print("95% confidence interval for proportion of houses with central cooling: ", ci_largehouses)

95% confidence interval for proportion of houses with central cooling:  (np.float64(0.11560683196507983), np.float64(0.2721482700757365))


## Hypothesis Test for 1 sample

Recall the steps for performing a Hypothesis Testing:
1. Determine the null value
2. Hypotheses
3. Level of Significance
4. Point Estimate ($\hat{p}$) and Standard Error
5. Calculate Test Statistic ($Z=\frac{\hat{p}-p_0}{SE}$)
6. Calculate the p-value and compare to level of significance
7. Make your conclusion

__Question__: In some areas near Duke Forest, at least 45% of homes can be classified as a large area. Show whether the percentage of large houses in Duke Forest is at least 45% or whether it is below that number using a 5% level of significance.

* $p_0 = 0.45$
* H0: $p = 0.45$
* HA: $p < 0.45$
* $\alpha = 0.05$

In [7]:
alpha = 0.05
null_value = 0.45

# Test Statistic
se = np.sqrt((null_value * (1 - null_value)) / len(duke))
z_largehouses = (p_largehouses - null_value) / se
print("Test statistic (z) for large houses hypothesis test: ", z_largehouses) # This will be negative since p_largehouses < null_value

# P-value
p_value_largehouses = stats.norm.cdf(z_largehouses)  # since this is a left-tailed test
print("P-value for large houses hypothesis test: ", p_value_largehouses)

# Make decision
if p_value_largehouses < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Conclusion
print("Conclusion: There is sufficient evidence to conclude that the proportion of large houses is less than 45%.")

Test statistic (z) for large houses hypothesis test:  -5.096512362405544
P-value for large houses hypothesis test:  1.7298399024450022e-07
Reject the null hypothesis.
Conclusion: There is sufficient evidence to conclude that the proportion of large houses is less than 45%.


-----

## Your turn to do inference

Your job now is to select a dataset from the [textbook dataset website](https://www.openintro.org/data/) and perform a similar analysis. You are welcome to use the same dataset that you used for Lab 1, Lab 2, and/or Lab 3.

### Instructions
1. Look at your dataset and create a question that you can answer. 
    * Look for a clear comparison (Maybe set up a confusion matrix between two variables and see if there are any two variables where the proportion of one variable changes based on another variable)
    * Set up your question using this comparison. 
    * For example, your sample data shows that the percentage of customers who like diet sodas is higher for coke drinkers than it is for pepsi drinkers. So, your question may be, "Is there a higher proportion of coke drinkers who prefer diet soda than there is for pepsi drinkers?"
2. Determine a null value
3. Determine your null and alternate hypotheses
4. Determine your level of significance
    * Make sure you do this *before* doing any calculations
5. Perform the hypothesis test
    * Find the test statistic
    * Find the p-value
    * Determine whether you reject H0 or not
    * Make a conclusion
6. Find the 95% confidence interval of your study

Note that your question may be for 1 sample or for two. You can choose what you need to answer your question.

-----
To submit your assignment,
* Click on the `Share` button above
* Under General Access, select `Anyone with the link`
* Copy the link
* Paste the link into Canvas
* Click `Submit`