# Phase 2 Code Challenge

This code challenge is designed to test your understanding of the Phase 2 material. It covers:

- Normal Distribution
- Statistical Tests
- Bayesian Statistics
- Linear Regression

_Read the instructions carefully_. You will be asked both to write code and to answer short answer questions.

## Code Tests

We have provided some code tests for you to run to check that your work meets the item specifications. Passing these tests does not necessarily mean that you have gotten the item correct - there are additional hidden tests. However, if any of the tests do not pass, this tells you that your code is incorrect and needs changes to meet the specification. To determine what the issue is, read the comments in the code test cells, the error message you receive, and the item instructions.

## Short Answer Questions 

For the short answer questions...

* _Use your own words_. It is OK to refer to outside resources when crafting your response, but _do not copy text from another source_.

* _Communicate clearly_. We are not grading your writing skills, but you can only receive full credit if your teacher is able to fully understand your response. 

* _Be concise_. You should be able to answer most short answer questions in a sentence or two. Writing unnecessarily long answers increases the risk of you being unclear or saying something incorrect.

In [1]:
# Run this cell without changes to import the necessary libraries

import itertools
import numpy as np
import pandas as pd 
from numbers import Number
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import pickle

---
## Part 1: Normal Distribution [Suggested time: 20 minutes]
---
In this part, you will analyze check totals at a TexMex restaurant. We know that the population distribution of check totals for the TexMex restaurant is normally distributed with a mean of \\$20 and a standard deviation of \\$3. 

### 1.1) Create a numeric variable `z_score_26` containing the z-score for a \\$26 check. 

In [2]:
p_mean = 20
std = 3

In [3]:
# CodeGrade step1.1
# Replace None with appropriate code

z_score_26 = (26 - p_mean) / std

In [7]:
z_score_26

2.0

In [4]:
# This test confirms that you have created a numeric variable named z_score_26

assert isinstance(z_score_26, Number)

### 1.2) Create a numeric variable `p_under_26` containing the approximate proportion of all checks that are less than \\$26.

Hint: Use the answer from the previous question along with the empirical rule, a Python function, or this [z-table](https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf).

In [8]:
# CodeGrade step1.2
# Replace None with appropriate code

p_under_26 = 1 - stats.norm.cdf(z_score_26)
p_under_26

0.02275013194817921

In [6]:
# This test confirms that you have created a numeric variable named p_under_26

assert isinstance(p_under_26, Number)

# These tests confirm that p_under_26 is a value between 0 and 1

assert p_under_26 >= 0
assert p_under_26 <= 1

### 1.3) Create numeric variables `conf_low` and `conf_high` containing the lower and upper bounds (respectively) of a 95% confidence interval for the mean of one waiter's check amounts using the information below. 

One week, a waiter gets 100 checks with a mean of \\$19 and a standard deviation of \\$3.

In [9]:
# CodeGrade step1.3
# Replace None with appropriate code

n = 100
mean = 19
std = 3

z_crit = stats.t.ppf(.975, df=(n-1))
sterr = std / np.sqrt(n)

conf_low = mean - ( z_crit * sterr)
conf_high = mean + (z_crit * sterr)

In [10]:
# These tests confirm that you have created numeric variables named conf_low and conf_high

assert isinstance(conf_low, Number)
assert isinstance(conf_high, Number)

# This test confirms that conf_low is below conf_high

assert conf_low < conf_high

# These statements print your answers for reference to help answer the next question

print('The lower bound of the 95% confidence interval is {}'.format(conf_low))
print('The upper bound of the 95% confidence interval is {}'.format(conf_high))

The lower bound of the 95% confidence interval is 18.404734914547394
The upper bound of the 95% confidence interval is 19.595265085452606


### 1.4) Short Answer: Interpret the 95% confidence interval you just calculated in Question 1.3.

In [None]:
# We are 95% confident that population mean is between the low confidence interval and the high confidence interval


---
## Part 2: Statistical Testing [Suggested time: 20 minutes]
---
The TexMex restaurant recently introduced queso to its menu.

We have a random sample containing 2000 check totals, all from different customers: 1000 check totals for orders without queso ("no queso") and 1000 check totals for orders with queso ("queso").

In the cell below, we load the sample data for you into the arrays `no_queso` and `queso` for the "no queso" and "queso" order check totals, respectively.

In [11]:
# Run this cell without changes

# Load the sample data 
no_queso = pickle.load(open('./no_queso.pkl', 'rb'))
queso = pickle.load(open('./queso.pkl', 'rb'))

### 2.1) Short Answer: State null and alternative hypotheses to use for testing whether customers who order queso spend different amounts of money from customers who do not order queso.

In [13]:
# Your answer here

# Null: There is no difference in spending between customers who order queso and no queso
    
# Alternative:There is a difference in spending between customers who order queso and no queso

### 2.2) Short Answer: What would it mean to make a Type I error for this specific hypothesis test?

Your answer should be _specific to this context,_  not a general statement of what Type I error is.

In [None]:
# Your answer here

# Type I(False Positive): Incorrectly rejecting the Null hypothesis, when there is actually no real difference between cutomer
# spending
    
# Type II (False Negative): Incorrectly failing to reject the Null hypothesis

### 2.3) Create a numeric variable `p_value` containing the p-value associated with a statistical test of your hypotheses. 

You must identify and implement the correct statistical test for this scenario. You can assume the two samples have equal variances.

Hint: Use `scipy.stats` to calculate the answer - it has already been imported as `stats`. Relevant documentation can be found [here](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests).

In [16]:
stats.ttest_ind(queso, no_queso)

Ttest_indResult(statistic=45.16857748646329, pvalue=1.29670967092511e-307)

In [None]:
q_mean = queso.mean()
nq_mean = no_queso.mean()
nq_std = no_queso.std()
n = len(queso)

In [18]:
z = (q_mean - nq_mean) / (nq_std / np.sqrt(n))

In [20]:
# CodeGrade step2.3
# Replace None with appropriate code

p_value = 1 - stats.norm.cdf(z)

In [22]:
# This test confirms that you have created a numeric variable named p_value

assert isinstance(p_value, Number)

### 2.4) Short Answer: Can you reject the null hypothesis using a significance level of $\alpha$ = 0.05? Explain why or why not.

In [None]:
# Your answer here

# We reject the Null Hypothesis since our p-value is extremly low (lower than our alpha), making it statistically significant

---
## Part 3: Bayesian Statistics [Suggested time: 15 minutes]
---
A medical test is designed to diagnose a certain disease. The test has a false positive rate of 10%, meaning that 10% of people without the disease will get a positive test result. The test has a false negative rate of 2%, meaning that 2% of people with the disease will get a negative result. Only 1% of the population has this disease.

### 3.1) Create a numeric variable `p_pos_test` containing the probability of a person receiving a positive test result.

Assume that the person being tested is randomly selected from the broader population.

In [23]:
# CodeGrade step3.1
# Replace None with appropriate code
    
false_pos_rate = 0.1
false_neg_rate = 0.02
population_rate = 0.01

p_pos_test = (.99 * .10) + (.01 * .98)

In [24]:
# This test confirms that you have created a numeric variable named p_pos_test

assert isinstance(p_pos_test, Number)

# These tests confirm that p_pos_test is a value between 0 and 1

assert p_pos_test >= 0
assert p_pos_test <= 1

### 3.2) Create a numeric variable `p_disease_given_pos` containing the probability of a person actually having the disease if they receive a positive test result.

Assume that the person being tested is randomly selected from the broader population.

Hint: Use your answer to the previous question to help answer this one.

In [27]:
# CodeGrade step3.2
# Replace None with appropriate code
    
false_pos_rate = 0.1
false_neg_rate = 0.02
population_rate = 0.01

p_disease_given_pos = (.98 * .01) / p_pos_test

In [28]:
# This test confirms that you have created a numeric variable named p_disease_given_pos

assert isinstance(p_disease_given_pos, Number)

# These tests confirm that p_disease_given_pos is a value between 0 and 1

assert p_disease_given_pos >= 0
assert p_disease_given_pos <= 1

---

## Part 4: Linear Regression [Suggested Time: 20 min]
---

In this section, you'll run regression models with [automobile price](https://archive.ics.uci.edu/ml/datasets/Automobile) data.

We will use these columns of the dataset:

- `body-style`: categorical, hardtop, wagon, sedan, hatchback, or convertible
- `length`: continuous
- `width`: continuous
- `height`: continuous
- `engine-size`: continuous
- `price`: continuous

We will use `price` as the target and all other columns as features. The units of `price` are US dollars in 1985.

In [30]:
# Run this cell without changes

# Load data into pandas
data = pd.read_csv("automobiles.csv")

# Data cleaning
data = data[(data["horsepower"] != "?") & (data["price"] != "?")]
data["horsepower"] = data["horsepower"].astype(int)
data["price"] = data["price"].astype(int)

# Select subset of columns
data = data[["body-style", "length", "width", "height", "engine-size", "horsepower", "city-mpg", "price"]]
data

Unnamed: 0,body-style,length,width,height,engine-size,horsepower,city-mpg,price
0,convertible,168.8,64.1,48.8,130,111,21,13495
1,convertible,168.8,64.1,48.8,130,111,21,16500
2,hatchback,171.2,65.5,52.4,152,154,19,16500
3,sedan,176.6,66.2,54.3,109,102,24,13950
4,sedan,176.6,66.4,54.3,136,115,18,17450
...,...,...,...,...,...,...,...,...
200,sedan,188.8,68.9,55.5,141,114,23,16845
201,sedan,188.8,68.8,55.5,141,160,19,19045
202,sedan,188.8,68.9,55.5,173,134,18,21485
203,sedan,188.8,68.9,55.5,145,106,26,22470


### 4.1) Build a StatsModels `OLS` model `numeric_mod` that uses all numeric features to predict `price`

In other words, this model should use all features in `data` except for `body-style` (because `body-style` is categorical).

In [33]:
# CodeGrade step4.1
# Replace None with appropriate code
    
y = data["price"]
X = sm.add_constant(data.drop(["price", "body-style"], axis=1))

numeric_mod = sm.OLS(y,X)

In [34]:
# This test confirms that you have created a variable named numeric_mod containing a StatsModels OLS model

assert type(numeric_mod) == sm.OLS

In [35]:
# This code prints your model summary for reference to help answer the next question

numeric_results = numeric_mod.fit()
print(numeric_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     146.2
Date:                Tue, 06 Dec 2022   Prob (F-statistic):           8.43e-69
Time:                        12:22:42   Log-Likelihood:                -1899.0
No. Observations:                 199   AIC:                             3812.
Df Residuals:                     192   BIC:                             3835.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        -7.07e+04   1.36e+04     -5.197      

### 4.2) Short Answer: Are all of these features statististically significant? If not, which features are not? How did you determine this from the model output?

Include the alpha level you are using in your answer.

In [None]:
# Your answer here

# Lenght and city-mpg are not statistically significant, since their p-value is greater than out alpha(0.05)

### 4.3) Short Answer: Let's say we want to add `body-style` to our model. Run the cell below to view the values of `body-style`. Given the output, how many one-hot encoded features should be added?

Explain your answer. ***Hint:*** you might want to mention the dummy variable trap and/or the reference category.

In [36]:
# Run this cell without changes

data["body-style"].value_counts().sort_index()

convertible     6
hardtop         8
hatchback      67
sedan          94
wagon          24
Name: body-style, dtype: int64

In [None]:
# Your answer here

# Body-style has 5 unique values

### 4.4) Prepare `body-style` for modeling using `pd.get_dummies`. Then create a StatsModels `OLS` model `all_mod` that predicts `price` using all (including one-hot encoded) other features.

In [48]:
# CodeGrade step4.4
# Replace None with appropriate code

dum = pd.get_dummies(data["body-style"], drop_first=True)
new_table = pd.concat([data, dum], axis=1)


X_ohe = sm.add_constant(new_table.drop(['price', 'body-style'], axis=1))

all_mod = sm.OLS(y, X_ohe)

In [49]:
# This test confirms that you have created a variable named all_mod containing a StatsModels OLS model

assert type(all_mod) == sm.OLS

# This code prints your model summary for reference to help answer the next question

all_results = all_mod.fit()
print(all_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.840
Model:                            OLS   Adj. R-squared:                  0.832
Method:                 Least Squares   F-statistic:                     98.92
Date:                Tue, 06 Dec 2022   Prob (F-statistic):           2.34e-69
Time:                        12:39:57   Log-Likelihood:                -1887.3
No. Observations:                 199   AIC:                             3797.
Df Residuals:                     188   BIC:                             3833.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -7.519e+04   1.32e+04     -5.718      

### 4.5) Short Answer: Does this model do a better job of explaining automobile price than the previous model using only numeric features? Explain how you determined this based on the model output. 

In [None]:
# Your answer here

# Since there are multiple features in both models, we look at adjusted R-squared in order to determine which one is a better
# model, and we see that the second one with the dummy variables has a better adjusted R-suqred; there, a better model