## Hypothesis Testing

### P Value interpretation

1 sided

P-value >= 0.05

    - it is unlikely that another sample mean will be in the same interval as our sample mean
    - reject null hypothesis
    - Accept Alternative Hypothesis
    - statistically significant difference
P-value < 0.05
    
    - it is likely that our sample mean is similar to our population mean
    - accept null hypothesis
    - reject alternative hypothesis
    - statistically non significant difference

<hr style="border:2px solid gray">

### One-Sample T-Test

 - to compare sample average to hypothetical population average
 - used for quantitative data to compare sample mean to expected population mean

example questions this could answer
- is average amount of time spent on a website different that 5 minutes
- is average amount of money that customers spend on a purchase mopre than 10 USD
- how long a person spent reading an email

In [1]:
from scipy.stats import ttest_1samp

tstat, pval = ttest_1samp(sample_distribution, expected_mean)
- sample_distribution - 50 or so samples to test against
- expected mean - mean to test against (mean we want to know is significantly close to population mean or not)

Assumptions
- The sample was randomly selected from the population
- The individual observations were independent
- The data is normally distributed without outliers OR the sample size is large (enough)

In [3]:
tstat, pval = ttest_1samp(prices, 1000)

Interpretation

    Null Hypothesis
    - no statistical significant difference in sample distribution and expected mean
    - pval > 0.05
    
    Alternative Hypothesis
    - statistical significant difference in sample distribution and expected mean
    - pval <= 0.05

<hr style="border:2px solid gray">

### Binomial Test


- to compare frequency of some outcomein a sample to the expected probability pf that outcome
- used for binary categorical data to compare a sample frequency to an expected population level probabiity

example questions this could answer
- if i am expected to be 50% cringe and i am hooked up to a cringe detector and it reads that I am 10% cringe, is that statistically significantly different than 50% cringe?
- if 90% of ticketed passangers are expected to show up to their flight, but only 80% show up, is that significantly different than 90%
- we want students to have a 70% chance of getting a question right, lower it too hard, higher is too easy, they score a 60% on it, is that significantly different than 70%?

#### Random binary categorical

np.random.choice() is used for this hypothesis test a lot.

[random numbers notebook](http://localhost:8888/notebooks/Sampling%20and%20Random%20Numbers.ipynb)

How to calculate a p value in python

[Binomial Proof](http://localhost:8888/notebooks/Binomial%20Function%20Proof.ipynb)

binom_test(observed_successes,n=number of trials, p=probability of success, alternative = hypothesis)
- observed success = how many times your criteria is met in the number of trials
- n = number of times you repeated the experiment
- p = probability of observed success occuring in each trial
- alternative =
    - 'less' - use when you are looking if the alternative hypothesis is less than the mean being tested
    - 'greater' - use when you are looking if the alternative hypothesis is greater than the mean tested against
    - 'two-sided' - use when there is a two sided interval being tested

#### One Sided Binomial Test

In [3]:
from scipy.stats import binom_test
p_value_1sided = binom_test(41,n=500,p=.1,alternative='less')

#### Two Sided Binomial Test

In [5]:
from scipy.stats import binom_test
p_value_2sided = binom_test(41, n=500, p=.1)

Interpretation
- 1 sided alpha = .05
- 2 sided alpha = .025

    Null Hypothesis
    - no significant difference in 

<hr style="border:2px solid gray">

### Linear Regression

**before fitting a linear or muiltiple linear regression model EDA and visualization must be done to roughly see relationships**

- compare the relationship between quantitative vriable and one or more other variables



Example Questions this could answer

- relationship between apartment size and and rental price for NYC apartments
- mothers height a good predictor of childs adult height

[Linear Regression Proof](http://localhost:8888/notebooks/Linear%20Regression%20Proof.ipynb)

**Step 1: scatter plot both quantitative column that you are trying to compare**

In [3]:
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

df = pd.read_csv('data path')</br>
plt.scatter(df.output_column, df.input_column)

> check if if the input and output have  a linear relationship

**Step 2: fit the model to a regression line**

model = sm.OLS.from_formula('output ~ input', data = df)</br>
results = models.fit()</br>
results.params

Output:

<span style="color:blue"></em>Intercept   -21.67</br>
[output name]        0.50</br>
dtype: float64</span>

- intercept - this shows us the best fit intercept to use
- height - this shows us the best fit slope to use

using these will give us the minimized loss function

results.params[0] will give you the y intercept<br>
results.params[1] will give you the slope<br>

###### to plot the regression (loss function) line it would look like this 

results.params[0] + results.params[1]*df.input_column

loss = [output name]*(input) +- intercept

 - can also make predictions with the loss formula

In [8]:
height = 160
weight = 0.5*(height) - 21.6
print(weight)

58.4


 predict_func = results.predict({'input':[value of input]})

**Step 3: plot regression with line**

plt.scatter(df.input, df.output)
plt.plot(df.input, results.predict(df))



plt.plot(df.input, results.params[0] + results.params[1]*df.input_column

![image.png](attachment:image.png)

**Step 4: Test the 2 remaining assumptions**

   **test 2: normality of residuals**

fitted_values = results.predict(df)<br>
residuals = df.output_column - df.fitted_values<br>
plt.hist(residuals)

![image.png](attachment:image.png)

look for:
- no multiple humps
- relatively normal
- no skew

**test 3: Homoscetasticity**

plt.scatter(fitted_values, residuals)


![image.png](attachment:image.png)

look for:
- patterns or aesymmetry around the y=0 line
- if there is aesymmerty and patterns, regression is not appropriate

![image.png](attachment:image.png)

- Homoscedasticity is not met

#### Predict Categoricals with Regression

Calculate Group Mean</br>
df.groupby('input_column (binary)').mean().output_column</br>
plt.scatter(input, output)


Example

model = sm.OLS.from_formula('height ~ play_bball', data)</br>
results = model.fit()</br>
print(results.params)

Output

<span style="color:blue"></em>Intercept     169.016</br>
play_bball     14.628</br>
dtype: float64</span>

<span style='color:orange'> this method works with 'yes', 'no', 'True' False', and 1,0

Difference in group means is the slope </br>
mean_category_1 - mean_category_2

In [2]:
students = pd.read_csv('test_data.csv')

# Calculate and print group means
mean_score_no_breakfast = np.mean(students.score[students.breakfast == 0])
mean_score_breakfast = np.mean(students.score[students.breakfast == 1])

# Fit the model and print the coefficients
score = sm.OLS.from_formula('score ~ breakfast', students)
results = score.fit()
print(results.params)

# Calculate and print the difference in group means
print(mean_score_breakfast - mean_score_no_breakfast))

SyntaxError: unmatched ')' (Temp/ipykernel_18504/3515327632.py, line 13)