In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.weightstats import ztest

In [None]:
df = pd.read_csv('./data/train.csv')

In [None]:
df.head()

In [None]:
mean_count = df.groupby('Neighborhood')['SalePrice'].agg(['mean','count'])

In [None]:
mean_count['diff'] = mean_count['mean'] - df['SalePrice'].mean()

In [None]:
mean_count.sort_values(by='diff')

In [None]:
nr_df = df[ df['Neighborhood'] == 'NridgHt'].copy()
ot_df = df[ df['Neighborhood'] == 'OldTown'].copy()
sw_df = df[ df['Neighborhood'] == 'SawyerW'].copy()

## Hypothesis:

$H_0$ is that there is no statistically significant difference bewteen the sample mean and the population mean.

$H_a$ is that there is a difference (two-sided)

T-test uses the [t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)

Z-test uses the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)

For sample < 30, use t-test.

As sample size approaches 30, the t-distribution approaches the normal.

## Z-Test

When we are working with a sampling distribution, the z score is equal to <br><br>  $\Large z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>$\sigma$ is the population standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

The denominator $\frac{\sigma}{\sqrt{n}}$, is the standard error

In [None]:
pop_mu = df['SalePrice'].mean()
pop_mu

In [None]:
pop_std = df['SalePrice'].std()
pop_std

In [None]:
x_bar = nr_df['SalePrice'].mean()
x_bar

In [None]:
n = nr_df.shape[0]
n

In [None]:
# z score
z = (x_bar - pop_mu)/(pop_std/np.sqrt(n))

z

In [None]:
# we can use stats to calculate the percentile
print(stats.norm.cdf(z))

# We can also use the survival function to calculate the probability
print(stats.norm.sf(z))

Now, with one without much difference

In [None]:
x_bar = sw_df['SalePrice'].mean()
x_bar

In [None]:
n = sw_df.shape[0]
n

In [None]:
# z score
z = (x_bar - pop_mu)/(pop_std/np.sqrt(n))

z

In [None]:
# we can use stats to calculate the percentile
print(stats.norm.cdf(z))

# We can also use the survival function to calculate the probability
print(stats.norm.sf(z))

Statsmodel

In [None]:
# statsmodel ztest
ztest_score, p_value= ztest(nr_df['SalePrice'], value=pop_mu, alternative='two-sided')
ztest_score, p_value

In [None]:
# statsmodel ztest two sample
ztest_Score, p_value= ztest(x1=nr_df['SalePrice'],x2=ot_df['SalePrice'], alternative='two-sided')
ztest_Score, p_value

## T-Test

> **$t$-test**:
> 
> - Calculate the **$t$-statistic** using the sample's standard deviation $s$:
> $$\large t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$$
> - We calculate the p-value from the **$t$-distribution**

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>s is the sample standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

One sample (compare to population)

In [None]:
# Let's continue to assume our alpha is 0.05
x_bar = nr_df['SalePrice'].mean()
s = nr_df['SalePrice'].std()
n = nr_df.shape[0]

t_stat = (x_bar - pop_mu)/(s/np.sqrt(n))
t_stat

In [None]:
# Calculate our t-critical value t*
crit_t = stats.t.ppf(0.05, n-1)
crit_t

In [None]:
# Calculate the p-value (two-tailed, so multiply by 2)
stats.t.sf(abs(t_stat),df=n-1)*2

#### Check it!

In [None]:
t_statistic, p_value = stats.ttest_1samp(nr_df['SalePrice'],popmean=pop_mu, alternative='two-sided')

t_statistic, p_value

### Two Sample T-Test

Check the variance of the two samples.

In [None]:
nr_df['SalePrice'].var()

In [None]:
mv_df['SalePrice'].var()

In [None]:
t_statistic, p_value = stats.ttest_ind(nr_df['SalePrice'], 
                                       mv_df['SalePrice'], 
                                       equal_var=False, 
                                       alternative='two-sided')

t_statistic, p_value