### Importing Libraries

In [2]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns 

### loading data

In [3]:
data = pd.read_csv('breast-cancer.csv')

In [4]:
data.head()

Unnamed: 0,119513,N,31,18.02,27.6,117.5,1013,0.09489,0.1036,0.1086,0.07055,0.1865,0.06333,0.6249,1.89,3.972,71.55,0.004433,0.01421,0.03233,0.009854,0.01694,0.003495,21.63,37.08,139.7,1436,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5,5.1
0,8423,N,61,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,2
1,842517,N,116,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,0.0818,0.2333,0.0601,0.5854,0.6105,3.928,82.15,0.006167,0.03449,0.033,0.01805,0.03094,0.005039,24.9,20.98,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,0
2,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
3,843584,R,27,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,0
4,843786,R,77,12.75,15.29,84.6,502.7,0.1189,0.1569,0.1664,0.07666,0.1995,0.07164,0.3877,0.7402,2.999,30.85,0.007775,0.02987,0.04561,0.01357,0.01774,0.005114,15.51,20.37,107.3,733.2,0.1706,0.4196,0.5999,0.1709,0.3485,0.1179,2.5,0


In [6]:
data.shape

(197, 35)

It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

Set up the hypothesis test.
Write down all the steps followed for setting up the test.
Calculate the test statistic by hand and also code it in Python. It should be 4.76190. We will take a look at how to make decisions based on this calculated value.

We want to test if our the **sample mean** is not equal to the **population mean** = 120 . We also know that our **sample** has a size of 100 individuals.

$t = \frac{(\bar{X}-\mu)}{\hat{\sigma}/\sqrt{n}}$

where:

* $\bar{X}$ is the **sample mean**
* $\mu$ is the **population mean**
* $\hat{\sigma}$ is the **sample standard deviation**
* $n$ is the number of measures in our sample

In [7]:
import math

sample_mean = 130.1
pop_mean = 120
sample_std = 21.21
n = 100
statistic = (sample_mean - pop_mean)/(sample_std/math.sqrt(n))
print("Statistic is: ", statistic)

Statistic is:  4.761904761904759


In [8]:
from scipy import stats
from numpy.random import normal


samples = {}

for i in range(10):
    sample_name = "sample_" + str(i)
    samples[sample_name] = normal(loc = 130.1, scale = 21.21, size = 100)
    sample_mean = "sample_" + str(i) + "_mean"
    samples[sample_mean] = np.mean(samples[sample_name])
    sample_std = "sample_" + str(i) + "_std"
    samples[sample_std] = np.std(samples[sample_name],ddof=1)
    sample_statistic = "sample_" + str(i) + "_t-statistic"
    samples[sample_statistic] = (samples[sample_mean]- pop_mean)/(samples[sample_std]/math.sqrt(n)) 
    print("The t-statistic for the sample {} is: {}".format(i,samples[sample_statistic]))



The t-statistic for the sample 0 is: 5.20335118828276
The t-statistic for the sample 1 is: 3.4255208652665337
The t-statistic for the sample 2 is: 5.620423518681785
The t-statistic for the sample 3 is: 2.380604277571787
The t-statistic for the sample 4 is: 5.360824230880215
The t-statistic for the sample 5 is: 7.134059584942099
The t-statistic for the sample 6 is: 5.436321079340408
The t-statistic for the sample 7 is: 4.092688553489643
The t-statistic for the sample 8 is: 4.084864999104343
The t-statistic for the sample 9 is: 5.765757266773386


Now that we have the t-statistic for each random sample, let's make the two tails test. Why two tails? Because we are looking what is the probability that we get a **sample mean** which deviates from the **population mean** more than out t-statistic. We don't care if the our **sample mean** is bigger or smaller than the **population mean**.

Therefore, we can ask ourselves what is the probability of having a deviation within -t and t.

In [9]:
print("Assuming a significance level of 0.05")
print()

for i in range(10):
    sample_name = "sample_" + str(i)
    print("The p-value of sample {} is: {:-5.3}".format(i,stats.ttest_1samp(samples[sample_name],120)[1]))
    if ( stats.ttest_1samp(samples[sample_name],120)[1] < 0.05 ):
        print("Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample {} given Ho.".format(i))
    print()

Assuming a significance level of 0.05

The p-value of sample 0 is: 1.06e-06
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 0 given Ho.

The p-value of sample 1 is: 0.000895
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 1 given Ho.

The p-value of sample 2 is: 1.76e-07
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 2 given Ho.

The p-value of sample 3 is: 0.0192
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 3 given Ho.

The p-value of sample 4 is: 5.42e-07
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 4 given Ho.

The p-value of sample 5 is: 1.63e-10
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 5 given Ho.

The p-value of sample 6 is: 3.92e-07
Therefore we discard the null hypothesis Ho, as it's very unlikely to get sample 6 given Ho.

The p-value of sample 7 is: 8.71e-05
Therefore