# SciPy for statistics

### T-Test

A t-test is used to check the deviation of two samples of data. 
It compares its **mean** values, calculates the **p-value**\[1] based on which we can tell how deviant the group A is from group B.

\[1]. p-value determines the probability of taking a single sample that deviates from the original group by a significant margin. The closer it is to 0, the less deviant this single sample is from the original, meaning that the original group has presumably changed.

In [1]:
%pip install pandas scipy numpy matplotlib -Uq

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from scipy import stats as st
import matplotlib.pyplot as plt

In [3]:
data_code = pd.read_csv("./data/CodeAndCoffeeModified.csv")
data_code.head()

Unnamed: 0,CodingHours,CoffeeCupsPerDay,CoffeeTime,CodingWithoutCoffee,CoffeeType,CoffeeSolveBugs,Gender,AgeRange
0,8,2,Before coding,Yes,Caffè latte,Sometimes,Female,18 to 29
1,3,2,Before coding,Yes,Americano,Yes,Female,30 to 39
2,5,3,While coding,No,Nescafe,Yes,Female,18 to 29
3,10,3,While coding,Sometimes,Turkish,No,Male,18 to 29
4,8,2,While coding,Sometimes,Nescafe,Yes,Male,30 to 39


### Propose a null hypothesis

$H_0$: More hard-working programmers drink more coffee while coding

$H_1$: Coffee consumption is consistent regardless of working hours

`tip`: use `$H_n$` to generate formulas in Markdown. [More information](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/writing-mathematical-expressions)

In [16]:
data_code_coding = data_code[data_code["CodingWithoutCoffee"]!="Yes"]

hard_workers = data_code_coding[data_code_coding["CodingHours"]>5]
soft_workers = data_code_coding[data_code_coding["CodingHours"]<5]

#### Make the datasets of equal length

In [17]:
print(hard_workers.info(), soft_workers.info())

<class 'pandas.core.frame.DataFrame'>
Index: 46 entries, 3 to 96
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   CodingHours          46 non-null     int64 
 1   CoffeeCupsPerDay     46 non-null     int64 
 2   CoffeeTime           46 non-null     object
 3   CodingWithoutCoffee  46 non-null     object
 4   CoffeeType           46 non-null     object
 5   CoffeeSolveBugs      46 non-null     object
 6   Gender               46 non-null     object
 7   AgeRange             46 non-null     object
dtypes: int64(2), object(6)
memory usage: 3.2+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, 9 to 95
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   CodingHours          17 non-null     int64 
 1   CoffeeCupsPerDay     17 non-null     int64 
 2   CoffeeTime           17 non-null     object
 3   CodingWithoutCof

In [18]:
hard_workers = hard_workers[:len(soft_workers)]

In [22]:
alpha = 0.05           # The 5% border of significance

results = st.ttest_ind(
    hard_workers["CoffeeCupsPerDay"],
    soft_workers["CoffeeCupsPerDay"],
)

print(f'p-value: {results.pvalue}\n')

if results.pvalue < alpha:
    print('Null hypothesis rejected')
else:
    print('No reason to reject the null hypothesis')

p-value: 0.315804659699638

No reason to reject the null hypothesis


### Conclusion: t-test proposed that working long hours does not necessarily mean drinking more coffee

Don't drink too much coffee, kids! And don't work long hours too...