# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats




In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df_pokemon = df.copy()
df_pokemon.head()

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False


In [3]:
df_pokemon.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        799 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   HP          800 non-null    int64 
 4   Attack      800 non-null    int64 
 5   Defense     800 non-null    int64 
 6   Sp. Atk     800 non-null    int64 
 7   Sp. Def     800 non-null    int64 
 8   Speed       800 non-null    int64 
 9   Generation  800 non-null    int64 
 10  Legendary   800 non-null    bool  
dtypes: bool(1), int64(7), object(3)
memory usage: 63.4+ KB


In [4]:
df_pokemon.columns = [col.lower().replace(" ", "_") for col in df_pokemon.columns]
df_pokemon.columns = [col.lower().replace(". ", "_") for col in df_pokemon.columns]
df_pokemon = df_pokemon.dropna()
df_pokemon

Unnamed: 0,name,type_1,type_2,hp,attack,defense,sp._atk,sp._def,speed,generation,legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [5]:
df_pokemon.loc[:, 'type_1'] = (
    df_pokemon['type_1']
    .astype(object)                        # force the obect type
    .astype(str)                 # back to string
)

df_pokemon.loc[:, 'type_2'] = (
    df_pokemon['type_2']
    .astype(object)        # force the obect type
    .astype(str)           # back to string
)
print(type(df_pokemon['type_1'].iloc[0]))
print(type(df_pokemon['type_2'].iloc[0]))

<class 'str'>
<class 'str'>


In [6]:
for col in df_pokemon.columns :
    nunique = df_pokemon[col].nunique()
    print(f"--- Analyse de la colonne : {col} {nunique} ---")

--- Analyse de la colonne : name 414 ---
--- Analyse de la colonne : type_1 18 ---
--- Analyse de la colonne : type_2 18 ---
--- Analyse de la colonne : hp 72 ---
--- Analyse de la colonne : attack 95 ---
--- Analyse de la colonne : defense 86 ---
--- Analyse de la colonne : sp._atk 85 ---
--- Analyse de la colonne : sp._def 74 ---
--- Analyse de la colonne : speed 94 ---
--- Analyse de la colonne : generation 6 ---
--- Analyse de la colonne : legendary 2 ---


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [11]:
# 1. On isole tous les HP pour les Dragon (Type 1 OU Type 2)
dragon_hp = df_pokemon[(df_pokemon['type_1'] == 'Dragon') | (df_pokemon['type_2'] == 'Dragon')]['hp']
mean_dragon_hp = dragon_hp.mean()
# 2. On isole tous les HP pour les Grass (Type 1 OU Type 2)
grass_hp = df_pokemon[(df_pokemon['type_1'] == 'Grass') | (df_pokemon['type_2'] == 'Grass')]['hp']
mean_grass_hp = grass_hp.mean()

from scipy import stats

# Perform the Independent T-Test
# equal_var=False is safer (Welch's T-test) because group sizes differ
t_stat, p_value = stats.ttest_ind(dragon_hp, grass_hp, equal_var=False)

# Now your existing print and if/else logic will work!
alpha = 0.05
print(f"--- Statistical Findings (Alpha = {alpha}) ---")
print(f"Dragon Mean HP: {mean_dragon_hp:.2f}")
print(f"Grass Mean HP: {mean_grass_hp:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print("Decision: Reject the Null Hypothesis (H0).")
    print("Conclusion: Dragon Pokemons have significantly more HP than Grass Pokemons.")
else:
    print("Decision: Fail to reject the Null Hypothesis (H0).")
    print("Conclusion: The difference in HP is not statistically significant.")

--- Statistical Findings (Alpha = 0.05) ---
Dragon Mean HP: 88.85
Grass Mean HP: 66.94
P-value: 0.0000
Decision: Reject the Null Hypothesis (H0).
Conclusion: Dragon Pokemons have significantly more HP than Grass Pokemons.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [14]:
from scipy import stats

# 1. Define the groups (Vectorization is faster than a loop)
legendary = df_pokemon[df_pokemon['legendary'] == True]
non_legendary = df_pokemon[df_pokemon['legendary'] == False]

# 2. List of stats to audit
stats_to_test = ['hp', 'attack', 'defense', 'sp._atk', 'sp._def', 'speed']

print(f"{'Stat':<10} | {'P-Value':<10} | {'Conclusion'}")
print("-" * 45)

# 3. Automation Loop
for col in stats_to_test:
    # Perform the T-Test for each stat
    t_stat, p_val = stats.ttest_ind(legendary[col], non_legendary[col], equal_var=False)
    
    # Audit Decision
    if p_val < 0.05:
        result = "Significant Difference"
    else:
        result = "No Significant Difference"
    
    print(f"{col:<10} | {p_val:<10.4f} | {result}")

Stat       | P-Value    | Conclusion
---------------------------------------------
hp         | 0.0000     | Significant Difference
attack     | 0.0000     | Significant Difference
defense    | 0.0000     | Significant Difference
sp._atk    | 0.0000     | Significant Difference
sp._def    | 0.0000     | Significant Difference
speed      | 0.0000     | Significant Difference


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [15]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [16]:
import pandas as pd
import numpy as np
from scipy import stats

# 1. Define coordinates
SCHOOL = np.array([-118, 34])
HOSPITAL = np.array([-122, 37])

# 2. Function to calculate Euclidean distance
def get_distance(row, target_coord):
    house_coord = np.array([row['longitude'], row['latitude']])
    return np.linalg.norm(house_coord - target_coord)

# 3. Calculate distances for each house
df['dist_school'] = df.apply(lambda x: get_distance(x, SCHOOL), axis=1)
df['dist_hospital'] = df.apply(lambda x: get_distance(x, HOSPITAL), axis=1)

# 4. Create the "Close" vs "Far" groups
# A house is "Close" if distance to school < 0.5 OR distance to hospital < 0.5
mask_close = (df['dist_school'] < 0.5) | (df['dist_hospital'] < 0.5)
close_houses = df[mask_close]['median_house_value']
far_houses = df[~mask_close]['median_house_value']

# 5. Statistical Test (T-test)
t_stat, p_val = stats.ttest_ind(close_houses, far_houses, equal_var=False)

# 6. Findings
print(f"Mean Price (Close): ${close_houses.mean():,.2f}")
print(f"Mean Price (Far): ${far_houses.mean():,.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Conclusion: Significant difference found. Proximity affects house prices.")
else:
    print("Conclusion: No significant difference found.")

Mean Price (Close): $246,951.98
Mean Price (Far): $180,678.44
P-value: 0.0000
Conclusion: Significant difference found. Proximity affects house prices.
