# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [25]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [26]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [28]:
#Set the hypothesis

#H0: mu (HP) Dragons !> mu (HP) Grass
#H1: mu (HP) Dragons > mu (HP) Grass

In [29]:
# Choose the significance level
alpha = 0.05

In [30]:
dragon_1 = df[df["Type 1"]=="Dragon"]["HP"] 
grass_1 = df[df["Type 1"]=="Grass"]["HP"]

In [31]:
st.ttest_ind(dragon_1,grass_1, alternative = 'greater')

TtestResult(statistic=3.590444254130357, pvalue=0.0002567969150153481, df=100.0)

In [32]:
_, p_value1 = st.ttest_ind(dragon_1,grass_1, alternative = 'greater')
p_value1

0.0002567969150153481

In [33]:
dragon_2 = df[df["Type 2"]=="Dragon"]["HP"] 
grass_2 = df[df["Type 2"]=="Grass"]["HP"]

In [34]:
st.ttest_ind(dragon_2,grass_2, alternative='greater')

TtestResult(statistic=2.762339453101214, pvalue=0.004275578527673753, df=41.0)

In [35]:
_, p_value2 = st.ttest_ind(dragon_2,grass_2, alternative = 'greater')
p_value2

0.004275578527673753

In [36]:
if p_value1 > alpha and p_value2 > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypothesis")

We reject the null hypothesis


In [37]:
# Because p_value is lower than significance level, we can reject the null hypothesis 
# On average, Pokemons of type Dragon have more HP stats than Pokemons of type Grass

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [39]:
#Set the hypothesis

#H0: mu Legendary (HP, Attack, Defense, Sp.Attack, Sp.Def, Speed) = mu Non-Legendary (HP, Attack, Defense, Sp.Attack, Sp.Def, Speed)
#H1: mu Legendary (HP, Attack, Defense, Sp.Attack, Sp.Def, Speed) != mu Non-Legendary (HP, Attack, Defense, Sp.Attack, Sp.Def, Speed)


In [40]:
# Choose the significance level
alpha = 0.05

In [41]:
legendary_hp = df[df["Legendary"]==True]["HP"] 
non_legendary_hp = df[df["Legendary"]==False]["HP"]
legendary_hp , non_legendary_hp

(156     90
 157     90
 158     90
 162    106
 163    106
       ... 
 795     50
 796     50
 797     80
 798     80
 799     80
 Name: HP, Length: 65, dtype: int64,
 0      45
 1      60
 2      80
 3      80
 4      39
        ..
 787    85
 788    55
 789    95
 790    40
 791    85
 Name: HP, Length: 735, dtype: int64)

In [42]:

legendary_attack = df[df["Legendary"]==True]["Attack"] 
non_legendary_attack = df[df["Legendary"]==False]["Attack"]
legendary_attack , non_legendary_attack

(156     85
 157     90
 158    100
 162    110
 163    190
       ... 
 795    100
 796    160
 797    110
 798    160
 799    110
 Name: Attack, Length: 65, dtype: int64,
 0       49
 1       62
 2       82
 3      100
 4       52
       ... 
 787    100
 788     69
 789    117
 790     30
 791     70
 Name: Attack, Length: 735, dtype: int64)

In [43]:
legendary_defense = df[df["Legendary"]==True]["Defense"] 
non_legendary_defense = df[df["Legendary"]==False]["Defense"]
legendary_defense , non_legendary_defense

(156    100
 157     85
 158     90
 162     90
 163    100
       ... 
 795    150
 796    110
 797     60
 798     60
 799    120
 Name: Defense, Length: 65, dtype: int64,
 0       49
 1       63
 2       83
 3      123
 4       43
       ... 
 787    122
 788     85
 789    184
 790     35
 791     80
 Name: Defense, Length: 735, dtype: int64)

In [44]:
legendary_sp_attack = df[df["Legendary"]==True]["Sp. Atk"] 
non_legendary_sp_attack = df[df["Legendary"]==False]["Sp. Atk"]
legendary_sp_attack , non_legendary_sp_attack

(156     95
 157    125
 158    125
 162    154
 163    154
       ... 
 795    100
 796    160
 797    150
 798    170
 799    130
 Name: Sp. Atk, Length: 65, dtype: int64,
 0       65
 1       80
 2      100
 3      122
 4       60
       ... 
 787     58
 788     32
 789     44
 790     45
 791     97
 Name: Sp. Atk, Length: 735, dtype: int64)

In [45]:
legendary_sp_def = df[df["Legendary"]==True]["Sp. Def"] 
non_legendary_sp_def = df[df["Legendary"]==False]["Sp. Def"]
legendary_sp_def , non_legendary_sp_def

(156    125
 157     90
 158     85
 162     90
 163    100
       ... 
 795    150
 796    110
 797    130
 798    130
 799     90
 Name: Sp. Def, Length: 65, dtype: int64,
 0       65
 1       80
 2      100
 3      120
 4       50
       ... 
 787     75
 788     35
 789     46
 790     40
 791     80
 Name: Sp. Def, Length: 735, dtype: int64)

In [46]:
legendary_speed = df[df["Legendary"]==True]["Speed"] 
non_legendary_speed = df[df["Legendary"]==False]["Speed"]
legendary_speed , non_legendary_speed

(156     85
 157    100
 158     90
 162    130
 163    130
       ... 
 795     50
 796    110
 797     70
 798     80
 799     70
 Name: Speed, Length: 65, dtype: int64,
 0       45
 1       60
 2       80
 3       80
 4       65
       ... 
 787     54
 788     28
 789     28
 790     55
 791    123
 Name: Speed, Length: 735, dtype: int64)

In [47]:
_, p_value1 = st.ttest_ind(legendary_hp,non_legendary_hp, alternative ='two-sided')

In [48]:
_, p_value2 = st.ttest_ind(legendary_attack,non_legendary_attack, alternative ='two-sided')

In [49]:
_, p_value3 = st.ttest_ind(legendary_defense,non_legendary_defense, alternative ='two-sided')

In [50]:
_, p_value4 = st.ttest_ind(legendary_sp_attack,non_legendary_sp_attack, alternative ='two-sided')

In [51]:
_, p_value5 = st.ttest_ind(legendary_sp_def,non_legendary_sp_def, alternative ='two-sided')

In [52]:
_, p_value6 = st.ttest_ind(legendary_speed,non_legendary_speed, alternative ='two-sided')

In [53]:
if p_value1 > alpha and p_value2 > alpha and p_value3 > alpha and p_value4 and p_value5 > alpha and p_value6 > alpha   :
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypothesis")

We reject the null hypothesis


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [56]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [58]:
import pandas as pd
import numpy as np
data = df[['latitude', 'longitude']]
lat_long_df = pd.DataFrame(data)

# # Coordinates for the school and hospital
school_coordinates = (-118, 37)   # school coordinates
hospital_coordinates = (-122, 34)  # hospital coordinates

def calculate_euclidean_distance(df, school_coordinates, hospital_coordinates):

    school_lon, school_lat = school_coordinates
    hospital_lon, hospital_lat = hospital_coordinates

    df['distance_to_school'] = np.sqrt((df['latitude'] - school_lat) ** 2 + (df['longitude'] - school_lon) ** 2)
    df['distance_to_hospital'] = np.sqrt((df['latitude'] - hospital_lat) ** 2 + (df['longitude'] - hospital_lon) ** 2)
    

    return df

calculate_euclidean_distance(lat_long_df, school_coordinates, hospital_coordinates)

Unnamed: 0,latitude,longitude,distance_to_school,distance_to_hospital
0,34.19,-114.31,4.638125,7.692347
1,34.40,-114.47,4.384165,7.540617
2,33.69,-114.56,4.773856,7.446456
3,33.64,-114.57,4.801510,7.438716
4,33.57,-114.57,4.850753,7.442432
...,...,...,...,...
16995,40.58,-124.26,7.211380,6.957298
16996,40.69,-124.27,7.275232,7.064630
16997,41.84,-124.30,7.944533,8.170410
16998,41.80,-124.30,7.920227,8.132035


In [97]:
df_houses_close = df[(lat_long_df['distance_to_school'] < 0.50) | (lat_long_df['distance_to_hospital'] < 0.5)]
df_houses_close   

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
4523,-118.05,36.64,34.0,2090.0,478.0,896.0,426.0,2.0357,74200.0
5596,-118.18,37.35,16.0,3806.0,794.0,1501.0,714.0,2.1212,108300.0
5597,-118.18,36.63,23.0,2311.0,487.0,1019.0,384.0,2.2574,104700.0
6776,-118.3,37.17,22.0,3480.0,673.0,1541.0,636.0,2.75,94500.0
6904,-118.31,36.94,35.0,2563.0,530.0,861.0,371.0,2.325,80600.0


In [99]:
df_houses_far = df[(lat_long_df['distance_to_school'] >= 0.50) | (lat_long_df['distance_to_hospital'] >= 0.5)]
df_houses_far   

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In [142]:
alpha = 0.05

In [144]:
_, p_values = st.ttest_ind( df_houses_close,df_houses_far, alternative = 'greater')
p_values

array([0.06496467, 0.08353479, 0.6772252 , 0.41619587, 0.38931952,
       0.69780538, 0.48845359, 0.96841823, 0.98657781])

In [146]:
for p_value in p_values:
    if p_value > alpha:
        print(f"P-value: {p_value:.4f} - We are not able to reject the null hypothesis")
    else:
        print(f"P-value: {p_value:.4f} - We reject the null hypothesis")

P-value: 0.0650 - We are not able to reject the null hypothesis
P-value: 0.0835 - We are not able to reject the null hypothesis
P-value: 0.6772 - We are not able to reject the null hypothesis
P-value: 0.4162 - We are not able to reject the null hypothesis
P-value: 0.3893 - We are not able to reject the null hypothesis
P-value: 0.6978 - We are not able to reject the null hypothesis
P-value: 0.4885 - We are not able to reject the null hypothesis
P-value: 0.9684 - We are not able to reject the null hypothesis
P-value: 0.9866 - We are not able to reject the null hypothesis


In [148]:
# H0: the houses close to either a school or a hospital are not more expensive
# H1: the houses close to either a school or a hospital are  more expensive
# p_value is negative, so we can reject  null hypothesis 