# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [16]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from scipy import stats



In [18]:
df_pokemon = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df_pokemon

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [21]:
df_dragon = df_pokemon[df_pokemon['Type 1'] == 'Dragon']
df_grass = df_pokemon[df_pokemon['Type 1'] == 'Grass']

ks_statistic, ks_p_value = stats.kstest(df_dragon["HP"], df_grass["HP"], 'norm')
ks_statistic, ks_p_value

(0.3821428571428571, 0.002158972680295076)

In [23]:
if ks_p_value < 0.05:
    print("Reject the null hypothesis: Dragon Pokemon have significantly different HP than Grass-type.")
else:
    print("Fail to reject the null hypothesis: No significant difference in HP between Dragon and Grass types.")

Reject the null hypothesis: Dragon Pokemon have significantly different HP than Grass-type.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [58]:
# T-Test is more appropriate for comparing means of continuous data

# Stats to compare
stats_columns = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

# Split dataset into Legendary and Non-Legendary Pokémon
df_leg = df_pokemon[df_pokemon["Legendary"] == True]
df_non_leg = df_pokemon[df_pokemon["Legendary"] == False]

# Perform T-test for each stat
for stat in stats_columns:
    t_stat, p_value = stats.ttest_ind(df_leg[stat], df_non_leg[stat], equal_var=False)
    print(f"T-test for {stat}: Statistic = {t_stat:.4f}, p-value = {p_value:.4e}")
    
    # Interpret results
    if p_value < 0.05:
        print(f"Statistically significant result: Reject the null hypothesis for {stat} (Legendary Pokémon differ)")
    else:
        print(f"Not statistically significant: Fail to reject the null hypothesis for {stat} (No significant difference)")
    
    print()

#The results of the independent samples T-test 
#confirm that Legendary Pokémon have significantly different stats compared to Non-Legendary Pokémon across all attributes analyzed (HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed).

T-test for HP: Statistic = 8.9814, p-value = 1.0027e-13
Statistically significant result: Reject the null hypothesis for HP (Legendary Pokémon differ)

T-test for Attack: Statistic = 10.4381, p-value = 2.5204e-16
Statistically significant result: Reject the null hypothesis for Attack (Legendary Pokémon differ)

T-test for Defense: Statistic = 7.6371, p-value = 4.8270e-11
Statistically significant result: Reject the null hypothesis for Defense (Legendary Pokémon differ)

T-test for Sp. Atk: Statistic = 13.4174, p-value = 1.5515e-21
Statistically significant result: Reject the null hypothesis for Sp. Atk (Legendary Pokémon differ)

T-test for Sp. Def: Statistic = 10.0157, p-value = 2.2949e-15
Statistically significant result: Reject the null hypothesis for Sp. Def (Legendary Pokémon differ)

T-test for Speed: Statistic = 11.4750, p-value = 1.0490e-18
Statistically significant result: Reject the null hypothesis for Speed (Legendary Pokémon differ)



**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [27]:
df_houses = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df_houses.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [29]:
#1.Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.

# Coordinates for the school and hospital
school = {"longitude": -118, "latitude": 37}
hospital = {"longitude": -122, "latitude": 34}

# Function to calculate the Euclidean distance
def calculate_distance(house, point):
    return np.sqrt((house["latitude"] - point["latitude"]) ** 2 + 
                   (house["longitude"] - point["longitude"]) ** 2)

# Calculate distances from each house to the school and hospital
df_houses["distance_to_school"] = df_houses.apply(lambda house: calculate_distance(house, school), axis=1)
df_houses["distance_to_hospital"] = df_houses.apply(lambda house: calculate_distance(house, hospital), axis=1)

# Print the results
print("Distances from each house to the school:")
print(df_houses[["distance_to_school"]])

print("\nDistances from each house to the hospital:")
print(df_houses[["distance_to_hospital"]])

Distances from each house to the school:
       distance_to_school
0                4.638125
1                4.384165
2                4.773856
3                4.801510
4                4.850753
...                   ...
16995            7.211380
16996            7.275232
16997            7.944533
16998            7.920227
16999            7.270083

[17000 rows x 1 columns]

Distances from each house to the hospital:
       distance_to_hospital
0                  7.692347
1                  7.540617
2                  7.446456
3                  7.438716
4                  7.442432
...                     ...
16995              6.957298
16996              7.064630
16997              8.170410
16998              8.132035
16999              6.949396

[17000 rows x 1 columns]


In [44]:
#2.Divide your dataset into houses close and far from either a hospital or school.

# Define threshold for "close" vs. "far"
distance_threshold = 1  # Adjust based on your preference

# Create categories based on proximity
df_houses["close_to_school"] = df_houses["distance_to_school"] <= distance_threshold
df_houses["close_to_hospital"] = df_houses["distance_to_hospital"] <= distance_threshold

# Classify houses based on either proximity
df_houses["category"] = np.where(
    df_houses["close_to_school"] | df_houses["close_to_hospital"], "Close", "Far"
)

# Print results
print(df_houses[["distance_to_school", "distance_to_hospital", "category"]])
df_houses["close"].value_counts()

       distance_to_school  distance_to_hospital category
0                4.638125              7.692347      Far
1                4.384165              7.540617      Far
2                4.773856              7.446456      Far
3                4.801510              7.438716      Far
4                4.850753              7.442432      Far
...                   ...                   ...      ...
16995            7.211380              6.957298      Far
16996            7.275232              7.064630      Far
16997            7.944533              8.170410      Far
16998            7.920227              8.132035      Far
16999            7.270083              6.949396      Far

[17000 rows x 3 columns]


close
False    16982
True        18
Name: count, dtype: int64

In [49]:
#3.Choose the propper test and, with 5% significance, comment your findings.

from scipy.stats import ttest_ind

# Split dataset
close_houses = df_houses[df_houses["category"] == "Close"]["median_house_value"]
far_houses = df_houses[df_houses["category"] == "Far"]["median_house_value"]

# Perform T-test
t_statistic, p_value = ttest_ind(close_houses, far_houses, equal_var=False)

print(f"T-test Statistic: {t_statistic}, p-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Statistically significant: Reject the null hypothesis (Price differs between groups)")
else:
    print("Not statistically significant: Fail to reject the null hypothesis (No significant price difference)")


T-test Statistic: -11.145497346724056, p-value: 2.4917760307821917e-09
Statistically significant: Reject the null hypothesis (Price differs between groups)
