# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [13]:
# on average refers to means (mu) -> t-test
# and then cause we have two independent samples -> two-sample
# and then cause we are coming in with assumptions, we have a direction -> one-tailed
# and that direction is to the right -> greater than

#Set the hypothesis

# H0: mu_dragon_class <= mu_grass_class
# H1: mu_dragon_class > mu_grass_class

# Set confidence level at 95%
alpha = 0.05

# Collect the data
dragon_class = df[df['Type 1']=='Dragon']['HP']
grass_class = df[df['Type 1']=='Grass']['HP']


# Stats test
t_value_ch1, p_value_ch1 = st.ttest_ind(dragon_class, grass_class, alternative='greater', equal_var=False)
p_value_ch1

# Decision making
print(t_value_ch1, p_value_ch1)
if p_value_ch1 > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypotesis")

3.3349632905124063 0.0007993609745420598
We reject the null hypotesis


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [24]:
# asking about so many factors -> iterate over test
# same idea as previous question -> two-sample T-test
# looking just if different -> two-tailed

#Set the hypothesis

# H0: mu_legendary = mu_non_legendary
# H1: mu_legendary <> mu_non_legendary

# Set confidence level at 95%
alpha = 0.05

# Define the stats to compare
stats_columns = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Collect the data
legendary = df[df['Legendary']==True]
non_legendary = df[df['Legendary']==False]

# Dictionary to store the results
results = {}


# Perform t-tests for each stat
for stat in stats_columns:
    t_stat_l, p_value_l = st.ttest_ind(legendary[stat], non_legendary[stat], equal_var=False)
    results[stat] = {'t-statistic': t_stat_l, 'p-value': p_value_l}

# Displaying the results
for stat, result in results.items():
    print(f"Stat: {stat}, t-statistic: {result['t-statistic']:.4f}, p-value: {result['p-value']:.4f}")
    if p_value_l < alpha:
        print('There is a significant difference b/w Legendary & Non-legendary Pokemons\n')
    else:
        print('No Significant Difference b/w Legendary & Non-legendary Pokemons\n')

Stat: HP, t-statistic: 8.9814, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons

Stat: Attack, t-statistic: 10.4381, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons

Stat: Defense, t-statistic: 7.6371, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons

Stat: Sp. Atk, t-statistic: 13.4174, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons

Stat: Sp. Def, t-statistic: 10.0157, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons

Stat: Speed, t-statistic: 11.4750, p-value: 0.0000
There is a significant difference b/w Legendary & Non-legendary Pokemons



**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [50]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [47]:
# Define the euclidean distance function
def calculate_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lat2 - lat1) ** 2 + (lon2 - lon1) ** 2)

In [51]:
# Define fixed locations
school_coor = (-118, 37)
hospital_coor = (-122, 34)

# Calculate the distance from each house to the school
df['distance_to_school'] = df.apply(
    lambda row: calculate_distance(school_coor[0], school_coor[1], row['longitude'], row['latitude']),
    axis=1
)

# Calculate the distance from each house to the hospital
df['distance_to_hospital'] = df.apply(
    lambda row: calculate_distance(hospital_coor[0], hospital_coor[1], row['longitude'], row['latitude']),
    axis=1
)

In [54]:
# Define threshold
threshold = 0.5

# Divide your dataset into houses close and far from either a hospital or school
close_df = df[(df['distance_to_school']< threshold) | (df['distance_to_hospital']< threshold)] 
far_df = df[(df['distance_to_school'] >= threshold) & (df['distance_to_hospital'] >= threshold)]


In [57]:
# We posit that houses close to either a school or a hospital are more expensive

# checking if the house avg. median price is more expensive if close or far from a school or hospital -> 

#Set the hypothesis

# H0: mu_close_median_house_value >= mu_far_median_house_value
# H1: mu_close_median_house_value > mu_far_median_house_value

# Set confidence level at 95%
alpha = 0.05

#Get only the median house values for these groups
close_prices = close_df['median_house_value']
far_prices = far_df['median_house_value']

# Stats test
t_value_ch2, p_value_ch2 = st.ttest_ind(close_prices, far_prices, alternative='greater', equal_var=False)
p_value_ch2

# Decision making
print(t_value_ch2, p_value_ch2)
if p_value_ch2 > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypotesis")

-17.174167998688404 0.9999738999071939
We are not able to reject the null hypothesis
