# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [23]:
#libraries
import pandas as pd
import numpy as np
from scipy import stats



In [24]:
pokemon_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
pokemon_data

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [25]:
pokemon_data.shape

(800, 11)

In [26]:
# Print number of unique values of dataset
unique_values = pokemon_data.nunique()
print(unique_values)

Name          799
Type 1         18
Type 2         18
HP             94
Attack        111
Defense       103
Sp. Atk       105
Sp. Def        92
Speed         108
Generation      6
Legendary       2
dtype: int64


In [27]:
#Print values for each column
for column in pokemon_data.columns:
    unique_values = pokemon_data[column].unique()
    print(f"Columna: {column}")
    print(f"Valores únicos ({len(unique_values)}): {unique_values}")

Columna: Name
Valores únicos (800): ['Bulbasaur' 'Ivysaur' 'Venusaur' 'Mega Venusaur' 'Charmander'
 'Charmeleon' 'Charizard' 'Mega Charizard X' 'Mega Charizard Y' 'Squirtle'
 'Wartortle' 'Blastoise' 'Mega Blastoise' 'Caterpie' 'Metapod'
 'Butterfree' 'Weedle' 'Kakuna' 'Beedrill' 'Mega Beedrill' 'Pidgey'
 'Pidgeotto' 'Pidgeot' 'Mega Pidgeot' 'Rattata' 'Raticate' 'Spearow'
 'Fearow' 'Ekans' 'Arbok' 'Pikachu' 'Raichu' 'Sandshrew' 'Sandslash'
 'Nidoran♀' 'Nidorina' 'Nidoqueen' 'Nidoran♂' 'Nidorino' 'Nidoking'
 'Clefairy' 'Clefable' 'Vulpix' 'Ninetales' 'Jigglypuff' 'Wigglytuff'
 'Zubat' 'Golbat' 'Oddish' 'Gloom' 'Vileplume' 'Paras' 'Parasect'
 'Venonat' 'Venomoth' 'Diglett' 'Dugtrio' 'Meowth' 'Persian' 'Psyduck'
 'Golduck' 'Mankey' nan 'Growlithe' 'Arcanine' 'Poliwag' 'Poliwhirl'
 'Poliwrath' 'Abra' 'Kadabra' 'Alakazam' 'Mega Alakazam' 'Machop'
 'Machoke' 'Machamp' 'Bellsprout' 'Weepinbell' 'Victreebel' 'Tentacool'
 'Tentacruel' 'Geodude' 'Graveler' 'Golem' 'Ponyta' 'Rapidash' 'Slowpoke'
 

In [28]:
# Check if a pokemon can be same type twice
pokemon_data['Same Type'] = pokemon_data['Type 1'] == pokemon_data['Type 2']

# Verify results
print(pokemon_data["Same Type"].value_counts())

Same Type
False    800
Name: count, dtype: int64


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [29]:
#Set the hypothesis

#H0: mu_hp_dragon_stats = mu_hp_grass_stats
#H1: mu_hp_dragon_stats != mu_hp_grass_stats

#significance level = 0.05

# Separate the data into two groups based on 'type'== Dragon and Grass
dragon_hp = pokemon_data[(pokemon_data['Type 1'] == 'Dragon') | (pokemon_data['Type 2'] == 'Dragon')]['HP']
grass_hp = pokemon_data[(pokemon_data['Type 1'] == 'Grass') | (pokemon_data['Type 2'] == 'Grass')]['HP']


# Perform t-test
t_stat, p_value = ttest_ind(dragon_hp, grass_hp, equal_var=False)

#Results
print(f"Test Statistic: {t_stat}")
print(f"P-Value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Dragon Pokémon have higher HP on average than Grass Pokémon.")
else:
    print("Fail to reject the null hypothesis: No significant difference in HP between Dragon and Grass Pokémon.")



Test Statistic: 4.097528915272702
P-Value: 0.00010181538122353851
Reject the null hypothesis: Dragon Pokémon have higher HP on average than Grass Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [30]:
#Set the hypothesis

#H0: mu_legendary_stats = mu_nonlegendary_stats
#H1: mu_legendary_stats != mu_nonlegendary_stats

#Significance level =0.05

#Separete Legendary and Non-Legendary Pokémon
legendary_stats = pokemon_data[pokemon_data['Legendary'] == True][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
non_legendary_stats = pokemon_data[pokemon_data['Legendary'] == False][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

for stat in legendary_stats.columns:
     # Do t-test
    test_stat, p_value = ttest_ind(legendary_stats[stat], non_legendary_stats[stat], equal_var=False)
    results[stat] = {'Statistic': test_stat, 'P-Value': p_value}
    
    # Print results
    print("Results:")
    if p_value < alpha:
        print(f"Reject the null hypothesis for {stat}: Legendary Pokémon have significantly different {stat} compared to Non-Legendary Pokémon (p-value={p_value:.4f}).")
    else:
        print(f"Fail to reject the null hypothesis for {stat}: No significant difference in {stat} between Legendary and Non-Legendary Pokémon (p-value={p_value:.4f}).")


Results:
Reject the null hypothesis for HP: Legendary Pokémon have significantly different HP compared to Non-Legendary Pokémon (p-value=0.0000).
Results:
Reject the null hypothesis for Attack: Legendary Pokémon have significantly different Attack compared to Non-Legendary Pokémon (p-value=0.0000).
Results:
Reject the null hypothesis for Defense: Legendary Pokémon have significantly different Defense compared to Non-Legendary Pokémon (p-value=0.0000).
Results:
Reject the null hypothesis for Sp. Atk: Legendary Pokémon have significantly different Sp. Atk compared to Non-Legendary Pokémon (p-value=0.0000).
Results:
Reject the null hypothesis for Sp. Def: Legendary Pokémon have significantly different Sp. Def compared to Non-Legendary Pokémon (p-value=0.0000).
Results:
Reject the null hypothesis for Speed: Legendary Pokémon have significantly different Speed compared to Non-Legendary Pokémon (p-value=0.0000).


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [31]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [32]:
# Define Euclidean distance function
def euclidean_distance(x1, y1, x2, y2):
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

In [33]:
# Coordinates of the school and hospital
school_coords = (-118, 37)
hospital_coords = (-122, 34)

In [36]:
# Calculate distances to school and hospital for each house
df['distance_to_school'] = df.apply(
    lambda row: euclidean_distance(row['longitude'], row['latitude'], *school_coords), axis=1
)

df['distance_to_hospital'] = df.apply(
    lambda row: euclidean_distance(row['longitude'], row['latitude'], *hospital_coords), axis=1
)

In [50]:
# Classify houses based on the distance
df['close_to_school_or_hospital'] = df.apply(
    lambda row: 1 if row['distance_to_school'] < 0.50 or row['distance_to_hospital'] < 0.50 else 0, axis=1
)

In [54]:
# Split the dataset into two groups: close to school/hospital and far from them
close_houses = df[df['close_to_school_or_hospital'] == 1]
far_houses = df[df['close_to_school_or_hospital'] == 0]

#Set the hypothesis

#H0: mu close_houses = mu far_houses
#H1: mu close_school != mu far_houses

# Perform a t-test to compare the median house values
t_stat, p_value = stats.ttest_ind(close_houses['median_house_value'], far_houses['median_house_value'])

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Conclusion based on p-value
if p_value < 0.10:
    print("Reject the null hypothesis: Houses close to schools or hospitals are more expensive.")
else:
    print("Fail to reject the null hypothesis: No significant price difference.")

T-statistic: -2.2146147257665834
P-value: 0.026799733071128685
Reject the null hypothesis: Houses close to schools or hospitals are more expensive.
