# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from scipy.stats import ttest_ind

In [2]:
pokemon_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
pokemon_data

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [3]:
# H0: The mean HP of Dragon-type Pokémon is significantly greater than the mean HP of Grass-type Pokémon.
# H1: The mean HP of Dragon-type Pokémon is not significantly greater than the mean HP of Grass-type Pokémon.

# Collecting data
dragon_hp = pokemon_data[pokemon_data['Type 1'] == 'Dragon']['HP']
grass_hp = pokemon_data[pokemon_data['Type 1'] == 'Grass']['HP']

# Perform two-sample t-test (one-tailed test for greater mean)
t_stat, p_value = st.ttest_ind(dragon_hp, grass_hp, alternative='greater', equal_var=False)

# Results
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The HP of Dragon-type Pokémon is significantly higher on average than Grass-type Pokémon.")
else:
    print("Fail to reject the null hypothesis: The HP of Dragon-type Pokémon is not significantly higher on average than Grass-type Pokémon.")

T-statistic: 3.3350, P-value: 0.0008
Reject the null hypothesis: The HP of Dragon-type Pokémon is significantly higher on average than Grass-type Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [4]:
# H0: The mean stats of Legendary Pokémon are significantly greater then the mean stats of Non-Legendary Pokémon.
# H1: The mean stats of Legendary Pokémon are not significantly greater then the mean stats of Non-Legendary Pokémon.

legendary = pokemon_data[pokemon_data['Legendary'] == True]
non_legendary = pokemon_data[pokemon_data['Legendary'] == False]

stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Perform one-tailed t-tests for each stat
results = {}
for stat in stats:
    t_stat, p_value = st.ttest_ind(legendary[stat], non_legendary[stat], alternative='greater', equal_var=False)
    results[stat] = {'t_statistic': t_stat, 'p_value': p_value}

# Print the results
alpha = 0.05
for stat, result in results.items():
    print(f"{stat}:")
    print(f"  T-statistic: {result['t_statistic']:.4f}")
    print(f"  P-value: {result['p_value']:.4f}")
    if result['p_value'] < alpha:
        print(f"  Reject the null hypothesis: The mean {stat} of Legendary Pokémon is significantly greater than the mean {stat} of Non-Legendary Pokémon.")
    else:
        print(f"  Fail to reject the null hypothesis: The mean {stat} of Legendary Pokémon is not significantly greater than the mean {stat} of Non-Legendary Pokémon.")
print("\nHypothesis 2: Legendary vs. Non-Legendary Stats")

for stat in stats:
    t_stat, p_value = ttest_ind(legendary[stat], non_legendary[stat])
    print(f"{stat} - T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
    if p_value < 0.05:
        print(f"Reject the null hypothesis: There is a significant difference in {stat} between Legendary and Non-Legendary Pokemons.")
    else:
        print(f"Fail to reject the null hypothesis: No significant difference in {stat} between Legendary and Non-Legendary Pokemons.")

HP:
  T-statistic: 8.9814
  P-value: 0.0000
  Reject the null hypothesis: The mean HP of Legendary Pokémon is significantly greater than the mean HP of Non-Legendary Pokémon.
Attack:
  T-statistic: 10.4381
  P-value: 0.0000
  Reject the null hypothesis: The mean Attack of Legendary Pokémon is significantly greater than the mean Attack of Non-Legendary Pokémon.
Defense:
  T-statistic: 7.6371
  P-value: 0.0000
  Reject the null hypothesis: The mean Defense of Legendary Pokémon is significantly greater than the mean Defense of Non-Legendary Pokémon.
Sp. Atk:
  T-statistic: 13.4174
  P-value: 0.0000
  Reject the null hypothesis: The mean Sp. Atk of Legendary Pokémon is significantly greater than the mean Sp. Atk of Non-Legendary Pokémon.
Sp. Def:
  T-statistic: 10.0157
  P-value: 0.0000
  Reject the null hypothesis: The mean Sp. Def of Legendary Pokémon is significantly greater than the mean Sp. Def of Non-Legendary Pokémon.
Speed:
  T-statistic: 11.4750
  P-value: 0.0000
  Reject the null

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [5]:
california_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
california_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [6]:
# Define the coordinates for the school and hospital
school_coords = (-118, 37)
hospital_coords = (-122, 34)

# Function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital
def euclidean_distance(lat1, lon1, lat2, lon2):
    return np.sqrt((lat1 - lat2)**2 + (lon1 - lon2)**2)

# Calculate the distances
california_data['distance_to_school'] = euclidean_distance(california_data['latitude'], california_data['longitude'], school_coords[1], school_coords[0])
california_data['distance_to_hospital'] = euclidean_distance(california_data['latitude'], california_data['longitude'], hospital_coords[1], hospital_coords[0])

# Display the first few rows with the new distance columns
print(california_data[['latitude', 'longitude', 'distance_to_school', 'distance_to_hospital']].head())

   latitude  longitude  distance_to_school  distance_to_hospital
0     34.19    -114.31            4.638125              7.692347
1     34.40    -114.47            4.384165              7.540617
2     33.69    -114.56            4.773856              7.446456
3     33.64    -114.57            4.801510              7.438716
4     33.57    -114.57            4.850753              7.442432


In [7]:
# Define the threshold
threshold = 0.50

# Classify houses
california_data['close_to_school_or_hospital'] = (california_data['distance_to_school'] <= threshold) | (california_data['distance_to_hospital'] <= threshold)

# Display the classification
print(california_data[['distance_to_school', 'distance_to_hospital', 'close_to_school_or_hospital']].head())

   distance_to_school  distance_to_hospital  close_to_school_or_hospital
0            4.638125              7.692347                        False
1            4.384165              7.540617                        False
2            4.773856              7.446456                        False
3            4.801510              7.438716                        False
4            4.850753              7.442432                        False


In [8]:
# Separate the data into 'close' and 'far' groups
close_group = california_data[california_data['close_to_school_or_hospital'] == True]['median_house_value']
far_group = california_data[california_data['close_to_school_or_hospital'] == False]['median_house_value']

# Perform a one-tailed t-test
t_stat, p_value = st.ttest_ind(close_group, far_group, equal_var=False, alternative='greater')

# Print the results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

T-statistic: -17.1742
P-value: 1.0000
