# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [54]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [56]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [59]:
#(H0) : The average HP of Dragon-type Pokémon is equal to the average HP of Grass-type Pokémon.
#(H1) : The average HP of Dragon-type Pokémon is higher than the average HP of Grass-type Pokémon.
#We use a two-independent-samples T-test because we are comparing the means of two independent groups (Pokemons Dragon and Grass)
#We choose a significance level of 5%, i.e.
#We will use statistical tools to calculate the test statistic and p-value from the HP data of Dragon-type and Grass-type Pokémon.
#If the p-value is less than 0.05, we reject the null hypothesis H0. Otherwise, we do not reject it.

hp_dragon = df[df['Type 1'] == 'Dragon']['HP']
hp_grass = df[df['Type 1'] == 'Grass']['HP']


t_statistic, p_value = st.ttest_ind(hp_dragon, hp_grass, alternative='greater')

print("T-Statistic:", t_statistic)
print("P-Value:", p_value)


if p_value < 0.05:
    print("H1 is the good hypothesis: Dragon Pokémon have, on average, more HP than Grass Pokémon")
else:
    print("H0 is the good hypothesis: We cannot conclude that Dragon Pokémon have more HP than Grass Pokémon.")

T-Statistic: 3.590444254130357
P-Value: 0.0002567969150153481
H1 is the good hypothesis: Dragon Pokémon have, on average, more HP than Grass Pokémon


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.

In [62]:
# Separate Legendary and Non-Legendary Pokémon
legendary = df[df['Legendary'] == True]
non_legendary = df[df['Legendary'] == False]

# Statistics to compare
stats_to_compare = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Dictionary to store results
results = {}

for stat in stats_to_compare:
    t_stat, p_value = st.ttest_ind(legendary[stat], non_legendary[stat], equal_var=False)
    results[stat] = (t_stat, p_value)

for stat, (t_stat, p_value) in results.items():
    print(f"Stat: {stat}, T-Statistic: {t_stat}, P-Value: {p_value}")

    if p_value < 0.05:
        print(f"Reject the null hypothesis for {stat}. There is a significant difference in {stat} between Legendary and Non-Legendary Pokémon.\n")
    else:
        print(f"Don't reject the null hypothesis for {stat}. There is not significant difference in {stat} between Legendary and Non-Legendary Pokémon.\n")


Stat: HP, T-Statistic: 8.981370483625046, P-Value: 1.0026911708035284e-13
Reject the null hypothesis for HP. There is a significant difference in HP between Legendary and Non-Legendary Pokémon.

Stat: Attack, T-Statistic: 10.438133539322203, P-Value: 2.520372449236646e-16
Reject the null hypothesis for Attack. There is a significant difference in Attack between Legendary and Non-Legendary Pokémon.

Stat: Defense, T-Statistic: 7.637078164784618, P-Value: 4.8269984949193316e-11
Reject the null hypothesis for Defense. There is a significant difference in Defense between Legendary and Non-Legendary Pokémon.

Stat: Sp. Atk, T-Statistic: 13.417449984138461, P-Value: 1.5514614112239812e-21
Reject the null hypothesis for Sp. Atk. There is a significant difference in Sp. Atk between Legendary and Non-Legendary Pokémon.

Stat: Sp. Def, T-Statistic: 10.015696613114878, P-Value: 2.2949327864052826e-15
Reject the null hypothesis for Sp. Def. There is a significant difference in Sp. Def between Lege

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [42]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [50]:
import pandas as pd

# Charger les données
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv'
df = pd.read_csv(url)

# Afficher les premières lignes du DataFrame
print(df.head())

import numpy as np

# Coordonnées de l'école et de l'hôpital
school_coords = (-118, 37)
hospital_coords = (-122, 34)

# Fonction pour calculer la distance euclidienne
def calculate_distance(x1, y1, x2, y2):
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

# Calculer les distances à l'école et à l'hôpital
df['distance_to_school'] = calculate_distance(df['longitude'], df['latitude'], *school_coords)
df['distance_to_hospital'] = calculate_distance(df['longitude'], df['latitude'], *hospital_coords)

# Considérer une maison proche si la distance est inférieure à 0.50
df['is_close'] = (df['distance_to_school'] < 0.50) | (df['distance_to_hospital'] < 0.50)

# Séparer les maisons en groupes proches et éloignés
close_houses = df[df['is_close'] == True]['median_house_value']
far_houses = df[df['is_close'] == False]['median_house_value']

from scipy import stats

# Effectuer le test t à deux échantillons
t_stat, p_value = stats.ttest_ind(close_houses, far_houses, equal_var=False)

print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

# Interprétation
if p_value < 0.05:
    print("Rejeter l'hypothèse nulle : Il y a une différence significative dans les prix des maisons entre celles proches d'une école ou d'un hôpital et celles qui ne le sont pas.")
else:
    print("Ne pas rejeter l'hypothèse nulle : Il n'y a pas de différence significative dans les prix des maisons entre celles proches d'une école ou d'un hôpital et celles qui ne le sont pas.")



   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0  
T-Statistic: -17.174167998688404, P-Value: 5.220018561223529e-05
Rejeter l'hypothèse nulle : Il y a une différence significative dans les prix des maisons entre celles pro

In [48]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   longitude             17000 non-null  float64
 1   latitude              17000 non-null  float64
 2   housing_median_age    17000 non-null  float64
 3   total_rooms           17000 non-null  float64
 4   total_bedrooms        17000 non-null  float64
 5   population            17000 non-null  float64
 6   households            17000 non-null  float64
 7   median_income         17000 non-null  float64
 8   median_house_value    17000 non-null  float64
 9   distance_to_school    17000 non-null  float64
 10  distance_to_hospital  17000 non-null  float64
 11  is_close              17000 non-null  bool   
dtypes: bool(1), float64(11)
memory usage: 1.4 MB
None
