# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df.head(30)

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
8,Mega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False
9,Squirtle,Water,,44,48,65,50,64,43,1,False


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        799 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   HP          800 non-null    int64 
 4   Attack      800 non-null    int64 
 5   Defense     800 non-null    int64 
 6   Sp. Atk     800 non-null    int64 
 7   Sp. Def     800 non-null    int64 
 8   Speed       800 non-null    int64 
 9   Generation  800 non-null    int64 
 10  Legendary   800 non-null    bool  
dtypes: bool(1), int64(7), object(3)
memory usage: 63.4+ KB


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
df['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [5]:
df['Type 2'].unique()

array(['Poison', nan, 'Flying', 'Dragon', 'Ground', 'Fairy', 'Grass',
       'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark', 'Water',
       'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'], dtype=object)

In [6]:
#H0: AVR HP Dragon <= AVR HP Grass

#H1: We believe Pokemons of type Dragon have, on average, more HP stats than Grass
#AVG HP Dragon > AVR HP Grass

In [7]:
df_dragon = df[(df["Type 1"] == "Dragon") | (df["Type 2"] =="Dragon")]["HP"]
df_grass = df[(df["Type 1"] == "Grass") | (df["Type 2"] =="Grass")]["HP"]

In [8]:
st.ttest_ind(df_dragon,df_grass, equal_var=False, alternative = "greater")

TtestResult(statistic=4.097528915272702, pvalue=5.0907690611769275e-05, df=77.58086781513519)

In [9]:
#p-value very much smaller than 0,05 so we reject the null hypothesis.
# Dragons have greater HP AVG

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [10]:
#H0: Legendary Pokemons stats == Non-Legendary Pokemons stats

#H1: Legendary Pokemons stats != Non-Legendary Pokemons stats

#(Legendary = boolean), Legendary True or False

In [11]:
df['Legendary'].unique()

array([False,  True])

In [12]:
df['stats'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].sum(axis=1)

In [13]:
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,stats
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False,318
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False,405
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False,525
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False,625
4,Charmander,Fire,,39,52,43,60,50,65,1,False,309
...,...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True,600
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True,700
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True,600
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True,680


In [14]:
df_legendary = df[df["Legendary"] == True]["stats"]
df_non_legendary = df[df["Legendary"] == False]["stats"]

In [15]:
df_legendary

156    580
157    580
158    580
162    680
163    780
      ... 
795    600
796    700
797    600
798    680
799    600
Name: stats, Length: 65, dtype: int64

In [16]:
st.ttest_ind(df_legendary, df_non_legendary, equal_var=False, alternative = "two-sided")

TtestResult(statistic=25.8335743895517, pvalue=9.357954335957446e-47, df=102.79988763435729)

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [17]:
df2 = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df2.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [18]:
#Calculate the Euclidean distance from each house to the school and hospital

def calculate_distances(df, threshold=0.50):
    school_lat, school_lon = 37, -118
    hospital_lat, hospital_lon = 34, -122
    
    # Calculate distance to school
    df['distance_to_school'] = np.sqrt(
        (df['latitude'] - school_lat) ** 2 +
        (df['longitude'] - school_lon) ** 2
    )
    
    # Calculate distance to hospital
    df['distance_to_hospital'] = np.sqrt(
        (df['latitude'] - hospital_lat) ** 2 +
        (df['longitude'] - hospital_lon) ** 2
    )
    
    # Determine if the house is close to either the school or the hospital
    df['close_to_school_or_hospital'] = (
        (df['distance_to_school'] < threshold) | 
        (df['distance_to_hospital'] < threshold)
    )
    
    return df

In [19]:
df_with_distances = calculate_distances(df2)
df_with_distances

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital,close_to_school_or_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347,False
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617,False
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456,False
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716,False
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432,False
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298,False
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630,False
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410,False
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035,False


In [20]:
df2['close_to_school_or_hospital'].value_counts()

close_to_school_or_hospital
False    16995
True         5
Name: count, dtype: int64

## We posit that houses close to either a school or a hospital are more expensive.

H0: AVG price close <= AVG price far

H1: AVG price_close > AVG price far

In [21]:
close = df2[df2['close_to_school_or_hospital'] == True]['median_house_value']
far = df2[df2['close_to_school_or_hospital'] == False]['median_house_value']

In [22]:
st.ttest_ind(close, far, equal_var=False, alternative = "greater")

TtestResult(statistic=-17.174167998688404, pvalue=0.9999738999071939, df=4.145382282040222)

In [23]:
close

4523     74200.0
5596    108300.0
5597    104700.0
6776     94500.0
6904     80600.0
Name: median_house_value, dtype: float64

In [24]:
far

0         66900.0
1         80100.0
2         85700.0
3         73400.0
4         65500.0
           ...   
16995    111400.0
16996     79000.0
16997    103600.0
16998     85800.0
16999     94600.0
Name: median_house_value, Length: 16995, dtype: float64

In [25]:
#we fail to reject the H0
#so this means we can observe houses near to the school or hospital that are cheaper than the far ones far,
#so maybe the distance to the schools or hospitals are not the decisive factor to influence in the house's price