# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [4]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [None]:
#h0  fire hp > grass hp
#h1   fire hp < grass hp

In [20]:
df_fire_count= df[df["Type 1"]== "Fire"]["HP"].count()
df_grass_count= df[df["Type 1"]== "Grass"]["HP"].count()
df_fire_count, df_grass_count

(52, 70)

In [16]:
df_fire_mean= df[df["Type 1"]== "Fire"]["HP"].mean()
df_grass_mean= df[df["Type 1"]== "Grass"]["HP"].mean()
df_fire_mean, df_grass_mean

(69.90384615384616, 67.27142857142857)

In [26]:
df_fire= df[df["Type 1"]== "Fire"]["HP"]
df_grass= df[df["Type 1"]== "Grass"]["HP"]


In [25]:
st.ttest_ind(df_fire,df_grass, equal_var=False)

TtestResult(statistic=0.7391940086169969, pvalue=0.46135826534621194, df=110.37960902080208)

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [53]:


legendary_true = df[df['Legendary'] == True]
legendary_false = df[df['Legendary'] == False]
legendary_true, legendary_false

(               Name    Type 1    Type 2   HP  Attack  Defense  Sp. Atk  \
 156        Articuno       Ice    Flying   90      85      100       95   
 157          Zapdos  Electric    Flying   90      90       85      125   
 158         Moltres      Fire    Flying   90     100       90      125   
 162          Mewtwo   Psychic       NaN  106     110       90      154   
 163   Mega Mewtwo X   Psychic  Fighting  106     190      100      154   
 ..              ...       ...       ...  ...     ...      ...      ...   
 795         Diancie      Rock     Fairy   50     100      150      100   
 796    Mega Diancie      Rock     Fairy   50     160      110      160   
 797  Hoopa Confined   Psychic     Ghost   80     110       60      150   
 798   Hoopa Unbound   Psychic      Dark   80     160       60      170   
 799       Volcanion      Fire     Water   80     110      120      130   
 
      Sp. Def  Speed  Generation  Legendary  
 156      125     85           1       True  
 157  

In [55]:
df.groupby("Legendary")[["Attack","HP","Defense","Sp. Atk","Sp. Def","Speed"]]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000020C03275290>

In [56]:
df_legendary= df[df["Legendary"]== True]["Attack"]
df_legendary_not= df[df["Legendary"]== False]["Attack"]
df_legendary ,df_legendary_not

(156     85
 157     90
 158    100
 162    110
 163    190
       ... 
 795    100
 796    160
 797    110
 798    160
 799    110
 Name: Attack, Length: 65, dtype: int64,
 0       49
 1       62
 2       82
 3      100
 4       52
       ... 
 787    100
 788     69
 789    117
 790     30
 791     70
 Name: Attack, Length: 735, dtype: int64)

In [57]:
st.ttest_ind(df_legendary,df_legendary_not, equal_var=False)

TtestResult(statistic=10.438133539322203, pvalue=2.520372449236646e-16, df=75.88324448141854)

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [65]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In [67]:
target_longitude = -118.37
result = df[df['longitude'] == target_longitude]
result

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
7577,-118.37,36.19,10.0,443.0,111.0,48.0,21.0,3.1250,71300.0
7578,-118.37,34.43,11.0,17339.0,2866.0,8721.0,2803.0,5.9507,225200.0
7579,-118.37,34.24,40.0,1283.0,246.0,594.0,236.0,4.1121,229200.0
7580,-118.37,34.23,32.0,1444.0,317.0,1177.0,311.0,3.6000,164600.0
7581,-118.37,34.22,17.0,1787.0,463.0,1671.0,448.0,3.5521,151500.0
...,...,...,...,...,...,...,...,...,...
7683,-118.37,33.81,36.0,2031.0,339.0,817.0,337.0,5.1271,458300.0
7684,-118.37,33.81,36.0,1283.0,209.0,563.0,209.0,6.9296,500001.0
7685,-118.37,33.81,33.0,5057.0,790.0,2021.0,748.0,6.8553,482200.0
7686,-118.37,33.79,36.0,1596.0,234.0,654.0,223.0,8.2064,500001.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [72]:
import pandas as pd
from scipy.spatial import distance_matrix

def find_neighbors_within_distance(result, distance=0.5):
    points = result[['longitude', 'latitude']].values
    dist_matrix = distance_matrix(points, points)
    
    neighbors = []
    for i in range(len(points)):
        for j in range(len(points)):
            if i != j and dist_matrix[i, j] <= distance:
                neighbors.append(j)
    
    # Get unique neighbor indices
    unique_neighbors = list(set(neighbors))
    
    # Get the DataFrame rows corresponding to unique neighbors
    neighbor_df = result.iloc[unique_neighbors]
    
    return neighbor_df

find_neighbors_within_distance(result, distance=0.5)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
7578,-118.37,34.43,11.0,17339.0,2866.0,8721.0,2803.0,5.9507,225200.0
7579,-118.37,34.24,40.0,1283.0,246.0,594.0,236.0,4.1121,229200.0
7580,-118.37,34.23,32.0,1444.0,317.0,1177.0,311.0,3.6000,164600.0
7581,-118.37,34.22,17.0,1787.0,463.0,1671.0,448.0,3.5521,151500.0
7582,-118.37,34.22,11.0,2127.0,581.0,1989.0,530.0,2.9028,174100.0
...,...,...,...,...,...,...,...,...,...
7683,-118.37,33.81,36.0,2031.0,339.0,817.0,337.0,5.1271,458300.0
7684,-118.37,33.81,36.0,1283.0,209.0,563.0,209.0,6.9296,500001.0
7685,-118.37,33.81,33.0,5057.0,790.0,2021.0,748.0,6.8553,482200.0
7686,-118.37,33.79,36.0,1596.0,234.0,654.0,223.0,8.2064,500001.0


In [82]:
houses = df

# Classify houses
houses = classify_houses(houses)

# Divide DataFrame into two sets based on proximity to a school or hospital
neighbor_close = houses[houses['close_to_either'] == 1]
neighbor_far = houses[houses['close_to_either'] == 0]
neighbor_far

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital,close_to_either
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347,0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617,0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456,0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716,0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432,0
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298,0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630,0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410,0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035,0


In [84]:
from scipy.stats import mannwhitneyu

# Conduct Mann-Whitney U test
statistic, p_value = mannwhitneyu(['close_to_either'] == 1, ['close_to_either'] == 0)

# Interpret the findings
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in median house values between houses close to either a school or hospital and houses far from both.")
else:
    print("There is no significant difference in median house values between houses close to either a school or hospital and houses far from both.")


There is no significant difference in median house values between houses close to either a school or hospital and houses far from both.
