# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [6]:
# Split data into dragon and grass type subsets, get HP
dragons = df[df['Type 1'] == 'Dragon']['HP']
grass = df[df["Type 1"] == 'Grass']['HP']

In [None]:
# Set the hypothesis

# H0: Average HP for dragon type pokemon is equal to average HP for grass type pokemon. 
# H1: Average HP for dragon type pokemon is not equal to average HP for grass type pokemon.

# Significance leve = 0.05

In [11]:
# Runs a two sample t-test comparing dragon type pokemon and grass type pokemon
stat, p_value = st.ttest_ind(dragons,grass, equal_var=False)
print("Stat:", stat, " p_value:", p_value)

# p_value is less than 0.05, so we reject the null hypothesis and conclude that average HP is not the same for both types of pokemon.
# Test results (specifically the t-value) seem to indicate that dragon type pokemon have more HP than grass type pokemon.  

Stat: 3.3349632905124063  p_value: 0.0015987219490841197


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the proper test and, with 5% significance, comment your findings.


In [15]:
# Create a list of columns to keep - combat stats
combat_stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def','Speed']

# Create combat stat subsets of the dataframe separated by legendary status
legendary = df[df['Legendary']==1][combat_stats]
regular = df[df['Legendary']==0][combat_stats]

In [None]:
# Set the hypothesis

# H0: Legendary and regular (non-legendary) pokemon have the same averages for their combat stats. 
# H1: Legendary and regular pokemon do not have the same averages for their combat stats. 

# Significance leve = 0.05
# Create alpha level
alpha = 0.05

In [22]:
# Runs a two sample t-test comparing hit points for legendary pokemon and non-legendary pokemon 
# We are assuming that variance between the legendary and regular pokemon is NOT equal, because regular pokemon come in multiple levels of evolution, with higher evolutions being significantly more powerful than lower evolutions. On the other hand, legendary pokemon come in their highest form, and so we expect one legendary pokemon to be more comparable to another in terms of power.
# For this reason, we use equal_var = False.  
stat, p_value = st.ttest_ind(legendary['HP'],regular['HP'], equal_var=False)   
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different hit points than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different hit points than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher hit points than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower hit points than non-legendary pokemon.")


Stat: 8.981370483625046  p_value: 1.0026911708035284e-13
Reject null hypothesis: Legendary pokemon have significantly different hit points than non-legendary pokemon.
Legendary pokemon have higher hit points than non-legendary pokemon.


In [21]:
# Runs a two sample t-test comparing attack for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Attack'],regular['Attack'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different attack power than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different attack power than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher attack power than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower attack power than non-legendary pokemon.")

Stat: 10.438133539322203  p_value: 2.520372449236646e-16
Reject null hypothesis: Legendary pokemon have significantly different attack power than non-legendary pokemon.
Legendary pokemon have higher attack power than non-legendary pokemon.


In [23]:
# Runs a two sample t-test comparing defense for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Defense'],regular['Defense'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different defense power than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different defense power than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher defense power than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower defense power than non-legendary pokemon.")

Stat: 7.637078164784618  p_value: 4.8269984949193316e-11
Reject null hypothesis: Legendary pokemon have significantly different defense power than non-legendary pokemon.
Legendary pokemon have higher defense power than non-legendary pokemon.


In [24]:
# Runs a two sample t-test comparing special attack for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Sp. Atk'],regular['Sp. Atk'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different special attack power than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different special attack power than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher special attack power than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower special attack power than non-legendary pokemon.")

Stat: 13.417449984138461  p_value: 1.5514614112239812e-21
Reject null hypothesis: Legendary pokemon have significantly different special attack power than non-legendary pokemon.
Legendary pokemon have higher special attack power than non-legendary pokemon.


In [25]:
# Runs a two sample t-test comparing special attack for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Sp. Def'],regular['Sp. Def'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different special defense power than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different special defense power than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher special defense power than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower special defense power than non-legendary pokemon.")

Stat: 10.015696613114878  p_value: 2.2949327864052826e-15
Reject null hypothesis: Legendary pokemon have significantly different special defense power than non-legendary pokemon.
Legendary pokemon have higher special defense power than non-legendary pokemon.


In [26]:
# Runs a two sample t-test comparing speed for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Speed'],regular['Speed'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different speed than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different speed than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher speed than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower speed than non-legendary pokemon.")

Stat: 11.47504444631443  p_value: 1.049016311882451e-18
Reject null hypothesis: Legendary pokemon have significantly different speed than non-legendary pokemon.
Legendary pokemon have higher speed than non-legendary pokemon.


From these test results, we can conclude that, on average, legendary pokemon are superior to non-legendary pokemon in every combat statistic. In every case we rejected the null hypothesis that there was no significant difference between the two. In every case we found that there was indeed a significant difference, and in every such case we found that legendary pokemon had higher combat stats. 

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [28]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the proper test and, with 5% significance, comment your findings.
 

In [73]:
# Define function to compute distance from school
def school_dist(lat1, lon1):
    "This function takes provided latitudinal and longitudinal coordinates and computes their euclidean distance from a school whose position is (-118,37)."
    # Converts lat and lon to 'Cartesian coordinates' - ChatGPT recommended I do this. 
    x1 = lon1 * np.cos(lat1)
    y1 = lat1
    school_lon = -118 * np.cos(37)  # uses longitude and latitude provided for school location data
    school_lat = 37                 # uses latitude provided for school location data
    # Computes euclidean distance from a given latitude and longitude to the school. 
    distance = np.sqrt((x1 - school_lon)**2 + (y1 - school_lat)**2)
    return distance

In [74]:
# Creates a new column in df that contains each house's euclidean distance from the school. 
df['school_dist'] = school_dist(df['latitude'],df['longitude'])

In [66]:
# Define function to compute distance from hospital
def hosp_dist(lat1, lon1):
    "This function takes provided latitudinal and longitudinal coordinates and computes their euclidean distance from a hospital whose position is (-122,34)."
    # Converts lat and lon to 'Cartesian coordinates' - ChatGPT recommended I do this. 
    x1 = lon1 * np.cos(lat1)
    y1 = lat1
    hosp_lon = -122 * np.cos(34)  # uses longitude and latitude provided for school location data
    hosp_lat = 34                 # uses latitude provided for school location data
    # Computes euclidean distance from a given latitude and longitude to the school. 
    distance = np.sqrt((x1 -hosp_lon)**2 + (y1 - hosp_lat)**2)
    return distance

In [67]:
# Creates a new column in df that contains each house's euclidean distance from the hospital. 
df['hosp_dist'] = hosp_dist(df['latitude'],df['longitude'])

In [88]:
# Creates subsets of data based on nearness to the school or hospital. 
close_house = df[(df['hosp_dist']<0.5) | (df['school_dist']<0.5)]['median_house_value']
far_house = df[(df['hosp_dist']>0.5) & (df['school_dist']>0.5)]['median_house_value']


In [None]:
# Runs a two sample t-test comparing speed for legendary pokemon and non-legendary pokemon 
stat, p_value = st.ttest_ind(legendary['Speed'],regular['Speed'], equal_var=False)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis: Legendary pokemon have significantly different speed than non-legendary pokemon.")
else:
    print("Fail to reject null hypothesis: Legendary pokemon do not have significantly different speed than non-legendary pokemon.")

if stat > 0:
    print("Legendary pokemon have higher speed than non-legendary pokemon.")
else:
    print("Legendary pokemon have lower speed than non-legendary pokemon.")

In [91]:
# Runs a two sample t-test comparing prices of houses close to either the school or hospital with houses that are not close to either. 
# We assume that the variance of house prices in both categories is nearly equal because we have no indication that houses close to either or far from both bear any different qualities from each other. 
stat, p_value = st.ttest_ind(close_house,far_house)
print("Stat:", stat, " p_value:", p_value)
if p_value < 0.05:
    print("Reject null hypothesis. There is a statistically significant difference in house prices between houses close to either the hospital or the school and houses that are close to neither.")
else:
    print("Fail to reject the null. We find no statistically significant difference between houses close to either the school or the hospital and houses that are close to neither.")

if stat > 0:
    print("Houses close to either the school or the hospital are more expensive than houses that are not.")
else:
    print("Houses close to either the school or the hospital are less expensive than houses that are not.") 

Stat: 2.190010046812842  p_value: 0.02853704837377385
Reject null hypothesis. There is a statistically significant difference in house prices between houses close to either the hospital or the school and houses that are close to neither.
Houses close to either the school or the hospital are more expensive than houses that are not.
