<a href="https://colab.research.google.com/github/datascience-uniandes/hypothesis-testing-tutorial/blob/master/hypothesis-testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing

MINE-4101: Applied Data Science  
Univerisdad de los Andes  
  
**Dataset:** AirBnb Listings - Mexico City, Distrito Federal, Mexico [[dataset](http://insideairbnb.com/get-the-data/) | [dictionary](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?usp=sharing)]. This dataset comprises information about Airbnb property listings in Mexico City. It includes data points like neighborhood, property type, price per night, number of reviews, review scores, availability, amenities, and more.

**Business Context:** Property Investment and Vacation Rental Strategy. You're a consultant for individuals or firms looking to invest in properties for Airbnb rentals. They want to identify the most lucrative neighborhoods, optimal pricing strategies, and understand the factors that contribute to positive reviews and frequent bookings. <span style="color: red;">Since you currently only have a sample of all the properties listed in the city, you must ensure that the insights you extract from your analysis can be generalized to the entire set of properties.</span>

Last update: September, 2023

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind, chi2_contingency

In [None]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

## 1. Loading the data

In [None]:
listings_df = pd.read_csv("./data/listings.csv.gz").sample(frac=0.01, random_state=100)

In [None]:
listings_df.shape

In [None]:
listings_df.dtypes

In [None]:
listings_df.sample(5)

## 2. Transforming the data

In [None]:
listings_df["price"] = listings_df["price"].str.replace("[$,]", "", regex=True).astype(float)

## 3. Removing some critical outliers based on listing price

In [None]:
q1 = listings_df["price"].quantile(0.25)
q3 = listings_df["price"].quantile(0.75)
iqr = q3 - q1

In [None]:
listings_df = listings_df.loc[listings_df["price"] <= (q3 + 1.5 * iqr)]

## 4. Business question 1

After selecting a couple of neighborhoods with good investment potential, analyze the listing price for that neighborhoods. On average, one of the two neighborhoods has higher prices than the other one?

In [None]:
listings_df["neighbourhood_cleansed"].value_counts(dropna=False, normalize=True)

In [None]:
selected_neighborhoods = ["Miguel Hidalgo", "Benito Juárez"]

In [None]:
# Showing some statistics for neighborhoods of interest
listings_df.loc[listings_df["neighbourhood_cleansed"].isin(selected_neighborhoods)].groupby("neighbourhood_cleansed")["price"].describe()

In [None]:
# Plotting price distribution by neighborhood
fig, ax = plt.subplots(1, 1, figsize=(20, 8))
sns.kdeplot(
    data=listings_df.loc[listings_df["neighbourhood_cleansed"].isin(selected_neighborhoods)],
    x="price",
    hue="neighbourhood_cleansed",
    bw_adjust=.3,
    ax=ax
)
for (neighborhood, color) in zip(selected_neighborhoods, ["steelblue", "orange"]):
    ax.axvline(listings_df.loc[listings_df["neighbourhood_cleansed"] == neighborhood, "price"].mean(), color=color, linestyle="dashed", linewidth=2, ymax=0.2)
plt.title("Price distribution by neighborhood (with means)")
plt.show()

**Step 1.** Define null and alternative hypothesis:

$$
H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 \neq \mu_2
$$

**Step 2.** Choose the appropriate test: t-test.  

**Step 3.** Calculate the p-value:

In [None]:
# Performing the two-sample t-test
t_stat, p_value = ttest_ind(
    listings_df.loc[listings_df["neighbourhood_cleansed"] == selected_neighborhoods[0], "price"],
    listings_df.loc[listings_df["neighbourhood_cleansed"] == selected_neighborhoods[1], "price"],
    equal_var=False
)

In [None]:
# Printing the results
print("T-statistic:", t_stat)
print("P-value:", p_value)

**Step 4.** Determine the statistical significance:

In [None]:
# Evaluating significance
alpha = 0.05  # Choosing a significance level (commonly 0.05)
if p_value < alpha:
    print(f"REJECT THE NULL HYPOTHESIS: The difference in listing price between {selected_neighborhoods[0]} and {selected_neighborhoods[1]} neighbourhoods is statistically significant.")
else:
    print(f"FAIL TO REJECT THE NULL HYPOTHESIS: The difference in listing price between {selected_neighborhoods[0]} and {selected_neighborhoods[1]} neighbourhoods is not statistically significant.")

**Potential implication for an investor:**  
Depending on land prices and profit margins, it will be more or less convenient to invest in neighborhoods where users are willing to pay on average a certain amount of money.

## 5. Business question 2

In order to select the best room type for investing, are there room types being most predominant in some neighborhoods?

In [None]:
neighborhood_frec_cumsum = listings_df["neighbourhood_cleansed"].value_counts(dropna=False, normalize=True).cumsum()
neighborhood_frec_cumsum

In [None]:
# Filtering by Pareto's rule at 90%
most_representative_neighborhoods = neighborhood_frec_cumsum.loc[neighborhood_frec_cumsum < 0.9].index.tolist()
most_representative_neighborhoods

In [None]:
listings_df["room_type"].value_counts(dropna=False, normalize=True)

In [None]:
contingency_table = pd.crosstab(
    listings_df.loc[listings_df["neighbourhood_cleansed"].isin(most_representative_neighborhoods)]["neighbourhood_cleansed"],
    listings_df.loc[listings_df["neighbourhood_cleansed"].isin(most_representative_neighborhoods)]["room_type"]
)
contingency_table

**Step 1.** Define null and alternative hypothesis:

$$
H_0: \text{The variables are independent} \\ H_a: \text{The variables are not independent}
$$

**Step 2.** Choose the appropriate test: t-test.  

**Step 3.** Calculate the p-value:

In [None]:
# Performing the chi-square test
chi2, p_value, _, expected_freq = chi2_contingency(contingency_table)

In [None]:
# Printing the results
print("Chi-Square:", chi2)
print("P-value:", p_value)
print("Expected Frequencies:\n", expected_freq.round(0))

In [None]:
# Evaluating significance
alpha = 0.05  # Choosing a significance level (commonly 0.05)
if p_value < alpha:
    print("REJECT THE NULL HYPOTHESIS: There's a statistically significant dependency between neighborhood and room type.")
else:
    print("FAIL TO REJECT THE NULL HYPOTHESIS: There's no statistically significant dependency between neighborhood and room type.")

**Potential implication for an investor:**  
There are no enough evidence to say that certain room types are more predominant in some neighborhoods, so the decision about the room type to offer could depend more on differente factors like...