<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_07_InferentialStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction: What is Inferential Statistics?

## How can we make sense of the Shire's housing market?

Imagine you're a curious hobbit named Frodo Baggins, and you've just inherited Bag End from your uncle Bilbo. As you settle into your new home, you start wondering about the housing market in the Shire. How much are other hobbit-holes worth? What factors influence their prices? Is Bag End truly as valuable as everyone says?

To answer these questions, you could try to gather information about every single hobbit-hole in the Shire. But that would take years, and by the time you finished, the market would have changed! This is where **inferential statistics** comes to the rescue.

**Inferential statistics** is a powerful tool in data science that allows us to draw conclusions about a large group (called a **population**) based on a smaller, representative subset (called a **sample**). It's like tasting a spoonful of Farmer Maggot's mushroom soup to judge the flavor of the entire pot.

In the context of our Shire housing market example, inferential statistics would help us:

1. **Estimate** the average price of hobbit-holes in the entire Shire based on data from a few neighborhoods.
2. **Predict** how factors like square footage or distance to the nearest pub might affect a hobbit-hole's price.
3. **Test hypotheses** about the housing market, such as whether homes in Hobbiton are significantly more expensive than those in Buckland.

Key concepts in inferential statistics include:

- **Population**: The entire group we're interested in studying (e.g., all hobbit-holes in the Shire).
- **Sample**: A subset of the population that we actually measure or observe.
- **Parameter**: A numerical value that describes a characteristic of the population (e.g., the true average price of all hobbit-holes).
- **Statistic**: A numerical value calculated from the sample data that estimates a population parameter.

Throughout this chapter, we'll explore various inferential statistics techniques using our Shire housing dataset. We'll learn how to:

- Perform **t-tests** to compare means between groups (Are Hobbiton houses really more expensive?)
- Calculate **z-scores** to understand how unusual a particular observation is (Is Bag End exceptionally large?)
- Interpret **p-values** to assess the strength of our statistical evidence
- Use **chi-squared tests** to examine relationships between categorical variables (Is there an association between neighborhood and garden size?)
- Conduct **hypothesis tests** to make decisions based on data
- Explore **correlations** between variables (Does distance to the nearest pub affect house prices?)

By the end of this chapter, you'll have the tools to make data-driven decisions about the Shire's housing market – or any other dataset you encounter on your adventures through Middle-earth and beyond!



## Intro to Shire Housing Dataset: What secrets does our hobbit-hole data hold?

Welcome to the Shire, a land of rolling hills, cozy hobbit-holes, and... data! Our journey through inferential statistics will be guided by a dataset containing information about hobbit-holes across various neighborhoods in the Shire. Let's take a closer look at what we're working with.

Our dataset includes the following variables:

1.  **SquareFootage**: The size of the hobbit-hole in square feet.
2.  **Age**: How many years ago the hobbit-hole was built.
3.  **Neighborhood**: The area of the Shire where the hobbit-hole is located.
4.  **GardenSize**: The size of the garden in square feet.
5.  **DistanceToPub**: How far the hobbit-hole is from the nearest pub, in miles.
6.  **Price**: The price of the hobbit-hole in gold pieces.

Here's a glimpse of our data:

In [1]:
import pandas as pd
import numpy as np
!wget https://github.com/brendanpshea/data-science/raw/main/data/shire_house_prices.csv -q
shire_df = pd.read_csv('shire_house_prices.csv')
shire_df.head()


Unnamed: 0,SquareFootage,Age,Neighborhood,GardenSize,DistanceToPub,Price
0,711,37,Buckland,516,0.7,3477
1,458,46,Bywater,590,1.7,2543
2,671,29,Michel Delving,499,1.4,2909
3,804,36,Tuckborough,598,1.5,4343
4,1134,41,Tuckborough,481,1.8,4429


## Bilbo's Hobbit-Hole (Descriptive) Statistics Class
Before diving into inferential statistics, let's take a moment to review some of the concepts of **descriptive statistics.** We'll do so with the help of Bilbo Baggins...

In a cozy classroom in Hobbiton, Bilbo Baggins stood before a group of young hobbits, both lads and lasses. Their eyes were wide with curiosity as Bilbo began his lesson on the fascinating world of statistics.

"Now, my dear young hobbits," Bilbo began with a twinkle in his eye, "before we embark on our grand statistical adventure, let's revisit some of the basics. I have here some interesting numbers about hobbit-holes in the Shire. Let's see what tales they can tell us!"

Bilbo unfurled a large parchment with the following information (the output of the Pandas method `df.desribe()`).

In [2]:
shire_df.describe(include='all').round(1)

Unnamed: 0,SquareFootage,Age,Neighborhood,GardenSize,DistanceToPub,Price
count,1000.0,1000.0,1000,1000.0,1000.0,1000.0
unique,,,5,,,
top,,,Buckland,,,
freq,,,211,,,
mean,802.2,30.4,,495.6,1.0,3495.4
std,197.1,14.1,,99.7,0.5,779.7
min,245.0,0.0,,159.0,0.0,1447.0
25%,671.0,21.0,,428.8,0.7,2949.2
50%,803.0,30.0,,492.0,1.0,3431.5
75%,936.2,40.0,,560.2,1.3,4003.0


"Now then," Bilbo continued, "who can tell me what we're looking at here?"

A young hobbit lass named Primrose raised her hand. "Mr. Bilbo, sir, it looks like information about hobbit-holes!"

"Excellent, Primrose!" Bilbo beamed. "Indeed it is. We have data on 1000 hobbit-holes in the Shire. Let's explore what these numbers can tell us."

### Sample Size and Variables

"First," Bilbo pointed out, "we can see that we have information on 1000 hobbit-holes. This is our sample size. We're looking at several characteristics, or variables: SquareFootage, Age, Neighborhood, GardenSize, DistanceToPub, and Price."

### Measures of Central Tendency

"Now, who remembers what we call the 'typical' or 'average' value?" Bilbo asked.

A young hobbit lad named Hamfast piped up, "The mean, Mr. Bilbo!"

"Very good, Hamfast!" Bilbo nodded. "The 'mean' row shows us the average for each characteristic. For example, the average hobbit-hole in our sample:

-   Is about 799 square feet
-   Is about 30 years old
-   Has a garden of about 499 square feet
-   Is 1 mile from the nearest pub
-   Costs about 3,495 gold pieces"

"But the mean isn't the only way to find the 'middle' of our data," Bilbo continued. "Who remembers another?"

Primrose's hand shot up again. "The median, Mr. Bilbo!"

"Excellent, Primrose!" Bilbo smiled. "The median is the middle value when we order all the data. It's shown in the '50%' row. For example, the median hobbit-hole price is 3,432 gold pieces. This means half of the hobbit-holes cost more than this, and half cost less."

"And for our neighborhood data," Bilbo added, "we have something called the mode. It's the most common category. Here, Buckland is the most common neighborhood, with 211 hobbit-holes."

### Measures of Spread

"Now," Bilbo continued, his eyes twinkling with excitement, "let's talk about how spread out our data is. This tells us about the variety of hobbit-holes in the Shire."

"Who can tell me how we might describe the spread of data?" Bilbo asked.

A quiet hobbit named Marigold spoke up, "The range, Mr. Bilbo?"

"Wonderful, Marigold!" Bilbo exclaimed. "The range is the difference between the largest and smallest values. For price, the range is 6,743 - 1,447 = 5,296 gold pieces. Quite a difference between the most and least expensive hobbit-holes!"

"Another important measure of spread," Bilbo continued, "is the standard deviation. It tells us how far, on average, values tend to be from the mean. For price, it's about 780 gold pieces."

### Percentiles

"Lastly," Bilbo said, "let's look at the percentiles. The '25%' and '75%' rows show us the first and third quartiles. For price, 25% of hobbit-holes cost less than 2,949 gold pieces, and 75% cost less than 4,003 gold pieces."

### Conclusion

As the young hobbits scribbled notes, Bilbo summarized, "These descriptive statistics give us a quick but informative glimpse into the hobbit-holes of the Shire. They tell us what's typical, what's unusual, and how much variety exists."

"Remember," Bilbo said with a wink, "just as every hobbit-hole tells a story about its inhabitants, every number in these statistics tells us something about our community. In our future lessons, we'll learn how to use this information to make inferences and predictions. But for now, let's practice calculating these statistics ourselves!"

The young hobbits eagerly pulled out their abacuses, ready to dive into the world of descriptive statistics under Bilbo's guidance. As they worked, Bilbo smiled, knowing that these fundamental concepts would serve as the foundation for their future statistical adventures.

## T-tests: Comparing Hobbit-Hole Prices

### Are Hobbiton homes really more expensive than those in Buckland?

Meet Rosie Cotton, a savvy hobbit real estate agent who's curious about the housing market in different parts of the Shire. She's heard rumors that homes in Hobbiton are more expensive than those in Buckland, but she wants to use data to confirm this claim. This is where **t-tests** come in handy!

A **t-test** is a statistical method used to determine if there's a significant difference between the means (averages) of two groups. It's like comparing the average size of apples from two different orchards to see if one orchard produces larger apples.

Key terms and concepts:

- **Null hypothesis (H₀)**: The initial assumption that there's no significant difference between the groups. In Rosie's case, the null hypothesis would be: "There's no significant difference in average price between Hobbiton and Buckland homes."

- **Alternative hypothesis (H₁)**: The claim we're testing against the null hypothesis. For Rosie: "There is a significant difference in average price between Hobbiton and Buckland homes."

- **Sample mean**: The average value calculated from our sample data.

- **Standard deviation**: A measure of how spread out the data points are from the mean.

- **Degrees of freedom (df)**: A numerical value related to the sample size that affects the shape of the t-distribution.

- **T-statistic**: A value calculated from our data that we use to determine if the difference between groups is statistically significant.

- **P-value**: The probability of obtaining results at least as extreme as our observed data, assuming the null hypothesis is true.

- **Significance level (α)**: The threshold we use to decide if our result is statistically significant, typically set at 0.05 (5%).

### Step-by-step example:

1. Rosie collects data on home prices from 30 houses in Hobbiton and 30 in Buckland.

2. She calculates the sample means (these are made up numbers, not reflective of our actual data set)
   - Hobbiton mean price: 3500 gold pieces
   - Buckland mean price: 3200 gold pieces

3. Rosie performs a two-sample t-test (comparing two independent groups).

4. She calculates the t-statistic and p-value using statistical software.

5. Results: t-statistic = 2.1, p-value = 0.04

6. Interpretation: Since the p-value (0.04) is less than the significance level (0.05), Rosie rejects the null hypothesis. She concludes that there is a statistically significant difference in average home prices between Hobbiton and Buckland.

### Performing a t-test in Python:

Here's a concise way to perform a t-test using Python:

```python
import scipy.stats as stats

# Assuming we have two lists: hobbiton_prices and buckland_prices
t_stat, p_value = stats.ttest_ind(hobbiton_prices, buckland_prices)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")
```

### Applying t-tests to the Shire datase

Let's use our Shire housing dataset to compare prices between Hobbiton and Buckland:

In [4]:
import scipy.stats as stats
# Filter data for Hobbiton and Buckland
hobbiton = shire_df[shire_df["Neighborhood"] == "Hobbiton"]["Price"]
buckland = shire_df[shire_df["Neighborhood"] == "Buckland"]["Price"]

# print sample means
print("Hobbiton mean price:", hobbiton.mean().round(2))
print("Buckland mean price:", buckland.mean().round(2))

# Perform t-test
t_stat, p_value = stats.ttest_ind(hobbiton, buckland)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

Hobbiton mean price: 3408.1
Buckland mean price: 3617.11
T-statistic: -2.68
P-value: 0.0077


This code will perform a t-test on our Shire dataset, comparing home prices in Hobbiton and Buckland. The results will help us determine if there's a statistically significant difference in prices between these two neighborhoods.

This result shows:

- The average house price in Hobbiton is 3408.1 gold pieces.
- The average house price in Buckland is 3617.11 gold pieces.
- The t-statistic of -2.68 indicates a difference between these means.
- The p-value (more on this later) of 0.0077 is less than the common significance level of 0.05.

Interpretation: There is strong statistical evidence (p < 0.05) that house prices in Buckland are significantly higher than in Hobbiton. The difference appears to be about 209 gold pieces on average.


Remember, though, a t-test is just one tool in our statistical toolbox.

It's great for comparing means between two groups, but it has limitations. For example, it assumes our data is normally distributed and doesn't account for other factors that might influence home prices. As we progress through this chapter, we'll explore other statistical methods that can provide deeper insights into our Shire housing market!

## Z-scores: How Unusual is Bag End?

### Is Bilbo's home truly extraordinary among hobbit-holes?

Imagine Lobelia Sackville-Baggins, always curious (and perhaps a bit envious) about her cousin Bilbo's famous home, Bag End. She wonders: just how exceptional is Bag End compared to other hobbit-holes in the Shire? To answer this question objectively, we can use a statistical tool called the **Z-score**.

A **Z-score** (also known as a standard score) tells us how many standard deviations an individual data point is from the mean of a dataset. It's like measuring how far away a hobbit is from the average height - are they unusually tall, short, or right in the middle?

Key terms and concepts:

- **Mean (μ)**: The average value in a dataset, calculated by summing all values and dividing by the number of data points.

- **Standard Deviation (σ)**: A measure of how spread out the data is from the mean. A low standard deviation indicates that most data points are close to the mean, while a high standard deviation suggests the data is more spread out.

- **Normal Distribution**: A symmetric, bell-shaped distribution of data where most values cluster around the mean, with fewer values towards the extremes.

- **Z-score**: A measure of how many standard deviations a data point is from the mean. Calculated as: Z = (X - μ) / σ, where X is the individual value, μ is the mean, and σ is the standard deviation.

- **Percentile**: The percentage of scores in a distribution that fall at or below a particular score.

### Step-by-step example:

1. Lobelia gathers data on the square footage of 100 hobbit-holes in the Shire.

2. She calculates the mean (μ) square footage: 800 sq ft
   And the standard deviation (σ): 150 sq ft

3. Bag End's square footage (X): 1200 sq ft

4. Lobelia calculates the Z-score for Bag End:
   Z = (X - μ) / σ = (1200 - 800) / 150 = 2.67

5. Interpretation:
   - A Z-score of 2.67 means Bag End is 2.67 standard deviations above the mean.
   - In a normal distribution, about 99.6% of the data falls within 3 standard deviations of the mean.
   - Bag End is larger than approximately 99.6% of hobbit-holes in the Shire!

### Calculating Z-scores in Python:

Here's a concise way to calculate Z-scores using Python:

In [6]:
# Bag End's square footage (assuming it's not in the dataset)
bag_end_size = 1200

# Calculate Z-score
z_score = (bag_end_size - shire_df['SquareFootage'].mean()) / shire_df['SquareFootage'].std()

print(f"Z-score of Bag End: {z_score:.2f}")

# Calculate percentile
percentile = (shire_df['SquareFootage'] < bag_end_size).mean() * 100
print(f"Bag End is larger than {percentile:.1f}% of hobbit-holes in our dataset")

Z-score of Bag End: 2.02
Bag End is larger than 98.0% of hobbit-holes in our dataset


Our results reveal that, with a square footage of 1,200, Bag's end has a z-score of 2, meaning it is over 2 standard deviations above the mean. In a two-tailed distribution (like this one), it means that it is larger than 98% of all hobbit holes!