# Correlation Demo: Coffee & Happiness

Welcome!   
In this short notebook, we’ll explore how to **measure relationships between variables** using two types of correlation coefficients:

- **Pearson correlation (ρ)** – for linear relationships  
- **Spearman rank correlation (rₛ)** – for monotonic (ranked) relationships

We'll use a fun real-world dataset linking **coffee consumption per capita** with **self-reported happiness scores** across countries.

---

 **What you’ll see:**
- A visual overview of the data
- Step-by-step calculation of both correlation types
- A deeper look at what correlation values actually mean
- Manual implementation of Pearson and Spearman formulas

---

This notebook is intended as a **teaching demo**, originally prepared for a lesson recording.

Let’s get started!


# 1. Load the Dataset

We use a publicly available dataset linking **coffee consumption per capita** with **average happiness score** by country.

We'll:
- Load the dataset directly from a GitHub URL
- Preview the first few rows
- Inspect the column names and summary statistics
- Drop any rows with missing values


In [None]:
import pandas as pd

# load the data

url = "https://raw.githubusercontent.com/batloon/data-projects/main/coffee_is_happiness/data/coffee_happiness_correlation.csv"
df = pd.read_csv(url)

# Display the first rows
df.head()

In [None]:
# Show columns
print(df.columns)

# Basic statistics
df.describe()


In [None]:
# clean the data: delete rows with missing values
df = df.dropna()

df.describe()

# 2. Visualize the Relationship (Scatter Plot)

Let’s plot the data to visually assess whether coffee consumption and happiness appear to be related.

We’ll use a scatterplot, where:
- The X-axis shows coffee consumption per person (kg/year)
- The Y-axis shows the happiness score


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.scatterplot(x="Coffee_Consumption_Per_Capita_KG", y="Happiness_Score", data=df, color = 'navy', s=60, edgecolor='white')
plt.title("Coffee Consumption vs Happiness Score")
plt.xlabel("Coffee Consumption per person (kg/year)")
plt.ylabel("Happiness Score")
plt.grid(True)
plt.show()



# Scatter Plot with Regression Line

We now add a **linear regression line** to the scatterplot using `seaborn.regplot`.

This helps visualize whether the relationship is approximately linear, which is important for interpreting the **Pearson correlation** later.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))

sns.regplot(
    x="Coffee_Consumption_Per_Capita_KG",
    y="Happiness_Score",
    data=df,
    ci=None,  # Hide confidence interval
    scatter_kws={'color': 'navy', 's': 60, 'edgecolor': 'white'},  # Color and size of points
    line_kws={'color': 'red', 'linewidth': 2}  # Color and width of regression line
)

plt.title("Coffee Consumption vs Happiness Score (with Regression Line)")
plt.xlabel("Coffee Consumption per person (kg/year)")
plt.ylabel("Happiness Score")
plt.grid(True)
plt.show()


# 3. Pearson Correlation Coefficient (ρ)

Pearson's ρ measures **how well a linear equation describes the relationship** between two variables.

It ranges from **–1** (perfect negative linear relationship) to **+1** (perfect positive linear relationship).  
A value close to **0** suggests no linear correlation.


In [None]:
# calculate Perason correlation coefficient
pearson_correlation = df["Coffee_Consumption_Per_Capita_KG"].corr(df["Happiness_Score"])
print(f"Pearson correlation coefficient: {pearson_correlation:.2f}")

# Pearson Correlation: Manual Calculation

Let’s manually compute Pearson’s correlation using its mathematical formula:

$$
\rho = \frac{\text{cov}(X, Y)}{\sigma_X \cdot \sigma_Y}
$$

This helps understand what Python libraries do under the hood.


In [None]:
# calculate Pearson correlation coefficient using the formula
def pearson_correlation_f(x, y):
    return (x - x.mean()).dot(y - y.mean()) / ((x.std() * y.std()) * len(x))


pearson_correlation_manual = pearson_correlation_f(df["Coffee_Consumption_Per_Capita_KG"], df["Happiness_Score"])
print(f"Pearson correlation coefficient (manual): {pearson_correlation_manual:.2f}")

# 4. Spearman Rank Correlation (rₛ)

Spearman’s rₛ is similar to Pearson’s ρ, but instead of using raw values, it uses **ranks**.

It is useful when the relationship is **monotonic but not linear**, and it’s **more robust to outliers**.


In [None]:
# calculate Spearman correlation coefficient
spearman_correlation = df["Coffee_Consumption_Per_Capita_KG"].corr(df["Happiness_Score"], method='spearman')
print(f"Spearman correlation coefficient: {spearman_correlation:.2f}")

# Spearman Correlation: Manual Calculation

Here we calculate Spearman’s correlation manually by:
1. Ranking both variables
2. Applying the Pearson formula to the **ranked data**

This demonstrates how Spearman is just **Pearson on ranks**.


In [None]:
# calculate Spearman correlation coefficient using the formula
def spearman_correlation_f(x, y):
    rank_x = x.rank()
    rank_y = y.rank()
    return pearson_correlation_f(rank_x, rank_y)    

spearman_correlation_manual = spearman_correlation_f(df["Coffee_Consumption_Per_Capita_KG"], df["Happiness_Score"])
print(f"Spearman correlation coefficient (manual): {spearman_correlation_manual:.2f}")