---
title: "Exploratory Data Analysis & Applied Statistics: From Data to Insights"
subtitle: "Master's in Business Data Science"
author: "Roman Jurowetzki"
---

## 🎯 Introduction: The Art of Asking Business Questions

Welcome! You've mastered the mechanics of pandas. Now, we shift our focus from *how* to manipulate data to *why*. This session is about turning you into a "business translator"—someone who can convert raw data into a compelling story that drives decisions.

Our goal is not to make you a statistician. It's to give you a framework for exploring data confidently, asking smart questions, and knowing when a pattern you've found is real versus just random noise.

::: {.callout-important}
#### 🎯 REALITY CHECK FOR BUSINESS STUDENTS

❌ You **DON'T** need to memorize formulas.
❌ You **DON'T** need to calculate by hand.
❌ You **DON'T** need to become a statistician.

✅ You **DO** need to ask the right business questions.
✅ You **DO** need to interpret results for a business audience.
✅ You **DO** need to know when a pattern is trustworthy.

**Your job is to be a business translator, not a math wizard!**
:::

**Our Guiding Problem:** We'll work with an HR dataset for a Danish tech company. Management is concerned about employee turnover and wants to understand the key drivers of satisfaction and performance. Our mission is to provide them with clear, data-backed insights.

::: {.callout-note}
#### Pacing Guide
This notebook is designed for a 150-minute interactive session (fitting within a 4x45 minute block structure with breaks). If you are working through it on your own, take your time to experiment with the code and reflect on the interpretation of each output. The goal is understanding, not speed.
:::

### Notebook Setup

First, let's import our tools and generate a more realistic dataset. Notice how we're now intentionally creating relationships between variables to make our analysis more interesting.

In [None]:
# Standard imports for analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set a professional plot style
sns.set_theme(style="whitegrid", palette="viridis")

In [None]:
# --- ADVANCED DATA GENERATION TO CREATE REALISTIC CORRELATIONS ---
np.random.seed(42)
n_employees = 500

# Define the base characteristics and their relationships (correlation matrix)
means = [3.5, 0.7, 5, 200, 6, 600000] # satisfaction, evaluation, projects, hours, tenure, salary_base
stds = [0.8, 0.15, 1.5, 40, 2.5, 150000]
corr_matrix = [
    [1.0, 0.2, 0.1, -0.55, -0.1, 0.0],  # satisfaction
    [0.2, 1.0, 0.80, 0.3, 0.1, 0.2],   # evaluation
    [0.1, 0.80, 1.0, 0.2, 0.0, 0.1],   # projects
    [-0.55, 0.3, 0.2, 1.0, 0.0, 0.1],   # hours
    [-0.1, 0.1, 0.0, 0.0, 1.0, 0.88],  # tenure
    [0.0, 0.2, 0.1, 0.1, 0.88, 1.0]    # salary_base
]

# Create a covariance matrix
cov_matrix = np.outer(stds, stds) * corr_matrix

# Generate the core numerical data
numerical_data = np.random.multivariate_normal(means, cov_matrix, size=n_employees)

# Create the DataFrame
df = pd.DataFrame(numerical_data, columns=[
    'satisfaction_score', 'last_evaluation', 'projects_completed',
    'avg_monthly_hours', 'tenure_years', 'annual_salary_dkk'
])

# Clip data to realistic ranges and set data types
df['satisfaction_score'] = df['satisfaction_score'].clip(1, 5)
df['last_evaluation'] = df['last_evaluation'].clip(0, 1)
df['projects_completed'] = df['projects_completed'].clip(2, 8).astype(int)
df['avg_monthly_hours'] = df['avg_monthly_hours'].clip(100, 300).astype(int)
df['tenure_years'] = df['tenure_years'].clip(2, 11).astype(int)
df['annual_salary_dkk'] = df['annual_salary_dkk'].clip(300000)

# Add categorical and ID columns
df['employee_id'] = range(1, n_employees + 1)
df['department'] = np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR'], n_employees, p=[0.4, 0.3, 0.2, 0.1])
df['attrition'] = np.random.choice([0, 1], n_employees, p=[0.85, 0.15])

# Introduce some realistic business logic and data issues
df.loc[df['department'] == 'Engineering', 'annual_salary_dkk'] *= 1.3
df.loc[df['department'] == 'Sales', 'satisfaction_score'] *= 0.85 # Intentionally create a difference for ANOVA
df.loc[df.sample(frac=0.05, random_state=1).index, 'satisfaction_score'] = np.nan
# Ensure the outlier is in the Engineering department
eng_indices = df[df['department'] == 'Engineering'].index
if len(eng_indices) > 0:
    outlier_index = np.random.choice(eng_indices)
    df.loc[outlier_index, 'annual_salary_dkk'] = 2500000 # Outlier in DKK

print("HR Analytics dataset (DKK) with realistic correlations created successfully!")

---

# Part 1: Your Systematic Exploration Framework (30 mins)

Professional analysis isn't random; it's a systematic funnel. We start broad and get progressively narrower. Let's use a simple five-question framework.

::: {.callout-note}
### 📋 YOUR EDA CHECKLIST - PRINT THIS OUT!

**DATA QUALITY CHECK**
□ **1. SHAPE:** How big is the data? (`df.shape`, `df.info()`)
□ **2. MISSING:** Where are the gaps? (`df.isnull().sum()`)
□ **3. OUTLIERS:** Any weird values? (`df.describe()`, `sns.boxplot()`)
□ **4. CATEGORIES:** What groups exist? (`df['column'].value_counts()`)

**BUSINESS INSIGHTS CHECK**
□ **5. RELATIONSHIPS:** How do things connect? (`df.corr()`, `sns.heatmap()`, `sns.boxplot()`)

**P-VALUE TRANSLATION GUIDE**
*   `p < 0.05` → "**This pattern is likely REAL.** It's not just random chance."
*   `p > 0.05` → "**This pattern could be FAKE.** We can't be sure it's not just random noise."
:::

## 1.1 First Look: Shape & Structure

In [None]:
# ✅ 1. SHAPE: How big is the data?
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
df.info()

We have 500 employees, 9 features. `satisfaction_score` has missing values.

## 1.2 Data Quality Check: Missing Values & Outliers

#### ✅ 2. MISSING: Where are the gaps?

In [None]:
print("Missing values per column:\n", df.isnull().sum())

`satisfaction_score` has 25 missing values. We'll proceed, but in a real project, this requires a clear handling strategy.

#### ✅ 3. OUTLIERS: Anything weird?

In [None]:
# Use style.format to make the large numbers readable
df.describe().style.format("{:,.2f}")

The `max` salary of 2,500,000 DKK is a huge outlier compared to the 75th percentile of ~770k DKK. Let's visualize it.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['annual_salary_dkk'])
plt.title('Salary Distribution - Spot the Outlier!', fontsize=16)
plt.xlabel('Annual Salary (DKK)')
plt.show()

The boxplot confirms a significant outlier that will skew any analysis based on averages. In a full analysis, our next step would be to investigate this outlier. Is it a data entry error? Or is it a C-level executive? How we handle it (keep, remove, or cap it) depends on the answer and would significantly impact analyses based on averages.

#### ✅ 4. CATEGORIES: What groups exist?

In [None]:
df['department'].value_counts()

Engineering is the largest department. HR is very small, so findings for this group should be treated with caution.

## 1.3 Business Insights: How Do Things Connect?

#### ✅ 5. RELATIONSHIPS: How do things connect?
A correlation matrix is the best starting point for numerical variables.

::: {.callout-tip}
#### Rule of Thumb: Interpreting Correlation Strength
- **0.0 - 0.3 (or -0.3):** Weak relationship.
- **0.3 - 0.6 (or -0.6):** Moderate relationship.
- **0.6+ (or -0.6+):** Strong relationship.
:::

In [None]:
numerical_features = ['satisfaction_score', 'last_evaluation', 'projects_completed', 'avg_monthly_hours', 'tenure_years', 'annual_salary_dkk']
corr_matrix = df[numerical_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix of Key Numerical Features', fontsize=16)
plt.show()

**Key initial findings:**
*   **`tenure_years` & `annual_salary_dkk` (0.67):** Strong positive correlation. Loyalty is rewarded.
*   **`avg_monthly_hours` & `satisfaction_score` (-0.64):** Moderate negative correlation. Suggests a burnout effect.
*   **`projects_completed` & `last_evaluation` (0.77):** Strong positive correlation. High performers complete more projects.

---

# Part 2: Statistical Storytelling with Visuals (15 mins)

Let's use visuals to investigate specific business questions.

### Business Question 1: "Are some departments paid more than others?"

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(data=df, x='department', y='annual_salary_dkk')
plt.title('Salary Distribution by Department', fontsize=16)
plt.ylabel('Annual Salary (DKK)')
plt.xlabel('Department')
plt.gca().get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.show()

The visual story is clear: Engineering salaries are structurally higher than other departments. Sales is next, followed by Marketing and HR. The outlier we found earlier is confirmed to be in the Engineering department.

---

# Part 3: Is It Real or Just Random? Your Statistical Toolkit (45 mins)

Our chart shows a difference. But is it "statistically significant," or a fluke? Statistical tests give us the answer.

### 3.1 Your Mental Model for Choosing a Test

::: {.callout-tip}
### Which Test Do I Use? A Simple Guide

*   🤔 **My Question:** "Am I comparing **averages** between **2 groups**?"
    *   👉 **Your Tool:** **T-test** (e.g., Average salary of Sales vs. Engineering)

*   🤔 **My Question:** "Am I comparing **averages** between **3+ groups**?"
    *   👉 **Your Tool:** **ANOVA** (e.g., Average satisfaction across all 4 departments)

*   🤔 **My Question:** "Am I looking for a relationship between **categories**?"
    *   👉 **Your Tool:** **Chi-Square Test** (e.g., Does department relate to attrition?)

*   🤔 **My Question:** "Are two **numbers** moving together?"
    *   👉 **Your Tool:** **Correlation Test** (e.g., Do hours worked go up with salary?)
:::

### 3.2 The T-Test: Comparing Averages Between Two Groups

**Business Question:** "Our chart shows a salary gap between Engineering and Sales. Is that gap real, or could it be random chance?"

In [None]:
# 1. Isolate the data for the two groups
salary_eng = df[df['department'] == 'Engineering']['annual_salary_dkk'].dropna()
salary_sales = df[df['department'] == 'Sales']['annual_salary_dkk'].dropna()

print(f"Engineering average salary: {salary_eng.mean():,.0f} DKK")
print(f"Sales average salary:     {salary_sales.mean():,.0f} DKK")

# 2. Run the test (the computer does the math)
t_stat, p_value = stats.ttest_ind(salary_eng, salary_sales, equal_var=False) # Welch's T-test; safer assumption when group sizes/variances might differ.

# 3. Translate the results into plain English
print(f"\nThe P-value is: {p_value:.6f}")

if p_value < 0.05:
    print("\n✅ Conclusion: The difference is REAL and statistically significant.")
else:
    print("\n❌ Conclusion: The difference could be FAKE. We can't be sure it's not just random noise.")

### 3.3 The ANOVA Test: Comparing Averages Across Multiple Groups

What if we want to compare all four departments at once?

**Business Question:** "Is there a statistically significant difference in employee satisfaction scores across *any* of our departments?"

In [None]:
# 1. Visualize First! This helps form our hypothesis.
plt.figure(figsize=(12, 7))
sns.boxplot(data=df, x='department', y='satisfaction_score')
plt.title('Satisfaction Score by Department', fontsize=16)
plt.ylabel('Satisfaction Score (1-5)')
plt.xlabel('Department')
plt.show()

The boxplot suggests that the Sales department has a lower median satisfaction score. But is this difference significant when considering all four departments? ANOVA can tell us.

In [None]:
# 2. Isolate the data for all groups
satisfaction_eng = df[df['department'] == 'Engineering']['satisfaction_score'].dropna()
satisfaction_sales = df[df['department'] == 'Sales']['satisfaction_score'].dropna()
satisfaction_mktg = df[df['department'] == 'Marketing']['satisfaction_score'].dropna()
satisfaction_hr = df[df['department'] == 'HR']['satisfaction_score'].dropna()

# 3. Run the test
f_stat, p_value = stats.f_oneway(satisfaction_eng, satisfaction_sales, satisfaction_mktg, satisfaction_hr)

# 4. Translate the results
print(f"The P-value is: {p_value:.6f}")

if p_value < 0.05:
    conclusion = "✅ Yes, there is a statistically significant difference in satisfaction scores between at least two of the departments."
    explanation = "The variation between department averages is large enough that it's unlikely to be due to random chance."
else:
    conclusion = "❌ No, we cannot conclude there's a significant difference in satisfaction scores across departments."
    explanation = "Any differences we see in the chart could just be random noise."

print("\nTranslation:")
print(explanation)
print("\nConclusion:")
print(conclusion)

**Key Takeaway:** The ANOVA test confirms what our eyes suspected from the boxplot. A p-value close to zero means we can be confident that the departmental differences in satisfaction are real.

### 3.4 The Chi-Square Test: Finding Relationships in Categories

Now let's switch from numbers (like salary) to categories.

**Business Question:** "Are employees in certain departments more likely to leave the company (attrition) than others?"

In [None]:
# 1. Prepare the data: Create a contingency table (crosstab)
contingency_table = pd.crosstab(df['department'], df['attrition'])
print("Count of employees who stayed (0) vs. left (1) by Department:")
print(contingency_table)

# 2. Visualize: A normalized bar chart is best for comparing proportions
contingency_table.div(contingency_table.sum(axis=1), axis=0).plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Attrition Rate by Department', fontsize=16)
plt.ylabel('Proportion of Employees')
plt.xlabel('Department')
plt.xticks(rotation=0)
plt.legend(title='Attrition', labels=['Stayed', 'Left'])
plt.show()

The chart seems to show a higher proportion of employees leaving from Sales and Marketing compared to Engineering and HR. Is this association statistically significant?

In [None]:
# 3. Run the test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# 4. Translate the results
print(f"The P-value is: {p_value:.4f}")

if p_value < 0.05:
    conclusion = "✅ Yes, there is a statistically significant association between an employee's department and whether they leave the company."
    explanation = "The department an employee works in appears to be related to their likelihood of attrition."
else:
    conclusion = "❌ No, we cannot conclude there's a significant association."
    explanation = "Based on this data, department and attrition appear to be independent. Any pattern in the chart could be random."

print("\nTranslation:")
print(explanation)
print("\nConclusion:")
print(conclusion)

::: {.callout-important}
#### Statistically Significant vs. Business Significant
Imagine our t-test found a statistically significant (p < 0.05) salary difference of **1,500 DKK per year**.

- **Is it real?** Statistically, yes. The pattern is not random.
- **Does it matter?** From a business perspective, probably not. A difference that small might not affect morale, and a company-wide initiative to "fix" it would be a waste of resources.

**Your job is to find things that are BOTH statistically real AND matter to the business.**
:::

::: {.callout-warning}
### ⚠️ Common Pitfalls to Avoid

- **Correlation ≠ Causation:** Just because tenure and salary are correlated doesn't mean staying longer *causes* a higher salary. A promotion (a factor we don't have) could be the true cause.
- **Sample Size Matters:** Our HR department is small. Any finding there is less reliable than a finding for the large Engineering department.
- **The Multiple Testing Problem:** If you run 20 tests, one of them will likely be "significant" (p < 0.05) by pure random luck. Test clear business questions, don't just hunt for low p-values.
:::

---

# Part 4: Building the Business Presentation (30 mins)

## 🚀 Your Turn: Building a Business Case with Data (In-Class Practice)

**Your Mission:** You are a data scientist at the company. Management is concerned about **employee burnout** and its impact on satisfaction and retention. Your manager has asked you to investigate: **"Is there evidence that overworked employees are less satisfied and more likely to leave?"**

Work in pairs. Your goal is to prepare a 2-minute presentation for management based on your findings. Use the "Which Test Do I Use?" guide to help you.

### Your Analysis Must Include:

1.  **Hypothesis 1 (Correlation):** Is there a relationship between `avg_monthly_hours` and `satisfaction_score`?
    *   **Visualize:** Create a scatter plot with a regression line (`sns.regplot`).
    *   **Test:** Run a Pearson correlation test (`stats.pearsonr`).
    *   **Translate:** Explain the result in plain English.
Hint: Remember that satisfaction_score has missing values. How will you handle them before running the correlation test?"
2.  **Hypothesis 2 (T-Test):** Do employees who leave the company (`attrition` = 1) work more hours on average than those who stay?
    *   **Visualize:** Create a boxplot comparing `avg_monthly_hours` for the two `attrition` groups.
    *   **Test:** Run a T-test.
    *   **Translate:** Explain the result in plain English.

3.  **Business Recommendation:** Based on your findings, what is your **one key takeaway** for management? Propose **one concrete action** they could take.

----