# Module 9: Exploratory Data Analysis

## Topics Covered
1. The EDA Process and Workflow
2. Understanding Your Data
3. Univariate Analysis
4. Bivariate Analysis
5. Multivariate Analysis
6. Identifying Patterns and Anomalies
7. EDA Case Study: Real Dataset

## Learning Objectives

By the end of this module, you will be able to:
- Follow a systematic approach to exploring new datasets
- Summarize data using appropriate statistics and visualizations
- Identify relationships between variables
- Detect outliers, anomalies, and data quality issues
- Generate hypotheses from data exploration
- Document findings for stakeholders

---

---
# Section 1: The EDA Process and Workflow
---

## What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the critical first step in any data science project. It involves:

- **Understanding** the structure and content of your data
- **Summarizing** main characteristics using statistics and visualizations
- **Discovering** patterns, relationships, and anomalies
- **Formulating** hypotheses for further analysis

### Why This Matters in Data Science

EDA helps you:
- Avoid costly mistakes from misunderstanding your data
- Make informed decisions about data cleaning and preprocessing
- Choose appropriate modeling techniques
- Communicate findings to stakeholders effectively

## The EDA Workflow

A systematic approach to EDA:

1. **Data Collection & Loading**
   - Load data from various sources
   - Initial inspection of structure

2. **Data Quality Assessment**
   - Check for missing values
   - Identify data types
   - Detect duplicates

3. **Univariate Analysis**
   - Analyze each variable individually
   - Distribution, central tendency, spread

4. **Bivariate Analysis**
   - Relationships between pairs of variables
   - Correlations, comparisons

5. **Multivariate Analysis**
   - Complex relationships among multiple variables
   - Patterns and clusters

6. **Findings & Insights**
   - Summarize key discoveries
   - Document for next steps

In [None]:
# Import libraries for EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)
sns.set_style('whitegrid')

print("Libraries imported successfully")

---
# Section 2: Understanding Your Data
---

The first step is to understand what you're working with: structure, size, data types, and quality.

In [None]:
# Load our datasets
sales = pd.read_csv('assets/datasets/sales_data.csv', parse_dates=['date'])
employees = pd.read_csv('assets/datasets/employees.csv')

print(f"Sales dataset: {sales.shape[0]} rows, {sales.shape[1]} columns")
print(f"Employees dataset: {employees.shape[0]} rows, {employees.shape[1]} columns")

In [None]:
# Example: First look at the data

print("Sales Data - First 5 Rows:")
print(sales.head())

In [None]:
# Example: Data types and memory usage

print("Data Types and Info:")
print(sales.info())

In [None]:
# Example: Quick statistics

print("Numeric Summary:")
print(sales.describe())

In [None]:
# Example: Categorical summary

print("Categorical Summary:")
print(sales.describe(include='object'))

In [None]:
# Example: Missing values analysis

def missing_values_summary(df):
    """Create a summary of missing values."""
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    summary = pd.DataFrame({
        'Missing Count': missing,
        'Missing %': missing_pct,
        'Data Type': df.dtypes
    })
    
    return summary[summary['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("Missing Values in Sales Data:")
print(missing_values_summary(sales))

In [None]:
# Example: Check for duplicates

print(f"Duplicate rows in sales: {sales.duplicated().sum()}")
print(f"Duplicate transaction IDs: {sales['transaction_id'].duplicated().sum()}")

In [None]:
# Example: Unique values in categorical columns

categorical_cols = sales.select_dtypes(include='object').columns

print("Unique Values per Categorical Column:")
print("-" * 40)
for col in categorical_cols:
    unique_count = sales[col].nunique()
    print(f"{col}: {unique_count} unique values")
    if unique_count <= 10:
        print(f"  Values: {sales[col].unique().tolist()}")

In [None]:
# Example: Date range analysis

print(f"Date range: {sales['date'].min()} to {sales['date'].max()}")
print(f"Time span: {(sales['date'].max() - sales['date'].min()).days} days")
print(f"Unique dates: {sales['date'].nunique()}")

## EDA Helper Function

Let's create a reusable function for initial data exploration.

In [None]:
def initial_eda(df, name="Dataset"):
    """Perform initial exploratory data analysis on a DataFrame."""
    print(f"{'='*60}")
    print(f"EDA Report: {name}")
    print(f"{'='*60}")
    
    # Basic info
    print(f"\n1. SHAPE: {df.shape[0]} rows, {df.shape[1]} columns")
    
    # Data types
    print(f"\n2. DATA TYPES:")
    print(df.dtypes.value_counts())
    
    # Missing values
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(f"\n3. MISSING VALUES:")
        print(missing[missing > 0])
    else:
        print(f"\n3. MISSING VALUES: None")
    
    # Duplicates
    dups = df.duplicated().sum()
    print(f"\n4. DUPLICATE ROWS: {dups}")
    
    # Numeric summary
    numeric_cols = df.select_dtypes(include=np.number).columns
    if len(numeric_cols) > 0:
        print(f"\n5. NUMERIC COLUMNS SUMMARY:")
        print(df[numeric_cols].describe().round(2))
    
    # Categorical summary
    cat_cols = df.select_dtypes(include='object').columns
    if len(cat_cols) > 0:
        print(f"\n6. CATEGORICAL COLUMNS:")
        for col in cat_cols:
            print(f"   {col}: {df[col].nunique()} unique values")
    
    print(f"\n{'='*60}")

# Test the function
initial_eda(sales, "Sales Data")

---
# Section 3: Univariate Analysis
---

Univariate analysis examines each variable individually to understand its distribution and characteristics.

## Numeric Variables

For numeric variables, we examine:
- Central tendency (mean, median, mode)
- Spread (standard deviation, range, IQR)
- Distribution shape (skewness, kurtosis)
- Outliers

In [None]:
# Example: Detailed analysis of a numeric variable

def analyze_numeric(series, name):
    """Detailed analysis of a numeric variable."""
    print(f"Analysis of: {name}")
    print("-" * 40)
    
    # Central tendency
    print(f"Mean: {series.mean():.2f}")
    print(f"Median: {series.median():.2f}")
    print(f"Mode: {series.mode().iloc[0]:.2f}")
    
    # Spread
    print(f"\nStd Dev: {series.std():.2f}")
    print(f"Variance: {series.var():.2f}")
    print(f"Range: {series.min():.2f} - {series.max():.2f}")
    print(f"IQR: {series.quantile(0.75) - series.quantile(0.25):.2f}")
    
    # Shape
    print(f"\nSkewness: {series.skew():.2f}")
    print(f"Kurtosis: {series.kurtosis():.2f}")
    
    # Percentiles
    print(f"\nPercentiles:")
    for p in [5, 25, 50, 75, 95]:
        print(f"  {p}th: {series.quantile(p/100):.2f}")

analyze_numeric(sales['total_amount'], 'Total Amount')

In [None]:
# Example: Visualizing numeric distribution

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram with KDE
sns.histplot(sales['total_amount'], kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Transaction Amount')
axes[0, 0].axvline(sales['total_amount'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].axvline(sales['total_amount'].median(), color='green', linestyle='--', label='Median')
axes[0, 0].legend()

# Box plot
sns.boxplot(x=sales['total_amount'], ax=axes[0, 1])
axes[0, 1].set_title('Box Plot of Transaction Amount')

# Violin plot
sns.violinplot(x=sales['total_amount'], ax=axes[1, 0])
axes[1, 0].set_title('Violin Plot of Transaction Amount')

# QQ plot (for normality check)
from scipy import stats
stats.probplot(sales['total_amount'], dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Normality Check)')

plt.tight_layout()
plt.show()

In [None]:
# Example: Analyzing all numeric columns

numeric_cols = sales.select_dtypes(include=np.number).columns.tolist()

fig, axes = plt.subplots(1, len(numeric_cols), figsize=(5*len(numeric_cols), 4))

for i, col in enumerate(numeric_cols):
    sns.histplot(sales[col].dropna(), kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution: {col}')

plt.tight_layout()
plt.show()

## Categorical Variables

For categorical variables, we examine:
- Frequency distribution
- Proportion of each category
- Rare categories

In [None]:
# Example: Categorical variable analysis

def analyze_categorical(series, name, top_n=10):
    """Detailed analysis of a categorical variable."""
    print(f"Analysis of: {name}")
    print("-" * 40)
    
    # Basic stats
    print(f"Unique values: {series.nunique()}")
    print(f"Most common: {series.mode().iloc[0]}")
    print(f"Missing: {series.isnull().sum()}")
    
    # Value counts
    print(f"\nTop {top_n} Values:")
    vc = series.value_counts()
    vc_pct = series.value_counts(normalize=True) * 100
    
    for i, (val, count) in enumerate(vc.head(top_n).items()):
        print(f"  {val}: {count} ({vc_pct[val]:.1f}%)")

analyze_categorical(sales['product'], 'Product')

In [None]:
# Example: Visualizing categorical distributions

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Category distribution
sales['category'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Transactions by Category')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Count')

# Region distribution
sales['region'].value_counts().plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Transactions by Region')
axes[1].set_xlabel('Region')

# Pie chart for category
sales['category'].value_counts().plot(kind='pie', ax=axes[2], autopct='%1.1f%%')
axes[2].set_title('Category Distribution')
axes[2].set_ylabel('')

plt.tight_layout()
plt.show()

## Practice Exercise 3.1

**Task:** Perform univariate analysis on the employees dataset:
1. Analyze the salary distribution (statistics and visualization)
2. Analyze the department distribution
3. Identify any potential outliers in salary

In [None]:
# Your code here


In [None]:
# Solution 3.1

employees = pd.read_csv('assets/datasets/employees.csv')

# 1. Salary analysis
print("1. SALARY ANALYSIS")
print("="*40)
analyze_numeric(employees['salary'], 'Salary')

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sns.histplot(employees['salary'], kde=True, ax=axes[0])
axes[0].set_title('Salary Distribution')

sns.boxplot(x=employees['salary'], ax=axes[1])
axes[1].set_title('Salary Box Plot')

# 2. Department distribution
employees['department'].value_counts().plot(kind='bar', ax=axes[2], color='steelblue')
axes[2].set_title('Employees by Department')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# 3. Outlier detection
print("\n3. OUTLIER DETECTION")
print("="*40)
Q1 = employees['salary'].quantile(0.25)
Q3 = employees['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = employees[(employees['salary'] < lower_bound) | (employees['salary'] > upper_bound)]
print(f"IQR: ${IQR:,.2f}")
print(f"Bounds: ${lower_bound:,.2f} - ${upper_bound:,.2f}")
print(f"Outliers found: {len(outliers)}")
if len(outliers) > 0:
    print(outliers[['first_name', 'last_name', 'department', 'salary']].head(10))

---
# Section 4: Bivariate Analysis
---

Bivariate analysis examines the relationship between two variables.

## Numeric vs Numeric

For two numeric variables, we examine:
- Correlation coefficient
- Scatter plots
- Trend analysis

In [None]:
# Example: Correlation between numeric variables

employees = pd.read_csv('assets/datasets/employees.csv')
employees['hire_date'] = pd.to_datetime(employees['hire_date'])
employees['years_exp'] = (pd.Timestamp.now() - employees['hire_date']).dt.days / 365

# Fill missing values for analysis
employees['bonus'] = employees['bonus'].fillna(0)
employees['performance_rating'] = employees['performance_rating'].fillna(employees['performance_rating'].median())

# Calculate correlations
numeric_cols = ['salary', 'bonus', 'years_exp', 'performance_rating']
correlations = employees[numeric_cols].corr()

print("Correlation Matrix:")
print(correlations.round(3))

In [None]:
# Example: Visualizing correlations

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap
sns.heatmap(correlations, annot=True, cmap='coolwarm', center=0, ax=axes[0], fmt='.2f')
axes[0].set_title('Correlation Heatmap')

# Scatter plot: Experience vs Salary
sns.scatterplot(data=employees, x='years_exp', y='salary', alpha=0.6, ax=axes[1])
axes[1].set_title(f"Experience vs Salary (r = {correlations.loc['years_exp', 'salary']:.3f})")
axes[1].set_xlabel('Years of Experience')
axes[1].set_ylabel('Salary ($)')

plt.tight_layout()
plt.show()

In [None]:
# Example: Scatter with regression line

fig, ax = plt.subplots(figsize=(10, 6))

sns.regplot(data=employees, x='years_exp', y='salary', scatter_kws={'alpha': 0.5}, ax=ax)
ax.set_title('Experience vs Salary with Trend Line')
ax.set_xlabel('Years of Experience')
ax.set_ylabel('Salary ($)')

plt.show()

## Categorical vs Numeric

For comparing a numeric variable across categories:

In [None]:
# Example: Salary by department

dept_salary = employees.groupby('department')['salary'].agg(['mean', 'median', 'std', 'count'])
dept_salary = dept_salary.sort_values('mean', ascending=False)

print("Salary Statistics by Department:")
print(dept_salary.round(2))

In [None]:
# Example: Visualizing categorical vs numeric

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Box plot
sns.boxplot(data=employees, x='department', y='salary', ax=axes[0])
axes[0].set_title('Salary by Department')
axes[0].tick_params(axis='x', rotation=45)

# Violin plot
sns.violinplot(data=employees, x='department', y='salary', ax=axes[1])
axes[1].set_title('Salary Distribution by Department')
axes[1].tick_params(axis='x', rotation=45)

# Bar plot with error bars
sns.barplot(data=employees, x='department', y='salary', ax=axes[2], errorbar='sd')
axes[2].set_title('Average Salary by Department')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Example: Sales amount by category

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(data=sales, x='category', y='total_amount', ax=axes[0])
axes[0].set_title('Transaction Amount by Category')

# Compare regions
sns.boxplot(data=sales, x='region', y='total_amount', ax=axes[1])
axes[1].set_title('Transaction Amount by Region')

plt.tight_layout()
plt.show()

## Categorical vs Categorical

For comparing two categorical variables:

In [None]:
# Example: Cross-tabulation

crosstab = pd.crosstab(sales['category'], sales['region'])
print("Transaction Count: Category vs Region")
print(crosstab)

In [None]:
# Example: Visualizing category vs category

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of counts
sns.heatmap(crosstab, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Transaction Count Heatmap')

# Stacked bar
crosstab.plot(kind='bar', stacked=True, ax=axes[1])
axes[1].set_title('Transactions by Category and Region')
axes[1].set_xlabel('Category')
axes[1].set_ylabel('Count')
axes[1].legend(title='Region')

plt.tight_layout()
plt.show()

## Practice Exercise 4.1

**Task:** Analyze the relationship between employee performance rating and salary:
1. Calculate the correlation
2. Create a scatter plot with regression line
3. Compare average salary across performance rating levels

In [None]:
# Your code here


In [None]:
# Solution 4.1

employees = pd.read_csv('assets/datasets/employees.csv')
employees['performance_rating'] = employees['performance_rating'].fillna(employees['performance_rating'].median())

# 1. Correlation
corr = employees['performance_rating'].corr(employees['salary'])
print(f"1. Correlation between Performance and Salary: {corr:.3f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 2. Scatter with regression
sns.regplot(data=employees, x='performance_rating', y='salary', 
            scatter_kws={'alpha': 0.5}, ax=axes[0])
axes[0].set_title(f'Performance vs Salary (r = {corr:.3f})')

# 3. Average salary by rating
rating_salary = employees.groupby('performance_rating')['salary'].agg(['mean', 'count'])
print("\n3. Average Salary by Performance Rating:")
print(rating_salary.round(2))

sns.barplot(data=employees, x='performance_rating', y='salary', ax=axes[1])
axes[1].set_title('Average Salary by Performance Rating')

plt.tight_layout()
plt.show()

---
# Section 5: Multivariate Analysis
---

Multivariate analysis examines relationships among three or more variables simultaneously.

In [None]:
# Example: Pair plot for multiple variables

employees = pd.read_csv('assets/datasets/employees.csv')
employees['hire_date'] = pd.to_datetime(employees['hire_date'])
employees['years_exp'] = (pd.Timestamp.now() - employees['hire_date']).dt.days / 365
employees['performance_rating'] = employees['performance_rating'].fillna(3)

# Select columns for pair plot
plot_cols = ['salary', 'years_exp', 'performance_rating']

g = sns.pairplot(employees[plot_cols + ['department']], 
                 hue='department', 
                 height=2.5,
                 plot_kws={'alpha': 0.6})
g.fig.suptitle('Employee Metrics by Department', y=1.02)
plt.show()

In [None]:
# Example: Scatter with multiple dimensions (color and size)

fig, ax = plt.subplots(figsize=(12, 8))

scatter = ax.scatter(
    employees['years_exp'],
    employees['salary'],
    c=employees['performance_rating'],
    s=employees['performance_rating'] * 30,
    alpha=0.6,
    cmap='viridis'
)

plt.colorbar(scatter, label='Performance Rating')
ax.set_xlabel('Years of Experience')
ax.set_ylabel('Salary ($)')
ax.set_title('Salary vs Experience (color & size = Performance)')

plt.show()

In [None]:
# Example: Faceted plots (small multiples)

g = sns.FacetGrid(employees, col='department', col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x='years_exp', y='salary', alpha=0.6)
g.set_axis_labels('Years of Experience', 'Salary ($)')
g.set_titles('{col_name}')
g.fig.suptitle('Experience vs Salary by Department', y=1.02)

plt.show()

In [None]:
# Example: Multi-dimensional pivot table

sales = pd.read_csv('assets/datasets/sales_data.csv', parse_dates=['date'])
sales['quarter'] = sales['date'].dt.quarter
sales['year'] = sales['date'].dt.year

# Pivot by multiple dimensions
pivot = sales.pivot_table(
    values='total_amount',
    index=['category', 'region'],
    columns='year',
    aggfunc='sum'
)

print("Sales by Category, Region, and Year:")
print(pivot.round(2))

In [None]:
# Example: Grouped bar chart with multiple categories

# Average transaction by category and region
fig, ax = plt.subplots(figsize=(12, 6))

sns.barplot(data=sales, x='category', y='total_amount', hue='region', ax=ax)

ax.set_title('Average Transaction by Category and Region')
ax.set_ylabel('Average Amount ($)')
ax.legend(title='Region', bbox_to_anchor=(1.02, 1))

plt.tight_layout()
plt.show()

---
# Section 6: Identifying Patterns and Anomalies
---

A key goal of EDA is to find patterns, trends, and anomalies in the data.

## Time-Based Patterns

In [None]:
# Example: Temporal patterns in sales

sales = pd.read_csv('assets/datasets/sales_data.csv', parse_dates=['date'])

# Add time components
sales['year'] = sales['date'].dt.year
sales['month'] = sales['date'].dt.month
sales['day_of_week'] = sales['date'].dt.day_name()
sales['quarter'] = sales['date'].dt.quarter

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Monthly pattern
monthly = sales.groupby('month')['total_amount'].mean()
axes[0, 0].bar(monthly.index, monthly.values, color='steelblue')
axes[0, 0].set_title('Average Sales by Month')
axes[0, 0].set_xlabel('Month')
axes[0, 0].set_ylabel('Average Sales ($)')

# Day of week pattern
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily = sales.groupby('day_of_week')['total_amount'].mean().reindex(day_order)
axes[0, 1].bar(range(7), daily.values, color='coral')
axes[0, 1].set_title('Average Sales by Day of Week')
axes[0, 1].set_xticks(range(7))
axes[0, 1].set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

# Quarterly trend
quarterly = sales.groupby(['year', 'quarter'])['total_amount'].sum().reset_index()
quarterly['period'] = quarterly['year'].astype(str) + '-Q' + quarterly['quarter'].astype(str)
axes[1, 0].plot(quarterly['period'], quarterly['total_amount'], 'o-')
axes[1, 0].set_title('Quarterly Sales Trend')
axes[1, 0].tick_params(axis='x', rotation=45)

# Year over year comparison
yearly = sales.groupby('year')['total_amount'].sum()
axes[1, 1].bar(yearly.index.astype(str), yearly.values, color='seagreen')
axes[1, 1].set_title('Total Sales by Year')

plt.tight_layout()
plt.show()

## Outlier Detection

In [None]:
# Example: Outlier detection methods

def detect_outliers(df, column):
    """Detect outliers using IQR and Z-score methods."""
    data = df[column].dropna()
    
    # IQR method
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    iqr_outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    
    # Z-score method
    z_scores = np.abs((data - data.mean()) / data.std())
    zscore_outliers = df[z_scores > 3]
    
    print(f"Outlier Detection for: {column}")
    print("="*50)
    print(f"\nIQR Method:")
    print(f"  Bounds: {lower_bound:.2f} - {upper_bound:.2f}")
    print(f"  Outliers: {len(iqr_outliers)} ({len(iqr_outliers)/len(df)*100:.1f}%)")
    print(f"\nZ-Score Method (|z| > 3):")
    print(f"  Outliers: {len(zscore_outliers)} ({len(zscore_outliers)/len(df)*100:.1f}%)")
    
    return iqr_outliers, zscore_outliers

iqr_out, zscore_out = detect_outliers(sales, 'total_amount')

In [None]:
# Example: Visualizing outliers

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Box plot shows outliers as points
sns.boxplot(x=sales['total_amount'], ax=axes[0])
axes[0].set_title('Box Plot (outliers as points)')

# Histogram with outlier threshold
Q1 = sales['total_amount'].quantile(0.25)
Q3 = sales['total_amount'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

sns.histplot(sales['total_amount'], ax=axes[1], bins=50)
axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'Upper Bound: ${upper_bound:.0f}')
axes[1].legend()
axes[1].set_title('Distribution with Outlier Threshold')

# Scatter plot highlighting outliers
is_outlier = sales['total_amount'] > upper_bound
axes[2].scatter(range(len(sales)), sales['total_amount'], 
                c=is_outlier, cmap='coolwarm', alpha=0.5)
axes[2].axhline(upper_bound, color='red', linestyle='--')
axes[2].set_title('Transactions (red = outliers)')
axes[2].set_xlabel('Transaction Index')
axes[2].set_ylabel('Amount ($)')

plt.tight_layout()
plt.show()

In [None]:
# Example: Investigating outliers

print("High-value transactions (outliers):")
print("="*60)

outliers = sales[sales['total_amount'] > upper_bound].sort_values('total_amount', ascending=False)
print(f"Count: {len(outliers)}")
print(f"\nBy Category:")
print(outliers['category'].value_counts())
print(f"\nBy Product:")
print(outliers['product'].value_counts())
print(f"\nSample outliers:")
print(outliers[['date', 'product', 'quantity', 'total_amount']].head(10))

---
# Section 7: EDA Case Study
---

Let's perform a complete EDA on the employees dataset as a comprehensive example.

In [None]:
# Load and prepare data
employees = pd.read_csv('assets/datasets/employees.csv')
employees['hire_date'] = pd.to_datetime(employees['hire_date'])
employees['years_exp'] = (pd.Timestamp.now() - employees['hire_date']).dt.days / 365

print("EMPLOYEE DATA - EXPLORATORY DATA ANALYSIS")
print("="*60)

In [None]:
# Step 1: Initial Overview
print("\n1. DATA OVERVIEW")
print("-"*40)
print(f"Shape: {employees.shape}")
print(f"\nColumn Types:")
print(employees.dtypes)
print(f"\nFirst few rows:")
employees.head()

In [None]:
# Step 2: Data Quality
print("\n2. DATA QUALITY")
print("-"*40)

# Missing values
print("Missing Values:")
missing = employees.isnull().sum()
print(missing[missing > 0])

# Duplicates
print(f"\nDuplicate employee IDs: {employees['employee_id'].duplicated().sum()}")

# Status distribution
print(f"\nEmployee Status:")
print(employees['status'].value_counts())

In [None]:
# Step 3: Univariate Analysis - Key Variables
print("\n3. KEY VARIABLE DISTRIBUTIONS")
print("-"*40)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Salary distribution
sns.histplot(employees['salary'], kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Salary Distribution')
axes[0, 0].axvline(employees['salary'].median(), color='red', linestyle='--', label='Median')
axes[0, 0].legend()

# Department counts
dept_order = employees['department'].value_counts().index
sns.countplot(data=employees, y='department', order=dept_order, ax=axes[0, 1])
axes[0, 1].set_title('Employees by Department')

# Years of experience
sns.histplot(employees['years_exp'], kde=True, ax=axes[0, 2])
axes[0, 2].set_title('Years of Experience')

# Performance ratings
employees['performance_rating'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Performance Rating Distribution')
axes[1, 0].set_xlabel('Rating')

# Status
employees['status'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1, 1])
axes[1, 1].set_title('Employee Status')

# Salary box plot
sns.boxplot(x=employees['salary'], ax=axes[1, 2])
axes[1, 2].set_title('Salary Box Plot')

plt.tight_layout()
plt.show()

In [None]:
# Step 4: Bivariate Analysis
print("\n4. RELATIONSHIPS BETWEEN VARIABLES")
print("-"*40)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Salary by department
dept_order = employees.groupby('department')['salary'].median().sort_values(ascending=False).index
sns.boxplot(data=employees, x='salary', y='department', order=dept_order, ax=axes[0, 0])
axes[0, 0].set_title('Salary by Department')

# Experience vs Salary
sns.scatterplot(data=employees, x='years_exp', y='salary', hue='department', alpha=0.6, ax=axes[0, 1])
axes[0, 1].set_title('Experience vs Salary')
axes[0, 1].legend(bbox_to_anchor=(1.02, 1), title='Department')

# Performance vs Salary
employees_clean = employees.dropna(subset=['performance_rating'])
sns.boxplot(data=employees_clean, x='performance_rating', y='salary', ax=axes[1, 0])
axes[1, 0].set_title('Salary by Performance Rating')

# Correlation heatmap
numeric_cols = ['salary', 'bonus', 'years_exp', 'performance_rating']
employees_for_corr = employees[numeric_cols].dropna()
sns.heatmap(employees_for_corr.corr(), annot=True, cmap='coolwarm', center=0, ax=axes[1, 1])
axes[1, 1].set_title('Correlation Matrix')

plt.tight_layout()
plt.show()

In [None]:
# Step 5: Key Insights Summary
print("\n5. KEY FINDINGS & INSIGHTS")
print("="*60)

# Salary insights
print("\nSALARY INSIGHTS:")
print(f"  - Average salary: ${employees['salary'].mean():,.0f}")
print(f"  - Median salary: ${employees['salary'].median():,.0f}")
print(f"  - Salary range: ${employees['salary'].min():,.0f} - ${employees['salary'].max():,.0f}")

# Top paying departments
print("\nTOP PAYING DEPARTMENTS (by median salary):")
dept_median = employees.groupby('department')['salary'].median().sort_values(ascending=False)
for i, (dept, sal) in enumerate(dept_median.head(3).items(), 1):
    print(f"  {i}. {dept}: ${sal:,.0f}")

# Experience correlation
exp_sal_corr = employees['years_exp'].corr(employees['salary'])
print(f"\nEXPERIENCE vs SALARY CORRELATION: {exp_sal_corr:.3f}")
if exp_sal_corr > 0.3:
    print("  -> Positive relationship: More experience tends to mean higher salary")

# Performance insights
perf_sal = employees.groupby('performance_rating')['salary'].mean()
print(f"\nSALARY BY PERFORMANCE RATING:")
for rating, sal in perf_sal.items():
    print(f"  Rating {rating:.0f}: ${sal:,.0f}")

# Data quality notes
print(f"\nDATA QUALITY NOTES:")
missing_pct = employees.isnull().sum() / len(employees) * 100
for col, pct in missing_pct[missing_pct > 0].items():
    print(f"  - {col}: {pct:.1f}% missing")

## Practice Exercise 7.1

**Task:** Perform a complete EDA on the sales data and create a summary report that includes:
1. Data overview and quality assessment
2. Top 5 products by total sales
3. Sales trends over time (monthly)
4. Regional performance comparison
5. At least 3 key insights or findings

In [None]:
# Your code here


In [None]:
# Solution 7.1

sales = pd.read_csv('assets/datasets/sales_data.csv', parse_dates=['date'])

print("SALES DATA - EDA SUMMARY REPORT")
print("="*60)

# 1. Data Overview
print("\n1. DATA OVERVIEW")
print("-"*40)
print(f"Total transactions: {len(sales):,}")
print(f"Date range: {sales['date'].min().date()} to {sales['date'].max().date()}")
print(f"Total revenue: ${sales['total_amount'].sum():,.2f}")
print(f"\nMissing values:")
print(sales.isnull().sum()[sales.isnull().sum() > 0])

# 2. Top 5 Products
print("\n2. TOP 5 PRODUCTS BY TOTAL SALES")
print("-"*40)
top_products = sales.groupby('product')['total_amount'].sum().nlargest(5)
for i, (product, amount) in enumerate(top_products.items(), 1):
    print(f"  {i}. {product}: ${amount:,.2f}")

# 3. Monthly Trend
print("\n3. MONTHLY SALES TREND")
print("-"*40)
monthly = sales.groupby(sales['date'].dt.to_period('M'))['total_amount'].sum()
print(f"Highest month: {monthly.idxmax()} (${monthly.max():,.2f})")
print(f"Lowest month: {monthly.idxmin()} (${monthly.min():,.2f})")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

monthly.plot(kind='line', marker='o', ax=axes[0])
axes[0].set_title('Monthly Sales Trend')
axes[0].set_ylabel('Sales ($)')

# 4. Regional Performance
print("\n4. REGIONAL PERFORMANCE")
print("-"*40)
region_stats = sales.groupby('region').agg(
    total_sales=('total_amount', 'sum'),
    avg_transaction=('total_amount', 'mean'),
    num_transactions=('transaction_id', 'count')
).sort_values('total_sales', ascending=False)
print(region_stats.round(2))

region_stats['total_sales'].plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('Total Sales by Region')
axes[1].set_ylabel('Sales ($)')

plt.tight_layout()
plt.show()

# 5. Key Insights
print("\n5. KEY INSIGHTS")
print("="*60)
print("""  
  1. Electronics dominates sales, contributing the highest revenue
     among all categories.
  
  2. The Central and East regions show the strongest performance,
     suggesting potential focus areas for marketing efforts.
  
  3. Data quality is generally good with minimal missing values
     (mainly in customer ratings and sales rep assignments).
""")

---
# Module Summary

## Key Takeaways

1. **EDA is systematic**: Follow a structured approach from data quality to insights
2. **Start simple**: Begin with univariate analysis before exploring relationships
3. **Visualize everything**: Charts reveal patterns that statistics might miss
4. **Document findings**: Keep track of insights for stakeholders and next steps
5. **Iterate**: EDA is not linear - discoveries lead to new questions

## EDA Checklist

```
[ ] Data overview (shape, types, head/tail)
[ ] Data quality (missing values, duplicates)
[ ] Univariate analysis (distributions, statistics)
[ ] Bivariate analysis (correlations, comparisons)
[ ] Multivariate analysis (patterns, clusters)
[ ] Outlier detection
[ ] Time-based patterns (if applicable)
[ ] Summary of key findings
```

## Next Module

In the next module, we'll cover **Data Cleaning and Preprocessing** - the essential steps to prepare your data for analysis and modeling based on EDA findings.

## Additional Practice

For extra practice, try these challenges:

1. **Comprehensive EDA Report**: Create a complete EDA notebook for the sales data that could be shared with stakeholders, including executive summary and recommendations.

2. **Automated EDA Function**: Create a function that takes any DataFrame and generates a complete EDA report with visualizations.

3. **Comparative Analysis**: Compare two time periods in the sales data and identify significant changes in patterns.