# Day 5: Data Visualization - Telling Stories with Data

Welcome to the world of **Data Visualization**! Today we'll learn how to transform raw numbers into compelling visual stories using **Matplotlib** and **Seaborn**.

### Why Visualization?
- Humans process visual information 60,000x faster than text.
- It reveals patterns, trends, and outliers hidden in data.
- It's essential for **Exploratory Data Analysis (EDA)**.
- Great visualizations can make or break your data presentation.

### Topics Covered:
1. **Matplotlib Basics**
2. **Line Charts & Trends**
3. **Bar Charts & Comparisons**
4. **Scatter Plots & Correlations**
5. **Histograms & Distributions**
6. **Box Plots & Outliers**
7. **Seaborn Aesthetics**
8. **Mini Project: Housing Price Analysis**

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set default style for better looking plots
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# For inline display in Jupyter
%matplotlib inline

print("Libraries loaded successfully!")

## 1. Matplotlib Basics

Matplotlib is the foundation of Python visualization. Let's understand the basic structure.

In [None]:
# The simplest plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y)
plt.title('My First Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()

In [None]:
# Figure and Axes - The Professional Way
fig, ax = plt.subplots(figsize=(10, 6))

x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

ax.plot(x, y1, label='sin(x)', color='#3498db', linewidth=2)
ax.plot(x, y2, label='cos(x)', color='#e74c3c', linewidth=2, linestyle='--')

ax.set_title('Sine and Cosine Waves', fontsize=16, fontweight='bold')
ax.set_xlabel('X Values', fontsize=12)
ax.set_ylabel('Y Values', fontsize=12)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Multiple Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

x = np.linspace(0, 10, 50)

# Top Left - Line Plot
axes[0, 0].plot(x, np.sin(x), 'b-', linewidth=2)
axes[0, 0].set_title('Line Plot')

# Top Right - Scatter Plot  
axes[0, 1].scatter(x, np.random.randn(50), c='#2ecc71', alpha=0.7)
axes[0, 1].set_title('Scatter Plot')

# Bottom Left - Bar Plot
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]
axes[1, 0].bar(categories, values, color='#9b59b6')
axes[1, 0].set_title('Bar Plot')

# Bottom Right - Histogram
data = np.random.randn(1000)
axes[1, 1].hist(data, bins=30, color='#e67e22', edgecolor='white')
axes[1, 1].set_title('Histogram')

plt.suptitle('Four Basic Plot Types', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## 2. Line Charts - Visualizing Trends Over Time

Perfect for time series and continuous data.

In [None]:
# Create sample stock price data
np.random.seed(42)
days = pd.date_range(start='2024-01-01', periods=100, freq='D')

# Simulate stock prices with random walk
stock_a = 100 + np.cumsum(np.random.randn(100) * 2)
stock_b = 100 + np.cumsum(np.random.randn(100) * 2.5)
stock_c = 100 + np.cumsum(np.random.randn(100) * 1.5)

# Create the plot
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(days, stock_a, label='Tech Corp', color='#3498db', linewidth=2)
ax.plot(days, stock_b, label='Finance Inc', color='#e74c3c', linewidth=2)
ax.plot(days, stock_c, label='Health Ltd', color='#2ecc71', linewidth=2)

ax.fill_between(days, stock_a, alpha=0.1, color='#3498db')

ax.set_title('Stock Price Comparison (2024)', fontsize=16, fontweight='bold')
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Price ($)', fontsize=12)
ax.legend(loc='upper left')

# Add annotation for max point
max_idx = np.argmax(stock_a)
ax.annotate(f'Peak: ${stock_a[max_idx]:.2f}', 
            xy=(days[max_idx], stock_a[max_idx]),
            xytext=(days[max_idx] + pd.Timedelta(days=10), stock_a[max_idx] + 5),
            arrowprops=dict(arrowstyle='->', color='gray'),
            fontsize=10)

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3. Bar Charts - Comparing Categories

Best for comparing discrete categories.

In [None]:
# Simple Bar Chart
departments = ['Engineering', 'Marketing', 'Sales', 'HR', 'Finance']
employees = [120, 45, 80, 25, 35]
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(departments, employees, color=colors, edgecolor='white', linewidth=2)

# Add value labels on bars
for bar, emp in zip(bars, employees):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2, 
            str(emp), ha='center', va='bottom', fontsize=12, fontweight='bold')

ax.set_title('Employees by Department', fontsize=16, fontweight='bold')
ax.set_xlabel('Department', fontsize=12)
ax.set_ylabel('Number of Employees', fontsize=12)
ax.set_ylim(0, max(employees) * 1.15)

plt.tight_layout()
plt.show()

In [None]:
# Grouped Bar Chart
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
product_a = [150, 180, 200, 220]
product_b = [120, 160, 190, 210]
product_c = [90, 110, 140, 170]

x = np.arange(len(quarters))
width = 0.25

fig, ax = plt.subplots(figsize=(10, 6))

bars1 = ax.bar(x - width, product_a, width, label='Product A', color='#3498db')
bars2 = ax.bar(x, product_b, width, label='Product B', color='#e74c3c')
bars3 = ax.bar(x + width, product_c, width, label='Product C', color='#2ecc71')

ax.set_title('Quarterly Sales by Product', fontsize=16, fontweight='bold')
ax.set_xlabel('Quarter', fontsize=12)
ax.set_ylabel('Sales (in thousands $)', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(quarters)
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Horizontal Bar Chart
programming_languages = ['Python', 'JavaScript', 'Java', 'C++', 'Go', 'Rust', 'TypeScript']
popularity = [92, 88, 75, 60, 55, 48, 82]

# Sort by popularity
sorted_indices = np.argsort(popularity)
programming_languages = [programming_languages[i] for i in sorted_indices]
popularity = [popularity[i] for i in sorted_indices]

fig, ax = plt.subplots(figsize=(10, 6))

# Create gradient colors
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(programming_languages)))

bars = ax.barh(programming_languages, popularity, color=colors, edgecolor='white')

# Add value labels
for bar, pop in zip(bars, popularity):
    ax.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2,
            f'{pop}%', va='center', fontsize=11)

ax.set_title('Programming Language Popularity 2024', fontsize=16, fontweight='bold')
ax.set_xlabel('Popularity Score (%)', fontsize=12)
ax.set_xlim(0, 105)

plt.tight_layout()
plt.show()

## 4. Scatter Plots - Finding Correlations

Essential for understanding relationships between variables.

In [None]:
# Generate correlated data
np.random.seed(42)
n = 100

# Strong positive correlation
study_hours = np.random.uniform(1, 10, n)
exam_scores = 40 + 5 * study_hours + np.random.normal(0, 5, n)

fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(study_hours, exam_scores, 
                     c=exam_scores, cmap='RdYlGn',
                     s=100, alpha=0.7, edgecolors='white')

# Add trend line
z = np.polyfit(study_hours, exam_scores, 1)
p = np.poly1d(z)
ax.plot(np.sort(study_hours), p(np.sort(study_hours)), 
        '--', color='#e74c3c', linewidth=2, label=f'Trend Line (y = {z[0]:.2f}x + {z[1]:.2f})')

# Calculate correlation
correlation = np.corrcoef(study_hours, exam_scores)[0, 1]

ax.set_title(f'Study Hours vs Exam Scores (r = {correlation:.3f})', fontsize=16, fontweight='bold')
ax.set_xlabel('Study Hours per Day', fontsize=12)
ax.set_ylabel('Exam Score', fontsize=12)
ax.legend()

plt.colorbar(scatter, label='Exam Score')
plt.tight_layout()
plt.show()

In [None]:
# Bubble Chart (Scatter with size dimension)
np.random.seed(42)

countries = ['USA', 'China', 'Japan', 'Germany', 'UK', 'India', 'France', 'Brazil', 'Canada', 'Australia']
gdp = [25.5, 18.3, 4.2, 4.1, 3.1, 3.4, 2.8, 1.6, 2.1, 1.7]  # Trillion USD
population = [331, 1412, 125, 83, 67, 1408, 67, 214, 38, 26]  # Millions
gdp_per_capita = [77000, 13000, 34000, 50000, 46000, 2400, 42000, 7500, 55000, 65000]

fig, ax = plt.subplots(figsize=(12, 8))

# Size based on population
sizes = [p * 0.5 for p in population]
colors = plt.cm.plasma(np.linspace(0.1, 0.9, len(countries)))

scatter = ax.scatter(gdp, gdp_per_capita, s=sizes, c=colors, alpha=0.7, edgecolors='white', linewidth=2)

# Add country labels
for i, country in enumerate(countries):
    ax.annotate(country, (gdp[i], gdp_per_capita[i]), 
                xytext=(5, 5), textcoords='offset points', fontsize=10)

ax.set_title('GDP vs GDP per Capita (Bubble size = Population)', fontsize=16, fontweight='bold')
ax.set_xlabel('GDP (Trillion USD)', fontsize=12)
ax.set_ylabel('GDP per Capita (USD)', fontsize=12)

# Add legend for bubble sizes
for pop in [100, 500, 1000]:
    ax.scatter([], [], s=pop * 0.5, c='gray', alpha=0.5, 
               label=f'{pop}M people', edgecolors='white')
ax.legend(title='Population', loc='upper right')

plt.tight_layout()
plt.show()

## 5. Histograms - Understanding Distributions

Reveal how data is distributed across different values.

In [None]:
# Single Histogram
np.random.seed(42)
ages = np.random.normal(35, 10, 1000)  # Mean=35, Std=10

fig, ax = plt.subplots(figsize=(10, 6))

n, bins, patches = ax.hist(ages, bins=30, color='#3498db', edgecolor='white', alpha=0.7)

# Color bins by height
for i, patch in enumerate(patches):
    patch.set_facecolor(plt.cm.viridis(n[i] / max(n)))

# Add mean line
ax.axvline(ages.mean(), color='#e74c3c', linestyle='--', linewidth=2, label=f'Mean: {ages.mean():.1f}')
ax.axvline(np.median(ages), color='#2ecc71', linestyle='--', linewidth=2, label=f'Median: {np.median(ages):.1f}')

ax.set_title('Age Distribution of Survey Respondents', fontsize=16, fontweight='bold')
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Multiple Overlapping Histograms
np.random.seed(42)

male_heights = np.random.normal(175, 7, 500)    # cm
female_heights = np.random.normal(162, 6, 500)  # cm

fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(male_heights, bins=30, alpha=0.6, color='#3498db', label=f'Male (μ={male_heights.mean():.1f})', edgecolor='white')
ax.hist(female_heights, bins=30, alpha=0.6, color='#e74c3c', label=f'Female (μ={female_heights.mean():.1f})', edgecolor='white')

ax.set_title('Height Distribution by Gender', fontsize=16, fontweight='bold')
ax.set_xlabel('Height (cm)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

## 6. Box Plots - Detecting Outliers

Perfect for showing distribution summary and identifying outliers.

In [None]:
# Single Box Plot with explanation
np.random.seed(42)
salaries = np.concatenate([
    np.random.normal(60000, 15000, 200),  # Most employees
    np.random.normal(150000, 20000, 20),  # Senior positions
    [300000, 350000, 400000]               # Executives (outliers)
])

fig, ax = plt.subplots(figsize=(10, 6))

bp = ax.boxplot(salaries, vert=True, patch_artist=True)

# Style the box
bp['boxes'][0].set_facecolor('#3498db')
bp['boxes'][0].set_alpha(0.7)
bp['medians'][0].set_color('#e74c3c')
bp['medians'][0].set_linewidth(2)

# Add annotations
ax.annotate('Median', xy=(1, np.median(salaries)), xytext=(1.2, np.median(salaries)),
            fontsize=10, arrowprops=dict(arrowstyle='->', color='gray'))
ax.annotate('Q3 (75th percentile)', xy=(1, np.percentile(salaries, 75)), xytext=(1.2, np.percentile(salaries, 75)),
            fontsize=10, arrowprops=dict(arrowstyle='->', color='gray'))
ax.annotate('Outliers', xy=(1, 350000), xytext=(1.2, 350000),
            fontsize=10, arrowprops=dict(arrowstyle='->', color='gray'))

ax.set_title('Company Salary Distribution', fontsize=16, fontweight='bold')
ax.set_ylabel('Salary ($)', fontsize=12)
ax.set_xticklabels(['All Employees'])

plt.tight_layout()
plt.show()

In [None]:
# Multiple Box Plots - Compare Departments
np.random.seed(42)

departments = {
    'Engineering': np.random.normal(95000, 20000, 100),
    'Marketing': np.random.normal(70000, 15000, 100),
    'Sales': np.concatenate([np.random.normal(65000, 10000, 90), [150000, 180000]]),  # With outliers
    'HR': np.random.normal(55000, 10000, 100),
    'Finance': np.random.normal(85000, 18000, 100)
}

fig, ax = plt.subplots(figsize=(12, 6))

bp = ax.boxplot(departments.values(), labels=departments.keys(), patch_artist=True)

colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

for median in bp['medians']:
    median.set_color('black')
    median.set_linewidth(2)

ax.set_title('Salary Distribution by Department', fontsize=16, fontweight='bold')
ax.set_xlabel('Department', fontsize=12)
ax.set_ylabel('Salary ($)', fontsize=12)

# Add grid
ax.yaxis.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## 7. Seaborn - Beautiful Statistical Visualizations

Seaborn builds on Matplotlib with higher-level interface and stunning default aesthetics.

In [None]:
# Create sample dataset
np.random.seed(42)
n = 200

df = pd.DataFrame({
    'Experience': np.random.uniform(0, 20, n),
    'Salary': np.random.normal(60000, 20000, n),
    'Department': np.random.choice(['Tech', 'Sales', 'Marketing', 'Finance'], n),
    'Education': np.random.choice(['Bachelor', 'Master', 'PhD'], n, p=[0.5, 0.35, 0.15]),
    'Satisfaction': np.random.uniform(1, 10, n)
})

# Adjust salary based on experience (create correlation)
df['Salary'] = df['Salary'] + df['Experience'] * 3000

print("Sample Dataset:")
print(df.head())
print(f"\nShape: {df.shape}")

In [None]:
# Seaborn Themes
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

themes = ['darkgrid', 'whitegrid', 'dark', 'white']

for ax, theme in zip(axes.flatten(), themes):
    sns.set_style(theme)
    sns.scatterplot(data=df, x='Experience', y='Salary', hue='Department', ax=ax, alpha=0.7)
    ax.set_title(f"Theme: {theme}", fontsize=14)
    ax.legend(loc='upper left', fontsize=8)

plt.suptitle('Seaborn Theme Comparison', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Reset to default
sns.set_style('whitegrid')

In [None]:
# Seaborn Distribution Plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# KDE Plot
sns.kdeplot(data=df, x='Salary', hue='Education', fill=True, ax=axes[0], alpha=0.5)
axes[0].set_title('Salary Distribution by Education (KDE)', fontsize=12)

# Histogram with KDE
sns.histplot(data=df, x='Experience', kde=True, ax=axes[1], color='#3498db')
axes[1].set_title('Experience Distribution with KDE', fontsize=12)

# Violin Plot
sns.violinplot(data=df, x='Department', y='Salary', ax=axes[2], palette='Set2')
axes[2].set_title('Salary by Department (Violin)', fontsize=12)
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Regression Plot with Confidence Interval
fig, ax = plt.subplots(figsize=(10, 6))

sns.regplot(data=df, x='Experience', y='Salary', 
            scatter_kws={'alpha': 0.5, 'color': '#3498db'},
            line_kws={'color': '#e74c3c', 'linewidth': 2},
            ax=ax)

ax.set_title('Experience vs Salary with Regression Line', fontsize=16, fontweight='bold')
ax.set_xlabel('Years of Experience', fontsize=12)
ax.set_ylabel('Annual Salary ($)', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Pair Plot - Multiple Variable Relationships
# Select numerical columns for pairplot
df_numeric = df[['Experience', 'Salary', 'Satisfaction']].copy()
df_numeric['Department'] = df['Department']

g = sns.pairplot(df_numeric, hue='Department', palette='husl', 
                 plot_kws={'alpha': 0.6}, diag_kind='kde')
g.figure.suptitle('Pair Plot: Variable Relationships by Department', y=1.02, fontsize=14)

plt.show()

In [None]:
# Heatmap - Correlation Matrix
# Create correlation matrix
corr_matrix = df[['Experience', 'Salary', 'Satisfaction']].corr()

fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, 
            square=True, linewidths=2, fmt='.3f',
            annot_kws={'size': 14, 'weight': 'bold'})

ax.set_title('Correlation Heatmap', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

---

## Mini Project: Housing Price Analysis

**Goal:** Explore a housing dataset to understand what factors affect house prices.

Let's put all our visualization skills together!

In [None]:
# Create a realistic housing dataset
np.random.seed(42)
n = 500

# Generate features
sqft = np.random.normal(2000, 500, n).clip(800, 5000)
bedrooms = np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.40, 0.30, 0.10])
bathrooms = np.random.choice([1, 1.5, 2, 2.5, 3, 3.5], n, p=[0.1, 0.15, 0.35, 0.20, 0.15, 0.05])
age = np.random.uniform(0, 50, n)
neighborhood = np.random.choice(['Downtown', 'Suburbs', 'Countryside', 'Beachfront'], n, p=[0.25, 0.40, 0.20, 0.15])
has_garage = np.random.choice([0, 1], n, p=[0.3, 0.7])
has_pool = np.random.choice([0, 1], n, p=[0.8, 0.2])

# Calculate price based on features with some noise
base_price = 50000
price = (base_price + 
         sqft * 150 +                                           # $150 per sqft
         bedrooms * 15000 +                                      # $15k per bedroom
         bathrooms * 10000 +                                     # $10k per bathroom
         (50 - age) * 500 +                                      # Newer = higher price
         has_garage * 25000 +                                    # $25k for garage
         has_pool * 30000 +                                      # $30k for pool
         np.where(neighborhood == 'Beachfront', 150000, 0) +     # Beachfront premium
         np.where(neighborhood == 'Downtown', 75000, 0) +        # Downtown premium
         np.where(neighborhood == 'Suburbs', 25000, 0) +         # Suburbs slight premium
         np.random.normal(0, 30000, n))                          # Random noise

housing = pd.DataFrame({
    'Price': price.clip(100000, 1500000),
    'SqFt': sqft.astype(int),
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Age': age.astype(int),
    'Neighborhood': neighborhood,
    'Has_Garage': has_garage.astype(bool),
    'Has_Pool': has_pool.astype(bool)
})

print(" Housing Dataset Ready!")
print(f"   Records: {len(housing)}")
print(f"   Price Range: ${housing['Price'].min():,.0f} - ${housing['Price'].max():,.0f}")
print(f"   Average Price: ${housing['Price'].mean():,.0f}")
print("\n")
print(housing.head(10))

In [None]:
# === VISUALIZATION 1: Price Distribution ===
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
sns.histplot(housing['Price'] / 1000, bins=30, kde=True, ax=axes[0], color='#3498db')
axes[0].axvline(housing['Price'].mean() / 1000, color='#e74c3c', linestyle='--', linewidth=2, label=f"Mean: ${housing['Price'].mean()/1000:.0f}K")
axes[0].axvline(housing['Price'].median() / 1000, color='#2ecc71', linestyle='--', linewidth=2, label=f"Median: ${housing['Price'].median()/1000:.0f}K")
axes[0].set_title('House Price Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Price (in $1000s)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].legend()

# Box Plot by Neighborhood
neighborhood_order = housing.groupby('Neighborhood')['Price'].median().sort_values(ascending=False).index
sns.boxplot(data=housing, x='Neighborhood', y='Price', order=neighborhood_order, ax=axes[1], palette='viridis')
axes[1].set_title('Price Distribution by Neighborhood', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Neighborhood', fontsize=12)
axes[1].set_ylabel('Price ($)', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# === VISUALIZATION 2: SqFt vs Price (The Key Relationship) ===
fig, ax = plt.subplots(figsize=(12, 7))

# Scatter plot with neighborhood color coding
sns.scatterplot(data=housing, x='SqFt', y='Price', hue='Neighborhood', 
                size='Bedrooms', sizes=(50, 300), alpha=0.7, ax=ax, palette='Set1')

# Add regression line
z = np.polyfit(housing['SqFt'], housing['Price'], 1)
p = np.poly1d(z)
x_line = np.linspace(housing['SqFt'].min(), housing['SqFt'].max(), 100)
ax.plot(x_line, p(x_line), '--', color='black', linewidth=2, label=f'Trend: ${z[0]:.0f}/sqft')

# Calculate correlation
corr = housing['SqFt'].corr(housing['Price'])

ax.set_title(f'Square Feet vs Price (Correlation: {corr:.3f})', fontsize=16, fontweight='bold')
ax.set_xlabel('Square Feet', fontsize=12)
ax.set_ylabel('Price ($)', fontsize=12)
ax.legend(title='Neighborhood', bbox_to_anchor=(1.02, 1), loc='upper left')

plt.tight_layout()
plt.show()

In [None]:
# === VISUALIZATION 3: Feature Impact Analysis ===
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Bedrooms vs Price
sns.boxplot(data=housing, x='Bedrooms', y='Price', ax=axes[0, 0], palette='Blues_d')
axes[0, 0].set_title('Price by Number of Bedrooms', fontsize=12, fontweight='bold')

# Bathrooms vs Price
sns.violinplot(data=housing, x='Bathrooms', y='Price', ax=axes[0, 1], palette='Greens_d')
axes[0, 1].set_title('Price by Number of Bathrooms', fontsize=12, fontweight='bold')

# Garage Impact
garage_prices = housing.groupby('Has_Garage')['Price'].mean()
bars = axes[1, 0].bar(['No Garage', 'Has Garage'], garage_prices.values, color=['#e74c3c', '#2ecc71'])
for bar, val in zip(bars, garage_prices.values):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5000, 
                    f'${val/1000:.0f}K', ha='center', fontsize=12, fontweight='bold')
axes[1, 0].set_title('Average Price: Garage vs No Garage', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Average Price ($)')

# Pool Impact
pool_prices = housing.groupby('Has_Pool')['Price'].mean()
bars = axes[1, 1].bar(['No Pool', 'Has Pool'], pool_prices.values, color=['#e74c3c', '#3498db'])
for bar, val in zip(bars, pool_prices.values):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5000, 
                    f'${val/1000:.0f}K', ha='center', fontsize=12, fontweight='bold')
axes[1, 1].set_title('Average Price: Pool vs No Pool', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Average Price ($)')

plt.suptitle('Feature Impact on House Prices', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# === VISUALIZATION 4: Age vs Price ===
fig, ax = plt.subplots(figsize=(12, 6))

sns.scatterplot(data=housing, x='Age', y='Price', hue='Neighborhood', 
                alpha=0.6, ax=ax, palette='Set2', s=80)

# Add regression line
sns.regplot(data=housing, x='Age', y='Price', scatter=False, ax=ax, 
            color='black', line_kws={'linewidth': 2, 'linestyle': '--'})

corr = housing['Age'].corr(housing['Price'])
ax.set_title(f'House Age vs Price (Correlation: {corr:.3f})', fontsize=16, fontweight='bold')
ax.set_xlabel('House Age (Years)', fontsize=12)
ax.set_ylabel('Price ($)', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# === VISUALIZATION 5: Correlation Heatmap ===
fig, ax = plt.subplots(figsize=(10, 8))

# Create numerical version of housing data
housing_numeric = housing.copy()
housing_numeric['Has_Garage'] = housing_numeric['Has_Garage'].astype(int)
housing_numeric['Has_Pool'] = housing_numeric['Has_Pool'].astype(int)

# Correlation matrix
corr_matrix = housing_numeric[['Price', 'SqFt', 'Bedrooms', 'Bathrooms', 'Age', 'Has_Garage', 'Has_Pool']].corr()

# Create heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Upper triangle mask
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, 
            square=True, linewidths=2, fmt='.2f',
            annot_kws={'size': 12, 'weight': 'bold'}, mask=mask)

ax.set_title('Feature Correlation Matrix', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# === VISUALIZATION 6: Neighborhood Deep Dive ===
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Average Price by Neighborhood
neighborhood_stats = housing.groupby('Neighborhood').agg({
    'Price': ['mean', 'count']
}).round(0)
neighborhood_stats.columns = ['Avg_Price', 'Count']
neighborhood_stats = neighborhood_stats.sort_values('Avg_Price', ascending=True)

bars = axes[0].barh(neighborhood_stats.index, neighborhood_stats['Avg_Price'] / 1000, color=plt.cm.viridis(np.linspace(0.2, 0.8, 4)))
for bar, (idx, row) in zip(bars, neighborhood_stats.iterrows()):
    axes[0].text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2, 
                 f"${row['Avg_Price']/1000:.0f}K", va='center', fontsize=11, fontweight='bold')
axes[0].set_title('Average Price by Neighborhood', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Average Price ($1000s)')

# Number of Listings
colors = plt.cm.Set2(np.linspace(0, 1, 4))
axes[1].pie(neighborhood_stats['Count'], labels=neighborhood_stats.index, autopct='%1.1f%%', 
            colors=colors, explode=[0.05]*4)
axes[1].set_title('Listings by Neighborhood', fontsize=12, fontweight='bold')

# Average SqFt by Neighborhood
sqft_by_neighborhood = housing.groupby('Neighborhood')['SqFt'].mean().sort_values()
axes[2].barh(sqft_by_neighborhood.index, sqft_by_neighborhood.values, color=plt.cm.plasma(np.linspace(0.2, 0.8, 4)))
axes[2].set_title('Average SqFt by Neighborhood', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Average Square Feet')

plt.suptitle('Neighborhood Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# === FINAL SUMMARY DASHBOARD ===
print("=" * 70)
print(" HOUSING MARKET ANALYSIS - KEY FINDINGS")
print("=" * 70)

# Key Statistics
print(f"""
 MARKET OVERVIEW:
   Total Properties Analyzed: {len(housing)}
   Average Price: ${housing['Price'].mean():,.0f}
   Median Price: ${housing['Price'].median():,.0f}
   Price Range: ${housing['Price'].min():,.0f} - ${housing['Price'].max():,.0f}

 TOP CORRELATIONS WITH PRICE:
""")

# Show correlations
correlations = housing_numeric.corr()['Price'].drop('Price').sort_values(ascending=False)
for feature, corr in correlations.items():
    direction = '' if corr > 0 else ''
    print(f"   {direction} {feature}: {corr:.3f}")

print(f"""
 NEIGHBORHOOD RANKINGS (by avg price):
""")
neighborhood_ranking = housing.groupby('Neighborhood')['Price'].mean().sort_values(ascending=False)
for i, (neighborhood, price) in enumerate(neighborhood_ranking.items(), 1):
    medal = ['', '', '', ''][i-1]
    print(f"   {medal} #{i} {neighborhood}: ${price:,.0f}")

print(f"""
 KEY INSIGHTS:
   - Square footage is the strongest predictor of price
   - Beachfront properties command {((neighborhood_ranking['Beachfront'] / neighborhood_ranking['Countryside']) - 1) * 100:.0f}% premium over Countryside
   - Houses with garages sell for ${housing[housing['Has_Garage']]['Price'].mean() - housing[~housing['Has_Garage']]['Price'].mean():,.0f} more on average
   - Each bedroom adds approximately ${(housing.groupby('Bedrooms')['Price'].mean().diff().mean()):,.0f} to home value
""")

print("=" * 70)
print(" Day 5 Complete! You've mastered Data Visualization!")
print("=" * 70)

---

## Practice Exercises

Try these on your own:

1. **Custom Color Palette**: Create a visualization using your own custom color scheme
2. **Stacked Bar Chart**: Show the composition of listings by neighborhood AND bedroom count
3. **Time Series**: Add a 'Date Listed' column and visualize price trends over time
4. **Interactive**: Use `plt.savefig('chart.png', dpi=300)` to save a high-resolution chart

---

## Key Takeaways

| Plot Type | Best For | Library |
|-----------|----------|----------|
| **Line Chart** | Trends over time | Matplotlib |
| **Bar Chart** | Comparing categories | Matplotlib/Seaborn |
| **Scatter Plot** | Correlations between variables | Both |
| **Histogram** | Distribution of single variable | Both |
| **Box Plot** | Distribution summary + outliers | Both |
| **Heatmap** | Correlation matrix | Seaborn |
| **Violin Plot** | Distribution by category | Seaborn |
| **Pair Plot** | Multiple relationships at once | Seaborn |

### Pro Tips:
- Always add **titles** and **labels**
- Use **color purposefully** (not just decoration)
- **Less is more** - don't clutter your visualizations
- Choose the **right chart type** for your data story

---

**Next Up:** Day 6 - Introduction to Machine Learning with Linear Regression!