# Module 13: Capstone Projects

## Projects Covered
1. Project 1: Sales Data Analysis
2. Project 2: Customer Segmentation
3. Project 3: Predictive Modeling

## Learning Objectives

By the end of this module, you will be able to:
- Apply the complete data analysis workflow to real-world problems
- Perform end-to-end exploratory data analysis
- Implement customer segmentation using clustering techniques
- Build and evaluate predictive models for business problems
- Communicate insights effectively through visualizations
- Create a portfolio-ready data science project

---

## About the Capstone Projects

These three projects are designed to integrate all the skills you've learned throughout this course:

- **Python fundamentals** (Modules 1-5)
- **Data manipulation with NumPy and Pandas** (Modules 6-7)
- **Data visualization** (Module 8)
- **Exploratory Data Analysis** (Module 9)
- **Data cleaning and preprocessing** (Module 10)
- **Statistics** (Module 11)
- **Machine Learning** (Module 12)

Each project follows a structured approach and can be added to your data science portfolio.

---

In [None]:
# Import all libraries we'll use across the capstone projects
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score,
                             accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             silhouette_score)

# Configuration
np.random.seed(42)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

---
# Project 1: Sales Data Analysis
---

## Project Overview

**Scenario:** You are a data analyst at a retail company. The management wants to understand sales performance across different dimensions and identify opportunities for growth.

**Objectives:**
1. Analyze overall sales trends over time
2. Identify top-performing products and categories
3. Analyze regional sales performance
4. Discover seasonal patterns
5. Provide actionable recommendations

**Skills Applied:**
- Data loading and cleaning (Pandas)
- Exploratory Data Analysis
- Data visualization (Matplotlib, Seaborn)
- Statistical analysis
- Business insights generation

In [None]:
# Generate synthetic sales data
np.random.seed(42)

# Parameters
n_transactions = 5000
start_date = '2022-01-01'
end_date = '2023-12-31'

# Generate dates
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
dates = np.random.choice(date_range, n_transactions)

# Product categories and products
categories = {
    'Electronics': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Smart Watch'],
    'Clothing': ['T-Shirt', 'Jeans', 'Jacket', 'Dress', 'Sneakers'],
    'Home & Garden': ['Furniture', 'Kitchenware', 'Bedding', 'Decor', 'Tools'],
    'Books': ['Fiction', 'Non-Fiction', 'Educational', 'Comics', 'Magazines']
}

# Base prices per category
category_base_prices = {
    'Electronics': (100, 1500),
    'Clothing': (20, 200),
    'Home & Garden': (30, 500),
    'Books': (10, 50)
}

# Regions
regions = ['North', 'South', 'East', 'West', 'Central']
region_weights = [0.25, 0.20, 0.20, 0.20, 0.15]

# Generate transactions
sales_records = []

for i in range(n_transactions):
    category = np.random.choice(list(categories.keys()))
    product = np.random.choice(categories[category])
    region = np.random.choice(regions, p=region_weights)
    
    # Price with seasonal variation
    base_low, base_high = category_base_prices[category]
    base_price = np.random.uniform(base_low, base_high)
    
    # Add seasonal factor
    month = pd.Timestamp(dates[i]).month
    if month in [11, 12]:  # Holiday season
        seasonal_factor = 1.3
    elif month in [6, 7, 8]:  # Summer
        seasonal_factor = 1.1
    else:
        seasonal_factor = 1.0
    
    price = base_price * seasonal_factor
    quantity = np.random.randint(1, 5)
    
    # Discount (occasionally)
    discount = np.random.choice([0, 0.05, 0.10, 0.15, 0.20], p=[0.6, 0.15, 0.12, 0.08, 0.05])
    
    total = price * quantity * (1 - discount)
    
    sales_records.append({
        'transaction_id': f'TXN{i+1:05d}',
        'date': dates[i],
        'category': category,
        'product': product,
        'region': region,
        'unit_price': round(price, 2),
        'quantity': quantity,
        'discount': discount,
        'total_amount': round(total, 2)
    })

# Create DataFrame
sales_df = pd.DataFrame(sales_records)
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df = sales_df.sort_values('date').reset_index(drop=True)

print("Sales Dataset Created")
print("="*50)
print(f"Shape: {sales_df.shape}")
print(f"Date range: {sales_df['date'].min()} to {sales_df['date'].max()}")
print(f"\nFirst few rows:")
print(sales_df.head())

In [None]:
# Step 1: Data Understanding
print("STEP 1: Data Understanding")
print("="*50)

print("\nDataset Info:")
print(sales_df.info())

print("\nStatistical Summary:")
print(sales_df.describe())

print("\nCategorical Variables:")
for col in ['category', 'product', 'region']:
    print(f"\n{col}: {sales_df[col].nunique()} unique values")
    print(sales_df[col].value_counts().head())

In [None]:
# Step 2: Sales Trends Over Time
print("STEP 2: Sales Trends Analysis")
print("="*50)

# Add time-based columns
sales_df['year'] = sales_df['date'].dt.year
sales_df['month'] = sales_df['date'].dt.month
sales_df['quarter'] = sales_df['date'].dt.quarter
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
sales_df['year_month'] = sales_df['date'].dt.to_period('M')

# Monthly sales trend
monthly_sales = sales_df.groupby('year_month').agg({
    'total_amount': 'sum',
    'transaction_id': 'count',
    'quantity': 'sum'
}).rename(columns={'transaction_id': 'num_transactions'})

monthly_sales['avg_transaction_value'] = monthly_sales['total_amount'] / monthly_sales['num_transactions']

print("Monthly Sales Summary (last 6 months):")
print(monthly_sales.tail(6))

In [None]:
# Visualize sales trends
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Monthly revenue trend
ax1 = axes[0, 0]
monthly_sales['total_amount'].plot(ax=ax1, marker='o', linewidth=2)
ax1.set_xlabel('Month')
ax1.set_ylabel('Total Revenue ($)')
ax1.set_title('Monthly Revenue Trend')
ax1.tick_params(axis='x', rotation=45)

# Quarterly comparison
ax2 = axes[0, 1]
quarterly_sales = sales_df.groupby(['year', 'quarter'])['total_amount'].sum().unstack(level=0)
quarterly_sales.plot(kind='bar', ax=ax2)
ax2.set_xlabel('Quarter')
ax2.set_ylabel('Total Revenue ($)')
ax2.set_title('Quarterly Sales by Year')
ax2.legend(title='Year')

# Sales by day of week
ax3 = axes[1, 0]
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_sales = sales_df.groupby('day_of_week')['total_amount'].sum().reindex(day_order)
daily_sales.plot(kind='bar', ax=ax3, color='steelblue')
ax3.set_xlabel('Day of Week')
ax3.set_ylabel('Total Revenue ($)')
ax3.set_title('Sales by Day of Week')
ax3.tick_params(axis='x', rotation=45)

# Monthly pattern (seasonality)
ax4 = axes[1, 1]
monthly_pattern = sales_df.groupby('month')['total_amount'].sum()
monthly_pattern.plot(kind='bar', ax=ax4, color='coral')
ax4.set_xlabel('Month')
ax4.set_ylabel('Total Revenue ($)')
ax4.set_title('Sales by Month (Seasonality)')

plt.tight_layout()
plt.show()

In [None]:
# Step 3: Product and Category Analysis
print("STEP 3: Product and Category Analysis")
print("="*50)

# Category performance
category_performance = sales_df.groupby('category').agg({
    'total_amount': 'sum',
    'transaction_id': 'count',
    'quantity': 'sum'
}).rename(columns={'transaction_id': 'transactions'})

category_performance['avg_transaction'] = category_performance['total_amount'] / category_performance['transactions']
category_performance['revenue_share'] = category_performance['total_amount'] / category_performance['total_amount'].sum() * 100
category_performance = category_performance.sort_values('total_amount', ascending=False)

print("\nCategory Performance:")
print(category_performance.round(2))

# Top products by revenue
top_products = sales_df.groupby(['category', 'product'])['total_amount'].sum().sort_values(ascending=False).head(10)
print("\nTop 10 Products by Revenue:")
print(top_products)

In [None]:
# Visualize product performance
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Category revenue pie chart
ax1 = axes[0]
category_performance['total_amount'].plot(kind='pie', ax=ax1, autopct='%1.1f%%')
ax1.set_ylabel('')
ax1.set_title('Revenue by Category')

# Top products bar chart
ax2 = axes[1]
top_products.plot(kind='barh', ax=ax2)
ax2.set_xlabel('Total Revenue ($)')
ax2.set_title('Top 10 Products by Revenue')

# Category trends over time
ax3 = axes[2]
category_monthly = sales_df.groupby(['year_month', 'category'])['total_amount'].sum().unstack()
category_monthly.plot(ax=ax3, marker='o')
ax3.set_xlabel('Month')
ax3.set_ylabel('Revenue ($)')
ax3.set_title('Category Revenue Trends')
ax3.legend(title='Category', bbox_to_anchor=(1.05, 1))
ax3.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Step 4: Regional Analysis
print("STEP 4: Regional Analysis")
print("="*50)

# Regional performance
regional_performance = sales_df.groupby('region').agg({
    'total_amount': 'sum',
    'transaction_id': 'count',
    'quantity': 'sum'
}).rename(columns={'transaction_id': 'transactions'})

regional_performance['avg_transaction'] = regional_performance['total_amount'] / regional_performance['transactions']
regional_performance['market_share'] = regional_performance['total_amount'] / regional_performance['total_amount'].sum() * 100
regional_performance = regional_performance.sort_values('total_amount', ascending=False)

print("Regional Performance:")
print(regional_performance.round(2))

# Category preferences by region
region_category = sales_df.pivot_table(
    values='total_amount',
    index='region',
    columns='category',
    aggfunc='sum'
)

print("\nCategory Preferences by Region (Revenue):")
print(region_category.round(2))

In [None]:
# Visualize regional analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regional revenue
ax1 = axes[0]
regional_performance['total_amount'].plot(kind='bar', ax=ax1, color='teal')
ax1.set_xlabel('Region')
ax1.set_ylabel('Total Revenue ($)')
ax1.set_title('Revenue by Region')
ax1.tick_params(axis='x', rotation=0)

# Heatmap of region-category performance
ax2 = axes[1]
sns.heatmap(region_category, annot=True, fmt=',.0f', cmap='YlGnBu', ax=ax2)
ax2.set_title('Revenue Heatmap: Region vs Category')

plt.tight_layout()
plt.show()

In [None]:
# Step 5: Key Insights and Recommendations
print("STEP 5: Key Insights and Recommendations")
print("="*60)

# Calculate key metrics
total_revenue = sales_df['total_amount'].sum()
total_transactions = len(sales_df)
avg_transaction = total_revenue / total_transactions
best_month = monthly_pattern.idxmax()
worst_month = monthly_pattern.idxmin()
top_category = category_performance['total_amount'].idxmax()
top_region = regional_performance['total_amount'].idxmax()

print("\nKEY METRICS:")
print(f"  Total Revenue: ${total_revenue:,.2f}")
print(f"  Total Transactions: {total_transactions:,}")
print(f"  Average Transaction Value: ${avg_transaction:.2f}")

print("\nINSIGHTS:")
print(f"  1. Best performing month: Month {best_month} (${monthly_pattern[best_month]:,.2f})")
print(f"  2. Slowest month: Month {worst_month} (${monthly_pattern[worst_month]:,.2f})")
print(f"  3. Top category: {top_category} ({category_performance.loc[top_category, 'revenue_share']:.1f}% of revenue)")
print(f"  4. Top region: {top_region} ({regional_performance.loc[top_region, 'market_share']:.1f}% of revenue)")

# Year-over-year growth
yearly_sales = sales_df.groupby('year')['total_amount'].sum()
if len(yearly_sales) > 1:
    yoy_growth = (yearly_sales.iloc[-1] - yearly_sales.iloc[-2]) / yearly_sales.iloc[-2] * 100
    print(f"  5. Year-over-Year Growth: {yoy_growth:.1f}%")

print("\nRECOMMENDATIONS:")
print("  1. Increase marketing spend in Q4 to capitalize on holiday season")
print(f"  2. Focus expansion efforts on {top_category} category")
print(f"  3. Investigate underperformance in {regional_performance['total_amount'].idxmin()} region")
print("  4. Consider promotional campaigns during slower months")
print("  5. Analyze high-discount transactions to optimize pricing strategy")

---
# Project 2: Customer Segmentation
---

## Project Overview

**Scenario:** You are a data scientist at an e-commerce company. The marketing team wants to segment customers to create targeted marketing campaigns.

**Objectives:**
1. Analyze customer behavior patterns
2. Create meaningful customer segments using clustering
3. Profile each segment with descriptive characteristics
4. Provide marketing recommendations for each segment

**Skills Applied:**
- Feature engineering
- Data preprocessing and scaling
- K-Means clustering
- Cluster analysis and interpretation
- Business recommendations

In [None]:
# Generate synthetic customer data
np.random.seed(42)

n_customers = 1000

# Customer base data
customer_data = pd.DataFrame({
    'customer_id': [f'CUST{i:04d}' for i in range(1, n_customers + 1)],
    'age': np.random.randint(18, 70, n_customers),
    'gender': np.random.choice(['M', 'F'], n_customers),
    'tenure_months': np.random.randint(1, 60, n_customers),
})

# Generate correlated behavioral features
# Create customer archetypes
archetypes = np.random.choice(['budget', 'regular', 'premium', 'vip'], n_customers, 
                               p=[0.30, 0.40, 0.20, 0.10])

# Define archetype characteristics
archetype_params = {
    'budget': {'orders': (1, 5), 'avg_spend': (20, 50), 'frequency': (1, 3)},
    'regular': {'orders': (5, 15), 'avg_spend': (50, 100), 'frequency': (3, 6)},
    'premium': {'orders': (10, 30), 'avg_spend': (100, 300), 'frequency': (4, 8)},
    'vip': {'orders': (20, 50), 'avg_spend': (200, 500), 'frequency': (6, 12)}
}

# Generate features based on archetypes
total_orders = []
avg_order_value = []
purchase_frequency = []
days_since_last_purchase = []

for archetype in archetypes:
    params = archetype_params[archetype]
    total_orders.append(np.random.randint(params['orders'][0], params['orders'][1]))
    avg_order_value.append(np.random.uniform(params['avg_spend'][0], params['avg_spend'][1]))
    purchase_frequency.append(np.random.uniform(params['frequency'][0], params['frequency'][1]))
    # Days since last purchase inversely related to frequency
    days_since_last_purchase.append(np.random.randint(1, 180 // params['frequency'][0]))

customer_data['total_orders'] = total_orders
customer_data['avg_order_value'] = np.round(avg_order_value, 2)
customer_data['purchase_frequency_monthly'] = np.round(purchase_frequency, 2)
customer_data['days_since_last_purchase'] = days_since_last_purchase
customer_data['total_spend'] = np.round(customer_data['total_orders'] * customer_data['avg_order_value'], 2)

# Add some noise and variation
customer_data['website_visits_monthly'] = np.round(
    customer_data['purchase_frequency_monthly'] * np.random.uniform(2, 5, n_customers), 1
)
customer_data['email_open_rate'] = np.round(
    np.random.beta(2, 5, n_customers) * 0.8 + 0.1, 2
)
customer_data['support_tickets'] = np.random.poisson(1, n_customers)

print("Customer Dataset Created")
print("="*50)
print(f"Shape: {customer_data.shape}")
print(f"\nFirst few rows:")
print(customer_data.head())

In [None]:
# Step 1: Exploratory Data Analysis
print("STEP 1: Customer Data EDA")
print("="*50)

print("\nDataset Summary:")
print(customer_data.describe())

# Key metrics
print("\nKey Customer Metrics:")
print(f"  Total Customers: {len(customer_data):,}")
print(f"  Total Revenue: ${customer_data['total_spend'].sum():,.2f}")
print(f"  Average Customer Value: ${customer_data['total_spend'].mean():.2f}")
print(f"  Average Order Value: ${customer_data['avg_order_value'].mean():.2f}")

In [None]:
# Visualize customer distributions
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Total spend distribution
axes[0, 0].hist(customer_data['total_spend'], bins=30, edgecolor='black')
axes[0, 0].set_xlabel('Total Spend ($)')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('Distribution of Total Spend')

# Order frequency
axes[0, 1].hist(customer_data['purchase_frequency_monthly'], bins=20, edgecolor='black')
axes[0, 1].set_xlabel('Purchases per Month')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Purchase Frequency Distribution')

# Avg order value
axes[0, 2].hist(customer_data['avg_order_value'], bins=30, edgecolor='black')
axes[0, 2].set_xlabel('Average Order Value ($)')
axes[0, 2].set_ylabel('Count')
axes[0, 2].set_title('Average Order Value Distribution')

# Recency (days since last purchase)
axes[1, 0].hist(customer_data['days_since_last_purchase'], bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Days Since Last Purchase')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Recency Distribution')

# Tenure
axes[1, 1].hist(customer_data['tenure_months'], bins=20, edgecolor='black')
axes[1, 1].set_xlabel('Tenure (Months)')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Customer Tenure Distribution')

# Total spend vs Frequency scatter
axes[1, 2].scatter(customer_data['purchase_frequency_monthly'], 
                   customer_data['total_spend'], alpha=0.5)
axes[1, 2].set_xlabel('Purchase Frequency (Monthly)')
axes[1, 2].set_ylabel('Total Spend ($)')
axes[1, 2].set_title('Frequency vs Total Spend')

plt.tight_layout()
plt.show()

In [None]:
# Step 2: Feature Engineering for Segmentation
print("STEP 2: Feature Engineering")
print("="*50)

# RFM-style features
segmentation_features = customer_data[[
    'total_spend',              # Monetary
    'purchase_frequency_monthly',  # Frequency
    'days_since_last_purchase',    # Recency
    'avg_order_value',
    'total_orders',
    'website_visits_monthly'
]].copy()

print("Features for Segmentation:")
print(segmentation_features.describe())

# Scale features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(segmentation_features)

print(f"\nScaled features shape: {features_scaled.shape}")

In [None]:
# Step 3: Find Optimal Number of Clusters
print("STEP 3: Determine Optimal Clusters")
print("="*50)

# Elbow method and silhouette score
k_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(features_scaled, kmeans.labels_))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow method
axes[0].plot(k_range, inertias, 'bo-', linewidth=2)
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')

# Silhouette score
axes[1].plot(k_range, silhouette_scores, 'go-', linewidth=2)
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score by Number of Clusters')

plt.tight_layout()
plt.show()

# Print silhouette scores
print("\nSilhouette Scores:")
for k, score in zip(k_range, silhouette_scores):
    print(f"  k={k}: {score:.4f}")

In [None]:
# Step 4: Apply K-Means Clustering
print("STEP 4: Customer Segmentation")
print("="*50)

# Choose k=4 based on elbow and silhouette
optimal_k = 4

kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
customer_data['segment'] = kmeans_final.fit_predict(features_scaled)

print(f"\nNumber of clusters: {optimal_k}")
print(f"Silhouette Score: {silhouette_score(features_scaled, customer_data['segment']):.4f}")

# Segment distribution
print("\nSegment Distribution:")
print(customer_data['segment'].value_counts().sort_index())

In [None]:
# Step 5: Segment Profiling
print("STEP 5: Segment Profiling")
print("="*50)

# Calculate segment statistics
segment_profile = customer_data.groupby('segment').agg({
    'customer_id': 'count',
    'total_spend': 'mean',
    'avg_order_value': 'mean',
    'purchase_frequency_monthly': 'mean',
    'days_since_last_purchase': 'mean',
    'total_orders': 'mean',
    'website_visits_monthly': 'mean',
    'email_open_rate': 'mean',
    'tenure_months': 'mean'
}).rename(columns={'customer_id': 'count'})

# Add percentage
segment_profile['pct_customers'] = segment_profile['count'] / segment_profile['count'].sum() * 100

print("\nSegment Profiles:")
print(segment_profile.round(2))

In [None]:
# Name the segments based on characteristics
segment_names = {
    segment_profile['total_spend'].idxmax(): 'VIP Champions',
    segment_profile['total_spend'].idxmin(): 'Budget Shoppers',
}

# Assign remaining segments
remaining = [s for s in range(optimal_k) if s not in segment_names]
remaining_sorted = segment_profile.loc[remaining, 'total_spend'].sort_values(ascending=False)

if len(remaining) >= 1:
    segment_names[remaining_sorted.index[0]] = 'Loyal Regulars'
if len(remaining) >= 2:
    segment_names[remaining_sorted.index[1]] = 'Occasional Buyers'

customer_data['segment_name'] = customer_data['segment'].map(segment_names)

print("Segment Names:")
for seg, name in sorted(segment_names.items()):
    count = (customer_data['segment'] == seg).sum()
    spend = segment_profile.loc[seg, 'total_spend']
    print(f"  Segment {seg}: {name} ({count} customers, avg spend: ${spend:.2f})")

In [None]:
# Visualize segments
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Segment sizes
ax1 = axes[0, 0]
segment_counts = customer_data['segment_name'].value_counts()
segment_counts.plot(kind='pie', ax=ax1, autopct='%1.1f%%')
ax1.set_ylabel('')
ax1.set_title('Customer Segments Distribution')

# Spend by segment (box plot)
ax2 = axes[0, 1]
customer_data.boxplot(column='total_spend', by='segment_name', ax=ax2)
ax2.set_xlabel('Segment')
ax2.set_ylabel('Total Spend ($)')
ax2.set_title('Total Spend by Segment')
plt.suptitle('')

# Frequency vs Spend scatter by segment
ax3 = axes[1, 0]
for name in segment_names.values():
    subset = customer_data[customer_data['segment_name'] == name]
    ax3.scatter(subset['purchase_frequency_monthly'], subset['total_spend'], 
                label=name, alpha=0.6)
ax3.set_xlabel('Purchase Frequency (Monthly)')
ax3.set_ylabel('Total Spend ($)')
ax3.set_title('Segment Distribution: Frequency vs Spend')
ax3.legend()

# Radar chart data preparation
ax4 = axes[1, 1]
metrics = ['total_spend', 'avg_order_value', 'purchase_frequency_monthly', 'website_visits_monthly']
segment_means = customer_data.groupby('segment_name')[metrics].mean()

# Normalize for comparison
segment_means_normalized = (segment_means - segment_means.min()) / (segment_means.max() - segment_means.min())

x = np.arange(len(metrics))
width = 0.2

for i, segment in enumerate(segment_means_normalized.index):
    ax4.bar(x + i*width, segment_means_normalized.loc[segment], width, label=segment)

ax4.set_xticks(x + width * 1.5)
ax4.set_xticklabels(['Total Spend', 'Avg Order', 'Frequency', 'Web Visits'])
ax4.set_ylabel('Normalized Score')
ax4.set_title('Segment Comparison (Normalized)')
ax4.legend()

plt.tight_layout()
plt.show()

In [None]:
# Step 6: Marketing Recommendations
print("STEP 6: Marketing Recommendations")
print("="*60)

recommendations = {
    'VIP Champions': {
        'description': 'High-value, frequent buyers with strong engagement',
        'strategy': [
            'Exclusive early access to new products',
            'Personalized VIP loyalty rewards',
            'Dedicated customer success manager',
            'Premium shipping and returns',
            'Invite-only events and experiences'
        ]
    },
    'Loyal Regulars': {
        'description': 'Consistent buyers with moderate spend',
        'strategy': [
            'Upsell and cross-sell campaigns',
            'Tiered loyalty program to encourage upgrades',
            'Product bundles and volume discounts',
            'Referral bonuses',
            'Birthday/anniversary special offers'
        ]
    },
    'Occasional Buyers': {
        'description': 'Infrequent purchasers with growth potential',
        'strategy': [
            'Win-back email campaigns',
            'Time-limited discount offers',
            'Product recommendations based on past purchases',
            'Re-engagement through social media',
            'Simplified checkout experience'
        ]
    },
    'Budget Shoppers': {
        'description': 'Price-sensitive customers seeking deals',
        'strategy': [
            'Flash sales and clearance notifications',
            'Price drop alerts on wishlist items',
            'Budget-friendly product recommendations',
            'Free shipping thresholds to increase basket size',
            'Loyalty points for every purchase'
        ]
    }
}

for segment, info in recommendations.items():
    count = (customer_data['segment_name'] == segment).sum()
    revenue_share = customer_data[customer_data['segment_name'] == segment]['total_spend'].sum() / customer_data['total_spend'].sum() * 100
    
    print(f"\n{segment.upper()}")
    print(f"  Customers: {count} ({count/len(customer_data)*100:.1f}%)")
    print(f"  Revenue Share: {revenue_share:.1f}%")
    print(f"  Profile: {info['description']}")
    print(f"  Recommended Strategies:")
    for strategy in info['strategy']:
        print(f"    - {strategy}")

---
# Project 3: Predictive Modeling
---

## Project Overview

**Scenario:** You are a data scientist at a subscription-based company. The business wants to predict which customers are likely to churn so they can take proactive retention actions.

**Objectives:**
1. Build a predictive model for customer churn
2. Identify key factors driving churn
3. Evaluate model performance
4. Provide actionable insights for retention

**Skills Applied:**
- Complete ML workflow
- Feature engineering
- Model training and evaluation
- Cross-validation
- Feature importance analysis
- Business recommendations

In [None]:
# Generate synthetic churn dataset
np.random.seed(42)

n_customers = 2000

# Customer features
churn_data = pd.DataFrame({
    'customer_id': [f'SUB{i:05d}' for i in range(1, n_customers + 1)],
    'tenure_months': np.random.randint(1, 72, n_customers),
    'monthly_charges': np.random.uniform(20, 100, n_customers),
    'contract_type': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], 
                                       n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Credit Card', 'Bank Transfer', 'Electronic Check'], 
                                        n_customers, p=[0.4, 0.35, 0.25]),
    'num_support_tickets': np.random.poisson(2, n_customers),
    'num_services': np.random.randint(1, 6, n_customers),
    'has_premium_support': np.random.choice([0, 1], n_customers, p=[0.7, 0.3]),
    'account_age_days': np.random.randint(30, 2000, n_customers)
})

# Calculate total charges
churn_data['total_charges'] = churn_data['tenure_months'] * churn_data['monthly_charges']

# Create churn based on features (simulating real patterns)
churn_prob = (
    0.35  # Base probability
    - 0.01 * churn_data['tenure_months']  # Longer tenure = less likely to churn
    + 0.002 * churn_data['monthly_charges']  # Higher charges = more likely to churn
    + 0.05 * (churn_data['contract_type'] == 'Month-to-Month').astype(int)  # Month-to-month = more churn
    + 0.03 * churn_data['num_support_tickets']  # More tickets = more churn
    - 0.03 * churn_data['num_services']  # More services = less churn
    - 0.1 * churn_data['has_premium_support']  # Premium support = less churn
    + 0.04 * (churn_data['payment_method'] == 'Electronic Check').astype(int)  # E-check = more churn
).clip(0.05, 0.8)

churn_data['churned'] = np.random.binomial(1, churn_prob)

print("Churn Dataset Created")
print("="*50)
print(f"Shape: {churn_data.shape}")
print(f"Churn Rate: {churn_data['churned'].mean():.2%}")
print(f"\nFirst few rows:")
print(churn_data.head())

In [None]:
# Step 1: Exploratory Data Analysis
print("STEP 1: Churn Data EDA")
print("="*50)

print("\nDataset Info:")
print(churn_data.info())

print("\nNumerical Summary:")
print(churn_data.describe())

print("\nCategorical Summary:")
for col in ['contract_type', 'payment_method']:
    print(f"\n{col}:")
    print(churn_data[col].value_counts())

In [None]:
# Analyze churn by different factors
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Churn by contract type
ax1 = axes[0, 0]
churn_by_contract = churn_data.groupby('contract_type')['churned'].mean().sort_values(ascending=False)
churn_by_contract.plot(kind='bar', ax=ax1, color='coral')
ax1.set_ylabel('Churn Rate')
ax1.set_title('Churn Rate by Contract Type')
ax1.tick_params(axis='x', rotation=45)

# Churn by payment method
ax2 = axes[0, 1]
churn_by_payment = churn_data.groupby('payment_method')['churned'].mean().sort_values(ascending=False)
churn_by_payment.plot(kind='bar', ax=ax2, color='steelblue')
ax2.set_ylabel('Churn Rate')
ax2.set_title('Churn Rate by Payment Method')
ax2.tick_params(axis='x', rotation=45)

# Tenure distribution by churn
ax3 = axes[0, 2]
churn_data[churn_data['churned']==0]['tenure_months'].hist(ax=ax3, alpha=0.5, label='Stayed', bins=20)
churn_data[churn_data['churned']==1]['tenure_months'].hist(ax=ax3, alpha=0.5, label='Churned', bins=20)
ax3.set_xlabel('Tenure (Months)')
ax3.set_ylabel('Count')
ax3.set_title('Tenure Distribution by Churn Status')
ax3.legend()

# Monthly charges by churn
ax4 = axes[1, 0]
churn_data.boxplot(column='monthly_charges', by='churned', ax=ax4)
ax4.set_xlabel('Churned')
ax4.set_ylabel('Monthly Charges ($)')
ax4.set_title('Monthly Charges by Churn Status')
plt.suptitle('')

# Support tickets by churn
ax5 = axes[1, 1]
churn_by_tickets = churn_data.groupby('num_support_tickets')['churned'].mean()
churn_by_tickets.plot(kind='bar', ax=ax5, color='purple')
ax5.set_xlabel('Number of Support Tickets')
ax5.set_ylabel('Churn Rate')
ax5.set_title('Churn Rate by Support Tickets')

# Services by churn
ax6 = axes[1, 2]
churn_by_services = churn_data.groupby('num_services')['churned'].mean()
churn_by_services.plot(kind='bar', ax=ax6, color='green')
ax6.set_xlabel('Number of Services')
ax6.set_ylabel('Churn Rate')
ax6.set_title('Churn Rate by Number of Services')

plt.tight_layout()
plt.show()

In [None]:
# Step 2: Feature Engineering and Preprocessing
print("STEP 2: Feature Engineering")
print("="*50)

# Create a copy for modeling
model_data = churn_data.copy()

# Encode categorical variables
model_data['contract_monthly'] = (model_data['contract_type'] == 'Month-to-Month').astype(int)
model_data['contract_one_year'] = (model_data['contract_type'] == 'One Year').astype(int)
model_data['payment_echeck'] = (model_data['payment_method'] == 'Electronic Check').astype(int)
model_data['payment_credit'] = (model_data['payment_method'] == 'Credit Card').astype(int)

# Create derived features
model_data['avg_monthly_tickets'] = model_data['num_support_tickets'] / (model_data['tenure_months'] + 1)
model_data['charge_per_service'] = model_data['monthly_charges'] / model_data['num_services']

# Select features for modeling
feature_columns = [
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_support_tickets', 'num_services', 'has_premium_support',
    'contract_monthly', 'contract_one_year',
    'payment_echeck', 'payment_credit',
    'avg_monthly_tickets', 'charge_per_service'
]

X = model_data[feature_columns]
y = model_data['churned']

print(f"Features: {len(feature_columns)}")
print(f"Samples: {len(X)}")
print(f"\nFeature columns:")
for col in feature_columns:
    print(f"  - {col}")

In [None]:
# Step 3: Train-Test Split and Scaling
print("STEP 3: Data Preparation")
print("="*50)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"\nChurn rate in training: {y_train.mean():.2%}")
print(f"Churn rate in testing: {y_test.mean():.2%}")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 4: Model Training and Comparison
print("STEP 4: Model Training")
print("="*50)

# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
}

# Train and evaluate each model
results = []

for name, model in models.items():
    # Use scaled data for logistic regression, unscaled for trees
    if name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_prob = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred)
    })

results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.round(4).to_string(index=False))

In [None]:
# Step 5: Cross-Validation for Best Model
print("STEP 5: Cross-Validation")
print("="*50)

# Random Forest performed best, let's validate with CV
best_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='f1')
cv_accuracy = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')

print(f"\nRandom Forest - 5-Fold Cross-Validation:")
print(f"  F1-Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"  Accuracy: {cv_accuracy.mean():.4f} (+/- {cv_accuracy.std()*2:.4f})")

In [None]:
# Step 6: Feature Importance Analysis
print("STEP 6: Feature Importance")
print("="*50)

# Train final model
final_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
final_model.fit(X_train, y_train)

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': final_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance Ranking:")
print(feature_importance.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance for Churn Prediction')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Step 7: Final Model Evaluation
print("STEP 7: Final Model Evaluation")
print("="*50)

# Predictions
y_pred_final = final_model.predict(X_test)
y_prob_final = final_model.predict_proba(X_test)[:, 1]

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_final)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Stayed', 'Churned'],
            yticklabels=['Stayed', 'Churned'])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC Curve
from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, y_prob_final)
auc = roc_auc_score(y_test, y_prob_final)

axes[1].plot(fpr, tpr, linewidth=2, label=f'Random Forest (AUC = {auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
axes[1].fill_between(fpr, tpr, alpha=0.3)
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_final, target_names=['Stayed', 'Churned']))

In [None]:
# Step 8: Business Insights and Recommendations
print("STEP 8: Business Insights and Recommendations")
print("="*60)

print("\nMODEL PERFORMANCE SUMMARY:")
print(f"  - Accuracy: {accuracy_score(y_test, y_pred_final):.2%}")
print(f"  - Recall (Churn Detection): {recall_score(y_test, y_pred_final):.2%}")
print(f"  - Precision: {precision_score(y_test, y_pred_final):.2%}")
print(f"  - AUC-ROC: {auc:.4f}")

print("\nKEY CHURN DRIVERS (Top 5):")
for i, row in feature_importance.head(5).iterrows():
    print(f"  {i+1}. {row['Feature']}: {row['Importance']:.4f}")

print("\nACTIONABLE RECOMMENDATIONS:")
recommendations_list = [
    ("Tenure & Engagement", 
     "Focus retention efforts on customers with < 12 months tenure. "
     "Implement onboarding programs and early engagement initiatives."),
    
    ("Contract Type", 
     "Incentivize month-to-month customers to switch to annual contracts. "
     "Offer discounts for longer commitments."),
    
    ("Support Experience", 
     "Customers with multiple support tickets are at higher risk. "
     "Proactively reach out after 2+ tickets to resolve issues."),
    
    ("Service Bundling", 
     "Customers with more services churn less. "
     "Create attractive bundles to increase service adoption."),
    
    ("Payment Method", 
     "Electronic check users have higher churn. "
     "Encourage credit card or bank transfer with small incentives."),
    
    ("Premium Support", 
     "Premium support customers churn significantly less. "
     "Promote premium support to high-risk customers.")
]

for i, (area, rec) in enumerate(recommendations_list, 1):
    print(f"\n  {i}. {area}:")
    print(f"     {rec}")

print("\nNEXT STEPS:")
print("  1. Deploy model to score all active customers")
print("  2. Create high-risk customer list for proactive outreach")
print("  3. A/B test retention strategies on different risk segments")
print("  4. Monitor model performance and retrain quarterly")

---
# Module Summary

## Congratulations!

You have completed the Python for Data Analysis and Data Science course! Through these 13 modules, you've built a strong foundation in:

### Skills Acquired

**Python Programming (Modules 1-5)**
- Python fundamentals and data structures
- Control flow and functions
- File handling and data I/O

**Data Manipulation (Modules 6-7)**
- NumPy for numerical computing
- Pandas for data analysis

**Data Visualization (Module 8)**
- Matplotlib for basic plots
- Seaborn for statistical visualizations

**Data Analysis (Modules 9-10)**
- Exploratory Data Analysis workflow
- Data cleaning and preprocessing

**Statistics & ML (Modules 11-12)**
- Statistical concepts and hypothesis testing
- Machine learning fundamentals

**Applied Projects (Module 13)**
- Sales data analysis
- Customer segmentation
- Predictive modeling

## Portfolio Projects

The three capstone projects can be added to your data science portfolio:

1. **Sales Data Analysis**: Demonstrates business analytics and visualization skills
2. **Customer Segmentation**: Shows unsupervised learning and customer insights
3. **Churn Prediction**: Exhibits end-to-end machine learning workflow

## Next Steps

To continue your data science journey:

1. **Practice**: Apply these skills to real datasets (Kaggle, UCI ML Repository)
2. **Deep Dive**: Explore advanced topics (deep learning, NLP, time series)
3. **Tools**: Learn SQL, cloud platforms (AWS, GCP), and deployment
4. **Projects**: Build more portfolio projects in your domain of interest
5. **Community**: Join data science communities and participate in competitions

## Resources for Continued Learning

- **Kaggle**: Competitions and datasets
- **Towards Data Science**: Articles and tutorials
- **Scikit-learn Documentation**: ML algorithms reference
- **Stack Overflow**: Problem-solving community

---

Thank you for completing this course. Best of luck on your data science journey!