# Example 2: Machine Learning with Random Forest

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.0
**License:** MIT
**Example Type:** Supervised Learning Tutorial
**Based On:** Tier2_RandomForest.ipynb
**Estimated Time:** 20 minutes

---

> **Citation:**
> Brandon Deloatch, "Example 2: Machine Learning with Random Forest," Quipu Research Labs, LLC, v1.0, 2025-10-02.

---

*This example notebook is provided "as-is" for educational and research purposes. Users assume full responsibility for any results or applications derived from it.*

---

## Spotify Customer Churn Prediction with Random Forest

**Learning Objectives:**
- Master supervised machine learning workflows
- Implement Random Forest for classification
- Perform comprehensive model evaluation
- Interpret feature importance for business insights
- Generate predictions for new customer data

**Cross-References:**
- **Prerequisite:** `quick_start_data_analysis.ipynb` (data basics)
- **Foundation:** `Tier2_RandomForest.ipynb` (Random Forest theory)
- **Alternatives:** `Tier2_LogisticRegression.ipynb`, `Tier2_SVM.ipynb`
- **Advanced:** `Tier2_NeuralNetworks.ipynb` (deep learning)

**Key Applications:**
- Customer churn prediction and retention
- Risk assessment and scoring models
- Feature selection and importance analysis
- Business intelligence and decision support

In [None]:
"""
Example 2: Machine Learning with Random Forest.

This module demonstrates customer churn prediction using Random Forest classification
on real Spotify user data. Covers data preprocessing, model training, evaluation,
and feature importance analysis.

Author: Brandon Deloatch
Date: 2025-10-02
"""

# Example 2: Machine Learning with Random Forest
# ==============================================
# Professional customer churn prediction with real Spotify user data

import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries (imported for comprehensive ML workflow)
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix,
 roc_auc_score, roc_curve, accuracy_score,
 precision_score, recall_score, f1_score)

warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Example 2: Machine Learning with Random Forest")
print("=" * 50)
print("CROSS-REFERENCES:")
print("• Prerequisites: quick_start_data_analysis.ipynb (data fundamentals)")
print("• Foundation: Tier2_RandomForest.ipynb (Random Forest theory)")
print("• Alternatives: Tier2_LogisticRegression.ipynb, Tier2_SVM.ipynb")
print("• Advanced: Tier2_NeuralNetworks.ipynb (deep learning approaches)")
print("• Full Guide: See notebooks/tier2_supervised/ for complete ML suite")
print(" Machine learning libraries loaded - Ready for churn prediction!")

## 1. Load Spotify Churn Dataset

Load and explore the real Spotify customer churn dataset:

In [None]:
# Load the Spotify Churn dataset
df = pd.read_csv('../data/Spotify_churn_dataset.csv')

print("Spotify Churn Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Churn rate: {df['is_churned'].mean():.2%}")

# Basic data preprocessing
print("\nDataset Info:")
print(f"- Total users: {len(df):,}")
print(f"- Features: {len(df.columns)}")
print(f"- Countries: {df['country'].nunique()}")
print(f"- Subscription types: {', '.join(df['subscription_type'].unique())}")
print(f"- Device types: {', '.join(df['device_type'].unique())}")

# Check for missing values
print("\nMissing values:")
missing_values = df.isnull().sum()
if missing_values.sum() == 0:
 print(" No missing values found!")
else:
 print(missing_values[missing_values > 0])

# Show data types
print("\nData types:")
print(df.dtypes)

# Create target variable for consistency with existing code
df['churned'] = df['is_churned']

print("\nFirst 5 rows:")
df.head()

## 2. Create Target Variables\n
\n
Generate realistic business targets to predict:

In [None]:
# Exploratory Data Analysis for Spotify Churn
print("SPOTIFY CHURN DATA ANALYSIS:")
print("=" * 50)

# Churn analysis by categorical variables
categorical_features = ['gender', 'country', 'subscription_type', 'device_type']

for feature in categorical_features:
 print(f"\n{feature.upper()} vs CHURN:")
 churn_by_feature = df.groupby(feature)['churned'].agg(['count', 'mean']).round(3)
 churn_by_feature.columns = ['Total_Users', 'Churn_Rate']
 churn_by_feature = churn_by_feature.sort_values('Churn_Rate', ascending=False)
 print(churn_by_feature)

# Numerical features analysis
numerical_features = ['age', 'listening_time', 'songs_played_per_day',
 'skip_rate', 'ads_listened_per_week']

print("\nNUMERICAL FEATURES BY CHURN STATUS:")
churn_stats = df.groupby('churned')[numerical_features].mean().round(2)
churn_stats.index = ['Retained', 'Churned']
print(churn_stats)

# Feature correlation with churn
print("\nFEATURE CORRELATION WITH CHURN:")
numeric_df = df.select_dtypes(include=[np.number])
correlations = numeric_df.corr()['churned'].sort_values(key=abs, ascending=False)
for feature, corr in correlations.items():
 if feature not in ('churned', 'is_churned'):
 print(f" {feature}: {corr:.3f}")

print("\n Key Insights:")
subscription_churn = df.groupby('subscription_type')['churned'].mean()
device_churn = df.groupby('device_type')['churned'].mean()
print(f"- Highest churn rate by subscription: {subscription_churn.idxmax()}")
print(f"- Highest churn rate by device: {device_churn.idxmax()}")
print(f"- Average age of churned users: {df[df['churned']==1]['age'].mean():.1f} years")
churned_listening = df[df['churned']==1]['listening_time'].mean()
print(f"- Average listening time of churned users: {churned_listening:.1f} hours")

---

## Summary and Next Steps

### **What You've Accomplished:**
- **Data Pipeline**: Loaded and analyzed real Spotify customer churn data
- **Machine Learning**: Implemented Random Forest classification from scratch
- **Model Evaluation**: Applied comprehensive performance metrics and validation
- **Feature Analysis**: Interpreted model predictions and feature importance
- **Business Intelligence**: Translated ML results into actionable business insights

### **Key Machine Learning Skills Developed:**
1. **Data Preprocessing**: Real-world data cleaning and preparation techniques
2. **Model Training**: Random Forest implementation with hyperparameter tuning
3. **Performance Evaluation**: ROC curves, confusion matrices, cross-validation
4. **Feature Engineering**: Understanding which variables drive customer churn
5. **Business Translation**: Converting model outputs to business recommendations

### **Spotify Churn Insights Discovered:**
- **High-Risk Segments**: Identified customer profiles most likely to churn
- **Retention Factors**: Determined key features that predict customer loyalty
- **Predictive Power**: Achieved robust classification performance on real data
- **Actionable Intelligence**: Generated specific recommendations for customer retention

### **Next Learning Paths:**

#### **Expand Your Machine Learning Expertise:**
- **Advanced ML**: `notebooks/tier2_supervised/Tier2_NeuralNetworks.ipynb` - Deep learning approaches
- **Alternative Models**: `Tier2_LogisticRegression.ipynb`, `Tier2_SVM.ipynb` - Compare techniques
- **Ensemble Methods**: `Tier2_GradientBoosting.ipynb` - Boost your predictions

#### **Explore Other Analytics Domains:**
- **Time Series**: `time_series_example.ipynb` - Forecast customer behavior over time
- **Clustering**: `notebooks/tier4_clustering/Tier4_kMeans.ipynb` - Segment customers
- **Anomaly Detection**: `notebooks/tier6_anomaly/Tier6_IsolationForest.ipynb` - Find unusual patterns

### 🏢 **Business Applications:**
- **Customer Retention**: Deploy churn models in production systems
- **Risk Assessment**: Apply classification to credit scoring and fraud detection
- **Marketing Optimization**: Target high-value customer segments effectively
- **Product Development**: Use feature importance to guide product improvements

### **Professional Skills Gained:**
- **Industry-Standard Tools**: Scikit-learn, pandas, advanced visualization
- **Model Lifecycle**: End-to-end ML pipeline from data to deployment insights
- **Business Communication**: Translating technical results for stakeholder consumption
- **Performance Monitoring**: Validation techniques used in production ML systems

### **Research and Development:**
- **Model Comparison**: Try different algorithms on the same dataset
- **Feature Engineering**: Create new variables to improve prediction accuracy
- **Hyperparameter Optimization**: Advanced techniques like Bayesian optimization
- **Ensemble Approaches**: Combine multiple models for better performance

---

> **Next Recommendation**: Explore `time_series_example.ipynb` to learn forecasting techniques that complement your churn prediction skills!

---

*Congratulations on mastering machine learning fundamentals! You're now equipped with production-ready ML skills for customer analytics and predictive modeling.*