# Coffee Shop Site Selection - ML Model Training

This notebook demonstrates the complete machine learning pipeline:

1. **Load training data** (successful coffee shop locations)
2. **Generate negative samples** (random locations)
3. **Engineer spatial features** (competitor density, POI diversity, etc.)
4. **Train Random Forest model** with spatial cross-validation
5. **Evaluate performance** and analyze feature importance
6. **Save model** for predictions

**Target:** 70-80% accuracy with spatial CV

In [None]:
# Import libraries
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from loguru import logger

from features.spatial_features import SpatialFeatureEngineer, create_feature_matrix
from models.train_model import CoffeeShopModel, generate_negative_samples

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')

print("✓ Libraries loaded successfully")

## Step 1: Load Training Data

Load known successful coffee shop locations (positive samples).

In [None]:
# Load coffee shop data
coffee_path = Path('../data/processed/coffee_shops/jakarta_coffee_shops_training.geojson')

if not coffee_path.exists():
    print("❌ Coffee shop training data not found!")
    print("Please run 01_data_collection.ipynb first to collect training data")
    raise FileNotFoundError(f"Training data not found at {coffee_path}")

positive_samples = gpd.read_file(coffee_path)

print(f"✓ Loaded {len(positive_samples):,} positive training samples")
print(f"\nBrands in training data:")
print(positive_samples['brand'].value_counts())

positive_samples.head()

## Step 2: Generate Negative Samples

Create negative training samples (random locations that are NOT near existing coffee shops).

In [None]:
# Jakarta bounding box
JAKARTA_BBOX = (106.6, -6.4, 107.1, -6.0)  # min_lon, min_lat, max_lon, max_lat

# Generate negative samples (2x the number of positive samples)
print("Generating negative samples...")
print("(Random locations at least 200m away from existing coffee shops)\n")

negative_samples = generate_negative_samples(
    positive_samples=positive_samples,
    bbox=JAKARTA_BBOX,
    n_samples=len(positive_samples) * 2,
    min_distance=200  # meters
)

print(f"✓ Generated {len(negative_samples):,} negative samples")
print(f"\nClass balance:")
print(f"  Positive (successful locations): {len(positive_samples)}")
print(f"  Negative (random locations): {len(negative_samples)}")
print(f"  Ratio: 1:{len(negative_samples)/len(positive_samples):.1f}")

In [None]:
# Visualize training samples
fig, ax = plt.subplots(figsize=(12, 10))

# Plot negative samples
negative_samples.plot(
    ax=ax,
    color='lightblue',
    markersize=10,
    alpha=0.5,
    label='Negative samples (random)'
)

# Plot positive samples
positive_samples.plot(
    ax=ax,
    color='red',
    markersize=20,
    alpha=0.7,
    label='Positive samples (coffee shops)'
)

ax.set_title('Training Data: Positive vs Negative Samples', fontsize=16, fontweight='bold')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

## Step 3: Engineer Spatial Features

Calculate features for both positive and negative samples:
- Competitor density (500m, 1km, 2km buffers)
- Distance to nearest competitor
- POI diversity indices

In [None]:
# Combine positive and negative samples for feature engineering
all_locations = pd.concat([
    positive_samples[['geometry']],
    negative_samples[['geometry']]
], ignore_index=True)

all_locations_gdf = gpd.GeoDataFrame(all_locations, geometry='geometry', crs=positive_samples.crs)

print(f"Total locations for feature engineering: {len(all_locations_gdf):,}")

In [None]:
# Initialize feature engineer
engineer = SpatialFeatureEngineer(buffer_distances=[500, 1000, 2000])

print("Calculating spatial features...")
print("This may take 5-10 minutes depending on number of samples\n")

# Calculate features
# Using positive_samples as competitors (existing coffee shops)
feature_matrix = engineer.calculate_all_buffer_features(
    locations=all_locations_gdf,
    competitors=positive_samples,
    all_pois=None  # Add OSM POIs here if available for diversity calculation
)

print(f"\n✓ Feature engineering complete!")
print(f"Features calculated: {feature_matrix.shape[1]}")
print(f"Samples: {feature_matrix.shape[0]}")

print(f"\nFeature columns:")
print(feature_matrix.columns.tolist())

In [None]:
# Preview features
print("Feature Statistics:\n")
print(feature_matrix.describe())

feature_matrix.head(10)

## Step 4: Prepare Training Data

Create labels and split data.

In [None]:
# Create labels
y_positive = np.ones(len(positive_samples))
y_negative = np.zeros(len(negative_samples))
y = np.concatenate([y_positive, y_negative])

# Get feature matrix
X = feature_matrix.values

print(f"Training data prepared:")
print(f"  Features (X): {X.shape}")
print(f"  Labels (y): {y.shape}")
print(f"  Positive class: {y.sum():.0f} ({y.mean():.1%})")
print(f"  Negative class: {(1-y).sum():.0f} ({(1-y.mean()):.1%})")

## Step 5: Train Model with Spatial Cross-Validation

Use spatial CV to prevent overfitting from spatial autocorrelation.

In [None]:
# Initialize model
model = CoffeeShopModel(
    n_estimators=200,
    max_depth=20,
    min_samples_split=10,
    random_state=42
)

print("Model configuration:")
print(f"  Algorithm: Random Forest")
print(f"  Trees: 200")
print(f"  Max depth: 20")
print(f"  Min samples split: 10")

In [None]:
# Perform spatial cross-validation
print("\nPerforming 5-fold spatial cross-validation...")
print("(This separates training/test by geographic clusters)\n")

cv_results = model.spatial_cross_validation(
    X=X,
    y=y,
    locations=all_locations_gdf,
    n_folds=5
)

print("\n" + "="*60)
print("Cross-Validation Results:")
print("="*60)
for metric, value in cv_results.items():
    if 'mean' in metric:
        metric_name = metric.replace('_mean', '')
        std = cv_results.get(f'{metric_name}_std', 0)
        print(f"{metric_name.upper():12s}: {value:.3f} (+/- {std:.3f})")

## Step 6: Train Final Model

Train on all data for deployment.

In [None]:
# Train final model
print("Training final model on all data...\n")

model.train(X, y)

print("\n✓ Model training complete!")

## Step 7: Feature Importance Analysis

Understand which features matter most for prediction.

In [None]:
# Plot feature importance
fig, ax = plt.subplots(figsize=(10, 8))

top_features = model.feature_importance.head(10)

sns.barplot(
    data=top_features,
    y='feature',
    x='importance',
    ax=ax
)

ax.set_title('Top 10 Most Important Features', fontsize=16, fontweight='bold')
ax.set_xlabel('Importance Score')
ax.set_ylabel('Feature')

plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(top_features)

## Step 8: Evaluate Model Performance

In [None]:
# Make predictions on training data (for visualization)
y_pred = model.predict(X)
y_pred_proba = model.predict_proba(X)[:, 1]

# Calculate metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

print("Model Performance on Training Data:")
print("="*60)
print(f"Accuracy:  {accuracy_score(y, y_pred):.3f}")
print(f"Precision: {precision_score(y, y_pred):.3f}")
print(f"Recall:    {recall_score(y, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y, y_pred):.3f}")
print(f"AUC:       {roc_auc_score(y, y_pred_proba):.3f}")

print("\n⚠️ Note: These are training metrics. CV metrics above are more reliable!")

In [None]:
# Plot prediction distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Probability distribution by class
ax1.hist(y_pred_proba[y == 1], bins=30, alpha=0.7, label='Positive (Coffee Shops)', color='red')
ax1.hist(y_pred_proba[y == 0], bins=30, alpha=0.7, label='Negative (Random)', color='blue')
ax1.set_xlabel('Predicted Probability')
ax1.set_ylabel('Count')
ax1.set_title('Prediction Distribution by Class')
ax1.legend()

# Score distribution
scores = y_pred_proba * 100
ax2.hist(scores, bins=30, alpha=0.7, color='green')
ax2.set_xlabel('Suitability Score (0-100)')
ax2.set_ylabel('Count')
ax2.set_title('Site Suitability Score Distribution')

plt.tight_layout()
plt.show()

## Step 9: Save Trained Model

In [None]:
# Save model
model_path = Path('../models/coffee_shop_rf_model.pkl')
model.save_model(str(model_path))

print(f"✓ Model saved to: {model_path}")
print(f"\nModel size: {model_path.stat().st_size / 1024:.1f} KB")

## Step 10: Demo - Score New Locations

Test the model on new candidate locations.

In [None]:
# Generate some test locations
test_locations = generate_negative_samples(
    positive_samples=positive_samples,
    bbox=JAKARTA_BBOX,
    n_samples=10,
    min_distance=0  # Allow any distance for testing
)

print(f"Generated {len(test_locations)} test locations")

# Calculate features for test locations
test_features = engineer.calculate_all_buffer_features(
    locations=test_locations,
    competitors=positive_samples,
    all_pois=None
)

# Score locations
test_scores = model.score_location(test_features.values)

# Add coordinates
test_scores['latitude'] = [geom.y for geom in test_locations.geometry]
test_scores['longitude'] = [geom.x for geom in test_locations.geometry]

# Sort by score
test_scores_sorted = test_scores.sort_values('score', ascending=False)

print("\nTest Location Scores:")
print(test_scores_sorted)

## Summary

### Model Training Complete! ✓

**Model Performance:**
- Cross-validation accuracy: Check results above (target: 70-80%)
- Uses spatial CV to prevent overfitting
- Random Forest with 200 trees

**Key Features:**
- Competitor density (most important)
- Distance to nearest competitor
- POI diversity indices

**Next Steps:**
1. ✅ Model trained and saved
2. ➡️ Use `05_prediction_demo.ipynb` to score new locations
3. ➡️ Validate predictions against real coffee shop openings
4. ➡️ Integrate into web application for customer use

**Model File:** `models/coffee_shop_rf_model.pkl`