# Jakarta Coffee Shop Site Selection - Data Collection

This notebook demonstrates the complete data collection pipeline for the location intelligence MVP.

## Data Sources (All Free):
1. **OpenStreetMap** - POIs, roads, buildings
2. **BPS API** - Demographics and census data
3. **Foursquare Open Source** - 8M Indonesian POIs
4. **Google Places API** - Coffee shop locations (training data)

## Setup Requirements:
- Python 3.10+
- PostgreSQL + PostGIS
- API keys (BPS, optional: Foursquare, Google Places)

In [None]:
# Import libraries
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
from loguru import logger

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')

print("‚úì Libraries loaded successfully")

## Step 1: Initialize Database

Create PostgreSQL database with PostGIS extension.

In [None]:
from data.init_db import DatabaseInitializer

# Initialize database
db_init = DatabaseInitializer()
success = db_init.initialize()

if success:
    print("\n‚úì Database initialized successfully")
else:
    print("\n‚úó Database initialization failed - check PostgreSQL is running")

## Step 2: Collect OpenStreetMap Data

Download POIs and road network for Jakarta using OSMnx.

In [None]:
from data.collect_osm import OSMCollector

# Initialize collector
osm_collector = OSMCollector(data_dir='../data')

# Collect data (this may take 5-10 minutes)
print("Collecting OSM data... (this may take several minutes)")
osm_results = osm_collector.collect_all(use_osmnx=True)

print("\n" + "="*60)
print("OSM Data Collection Results:")
print("="*60)
for key, path in osm_results.items():
    print(f"{key}: {path}")

In [None]:
# Load and preview OSM POIs
osm_pois_path = Path(osm_results['pois'])
osm_pois = gpd.read_file(osm_pois_path)

print(f"OSM POIs loaded: {len(osm_pois):,} points")
print(f"\nPOI Categories:")
print(osm_pois['amenity'].value_counts().head(10))

osm_pois.head()

## Step 3: Collect BPS Demographic Data

Fetch demographic data from Indonesian Statistics Bureau.

In [None]:
from data.collect_bps import BPSCollector

# Initialize collector
bps_collector = BPSCollector(data_dir='../data')

# Check if API key is configured
if bps_collector.api_key:
    print("Collecting BPS data...")
    bps_results = bps_collector.collect_all()
    
    print("\n" + "="*60)
    print("BPS Data Collection Results:")
    print("="*60)
    for key, path in bps_results.items():
        print(f"{key}: {path}")
else:
    print("‚ö†Ô∏è BPS API key not configured")
    print("Get free API key from: https://webapi.bps.go.id/developer/")
    print("Add to .env file: BPS_API_KEY=your_key")

## Step 4: Collect Coffee Shop Training Data

Collect locations of successful coffee shop chains in Jakarta.

In [None]:
from data.collect_coffee_shops import CoffeeShopCollector

# Initialize collector
coffee_collector = CoffeeShopCollector(data_dir='../data')

# Check if API keys are configured
if coffee_collector.google_api_key or coffee_collector.foursquare_api_key:
    print("Collecting coffee shop locations...")
    print("This may take 5-10 minutes depending on API limits\n")
    
    coffee_results = coffee_collector.collect_all()
    
    if coffee_results:
        print("\n" + "="*60)
        print("Coffee Shop Collection Results:")
        print("="*60)
        for key, value in coffee_results.items():
            print(f"{key}: {value}")
else:
    print("‚ö†Ô∏è No API keys configured for coffee shop collection")
    print("\nOptions:")
    print("1. Google Places API: https://console.cloud.google.com/ ($200 free credit)")
    print("2. Foursquare API: https://foursquare.com/developers/apps (10k free calls)")
    print("\nAdd to .env file:")
    print("GOOGLE_PLACES_API_KEY=your_key")
    print("FOURSQUARE_API_KEY=your_key")

In [None]:
# Load and preview coffee shop data (if collected)
coffee_csv = Path('../data/processed/coffee_shops/jakarta_coffee_shops_training.csv')

if coffee_csv.exists():
    coffee_df = pd.read_csv(coffee_csv)
    
    print(f"Coffee shops collected: {len(coffee_df):,}")
    print(f"\nBy brand:")
    print(coffee_df['brand'].value_counts())
    
    print(f"\nBy source:")
    print(coffee_df['source'].value_counts())
    
    coffee_df.head()
else:
    print("No coffee shop data collected yet")

## Step 5: Visualize Collected Data

Create interactive map showing all collected data.

In [None]:
# Create interactive map
jakarta_center = [-6.2088, 106.8456]
m = folium.Map(location=jakarta_center, zoom_start=11)

# Add coffee shops (if available)
if coffee_csv.exists():
    coffee_gdf = gpd.read_file(coffee_csv.parent / 'jakarta_coffee_shops_training.geojson')
    
    for idx, row in coffee_gdf.iterrows():
        folium.CircleMarker(
            location=[row.geometry.y, row.geometry.x],
            radius=3,
            color='red',
            fill=True,
            popup=f"{row['brand']}<br>{row['name']}"
        ).add_to(m)
    
    print(f"‚úì Added {len(coffee_gdf)} coffee shops to map")

# Add OSM POIs (sample)
if osm_pois_path.exists():
    osm_sample = osm_pois.sample(min(100, len(osm_pois)))
    
    for idx, row in osm_sample.iterrows():
        folium.CircleMarker(
            location=[row.geometry.y, row.geometry.x],
            radius=2,
            color='blue',
            fill=True,
            popup=f"{row.get('name', 'Unknown')}"
        ).add_to(m)
    
    print(f"‚úì Added {len(osm_sample)} OSM POIs to map (sample)")

print("\nüìç Interactive map created")
m

## Summary

### Data Collection Complete!

**Next Steps:**
1. ‚úÖ Data collected and validated
2. ‚û°Ô∏è Move to `02_eda_jakarta.ipynb` for exploratory analysis
3. ‚û°Ô∏è Then `03_feature_engineering.ipynb` to build ML features
4. ‚û°Ô∏è Finally `04_model_training.ipynb` to train the prediction model

### Cost Summary:
- OSM data: **Free**
- BPS API: **Free**
- Foursquare: **Free** (10k calls)
- Google Places: **~$0-20** (covered by $200 free credit)
- **Total: $0** for this data collection phase