## **Galactic Empire Cost of Living Prediction**

### **Your Mission**
As a newly recruited Rebel intelligence officer, you’ve been granted access to highly classified mobility data intercepted from the Empire’s network. Your task is to analyze this information, predicting the true cost of living for regions across the galaxy. Each prediction could be a turning point in the struggle against Imperial oppression.

The data is encrypted using the Galactic Council's powerful H3 geospatial grid—an advanced system known for its precision. With the clock ticking, you must decode the movement patterns, extract insights, and accurately estimate the cost of living in each sector before the Empire discovers your mission.

Only the most skilled and insightful intelligence officers will succeed. The competition is fierce, and only those who can wield the power of data with precision and creativity will prevail.

### **The Tools of Rebellion**
The Rebellion’s intelligence arm has provided you with a set of encrypted data files. These are the tools you’ll need to expose the Empire’s deception:

- **train.csv**: Data from sectors already liberated by the Rebellion. Each entry contains precise coordinates and the true cost of living, encrypted by the Galactic Council's H3 technology.
- **test.csv**: Sectors still under Imperial control, awaiting your predictions to reveal the hidden cost of living.
- **mobility.parquet**: A data set containing the secret movement patterns of operatives, smugglers, and civilians across the galaxy. These movements hold the clues to understanding the local economies of each region.

### **The Power of Prediction**
The Rebellion isn't just looking for predictions—they’re seeking insights. Your mission is not only to estimate the cost of living but to transform chaos into clarity. The Empire’s secrets are buried in raw data, waiting for you to uncover the truth. Use creativity, strategic feature engineering, and any additional data you can find to strengthen the Rebellion’s cause.


#### **Import Required Libraries**

In [34]:
# Import libraries for data processing, geospatial conversion, and machine learning
import pandas as pd
import numpy as np
import h3
import geopandas as gpd
import pyarrow.parquet as pq
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load and Explore Train and Test Data
# Define the dataset path
DATA_PATH = r"\alt-score-data-science-competition"

# Load training data (contains hex_id and cost_of_living values)
train = pd.read_csv(f"{DATA_PATH}\\train.csv")

# Load test data (contains hex_id values for prediction)
test = pd.read_csv(f"{DATA_PATH}\\test.csv")

In [35]:
# Check resolution of hex_id in train and test data to ensure consistency
example_hex = train["hex_id"].iloc[0]
desired_resolution = h3.get_resolution(example_hex)
print(f"Resolution of train/test hex_id: {desired_resolution}")

Resolution of train/test hex_id: 8


#### **Process Mobility Data in Chunks**

In [36]:
# Load large mobility data file in chunks to save memory
mobility = pq.ParquetFile(f"{DATA_PATH}\\mobility_data.parquet")
chunk_size = 500_000    # Number of rows to process at a time

# Dictionary to store aggregated features for each hex_id
hex_features = {}

for batch in mobility.iter_batches(batch_size=chunk_size):
    # Convert batch to a pandas DataFrame
    chunk_df = batch.to_pandas()
    
    # Convert lat/lon to H3 hexagons with matching resolution
    chunk_df["hex_id"] = chunk_df.apply(lambda row: h3.latlng_to_cell(row["lat"], row["lon"], desired_resolution), axis=1)
    
    # Aggregate features per hex_id
    grouped = chunk_df.groupby("hex_id").agg(
        device_count=("device_id", "nunique"),  # Unique devices per hex
        visit_count=("device_id", "count"),      # Total visits per hex
        avg_time=("timestamp", "mean")           # Average timestamp per hex
    ).reset_index()
    
    # Update the hex_features dictionary with aggregated data
    for _, row in grouped.iterrows():
        hex_id = row["hex_id"]
        if hex_id not in hex_features:
            hex_features[hex_id] = row[1:].values
        else:
            hex_features[hex_id] += row[1:].values

#### **Convert Aggregated Features to DataFrame**

In [55]:
# Convert feature dictionary to DataFrame
mobility_features_df = pd.DataFrame.from_dict(hex_features, orient='index', columns=["device_count", "visit_count", "avg_time"]).reset_index()
mobility_features_df.rename(columns={"index": "hex_id"}, inplace=True)

#### **Debugging Hex ID Mismatches**

In [56]:
# Compare unique hex_id values across train, test, and mobility datasets
train_hex_ids = set(train["hex_id"].unique())
test_hex_ids = set(test["hex_id"].unique())
mobility_hex_ids = set(mobility_features_df["hex_id"].unique())

# Identify missing hex_ids
missing_train_hex = train_hex_ids - mobility_hex_ids
missing_test_hex = test_hex_ids - mobility_hex_ids
print(f"Hex IDs missing in mobility features (train): {len(missing_train_hex)}")
print(f"Hex IDs missing in mobility features (test): {len(missing_test_hex)}")

Hex IDs missing in mobility features (train): 124
Hex IDs missing in mobility features (test): 155


#### **Merge Train Data with Mobility Features**

In [64]:
# Join train data with mobility features and fill missing values with 0
train_merged = train.merge(mobility_features_df, on="hex_id", how="left").fillna(0)

#### **Train Model using Random Forest Regressor**

In [58]:
# Separate features (X) and target (y) variablesl
X = train_merged.drop(columns=["cost_of_living", "hex_id"])
y = train_merged["cost_of_living"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Regressor
model_rf = RandomForestRegressor(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)
y_pred = model_rf.predict(X_val)

In [59]:
# Evaluate the model using Mean Squared Error
mse_rf = mean_squared_error(y_val, y_pred)
print(f"Validation MSE: {mse_rf}")

Validation MSE: 0.044339250462866006


#### **Make Predictions on Test Data from RandomForestRegressor**

In [60]:
# Join test data with mobility features and fill missing values with 0
test_merged = test.merge(mobility_features_df, on="hex_id", how="left").fillna(0)

In [61]:
# Ensure test_X has the same features as X_train
test_X = test_merged.drop(columns=["hex_id"], errors='ignore')
test_X = test_X.reindex(columns=X_train.columns, fill_value=0)

In [62]:
# Generate predictions for the test data
test["cost_of_living"] = model_rf.predict(test_X)

#### **Save Predictions to CSV**

In [63]:
# Save the predictions to a submission file
test.to_csv("submission_rf.csv", index=False)
print("Predictions saved")

Predictions saved


#### **Train Model using Random Forest Regressor**

In [65]:
# Import corresponding library
from xgboost import XGBRegressor

# Separate features (X) and target (y) variablesl
X = train_merged.drop(columns=["cost_of_living", "hex_id"])
y = train_merged["cost_of_living"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost Regressor
model_xg = XGBRegressor(n_estimators=200, learning_rate=0.1, random_state=42)
model_xg.fit(X_train, y_train)
y_pred = model_xg.predict(X_val)

In [66]:
# Evaluate the model using Mean Squared Error
mse_xg = mean_squared_error(y_val, y_pred)
print(f"Validation MSE: {mse_xg}")

Validation MSE: 0.05399483775948132


#### **Make Predictions on Test Data from XGBoost**

In [67]:
# Join test data with mobility features and fill missing values with 0
test_merged = test.merge(mobility_features_df, on="hex_id", how="left").fillna(0)

# Ensure test_X has the same features as X_train
test_X = test_merged.drop(columns=["hex_id"], errors='ignore')
test_X = test_X.reindex(columns=X_train.columns, fill_value=0)

# Generate predictions for the test data
test["cost_of_living"] = model_xg.predict(test_X)

#### **Save Predictions to CSV**

In [68]:
# Save the predictions to a submission file
test.to_csv("submission_xg.csv", index=False)
print("Predictions saved")

Predictions saved
