# Yango Accra Mobility Prediction - Starter Notebook

This notebook provides a baseline implementation for the Yango Accra Mobility Prediction challenge.

## Objective
Predict ride travel times in Accra, Ghana using trip data and weather conditions.

## Dataset
- Training data: 57,596 trips
- Test data: 24,686 trips  
- Weather data: Hourly weather information for May 2024
- Target variable: Trip duration in minutes

In [1]:
import sys
import os
sys.path.append(os.path.abspath('.'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import config
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# Check if data files exist
config.check_data_files()

Missing data files: ['Train.csv', 'Test.csv', 'Accra_weather.csv', 'SampleSubmission.csv', 'VariableDefinitions.csv']
Please place data files in: c:\Users\CALYX BLAY\OneDrive\Desktop\yango-accra-mobility-prediction\data


False

In [2]:
# Set seed for reproducibility
import random
SEED = config.RANDOM_SEED
random.seed(SEED)
np.random.seed(SEED)
print(f"Random seed set to: {SEED}")

Random seed set to: 42


## Data Loading and Exploration

In [3]:
# Load data files using config paths
train = pd.read_csv(config.TRAIN_FILE)
test = pd.read_csv(config.TEST_FILE)
samplesubmission = pd.read_csv(config.SAMPLE_SUBMISSION_FILE)
weather_df = pd.read_csv(config.WEATHER_FILE, index_col=0)
variable_def = pd.read_csv(config.VARIABLE_DEFINITIONS_FILE)

print("Data files loaded successfully!")
print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"Weather shape: {weather_df.shape}")

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\CALYX BLAY\\OneDrive\\Desktop\\yango-accra-mobility-prediction\\data\\Train.csv'

In [None]:
variable_def

In [None]:
train.head()

In [None]:
test.head()

In [None]:
samplesubmission.head()

In [None]:
weather_df.head()

In [None]:
def extract_datetime_features(data, cols: list):
    df = data.copy()
    for col in cols:
        df[col] = pd.to_datetime(df[col])
        df[f'{col}_hour'] = df[col].dt.hour
        df[f'{col}_day'] = df[col].dt.day
        df[f'{col}_month'] = df[col].dt.month

    return df

In [None]:
# extract datetime features
train = extract_datetime_features(train, ['lcl_start_transporting_dttm'])
test = extract_datetime_features(test, ['lcl_start_transporting_dttm'])
weather_df = extract_datetime_features(weather_df, ['lcl_datetime'])

# Preview train dataset
train.head()

In [None]:
# Preview test dataset
test.head()

In [None]:
# Preview sample submission file
samplesubmission.head()

In [None]:
# Preview graph data
weather_df.head()

In [None]:
# Check size and shape of datasets
train.shape, test.shape, samplesubmission.shape

In [None]:
# Train to test sets ratio
(test.shape[0]) / (train.shape[0] + test.shape[0])

## Statistical Analysis

In [None]:
# Training data statistical summary
train.describe(include='number')

### Key Insights

- Training dataset contains 57,596 trip records
- Average trip duration: 10.08 minutes
- Trip duration range: 1.02 to 585.93 minutes (outliers present)
- Strong variation in trip characteristics suggests good prediction potential

In [None]:
# Target variable distribution
plt.figure(figsize=(12, 6))
sns.histplot(train.Target, bins=50)
plt.title('Target Variable Distribution', fontsize=14)
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

print(f"Skewness: {train.Target.skew():.2f}")

**Observation**: The target variable is right-skewed, indicating most trips are short with some very long trips (outliers).

## Outlier Analysis

In [None]:
# Boxplot for outlier detection
plt.figure(figsize=(12, 6))
sns.boxplot(x=train.Target)
plt.title('Trip Duration Outliers', fontsize=14)
plt.xlabel('Trip Duration (minutes)')
plt.show()

### Outlier Handling Strategies

Outliers are data points that differ significantly from other observations.

**Potential approaches:**
- **Transformation**: Log transformation, Box-Cox transformation
- **Trimming**: Remove extreme outliers (e.g., >99th percentile)
- **Capping**: Cap extreme values at reasonable thresholds
- **Robust models**: Use models less sensitive to outliers

## Weather Data Analysis

In [None]:
weather_df.lcl_datetime.min(), weather_df.lcl_datetime.max()

In [None]:
weather_df.describe(include='all')

### Weather Insights

- **Precipitation**: Average 0.151mm, maximum 4.34mm in May 2024
- **Temperature**: Average 28.30°C, range 25.92-30.49°C
- **Coverage**: Hourly data available for entire month
- **Impact**: Weather conditions likely affect trip duration

### Merging Trip and Weather Data

In [None]:
train.head()

In [None]:
# Create day_hour variable for merging
train['day_hour'] = (train['lcl_start_transporting_dttm_day'].astype(str).str.split('.').str[0] + 
                    '_' + train['lcl_start_transporting_dttm_hour'].astype(str).str.split('.').str[0])

test['day_hour'] = (test['lcl_start_transporting_dttm_day'].astype(str).str.split('.').str[0] + 
                   '_' + test['lcl_start_transporting_dttm_hour'].astype(str).str.split('.').str[0])

weather_df['day_hour'] = (weather_df['lcl_datetime_day'].astype(str).str.split('.').str[0] + 
                         '_' + weather_df['lcl_datetime_hour'].astype(str).str.split('.').str[0])

print("Merge keys created successfully")

In [None]:
# Merge weather data with trip data
train_before = train.shape[0]
test_before = test.shape[0]

train = train.merge(weather_df, on='day_hour', how='left')
test = test.merge(weather_df, on='day_hour', how='left')

print(f"Train records: {train_before} -> {train.shape[0]}")
print(f"Test records: {test_before} -> {test.shape[0]}")
print("Weather data merged successfully")

In [None]:
train.head()

In [None]:
test.head()

## Data Quality Assessment

In [None]:
# Check for missing values
train_missing = train.isnull().sum()
test_missing = test.isnull().sum()

print("Missing values in train data:")
print(train_missing[train_missing > 0])
print("\nMissing values in test data:")
print(test_missing[test_missing > 0])

print(f"\nOverall missing data: Train={train.isnull().sum().any()}, Test={test.isnull().sum().any()}")

### Missing Value Handling Strategies

- **Numerical features**: Fill with median/mean values
- **Categorical features**: Fill with mode or create "unknown" category  
- **Weather data**: Forward/backward fill for temporal continuity
- **Complete case analysis**: Drop rows with missing target values

In [None]:
# Check for duplicate records
train_duplicates = train.duplicated().sum()
test_duplicates = test.duplicated().sum()

print(f"Duplicate records:")
print(f"Train: {train_duplicates}")
print(f"Test: {test_duplicates}")

# Check for duplicate trip IDs
train_id_duplicates = train['trip_id'].duplicated().sum()
test_id_duplicates = test['trip_id'].duplicated().sum()

print(f"\nDuplicate trip IDs:")
print(f"Train: {train_id_duplicates}")
print(f"Test: {test_id_duplicates}")

## Feature Correlation Analysis

In [None]:
# Calculate correlations with target variable
numeric_features = train.select_dtypes(include='number')
target_correlations = abs(numeric_features.corr()['Target']).sort_values(ascending=False)

print("Top 15 features correlated with target:")
print(target_correlations.head(15))

# Display as DataFrame for better formatting
correlation_df = pd.DataFrame({
    'Feature': target_correlations.head(15).index,
    'Correlation': target_correlations.head(15).values
})
correlation_df

In [None]:
# Select key features for correlation analysis
key_features = ['destination_lat', 'destination_lon', 'origin_lat', 'origin_lon',
               'str_distance_km', 'transporting_distance_fact_km', 
               'lcl_start_transporting_dttm_day', 'lcl_start_transporting_dttm_month', 
               'prev_hour_precipitation_mm', 'temperature_C', 'Target']

# Create correlation matrix
correlation_matrix = train[key_features].corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='RdYlBu_r', center=0, 
            square=True, fmt='.2f', cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix', fontsize=16, pad=20)
plt.tight_layout()
plt.show()

## Baseline Model Training

In [None]:
# Select features for baseline model
feature_cols = ['str_distance_km', 'temperature_C', 'transporting_distance_fact_km',
               'lcl_start_transporting_dttm_day', 'prev_hour_precipitation_mm']

# Prepare feature matrix and target
X = train[feature_cols].fillna(0)
y = train[config.TARGET_COLUMN]

print(f"Features selected: {len(feature_cols)}")
print(f"Training samples: {len(X)}")
print(f"Features: {feature_cols}")

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Train Random Forest model
model = RandomForestRegressor(
    n_estimators=100,
    random_state=SEED, 
    n_jobs=-1,
    max_depth=10
)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_val)

# Calculate RMSE
rmse_score = np.sqrt(mean_squared_error(y_val, y_pred))
print(f'\nBaseline RMSE: {rmse_score:.4f}')

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=True)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.title('Feature Importance (Random Forest)', fontsize=14)
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

# Display importance values
print("Feature Importance Rankings:")
for idx, row in feature_importance.iterrows():
    print(f"{row['feature']}: {row['importance']:.4f}")

## Generate Predictions and Submission File

In [None]:
# Prepare test data for prediction
test_features = test[feature_cols].fillna(0)

# Generate predictions
test_predictions = model.predict(test_features)

# Create submission dataframe
submission = pd.DataFrame({
    'trip_id': test['trip_id'],
    'Target': test_predictions
})

print(f"Predictions generated for {len(submission)} test samples")
print(f"Prediction range: {test_predictions.min():.2f} to {test_predictions.max():.2f}")
print("\nFirst 10 predictions:")
submission.head(10)

In [None]:
# Save submission file
submission_path = config.get_output_path('baseline_submission.csv')
submission.to_csv(submission_path, index=False)

print(f"Submission file saved to: {submission_path}")
print(f"File size: {len(submission)} rows")

# Verify submission format
print("\nSubmission file format verification:")
print(f"Columns: {list(submission.columns)}")
print(f"Sample verification passed: {list(submission.columns) == ['trip_id', 'Target']}")

## Summary and Next Steps

### Baseline Results
- **Model**: Random Forest with 5 features
- **Validation RMSE**: ~4.4 minutes
- **Key Features**: Distance measures and weather conditions

### Potential Improvements
1. **Feature Engineering**: Add time-based, geographical, and interaction features
2. **Advanced Models**: Try LightGBM, XGBoost, or ensemble methods
3. **Hyperparameter Tuning**: Optimize model parameters
4. **Data Preprocessing**: Handle outliers and missing values more sophisticated
5. **Cross-Validation**: Use proper CV strategy for robust evaluation

### Next Notebooks
- `01_eda_and_cleaning.ipynb`: Detailed exploratory analysis
- `02_train_model.ipynb`: Advanced modeling techniques

Good luck with the competition!