# Collaborative Hackathon: Building a Machine Learning Prototype ü§ù

Welcome to ** The Collaborative ML Hackathon!**  
In this session, you‚Äôll walk through a full end-to-end machine learning workflow ‚Äî the same structure you‚Äôll use in your group hackathon project.

This *demo notebook* provides a worked example of how to approach each stage, from problem definition through to deployment via an API.

---
### Learning Objectives
By the end, you‚Äôll understand how to:
- Define an ML problem from a business context.  
- Prepare data systematically.  
- Train and compare classical ML models.  
- Tune hyperparameters for performance.  
- Evaluate and deploy a prototype model with a simple API.

## 1Ô∏è‚É£ Define the Problem

For demonstration, we‚Äôll use the **California Housing dataset** (a regression task) ‚Äî predicting median house prices based on location and socioeconomic variables.

**Business context:**  
A property analytics firm wants to estimate median housing prices for new areas based on demographic and environmental factors.  A reliable regression model would allow faster property valuation and planning decisions.

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the dataset
data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## 2Ô∏è‚É£ Choose the Machine Learning Approach

We‚Äôll formulate this as a **supervised regression problem**:
- **Target variable:** `MedHouseVal` (median house value)
- **Features:** numerical attributes such as income, house age, and geographical coordinates.

We‚Äôll explore two algorithms:
- **Linear Regression** ‚Äì interpretable baseline.  
- **Random Forest Regressor** ‚Äì ensemble model capturing non-linear relationships.

## 3Ô∏è‚É£ Data Preparation

Data preparation is essential for reliable performance. Here we‚Äôll:
- Inspect and clean the data.
- Handle missing values and outliers.
- Standardise numeric features.
- Split into training, validation, and test sets.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Check for missing values
df.isnull().sum()

Unnamed: 0,0
MedInc,0
HouseAge,0
AveRooms,0
AveBedrms,0
Population,0
AveOccup,0
Latitude,0
Longitude,0
MedHouseVal,0


In [3]:
# No missing values in this dataset, but we‚Äôll show an example placeholder:
# df.fillna(df.median(), inplace=True)

# Detect simple outliers by z-score
z_scores = np.abs((df - df.mean())/df.std())
df_clean = df[(z_scores < 3).all(axis=1)]
print(f"Removed {len(df) - len(df_clean)} potential outliers.")

Removed 846 potential outliers.


In [4]:
# Split data: 70% train, 15% validation, 15% test
train_val, test = train_test_split(df_clean, test_size=0.15, random_state=42)
train, val = train_test_split(train_val, test_size=0.1765, random_state=42)

X_train, y_train = train.drop('MedHouseVal', axis=1), train['MedHouseVal']
X_val, y_val = val.drop('MedHouseVal', axis=1), val['MedHouseVal']
X_test, y_test = test.drop('MedHouseVal', axis=1), test['MedHouseVal']

# Standardise features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

## 4Ô∏è‚É£ Model Selection and Training

We‚Äôll compare **Linear Regression** and **Random Forest Regressor** on the validation set.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Train models
lr = LinearRegression()
rf = RandomForestRegressor(random_state=42)

lr.fit(X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)

# Evaluate on validation set
for model, name in [(lr, 'Linear Regression'), (rf, 'Random Forest')]:
    preds = model.predict(X_val_scaled)
    rmse = np.sqrt(mean_squared_error(y_val, preds))
    print(f"{name}: RMSE = {rmse:.3f}")

Linear Regression: RMSE = 0.655
Random Forest: RMSE = 0.503


**Model choice:**  Random Forest is likely to perform better here due to non-linear relationships, though at the cost of interpretability.

## 5Ô∏è‚É£ Hyperparameter Tuning

We‚Äôll perform a simple **Grid Search** to optimise the Random Forest model using cross-validation.

In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)
print(f"Validation RMSE: {-grid_search.best_score_:.3f}")

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}
Validation RMSE: 0.524


## 6Ô∏è‚É£ Model Evaluation

Finally, we evaluate the tuned model on the **test set**, which represents unseen data.

In [7]:
best_model = grid_search.best_estimator_
test_preds = best_model.predict(X_test_scaled)
test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
print(f"Test RMSE: {test_rmse:.3f}")

Test RMSE: 0.519


## 7Ô∏è‚É£ Deployment via API

We‚Äôll now demonstrate how to wrap the final model in a lightweight API using **Flask**.  
This lets other systems send data to the `/predict` endpoint and receive a price prediction.

In [8]:
%%writefile app.py
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)
model = joblib.load('best_model.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    X_input = np.array(data['features']).reshape(1, -1)
    X_scaled = scaler.transform(X_input)
    prediction = model.predict(X_scaled)[0]
    return jsonify({'predicted_value': float(prediction)})

if __name__ == '__main__':
    app.run(debug=True)

Writing app.py


## üß≠ Summary

In this demo, we covered the full workflow:
1. **Defined** a clear business problem.  
2. **Prepared** the data responsibly.  
3. **Trained** and compared models.  
4. **Tuned** hyperparameters systematically.  
5. **Evaluated** generalisation on a test set.  
6. **Deployed** via a simple Flask API.

Your hackathon task will be to follow a similar structure using one of the provided datasets, document your process, and present your model as a working prototype!