# MLOPs - California Housing Price Prediction
## Introduction
This project involves building an end-to-end machine learning workflow to predict housing prices using the California Housing dataset. The focus is on applying key MLOPs concepts including preprocessing, cross-validation, hyperparameter tuning, pipeline construction, and model deployment.



## Task Completion
## 1. Loading and Exploring the Dataset
First, we load the California Housing dataset and perform basic exploration:

In [1]:
# Import required libraries
import pandas as pd
import pickle
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Load dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

# Display dataset information
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFirst 5 rows of features:")
print(X.head())
print("\nTarget description:")
print(y.describe())

Features shape: (20640, 8)
Target shape: (20640,)

First 5 rows of features:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

Target description:
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64


## 2. Data Preprocessing and Pipeline Construction
We create a preprocessing pipeline that includes imputation for missing values and standardization of features:

In [2]:
# Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocessing: Imputation + Scaling for numerical features
numeric_features = X.columns  # all are numerical
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler())  # Standardize features
])

# Combine preprocessing using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features)
])

# Build complete pipeline: preprocessing + KNN
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor())
])

## 3. Hyperparameter Tuning with GridSearchCV
We perform hyperparameter tuning using 5-fold cross-validation:

In [3]:
# Define hyperparameter grid
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9],  # Number of neighbors
    'knn__weights': ['uniform', 'distance'],  # Weighting scheme
    'knn__p': [1, 2]  # Distance metric (1=Manhattan, 2=Euclidean)
}

# Apply GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='r2',  # Using R² score as evaluation metric
    verbose=1,
    n_jobs=-1  # Use all available CPU cores
)

# Fit the model
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


## 4. Model Evaluation
We evaluate the best model on the test set:

In [4]:
# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Best CV R² Score:", grid_search.best_score_)
print("Test R² Score:", r2)
print("Test MSE:", mse)
print("Test RMSE:", rmse)

Best Parameters: {'knn__n_neighbors': 9, 'knn__p': 1, 'knn__weights': 'distance'}
Best CV R² Score: 0.731266870986164
Test R² Score: 0.72210916268423
Test MSE: 0.3641506481894662
Test RMSE: 0.6034489607162036


## 5. Saving the Model
We save the trained pipeline for future use:

In [5]:
# Save the pipeline
with open('california_knn_pipeline.pkl', 'wb') as f:
    pickle.dump(best_model, f)

print("📦 Final pipeline saved to 'california_knn_pipeline.pkl'")

📦 Final pipeline saved to 'california_knn_pipeline.pkl'


## Conclusion
Through this assignment, I learned:

### End-to-End ML Workflow: 
How to construct a complete machine learning pipeline from data loading to model deployment.

### Preprocessing Importance: 
The significance of proper data preprocessing (imputation and scaling) for model performance.

### Hyperparameter Tuning: 
How GridSearchCV with cross-validation helps find optimal model parameters systematically.

### Model Evaluation: 
The importance of using multiple metrics (R², MSE, RMSE) to assess model performance comprehensively.

### Model Deployment: 
The basics of creating an API endpoint for serving model predictions, which is crucial for real-world applications.

The KNN regressor achieved a reasonable R² score of 0.71 on the test set, indicating it explains about 71% of the variance in housing prices. For production use, we might want to experiment with more sophisticated models and additional feature engineering to improve performance further.