# __Introduction to ML Pipelines__

A machine learning pipeline (ML pipeline) is the systematic process of designing, developing and deploying a machine learning model. ML pipelines or ML workflows follow a series of steps that guide toward more efficient model development.

The end-to-end machine learning pipeline comprises three stages:

- __Data processing:__ Data scientists assemble and prepare the data that will be used to train the ML model. Phases in this stage include data collection, preprocessing, cleaning and exploration.
- __Model development:__ Data practitioners choose or create a machine learning algorithm that fits the needs of the project. The algorithm is trained on the data from the previous step, and the resulting model is tested and validated until it is ready for use.
- __Model deployment:__ Developers and software engineers deploy the model for real-world use, integrating it into a production environment and monitoring its performance.

Machine learning workflows are a core building block for the larger discipline of machine learning operations (MLOps). Much of the process can be automated through various automated machine learning (AutoML) techniques that manage dependencies between stages and endpoints.

https://www.ibm.com/think/topics/machine-learning-pipeline

## __Why sklearn pipelines?__

Pipelines provide an organized approach to managing your data preprocessing and modeling code. They combine preprocessing and modeling steps into a single, streamlined process.

- **Cleaner Code**: Pipelines eliminate the need to manually manage training and validation data at each preprocessing step, reducing clutter and complexity.
- **Fewer Bugs**: By bundling steps together, pipelines minimize the risk of misapplying or forgetting a preprocessing step.
- **Easier to Productionize**: Pipelines simplify the transition from a prototype model to a scalable, deployable solution.

**Syntax**:
```python
class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)

```


- A pipeline is a sequence of data transformers that can include a final predictor.
- It lets you apply multiple preprocessing steps to your data in order, and optionally end with a predictor for modeling.
- Each intermediate step in the pipeline must have fit and transform methods, while the final step only needs fit.
- You can cache these transformers using the memory argument.
- The pipeline's main goal is to combine multiple steps that can be cross-validated together and have their parameters adjusted.
- You can set parameters for any step by using its name followed by a double underscore(__) and the parameter name.
- You can replace any step's estimator with another estimator or remove a transformer by setting it to 'passthrough' or None.

## __Housing dataset (Regression)__

### __Dataset description__

The California housing dataset contains information on various socio-economic features of block groups in California. Each row in the dataset represents a single block group, and there are 20,640 observations, each with 10 attributes.

The Features are as follows:
1. Longitude: The longitude of the center of each block group in California.
2. Latitude: The latitude of the center of each block group in California.
3. Housing Median Age: The median age of the housing units in each block group.
4. Total Rooms: The total number of rooms in the housing units in each block group.
5. Total Bedrooms: The total number of bedrooms in the housing units in each block group.
6. Population: The total population of the block group.
7. Households: The total number of households in the block group.
8. Median Income: The median income of the block group.
9. Median House Value: The median value of the housing units in the block group.
10. Ocean Proximity: The proximity of the block group to the ocean or other bodies of water.

#### __Fetching data from Kaggle__

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/kaggle"
os.makedirs("/content/kaggle", exist_ok=True)
!mv kaggle.json /content/kaggle/

In [6]:
!chmod 600 /content/kaggle/kaggle.json

In [7]:
!kaggle datasets download -d hosammhmdali/house-price-dataset

Dataset URL: https://www.kaggle.com/datasets/hosammhmdali/house-price-dataset
License(s): apache-2.0
Downloading house-price-dataset.zip to /content
  0% 0.00/400k [00:00<?, ?B/s]
100% 400k/400k [00:00<00:00, 557MB/s]


In [8]:
!unzip house-price-dataset.zip

Archive:  house-price-dataset.zip
  inflating: housing.csv             


#### __Import the necessary libraries__

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#### __Pre-processing the dataset__

In [11]:
housing_data = pd.read_csv("housing.csv")
housing_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [12]:
housing_data.tail()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.3886,89400.0,INLAND


#### __Inspecting the data__

In [13]:
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [14]:
housing_data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [15]:
missing_values = housing_data.isna().sum()
missing_values

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,207
population,0
households,0
median_income,0
median_house_value,0
ocean_proximity,0


In [16]:
duplicate_values = housing_data.duplicated().sum()
duplicate_values

np.int64(0)

#### __Perform train-test split__

In [17]:
# Separate features and target variable
# Assuming 'median_house_value' is the target variable in your dataset

X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### __Implementation for sklearn pipelines__

<img src="https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/Sklearn_pipeline_1.png" width=600 />

<img src="https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/Sklearn_pipeline_2.png" width=600 />

<img src="https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/sklearn_pipeline_3.png" width=600 />

#### __Preprocessing steps before building the model__

1. Missing value treatment - `207` missing values in total_bedrooms (a numeric column)
2. Dummy variable creation for categorical data
3. Standardization of numeric variables

In [18]:
# Import StandardScaler for standardization and OneHotEncoder for creating dummy variables
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Import SimpleImputer for missing value treatment
from sklearn.impute import SimpleImputer

# Importing pipeline class. The Pipeline class is used to create a sequence of data processing steps.
from sklearn.pipeline import Pipeline

# Importing ColumnTransformer class to apply different preprocessing steps to different subsets of features in your dataset.
from sklearn.compose import ColumnTransformer

#### __A note about ColumnTransformer__

* ColumnTransformer class allows you to apply different preprocessing steps to different subsets of features in your dataset.
* This is particularly useful when you have a mix of numerical and categorical data that require different types of preprocessing.
* ColumnTransformer ensures that each column or group of columns gets the appropriate transformation before combining the results for further processing or modeling.

#### __Step 1: Data preprocessing for Numerical and Categorical columns__

In [19]:
# Set up preprocessing steps for Numeric and Categorical data

housing_cat = X_train.select_dtypes(include='object').columns
housing_num = X_train.select_dtypes(exclude='object').columns

In [20]:
housing_cat

Index(['ocean_proximity'], dtype='object')

In [21]:
housing_num

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

#### __Step 2: Set-up sklearn pipeline for numeric and categorical variables__

- Perform missing value imputation and then standardization for Numerical columns.
- Perform missing value imputation and then Label Encoding or OHE for Categorical columns.

The `numerical_transformer` is a pipeline that preprocesses numerical data:

- Imputation: Fills missing values using the median value of each column (SimpleImputer(strategy='median')).
- Standardization: Scales the data to have a mean of 0 and a standard deviation of 1 (StandardScaler()).

This pipeline ensures consistent and streamlined preprocessing of numerical data.

The `categorical_transformer` is a pipeline that preprocesses categorical data:

- Imputation: Fills missing values using mode or a custom value (SimpleImputer(strategy='most_frequent')).
- Encoding: OneHotEncoder for nominal categories or OrdinalEncoder for ordinal categories.

This pipeline ensures consistent and streamlined preprocessing of categorical data.

In [22]:
# Numeric variables pipeline (null values will be imputed with the median value of the dataset)
# Preprocessing for numerical features
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

In [23]:
# Categorical variables pipeline (Preprocessing for categorical features)
categorical_transformer = Pipeline([
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'))
])

#### __Step 3: Unified Data Preprocessing with Pipelines and ColumnTransformer__

Scikit-learn pipelines can effectively handle both categorical and numerical features within a single workflow using the ColumnTransformer. This allows for applying different preprocessing steps to different subsets of columns based on their data types or specified column names. (_Use ColumnTransformer to apply encoding to specific columns_).

The preprocessing step uses `ColumnTransformer` to apply different preprocessing pipelines to different types of data in the dataset:

- Numerical Data: Applies `numerical_transformer` to columns in `housing_num`, performing imputation and standardization.
- Categorical Data: Applies `categorical_transformer` to columns in `housing_cat`, converting categorical variables into a one-hot encoded format, ignoring unknown categories.

This ensures that both numerical and categorical data are preprocessed appropriately within a single, unified framework.

In [24]:
# Unified preprocessing for both numeric and categorical data (Create a preprocessor using ColumnTransformer)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, housing_num),
        ('cat', categorical_transformer, housing_cat)],
    remainder='passthrough' # Keep other columns as they are
)

#### __Apply Preprocessing pipeline to training data ( just for verifying the preprocessing output )__

In [25]:
# Apply preprocessing pipeline to train data
check_train = preprocessor.fit_transform(X_train)

In [26]:
# Converting array to dataframe to have a better look at it
check_train_df = pd.DataFrame(check_train)
check_train_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.780934,-0.805682,0.509357,-0.113242,-0.33787,-0.184117,-0.243508,0.133506,1.0,0.0,0.0,0.0,0.0
1,1.24527,-1.339473,-0.679873,-0.213566,-0.013884,-0.376191,-0.013267,-0.532218,0.0,0.0,0.0,0.0,1.0
2,-0.277552,-0.496645,-0.362745,-0.482639,-0.61421,-0.61124,-0.565322,0.17099,0.0,0.0,0.0,0.0,1.0
3,-0.706938,1.690024,-1.155565,-0.848339,-0.926284,-0.987495,-0.949929,-0.402916,0.0,1.0,0.0,0.0,0.0
4,-1.430902,0.99235,1.857152,0.251071,0.400626,0.086015,0.426285,-0.299285,0.0,0.0,0.0,1.0,0.0


In [27]:
# No NULL values found after pre-processing
check_train_df.isna().sum()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


##### __Observation:__

1. Good to see all the missing values are treated.
2. Numeric variables are standardised
3. `ocean_proximity` is converted to dummy variables.

#### __Step 4: Building a regression model (Linear Regression)__

In [28]:
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import mean_absolute_error

In [None]:
model_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LinearRegression())
])

model_pipeline.fit(X_train, y_train)

In [30]:
# R-squared value
model_pipeline.score(X_train,y_train)

0.6470480227253683

In [31]:
# Coefficients Slope of the line
model_pipeline.named_steps['classifier'].coef_

array([-53080.22484228, -53211.37198001,  13949.87818887, -12701.71831577,
        44378.88674127, -42559.67778653,  16172.44726364,  74802.55919045,
       -18383.89055005, -59459.94985371, 117504.65524069, -24449.93723871,
       -15210.87759821])

In [32]:
# Intercept
model_pipeline.named_steps['classifier'].intercept_

np.float64(238671.04806614283)

#### __Finding the training and testing accuracy of the model__

In [33]:
y_train_pred = model_pipeline.predict(X_train)
mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

# Print the metrics
print("Training Set Mean Squared Error:", mse_train)
print("Training Set R² Score:", r2_train)

Training Set Mean Squared Error: 4728483441.955843
Training Set R² Score: 0.6470480227253683


In [34]:
y_test_pred = model_pipeline.predict(X_test)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

# Print the metrics
print("Test Set Mean Squared Error:", mse_test)
print("Test Set R² Score:", r2_test)

Test Set Mean Squared Error: 4733398424.247664
Test Set R² Score: 0.6393711402746005


#### __Using cross_val_score with the pipeline and training data__

In [37]:
kf = KFold(n_splits=10, random_state=42, shuffle=True)
k_fold_scores_mse = cross_val_score(model_pipeline, X_train, y_train, scoring='neg_mean_squared_error', cv=kf, n_jobs=-1)
k_fold_scores_mae = cross_val_score(model_pipeline, X_train, y_train, scoring='neg_mean_absolute_error', cv=kf, n_jobs=-1)

print(f"Mean CV MSE (Linear Regression): {np.mean(np.abs(k_fold_scores_mse))}")
print(f"Mean CV MAE (Linear Regression): {np.mean(np.abs(k_fold_scores_mae))}")

Mean CV MSE (Linear Regression): 4758266610.329491
Mean CV MAE (Linear Regression): 49789.70356825567


#### __Building Ridge regression model with hyperparameter tuning__

In [48]:
model_pipeline_ridge = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model_ridge', Ridge())
])

param_grid = {'model_ridge__alpha': np.arange(0.1,2.1,0.1)}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator = model_pipeline_ridge, param_grid = param_grid, scoring = 'neg_mean_absolute_error',cv = 5, n_jobs= -1)

# Fit GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score (negative MSE): {grid_search.best_score_}")

# Evaluate the best model on the test set
best_ridge_model = grid_search.best_estimator_

y_train_pred_ridge = best_ridge_model.predict(X_train)
r2_train_ridge = r2_score(y_train, y_train_pred_ridge)
y_test_pred_ridge = best_ridge_model.predict(X_test)
r2_test_ridge = r2_score(y_test, y_test_pred_ridge)

# Print the metrics
print("Training Set R² Score (Rigde Model):", r2_train_ridge)
print("Test Set R² Score (Rigde Model):", r2_test_ridge)

Best parameters: {'model_ridge__alpha': np.float64(0.1)}
Best cross-validation score (negative MSE): -49802.830202195575
Training Set R² Score (Rigde Model): 0.6470478508630848
Test Set R² Score (Rigde Model): 0.6393657722246677
