# __Sklearn Pipelines__

## __Why sklearn pipelines?__

**Pipelines provide an organized approach to managing your data preprocessing and modeling code. They combine preprocessing and modeling steps into a single, streamlined process.**

 - **Cleaner Code**: Pipelines eliminate the need to manually manage training and validation data at each preprocessing step, reducing clutter and complexity.

 - **Fewer Bugs**: By bundling steps together, pipelines minimize the risk of misapplying or forgetting a preprocessing step.

 - **Easier to Productionize**: Pipelines simplify the transition from a prototype model to a scalable, deployable solution.

**Syntax**:
```python
class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)

```

### __Before we start the application, let's pay attention to the following important points:__

 - A pipeline is a sequence of data transformers that can include a final predictor.

 - It lets you apply multiple preprocessing steps to your data in order, and optionally end with a predictor for modeling.

 - Each intermediate step in the pipeline must have fit and transform methods, while the final step only needs fit.

 - You can cache these transformers using the memory argument.

 - The pipeline's main goal is to combine multiple steps that can be cross-validated together and have their parameters adjusted.

 - You can set parameters for any step by using its name followed by a double underscore(__) and the parameter name.

 - You can replace any step's estimator with another estimator or remove a transformer by setting it to 'passthrough' or None.

### Lets look at the housing data with `Ocean proximity` feature:

In [421]:
housing_data = pd.read_csv("housing_with_ocean_proximity.csv")

In [423]:
housing_data.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [425]:
# Separate features and target variable
# Assuming 'median_house_value' is the target variable in your dataset

X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']

In [427]:
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=2024)

In [431]:
# Print info to check the structure of the training data

print(X_train.info(), "\n")
print(X_train.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 14448 entries, 4722 to 7816
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           14448 non-null  float64
 1   latitude            14448 non-null  float64
 2   housing_median_age  14448 non-null  float64
 3   total_rooms         14448 non-null  float64
 4   total_bedrooms      14286 non-null  float64
 5   population          14448 non-null  float64
 6   households          14448 non-null  float64
 7   median_income       14448 non-null  float64
 8   ocean_proximity     14448 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.1+ MB
None 

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        162
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64


#### __Implementation for sklearn pipelines__

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/Sklearn_pipeline_1.png)

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/Sklearn_pipeline_2.png)

#### __We need to perform following preprocessing steps before building the model__|

1. Missing value treatment - 162 missing values in total_bedrooms(a numeric column)
2. Dummy variable creation for categorical data
3. Standardization of numeric variables

In [452]:
# import StandardScaler for standardization and OneHotEncoder for creating dummy variables
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# import SimpleImputer for missing value treatment
from sklearn.impute import SimpleImputer

# importing pipeline class. The Pipeline class is used to create a sequence of data processing steps.
from sklearn.pipeline import Pipeline

# importing ColumnTransformer class to apply different preprocessing steps to different subsets of features in your dataset.
from sklearn.compose import ColumnTransformer

##### __A note about ColumnTransformer__

* ColumnTransformer class allows you to apply different preprocessing steps to different subsets of features in your dataset.
* This is particularly useful when you have a mix of numerical and categorical data that require different types of preprocessing.
* ColumnTransformer ensures that each column or group of columns gets the appropriate transformation before combining the results for further processing or modeling.

### __Data preprocessing starts here:__

#### Steps to perform:
    
#### Step 1: Let's store the names of numeric and object type variables separately.

In [543]:
# Set up preprocessing steps for numeric and categorical data

housing_cat = X_train.select_dtypes(include='object').columns
housing_num = X_train.select_dtypes(exclude='object').columns

In [545]:
housing_cat

Index(['ocean_proximity'], dtype='object')

In [547]:
housing_num

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

------------------------------------

#### Step 2: Set-up sklearn pipeline for numeric variables. We need to perform missing value imputation and then standardization.

In [465]:
# Numeric variables pipeline
# Here, null values will be imputed with the median value of the dataset

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

In [549]:
# Numeric variables pipeline
# If we want to remove the rows with the null values (NOT WORKING)

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=np.nan)),
    ('std_scaler', StandardScaler())
])

In [551]:
# Categorical variables pipeline

cat_pipeline = Pipeline([
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'))
])

The `num_pipeline` is a pipeline that preprocesses numerical data in two steps:

- Imputation: Fills missing values using the median value of each column (SimpleImputer(strategy='median')).
- Standardization: Scales the data to have a mean of 0 and a standard deviation of 1 (StandardScaler()).

This pipeline ensures consistent and streamlined preprocessing of numerical data.

------------------------------------

#### Step 3: Unified Data Preprocessing with Pipelines and ColumnTransformer

In [556]:
# Unified preprocessing for both numeric and categorical data

preprocessing = ColumnTransformer([
    ('num', num_pipeline, housing_num),
    ('cat', OneHotEncoder(handle_unknown='ignore'), housing_cat)
])

In [558]:
# Unified preprocessing for both numeric and categorical data

preprocessing = ColumnTransformer([
    ('num', num_pipeline, housing_num),
    ('cat', cat_pipeline, housing_cat)
])

The preprocessing step uses ColumnTransformer to apply different preprocessing pipelines to different types of data in the dataset:

 - Numerical Data: Applies num_pipeline to columns in housing_num, performing imputation and standardization.
 - Categorical Data: Applies OneHotEncoder to columns in housing_cat, converting categorical variables into a one-hot encoded format, ignoring unknown categories.

This ensures that both numerical and categorical data are preprocessed appropriately within a single, unified framework.

------------------------------------

#### Let's check how the pipeline we have created works with the data

In [563]:
# applying preprocessing pipeline to train data

check_train = preprocessing.fit_transform(X_train)

In [565]:
# converting array to dataframe to have a better look at it

check_train_df = pd.DataFrame(check_train)
check_train_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.604388,-0.741523,1.536729,-0.620001,-0.715386,-0.79088,-0.721514,0.072853,1.0,0.0,0.0,0.0,0.0
1,0.798753,-0.862998,0.501005,-0.11989,-0.133529,0.218924,-0.043175,0.129574,1.0,0.0,0.0,0.0,0.0
2,-0.845868,1.445014,-1.809456,0.736722,0.329095,0.298266,0.353837,0.825943,0.0,1.0,0.0,0.0,0.0
3,-1.27945,0.828299,-1.092417,-0.018491,-0.414919,-0.394171,-0.348165,3.519281,0.0,0.0,0.0,1.0,0.0
4,1.237318,-1.3816,1.457057,-0.847115,-0.739232,-0.702522,-0.747807,-0.986694,0.0,0.0,0.0,0.0,1.0


In [567]:
# No NULL values found after pre-processing

check_train_df.isna().sum()

0       0
1       0
2       0
3       0
4     162
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
dtype: int64

#### __Observation:__

1. Good to see all the missing values are treated
2. numeric variables are standardised
3. ocean_proximity is converted to dummy variables

------------------------------------

#### Step 4: Building Linear regression model and hyperparameter tuning

In [529]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error


In [531]:
# Create an instance of the Ridge regression model

model = Ridge()

In [533]:
# Build a Ridge regression model within a complete pipeline

final_pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('model_ridge', model)
])


In [535]:
# Define the grid of hyperparameters to search
# Note: You can set parameters for any step by using its name followed by a double underscore(__) and the parameter name.

grid = dict()

grid['model_ridge__alpha'] = np.arange(0.1,2.1,0.1)
grid

{'model_ridge__alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
        1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ])}

In [537]:
# Create the GridSearchCV object

search = GridSearchCV(estimator = final_pipeline, param_grid = grid, scoring = 'neg_mean_absolute_error',cv = 5, n_jobs= -1)

# Fit GridSearchCV object to the training data

results = search.fit(X_train, y_train)
print('MAE: %.3f' % -results.best_score_)
print('Config: %s' % results.best_params_)

MAE: 49789.360
Config: {'model_ridge__alpha': 0.1}


------------------------------------

#### Step 5: Using pipeline object to test the results

In [539]:
# Predict on the test set using the trained model

y_pred = search.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Mean Absolute Error: 50091.26322372467


![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ML/Lesson_03/sklearn_pipeline_3.png)

### __Conclusion__

Regression analysis is an essential method for examining and predicting variable relationships. In this lesson, we've delved into core types of regression—simple linear, multiple linear, and polynomial regression.

You've acquired skills to assess model performance using metrics like MSE, RMSE, and R-squared. Furthermore, we explored regularization techniques such as Lasso, Ridge, and ElasticNet to mitigate overfitting, and learned the importance of hyperparameter tuning for optimizing model parameters to enhance results.

In conclusion, this lesson has provided a comprehensive exploration of regression through theoretical insights and hands-on applications.