# Pipeline in Machine Learning

Pipeline in Machine Learning is a way to simplify your workflow by combining multiple steps. It is very useful when you have to perform multiple transformations on your data before applying the final estimator. 

It helps in reducing the chances of data leakage and makes the code more readable and maintainable. In this notebook, we will see how to use the pipeline in machine learning using the scikit-learn library.

## steps involved in the pipeline in machine learning


The pipeline in machine learning involves the following steps:
1. Data Preprocessing: This step involves handling missing values, encoding categorical variables, and scaling the features.
2. Feature Selection: This step involves selecting the most important features from the dataset.
3. Model Building: This step involves building a machine learning model using the selected features.
4. Prediction: This step involves making predictions using the model.
4. Model Evaluation: This step involves evaluating the performance of the model using different metrics.


## Advantages of using the pipeline in machine learning

The main advantage of using the pipeline in machine learninng are:-
1. Reduces the chances of data leakage: The pipeline ensures that the transformations are applied only to the training data and not to the test data, reducing the chances of data leakage.
2. Makes the code more readable and maintainable: The pipeline makes the code more readable and maintainable by combining multiple steps into a single object.
3. Automates the workflow: The pipeline automates the workflow by applying the transformations in a sequential manner.
4. Improves the performance: The pipeline improves the performance of the model by ensuring that the transformations are applied consistently to the training and test data.

### Summary

In summary the pipeline in machine learning is a very useful tool that helps in simplifying the workflow by combining multiple steps. It reduces the chances of data leakage, makes the code more readable and maintainable, automates the workflow, and improves the performance of the model.

In [20]:
import pandas as pd 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [21]:
# load the dataset
titanic = sns.load_dataset('titanic')

# selecting features and target variable
X = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic['survived']

# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# column transformer for imputing missing values
numeric_features = ['age', 'fare']
category_features = ['pclass', 'sex', 'embarked']

numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])

categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                      ('encoder', OneHotEncoder(handle_unknown='ignore'))])

processor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                            ('cat', categorical_transformer, category_features)])

# create a pipeline with a random forest classifier
pipeline = Pipeline(steps=[('preprocessor', processor), 
                           ('classifier', RandomForestClassifier(random_state=42))])

# fit the model
pipeline.fit(X_train, y_train)

# predict the target variable
y_pred = pipeline.predict(X_test)

# calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.7821229050279329


# Hyperparameter Tuning in Pipeline

Hyperparameter tuning in pipeline is a way to optimize the hyperparameters of the model by using grid search or random search. It helps in finding the best hyperparameters for the model and improves the performance of the model.

In [22]:
import pandas as pd 
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [23]:
# load the dataset
titanic = sns.load_dataset('titanic')

# selecting features and target variable
X = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic['survived']

# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a pipeline with a random forest classifier
pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                     ('encoder', OneHotEncoder(handle_unknown='ignore')),
                     ('model', RandomForestClassifier(random_state=42))])

# hyperparameters for tune
param_grid = {
    'model__n_estimators': [100, 200, 300, 500],
    'model__max_depth': [None, 5, 10],
    'model__min_samples_split': [2, 5, 10]
}

# Grid search cross-validation
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# getting the best model
pipeline = grid_search.best_estimator_

# making the prediction on best model
y_pred = pipeline.predict(X_test)

# calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# print the best hyperparameters
print('Best hyperparameters:', grid_search.best_params_)

Accuracy: 0.8212290502793296
Best hyperparameters: {'model__max_depth': None, 'model__min_samples_split': 5, 'model__n_estimators': 500}
