# Pipelines in Python

In Python, pipelines are commonly used in the context of machine learning and data preprocessing. They allow for the chaining together of multiple data transformations and a final estimator into a single workflow, which can be particularly useful in cross-validation for consistent and automated processing of data.

In the scikit-learn library, pipelines can be created using the Pipeline class. Here's an example of creating a pipeline in scikit-learn:

# Here are the key components of a pipelines :

## 1.Data Preprocessing Steps: 
These are the initial steps in the pipeline that are used to transform and prepare the raw data for machine learning. Examples include data cleaning, data normalization, feature extraction, and feature engineering.
## 2.Machine Learning Model:
 This is the final step in the pipeline, which uses the preprocessed data to make predictions or classifications. Common machine learning models include linear regression, logistic regression, decision trees, and neural networks.
## 3.Estimator Class:
 In scikit-learn, pipelines are created using the Pipeline class, which is a type of estimator. The Pipeline class is used to chain together multiple data preprocessing steps and a final machine learning model into a single workflow.
## 4.Fit Method:
 The fit method is used to train the machine learning model on the preprocessed data. In scikit-learn, this method is available on both individual estimators and pipelines.
## 4.Predict Method:
 The predict method is used to make predictions on new data using the trained machine learning model. In scikit-learn, this method is available on both individual estimators and pipelines.
## 5.Cross-Validation:
 Pipelines are particularly useful in cross-validation, as they ensure that the same preprocessing steps are applied to each fold of the data. This reduces variability and improves the accuracy of model evaluation. In scikit-learn, cross-validation can be performed using the cross_val_score function, which is available on both individual estimators and pipelines.


In [14]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [13]:
# # load the titanic dataset from seaborn

# df = sns.load_dataset('titanic')
# df.head()

# # select fratures and target variable
# X = df[['pclass', 'age', 'sex','fare','embarked']]
# y = df['survived']

# # split the data into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # define the column transformer for imputing missing values and encoding categorical variables

# numeric_features = ['age', 'fare']
# categorical_features = ['pclass', 'sex', 'embarked']

# numeric_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='median'))
# ])

# categorical_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='most_frequent')),
#     ('encoder', OneHotEncoder(handle_unknown='ignore')),
# ])

# preprocessor = ColumnTransformer(transformers=[
#     ('num', numeric_transformer, numeric_features),
#     ('cat', categorical_transformer, categorical_features)
# ])

# # create a pipeline with the preprocessor and RandomForestClassifier
# pipeline = Pipeline(steps=[
#     ('preprocessor', preprocessor),
#     ('classifier', RandomForestClassifier())
# ])

# # fit the pipeline on the training data
# pipeline.fit(X_train, y_train)

# # make predictions on the test data
# y_pred = pipeline.predict(X_test)

# # calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy:", accuracy)

# # print out the confusion matrix
# print("Confusion Matrix:")
# print(confusion_matrix(y_test, y_pred))

# # print classification report 

# print("Classification Report:")
# print(classification_report(y_test, y_pred))





--------
# Hyperperameter Tunning in Pipeline

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df = sns.load_dataset('titanic')
df.head()

# select fratures and target variable
X = df[['pclass', 'age', 'sex','fare','embarked']]
y = df['survived']
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# crate a pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer( strategy='most_frequent')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ('model', RandomForestClassifier(random_state=42))
])
# define the hyperparameters to tune

hyperparameters = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 5, 10],
    'model_min_samples_split': [2, 5, 10]
}
# create a grid search Cross-validation
grid_search = GridSearchCV(pipeline,hyperparameters,cv=5)
grid_search.fit(X_train, y_train)

# get the best model
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

# calculate accuracy score

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy',accuracy)

# print the best hyperparameters

print('Best Hyperparameters:', grid_search.best_params_)



