# Chapter 2 keypoint EX

## Pipeline

With increasing demand in machine learning and data science in businesses, for **upgraded data strategizing** there’s a need for a **better workflow** to **ensure robustness in data modelling**. 

Machine learning has certain steps to be followed namely – data collection, data preprocessing(cleaning and feature engineering), model training, validation and prediction on the test data(which is previously unseen by model). 

Here testing data needs to go through the same preprocessing as training data.**For this iterative process**, **pipelines** are used which can automate the entire process for both training and testing data. It ensures reusability of the model by reducing the redundant part, thereby speeding up the process. This could prove to be very effective during the production workflow.

**Advantages of using Pipeline:**

- Automating the workflow being iterative.
- Easier to fix bugs 
- Production Ready
- Clean code writing standards
- Helpful in iterative hyperparameter tuning and cross-validation evaluation

**Challenges in using Pipeline:**

- Proper data cleaning
- Data Exploration and Analysis
- Efficient feature engineering

**Scikit-Learn Pipeline**

The sklearn.pipeline module implements utilities to build a composite estimator, as a chain of transforms and estimators.

With the **scikit learn pipeline**, we can easily systemise the process and therefore make it extremely reproducible.




### Pipeline 1 : [Daily bike share]((https://towardsdatascience.com/step-by-step-tutorial-of-sci-kit-learn-pipeline-62402d5629b6))

In [31]:
import numpy as np
import pandas as pd

In [32]:
data = pd.read_csv("../data/daily-bike-share.csv")
data.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
rentals         int64
dtype: object

In [33]:
data.isnull().sum()  # clear

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
rentals       0
dtype: int64

In [34]:
# label
data = data[
    [
        "season",
        "mnth",
        "holiday",
        "weekday",
        "workingday",
        "weathersit",
        "temp",
        "atemp",
        "hum",
        "windspeed",
        "rentals",
    ]
]

In [35]:
# split the data
from sklearn.model_selection import train_test_split

X = data.drop("rentals", axis=1)
y = data["rentals"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

#### Create the pipeline

The main parameter of a pipeline we’ll be working on is ‘steps’.

 From the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline),it is a *‘list of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.'*

It’s easier to just have a glance at what the pipeline should look like:

        Pipeline(steps=[('name_of_preprocessor', preprocessor),
                        ('name_of_ml_model', ml_model())])

In [36]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

In [37]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant")),
        ("encoder", OrdinalEncoder()),
    ]
)

####  Apply ColumnTransformer

Specify which columns are numeric and which are categorical, so we can apply the transformers accordingly.

 We apply the transformers to features by using `ColumnTransformer`. 
 
 Applying the transformers to features is our preprocessor. 
 
 Similar to pipeline, we pass **a list of tuples**, which is composed of *(‘name’, ‘transformer’, ‘features’)*, to the parameter **‘transformers’.**



In [38]:
numeric_features = ["temp", "atemp", "hum", "windspeed"]

categorical_features = [
    "season",
    "mnth",
    "holiday",
    "weekday",
    "workingday",
    "weathersit",
]

preprocessor = ColumnTransformer(
    transformers=[
        (
            "numeric",
            numeric_transformer,
            numeric_features,
        ),  ## Specify, or "map" the num pipeline with the num features
        (
            "categorical",
            categorical_transformer,
            categorical_features,
        ),  ## Specify, or "map" the cat pipeline with the cat features
    ]
)
## Not preferred:
# numeric_features = data.select_dtypes(include=['int64', 'float64']).columns
# categorical_features = data.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns

In [39]:
# Estimator with Random Forest Regression model
from sklearn.ensemble import RandomForestRegressor

pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),  # preprocessor from the ColumnTransformer
        ("regressor", RandomForestRegressor()),
    ]
)

Here we make antoher **pipeline** with previos preprocessor,which made of two **pipelines**:

`numeric_transformer & category_transformer`,meanwhile they all come from basic pipelines.

Also, RandomForestRegressor added up in the new pipeline.

In [40]:
# Train the model:
rf_model = pipeline.fit(X_train, y_train)
print(rf_model)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['temp', 'atemp', 'hum',
                                                   'windspeed']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='constant')),
                                                                  ('encoder',
                                                                   OrdinalEncoder())]),
                        

Amazing and beautiful, isn't it?

It makes clear of pipeline working process.

In [41]:
from sklearn.metrics import r2_score

In [42]:
predictions = rf_model.predict(X_test)
r2_pred = r2_score(y_test, predictions)
print(r2_pred)

0.7681713038706116


#### Use the model

To maximise reproducibility, we‘d like to use this model repeatedly for our new incoming data. Let’s save the model by using ‘joblib’ package to save it as a pickle file.

In [43]:
import joblib

In [44]:
joblib.dump(rf_model, "./rf_model.pkl")

['./rf_model.pkl']

Now we can call this pipeline, which includes all sorts of data preprocessing we need and the training model, whenever we need it：

        # In other notebooks 
        rf_model = joblib.load('PATH/TO/rf_model.pkl')
        
        new_prediction = rf_model.predict(new_data)



### Conclusion

Scikit learn pipeline makes workflows smoother and more flexible. 

For example, you can easily compare the performance of a number of algorithms like:

        regressors = [
            regressor_1()
           ,regressor_2()
           ,regressor_3()
           ....]for regressor in regressors:
            pipeline = Pipeline(steps = [
                       ('preprocessor', preprocessor)
                      ,('regressor',regressor)
                   ])
            model = pipeline.fit(X_train, y_train)
            predictions = model.predict(X_test)
            print (regressor)
            print (f('Model r2 score:{r2_score(predictions, y_test)}')

or adjust the preprocessing/transforming methods. For instance, use ‘median’ value to fill missing values, use a different scaler for numeric features, change to one-hot encoding instead of ordinal encoding to handle categorical features, hyperparameter tuning, etc.

        numeric_transformer = Pipeline(steps=[
               ('imputer', SimpleImputer(strategy='median'))
              ,('scaler', MinMaxScaler())
        ])categorical_transformer = Pipeline(steps=[
               ('imputer', SimpleImputer(strategy='constant'))
              ,('encoder', OneHotEncoder())
        ])pipeline = Pipeline(steps = [
                       ('preprocessor', preprocessor)
                      ,('regressor',RandomForestRegressor(n_estimators=300
                                                         ,max_depth=10))
                   ])

### Pipeline 2: [Iris dataset](https://analyticsindiamag.com/hands-on-tutorial-on-machine-learning-pipelines-with-scikit-learn/) 

In [45]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [46]:
iris_df = load_iris()

In [47]:
# iris_df.data
# iris_df.feature_names
# iris_df.target.shape
# iris_df.data.shape

In [48]:
# Split the dataset

X_train, X_test, y_train, y_test = train_test_split(
    iris_df.data, iris_df.target, test_size=0.3, random_state=0
)

#### Create the pipeline

In [49]:
pipeline_lr = Pipeline(
    [  # "steps="" omitted.
        ("scalar_1", StandardScaler()),
        ("pca_1", PCA(n_components=2)),
        ("lr_classifier", LogisticRegression(random_state=0)),
    ]
)

In [50]:
model = pipeline_lr.fit(X_train, y_train)
model.score(X_test, y_test)

0.8666666666666667

### Move on:Stacking Multiple Pipelines to Find the Model with the Best Accuracy

*With Iris dataset*

We build different pipelines for each algorithm and the fit to see which performs better.

In [51]:
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [52]:
pipeline_lr = Pipeline(
    [
        ("scalar_1", StandardScaler()),
        ("pca_1", PCA(n_components=2)),
        ("lr_classifier", LogisticRegression(random_state=0)),
    ]
)

pipeline_dt = Pipeline(
    [
        ("scalar_2", StandardScaler()),
        ("pca_2", PCA(n_components=2)),
        ("dt_classifier", DecisionTreeClassifier()),
    ]
)

pipeline_svm = Pipeline(
    [("scalar_3", StandardScaler()), ("pca_3", PCA(n_components=2)), ("clf", svm.SVC())]
)

pipeline_knn = Pipeline(
    [
        ("scalar_4", StandardScaler()),
        ("pca_4", PCA(n_components=2)),
        ("knn_classifier", KNeighborsClassifier()),
    ]
)

In [53]:
pipelines = [pipeline_lr, pipeline_dt, pipeline_svm, pipeline_knn]

pipeline_dict = {
    0: "Logistic Regression",
    1: "Decision Tree",
    2: "Support Vector Machine",
    3: "K Nearest Neighbor",
}
for pipe in pipelines:
    pipe.fit(X_train, y_train)

for i, model in enumerate(pipelines):
    print(f"{pipeline_dict[i]}  Test Accuracy: {model.score(X_test,y_test)}")

Logistic Regression  Test Accuracy: 0.8666666666666667
Decision Tree  Test Accuracy: 0.9111111111111111
Support Vector Machine  Test Accuracy: 0.9333333333333333
K Nearest Neighbor  Test Accuracy: 0.9111111111111111


### Go deeply:Hyperparameter Tuning in Pipeline

With pipelines, you can easily perform a grid-search over a set of parameters for each step of this meta-estimator to find the best performing parameters. 

To do this you first need to create a parameter grid for your chosen model. 

One important thing to note is that you need to append the name that you have given the classifier part of your pipeline to each parameter name. In my code above I have called this ‘randomforestclassifier’ so I have added randomforestclassifier__ to each parameter. 

Next, I created a grid search object which includes the original pipeline. When I then call fit, the transformations are applied to the data, before a cross-validated grid-search is performed over the parameter grid.

In [64]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

In [65]:
pipe = make_pipeline((RandomForestClassifier()))
grid_param = [
    {
        "randomforestclassifier": [RandomForestClassifier()],
        "randomforestclassifier__n_estimators": [10, 100, 1000],
        "randomforestclassifier__max_depth": [5, 8, 15, 25, 30, None],
        "randomforestclassifier__min_samples_leaf": [1, 2, 5, 10, 15, 100],
        "randomforestclassifier__max_leaf_nodes": [2, 5, 10],
    }
]

In [66]:
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0, n_jobs=-1)
best_model = gridsearch.fit(X_train, y_train)

In [67]:
best_model.score(X_test,y_test)

0.9777777777777777

In [68]:
best_model.best_estimator_

Pipeline(steps=[('randomforestclassifier',
                 RandomForestClassifier(max_depth=5, max_leaf_nodes=10,
                                        min_samples_leaf=5))])

In [71]:
best_model.best_params_

{'randomforestclassifier': RandomForestClassifier(max_depth=5, max_leaf_nodes=10, min_samples_leaf=5),
 'randomforestclassifier__max_depth': 5,
 'randomforestclassifier__max_leaf_nodes': 10,
 'randomforestclassifier__min_samples_leaf': 5,
 'randomforestclassifier__n_estimators': 100}

In [73]:
best_model.best_score_

0.9619047619047618

In [78]:
best_model.

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('randomforestclassifier',
                                        RandomForestClassifier())]),
             n_jobs=-1,
             param_grid=[{'randomforestclassifier': [RandomForestClassifier(max_depth=5,
                                                                            max_leaf_nodes=10,
                                                                            min_samples_leaf=5)],
                          'randomforestclassifier__max_depth': [5, 8, 15, 25,
                                                                30, None],
                          'randomforestclassifier__max_leaf_nodes': [2, 5, 10],
                          'randomforestclassifier__min_samples_leaf': [1, 2, 5,
                                                                       10, 15,
                                                                       100],
                          'randomforestclassifier__n_estimators': [10, 

### Conclusion

This is a basic pipeline implementation. In real-life data science, scenario data would need to be prepared first then applied pipeline for rest processes. 

Building quick and efficient machine learning models is what pipelines are for. Pipelines are high in demand as it helps in coding better and extensible in implementing big data projects. 

Automating the applied machine learning workflow and saving time invested in redundant preprocessing work.

## Cross validation application

## \__init__ and self(def and Class)

In [4]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
sss.get_n_splits(X, y)

print(sss)

for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


StratifiedShuffleSplit(n_splits=5, random_state=0, test_size=0.2,
            train_size=None)
TRAIN: [2 5 1 3] TEST: [0 4]
TRAIN: [0 4 3 1] TEST: [2 5]
TRAIN: [0 4 3 1] TEST: [2 5]
TRAIN: [5 4 1 2] TEST: [0 3]
TRAIN: [1 5 2 4] TEST: [0 3]


In [24]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [25]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [26]:
titanic.info

<bound method DataFrame.info of      survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_m

In [27]:
X = titanic[["pclass", "age", "sex"]]
y = titanic["survived"]

In [28]:
X["age"].fillna(X["age"].mean(), inplace=True)
# X.info

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [29]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)
X

Unnamed: 0,pclass,age,sex
0,3,22.000000,male
1,1,38.000000,female
2,3,26.000000,female
3,1,35.000000,female
4,3,35.000000,male
...,...,...,...
886,2,27.000000,male
887,1,19.000000,female
888,3,29.699118,female
889,1,26.000000,male


In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=33
)
X_train = X_train.to_dict(orient="record")
X_test = X_test.to_dict(orient="record")
# 将非数值型数据转换为数值型数据
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

clf = Pipeline(
    [("vecd", DictVectorizer(sparse=False)), ("dtc", DecisionTreeClassifier())]
)
vec = DictVectorizer(sparse=False)

clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
from sklearn.metrics import classification_report

print(clf.score(X_test, y_test))
print(classification_report(y_predict, y_test, target_names=["died", "survivied"]))

0.8340807174887892
              precision    recall  f1-score   support

        died       0.90      0.84      0.87       143
   survivied       0.74      0.82      0.78        80

    accuracy                           0.83       223
   macro avg       0.82      0.83      0.82       223
weighted avg       0.84      0.83      0.84       223



  X_train = X_train.to_dict(orient='record')
  X_test = X_test.to_dict(orient='record')
