### Why we need to serialize or save models
* Serialization is the process of converting an object into a stream of bytes to store the object or transmit it to memory, a database, or a file. Its main purpose is to save the state of an object in order to be able to recreate it when needed.

* We need serialization because models take time to train and we may train many models over different seed values and different data and we can save these to reporduce the same results

* We need saved states of models trained to deploy models which can be used by a business or end user/customer to perform predictions

* Think of a weather prediction model you created. Now that you have trained and created a model that predicts weather accurately it should have the weights and hyperparameters of the model saved in a format that the model when loaded is ready to predict right away

### Pickle

In [None]:
from IPython.display import Image
Image("./picklerick.jpg")

* Pickling is the process whereby a Python object hierarchy is converted into a byte stream (usually not human readable) 

* Unpickling is the reverse operation, whereby a byte stream is converted back into a working Python object hierarchy.

* Pickle is operationally simplest way to store the object. 

* The Python Pickle module is an object-oriented way to store objects directly in a special storage format.

[source: 'https://www.tutorialspoint.com/object_oriented_python/object_oriented_python_serialization.htm' ]

### What can it do?
* Pickle can store and reproduce dictionaries and lists very easily.
* Stores object attributes and restores them back to the same State.

### What pickle can’t do?
* It does not save an objects code. Only it’s attributes values.
* It cannot store file handles or connection sockets.
* In short we can say, pickling is a way to store and retrieve data variables into and out from files where variables can be lists, classes, etc.

### To Pickle something you must −

<code>import pickle
Write a variable to file, something like
pickle.dump(mystring, outfile, protocol)</code>

#### Methods
The pickle interface provides four different methods.

dump() − The dump() method serializes to an open file (file-like object).

dumps() − Serializes to a string

load() − Deserializes from an open-like object.

loads() − Deserializes from a string.

[source: 'https://www.tutorialspoint.com/object_oriented_python/object_oriented_python_serialization.htm' ]

In [None]:
import pickle
#Here's an example dict
grades = { 'Alice': 89, 'Bob': 72, 'Charles': 87 }

#Use dumps to convert the object to a serialized string
serial_grades = pickle.dumps( grades )

#Use loads to de-serialize an object
received_grades = pickle.loads( serial_grades )

In [None]:
received_grades

In [None]:
serial_grades

### Joblib

In [None]:
from sklearn.svm import SVC
from sklearn import datasets
clf = SVC()
X, y= datasets.load_iris(return_X_y=True)
clf.fit(X, y)
SVC()

In [None]:
import pickle
import numpy as np
s = pickle.dumps(clf)

In [None]:
s

In [None]:
clf2 = pickle.loads(s)

In [None]:
clf2.support_vectors_

In [None]:
y[0]

In [None]:
X[0:1]

In [None]:
clf2.predict(X[0:1])
# np.array([0])
# y[0]

### Use of joblib over pickle

In [None]:
!pip install joblib

In [None]:
from joblib import dump, load

In [None]:
joblib.dump(clf2, 'clf2.can')

In [None]:
clf2 = load('clf2.can') 

In [None]:
clf2.support_vectors_

### Sklearn Pipeline - demo1
[source: 'https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976' ]

In [None]:
import pandas as pd 
winedf = pd.read_csv('./winequality-red.csv')
# print winedf.isnull().sum() # check for missing data
winedf.head(3)

In [None]:
X=winedf.drop(['quality'],axis=1)
Y=winedf['quality']

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

In [None]:
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps) # define the pipeline object.

In [None]:
pipeline

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)

In [None]:
print(winedf['quality'].value_counts())

In [None]:
parameters = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = GridSearchCV(pipeline, param_grid=parameters, cv=5)

In [None]:
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print(grid.best_params_)

### Using a Pipeline
1. A pipeline defines all the steps such as transformation, modelling and hyperparameter tuning into one variable or object
2. you can at once perform parameter tuning,scaling and modelling using the same object

### Sklearn Pipeline Activity/Demo -2 
[source: 'https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/' ]

### Use pima indians dataset to:
1. Create 3 PCA components

2. Select 6 KBest features using k-best feature selection

3. combine 6 best features selected with 3 PCA components to create new X

4. Use Logistic Regression to classify diabetics with non diabetics

5. Perform KFold CrossValidation

6. Print cross validation score

In [1]:
# Create a pipeline that extracts features from the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

In [2]:
feature_union

FeatureUnion(transformer_list=[('pca', PCA(n_components=3)),
                               ('select_best', SelectKBest(k=6))])

In [3]:
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))

In [5]:
estimators

[('feature_union',
  FeatureUnion(transformer_list=[('pca', PCA(n_components=3)),
                                 ('select_best', SelectKBest(k=6))])),
 ('logistic', LogisticRegression())]

In [6]:
model = Pipeline(estimators)
# evaluate pipeline
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

0.7773410799726589


### Use custom class in sklearn pipeline

[source: 'https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156' ]

In [7]:
import pandas as pd
import warnings
import numpy as np

warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.linear_model import LinearRegression

In [8]:
df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,9],
                                                   [4,36,16],
                                                   [1,16,9],
                                                   [2,9,8],
                                                   [3,36,15],
                                                   [2,49,16],
                                                   [4,25,14],
                                                   [5,36,17]
])

### y = X1 + 2 * sqrt(X2)

train = df.iloc[:6]
test = df.iloc[6:]

train_X = train.drop('y', axis=1)
train_y = train.y

test_X = test.drop('y', axis=1)
test_y = test.y


In [9]:
m1 = LinearRegression()
fit1 = m1.fit(train_X, train_y)
preds = fit1.predict(test_X)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(mean_squared_error(test_y, preds))}\n")


[13.72113586 16.93334467]
RMSE: 0.20274138822160784



* Predictions above are off
* Let us try to transform X and fit a model again

In [16]:
class ExperimentalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print('\n>>>>>>>init() called.\n')

    def fit(self, X, y = None):
        print('\n>>>>>>>fit() called.\n')
        return self

    def transform(self, X, y = None):
        print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_.X2 = 2 * np.sqrt(X_.X2)
        return X_

In [17]:
# without input transformation - to validate that we get the same results as before
print("create pipeline 1")
pipe1 = Pipeline(steps=[
                       ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(train_X, train_y)
print("predict via pipeline 1")
preds1 = pipe1.predict(test_X)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(test_y, preds1))}\n")

create pipeline 1
fit pipeline 1
predict via pipeline 1

[13.72113586 16.93334467]
RMSE: 0.20274138822160784



In [18]:
# with input transformation
print("create pipeline 2")
pipe2 = Pipeline(steps=[
                       ('experimental_trans', ExperimentalTransformer()),    # this will trigger a call to __init__
                       ('linear_model', LinearRegression())
])

# an alternate, shorter syntax to do the above, without naming each step, is:
#pipe2 = make_pipeline(ExperimentalTransformer(), LinearRegression())

print("fit pipeline 2")
pipe2.fit(train_X, train_y)
print("predict via pipeline 2")
preds2 = pipe2.predict(test_X)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(test_y, preds2))}\n")

create pipeline 2

>>>>>>>init() called.

fit pipeline 2

>>>>>>>fit() called.


>>>>>>>transform() called.

predict via pipeline 2

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15

