## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [1]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import pandas as pd
import numpy as np

from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [37]:
df = pd.read_csv(r'C:\Users\k_mah\Documents\miniproject4-master\data\cleanloans.csv')
df.head()
df = df.drop(columns='Unnamed: 0')

In [38]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,total_income,LoanRatio
0,Male,Yes,1,Graduate,No,4.85203,360.0,1.0,Rural,N,8.714568,0.556772
1,Male,Yes,0,Graduate,Yes,4.189655,360.0,1.0,Urban,Y,8.006368,0.52329
2,Male,Yes,0,Not Graduate,No,4.787492,360.0,1.0,Urban,Y,8.505323,0.562882
3,Male,No,0,Graduate,No,4.94876,360.0,1.0,Urban,Y,8.699515,0.568855
4,Male,Yes,2,Graduate,Yes,5.587249,360.0,1.0,Urban,Y,9.170872,0.609239


In [39]:
#We need to have the Loan Status as binary when we get to the models, so let's do that first
df['Loan_Status'] = df.Loan_Status.replace(to_replace=['N', 'Y'], value=[0, 1])
y = df['Loan_Status']
df = df.drop(columns='Loan_Status')

In [40]:
#Now we can assign the rest of the dataframe as the training variables
X = df

#And split our test and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27, stratify=y)

In [41]:
#numeric_transform = Pipeline([('scaling', StandardScaler())])
categorical_transform = Pipeline([('one-hot-encode', OneHotEncoder(sparse=False))])

In [42]:
# (name, transformer, list of column names)
preprocessing_loans = ColumnTransformer([
    # ('numeric', numeric_transform, ['LoanAmount', 'Loan_Amount_Term', 'total_income', 'LoanRatio']), 
    ('categorical', categorical_transform, ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area'])])

In [43]:
pipeline_loans = Pipeline([('preprocessing', preprocessing_loans),
                            ('scale', StandardScaler()),
                            ('pca', PCA()),
                            ('linearsvc', LinearSVC(random_state=0, C=0.01, max_iter = 6000))])
pipeline_loans.fit(X_train, y_train)

In [44]:
# Find the best hyperparameters using GridSearchCV on the train set
param_grid = ({ 'linearsvc__max_iter': [1000, 5000, 10000],
                'linearsvc__tol': [1e-15, 1e-14],
                'linearsvc__C' : [0.1, 1, 10, 100, 1000],
                'pca__n_components': [3, 5],
              })
grid = GridSearchCV(pipeline_loans, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')



Best test set accuracy: 0.8319327731092437
Achieved with hyperparameters: {'linearsvc__C': 0.1, 'linearsvc__max_iter': 1000, 'linearsvc__tol': 1e-15, 'pca__n_components': 5}




In [45]:
import pickle

# save the model to disk
with open('model.sav', 'wb') as f:
    pickle.dump(grid.best_estimator_, f)

# load saved model
with open('model.sav', 'rb') as f:
    loaded_model = pickle.load(f)