I'm sure you have done the standard data analysis workflow: preprocessing, modeling, and validating, which can take quite a bit of time. What if I told you, you can automate the process and arrive at a accuracy value with a few lines of code? Introducing pipeline from sklearn. This tool lets you chain all the steps into one block of code, a sort of auto analysis. This is a great way to select the top performing model, before diving deeper into the analysis. 

First, we'll import the basic libraries.

In [154]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

Import the data set and specify the dependent and independent variables.

In [155]:
#import dataset and specify X (independent variables) and y (dependent variable)
df = pd.read_csv('C:/Users/dhuan/Desktop/Datasets/heart.csv')
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
1,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
2,60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
3,35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
4,69,0,3,140,239,0,1,151,0,1.8,2,2,2,1


In [156]:
y = df['target']
X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang',
       'oldpeak', 'slope', 'ca', 'thal']]

We'll first handle the preprocessing part. We'll specify preprocessing steps for the numberical data and the categorical data. It's quite easy to add any additional steps. Just the step a name and list the function. 

In [157]:
#import the pipeline
from sklearn.pipeline import Pipeline

#select preprocessing methods
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

#numerical pipeline
num_trans = Pipeline(steps = [('imputer', KNNImputer(n_neighbors = 2, weights = 'uniform')),
                              ('scaler', StandardScaler(copy = True))])

#categorical pipeline
cat_trans = Pipeline(steps = [('imputer', KNNImputer(n_neighbors = 2, weights = 'uniform')),
                              ('onehot', OneHotEncoder(drop='first'))])

Specify the numerical and categorical columns and apply the transformation.

In [158]:
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca']
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']

from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_trans, numeric_features),
        ('cat', cat_trans, categorical_features)])

Import all the models to be compared and also the validation method. 

In [159]:
#select models 
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

#select validation method
from sklearn.model_selection import cross_val_score

Put all the models into an array. We will loop through all the models. Also list them as strings in another array, we'll need it for our performance metrics later on.

In [160]:
#list all the models to be fitted
classifiers = [SVC(),
               LinearSVC(),
               LogisticRegression(),
               GaussianNB(),
               RandomForestClassifier(),
               GradientBoostingClassifier(),
               MLPClassifier(max_iter=2000),
               KNeighborsClassifier(n_neighbors = 6)]

#string values of the models
classifiers_names = ['SVC',
                     'LinearSVC',
                     'LogisticRegression', 
                     'GaussianNB', 
                     'RandomForestClassifier', 
                     'GradientBoostingClassifier',
                     'MLPClassifier',
                     'KNeighborsClassifier']

Create an empty array to store the performance metrics.

In [161]:
#empty array to hold peformance of all model
val_storage = []

Our pipeline consists of three steps: preprocessing, modelling, and validation. As you can see, it's a relatively short block of code. The pipeline:

In [162]:
#loop through all models and run each one according to the pipeline steps
for models in classifiers:
    #list steps
    steps = [('preprocess', preprocessor),
             ('model', models)]
    
    #run the pipeline
    pipeline = Pipeline(steps)
    
    #performance metrics
    scores = cross_val_score(pipeline, X, y, cv=4)
    avg_score = scores.mean()
    performance = str(round((avg_score)*100,2)) + ' %' + ' +/- ' + str(round((scores.max()-avg_score)*100,2)) + ' %'
    val_storage.append(performance)

Create and display the performance dataframe

In [163]:
#display performance 
df_metric = pd.DataFrame(data = {'Models:' : classifiers_names, 
                                 'Accuracy:' : val_storage})
df_metric

Unnamed: 0,Models:,Accuracy:
0,SVC,83.48 % +/- 3.36 %
1,LinearSVC,83.82 % +/- 1.7 %
2,LogisticRegression,83.16 % +/- 2.36 %
3,GaussianNB,74.2 % +/- 12.64 %
4,RandomForestClassifier,80.51 % +/- 3.7 %
5,GradientBoostingClassifier(),81.16 % +/- 8.31 %
6,MLPClassifier,77.21 % +/- 9.63 %
7,KNeighborsClassifier,79.86 % +/- 1.71 %


From here, we can choose the top performing model and dive deeper. This is also a great way to just test the dataset to see if there is predictability in the dataset or not. 