<h1 align="center"> 
Exercise_6
</h1>

<h4 align="center"> 


## Note
- Complete the missing parts indicated by # Implement me


## Objective
- how to use pipeline to sequentially apply a list of transforms and a final estimator

## Overview
The only difference between this exercise and exercise 5 is as follows:
- in exercise 5, we manually standardized the data using StandardScaler right after splitting the data into training and testing
- here, we wrap StandardScaler in the pipeline so that we do not have to manually standardize the data (the pipeline takes care of it when fitting the data)

## Load the Hepatitis Data

In [1]:
import warnings
warnings.filterwarnings('ignore')
    
import pandas as pd

# Load the data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data', header=None)

# Specify the name of the columns
df.columns = ['Target', 'AGE', 'SEX', 'STEROID', 'ANTIVIRALS', 'FATIGUE', 'MALAISE', 'ANOREXIA', 'LIVER BIG', 'LIVER FIRM', 'SPLEEN PALPABLE', 'SPIDERS', 'ASCITES', 'VARICES', 'BILIRUBIN', 'ALK PHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'HISTOLOGY']

# Show the header and the first five rows
df.head()

Unnamed: 0,Target,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.0,85,18,4.0,?,1
1,2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.9,135,42,3.5,?,1
2,2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.7,96,32,4.0,?,1
3,2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.7,46,52,4.0,80,1
4,2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.0,?,200,4.0,?,1


## Remove rows with missing values

In [2]:
import numpy as np

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Replace ? with np.NaN
# Implement me
df = df.replace('?', np.NaN)

# Remove rows with np.NaN
# Implement me
df = df.dropna(how='any')

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 155
Number of rows after removing rows with missing values: 80


## Get the feature and target vector

In [3]:
# Specify the name of the target
target = 'Target'

# Get the target vector
# Implement me
y = df[target].values

# Specify the name of the features
features = list(df.drop(target, axis=1).columns)

# Get the feature vector
# Implement me
X = df[features].values

## Divide the data into training and testing

In [4]:
from sklearn.model_selection import train_test_split

# Randomly choose 30% of the data for testing (set randome_state as 0 and stratify as y)
# Implement me
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)

## Standardize the features
This part is not necessary for this exercise (since pipeline will take care of it)

In [5]:
# from sklearn.preprocessing import StandardScaler

# # Declare the StandardScaler
#std_scaler = StandardScaler()

# # Standardize the features in the training data
#X_train = std_scaler.fit_transform(X_train)

# # Standardize the features in testing data
#X_test = std_scaler.transform(X_test)

## Fit svm using different settings for the hyperparameters
Here we wrap StandardScaler in the pipeline so that we do not have to manually standardize the data

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# The list of value for hyperparameter C (penalty parameter)
Cs = [0.01, 0.1, 1]

# The list of choice for hyperparameter kernel
kernels = ['linear', 'rbf', 'sigmoid']

# The list of [score, setting], where score is the score of the classifier and setting a pair of (C, kernel)
score_settings = []

# For each C
for C in Cs:
    # For each kernel
    for kernel in kernels:
        # Declare the classifier with hyperparameter C, kernel, class_weight, and random_state
        # Implement me
        clf = SVC(C=C, kernel=kernel, class_weight='balanced', random_state=0)
        
        # The pipeline, with StandardScaler and clf defined above
        # Implement me
        pipe_clf = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])

        # Fit the pipeline
        # Implement me
        pipe_clf.fit(X_train, y_train)

        # Get the score (rounding to two decimal places)
        score = round(pipe_clf.score(X_test, y_test), 2)
        
        # Get the setting, which is a pair of (C, kernel)
        # Implement me
        setting = [C, kernel]

        # Append [score, setting] to score_settings
        # Implement me
        score_settings.append([score, setting])
        
# Sort score_settings in descending order of score
# Implement me
score_settings = sorted(score_settings, key=lambda x: x[0], reverse=True)

# Print score_settings
print('The list of [score, setting] is:')
for score_setting in score_settings:
    print(score_setting)
print()

# Print the best setting
print('The best setting is:')
print('C: ' + str(score_settings[0][1][0]))
print('kernel: ' + score_settings[0][1][1])

The list of [score, setting] is:
[0.83, [0.01, 'linear']]
[0.83, [1, 'rbf']]
[0.79, [0.1, 'linear']]
[0.75, [0.1, 'sigmoid']]
[0.75, [1, 'linear']]
[0.75, [1, 'sigmoid']]
[0.38, [0.1, 'rbf']]
[0.17, [0.01, 'rbf']]
[0.17, [0.01, 'sigmoid']]

The best setting is:
C: 0.01
kernel: linear


## Discussion

The above results are exactly the same as those in exercise 5. Using pipeline is very convenient particularly when we will apply several transforms to the data.