# Model Performance Measures, ML Pipeline and Hyperparameter Tuning

## Can you correctly identify glass type?

## Context:
    
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 
values)

# Content

Attribute Information:

Id number: 1 to 214 

RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute) 
    -- 1 building_windows_float_processed 
    -- 2 building_windows_non_float_processed 
    -- 3 vehicle_windows_float_processed 
    -- 4 vehicle_windows_non_float_processed (none in this database) 
    -- 5 containers 
    -- 6 tableware 
    -- 7 headlamps

## Source:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification

# 1.  Import necessary libraries and load the data

In [13]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

df = pd.read_csv('glass.csv')

In [14]:
df.head()

Unnamed: 0,ID,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron,Type
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,building_windows_float_processed
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,building_windows_float_processed
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,building_windows_float_processed
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,building_windows_float_processed
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,building_windows_float_processed


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
ID                  214 non-null int64
refractive index    214 non-null float64
Sodium              214 non-null float64
Magnesium           214 non-null float64
Aluminum            214 non-null float64
Silicon             214 non-null float64
Potassium           214 non-null float64
Calcium             214 non-null float64
Barium              214 non-null float64
Iron                214 non-null float64
Type                214 non-null object
dtypes: float64(9), int64(1), object(1)
memory usage: 18.5+ KB


In [6]:
df.Type.unique()

array(['building_windows_float_processed',
       'building_windows_non_float_processed',
       'vehicle_windows_float_processed', 'containers', 'tableware',
       'headlamps'], dtype=object)


# 2. Split the data into dependent and independent variables. Also see how the looks like

Hint: you can make use of nay method(iloc or drop method)

In [16]:
X = df.drop(['ID', 'Type'], axis = 1)
y = df['Type']

In [17]:
y.unique()

array(['building_windows_float_processed',
       'building_windows_non_float_processed',
       'vehicle_windows_float_processed', 'containers', 'tableware',
       'headlamps'], dtype=object)

# 3. Convert Target variable into numerical

In [18]:
y = y.replace('building_windows_float_processed', 1)

# 4. Split the dataset into train set test set also the validation 
Always a good practice to split the dataset into 3 sets

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# 5. Build the pipeline
Steps:
Instantiate the pipeline, as first defining standard scaler and on the scaled data run the PCA and then feed it to the logistic regression(or any other algo)

Hint:

Import standard scaler to standardize the data

You can take an algorithm of choice and build a pipeline

In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe_lr = Pipeline([('scl', StandardScaler()), ('pca', PCA()), ('lg', LogisticRegression())])
pipe_lr.fit(X_train, y_train)
pipe_lr.score(X_test, y_test)



TypeError: '<' not supported between instances of 'int' and 'str'

# 6.Follow the above steps and check if you can tweak the logistic regression parameters above and make use of Grid search(can use any algorithm)

In [11]:
from sklearn.svm import SVC
pipe_svc = Pipeline([('scl', StandardScaler()), ('pca', PCA()), ('svc', SVC())])
param_grid = {'pca__n_components': [3, 4, 5], 'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10], 'svc__kernel': ['rbf', 'poly']}
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(pipe_svc, param_grid, cv = 4)
gs.fit(X_train, y_train)

TypeError: '<' not supported between instances of 'int' and 'str'

In [12]:
gs.best_score_

AttributeError: 'GridSearchCV' object has no attribute 'best_score_'

# 7. Optimize the model parameters(can make use of any algorithm)

Make use of Grid search for hyper parameter

Steps:
Split the dataset into train and test set

Make use of any algorithm , from the list of hyper parameters you get apply param grid 

Once hyper parameter grid is defined, import grid search CV and fit x_train and y_train

Find the best params and mean test score



In [4]:
#split the dataset into train and test set
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,random_state = 7)

NameError: name 'X' is not defined

In [None]:
from sklearn.neighbors import KNeighborsClassifier
### Number of nearest neighbors
knn_clf = KNeighborsClassifier()

In [3]:
knn_clf.fit(X_train, y_train)

NameError: name 'knn_clf' is not defined

In [21]:
param_grid = {'n_neighbors': list(range(1,9)),
             'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }

In [22]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(knn_clf,param_grid,cv=10)

NameError: name 'knn_clf' is not defined

In [23]:
gs.best_params_

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

In [None]:
gs.cv_results_['mean_test_score']