# Assignment 10

You are provided with a dataset from USA Forensic Science Service which has description of 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc). Your task is to use K-Nearest Neighbor (KNN) classifier to classify the glasses.

The original dataset is available at (https://archive.ics.uci.edu/ml/datasets/glass+identification). For detailed description on the attributes of the dataset, please refer to the original link of the dataset in the UCI ML repository.

But the shared drive folder have the dataset for your convenience perform exploratory data analysis on the dataset using Python Pandas, including dropping irrelevant fields for predicted values, and standardization of each attribute.

Following data cleaning, two Scikit-Learn KNN models should be created for two different distance metrics: Square Euclidean and Manhattan distance. The performance of the two models using different distance metrics should be compared in terms of accuracy to the test data and Scikit-Learn Classification Report.

# Importing Packages

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import seaborn as sns
#from sklearn import datasets, neighbors
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions
from sklearn.model_selection import cross_val_score # import all the functions reqd for cross validation
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
#for scaling the data
from sklearn.preprocessing import StandardScaler
#for distances
from sklearn.metrics import classification_report
from scipy.spatial import distance

In [None]:
df = pd.read_csv('/content/trainKNN.csv')
df.shape

# Website for given dataset gives 

Attribute Information:

1. Id number: 1 to 214
2. RI: refractive index
3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4. Mg: Magnesium
5. Al: Aluminum
6. Si: Silicon
7. K: Potassium
8. Ca: Calcium
9. Ba: Barium
10. Fe: Iron
11. Type of glass: (class attribute)
1 building_windows_float_processed
2 building_windows_non_float_processed
3 vehicle_windows_float_processed
4 vehicle_windows_non_float_processed (none in this database)
5 containers
6 tableware
7 headlamps

In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df.columns = attributes
df.head()

Dropping the Id column

In [None]:
df=df.drop(['Id'], axis=1)
df.head()

Analyzing the data

In [None]:
df.isnull().sum()

In [None]:
df = df.drop_duplicates()

In [None]:
df.shape

Clearly it shows that one duplicate row.

Also no encoding needed.

**Checking if Outliers exist or not**

In [None]:
df.describe().T

In [None]:
for k, v in df.items():
  q1 = v.quantile(0.25)
  q3 = v.quantile(0.75)
  irq = q3 - q1
  v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
  perc = np.shape(v_col)[0] * 100.0 / np.shape(df)[0]
  print("Column %s outliers = %.2f%%" % (k, perc))

In [None]:
plt.figure(figsize = (16, 12))
sns.heatmap(df.corr(), annot = True, fmt = '.2%')
# plt.savefig('../images/features_correlation.png')

**Better correlatin for Type_Of_Glass with Al, Ba and Na.**

# kNN model

In [None]:
b = []
for i in df.keys():
  b.append(i)
print(b)

In [None]:
b.remove('Type_Of_Glass')
print(b)

**Now we are taking the feature set as b**

In [None]:
X = df[b].values#array of features
y = df['Type_Of_Glass'].values

**Splitting of data**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)


**Scaling of data**

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Initializing and fitting the k-NN model by splitting the train data

**BY EUCLIDEAN METRIC**

In [None]:
for i in [1,2,3,4,5,6,7,8,9,10,20,25,30,35,40,45,50]:
  knn = KNeighborsClassifier(i,metric=distance.sqeuclidean) #initialising the model
  knn.fit(x_train,y_train) # training the model
  print("K value  : " , i, " score : ", np.mean(cross_val_score(knn, x_train, y_train, cv=4))) #predicting using the model


In [None]:
knn = KNeighborsClassifier(n_neighbors=5,metric=distance.sqeuclidean) #it will initialise the model with @neighbours as k 
knn.fit(x_train, y_train) # train the model
print("Train Accuracy : ", knn.score(x_train,y_train)) # test the model and it computes the accuracy (train data accuracy)
print("Val Accuracy : ", np.mean(cross_val_score(knn, x_train, y_train, cv=4)))


**Test the model using testing data**

In [None]:
df1 = pd.read_csv('/content/testKNN.csv')


In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df1.columns = attributes
df1.head()

In [None]:
df2=df1.drop(['Id'], axis=1)
df2=df2.drop(['Type_Of_Glass'], axis=1)
df2.head()

In [None]:
type(df2)

In [None]:
x_test
type(x_test)

In [None]:
df2

In [None]:
df2 = df2.values

In [None]:
df2

In [None]:
df2_test = scaler.transform(df2)

In [None]:
results = knn.predict(df2_test)

In [None]:
print(results)

In [None]:
df1['Type_Of_Glass_pred'] = results

In [None]:
df1

**BY MANHATTAN METRIC**

In [None]:
for i in [1,2,3,4,5,6,7,8,9,10,20,25,30,35,40,45,50]:
  knn = KNeighborsClassifier(i,metric=distance.cityblock) #initialising the model
  knn.fit(x_train,y_train) # training the model
  print("K value  : " , i, " score : ", np.mean(cross_val_score(knn, x_train, y_train, cv=4))) #predicting using the model


In [None]:
knn = KNeighborsClassifier(n_neighbors=10,metric=distance.cityblock) #it will initialise the model with @neighbours as k 
knn.fit(x_train, y_train) # train the model
print("Train Accuracy : ", knn.score(x_train,y_train)) # test the model and it computes the accuracy (train data accuracy)
print("Val Accuracy : ", np.mean(cross_val_score(knn, x_train, y_train, cv=4)))


69% accuracy in manhattan

**Test the model using testing data**

In [None]:
df1 = pd.read_csv('/content/testKNN.csv')

In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df1.columns = attributes
df1.head()

In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df1.columns = attributes
df1.head()

In [None]:
type(df2)

In [None]:
x_test
type(x_test)

In [None]:
df2

In [None]:
df2 = df2.values

In [None]:
df2

In [None]:
df2_test = scaler.transform(df2)

In [None]:
results = knn.predict(df2_test)

In [None]:
print(results)

In [None]:
df1['Type_Of_Glass_pred'] = results

In [None]:
df1

**initialize and fitting k-NN model without splitting training data and cleaning the outliers from features.**

In [None]:
df.head()

In [None]:
for k, v in df.items():
  q1 = v.quantile(0.25)
  q3 = v.quantile(0.75)
  irq = q3 - q1
  v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
  perc = np.shape(v_col)[0] * 100.0 / np.shape(df)[0]
  print("Column %s outliers = %.2f%%" % (k, perc))

In [None]:
def floorcapping(df):
  i = input()
  Q1 = df[i].quantile(0.25)
  Q3 = df[i].quantile(0.75)
  IQR = Q3 - Q1
  whisker_width = 1.5
  lower_whisker = Q1 -(whisker_width*IQR)
  upper_whisker = Q3 + (whisker_width*IQR)
  x = ((df[i] < Q1 - whisker_width*IQR) | (df[i] > Q3 + whisker_width*IQR))
  x = pd.DataFrame(x) # convert to data frame
  # df[x.isin([True])]
  substring = 'True'
  y= x[x.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)]
  if True in y[i].tolist():
    df[i]=np.where(df[i]>upper_whisker,upper_whisker,np.where(df[i]<lower_whisker,lower_whisker,df[i])) 
  # substitute upper and lower whiskes to outliers
floorcapping(df)

In [None]:
floorcapping(df)

In [None]:
floorcapping(df)

In [None]:
def outlierpresence(df):
  for i in df.keys():
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    x = (df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))
    # df[x.isin([True])]
    substring = 'True'
    y= x[x.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)] #IT WILL GIVE ALL OUTLIERS IN THE DATAFRAME WITH ALL COLUMNS
    if True in y[i].tolist(): #HERE WE CHECK True is in the list of particular column
      print('Outliers', '\033[1m'+ 'present' +'\033[0m', 'in the data of','\033[1m' + i + '\033[0m')
      print('-------------------------------')
    else:
      print('Outliers', '\033[1m'+ ' not present in the data of' +'\033[0m', 'in','\033[1m' + i + '\033[0m') 
      print('-------------------------------') 
outlierpresence(df)

In [None]:
x_train = df.drop(['Type_Of_Glass'], axis=1)
x_train = x_train.values
x_train

In [None]:
y_train = df['Type_Of_Glass']
y_train = y_train.values
y_train

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

**Initializing and fitting the k-NN model**

BY EUCLIDEAN METRIC


In [None]:
for i in [1,2,3,4,5,6,7,8,9,10,20,25,30,35,40,45,50]:
  knn = KNeighborsClassifier(i,metric=distance.sqeuclidean) #initialising the model
  knn.fit(x_train,y_train) # training the model
  print("K value  : " , i, " score : ", np.mean(cross_val_score(knn, x_train, y_train, cv=6))) #predicting using the model


In [None]:
knn = KNeighborsClassifier(n_neighbors=6,metric=distance.sqeuclidean) #it will initialise the model with @neighbours as k 
knn.fit(x_train, y_train) # train the model
print("Train Accuracy : ", knn.score(x_train,y_train)) # test the model and it computes the accuracy (train data accuracy)
print("Val Accuracy : ", np.mean(cross_val_score(knn, x_train, y_train, cv=6)))


67% accuracy in Euclidean metric.

**Test the model using testing data**

In [None]:
df1 = pd.read_csv('/content/testKNN.csv')

In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df1.columns = attributes
df1.head()


In [None]:
df2=df1.drop(['Id'], axis=1)
df2=df2.drop(['Type_Of_Glass'], axis=1)
df2.head()

In [None]:
type(df2)

In [None]:
x_test
type(x_test)

In [None]:
df2

In [None]:
df2 = df2.values

In [None]:
df2

In [None]:
df2_test = scaler.transform(df2)

In [None]:
results = knn.predict(df2_test)

In [None]:
print(results)

In [None]:
df1['Type_Of_Glass_pred'] = results

In [None]:
df1

# BY MANHATTAN METRIC

In [None]:
for i in [1,2,3,4,5,6,7,8,9,10,20,25,30,35,40,45,50]:
  knn = KNeighborsClassifier(i,metric=distance.cityblock) #initialising the model
  knn.fit(x_train,y_train) # training the model
  print("K value  : " , i, " score : ", np.mean(cross_val_score(knn, x_train, y_train, cv=6))) #predicting using the model


In [None]:
knn = KNeighborsClassifier(n_neighbors=8,metric=distance.cityblock) #it will initialise the model with @neighbours as k 
knn.fit(x_train, y_train) # train the model
print("Train Accuracy : ", knn.score(x_train,y_train)) # test the model and it computes the accuracy (train data accuracy)
print("Val Accuracy : ", np.mean(cross_val_score(knn, x_train, y_train, cv=6)))


68% in Manhattan metric.

**Test the model using testing data**

In [None]:
df1 = pd.read_csv('/content/testKNN.csv')

In [None]:
attributes = ['Id','RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type_Of_Glass']
df1.columns = attributes
df1.head()

In [None]:
df2=df1.drop(['Id'], axis=1)
df2=df2.drop(['Type_Of_Glass'], axis=1)
df2.head()

In [None]:
type(df2)

In [None]:
x_test
type(x_test)

In [None]:
df2

In [None]:
df2 = df2.values

In [None]:
df2

In [None]:
df2_test = scaler.transform(df2)

In [None]:
results = knn.predict(df2_test)

In [None]:
print(results)

In [None]:
df1['Type_Of_Glass_pred'] = results

In [None]:
df1

# Conclusions

I experimented the given data in two ways,

Initialize and fitting k-NN model by splitting training data

By using Euclidean metric :- 68%
By using Manhattan metric :- 69%
Again initialize and fitting k-NN model by without splitting training data and Clean the outliers from the features

By using Euclidean metric :- 67%
By using Manhattan metric :- 68%
In all of the above models they did't predict the glasses in 3rd and 4th class.