 **TABLE OF CONTENT**

In [None]:

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

import seaborn as sns
import plotly as py
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LogisticRegression as LR
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble import RandomForestClassifier as RF

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Basic Data Analysis

** Load dataset from CSV File using Pandas**

In [None]:
X = pd.read_csv('../input/faults.csv')

As We can see we have 1941 entries and there is no null values.For this we will need comparatively less data processing e.g putting dummy entries.
We have total of 34 columns .

In [None]:
X.info()

We will use head function of the pandas DataFrame object. The HEAD function takes the number of rows to be listed as parameters and lists from the first row. By default, the number of rows to list is 5.

In [None]:
X.head()

The statistical summary of the properties in the data set is defined as **Descriptive Statistics**. It aims to present the summary and analysis of the descriptive statistical data set by statistical calculations.



The given statistical measures are:

1.     count= number of samples
1.     mean= the mean of this attribute among all samples
1.     std= the standard deviation of this attribute
1.     min= the minimal value of this attribute
1.     25%= the lower percentile
1.     50%= the median
1.     75%= the upper percentile
1.     max= the maximal value of this attribute


In [None]:
display(X.describe(include="all"))

**HEATMAP**

A heat map is a graphical representation of data where the values contained in a matrix are represented as colors.
The heatmap on our data doesnot give any useful information.

In [None]:
sns.set(rc={'figure.figsize':(12,10)})
corr = X.corr()
sns.heatmap(corr, xticklabels=corr.columns.values,yticklabels=corr.columns.values)

**HISTOGRAM**

A histogram is an accurate representation of the distribution of numerical data

In [None]:


X.hist(figsize=(20,20))
plt.show()



# Data Processing

 We are dividing our dateset in **features** and **faults** dataset.The faults dataset contains
7 columns while the features dataset contains 27 columns.

In [None]:
X1 = X.values
y_dataframe =X[["Pastry","Z_Scratch","K_Scatch","Stains","Dirtiness","Bumps","Other_Faults"]]
features = X1[:,0:27]
x = pd.DataFrame(features)





In [None]:
y_dataframe.info()

converting the faults dataset(7 colums) into single column

In [None]:


# I want to get a single column for machine learning purpose -> DOING INVERSE OF ONE HOT ENCODING
y = []
for i in range(y_dataframe.shape[0]):
    if y_dataframe["Pastry"].values[i] == 1:
        y.append("Pastry")
    elif y_dataframe["Z_Scratch"].values[i] == 1:
        y.append("Z_Scratch")
    elif y_dataframe["K_Scatch"].values[i] == 1:
        y.append("K_Scatch")
    elif y_dataframe["Stains"].values[i] == 1:
        y.append("Stains")
    elif y_dataframe["Dirtiness"].values[i] == 1:
        y.append("Dirtiness")
    elif y_dataframe["Bumps"].values[i] == 1:
        y.append("Bumps")
    else:
        y.append("Other_Faults")

In [None]:
y=np.array(y)

In [None]:
y.shape

In [None]:

faultstype= pd.DataFrame({'faults':y})

Faults dataset description

In [None]:
faultstype.describe()

**Heatmap**

*There is inexistant correlation between different faults.*

In [None]:
sns.set(rc={'figure.figsize':(12,10)})
corr = y_dataframe.corr()
sns.heatmap(corr, xticklabels=corr.columns.values,yticklabels=corr.columns.values)

As its evident from the charts that our faults # in decreasing order are :

1. OTHER FAULTS
1. BUMPS
1. K SCRATCH
1. Z SCRATCH
1. PASTRY
1. STAINS
1. DIRTINESS

The other_faults category has overwhelming majority as compared to all other type of faults.The dataset is not perfectly balanced but it works fine .

In [None]:
fig, ax=plt.subplots(1,2,figsize=(20,8))

faultstype['faults'].value_counts().plot.pie(ax=ax[0])
sns.countplot(x='faults', data=faultstype, ax=ax[1])






# Classification using Principal component analysis

 PCA reduces a large number of variables into a set of PC axes . It attempts to capture the information from the input variables and reduce them to a few axes. The number of PC axes that end up being important are determined by the Eigenvalues of each axis. The usual cut-off point to consider a PC "important" is an eigenvalue of ≥ 1. PC1 will always have the highest eigenvalue, followed by PC2,PC3 ETC.

**Data Strandadization **

Using variables without standardization in effect gives the variable with the larger range a larger weight in the analysis. Transforming the data to comparable scales can prevent this problem.

In [None]:
sc=StandardScaler()
X=sc.fit_transform(x)


In [None]:
faultstype.shape

**TRAIN TEST DATA SPLIT **

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state = 42)

**we will take 10 principal components ,Please note we had 27 variables .**

In [None]:


pca=PCA(10)
pca.fit(x_test)
pca.explained_variance_ratio_



In [None]:
pca_train=pca.transform(x_train)
pca_test=pca.transform(x_test)
pca_train=x_train
pca_test=x_test


In [None]:
pca_score= np.zeros(6)
pca_accuracy= np.zeros(6)





**Logistic regression**

In [None]:
Logistic_Regression = LR().fit(pca_train,y_train)
pca_score[0]=Logistic_Regression.score(pca_train,y_train)
predictions_LR = Logistic_Regression.predict(pca_test)
pca_accuracy[0]=accuracy_score(y_test, predictions_LR)


**Decision Tree**

**Decision tree builds classification or regression models in the form of a tree structure**

pseudo code of decision tree

In [None]:
Decision_Tree_Classifier = DTC().fit(pca_train,y_train)
pca_score[1]=Decision_Tree_Classifier.score(pca_train,y_train)
predictions_DTC = Decision_Tree_Classifier.predict(pca_test)
pca_accuracy[1]=accuracy_score(y_test, predictions_DTC)

**SVC**

In [None]:
from sklearn.svm import SVC
SVC = SVC().fit(pca_train,y_train)
pca_score[2]=SVC.score(pca_train,y_train)
predictions_SVC = SVC.predict(pca_test)
pca_accuracy[2]=accuracy_score(y_test, predictions_SVC)



**K-Nearest Neigbours**

In [None]:
K_Neighbors_Classifier = KNC(8).fit(pca_train,y_train)
pca_score[3]=K_Neighbors_Classifier.score(pca_train,y_train)
predictions_KNC = K_Neighbors_Classifier.predict(pca_test)
pca_accuracy[3]=accuracy_score(y_test, predictions_KNC)

**GAUSSIAN NAIVE BAYES**


In [None]:
from sklearn.naive_bayes import GaussianNB as GNB

GNB = GNB().fit(pca_train,y_train)
pca_score[4]=GNB.score(pca_train,y_train)
predictions_GNB= GNB.predict(pca_test)
pca_accuracy[4]=accuracy_score(y_test, predictions_GNB)



**Random Forest**

In [None]:
RF = RF().fit(pca_train,y_train)
pca_score[5]=RF.score(pca_train,y_train)
predictions_RF= RF.predict(pca_test)
pca_accuracy[5]=accuracy_score(y_test, predictions_RF)

# Classification and LDA

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda=LinearDiscriminantAnalysis()
lda_train = lda.fit(x_test, y_test)


In [None]:
lda_train=lda.transform(x_train)
lda_test=lda.transform(x_test)

In [None]:
lda_score= np.zeros(6)
lda_accuracy= np.zeros(6)

**Logistic Regression**

In [None]:
Logistic_Regression = LR().fit(lda_train,y_train)
lda_score[0]=Logistic_Regression.score(lda_train,y_train)
predictions_LR = Logistic_Regression.predict(lda_test)
lda_accuracy[0]=accuracy_score(y_test, predictions_LR)

**Decision Tree**

In [None]:
Decision_Tree_Classifier = DTC().fit(lda_train,y_train)
lda_score[1]=Decision_Tree_Classifier.score(lda_train,y_train)
predictions_DTC = Decision_Tree_Classifier.predict(lda_test)
lda_accuracy[1]=accuracy_score(y_test, predictions_DTC)

**SVC**

In [None]:
from sklearn.svm import SVC
SVC = SVC().fit(lda_train,y_train)
lda_score[2]=SVC.score(lda_train,y_train)
predictions_SVC = SVC.predict(lda_test)
lda_accuracy[2]=accuracy_score(y_test, predictions_SVC)

**KNC**

In [None]:
K_Neighbors_Classifier = KNC(10).fit(lda_train,y_train)
lda_score[3]=K_Neighbors_Classifier.score(lda_train,y_train)
predictions_KNC = K_Neighbors_Classifier.predict(lda_test)
lda_accuracy[3]=accuracy_score(y_test, predictions_KNC)

**GNB**

In [None]:
from sklearn.naive_bayes import GaussianNB as GNB

GNB = GNB().fit(lda_train,y_train)
lda_score[4]=GNB.score(lda_train,y_train)
predictions_GNB= GNB.predict(lda_test)
lda_accuracy[4]=accuracy_score(y_test, predictions_GNB)

**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF
RF = RF().fit(lda_train,y_train)
lda_score[5]=RF.score(lda_train,y_train)
predictions_RF= RF.predict(lda_test)
lda_accuracy[5]=accuracy_score(y_test, predictions_RF)

# Simple Classification

In [None]:

simple_score= np.zeros(6)
simple_accuracy= np.zeros(6)

**LOGISTIC REGRESSION**

In [None]:
Logistic_Regression = LR().fit(x_train,y_train)
simple_score[0]=Logistic_Regression.score(x_train,y_train)
predictions_LR = Logistic_Regression.predict(x_test)
simple_accuracy[0]=accuracy_score(y_test, predictions_LR)

**Decision Tree Classifier**

In [None]:
Decision_Tree_Classifier = DTC().fit(x_train,y_train)
simple_score[1]=Decision_Tree_Classifier.score(x_train,y_train)
predictions_DTC = Decision_Tree_Classifier.predict(x_test)
simple_accuracy[1]=accuracy_score(y_test, predictions_DTC)

**SVC**

In [None]:
from sklearn.svm import SVC
SVC = SVC().fit(x_train,y_train)
simple_score[2]=SVC.score(x_train,y_train)
predictions_SVC = SVC.predict(x_test)
simple_accuracy[2]=accuracy_score(y_test, predictions_SVC)

**K NEIGHBOURS CLASSIFIER**

In [None]:
K_Neighbors_Classifier = KNC(10).fit(x_train,y_train)
simple_score[3]=K_Neighbors_Classifier.score(x_train,y_train)
predictions_KNC = K_Neighbors_Classifier.predict(x_test)
simple_accuracy[3]=accuracy_score(y_test, predictions_KNC)

**Gaussian Naive Bayes**

In [None]:
from sklearn.naive_bayes import GaussianNB as GNB

GNB = GNB().fit(x_train,y_train)
simple_score[4]=GNB.score(x_train,y_train)
predictions_GNB= GNB.predict(x_test)
simple_accuracy[4]=accuracy_score(y_test, predictions_GNB)

**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF
RF = RF(10).fit(x_train,y_train)
simple_score[5]=RF.score(x_train,y_train)
predictions_RF= RF.predict(x_test)
simple_accuracy[5]=accuracy_score(y_test, predictions_RF)

 # **Results**

*Decison Tree AND RANDOM fOREST ARE GIVING UNREASONABILY HIGH SCORES BECAUSE OF OVERFITTING THEIR TRAINING DATA.*

In [None]:


objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
y_pos = np.arange(len(objects))
SCORES = [simple_score[0]*100,simple_score[1]*100,simple_score[2]*100,simple_score[3]*100,simple_score[4]*100,simple_score[5]*100]

plt.bar(y_pos, SCORES,color='#7f6d5f', align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('scores without LDA & PCA')
plt.title('MACHINE LEARNING MODELS')

plt.show()

**ACCURACY WITHOUT PCA AND LDA**

***SVC** HAS ACCURACY OF **ABOVE 75%**, IS CLEARLY THE WINNER HERE.*
*WHILE LOGISTIC REGRESSION,K NEIGHBOURS AND RANDOM FOREST PERFORMS WELL **ABOVE 70%**.*

In [None]:


objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
obj = np.arange(len(objects))
SCORES = [simple_accuracy[0]*100,simple_accuracy[1]*100,simple_accuracy[2]*100,simple_accuracy[3]*100,simple_accuracy[4]*100,simple_accuracy[5]*100]

plt.bar(y_pos, SCORES,color='#7f6d5f', align='center', alpha=0.5)
plt.xticks(obj, objects)
plt.ylabel('Accuracy without LDA & PCA')
plt.title('MACHINE LEARNING MODELS')

plt.show()

**SCORE WITH PCA**

*AGAIN RANDOM FOREST AND DECISION TREE PERFORMS UNREASONABLY GOOD.THE MODELS ARE OVER FITTING THE DATASET .* *WE DONT NEED TO WORRY ABOUT IT AS MODELS OFTEN OVERFIT THEIR TRAINING DATASET.*

In [None]:


objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
y_pos = np.arange(len(objects))
SCORES = [pca_score[0]*100,pca_score[1]*100,pca_score[2]*100,pca_score[3]*100,pca_score[4]*100,pca_score[5]*100]

plt.bar(y_pos, SCORES,color='#557f2d' ,align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('scores with PCA')
plt.title('MACHINE LEARNING MODELS')

plt.show()

**ACCURACY WITH PCA**

***SVC** HAS ACCURACY OF **ABOVE 75%**, IS CLEARLY THE WINNER HERE.*
*WHILE LOGISTIC REGRESSION,K NEIGHBOURS AND RANDOM FOREST PERFORMS WELL **ABOVE 70%**.*

In [None]:


objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
y_pos = np.arange(len(objects))
SCORES = [pca_accuracy[0]*100,pca_accuracy[1]*100,pca_accuracy[2]*100,pca_accuracy[3]*100,pca_accuracy[4]*100,pca_accuracy[5]*100]

plt.bar(y_pos, SCORES,color='#557f2d', align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('ACCURACY WITH PCA')
plt.title('MACHINE LEARNING MODELS')

plt.show()

**SCORE WITH LDA**

*AGAIN RANDOM FOREST AND DECISION TREE PERFORMS UNREASONABLY GOOD.THE MODELS ARE OVER FITTING THE DATASET .* *WE DONT NEED TO WORRY ABOUT IT AS MODELS OFTEN OVERFIT THEIR TRAINING DATASET.*

In [None]:


objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
y_pos = np.arange(len(objects))
SCORES = [lda_score[0]*100,lda_score[1]*100,lda_score[2]*100,lda_score[3]*100,lda_score[4]*100,lda_score[5]*100]

plt.bar(y_pos, SCORES,color='#2d7f5e',align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('scores with LDA ')
plt.title('MACHINE LEARNING MODELS')

plt.show()

**ACCURACY WITH LDA**

 **SVC** HAS ACCURACY OF **ABOVE 75%**, IS CLEARLY THE WINNER HERE.*
*WHILE LOGISTIC REGRESSION,K NEIGHBOURS AND RANDOM FOREST PERFORMS WELL **ABOVE 70%**.

In [None]:

objects = ('Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest')
y_pos = np.arange(len(objects))
SCORES = [lda_accuracy[0]*100,lda_accuracy[1]*100,lda_accuracy[2]*100,lda_accuracy[3]*100,lda_accuracy[4]*100,lda_accuracy[5]*100]

plt.bar(y_pos, SCORES,color='#2d7f5e', align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Accuracy with LDA ')
plt.title('MACHINE LEARNING MODELS')

plt.show()

# COMPARISON

  **COMPARISON OF CLASSIFICATION ALGORITHMS SCORES W.R.T PCA,LDA AND SIMPLE TECHNIQUE**

**SCORES**

**Decison treee is over perfoming while Gassian-NB under performs.SVC has reasonable score nearing 80%.**

In [None]:
# set width of bar
barWidth = 0.25

# set height of bar
bars1 = [lda_score[0]*100,lda_score[1]*100,lda_score[2]*100,lda_score[3]*100,lda_score[4]*100,lda_accuracy[5]*100]
bars2 = [pca_score[0]*100,pca_score[1]*100,pca_score[2]*100,pca_score[3]*100,pca_score[4]*100,pca_accuracy[5]*100]
bars3 = [simple_score[0]*100,simple_score[1]*100,simple_score[2]*100,simple_score[3]*100,simple_score[4]*100,simple_score[5]*100]

# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]

# Make the plot
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='LDA')
plt.bar(r2, bars2, color='#557f2d', width=barWidth, edgecolor='white', label='PCA')
plt.bar(r3, bars3, color='#2d7f5e', width=barWidth, edgecolor='white', label='SIMPLE')

# Add xticks on the middle of the group bars
plt.xlabel('MODELS', fontweight='bold')
plt.ylabel('Score', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest'])

# Create legend & Show graphic
plt.legend()
plt.show()

 **COMPARISON OF CLASSIFICATION ALGORITHMS ACCURACIES W.R.T PCA,LDA AND SIMPLE TECHNIQUE**

***SVC** HAS ACCURACY OF **ABOVE 75%**, IS CLEARLY THE WINNER HERE.*
*WHILE LOGISTIC REGRESSION,K NEIGHBOURS AND RANDOM FOREST PERFORMS WELL **ABOVE 70%**.*

**Gaussian NB** is underperforming.

In [None]:
# set width of bar
barWidth = 0.25

# set height of bar
bars1 = [lda_accuracy[0]*100,lda_accuracy[1]*100,lda_accuracy[2]*100,lda_accuracy[3]*100,lda_accuracy[4]*100,lda_accuracy[5]*100]
bars2 = [pca_accuracy[0]*100,pca_accuracy[1]*100,pca_accuracy[2]*100,pca_accuracy[3]*100,pca_accuracy[4]*100,pca_accuracy[5]*100]
bars3 = [simple_accuracy[0]*100,simple_accuracy[1]*100,simple_accuracy[2]*100,simple_accuracy[3]*100,simple_accuracy[4]*100,simple_accuracy[5]*100]

# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]

# Make the plot
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='LDA')
plt.bar(r2, bars2, color='#557f2d', width=barWidth, edgecolor='white', label='PCA')
plt.bar(r3, bars3, color='#2d7f5e', width=barWidth, edgecolor='white', label='SIMPLE')

# Add xticks on the middle of the group bars
plt.xlabel('MODELS', fontweight='bold')
plt.ylabel('ACCURACY', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['Logistic_reg', 'Decision_Tree', 'SVC', 'K_Neighbors', ' GaussianNB','Random Forest'])

# Create legend & Show graphic
plt.legend()
plt.show()

# **SVC Optimization**

**For rbf kernel our best parameters are {'C': 10, 'gamma': 0.1}.**


As we have seen in our previous data analysis that svc has outperformed all the models.What can we do to increase its accuracy more.Lets see.

In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2 = [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=10)
grid_search.fit(x_train, y_train)
print (grid_search.best_params_)



**For Linear kernel our best parameters are {'C': 1, 'gamma': 0.001}.**


In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2= [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='linear'), param_grid, cv=10)
grid_search.fit(x_train, y_train)
print (grid_search.best_params_)


**our accuracy with kernel='rbf', C=10, gamma=0.1 is 76 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='rbf', C=10, gamma=0.1).fit(x_train,y_train)
predictions_SVC = SVC.predict(x_test)
accuracy_score(y_test, predictions_SVC)

**Our accuracy with kernel='linear', C=1, gamma=0.001 is 73.9 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='linear', C=1, gamma=0.001).fit(x_train,y_train)

predictions_SVC = SVC.predict(x_test)
accuracy_score(y_test, predictions_SVC)

**For rbf kernel with " LDA " our best parameters are {'C': 10, 'gamma': 0.1}.**

In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2 = [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=10)
grid_search.fit(lda_train, y_train)
print (grid_search.best_params_)


**For rbf kernel with " LDA " our accuracy is 75.1 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='rbf', C=10, gamma=0.1).fit(lda_train,y_train)

predictions_SVC = SVC.predict(lda_test)
accuracy_score(y_test, predictions_SVC)

**For linear  kernel with " LDA " our best parameters are {'C': 1, 'gamma': 0.001}.**

In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2 = [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='linear'), param_grid, cv=10)
grid_search.fit(lda_train, y_train)
print (grid_search.best_params_)

 **For linear  kernel with " LDA " our accuracy is 73.1 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='linear', C=1, gamma=0.001).fit(lda_train,y_train)

predictions_SVC = SVC.predict(lda_test)
accuracy_score(y_test, predictions_SVC)

**For rbf  kernel with " PCA " our best parameters are {'C': 10, 'gamma': 0.1}**

In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2 = [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=10)
grid_search.fit(pca_train, y_train)
print (grid_search.best_params_)

**For rbf  kernel with " PCA " our accuracy is 76 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='rbf', C=10, gamma=0.1).fit(pca_train,y_train)

predictions_SVC = SVC.predict(pca_test)
accuracy_score(y_test, predictions_SVC)

**For linear  kernel with " PCA " our best parameters are {'C': 1, 'gamma': 0.001}**

In [None]:
from sklearn.svm import SVC
param1 = [0.001, 0.01, 0.1, 1, 10]
param2 = [0.001, 0.01, 0.1, 1]
param_grid = {'C': param1, 'gamma' : param2}
grid_search = GridSearchCV(SVC(kernel='linear'), param_grid, cv=10)
grid_search.fit(pca_train, y_train)
print (grid_search.best_params_)

**For linear  kernel with " PCA " our accuracy is 73.9 percent.**

In [None]:
from sklearn.svm import SVC
SVC = SVC(kernel='linear', C=1, gamma=0.001).fit(pca_train,y_train)

predictions_SVC = SVC.predict(pca_test)
accuracy_score(y_test, predictions_SVC)

# Random forest optimization

**This grid search for random forest consumes alot of cpu+gpu power.I did grid search on some other cpu and wrote results here.**

Training of models hace a very bad carbon footprint

Please for information on training costs of a model see here : https://interestingengineering.com/training-ai-is-shockingly-costly-to-the-environment

In [None]:
# from sklearn.ensemble import RandomForestClassifier as RF
#
# param_grid = {
#     'n_estimators': [50,100,200, 500],
#     'max_features': ['auto', 'sqrt', 'log2'],
*     'max_depth' : [3,4,5,6,7,8],
*     'criterion' :['entropy', 'gini']
* }
*
* grid_search = GridSearchCV(RF(), param_grid, cv=10)
* grid_search.fit(x_train, y_train)
* print (grid_search.best_params_)

**Best parameters are {'criterion': 'entropy', 'max_depth': 7, 'max_features': 'sqrt', 'n_estimators': 200}**

 **Our accuracy wih these parameters in 78.4 % .**

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF
RF = RF(criterion='entropy', max_depth=7,max_features='sqrt',n_estimators= 200).fit(x_train,y_train)
predictions_RF= RF.predict(x_test)
accuracy_score(y_test, predictions_RF)


** Random Forest with 'PCA' best parameters are criterion='entropy', max_depth=8,max_features='sqrt',n_estimators= 500.**

In [None]:
# from sklearn.ensemble import RandomForestClassifier as RF

# param_grid = {
#    'n_estimators': [50,100,200, 500],
#     'max_features': ['auto', 'sqrt', 'log2'],
#     'max_depth' : [3,4,5,6,7,8],
#     'criterion' :['entropy', 'gini']
# }
#
# grid_search = GridSearchCV(RF(), param_grid, cv=10)
# grid_search.fit(pca_train, y_train)
# print (grid_search.best_params_)

**Accuracy with criterion='entropy', max_depth=8,max_features='sqrt',n_estimators= 500 and PCA 78.5 %. **

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF
RF = RF(criterion='entropy', max_depth=8,max_features='sqrt',n_estimators= 500).fit(pca_train,y_train)
predictions_RF= RF.predict(pca_test)
accuracy_score(y_test, predictions_RF)


# Best Model

    As we had a classification problem at hand ,where data can be classfied in 7 different types.All algorithms used i.e Logistic regression ,Decision tree ,svc ,k_neigbours ,Gaussian NB,Random Forest are good for classification problem.
    The algorithms which worked well with our dataset in decreasing order are >
*     RANDOM FOREST criterion='entropy', max_depth=8,max_features='sqrt',n_estimators= 500 with 'PCA'
*     SVC WITH PARAMETERS kernel='rbf', C=10, gamma=0.1 .
*     LOGISTIC REGRESSION
*     K NEIGBOURS
*     DECISION TREE
*     GAUSSIAN NB

Random Forest And SVC performed quite reasonably well both in training scores and test accuracy .