This tutorial is based on:

Comparing Machine Learning Algorithms on a single Dataset(Classification). 
https://medium.com/@vaibhavpaliwal/comparing-machine-learning-algorithms-on-a-single-dataset-classification-46ffc5d3f278

This website contains more detailed background information! This notebook is only the code execution. 
**READ the steps and details on the website**

Step1-: The first step is to import the necessary libraries for the code.


In [None]:
import numpy as np
import sklearn
from sklearn import model_selection
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import pandas as pd
# import seaborn as sns



Custom functions


In [None]:

def plotMatrix(data):
  fig, ax = plt.subplots()
  # Using matshow here just because it sets the ticks up nicely. imshow is faster.
  ax.matshow(data, cmap='viridis')
  for (i, j), z in np.ndenumerate(data):
     ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
  plt.show()

Step2-: Now as we imported the necessary libraries let’s import the dataset in the form of CSV file using the pandas library.

In [None]:
# dataset fetched from kaggle: https://www.kaggle.com/roustekbio/breast-cancer-csv

# you should first fetch the dataset from github and upload the dataset to your jetson
data = pd.read_csv("./breastCancer.csv")


## sklearn has built in datasets
## Can you run this also on this wine dataset?

# from sklearn.datasets import load_wine
# data = load_wine()
# data = pd.DataFrame(data.data, columns=data.feature_names)
# data.head()

# import kaggle
# kaggle.api.authenticate()
# kaggle.api.dataset_download_files('breastCancer.csv', path='.', unzip=True)


Step3-: Check for the missing data and preprocess it, we will also look at the data axes and attributes.

In [None]:
data.replace('?',-99999, inplace=True)
print(data.axes)
print(data.columns)

Step4-: In this step, we will randomly select one row and visualize its data, we will also look for the shape of data, which means the total number of instances and attributes. The highlighted output is the shape of the dataset.

In [None]:
print (data.loc[20])
print (data.shape)

Step5-: Now we will describe our data, it means we will look at the value of the statistics for each attribute. The (describe) function of pandas lib Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

In [None]:
print(data.describe())

Step6-: Now we will do a graphical representation of our dataset, in which we will use (histogram) feature to visualize the graph of each attribute. A histogram is a representation of the distribution of data. This function calls matplotlib.pyplot.hist(), on each series in the Data Frame, resulting in one histogram per column.

In [None]:
data.hist(figsize=(15,15))
plt.show()

Step7-: We will plot the scatter matrix for our dataset, which is broadly used for the understanding correlation between attributes. A scatter plot matrix can be formed for a collection of variables where each of the variables will be plotted against each other.


In [None]:
scatter_matrix(data, figsize=(15,15))
plt.show()

In [None]:
# axes = scatter_matrix(df, alpha=0.5, diagonal='kde')
# corr = data.values()
# for i, j in zip(*plt.np.triu_indices_from(axes, k=1)):
#     axes[i, j].annotate("%.3f" %corr[i,j], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center')
# plt.show()

Step8-: In this step, we will plot the correlation matrix to see the correlation between attributes. This also helps us in determining which attributes have high correlation and then we can decide which attribute is important for us. In Python, the correlation values lie between (-1 and 1).

There are two key components of a correlation value: 1. magnitude — The larger the magnitude (closer to 1 or -1), the stronger the correlation. 2. sign — If negative, there is an inverse correlation. If positive, there is a regular correlation.

In [None]:
corrmat = data.corr()

# #using seaborn
# plt.figure(figsize=(10,10))
# sns.heatmap(corrmat, cmap='viridis', annot=True, linewidths=0.5,)

# plt.imshow(data, cmap='hot')
# plt.show()

# #using matplotlib
# plt.matshow(data.corr())
# plt.show()

#using pandas
corrmat.style.background_gradient(cmap='viridis').set_precision(4)



In [None]:
plotMatrix(corrmat)

Step9-: In this step, we will convert the columns in a list and then divide our data into two variables (X and y), where X is consisting of all attributes except (class and ID). In y variable, we will put target value which is our “class” attribute and then look for the shape of both variables.

In [None]:
columns = data.columns.to_list()

columns = [c for c in columns if c not in ["class", "id"]]

target = "class"

X = data[columns]
y = data[target]

print(X.shape)
print(y.shape)



Step10-: Now look at any random row of X and y to check we are going well.


In [None]:
print(X.loc[20])
print(y.loc[20])

Step11-: This step is very important as we will split our data into the training and testing to check the accuracy and for this, we will use (model selection) library. When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. To do this we will split the dataset into two sets, one for training and the other for testing; and you do this before you start training your model.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.2)


In [None]:
seed=5
scoring = 'accuracy'

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

Step12 -: Sometimes we get the future warning in our code, so to ignore them we will use the below command.

In [None]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

Step13 -: This is the most important step of our code, where we will import both algorithms (SVM and Random Forest) and then we will train model and test it using 10-fold cross-validation. Firstly, look at the code and output then we will discuss the features and parameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier

models = []

models.append(('SVM', SVC(gamma='auto')))
models.append(('RFC', RandomForestClassifier(max_depth=5, n_estimators=40)))

results = []
names = []

for name, model in models:
  kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle=True)
  cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
  print(cv_results)
  results.append(cv_results)
  names.append(name)
  msg = "%s Algorithm: Accuracy %f (%f)" % (name, cv_results.mean(), cv_results.std())
  print(msg)

Step14 -: Now we will plot an algorithm comparison box plot to compare the accuracy of both algorithms and as we can see the accuracy calculated by Random Forest is more than the accuracy of SVM. It means RF is more accurate than SVM.

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Step15 -: Let’s visualize the result of all 10 folds graphically and look at the mean of all the scores.

In [None]:
# Yellowbrick is not supported is Jetson (see next code block)

# from sklearn.model_selection import StratifiedKFold
# # from yellowbrick.model_selection import CVScores

# _, ax = plt.subplots()

# cv = StratifiedKFold(10)

# oz = CVScores(RandomForestClassifier(max_depth=5, n_estimators=40), ax=ax, cv=cv, scoring='accuracy')
# oz.fit(X,y)
# oz.poof()




In [None]:
from sklearn.model_selection import cross_val_score
CVSCORE = cross_val_score(RandomForestClassifier(max_depth=5, n_estimators=40), X, y, cv=10)
# print(CVSCORE)
df = pd.DataFrame(CVSCORE)

# cv = StratifiedKFold(10)
ax = df.plot.bar()
for p in ax.patches:
  ax.annotate(str(round(p.get_height(),4)),(p.get_x(), p.get_height())  )
print("Mean score: ", df.mean())


Step16 -: Now we will make predictions on the validation sheet, we will look at the accuracy score and classification report which is consisting of many important parameters.

In [None]:
for name, model in models:
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  print(name)
  print(accuracy_score(y_test, predictions))
  print(classification_report(y_test, predictions))

Step17 -: Now we will look at the confusion matrix to evaluate the accuracy of classification. By definition a confusion matrix C is such that Ci,j is equal to the number of observations are known to be in the group I but predicted to be in group j. Thus, in binary classification, the count of true negatives is C0,0, false negatives are C1,0, true positives is C1,1 and false positives are C0,1. Normally a confusion matrix looks like

In [None]:
from sklearn.metrics import confusion_matrix
predict = model.predict(X_test)
confusion = confusion_matrix(y_test,predict)
print("==== Confusion Matrix ===")
print(confusion)
print('\n')

#using matplotlib
# plt.matshow(confusion)
# plt.show()


# def plotMatrix(data):
#   fig, ax = plt.subplots()
#   # Using matshow here just because it sets the ticks up nicely. imshow is faster.
#   ax.matshow(data, cmap='viridis')
#   for (i, j), z in np.ndenumerate(data):
#      ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
#   plt.show()

# plotMatrix(confusion)

# #using seaborn
# from sklearn import metrics
# cnf_matrix = metrics.confusion_matrix(y_test, predict)
# p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True)



#using pandas
pd.DataFrame(confusion).style.background_gradient(cmap='viridis').set_precision(4)


Step18 -: In this step we will calculate the Cohen Kappa score and Matthews Correlation Coefficient (MCC).

In [None]:
from sklearn.metrics import cohen_kappa_score
cohen_score = cohen_kappa_score(y_test, predictions)
print("kappa score: ", cohen_score)

from sklearn.metrics import matthews_corrcoef

MCC = matthews_corrcoef(y_test, predictions)
print("MCC score: ", MCC)