<a href="https://colab.research.google.com/github/sjriek/AIS7/blob/main/Ais7test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Comparing Machine Learning Algorithms on a single Dataset(Classification). 
https://medium.com/@vaibhavpaliwal/comparing-machine-learning-algorithms-on-a-single-dataset-classification-46ffc5d3f278

Step1-: The first step is to import the necessary libraries for the code.


In [None]:
import numpy as np
import sklearn
from sklearn import model_selection
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns



Custom functions


In [None]:

def plotMatrix(data):
  fig, ax = plt.subplots()
  # Using matshow here just because it sets the ticks up nicely. imshow is faster.
  ax.matshow(data, cmap='viridis')
  for (i, j), z in np.ndenumerate(data):
     ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
  plt.show()

Step2-: Now as we imported the necessary libraries let’s import the dataset in the form of CSV file using the pandas library.

In [None]:
# dataset fetched from kaggle: https://www.kaggle.com/roustekbio/breast-cancer-csv

# data = pd.read_csv("./breastCancer.csv")

from sklearn.datasets import load_wine
data = load_wine()
data = pd.DataFrame(data.data, columns=data.feature_names)
data.head()

# import kaggle
# kaggle.api.authenticate()
# kaggle.api.dataset_download_files('breastCancer.csv', path='.', unzip=True)


Step3-: Check for the missing data and preprocess it, we will also look at the data axes and attributes.

In [None]:
data.replace('?',-99999, inplace=True)
print(data.axes)
print(data.columns)

Step4-: In this step, we will randomly select one row and visualize its data, we will also look for the shape of data, which means the total number of instances and attributes. The highlighted output is the shape of the dataset.

In [None]:
print (data.loc[20])
print (data.shape)

Step5-: Now we will describe our data, it means we will look at the value of the statistics for each attribute. The (describe) function of pandas lib Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

In [None]:
print(data.describe())

Step6-: Now we will do a graphical representation of our dataset, in which we will use (histogram) feature to visualize the graph of each attribute. A histogram is a representation of the distribution of data. This function calls matplotlib.pyplot.hist(), on each series in the Data Frame, resulting in one histogram per column.

In [None]:
data.hist(figsize=(15,15))
plt.show()

Step7-: We will plot the scatter matrix for our dataset, which is broadly used for the understanding correlation between attributes. A scatter plot matrix can be formed for a collection of variables where each of the variables will be plotted against each other.


In [None]:
scatter_matrix(data, figsize=(15,15))
plt.show()

In [None]:
# axes = scatter_matrix(df, alpha=0.5, diagonal='kde')
# corr = data.values()
# for i, j in zip(*plt.np.triu_indices_from(axes, k=1)):
#     axes[i, j].annotate("%.3f" %corr[i,j], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center')
# plt.show()

Step8-: In this step, we will plot the correlation matrix to see the correlation between attributes. This also helps us in determining which attributes have high correlation and then we can decide which attribute is important for us. In Python, the correlation values lie between (-1 and 1).

There are two key components of a correlation value: 1. magnitude — The larger the magnitude (closer to 1 or -1), the stronger the correlation. 2. sign — If negative, there is an inverse correlation. If positive, there is a regular correlation.

In [None]:
corrmat = data.corr()

# #using seaborn
# plt.figure(figsize=(10,10))
# sns.heatmap(corrmat, cmap='viridis', annot=True, linewidths=0.5,)

# plt.imshow(data, cmap='hot')
# plt.show()

# #using matplotlib
# plt.matshow(data.corr())
# plt.show()

#using pandas
corrmat.style.background_gradient(cmap='viridis').set_precision(4)



In [None]:
plotMatrix(corrmat)

Step9-: In this step, we will convert the columns in a list and then divide our data into two variables (X and y), where X is consisting of all attributes except (class and ID). In y variable, we will put target value which is our “class” attribute and then look for the shape of both variables.

In [None]:
columns = data.columns.to_list()

columns = [c for c in columns if c not in ["class", "id"]]

target = "class"

X = data[columns]
y = data[target]

print(X.shape)
print(y.shape)



Step10-: Now look at any random row of X and y to check we are going well.


In [None]:
print(X.loc[20])
print(y.loc[20])

Step11-: This step is very important as we will split our data into the training and testing to check the accuracy and for this, we will use (model selection) library. When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. To do this we will split the dataset into two sets, one for training and the other for testing; and you do this before you start training your model.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.2)


In [None]:
seed=5
scoring = 'accuracy'

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

Step12 -: Sometimes we get the future warning in our code, so to ignore them we will use the below command.

In [None]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

Step13 -: This is the most important step of our code, where we will import both algorithms (SVM and Random Forest) and then we will train model and test it using 10-fold cross-validation. Firstly, look at the code and output then we will discuss the features and parameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier

models = []

models.append(('SVM', SVC(gamma='auto')))
models.append(('RFC', RandomForestClassifier(max_depth=5, n_estimators=40)))

results = []
names = []

for name, model in models:
  kfold = model_selection.KFold(n_splits=10, random_state=seed)
  cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
  print(cv_results)
  results.append(cv_results)
  names.append(name)
  msg = "%s Algorithm: Accuracy %f (%f)" % (name, cv_results.mean(), cv_results.std())
  print(msg)

Step14 -: Now we will plot an algorithm comparison box plot to compare the accuracy of both algorithms and as we can see the accuracy calculated by Random Forest is more than the accuracy of SVM. It means RF is more accurate than SVM.

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Step15 -: Let’s visualize the result of all 10 folds graphically and look at the mean of all the scores.

In [None]:
from sklearn.model_selection import StratifiedKFold
from yellowbrick.model_selection import CVScores

_, ax = plt.subplots()

cv = StratifiedKFold(10)

oz = CVScores(RandomForestClassifier(max_depth=5, n_estimators=40), ax=ax, cv=cv, scoring='accuracy')
oz.fit(X,y)
oz.poof()




In [None]:
from sklearn.model_selection import cross_val_score
CVSCORE = cross_val_score(RandomForestClassifier(max_depth=5, n_estimators=40), X, y, cv=10)
# print(CVSCORE)
df = pd.DataFrame(CVSCORE)

# cv = StratifiedKFold(10)
ax = df.plot.bar()
for p in ax.patches:
  ax.annotate(str(round(p.get_height(),4)),(p.get_x(), p.get_height())  )
print("Mean score: ", df.mean())


Step16 -: Now we will make predictions on the validation sheet, we will look at the accuracy score and classification report which is consisting of many important parameters.

In [None]:
for name, model in models:
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  print(name)
  print(accuracy_score(y_test, predictions))
  print(classification_report(y_test, predictions))

Step17 -: Now we will look at the confusion matrix to evaluate the accuracy of classification. By definition a confusion matrix C is such that Ci,j is equal to the number of observations are known to be in the group I but predicted to be in group j. Thus, in binary classification, the count of true negatives is C0,0, false negatives are C1,0, true positives is C1,1 and false positives are C0,1. Normally a confusion matrix looks like

In [None]:
from sklearn.metrics import confusion_matrix
predict = model.predict(X_test)
confusion = confusion_matrix(y_test,predict)
print("==== Confusion Matrix ===")
print(confusion)
print('\n')

#using matplotlib
# plt.matshow(confusion)
# plt.show()


# def plotMatrix(data):
#   fig, ax = plt.subplots()
#   # Using matshow here just because it sets the ticks up nicely. imshow is faster.
#   ax.matshow(data, cmap='viridis')
#   for (i, j), z in np.ndenumerate(data):
#      ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
#   plt.show()

# plotMatrix(confusion)

#using seaborn
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, predict)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True)



#using pandas
pd.DataFrame(cnf_matrix).style.background_gradient(cmap='viridis').set_precision(4)


Step18 -: In this step we will calculate the Cohen Kappa score and Matthews Correlation Coefficient (MCC).

In [None]:
from sklearn.metrics import cohen_kappa_score
cohen_score = cohen_kappa_score(y_test, predictions)
print("kappa score: ", cohen_score)

from sklearn.metrics import matthews_corrcoef

MCC = matthews_corrcoef(y_test, predictions)
print("MCC score: ", MCC)

===============================






===================================

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)   #random_state=42
sgd_clf.fit(X_train, X_test)
y_scores = sgd_clf.decision_function([y_train])
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

Vaibhav Paliwal
2 Followers
About
Follow

Sign in

Get started


Comparing Machine Learning Algorithms on a single Dataset(Classification).
Vaibhav Paliwal
Vaibhav Paliwal

Mar 5, 2020·17 min read





Project Description-:
In this project, I used the Breast Cancer dataset from the UCI Repository, which is a classification type of dataset. I checked the performance of two famous supervised types of machine learning algorithm on the dataset. The algorithm that I used in my project is the Random Forest algorithm and the Support Vector Machine algorithm. I implemented the whole project in Python programming language using Jupyter Notebook. I used various libraries of Python for completing this project.
About Dataset -:
I used the Breast Cancer Wisconsin original dataset which is donated on 15th July 1992. I took this dataset from the UCI Machine Learning Repository. This is an open-source repository which is consisting of many datasets. The dataset consists of 699 instances and 10 attributes. It is a very interesting dataset and already has a lot of web hits and published on many pages. I converted the dataset in the CSV file and then used it in Python for the project. The link for the dataset is:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
Problem Description -:
The dataset is a classification type of dataset, so the problem is also a classification type of problem, where we need to determine whether a person is having cancerous cell (Malignant) or non-cancerous cell (Benign). The final attribute of the dataset (Class) consisting of two values:
2 for Benign and 4 for Malignant.
Algorithm Used in Project:
There are many algorithms present like SVM, KNN, Naïve Bayes, Random Forest, Decision Trees which can be used to solve the classification type of problems. In my project, I used two best-supervised types of machine learning algorithms (Support Vector Machine and Random Forest), both the algorithms work very well in finding accuracy for classification problems.
About Support Vector Machine -:
It is a supervised machine learning algorithm that is mainly used to classify data into different classes. Unlike most algorithms, SVM makes use of a hyperplane which acts as a decision boundary between the various classes. SVM can be used to generate multiple separating hyperplanes such that the data is divided into segments and each segment contains only one kind of data. Example: if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors).
Features of SVM:
1. As mentioned before, it is a supervised learning algorithm, which means it trains itself on a set of labeled data. It studies from the labeled training data and then classifies new input data depending on what is learned in the training phase.
2. This algorithm can be used for both classification and regression types of problems and this is one of the main advantages of this algorithm. The SVM is mainly known for classification and SVR (Support Vector Regressor) is used for regression problems.
3. It can also be used for classifying the non-linear type of data using kernels. In SVM there are different kernels (Linear, RBF, Sigmoid, Poly). These kernels are used in transforming the non-linear data into high dimensional so that a clear hyperplane will be created between various classes of data.
How does SVM work?
To understand this let’s consider an example of a farm, where we have rabbits and wolves and over a problem is, we need to set up a fence to protect the rabbits from wolves. So, how we build a fence?

On of the solution comes in mind is to build a decision boundary based on the position of rabbits and wolves. So, it will look like:

Now you can see a clear fence along this line.
SVM works in the same way, it draws a decision boundary which is known as hyperplane between any two classes to separate or classify them. The basic principle of SVM is to draw a hyperplane that best separates the two classes. Here in our example, the two classes are rabbits and wolves.
To understand SVM in more details lets consider the below picture:

So, the first task is to draw the hyperplane that can be random and then you check the distance between the hyperplane and closest data prints from each class. The closest data points from the hyperplane are known as Support Vectors, as you can see in the picture. And, that’s the reason behind this algorithm (Support Vector Machine).
The hyperplane is drawn based on the support vectors and an optimum hyperplane is one that will have maximum distance from each of the support vectors. This distance between hyperplane and support vectors is known as Margin.
SVM used to classify the data by using a hyperplane, considering that the distance between hyperplane and support vectors is maximum.
Let’s consider a situation, where you input a new data point and you want to draw a hyperplane that it best separates these two classes.

The blue one is the new input data. So, consider the two hyperplane pictures and see which one is having optimal hyperplane.


From the definition, we know an optimal hyperplane is one which is having maximum distance from the support vectors. So, from both the pictures we can see the Margin of the second picture is clearly more than the first one. The second hyperplane is our optimal hyperplane.
Non-Linear Support Vector Machine -:
Consider the below picture and try to draw the hyperplane.

As we can see the above data is totally different from the previous data we considered. Now if you draw a hyperplane for such data, it will be like:

So, the above hyperplane is totally incorrect, and it doesn’t separate the two classes.
To solve such problems Non-Linear SVM comes into the picture. As I mentioned before that how can we use the kernel to transfer data into another dimension that has a clear dividing margin between classes of data?
Kernel function provides the option of transforming non-linear spaces into linear spaces. Till now, we plotted data based on two variables (x and y) on 2D space. We will transform the two-variable data into three variable data with (z) as the third variable. Means, we are visualizing the data in 3D space.
When you transform your data from 2D to 3D space you will be able to see a clear dividing margin between the two classes of data. And then you can separate the two classes by drawing an optimal hyperplane between them.

END Notes:
Support Vector Machines are a very powerful classification algorithm. When used in conjunction with random forest and other machine learning tools, they give a very different dimension to ensemble models. Hence, they become very crucial for cases where very high predictive power is required. The performance of the SVM classifier was very accurate for even a small data set and its performance was compared to other classification algorithms like Naïve Bayes and in each case, the SVM outperformed Naive Bayes.
About Random Forest -:
Random Forest is a supervised type of machine learning algorithm which is used for both classification and regression problem. It is a trademarked term for an ensemble of decision trees. In Random Forest, we’ve a collection of decision trees (so-known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Generally, the more trees in the forest the more robust the forest looks like. Similarly, in the random forest classifier, the higher the number of trees in the forest, the greater is the accuracy of the results.

Random Forest basically compiles the outcome of all the decision trees and based on majority voting it gives the final result. To get a better idea, consider the below picture.

How does Random Forest work?
The first step in Random Forest is that it will divide the data into smaller subsets and every subset need not be distinct, some subsets may be overlapped.
Consider the picture for understanding the working of Random forest, it consists of several steps which are explained in the picture.

Here, T = Number of Features, D = number of trees to be constructed, V = output, the class with the highest vote.
Features of Random Forest -:
1. Most accurate learning algorithm.
2. It works well for both classification and regression problems.
3. It runs efficiently on large datasets.
4. It does not require input preparation because of its implicit features.
5. It does not take much time and can be easily grown in parallel.
End Notes -:
Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. These cases generally have a high number of predictive variables and a huge sample size. This is because it captures the variance of several input variables at the same time and enables the high number of observations to participate in the prediction.
Project Code and Description:
I have completed this project in Python using Jupyter Notebook. In the description, I will show the stepwise code with the description.
Python libraries used in the code-:
1. Pandas.
2. Sklearn (Scipy).
3. Numpy.
4. Matplotlib.
5. Seaborn.
Let’s look at the code stepwise:
Step1-: The first step is to import the necessary libraries for the code.

Step2-: Now as we imported the necessary libraries let’s import the dataset in the form of CSV file using the pandas library.

Step3-: Check for the missing data and preprocess it, we will also look at the data axes and attributes.

Step4-: In this step, we will randomly select one row and visualize its data, we will also look for the shape of data, which means the total number of instances and attributes. The highlighted output is the shape of the dataset.

Step5-: Now we will describe our data, it means we will look at the value of the statistics for each attribute. The (describe) function of pandas lib Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

Step6-: Now we will do a graphical representation of our dataset, in which we will use (histogram) feature to visualize the graph of each attribute. A histogram is a representation of the distribution of data. This function calls matplotlib.pyplot.hist(), on each series in the Data Frame, resulting in one histogram per column.

Step7-: We will plot the scatter matrix for our dataset, which is broadly used for the understanding correlation between attributes. A scatter plot matrix can be formed for a collection of variables where each of the variables will be plotted against each other.

Step8-: In this step, we will plot the correlation matrix to see the correlation between attributes. This also helps us in determining which attributes have high correlation and then we can decide which attribute is important for us. In Python, the correlation values lie between (-1 and 1).
There are two key components of a correlation value: 1. magnitude — The larger the magnitude (closer to 1 or -1), the stronger the correlation. 2. sign — If negative, there is an inverse correlation. If positive, there is a regular correlation.

As we can see (ID) column is not having any correlation with our main target (class), and it is also known as well. Now if you look carefully, you will notice a high correlation between Uniformity of cell size and uniformity of cell shape (0.91), so will check for the situation in one we will consider all attributes for our model and in another one, we will consider only one out of two attributes.
Step9-: In this step, we will convert the columns in a list and then divide our data into two variables (X and y), where X is consisting of all attributes except (class and ID). In y variable, we will put target value which is our “class” attribute and then look for the shape of both variables.

Step10-: Now look at any random row of X and y to check we are going well.

Step11-: This step is very important as we will split our data into the training and testing to check the accuracy and for this, we will use (model selection) library. When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. To do this we will split the dataset into two sets, one for training and the other for testing; and you do this before you start training your model.

Let’s look at this in more detail, so here we split out data into (X_train, X_test & y_train, y_test). As you can see the (test_size = 0.2) it means we split our data where testing data is 20% and training data is 80% and to split data into 80–20 ratio is a common practice in data science and recommended.
Step12 -: Sometimes we get the future warning in our code, so to ignore them we will use the below command.

Step13 -: This is the most important step of our code, where we will import both algorithms (SVM and Random Forest) and then we will train model and test it using 10-fold cross-validation. Firstly, look at the code and output then we will discuss the features and parameters.

As we can see we created an empty list name as models and then we append both the classifiers in our list (SVC and Random forest classifiers). In RF we have (n_estimators = 40), which means the number of trees we want to build.
Here we use KFold cross-validation. What is KFold cross-validation? It is a process where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. Let’s take the scenario of 5-Fold cross-validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds has been used as the testing set.

No look at some parameters (n_splits = 10) means the value of K=10 and (random_state = seed) which we consider as random state and we have already defined (seed =5) previously. In cross_val_score, the scoring is the accuracy score. A last we found the accuracy of all 10 folds score and calculated the mean and standard deviation.
Step14 -: Now we will plot an algorithm comparison box plot to compare the accuracy of both algorithms and as we can see the accuracy calculated by Random Forest is more than the accuracy of SVM. It means RF is more accurate than SVM.

Step15 -: Let’s visualize the result of all 10 folds graphically and look at the mean of all the scores.

Step16 -: Now we will make predictions on the validation sheet, we will look at the accuracy score and classification report which is consisting of many important parameters.

Accuracy Score -: In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Classification Report -: In this report, we can see the Precision, Recall, F1-score for each class. The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), sample average (only for multilabel classification) and micro average (averaging the total true positives, false negatives and false positives) it is only shown for multi-label or multi-class with a subset of classes because it is accuracy otherwise.
We will discuss more precision, recall, F1-score in the next step.
Step17 -: Now we will look at the confusion matrix to evaluate the accuracy of classification. By definition a confusion matrix C is such that Ci,j is equal to the number of observations are known to be in the group I but predicted to be in group j. Thus, in binary classification, the count of true negatives is C0,0, false negatives are C1,0, true positives is C1,1 and false positives are C0,1. Normally a confusion matrix looks like:

Let’s take about precision, F1-score, accuracy, recall, error rate.
Precision refers to the accuracy of positive predictions. Precision = TP/TP+FP
Recall (Sensitivity) — (false negatives) ratio of correctly predicted positive observations to all observations in actual class — yes.
Recall = TP/Actual yes.
 F1 score — F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false.
 Error Rate = 1- Accuracy Score.
Now, look at our confusion matrix.

As we can see for our 140 instances in testing data, we got only (4) wrong predictions, rest are correctly prediction.
Step18 -: In this step we will calculate the Cohen Kappa score and Matthews Correlation Coefficient (MCC).

Kappa Score -: A statistic that measures inter-annotator agreement. This function computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem. It is defined as κ=(po−pe)/(1−pe) where po is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and pe is the expected agreement when both annotators assign labels randomly. pe is estimated using a per-annotator empirical prior to the class labels. As a result, the kappa statistic, which is a number between -1 and 1. The maximum value means complete agreement; zero or lower means chance agreement.
MCC Score -: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It considers true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes. The MCC is, in essence, a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.
Let’s play with the Dataset -:
1. Now we will play with the dataset to check the different scenarios and compare the ML algorithms. So, till not we split our data into the 80–20 ratios and checked all the outputs.
Let’s change the ratio to 70–30% and see the outputs.

The accuracy of RF is still more than the SVM.
Then look for 60–40% ratio -:
As you can see the accuracy is increased and RF again resulting in a more accurate algorithm.

2. As I mentioned in the correlation matrix step that we are getting a high correlation among two attributes i.e. Uniformity of cell size and shape (0.91), so we try with only one attribute and remove the “uniformity of cell size” and look at the result.


There is no big change in the results, which means we can also find the output after choosing among the high correlated attributes. For, RF this shows better results.
Additional Features:
I tried some additional features of Python libraries, and I found them interesting.
1. Voting Classifier -: The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well-performing models in order to balance out their individual weaknesses.
Firstly, let's compare different machine learning algorithms on our dataset.

Now, look at the result of the Voting Classifier.

As we can see we are getting the highest accuracy with the Voting Classifier.
2. Visualize a decision tree from the Random Forest trees using Python export_graphviz.
The code for visualizing a tree is:

Here we used export_graphviz and this function generates a GraphViz representation of the decision tree, which is then written into out_file.
Syntax: sklearn.tree.export_graphviz(decision_tree, out_file=None, max_depth=None, feature_names=None, class_names=None, label=’all’, filled=False, leaves_parallel=False, impurity=True, node_ids=False, proportion=False, rotate=False, rounded=False, special_characters=False, precision=3).

Visualizing a single decision tree can help give us an idea of how an entire random forest makes predictions.
In Random Forest every decision at a node is made by classification using a single feature. Plotting a decision tree gives the idea of split value, number of data points at every node, etc. Considering the majority voting concept in a random forest, data scientist usually prefers more no of trees (even up to 200) to build a random forest, hence it is almost impracticable to conceive all the decision trees. But visualizing any 2–3 trees picked randomly will give fairly a good intuition of model learning.
Conclusion:
In this project, we compared two machine learning algorithms in different scenarios for our classification type of dataset. In almost all cases Random Forest gives us better results than SVM in terms of accuracy. We also did parameter tuning of both the algorithms and found the best result of them. There are many additional features are present in Python that can help us in finding the best result for our problem, we looked at some of them to have an idea. Support Vector Machine is considered good algorithms for both small datasets and huge datasets. Random Forest works very well on huge datasets and because of its implicit features, it does not require much input preparation.
Reference:
https://scikit-learn.org/stable/
https://www.edureka.co/blog
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
https://scikit-learn.org/stable/model_selection.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c
https://www.simplilearn.com/classification-machine-learning-tutorial
http://benalexkeen.com/correlation-in-python/
Confusion matrix - scikit-learn 0.22.2 documentation
Example of confusion matrix usage to evaluate the quality of the output of a classifier on the iris data set. The…
scikit-learn.org

THANKS...
Vaibhav Paliwal
Follow

34


1

34


1





More from Vaibhav Paliwal
Follow

More From Medium
Understand — Logistic Regression within 5min 🐱‍🏍
Sri Vigneshwar DJ

Five Interesting PyTorch Tensor Functions
BHUPENDRA SINGH@IIT Indore in Jovian — Data Science and Machine Learning

PyTorch | Important Links & Readable
Keshav Thakur

Neural network introduction | Part 1
Pierre Portal

A Comprehensive Guide to Working With Recurrent Neural Networks in Keras
Andre Ye in Towards Data Science

Email Spam Detector in Python
George Pipis in Python in Plain English

Get the full (satellite) picture!
Earthcube in Preligens Stories

2.2: A Brief Introduction To PyTorch
Nikhil Cheerla in deeplearningschool

About

Write

Help

Legal