* [Introduction](#0)
* [Load and Check Dataset](#1)
    * [weka_3c file](#4)
* [Selected Algorithms](#2)
* [Conclusion](#3)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
	

<a id = 0></a>
# Introduction

The data have been organized in two different but related classification tasks.

- column3Cweka.csv (file with three class labels)

The first task consists in classifying patients as belonging to one out of three categories: Normal (100 patients), Disk Hernia (60 patients) or Spondylolisthesis (150 patients).

- column2Cweka.csv (file with two class labels)

For the second task, the categories Disk Hernia and Spondylolisthesis were merged into a single category labelled as 'abnormal'. Thus, the second task consists in classifying patients as belonging to one out of two categories: Normal (100 patients) or Abnormal (210 patients).,

Only column3Cweka.csv file was considered in this notebook.


<a id=1> </a>
# Load and Check Dataset

In [2]:
weka_2C = pd.read_csv("../input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv")
weka_3C = pd.read_csv("../input/biomechanical-features-of-orthopedic-patients/column_3C_weka.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv'

<a id=4> </a>
## weka_3C file

In [None]:
weka_3C.info()

In [None]:
weka_3C.head(5)

In [None]:
# checking any null cells
weka_3C.isnull().sum()

In [None]:
weka_3C["class"].value_counts()

In [None]:
weka_3C.describe()

In [None]:
weka_3C.corr()

In [None]:
# some relations between features
sns.scatterplot(data = weka_3C, x = "pelvic_incidence", y = "sacral_slope", hue = "class", palette = "bright")
plt.xlabel("Pelvic Incidence")
plt.ylabel("Sacral Slope")
plt.legend()
plt.show()

In [None]:
# some relations between features
sns.scatterplot(data = weka_3C, x = "pelvic_incidence", y = "lumbar_lordosis_angle", hue = "class", palette = "bright")
plt.xlabel("Pelvic Incidence")
plt.ylabel("Lumbar Lordosis Angle")
plt.legend()
plt.show()

In [None]:
# Let's convert these categorical values into numeric
weka_3C["class"].replace(["Hernia", "Spondylolisthesis", "Normal"], [0, 1, 2], inplace=True)

In [None]:
# split dataset into x and y
y = weka_3C["class"].values # classes
x_data = weka_3C.drop(["class"], axis=1) # features

In [None]:
# Relations between features
sns.pairplot(x_data)
plt.show()

In [None]:
# normalization
x = (x_data - np.min(x_data))/(np.max(x_data)-np.min(x_data)).values

In [None]:
# train-test-split time - 15% for test, 85½ for train 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=42)

<a id=2></a>
# Selected Algorithms

K-Nearest Neighbor, Support Vector Machines, Random Forests, and Decision Tree Algorithms are used for this project.


In [None]:
# KNN algorithm

# looking for best k value
score_list = []
for each in range(1,50):
    knn2 = KNeighborsClassifier(n_neighbors = each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))

plt.plot(range(1,50),score_list)
plt.xlabel("k values")
plt.ylabel("Accuracy")
plt.title("Finding the Best k Value")
plt.show()

According to this graphic, it seems that the best k value can be 31 or 32. Let's select both of them.


In [None]:
neighbor = 31
knn = KNeighborsClassifier(n_neighbors = neighbor)
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
print(" {} KNN Model Test Accuracy: {} ". format(neighbor, knn.score(x_test, y_test)))

In [None]:
neighbor = 32
knn = KNeighborsClassifier(n_neighbors = neighbor)
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
print(" {} KNN Model Test Accuracy: {} ". format(neighbor, knn.score(x_test, y_test)))

Selecting the k value as 31 results better than selecting the k value as 32.


In [None]:
# Support Vector Machine Algorithm

# train
svm = SVC(random_state=1)
svm.fit(x_train,y_train)

# test
print("SVM Model Accuracy: {}".format(svm.score(x_test,y_test)))

Interestingly, SVM accuracy is exactly equal to the KNN algorithm with k=32.

In [None]:
# Decision Tree Algorithm

# train
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)

# test
print("DT Model Accuracy: {}".format(dt.score(x_test,y_test)))

In [None]:
# Random Forests Algorithm

# train
rf = RandomForestClassifier(n_estimators = 100, random_state = 1) # for 100 trees
rf.fit(x_train,y_train)
# test
print("RF Model Accuracy: {}".format(rf.score(x_test,y_test)))


<a id=3></a>
# Conclusion
As a result of comparing these four algorithms, Random Forest Algorithm gives a better result.
