<h1> Data Spaces' Tesina <h1>

<p>Imports section<p>

In [2]:
import numpy as np
from sklearn import neighbors, model_selection, metrics, preprocessing
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

<p> Read .csv file <p>

In [2]:
filename = "dataR2.csv"
file = open(filename, "r")

# data_complete = np.loadtxt(file, delimiter=",", dtype=None, encoding=None, usecols=(0,1,2,3,4,5,6,7,8,9), skiprows=1)

# according to the relevant papers indicated in the website where I took this dataset from has been has been
# observed that if instead of taking into account all of the features we only take 4 of them (Age, BMI, Glucose
# & Resistine) we can achieve a grater accuracy (only if we are using the SVM or at least Random Forest algos
# whereas with KNN or Logistic regression results aren't that nice).

data = np.loadtxt(file, delimiter=",", dtype=None, encoding=None, usecols=(0,1,2,7,9), skiprows=1)

# records with the classification as "1" are Healthy Controls, "2" means Patients.

<p><p>Preprocessing<p></p>

In [3]:
# Let's divide data into source and target (respectively X and Y)
X = data[:, :-1]
Y = data[:, len(data[0])-1]

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, random_state=np.random.randint(0,100), test_size=0.27)

# array of possible K to apply KNN neighbors
ks = [3,5,7,9]


# normalize data because algorithms work better with normalized data
# scaler = preprocessing.StandardScaler()
# scaler.fit(X_train)
# X_train = scaler.transform(X_train)
# X_test = scaler.transform(X_test)

<p><u>K-NN</u></p>

In [4]:
for k in ks:
    n_neighbors = k
    
    # Create an instance of neighbors classifier (clf) and fit the data
    clf = neighbors.KNeighborsClassifier(n_neighbors)
    
    # train the classifier on the training set
    clf.fit(X_train, Y_train)
    
    print("Accuracy score on the test set with K =",n_neighbors,"is %.3f" %(clf.score(X_test, Y_test)))


    

Accuracy score on the test set with K = 3 is 0.688
Accuracy score on the test set with K = 5 is 0.750
Accuracy score on the test set with K = 7 is 0.719
Accuracy score on the test set with K = 9 is 0.719


<p><u>Logistic Regression</u><p>

In [6]:
logReg = LogisticRegression(solver="lbfgs") # instance of the model

logReg.fit(X_train, Y_train)

# res = logReg.predict(X_test)

print("Accuracy: ", logReg.score(X_test, Y_test))
print(classification_report(Y_test, logReg.predict(X_test)))

Accuracy:  0.6875
              precision    recall  f1-score   support

         1.0       0.71      0.71      0.71        17
         2.0       0.67      0.67      0.67        15

    accuracy                           0.69        32
   macro avg       0.69      0.69      0.69        32
weighted avg       0.69      0.69      0.69        32



<p><u> Random forest</u> </p>

In [8]:
rndFor = RandomForestClassifier(n_estimators = 200, criterion="entropy")

rndFor.fit(X_train, Y_train)

y_pred = rndFor.predict(X_test)

print("Accuracy: ", metrics.accuracy_score(Y_test, y_pred))

Accuracy:  0.78125


<p><u> SVM</u> </p>

In [9]:
svc = SVC(gamma='auto', kernel='rbf')
svc.fit(X_train, Y_train)
print("Accuracy: ", svc.score(X_test, Y_test))

Accuracy:  0.8125


<p> So this is the accuracy with the default parameters, but with SVm approach is necessary to perform some parameters tuning in order to achieve a better result. To do so I relied on the GridSearch <p>
<p><u> SVM with GridSearch </u></p>

In [10]:
parameters = [{'C': [0.1, 0.2, 1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [0.1, 0.2, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 0.25, 1], 'kernel': ['rbf']}]
svcGS = GridSearchCV(svc, parameters, n_jobs=-1, cv=5, refit=True)
svcGS.fit(X_train, Y_train)
# print("Accuracy: ", svcGS.score(X_test, Y_test))
print(classification_report(Y_test, svcGS.predict(X_test)),"\n")
cm = metrics.confusion_matrix(Y_test, svcGS.predict(X_test))

print ("Confusion Matrix:\n",cm)

fpr, tpr, thresholds = metrics.roc_curve(Y_test, svcGS.predict(X_test), pos_label=2)
print("Area Under the ROC curve: ",metrics.auc(fpr, tpr))                                         

              precision    recall  f1-score   support

         1.0       0.88      0.82      0.85        17
         2.0       0.81      0.87      0.84        15

    accuracy                           0.84        32
   macro avg       0.84      0.85      0.84        32
weighted avg       0.85      0.84      0.84        32
 

Confusion Matrix:
 [[14  3]
 [ 2 13]]
Area Under the ROC curve:  0.8450980392156862


<p> The result above shows us a fuller report. </p>