<h1> Data Spaces' Tesina <h1>

<p>Imports section<p>

In [24]:
import numpy as np
from sklearn import neighbors, model_selection, metrics, preprocessing
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

<p> Read .csv file <p>

In [25]:
filename = "dataR2.csv"
file = open(filename, "r")

# data = np.loadtxt(file, delimiter=",", dtype=None, encoding=None, usecols=(0,1,2,3,4,5,6,7,8,9), skiprows=1)

# according to the relevant papers indicated in the website where I took this dataset from has been has been
# observed that if instead of taking into account all of the features we only take 4 of them (Age, BMI, Glucose
# & Resistine) we can achieve a grater accuracy (only if we are using the SVM or at least Random Forest algos
# whereas with KNN or Logistic regression results aren't that nice).

data = np.loadtxt(file, delimiter=",", dtype=None, encoding=None, usecols=(0,1,2,7,9), skiprows=1)

# records with the classification as "1" are Healthy Controls, "2" means Patients.

<p>Preprocessing</p>

In [26]:
# Let's divide data into source and target (respectively X and Y)
X = data[:, :-1]
Y = data[:, len(data[0])-1]

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, random_state=np.random.randint(0,100), test_size=0.3)

# array of possible K to apply KNN neighbors
ks = [3,5,7,9]


# normalize data because algorithms work better with normalized data
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

<p> I try to apply PCA on the dataset, since there are 9 features (which are a lot) and then re-apply KNN <p>

In [27]:
# pca = PCA(n_components=4)
# pca.fit(X)
# Xp = pca.transform(X)
# print(Xp)

# X_train, X_test, Y_train, Y_test = model_selection.train_test_split(Xp, Y, random_state=np.random.randint(0,100), test_size=0.3)

# # array of possible K to apply KNN neighbors
# ks = [3,5,7,9]


# normalize data because algorithms work better with normalized data
# scaler = preprocessing.StandardScaler()
# scaler.fit(X_train)
# X_train = scaler.transform(X_train)
# X_test = scaler.transform(X_test)

<p>Apply K-NN</p>

In [28]:
for k in ks:
    n_neighbors = k
    
    # Create an instance of neighbors classifier (clf) and fit the data
    clf = neighbors.KNeighborsClassifier(n_neighbors)
    
    # train the classifier on the training set
    clf.fit(X_train, Y_train)
    
    print("Accuracy score on the test set with K =",n_neighbors,"is %.3f" %(clf.score(X_test, Y_test)))


    

Accuracy score on the test set with K = 3 is 0.800
Accuracy score on the test set with K = 5 is 0.771
Accuracy score on the test set with K = 7 is 0.829
Accuracy score on the test set with K = 9 is 0.714


<p>Apply KNN this time with a validation set </p>

In [29]:
# Let's create the validation set
X_train_t, X_valid, Y_train_t, Y_valid = model_selection.train_test_split(X_train, Y_train, random_state=np.random.randint(0,100), test_size=0.30)

# array of possible K to apply KNN neighbors
ks = [3,5,7,9]

# normalize data because algorithms work better with normalized data
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_valid = scaler.transform(X_valid)

# array to write accuracy values for each K 
acc_arr = []

for k in ks:
    n_neighbors = k
    
    # Create an instance of neighbors classifier (clf) and fit the data
    clf = neighbors.KNeighborsClassifier(n_neighbors)
    
    # train the classifier on the training set
    clf.fit(X_train_t, Y_train_t)
    
    acc_arr.append(clf.score(X_valid, Y_valid))
    
# I choose the best K based on the results on the validation set and apply KNN with that
k_best_index = acc_arr.index(max(acc_arr))
k_best = ks[k_best_index]

clf2 = neighbors.KNeighborsClassifier(k_best)
clf2.fit(X_train_t, Y_train_t)

print("The best K, based on the validation set results, is",k_best,"and the accuracy on the test set is %.3f"%(clf2.score(X_test, Y_test)))

The best K, based on the validation set results, is 3 and the accuracy on the test set is 0.686


<p> I'll try Logistic Regression on the dataset <p>

In [35]:
logReg = LogisticRegression(solver="lbfgs") # instance of the model

logReg.fit(X_train, Y_train)

# res = logReg.predict(X_test)

print("Accuracy: ", logReg.score(X_test, Y_test))
print(classification_report(Y_test, logReg.predict(X_test)))

Accuracy:  0.8571428571428571
              precision    recall  f1-score   support

         1.0       0.76      0.93      0.84        14
         2.0       0.94      0.81      0.87        21

    accuracy                           0.86        35
   macro avg       0.85      0.87      0.86        35
weighted avg       0.87      0.86      0.86        35



<p><u> Naive Bayes classifier </u></p>

In [31]:
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
print("Accuracy: ", gnb.score(X_test, Y_test))

Accuracy:  0.6


<p><u> Random forest</u> </p>

In [32]:
rndFor = RandomForestClassifier(n_estimators = 200, criterion="entropy")

rndFor.fit(X_train, Y_train)

y_pred = rndFor.predict(X_test)

print("Accuracy: ", metrics.accuracy_score(Y_test, y_pred))

Accuracy:  0.8


<p><u> SVM</u> </p>

In [33]:
svc = SVC(gamma='auto', kernel='rbf')
svc.fit(X_train, Y_train)
print("Accuracy: ", svc.score(X_test, Y_test))

Accuracy:  0.8


<p> So this is the accuracy with the default parameters, but with SVm approach is necessary to perform some parameters tuning in order to achieve a better result. To do so I relied on the GridSearch <p>
<p><u> SVM with GridSearch </u></p>

In [34]:
parameters = [{'C': [0.1, 0.2, 1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [0.1, 0.2, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 0.25, 1], 'kernel': ['rbf']}]
svcGS = GridSearchCV(svc, parameters, n_jobs=-1, cv=5, refit=True)
svcGS.fit(X_train, Y_train)
# print("Accuracy: ", svcGS.score(X_test, Y_test))
print(classification_report(Y_test, svcGS.predict(X_test)))

              precision    recall  f1-score   support

         1.0       0.71      0.86      0.77        14
         2.0       0.89      0.76      0.82        21

    accuracy                           0.80        35
   macro avg       0.80      0.81      0.80        35
weighted avg       0.82      0.80      0.80        35



<p> The result above shows us a fuller report, from which we can observe that, even if the total accuracy (f1 score) is 0.8 what we are more interested in (in this specific case) is the precision for the class 2.0, that is the class that indicates the patients tested positive for breast cancer. 
From here we can say that if a patient is positive for breast cancer this final model we can tell correctly the shes is in almost 90% of the cases </p>