# Scikit Learn Example

First of all, we need to import some libraries. As you can notice below, lots of classifiers are being imported to our code, but we will not use all of them at the same time. In the next steps, you will understand why we are doing it now.

In [None]:
import pandas as pd
import numpy as np
from sklearn import tree, metrics, svm
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.model_selection import StratifiedShuffleSplit

Then, we can import the original dataset. We will also visualize the first samples of our data:

In [None]:
original = pd.read_csv('disease.csv', header=0)
original.head()

For the training, the first step should be separate the samples into two different collections. One with all the features and the other one with the classes of each sample:

In [None]:
snps = original.drop('diagnosis', 1)
diag = original.loc[:,['diagnosis']]

Now we have our features in a dataFrame:

In [None]:
snps.head()

And our the classes of each sample in another dataFrame:
> Notice both dataFrames share the same index, so we have a common key to use with them.

In [None]:
diag.head()

As you should have noticed, our data isn't numeric, which means that it isn't suitable for most of the techniques discussed here. This trick will enable us to get around this:

In [None]:
for feature in snps.columns:
	df_coded = snps 
	setattr(df_coded, feature, getattr(snps,feature).astype("category").cat.codes)
snps.head()


> Notice: We just mapped the original categories to a numeric value representation. This should be fine for some techniques, but not for all. Some algorithms could be influenced by those values and it can change their behavior. If you want to use an of those algorithms, you can try [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), it should do the trick for you. 

In this example, we will use cross-validation to check the performance of our classifiers. This is by far the most used validator used in Machine Learning, but it has some limitations especially when used with small sample data. As the focus of this workshop isn't the validator performance, we will use a built-in cross validator from Scikit Learn:

In [None]:
cv = StratifiedShuffleSplit(n_splits=1, test_size=0.20, random_state=2**10)

And now we have to create independent datasets for training and testing our model:

In [None]:
for train, test in cv.split(snps, diag):
#   Setting training data
    df_train = snps.loc[train]
    diag_train = diag.loc[train]
#   Setting test data
    df_test  = snps.loc[test]
    diag_test  = diag.loc[test]

At this point we have two sets of data: one to train our classifier and another test it. Let's take a look on they?  

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
diag_train.head()

In [None]:
diag_train.shape

At this point, we can create and train our classifier. We will use a Multilayer Perceptron Neural Network now, but some steps ahead you will see how easy it will be to change the classifier:

In [None]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(8, 5))
clf.fit(df_train,diag_train.pop('diagnosis').tolist())

And then we can predict our test dataset:

In [None]:
predicted = clf.predict(df_test)
#   Formating before comparing 
diag_test_ = diag_test.pop('diagnosis').tolist()

In [None]:
print("Accuracy: ", metrics.accuracy_score(diag_test_, predicted))
print("Confusion Matrix: \n", metrics.confusion_matrix(diag_test_, predicted))

Now, you can go back and select different classifiers and parameters to test your data. Try to run the next code snippet uncommenting different classifiers and changing their parameters. [Here](https://scikit-learn.org/stable/modules/classes.html) you can find  the documentation to help you with this task.

In [None]:
for train, test in cv.split(snps, diag):
    df_train = snps.loc[train]
    diag_train = diag.loc[train]
    df_test  = snps.loc[test]
    diag_test  = diag.loc[test]

# clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
# clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
# clf = tree.DecisionTreeClassifier(criterion='entropy')
# clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(8, 5))
# clf = RandomForestClassifier(n_estimators=100)
# clf = AdaBoostClassifier(n_estimators=100)
# clf = NearestCentroid()

clf.fit(df_train,diag_train.pop('diagnosis').tolist())
predicted = clf.predict(df_test)
diag_test = diag_test.pop('diagnosis').tolist()

print("Accuracy: ", metrics.accuracy_score(diag_test, predicted))
print("Confusion Matrix: \n", metrics.confusion_matrix(diag_test, predicted))
print("Classifier: ", clf  ) 