# Machine Learning for targeted metabolomics

## Prepare images for classification

In [1]:
#https://www.codespeedy.com/prepare-your-own-data-set-for-image-classification-python/
## Prepare your own data set for image classification in Machine learning Python

There is large amount of open source data sets available on the Internet for Machine Learning, but while managing your own project you may require your own data set. Today, let’s discuss how can we prepare our own data set for Image Classification.
Collect Image data

The first and foremost task is to collect data (images). One can use camera for collecting images or download from Google Images (copyright images needs permission). There are many browser plugins for downloading images in bulk from Google Images. Suppose you want to classify cars to bikes. Download images of cars in one folder and bikes in another folder.
Process the Data

The downloaded images may be of varying pixel size but for training the model we will require images of same sizes. So let’s resize the images using simple Python code. We will be using built-in library PIL.

In [2]:
from PIL import Image
import os
def resize_multiple_images(src_path, dst_path):
    # Here src_path is the location where images are saved.
    for filename in os.listdir(src_path):
        try:
            img=Image.open(src_path+filename)
            new_img = img.resize((64,64))
            if not os.path.exists(dst_path):
                os.makedirs(dst_path)
            new_img.save(dst_path+filename)
            print('Resized and saved {} successfully.'.format(filename))
        except:
            continue



In [8]:
src_path = "C:/Users/Marilyn/Desktop/imagesNF/" #<Enter the source path>
dst_path = "C:/Users/Marilyn/Desktop/image_resizeNF/" #<Enter the destination path>

In [9]:
resize_multiple_images(src_path, dst_path)

Resized and saved contour plot 200929s001_STD-STD073.png successfully.
Resized and saved contour plot 200929s003_STD-STD073.png successfully.
Resized and saved contour plot 200929s007_STD-STD073.png successfully.
Resized and saved contour plot 200929s017_STD-STD073.png successfully.
Resized and saved contour plot 200929s037_STD-STD073.png successfully.
Resized and saved contour plot 200929s047_STD-STD073.png successfully.
Resized and saved contour plot 200929s049_STD-STD073.png successfully.
Resized and saved contour plot 200929s050_STD-STD073.png successfully.
Resized and saved contour plot 200929s051_STD-STD073.png successfully.
Resized and saved contour plot 200929s052_STD-STD073.png successfully.


In [149]:
#img=Image.open("C:/Users/Marilyn/Desktop/imagesF/contour plot 200929s001_STD-STD073.pdf")

The images should have small size so that the number of features is not large enough while feeding the images into a Neural Network. For example, a colored image is 600X800 large, then the Neural Network need to handle 600*800*3 = 1,440,000 parameters, which is quite large. On the other hand any colored image of 64X64 size needs only 64*64*3 = 12,288 parameters, which is fairly low and will be computationally efficient. Now since we have resized the images, we need to rename the files so as to properly label the data set.

In [10]:
import os

def rename_multiple_files(path,obj):

    i=0

    for filename in os.listdir(path):
        try:
            f,extension = os.path.splitext(path+filename)
            src=path+filename
            dst=path+filename+obj+str(i)+extension
            os.rename(src,dst)
            i+=1
            print('Rename successful.')
        except:
            i+=1



In [12]:
path= "C:/Users/Marilyn/Desktop/image_resizeF/" #<Enter the path of objects to be renamed>
obj= "FOUND"#<Enter the prefix to be added to each file. For ex. car, bike, cat, dog, etc.>
rename_multiple_files(path,obj)

Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.
Rename successful.


Since, we have processed our data. Merge the content of ‘car’ and ‘bikes’ folder and name it ‘train set’. Pull out some images of cars and some of bikes from the ‘train set’ folder and put it in a new folder ‘test set’. Now we have to import it into our python code so that the colorful image can be represented in numbers to be able to apply Image Classification Algorithms.

Import Images in form of array

## Create Train and test data

In [58]:
from PIL import Image
import os
import numpy as np
import re

def get_data(path):
    all_images_as_array=[]
    label=[]
    for filename in os.listdir(path):
        try:
            if re.match(r'FOUND',filename): #<Edit obj here>
                label.append(1)
            else:
                label.append(0)
            img=Image.open(path + filename)
            np_array = np.asarray(img)
            l,b,c = np_array.shape    
            np_array = np_array.reshape(l*b*c,)   
            all_images_as_array.append(np_array)
        except:
            print(filename) #if error with 2dim, print
            continue

    return np.array(all_images_as_array), np.array(label)



In [59]:
path_to_train_set = "C:/Users/Marilyn/Desktop/train/" #<Enter the location of train set>
path_to_test_set = "C:/Users/Marilyn/Desktop/test/" #<Enter the location of test set>
X_train,y_train = get_data(path_to_train_set)
X_test, y_test = get_data(path_to_test_set)

Woah! You made it. Your image classification data set is ready to be fed to the neural network model. Feel free to comment below.

In [60]:
print('X_train set : ',X_train) #trainingsdata
print('y_train set : ',y_train) #vector met labels 1=FOUND, 0=NF
print('X_test set : ',X_test)
print('y_test set : ',y_test)

X_train set :  [[255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 ...
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]]
y_train set :  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
X_test set :  [[255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 ...
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]]
y_test set :  [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]


In [61]:
#img=Image.open("C:/Users/Marilyn/Desktop/test/FOUND0.png")
#np_array = np.asarray(img)
#print(np_array)
#np_array.shape

In [62]:
print(X_train.shape)
print(y_train.shape)

(59, 12288)
(59,)


In [63]:
print(X_test.shape)
print(y_test.shape)

(20, 12288)
(20,)


## SVM

In [64]:
## model SVM
#https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-download-auto-examples-classification-plot-digits-classification-py
#probeer kernel rbf (pole cordin ipv gewone die lijn probeert te trekken) gezien ik sphere zoek
#Radial Basis Function (RBF) kernel SVM.

In [65]:
from sklearn import datasets, svm, metrics
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001, kernel='rbf')

In [66]:
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [67]:
# Now predict the value of the digit on the second half:
predicted = classifier.predict(X_test)
print(predicted)
#NOK, alles denkt FOUND

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [68]:
#real answer test label
print('y_test set : ',y_test)

y_test set :  [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]


In [50]:
#print(metrics.classification_report(y_test, predicted))

In [69]:
accuracy_score(y_test, predicted)

0.55

## NAIVE BAYES

In [70]:
#https://www.codespeedy.com/naive-bayes-algorithm-in-python/
# model Naive Bayes

In [71]:
from sklearn.naive_bayes import GaussianNB
nv = GaussianNB() # create a classifier
nv.fit(X_train,y_train) # fitting the data

GaussianNB(priors=None, var_smoothing=1e-09)

In [72]:
from sklearn.metrics import accuracy_score
y_pred = nv.predict(X_test) # store the prediction data
accuracy_score(y_test,y_pred) # calculate the accuracy

0.85

In [73]:
print(y_pred)
#ok, dus 1 fout van NF -> denk FOUND

[1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0]


In [74]:
#real answer test label
print('y_test set : ',y_test)

y_test set :  [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]


In [75]:
## uitlezen pd (read.csv(/t) in loop, matrix 'flatten' als 1 array, alle arays in lijst, erna np.array(List)

## NN

In [76]:
#mlp classifier, 
#hiddel layer v [100, 1] = 1 laag van honderd 100 iddenh, [100, 100] = 2 lagen van honderd

In [81]:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

clf = MLPClassifier(hidden_layer_sizes=(100, ), activation='relu').fit(X_train, y_train)

In [85]:
clf.predict_proba(X_test)

array([[0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395],
       [0.51567605, 0.48432395]])

In [86]:
clf.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [87]:
clf.score(X_test, y_test)

0.45

In [None]:
#from sololearn, ev to add:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
print("accuracy:", accuracy_score(y_test, predicted))
print("precision:", precision_score(y_test, predicted))
print("recall:", recall_score(y_test, predicted))
print("f1 score:", f1_score(y_test, predicted))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))
#Scikit-learn reverses the confusion matrix to show the negative counts first!
# Output:
# actual as rows
#  [[TN  FP]
#  [FN TP]]
# 0 = neg, 1 = pos reason reverse

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
#By default the training set is 75% of the data and the test set is the remaining 25% of the data.
print("whole dataset:", X.shape, y.shape)
print("training set:", X_train.shape, y_train.shape)
print("test set:", X_test.shape, y_test.shape)
train_test_split(X, y, train_size=0.6) 

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
y_pred = model.predict(X_test)
print("accuracy:", accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=27)
#alias seed, always same plit instead of random result

In [None]:
sensitivity_score = recall_score
print(sensitivity_score(y_test, y_pred)) 
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y, y_pred))
def specificity_score(y_true, y_pred):
    p, r, f, s = precision_recall_fscore_support(y_true, y_pred)
    return r[0]
print(specificity_score(y_test, y_pred)) 

In [None]:
(model.predict_proba(X_test)
 model.predict_proba(X_test)[:, 1]
 y_pred = model.predict_proba(X_test)[:, 1] > 0.75

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,1])

plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('1 - specificity')
plt.ylabel('sensitivity')
plt.show()

In [None]:
(roc_auc_score(y_test, y_pred_proba[:,1]) 