<h2>File Information</h2>

10k images from the challenge website: https://challenge2018.isic-archive.com/task3/.<br>
All the images are 600x450 and are somewhat centered on the skin lesion.<br>

<ul>
Folder structure:
    <li>../Skin_Image_Recognition</li>
    <ul>
        <li>(this notebook).ipynb</li>
        <li>target.csv</li>
        <li>images</li>
        <ul>
            <li>img1.jpeg</li>
            <li>img2.jpeg</li>
            <li>...</li>
        </ul>
    </ul>
</ul>

<h2>Define all functions at top, execute code at the bottom</h2>

<h2>Loading Data</h2>

`import_images(path)` loads all the images inside the specified folder and return them as a list of grayscale 2d matrices.

`import_target(path)` loads a local csv that has classifications listed in the same order as the files in images. it returns list of 0s and 1s, indicating "not nevus" and "nevus".

`drop_even(X, Y, N, M)` removes instances from the dataset to give a desired positive:negative classification ratio.

To use `N` data points: `X_all, Y_all = drop_even(*import_data(), N)` <br>
To use all data points: `X_all, Y_all = import_data()`

In [1]:
import numpy as np
import os
import cv2
import math

def import_images(path):
    return np.array([cv2.imread(os.path.join(path, img_name), 0) for img_name in os.listdir(path)])

def import_target(path):
    f = open(os.path.join(path, "target.csv"))
    lines = f.read().split('\n')
    toks = [line.split(',') for line in lines]
    toks = toks[1:-1]
    return np.array([int(float(tok[2])) for tok in toks])

def import_data():
    X_all = import_images(os.path.join("images"))
    Y_all = import_target("")
    Y_all = Y_all[:len(X_all)]
    return X_all, Y_all

def drop_even(X, Y, M, N):
    indices = []
    for i in range(len(X)):
        if (Y[i] == 1 and M > 0):
            M -= 1
            indices.append(i)
        if (Y[i] == 0 and N > 0):
            N -= 1
            indices.append(i)
    return X[indices], Y[indices]

In [2]:
#import matplotlib.pyplot as plt
#X, Y = import_data()

In [3]:
#n = 4
#plt.imshow(X[n], cmap=("gray"))
#print(Y[n])

<h2>Testing/Training Dataset Generation</h2>

`linearize(X)` accepts X, a list of 2d matrices. The function flattens each 2d image matrix into a 1d vector that can be fed into a classifier.

`shuffle(X,Y)` randomly permutes the order of images.

`split(X,Y)` separates the training/testing data.

`gen_train_test(X,Y,seed,r,verbose)` uses `shuffle` and `split` to generate a testing/training dataset.

In [4]:
def linearize(X):
    return np.array([x.flatten() for x in X])

def shuffle(X,Y):
    shuffle = np.random.permutation(len(X))
    X_s, Y_s = X[shuffle], Y[shuffle]
    return X_s, Y_s

def split(X,Y,r):
    c = int(len(X)*r) #cutoff for test/train data
    return X[:c], Y[:c], X[c:], Y[c:]

def gen_train_test(X,Y,seed,r,verbose):
    np.random.seed(seed)
    X_train, Y_train, X_test, Y_test = split(*shuffle(X,Y), r)
    if verbose:
        print("Train Non-Nevi:", len(Y_train[Y_train == 0]), "Train Nevi:", len(Y_train[Y_train == 1]))
        print("Test  Non-Nevi:", len(Y_test[Y_test == 0]),   "Test  Nevi:", len(Y_test[Y_test == 1]))
    return X_train, Y_train, X_test, Y_test

In [5]:
#print(X.shape)
#print(linearize(X).shape)

# change the seed/split around
#X_train, Y_train, X_test, Y_test = gen_train_test(X,Y,0,0.8,True)

<h2>SGDClassifier</h2>

`val_test(clf, X_test, Y_test)` takes a classifier and the testing data, and returns the accuracy from evaluating the testing datapoints.

The commented sections inside `train_sgd_clf` print out results for 4-fold cross-validation and the resulting confusion matrix.

In [6]:
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, confusion_matrix

def val_test(clf, X_test, Y_test):
    Y_pred = clf.predict(X_test)
    return sum(Y_pred == Y_test) / len(Y_pred)

# misc code
def train_sgd_clf(clf,X,Y,seed):
    X_train, Y_train, X_test, Y_test = gen_train_test(X,Y,seed,0.8,verbose=True)
    X_train = linearize(X_train)

    clf.fit(X_train, Y_train)
    
    cvs = cross_val_score(clf, X_train, Y_train, cv=4, scoring="accuracy")
    print("Cross validation scores:", cvs)
    
    Y_train_pred = cross_val_predict(clf, X_train, Y_train, cv=4)
    cm = confusion_matrix(Y_train, Y_train_pred)
    print("Confusion Matrix:")
    print(cm)
    
    acc = val_test(clf, linearize(X_test), Y_test)
    print("Testing Accuracy:", acc)

In [7]:
#clf = SGDClassifier(loss='hinge', max_iter=4, tol=-np.infty, random_state=13)
#train_sgd_clf(clf, X, Y, 9)

In [8]:
#X_d, Y_d = drop_even(X, Y, 103, 103)

In [9]:
# show seed 1, seed 2
#clf = SGDClassifier(loss='hinge', max_iter=4, tol=-np.infty, random_state=13)
#train_sgd_clf(clf, X_d, Y_d, 0)

Confusion matrix:<br>
[TN FP]<br>[FN TP]

<h2>Downscaling</h2>

All the images here are given as 600x450. Let's try resizing to 200x150 with different interpolation algorithms and seeing how performance changes.<br>
Runtime is better, and accuracy stays similar for multiple interpolation methods (for the base images, at least)

`resize_all` downscales `X`, the list of 2d matrices, based on `inter_type`, a `cv2.Interpolation_Flag`.

`test_downscaling` takes in `funcs`, a list of `cv2.Interpolation_Flag` and trains models off of each downscaled dataset.<br>
e.g. `test_downscaling(13, [cv2.INTER_LINEAR, cv2.INTER_NEAREST, cv2.INTER_CUBIC])`

In [10]:
def resize_all(X, inter_type):
    return np.array([cv2.resize(x,dsize=(200,150), interpolation=inter_type) for x in X])

def test_downscaling(seed, funcs):
    for func in funcs:
        np.random.seed(seed)
        X_down = resize_all(X_all, func)
        X_train, Y_train, X_test, Y_test = split(*shuffle(linearize(X_down), Y_all))
        sgd_clf = SGDClassifier(loss='hinge', max_iter=4, tol=-np.infty, random_state=13) # hinge = SVM
        train_sgd_clf(sgd_clf, X_train, Y_train)
        print(val_test(sgd_clf, X_test, Y_test))

<h2>Image Shifting</h2>

We can artificially create more data locally by shifting images around (the same as cropping out x pixels on a particular side) and resizing, which gives us a new image that has the same classification as the original.

`shift_img` crops out `n` pixels off a certain side given by `shift_dir`.

In [11]:
def shift_img(img, n, shift_dir): # n = shift magnitude of pixels
    m = img
    if shift_dir == 0: # right
        m = img[:-n, :]
    if shift_dir == 1:
        m = img[:-n, n:]
    elif shift_dir == 2: # up
        m = img[:, n:]
    elif shift_dir == 3:
        m = img[n:, n:]
    elif shift_dir == 4: # left
        m = img[n:, :]
    elif shift_dir == 5:
        m = img[n:, :-n]
    elif shift_dir == 6: # down
        m = img[:, :-n]
    elif shift_dir == 7:
        m = img[:-n, :-n]
    return cv2.resize(m,dsize=(200,150))

In [12]:
#img = X_d[0]
#print(img)
#plt.imshow(img, cmap="gray")

In [13]:
#new_img = shift_img(img, 12, 0)
#print(new_img)
#plt.imshow(new_img, cmap="gray")

<h2>Batch Generation</h2>

When too many images are generated (using the shifting transformation), the computer does not have enough RAM to store them. To prevent this, we generate one 'batch' of images at a time and apply `partial_fit` to our classifier repeatedly.

`gen_batches` creates a permutation of all integers from 1 to (total images) and splits it into separate lists of equal size (batches).

`gen_img` creates the image mapped to the integer value from the batch.

`gen_XY_batch` does `gen_img` for all images and updates Y (the observations) accordingly.

In [14]:
def gen_batches(n, batch_size, m): # m = multiplier, number of shift magnitudes * 8 + 1
    ids = np.random.permutation(n*m)
    num_batches = n*m//batch_size
    full_batches = [ids[i*batch_size:(i+1)*batch_size] for i in range(num_batches)]
    last_batch = ids[batch_size*num_batches:]
    if len(last_batch) > 0:
        full_batches.append(last_batch)
    return full_batches

def gen_img(X, img_id, shifts):
    s = len(shifts)*8 + 1
    img = X[img_id // s]
    shift_type = img_id % s
    if shift_type == s-1:
        return img
    shift_mag, shift_dir = shifts[shift_type // 8], shift_type % 8
    return shift_img(img, shift_mag, shift_dir)

def gen_XY_batch(X,Y,batch,shifts):
    return np.array([gen_img(X, img_id, shifts) for img_id in batch]), Y[batch // (len(shifts)*8+1)]

In [15]:
#gen_batches(20, 5, 49)

<h2>Partial Fitting</h2>

In [16]:
def batch_train_aug(clf, X_train, Y_train, seed, epochs, batch_size, shifts):
    np.random.seed(seed)
    classes = np.unique(Y_train)
    for n in range(epochs):
        batches = gen_batches(len(X_train), batch_size, len(shifts)*8 + 1)
        for batch in batches:
            X_train_b, Y_train_b = gen_XY_batch(X_train, Y_train, batch, shifts)
            clf.partial_fit(linearize(X_train_b), Y_train_b, classes)
    return clf

In [17]:
#clf = SGDClassifier(loss='hinge', max_iter=4, tol=-np.infty, random_state=13)

<h2>Running Trials</h2>


In [18]:
from sklearn.base import clone
def grid_search_aug(clf, X, Y, seed_range, epoch_range, batch_size_range, shift_sets):
    for seed in seed_range:
        X_train, Y_train, X_test, Y_test = gen_train_test(X, Y, seed, 0.8, verbose=True)
        X_train_l = linearize(X_train)
        for shift_set in shift_sets:
            for epochs in epoch_range:
                for batch_size in batch_size_range:
                    clf_clone = clone(clf)
                    batch_train_aug(clf_clone, X_train, Y_train, seed, epochs, batch_size, shift_set)
                    
                    acc = val_test(clf_clone, linearize(X_test), Y_test)
                    cvs = 0
                    #cvs = cross_val_score(clf_clone, X_train_l, Y_train, cv=4, scoring="accuracy")
                    
                    s = "{:<.5f}\t{:<6d}\t{:<6d}\t{:<6d}\t{}\t{}"
                    print(s.format(acc, seed, epochs, batch_size, shift_set, cvs))

                    
#X_train is linearized                    
def grid_search(clf, X, Y, seed_range):
    for seed in seed_range:
        X_train, Y_train, X_test, Y_test = gen_train_test(X, Y, seed, 0.8, verbose=False)
        X_train_l = linearize(X_train)
        
        clf_clone = clone(clf)
        clf_clone.fit(X_train_l, Y_train)

        acc = val_test(clf_clone, linearize(X_test), Y_test)
#        cvs = 0
        cvs = cross_val_score(clf_clone, X_train_l, Y_train, cv=4, scoring="accuracy")

        s = "{:<.5f}\t{:<6d}\t{}"
        print(s.format(acc, seed, cvs))

<h2>Misc</h2>

In [None]:
def pca(X, n_comp):
    pca = PCA(n_components = n_comp)
    pca.fit(X)
    return pca.fit_transform(X)

def test_pca():
    X_pca = pca(linearize(X_all), n_comp)
    for n_comp in range(5,100,5):
        np.random.seed(13)
        #X_train, Y_train, X_test, Y_test = split(*shuffle(linearize(X_all), Y_all))
        X_train, Y_train, X_test, Y_test = split(*shuffle(X_pca, Y_all))
        print("Training, total:", len(Y_train), "negative:", len(Y_train[Y_train == 0]))
        print("Testing, total:", len(Y_test), "negative:", len(Y_test[Y_test == 0]))

        sgd_clf = SGDClassifier(loss='hinge', max_iter=4, tol=-np.infty, random_state=13) # hinge = SVM
        train_sgd_clf(sgd_clf, X_train, Y_train)
        print("600x450 Accuracy:", val_test(sgd_clf, X_test, Y_test), n_comp)

<h2>Code goes here</h2>

In [19]:
X_all, Y_all = drop_even(*import_data(), 103, 103)
X_all = resize_all(X_all, cv2.INTER_LINEAR)
print("CSV Lines:", len(Y_all), "Non-Nevi:", len(Y_all[Y_all == 0]), "Nevi:", len(Y_all[Y_all == 1]))
print("Images:", len(X_all))

CSV Lines: 206 Non-Nevi: 103 Nevi: 103
Images: 206


In [40]:
X_train, Y_train, X_test, Y_test = gen_train_test(linearize(X_all),Y_all,17,0.8,True)

Train Non-Nevi: 82 Train Nevi: 82
Test  Non-Nevi: 21 Test  Nevi: 21


In [45]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

tuned_parameters = [{'loss': ['hinge', 'log'], 
                     'penalty': ['none', 'l2', 'l1', 'elasticnet']}]

scores = ['precision', 'recall']

for score in scores:
    clf = GridSearchCV(SGDClassifier(max_iter=5, tol=-np.infty, random_state=0), tuned_parameters, cv=5,
                           scoring='%s_macro' % score)
    clf.fit(X_train, Y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    Y_true, Y_pred = Y_test, clf.predict(X_test)
    print(classification_report(Y_true, Y_pred))
    print()

  'precision', 'predicted', average, warn_for)


Best parameters set found on development set:

{'loss': 'hinge', 'penalty': 'l2'}

Grid scores on development set:

0.744 (+/-0.163) for {'loss': 'hinge', 'penalty': 'none'}
0.749 (+/-0.135) for {'loss': 'hinge', 'penalty': 'l2'}
0.744 (+/-0.163) for {'loss': 'hinge', 'penalty': 'l1'}
0.702 (+/-0.307) for {'loss': 'hinge', 'penalty': 'elasticnet'}
0.724 (+/-0.132) for {'loss': 'log', 'penalty': 'none'}
0.650 (+/-0.414) for {'loss': 'log', 'penalty': 'l2'}
0.724 (+/-0.132) for {'loss': 'log', 'penalty': 'l1'}
0.741 (+/-0.137) for {'loss': 'log', 'penalty': 'elasticnet'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

           0       1.00      0.19      0.32        21
           1       0.55      1.00      0.71        21

   micro avg       0.60      0.60      0.60        42
   macro avg       0.78      0.60      0.52        42
weighted avg   

  'precision', 'predicted', average, warn_for)


SGD on regular data

In [40]:
strs = ["Acc", "Seed", "Crossval"]
print("{:<7}\t{:<6}\t{}".format(*strs))
grid_search(
    clf = SGDClassifier(loss="hinge", max_iter=4, tol=-np.infty, random_state=13), 
    X = X_all,
    Y = Y_all,
    seed_range = [x for x in range(20)])

Acc    	Seed  	Crossval
0.54762	0     	[0.64285714 0.47619048 0.625      0.6       ]
0.47619	1     	[0.5        0.73170732 0.65853659 0.525     ]
0.73810	2     	[0.57142857 0.63414634 0.51219512 0.5       ]
0.57143	3     	[0.57142857 0.6097561  0.51219512 0.475     ]
0.57143	4     	[0.7804878  0.6097561  0.63414634 0.56097561]
0.57143	5     	[0.64285714 0.68292683 0.53658537 0.65      ]
0.52381	6     	[0.5        0.48780488 0.48780488 0.575     ]
0.57143	7     	[0.73809524 0.51219512 0.58536585 0.6       ]
0.52381	8     	[0.57142857 0.53658537 0.6097561  0.7       ]
0.73810	9     	[0.47619048 0.6097561  0.6097561  0.6       ]
0.73810	10    	[0.48780488 0.53658537 0.48780488 0.48780488]
0.52381	11    	[0.52380952 0.58536585 0.63414634 0.5       ]
0.69048	12    	[0.80952381 0.6097561  0.70731707 0.525     ]
0.54762	13    	[0.7804878  0.51219512 0.68292683 0.48780488]
0.52381	14    	[0.5952381  0.6097561  0.51219512 0.575     ]
0.69048	15    	[0.69047619 0.51219512 0.51219512 0.625     ]


In [41]:
# Showing how to retrieve a test case
#X_train, Y_train, X_test, Y_test = gen_train_test(X_all, Y_all, 18, 0.8, True)

Train Non-Nevi: 86 Train Nevi: 78
Test  Non-Nevi: 17 Test  Nevi: 25


In [42]:
#clf = SGDClassifier(loss="hinge", max_iter=4, tol=-np.infty, random_state=13)
#train_sgd_clf(clf,X_all,Y_all,18)

Train Non-Nevi: 86 Train Nevi: 78
Test  Non-Nevi: 17 Test  Nevi: 25
Cross validation scores: [0.52380952 0.66666667 0.725      0.475     ]
Confusion Matrix:
[[45 41]
 [25 53]]
Testing Accuracy: 0.40476190476190477


MLP on regular data

In [None]:
from sklearn.neural_network import MLPClassifier
strs = ["Acc", "Seed", "Crossval"]
print("{:<7}\t{:<6}\t{}".format(*strs))
grid_search(
    clf = MLPClassifier(random_state=13), 
    X = X_all,
    Y = Y_all,
    seed_range = [x for x in range(0)])

SGD on shifted data

In [None]:
strs = ["Acc", "Seed", "Epochs", "Batch"]
print("{:<7}\t{:<6}\t{:<6}\t{:<6}".format(*strs))
shift_sets = [[3], [3,6], [3,6,9], [3,6,9,12], [3,6,9,12,15]]
grid_search_aug(
    clf = SGDClassifier(loss="hinge", max_iter=4, tol=-np.infty, random_state=13), 
    X = X_all,
    Y = Y_all,
    seed_range = [x for x in range(0)], 
    epoch_range = [1,2,3,4], 
    batch_size_range = [50, 100], 
    shift_sets = shift_sets)