# Dataset formatting

## Training set
___
In this section we explain the given training set format.

In [1]:
# import useful packages
import pandas as pd;
from sklearn.utils import shuffle;
import numpy as np;
from math import floor;

data_filename = './../../data/{}';
train_set_filename = 'train.csv';

print('Importing training set....');
train_set = pd.read_csv(data_filename.format(train_set_filename));
train_set = train_set.replace(np.nan, '', regex=True);
print('Training set correctly imported. Displaying first rows....')
train_set.head()

Importing training set....
Training set correctly imported. Displaying first rows....


Unnamed: 0,ImageId_ClassId,EncodedPixels
0,0002cc93b.jpg_1,29102 12 29346 24 29602 24 29858 24 30114 24 3...
1,0002cc93b.jpg_2,
2,0002cc93b.jpg_3,
3,0002cc93b.jpg_4,
4,00031f466.jpg_1,


### train_set.csv explanation

The training set associate to each image one entry per defect class. If there are n defect classes, the image "ImageId.jpg" will be associated to four entries, each ending with "_ClassId", where *"ClassId"* is the defect class identifier.

It is easy to see that an image have a flaw of the i-th class if the corresponding row has, in the *EncodedPixels* column, the sequence of pixels where the defect is present. If there are no sequence for any *ClassId*, then the steel in the image is flawless.

### EncodedPixels explanation
The encoding starts by enumerating pixels from (1,1) down each column, then moving one column right. 

<table>
<tr>
    <th>Image with two classes of defects </th>
    <th>First column pixels enumeration</th>
    <th>Second column pixels enumeration</th>
</tr>
<tr>
<td>
    
|   | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| 1 |   |   |   | X |   |   |   |
| 2 |   |   |   |   |   |   |   |
| 3 |   |   | X | X |   | O |   |
| 4 |   |   | X | X | O | O | O |
| 5 |   |   | X | X | X | O |   |
| 6 |   |   | X | X | X |   |   |
| 7 |   |   | X |   |   |   |   |

</td>
<td>

| row | column | pixel number |
|:---:|:------:|:------------:|
|  1  |    1   |       1      |
|  2  |    1   |       2      |
|  3  |    1   |       3      |
|  4  |    1   |       4      |
|  5  |    1   |       5      |
|  6  |    1   |       6      |
|  7  |    1   |       7      |

</td>
<td>

| row | column | pixel number |
|:---:|:------:|:------------:|
|  1  |    2   |       8      |
|  2  |    2   |       9      |
|  3  |    2   |       10      |
|  4  |    2   |       11      |
|  5  |    2   |       12      |
|  6  |    2   |       13      |
|  7  |    2   |       14      |

</td>
</tr>
</table>

In the first of the tables above is shown an example image with two defects, class 1 (represented by the crosses "X") and class 2 (represented by the circles "O"). The second and third explain pixels enumeration for the first two columns.

In this case, the corresponding entries in the *"train_set.csv"* data file would be (with 4 defect classes):

|   | ImageId_ClassId |  EncodedPixels |
|:-:|:---------------:|:--------------:|
| 1 | 0002cc93b.jpg_1 | 17 6 24 4 33 2 |
| 2 | 0002cc93b.jpg_2 | 32 1 38 3 46 1 |
| 3 | 0002cc93b.jpg_3 |                |
| 4 | 0002cc93b.jpg_4 |                |

This means, indeed, that that the image has a class 1 defect for 6 consecutives pixels starting from pixel 17, and so on.

## Input formatting
---
As explained in the [problem statement section](./Problem%20statement.ipynb), our goals is to identify defective steel surfaces, locating flaws. The output has the same format of the training set.

To do this, we will use a 1-vs-all classification with a [CNN (convolutional neural network)](./Convolutional%20Neural%20Network%20(CNN).ipynb). For this reason we are now going to format the training set in a way that better suite the problem.

The following code will group data relative to each image, adding a boolean information *HasDefect_ClassId*.

In [2]:
def format_train_set(train_set, n_defect_classes):
    n_rows = len(train_set.index);
    m = int(n_rows / n_defect_classes);
    assert m * n_defect_classes == n_rows, 'Number of rows should be multiple of number of defect classes.'
    images_id = [];
    encoded_pixels = {};
    has_defect = {};
    for class_id in range(1, n_defect_classes + 1):
        encoded_pixels['EncodedPixels_{}'.format(class_id)] = [];
        has_defect['HasDefect_{}'.format(class_id)] = [];
    for i in range(0, m):
        image_index = i * n_defect_classes;
        image_df = train_set.loc[image_index: (i + 1) * n_defect_classes - 1];
        images_id.append(image_df['ImageId_ClassId'][image_index].replace('_1', ''));
        for class_id in range(1, n_defect_classes + 1):
            encoded_pixels['EncodedPixels_{}'.format(class_id)].append(image_df['EncodedPixels'][image_index + class_id - 1]);
            has_defect['HasDefect_{}'.format(class_id)].append(image_df['EncodedPixels'][image_index + class_id - 1] != '');
    data_set = pd.DataFrame({
        'ImageId': images_id,
        **encoded_pixels,
        **has_defect
    })
    return data_set;

It needs the training set and the number of defect classes present. This number is given from the contest, but can also be easily recovered by analyzing the data set.

**Remark:** The above code exploit the pattern in input data to run faster (e.g. number of defect classes and data order).

In [3]:
n_defect_classes = 4;
data_set = format_train_set(train_set, n_defect_classes);
data_set.head()

Unnamed: 0,ImageId,EncodedPixels_1,EncodedPixels_2,EncodedPixels_3,EncodedPixels_4,HasDefect_1,HasDefect_2,HasDefect_3,HasDefect_4
0,0002cc93b.jpg,29102 12 29346 24 29602 24 29858 24 30114 24 3...,,,,True,False,False,False
1,00031f466.jpg,,,,,False,False,False,False
2,000418bfc.jpg,,,,,False,False,False,False
3,000789191.jpg,,,,,False,False,False,False
4,0007a71bf.jpg,,,18661 28 18863 82 19091 110 19347 110 19603 11...,,False,False,True,False


## Cross validation and test set
---
In order to allow hyperparameters tuning and to test the model, we divide the training set randomly in the real training set, the cross validation set and in a test set, using a 70/20/10 division.

In [4]:
def build_train_cv_test_set(data_set, train_set_fraction = 0.7, test_set_fraction = 0.1):
    assert train_set_fraction > 0, 'Training set shouldn\'t be null';
    assert test_set_fraction >= 0, 'Test set can\'t have a negative number of rows';
    assert train_set_fraction + test_set_fraction <= 1, 'Fraction sum is higher then 1';
    n_rows = len(data_set.index);
    n_rows_train_set = floor(n_rows * train_set_fraction);
    n_rows_test_set = floor(n_rows * test_set_fraction);
    n_rows_cv_set = n_rows - (n_rows_train_set + n_rows_test_set);
    print('The data set has {} rows'.format(n_rows));
    print('Randomly shuffling the data set....');
    shuffled_data_set = shuffle(data_set);
    print('Picking the first {} rows to build the train set....'.format(n_rows_train_set));
    train_set = shuffled_data_set[0:n_rows_train_set];
    cv_set = {}
    if(n_rows_cv_set > 0):
        print('Picking the first {} rows to build the cross validation set....'.format(n_rows_cv_set));
        cv_set = shuffled_data_set[n_rows_train_set:
                                   n_rows_train_set + n_rows_cv_set];
    test_set = {}
    if(n_rows_test_set > 0):
        print('Picking the first {} rows to build the test set....'.format(n_rows_test_set));
        test_set = shuffled_data_set[n_rows_train_set + n_rows_cv_set:
                                   n_rows_train_set + n_rows_cv_set + n_rows_test_set];
    return {
        'TRAIN': train_set,
        'CV': cv_set,
        'TEST': test_set
    }

In [5]:
sets = build_train_cv_test_set(data_set);
training_set = sets['TRAIN'];
cv_set = sets['CV'];
test_set = sets['TEST'];

The data set has 12568 rows
Randomly shuffling the data set....
Picking the first 8797 rows to build the train set....
Picking the first 2515 rows to build the cross validation set....
Picking the first 1256 rows to build the test set....


In [6]:
print('First rows of TRAINING SET:');
training_set.head()

First rows of TRAINING SET:


Unnamed: 0,ImageId,EncodedPixels_1,EncodedPixels_2,EncodedPixels_3,EncodedPixels_4,HasDefect_1,HasDefect_2,HasDefect_3,HasDefect_4
12486,fe335d9b8.jpg,,,303847 26 304053 76 304258 127 304464 177 3046...,,False,False,True,False
11568,ead7d8e70.jpg,,,,,False,False,False,False
1948,2745ecf6b.jpg,,,113050 11 113289 32 113528 53 113767 72 114007...,,False,False,True,False
6621,86752d23a.jpg,,,35039 14 35286 30 35542 38 35799 39 36055 39 3...,,False,False,True,False
8751,b17fcb164.jpg,,,615 2 867 6 1120 10 1372 14 1625 18 1877 22 21...,,False,False,True,False


In [7]:
print('First rows of CROSS VALIDATION SET:');
cv_set.head()

First rows of CROSS VALIDATION SET:


Unnamed: 0,ImageId,EncodedPixels_1,EncodedPixels_2,EncodedPixels_3,EncodedPixels_4,HasDefect_1,HasDefect_2,HasDefect_3,HasDefect_4
3749,4c2bd254d.jpg,,,221441 11 221697 32 221953 53 222209 74 222465...,,False,False,True,False
4377,594e0207a.jpg,,,,,False,False,False,False
10058,cc3a294d4.jpg,,,,,False,False,False,False
7849,9ffe5113f.jpg,,,,,False,False,False,False
6277,7f10fcbfe.jpg,,,,,False,False,False,False


In [10]:
print('First rows of TEST SET:');
test_set.head()

First rows of TEST SET:


Unnamed: 0,ImageId,EncodedPixels_1,EncodedPixels_2,EncodedPixels_3,EncodedPixels_4,HasDefect_1,HasDefect_2,HasDefect_3,HasDefect_4
758,0fa3f0dab.jpg,,,214020 32 214276 95 214532 158 214788 221 2150...,,False,False,True,False
5825,7581cba15.jpg,,,,323840 1 324094 3 324348 5 324603 6 324857 8 3...,False,False,False,True
11898,f22da130b.jpg,,,,996 26 1200 78 1287 247 1545 245 1803 243 2061...,False,False,False,True
7562,99d47e124.jpg,,,47803 15 48031 43 48258 72 48485 101 48713 129...,,False,False,True,False
4940,64361cd8b.jpg,,,582 6 830 17 1078 27 1329 31 1583 33 1837 34 2...,,False,False,True,False


Finally, we save these sets in three csv files.

In [11]:
datasets_filenames = {
    'TRAIN': 'training_set.csv',
    'CV': 'cross_validation_set.csv',
    'TEST': 'test_set.csv'
}

print('Saving TRAINING SET to \'{}\'...'.format(data_filename.format(datasets_filenames['TRAIN'])));
training_set.to_csv(data_filename.format(datasets_filenames['TRAIN']));
print('Saving CROSS VALIDATION SET to \'{}\'...'.format(data_filename.format(datasets_filenames['CV'])));
cv_set.to_csv(data_filename.format(datasets_filenames['CV']));
print('Saving TEST SET to \'{}\'...'.format(data_filename.format(datasets_filenames['TEST'])));
test_set.to_csv(data_filename.format(datasets_filenames['TEST']));
print('Data sets successfully saved.')

Saving TRAINING SET to './../../data/training_set.csv'...
Saving CROSS VALIDATION SET to './../../data/cross_validation_set.csv'...
Saving TEST SET to './../../data/test_set.csv'...
Data sets successfully saved.


**Remark:** the choice of randomly divide the data set in training set, cross validation set and test set is reasonable, because flaw and flawless steel surfaces should be present in all the steps of the model definition and evaluation. Indeed, we are not only interested in detecting an anomaly, but also in identifying it and locating it.

Later we will eventually try to apply some criteria to the choice of the distribution of the data set samples and we will compare the results. A priori, this is a good choice. In the next section we will analize the data sets.