The aim of this notebook is to load the original data from the ICPR2010 contest (specifically the data inside the S1 directory, since it is the only one that is labeled) preprocess the datasets, split them into 5-fold files and then save them in CSV.

# Importing required libs

In [62]:
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff
from sklearn import preprocessing
from sklearn.model_selection import KFold, StratifiedKFold


# Loading data

In [12]:
datasets = [] 
for i in range(1,302):
    raw_data = loadarff(f'../../data/S1/D{i}-trn.arff')
    df_data = pd.DataFrame(raw_data[0])
    datasets.append(df_data)

Listing datasets properties (number of rows, columns and classes).

In [13]:
ds_desc_dict = {'dataset':[], 'rows':[], 'columns':[], 'classes':[]}
for i in range(1,302):
    rows = datasets[i-1].shape[0]
    columns = datasets[i-1].shape[1]
    classes = len(datasets[i-1]['class'].unique())
    ds_desc_dict['dataset'].append(i)
    ds_desc_dict['rows'].append(rows)
    ds_desc_dict['columns'].append(columns)
    ds_desc_dict['classes'].append(classes)
    
ds_desc = pd.DataFrame(ds_desc_dict)

In [14]:
ds_desc

Unnamed: 0,dataset,rows,columns,classes
0,1,301,21,2
1,2,231,9,2
2,3,319,21,2
3,4,301,21,2
4,5,300,21,2
...,...,...,...,...
296,297,231,9,2
297,298,301,21,2
298,299,300,21,2
299,300,302,21,2


In [15]:
# How many ocurrences of each possible number of classes?
ds_desc['classes'].value_counts()

2     300
20      1
Name: classes, dtype: int64

Only one dataset is not binary (the last one).

In [16]:
# Describe it excluding the last dataset (outlier).
ds_desc.iloc[:-1,:].describe()

Unnamed: 0,dataset,rows,columns,classes
count,300.0,300.0,300.0,300.0
mean,150.5,389.846667,16.52,2.0
std,86.746758,134.684953,5.813972,0.0
min,1.0,230.0,9.0,2.0
25%,75.75,302.0,9.0,2.0
50%,150.5,354.0,21.0,2.0
75%,225.25,466.5,21.0,2.0
max,300.0,950.0,21.0,2.0


**Conclusion:** 
- All datasets are binary, except the last one, which has 20 class labels. 
- Excluding the last dataset, the mean number of rows and columns is 389.8 and 16.5 respectively.

In [17]:
# To simplify our analisys, we are going to exclude dataset 301 from our experiments.
datasets = datasets[:-1]

# Preprocessing

## Label encoding

First, let's deal with the class values, transforming them into 0 and 1.

In [18]:
for ds in datasets:
    ds['class'] = preprocessing.LabelEncoder().fit_transform(ds['class'])

## Missing values?

Are there any missing values?

In [21]:
i = 1
found_missing = False
for ds in datasets:
    if ds.isnull().values.any(): 
        print(f'There is(are) missing value(s) on dataset {i}.')
        found_missing = True
    i += 1
if not found_missing:
    print('There is no missing value.')

There is no missing value.


## Non-numeric attributes?

Are there non-numeric attributes?

In [29]:
i = 1
found_non_numeric = False
for ds in datasets:
    if len(datasets[0].select_dtypes(exclude=["number","bool_"]).columns) > 0:
        print(f'There is a non-numeric attribute in dataset {i}.')
        found_non_numeric = True
    i += 1
if not found_non_numeric:
    print('Datasets are composed of numeric attributes only.')

Datasets are composed of numeric attributes only.


## Splitting datasets into 5-folds

In [None]:
for ds_number in range(1,len(datasets)+1):
    ds = datasets[ds_number-1]

    # The folds are made by preserving the percentage of samples for each class.
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    cnt = 1
    # split()  method generate indices to split data into training and test set.
    folds = []
    X = ds.iloc[:,:-1]
    y = ds.iloc[:,-1]
    for train_index, test_index in kf.split(X,y):
        folds.append({'train':ds.filter(train_index, axis=0), 
                      'test':ds.filter(test_index, axis=0)})
    
    # Saving the folds in CSV files so that they can be reused to reproduce the results.
    i = 1
    for fold in folds:
        fold['train'].to_csv(f'../../data/5-fold/D{ds_number}-fold{i}-train.csv', index=False, encoding='utf8')
        fold['test'].to_csv(f'../../data/5-fold/D{ds_number}-fold{i}-test.csv', index=False, encoding='utf8')
        i += 1

Now that the data has been preprocessed, split and saved into CSVs, the experiment can continue in another notebook.