# Data Description and Preprocessing 

## Modular 

Data has been collected using the modular rig tester that shown in figure (a). The tester consists of several components : (1) electric motor; (2)torque-measurement shaft ; (3) a rolling bearing test module; (4) fly wheel; (5) load motor. A detailed explanation of the modular tester can be found [KAt].  


## Description 

In this dataset, rolling bearing elements with about 32 bearing experiment has been conducted as shown: 
- 6 bearings are undamaged (heakthy)
- 12 bearings with artificial damages- 
- 14 bearings with real accelerated damages. 

The artificial and real damages have two fault types: inner and outer ring faults.  20 measurements are conducted for each bearing file, each file collected under 64 KHz sampling rate and last for 4 seconds. Hence, the data file can have around 256000 data points.  The experiments carried out under 4 different working conditions.  Different parameters among working conditions such as: rotational speed, load torque, and radial force. Rotational speed varied between 900 rpm and 1500 rpm, while load toque changed from 0.1 Nm to 0.7 Nm, while the asserted radial force has been increased from 400N to 1000N. Table_x shows the set up of the 4 different operating conditions. 

Summary: 
- The dataset has 26 damaged bearing states (i.e., 12 artificial damages and 14 real damages) and 6 healthy(undamaged) bearing states.
- Four different operating conditions. 
- 20 measurements of 4 second each for each bearing file, where each file name has code of to represent its operating working condition.  

##  Loading and Preprocessing 

##### Loading of 20 measurements for each real damage file 

In this dataset we employed files with real damages damages for more practical scenario. the experiments conducted under 4 differnt working conditions. The following table shows the selected real files train our model for each wokring condition. Each of these files (e.g. KA01, KA02, KI01,....) has 20 measurements (e.g. KA01_1,.....,KA01_20). Signle measurement has 256000 data points (i.e. 64 KHz and 4 seconds long). To load these files automatically we implemented matalb code "automatic file loading". The output of matlab code will be (A_5120L, B5120L, C5120L, D_5120L)
In our experiement we  used single healthy file,  5 files with outer faults, and 5 files with inner faults. 


##### Data augmentation using sliding window 

To handle the lengthy data files with 2560000, we used sliding window to extract a short samples and applied overlapping to increase the number of samples. For each measurement we used window length of 5120 and shifiting size of 4096 as shown in the following figure. Note that for the healthy samples the shfting size reduced to 1024 to balance the number of healthy data. Finally we woll have 4900 healthy samples, 6200 inner faulty, and 6200 outer faulty samples. The following figure show the sliding window for healthy and faulty data. 

 The following equations show the details about calcuating number of generated samples:

$$\begin{align}
\textbf{n}&=ceil(\frac{T - L}{S}) * number\ of \ measuerements  \\
\textbf{N}&=\textbf{n}*number\ of\ class \ files \\ 
\textbf{K}&=\textbf{N}*number\ of\ classes.
\end{align}$$


Where $n$ is number samples per file,$N$ is the number of samples for each class, and __K__ is the total number of samples for each working conditions  

# Import Libraries 

In [24]:
import torch 
import numpy as np 
import mat4py as sp
from sklearn.model_selection import train_test_split
import os 

# Data loading 

In [2]:
data_path = 'data_files'
output_dir = '../../data/pFD'
# Loading data 
data_a = sp.loadmat(os.path.join(data_path, 'A_5120L.mat'))
data_b = sp.loadmat(os.path.join(data_path, 'B_5120L.mat'))
data_c = sp.loadmat(os.path.join(data_path, 'C_5120L.mat'))
data_d = sp.loadmat(os.path.join(data_path, 'D_5120L.mat'))

##  Constructing samples for different working conditions 

#### Construct data and labels working condition A 

In [17]:
healthy_files= ['K001']; 
inner_fault_files = ['KI14','KI21','KI17','KI18','KI16']; 
out_fault_files = ['KA22','KA04','KA15','KA30','KA16'];

Kat_data = {'a':{'samples':[], 'labels':[]}, 
            'b':{'samples':[], 'labels':[]},
            'c':{'samples':[], 'labels':[]},
            'd':{'samples':[], 'labels':[]}}

# loop through healthy data 
for data_item in healthy_files:
    wk_a_samples = torch.from_numpy(np.asarray(data_a['A_5120L']['real'][data_item]))
    wk_b_samples = torch.from_numpy(np.asarray(data_b['B_5120L']['real'][data_item]))
    wk_c_samples = torch.from_numpy(np.asarray(data_c['C_5120L']['real'][data_item]))
    wk_d_samples = torch.from_numpy(np.asarray(data_d['D_5120L']['real'][data_item]))

    wk_a_labels  = torch.LongTensor(wk_a_samples.size(0)).fill_(0)                                
    wk_b_labels  = torch.LongTensor(wk_b_samples.size(0)).fill_(0)                                       
    wk_c_labels  = torch.LongTensor(wk_c_samples.size(0)).fill_(0)                                      
    wk_d_labels  = torch.LongTensor(wk_d_samples.size(0)).fill_(0)

    Kat_data['a']['samples'].append(wk_a_samples)
    Kat_data['b']['samples'].append(wk_b_samples)
    Kat_data['c']['samples'].append(wk_c_samples)
    Kat_data['d']['samples'].append(wk_d_samples)
                                               
    Kat_data['a']['labels'].append(wk_a_labels)
    Kat_data['b']['labels'].append(wk_b_labels)
    Kat_data['c']['labels'].append(wk_c_labels)
    Kat_data['d']['labels'].append(wk_d_labels)                                            

# Loop through outer faults 
for data_item in out_fault_files:
    wk_a_samples = torch.from_numpy(np.asarray(data_a['A_5120L']['real'][data_item]))
    wk_b_samples = torch.from_numpy(np.asarray(data_b['B_5120L']['real'][data_item]))
    wk_c_samples = torch.from_numpy(np.asarray(data_c['C_5120L']['real'][data_item]))
    wk_d_samples = torch.from_numpy(np.asarray(data_d['D_5120L']['real'][data_item]))

    wk_a_labels  = torch.LongTensor(wk_a_samples.size(0)).fill_(1)                                
    wk_b_labels  = torch.LongTensor(wk_b_samples.size(0)).fill_(1)                                       
    wk_c_labels  = torch.LongTensor(wk_c_samples.size(0)).fill_(1)                                      
    wk_d_labels  = torch.LongTensor(wk_d_samples.size(0)).fill_(1)
                                               
    Kat_data['a']['samples'].append(wk_a_samples)
    Kat_data['b']['samples'].append(wk_b_samples)
    Kat_data['c']['samples'].append(wk_c_samples)
    Kat_data['d']['samples'].append(wk_d_samples)
                                               
    Kat_data['a']['labels'].append(wk_a_labels)
    Kat_data['b']['labels'].append(wk_b_labels)
    Kat_data['c']['labels'].append(wk_c_labels)
    Kat_data['d']['labels'].append(wk_d_labels)        
    
# loop through inner faults 
for data_item in inner_fault_files:
    wk_a_samples = torch.from_numpy(np.asarray(data_a['A_5120L']['real'][data_item]))
    wk_b_samples = torch.from_numpy(np.asarray(data_b['B_5120L']['real'][data_item]))
    wk_c_samples = torch.from_numpy(np.asarray(data_c['C_5120L']['real'][data_item]))
    wk_d_samples = torch.from_numpy(np.asarray(data_d['D_5120L']['real'][data_item]))

    wk_a_labels  = torch.LongTensor(wk_a_samples.size(0)).fill_(2)                                
    wk_b_labels  = torch.LongTensor(wk_b_samples.size(0)).fill_(2)                                       
    wk_c_labels  = torch.LongTensor(wk_c_samples.size(0)).fill_(2)                                      
    wk_d_labels  = torch.LongTensor(wk_d_samples.size(0)).fill_(2)
                                               
    Kat_data['a']['samples'].append(wk_a_samples)
    Kat_data['b']['samples'].append(wk_b_samples)
    Kat_data['c']['samples'].append(wk_c_samples)
    Kat_data['d']['samples'].append(wk_d_samples)
                                               
    Kat_data['a']['labels'].append(wk_a_labels)
    Kat_data['b']['labels'].append(wk_b_labels)
    Kat_data['c']['labels'].append(wk_c_labels)
    Kat_data['d']['labels'].append(wk_d_labels)       

In [19]:
# Data Splitting
full_data_kat={}
for work_env in ['a', 'b', 'c', 'd']:
    data = torch.cat(Kat_data[work_env]['samples']).numpy()
    labels =  torch.cat(Kat_data[work_env]['labels']).numpy()
    
    X_train, X_test, y_train, y_test = train_test_split(data,  labels,  stratify=labels,  test_size=0.2, random_state=1)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, stratify=y_train, random_state=1)
    
    full_data_kat[work_env]= {'train':{'samples': torch.from_numpy(X_train), 'labels':torch.LongTensor(y_train)},
                           'val':{'samples':torch.from_numpy(X_val), 'labels':torch.LongTensor(y_val)},
                           'test':{'samples':torch.from_numpy(X_test), 'labels':torch.LongTensor(y_test)}}

# datasaving
for data_idx in ['a', 'b', 'c', 'd']:
    torch.save(full_data_kat[data_idx]['train'], os.path.join(output_dir ,f'train_{data_idx}.pt'))
    torch.save(full_data_kat[data_idx]['val'],  os.path.join(output_dir , f'val_{data_idx}.pt' ))
    torch.save(full_data_kat[data_idx]['test'] , os.path.join(output_dir , f'test_{data_idx}.pt'))
