<span style="font-size:10pt">AI-ML @ ENSPIMA / v1.3 september 2024 / Jean-Luc CHARLES (Jean-Luc.charles@mailo.com) / CC BY-SA 4.0 /</span>

<div style="color:brown;font-family:arial;font-size:26pt;font-weight:bold;text-align:center"> 
Machine Learning $-$ MiniProject</div><br>
<hr>
<div style="color:blue;font-family:arial;font-size:22pt;font-weight:bold;text-align:center">
Training a neural network to diagnose bearing faults<br><br>
Part 1/3: Load the CWRU (Case Western Reserve University)<br><br>bearing dataset</div>
<hr>
Expected duration : 60 minutes (may depend on Internet rate if you choose to download the dataset from Internet)

## Part-1 targeted learning objectives
Know how to:
- load files in *Matlab MAT-file* format with *Python*.
- dimension and fill numpy ndarrays with the data of the `.mat` files
- display a grid of data plots
- store the numpy ndarrays in a `.npz` file

<div class="alert alert-block alert-danger">
<span style="color:brown;font-family:arial;font-size:12pt"> 
It is important to use a <span style="font-weight:bold;">Python Virtual Environment</span> (PVE) for your Python projects: a PVE makes it possible to control for each project the versions of the Python interpreter and the "sensitive" modules (like tensorflow).
    
All the notebooks must be loaded in a `jupyter notebook` or `jupyter lab` launched within the <b><span style="color: rgb(100, 151, 202);" >pyml</span></b> PVE specially created for the session.    
</span></div>

In [None]:
import os, sys
import scipy.io
import numpy as np
import matplotlib.pyplot as plt

# 1 $-$ The *Case Western Reserve University* bearing dataset

The bearing data used in this notebook are provided by the **Case Western Reserve University (CWRU)** on the page [engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data](https://engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data) . <br>

The data were collected from a motor driving mechanical system under four different loads with the sampling frequency of 48 kHz:<br>
![sdsdv](./img/CWRU-TestBench.png)<br>
(source: https://engineering.case.edu/bearingdatacenter/apparatus-and-procedures)

The bearing data set was obtained under four experimental conditions:
- Normal condition (N)
- with Outer race Fault (OF)
- with Inner race Fault (IF)
- with Roller Fault (OF).

Faulted bearings were installed into the test motor and vibration data was recorded for motor loads of 0 to 3 horsepower (motor speeds of 1797 to 1720 RPM).<br>
The faults were introduced into the drive-end bearing of the motor with fault diameters of 0.18, 0.36 and 0.54 mm, respectively.

The defaults classification table is as follows:

|class label|Fault type|Fault diameter|
|:---------:|:--------:|-------------:|
| 1         | N        | 0            |
| 2         | RF       | 0.18         |
| 3         | RF       | 0.36         |
| 4         | RF       | 0.54         |
| 5         | IF       | 0.36         |
| 6         | IF       | 0.36         |
| 7         | IF       | 0.54         |
| 8         | OF       | 0.18         |
| 9         | OF       | 0.36         |
| 10        | OF       | 0.54         |
 


## 1.1 $-$ Download the the **CWRU** dataset

The **CWRU** dataset consists in about fifty [Matlab MAT-file](https://www.mathworks.com/help/matlab/import_export/mat-file-versions.html) files that can be downloaded:

- either **manually**: by clicking on the hyper-links in the page  https://engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data
- either with **Python instruction**: for example with the *wget* module, to get the `.mat` files form the directory https://engineering.case.edu/sites/default/files.

By exploring the hyper-links of the page https://engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data one can define the list of the .mat files involved by the previous defaults classification table:

    ['98.mat', '99.mat' '100.mat', '110.mat', '111.mat', '112.mat', '123.mat', '124.mat', '125.mat', '136.mat', '137.mat', '138.mat', '175.mat', '176.mat', '177.mat', '190.mat', '191.mat', '192.mat', '202.mat', '203.mat', '204.mat', '214.mat', '215.mat', '217.mat', '227.mat', '228.mat', '229.mat', '239.mat', '240.mat', '241.mat']

In [None]:
# define the list of the wanted '.mat' files:
CWRU_data_file = ['98.mat', '99.mat', '100.mat', 
                  '110.mat', '111.mat', '112.mat', 
                  '123.mat', '124.mat', '125.mat', 
                  '136.mat', '137.mat', '138.mat', 
                  '175.mat', '176.mat', '177.mat', 
                  '190.mat', '191.mat', '192.mat', 
                  '202.mat', '203.mat', '204.mat', 
                  '214.mat', '215.mat', '217.mat', 
                  '227.mat', '228.mat', '229.mat', 
                  '239.mat', '240.mat', '241.mat']

#### The following cell lets you download all the required `.mat` files from _engineering.case.edu_ with some Python instructions.

#### $\leadsto$ If the download of the `mat` files is too slow, you can use the `mat` files already downloaded in the `pre_loaded_dataset` directory.

In [None]:
import wget
from time import sleep

# the URL where to find the .mat files:
url = 'https://engineering.case.edu/sites/default/files'

# the directory where to store the downloaded files:
data_dir = "./CWRU_dataset/"
if not os.path.exists(data_dir): os.mkdir(data_dir)

# download the files and store tem:
for file in CWRU_data_file:
    file_url = url + "/" + file
    target   = os.path.join(data_dir, file)
    if not os.path.exists(target):
        print(f"downloading file <{file_url}> as <{target}>")
        try:
            wget.download(file_url, target) 
        except:
            print(f"a problem occured when loading <{file_url}>")
        print("")
    else:
        print(f"file <{target} already exists>")
    sleep(1)

#### $\leadsto$ Now choose your working data directory: pre_loaded_dataset or CWRU_dataset download bty yourself:

In [None]:
####
#### Uncomment one of the two lines:
####

#data_dir = "./pre_loaded_dataset"    # uncomment to use the pre-loaded CWRU dataset

#data_dir = "./CWRU_dataset"         # uncomment to use the CWRU dataset you have loaded

Let's check the list of the `.mat` data files that are in your `data_dir` directory:

In [None]:
list_mat_file = [ f for f in os.listdir(data_dir) if f.endswith(".mat")]
list_mat_file.sort()
print(f"List of .mat files in <{data_dir}>:\n{list_mat_file}")

## 1.2 $-$ handling of the **CWRU** dataset

The `scipy.io.loadmat` can load a `.mat` file (*MAT-file* format < 7.3) and return the data as a Python `dict` object:

In [None]:
data_file = os.path.join(data_dir, "98.mat")

mat98 = scipy.io.loadmat(data_file)  
mat98

You can see in the above cell that the return of `loadmat` is a Python dictionary, so let's look at its **keys**:

In [None]:
mat98.keys()

The accelerometers data are associated with the keys:<br>
- `X098_DE_time`: temporal data of the accelerometer at Drive End (DE) of the test bench, sampled at 48 kHz<br>
- `X098_FE_time`: temporal data of the accelerometer at Fan End (FE) of the test bench, sampled at 48 kHz.<br><br>

The Accelerometer data in the dictionnary are `numpy.ndarray` objets:

In [None]:
type(mat98['X098_DE_time']), type(mat98['X098_FE_time'])

The arrays are _single column matrices_ of accelerometers output sampled at 48 khz:

In [None]:
mat98['X098_DE_time'].shape, mat98['X098_FE_time'].shape

## 1.3 $-$ Minimalist plot of data (`pyplot` style)

For simplicity, let's name `X_DE` and `X_FE` the accelerometers data and plot the data in 2 subplots:

In [None]:
X_DE, X_FE = mat98['X098_DE_time'], mat98['X098_FE_time']

The cell bellow draw simple plots of the data:

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))

plt.subplot(2,1,1)
plt.plot(X_DE[-2000:], '.-b', markersize=0.6)
plt.grid()

plt.subplot(2,1,2)
plt.plot(X_FE[-2000:], '.-m', markersize=0.6)
plt.grid()

## 1.4 $-$ More elaborate plot of data (`Axes` object style)

In the cell bellow, we use the Matplotlib `Axes` syntaxe to draw more elaborated plots of the data:

In [None]:
import matplotlib.pyplot as plt

# Let's compute a time vector for abscissa:
N = 4000                 # we take only the last 4000 temporal data points
T = np.arange(N)/48e3
T *= 1e3                 # cconvert T in milli-sec

key_DE, key_FE = 'X098_DE_time', 'X098_FE_time'
X_DE, X_FE = mat98[key_DE], mat98[key_FE]

# min and max values for plotting:
max_value = max(X_DE[-N:].max(), X_FE[-N:].max())
min_value = min(X_DE[-N:].min(), X_FE[-N:].min())

# subplots returns a figure and a list of Axes:
fig, axes = plt.subplots(2,1, sharex=True) 

fig.suptitle(f"Accelerometers output from file <{os.path.basename(data_file)}>")
fig.set_size_inches((12,5))

axe = axes[0]
axe.plot(T, X_DE[-N:], '-b', markersize=0.6, linewidth=0.5, label=key_DE)
axe.set_ylabel("Arbitrary unit")
axe.set_ylim(min_value, max_value)
axe.legend(loc='upper right', framealpha=0.5)
axe.grid()

axe = axes[1]
axe.plot(T, X_FE[-N:], '-m', markersize=0.6, linewidth=0.4, label=key_FE)
axe.set_ylabel("Arbitrary unit")
axe.set_xlabel("Time [ms]")
axe.set_ylim(min_value, max_value)
axe.legend(loc='upper right', framealpha=0.5)
axe.grid()

plt.savefig("CWRU_data.png")

## 1.5 $-$ Creating the numpy dataset from CWRU MAT-files

Now let's define 3 datasets `A`, `B` and `C` by grouping the data for motor loads 1, 2 and 3: the figures bellow shows how the data of the .mat files are groupes, and the structure of the A,B and C arrays.

![img/A-B-C.png](img/A-B-C.png)

![img/A-B-C.png](img/array_A.png)

The cell bellow creates the 3 ndarrays A, B & C and fills the arrays with the data from the .mat files as explained by the two figures:

In [None]:
# group the CWRU files number in 3 datasets for the the motor loads 1, 2 and 3 horsepower:
num_load_1 = ( 98, 123, 190, 227, 110, 175, 214, 136, 202, 239)
num_load_2 = ( 99, 124, 191, 228, 111, 176, 215, 137, 203, 240)
num_load_3 = (100, 125, 192, 229, 112, 177, 217, 138, 204, 241)

# We will define 3 arrays A, B and C for the 3 datasets above: 
# for each of the 10 health condidion we will split each of the 3 datasets in 200 samples of 1900 points.
# >>> So the shape of each array is: (10, 200, 1900):

nb_HC       = 10        # number of Health Condition
nb_sample   = 200       # number of sample batch
sample_size = 1900      # number of data in a sample

A = np.zeros((nb_HC, nb_sample, sample_size), dtype=float)
B = np.zeros((nb_HC, nb_sample, sample_size), dtype=float)
C = np.zeros((nb_HC, nb_sample, sample_size), dtype=float)

# Now we loop simultaneously accross the files numbers and the dataset arrays to fill the arrays
# with the files data:
for num_load, target_array in zip((num_load_1, num_load_2, num_load_3), (A, B, C)):
    
    for hc, file_num in enumerate(num_load):
        # 'hc' is the health condition rank in [0,9]
            
        # build the 'mat' file path with 'file_num':
        mat_file = os.path.join(data_dir,f"{file_num}.mat")
        print(f"Loading file <{mat_file:8s}>")
        
        # load the data of the file in the dict 'data':
        data = scipy.io.loadmat(mat_file) 
        
        # build the key and get the data we want from the dictionnary:
        key = f"X{file_num:03d}_DE_time"
        X = data[key]
        print(f'\t got values for key {key}, X.shape:{X.shape}')
        
        # Try to split the data acroos the array dimensions:
        try:
            for s in range(nb_sample):
                # s is the sample number, hc is the health condition rank in [0,9]
                target_array[hc, s] = X[s*sample_size:(s+1)*sample_size, 0]
            print(f'\t target array filled with data')
        except:
            print(f"\t Error with file <{file_num}.mat>")
    print('-'*80)

## 1.6 $-$ Plot the data

Here we will plot the data for sample #0 of the 3 arrays A, B & C:

In [None]:
# create the list of the health condition labels:
health_cond = ['N']
for def_type in 'RF', 'IF', 'OF':
    for size in '18', '36', '54':
        health_cond.append(f"{def_type}.{size}")
print(f"list of {len(health_cond)} health conditions:", health_cond)

In [None]:
# define 'nb_HC', the number of health conditions:
nb_HC = len(health_cond)

# define 'nb_L', the number of load cases:
full_dataset = (A, B, C)
nb_L = len(full_dataset)

s_num = 0  # the sample number

plt.rcParams['font.size'] = 6   # change the pyplot defaut font size
fig, axes = plt.subplots(nb_HC, nb_L, sharex=True)
fig.set_size_inches((8,12))
plt.subplots_adjust(top=.95, wspace=0.25, hspace=0.5)
plt.suptitle(f"Plots for the sample #{s_num}", fontsize=10)

for n, dataset in enumerate(full_dataset):
    for hc in range(nb_HC):
        axe = axes[hc, n]
        axe.set_title(f"Load_{n+1} / health cond {health_cond[hc]}", fontsize=8)
        axe.plot(dataset[hc, s_num], linewidth=0.4)
        if hc == nb_HC-1: axe.set_xlabel("Rank")

plt.rcParams['font.size'] = 10  # restore the pyplot defaut font size to its defautl value

## 1.7 $-$ Export the numpy dadasets in a `.npz` compressed file

The function `savez` of `numpy` takes `ndarray` objects as arguments and creates a binary file that holds the arrays data:

In [None]:
np.savez('CWRU_dadaset', A, B, C)

### Further work:
In the next notebook (`2-process_CWRU_data.ipynb`), you will load the `CWR_dataset.npz` file and learn how to pre-process the dataset in order to prepare the training of the neural network.