# 2.11 ML-ready data


Preparing and pre-processing data to integrate in machine learning workflow is fundamental towards a good machine learning project.

- Organize the data in machine-readable formats and data structures that can be manipulated automatically in the ML workflow:
    * arrange data in numpy arrays, Xarrays, or pandas. 
    * save data and its attributes in Zarr, H5, CSV.
- extract feature from the data as a first step toward dimensionality reduction:
    * extract statistical, temporal, or spectral features (use tsfresh, tsfel, ...)
    * transform the data into Fourier or Wavelet space (use scipy fft or cwt module)
    * reduce dimension by taking the PCA or ICA of the data. Save these features into file or metadata (use scikit-learn PCA or FastICA module). 
    * explore the dimensionality of the remaining feature space. Find correlations among features (use plotly interactive plotting, seaborn scatterplot visualization, or the pandas.corr matrix)
    * Further reduce the dimension using:
        + Feature *selection* finds the dimensions that explain the data without loss of information and ends with a smaller dimensionality of the input data.  A *forward selection* approach starts with one variable that decreases the error the most and add one by one.  A *backward selection* starts with all variables and removes them one by one.
        + Feature *extraction* finds a new set of dimension as a combination of the original dimensions. They can be supervised or unsupervised depending on the output information. 



- Save the data processing workflow from raw data to feature data. 
    * Use the scitkit-learn [Pipeline](!https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) module. 
    * Write a python script to reproduce the pre-processing.
