# Data cleaning & preprocessing

This notebook contain all the steps we performed during the data cleaning and preprocessing. Before running this notebook, make sure you read the [README.md](../README.md) file.

In [None]:
%load_ext autoreload
%autoreload 2

from assaiku.data import DataConfig, DataPipe
from assaiku.data.validation import load_and_validate
from assaiku.data.processing import remove_group_duplicates, filter_outliers

import pandas as pd

pd.set_option("display.max_columns", 50)

data_config = DataConfig(perform_exploration=False)

## Loading and validating data

First we load the data and validate it, for validation we are using the `pandera` library that will check several things:
- the data type of each column
- numerical constraints (for instance `age >= 0`)
- nategorical constraints (for instance ``sex is in [Male, Female]``)

In [None]:
train_df, test_df = load_and_validate(data_config=data_config)

## Removing duplicates

Here are the operations we are performing:
- We remove some duplicates (including the instance group)
- Then we group the same instances together and sum their ``instance_weights``

In [None]:
train_df = remove_group_duplicates(train_df, weight_col=data_config.weight_col)
test_df = remove_group_duplicates(test_df, weight_col=data_config.weight_col)

## Checking for outliers

We are checking for outliers, to do so we will use the *Empirical Cumulative Distribution-based Outlier Detection (ECOD)* method on continuous feature: 
- The distribution of outlier scores is displayed
- An example of outlier is given too

In [None]:
clean_train, clean_test = filter_outliers(
    train_data=train_df,
    test_data=test_df,
    numerical_cols=data_config.numerical_cols,
    threshold=data_config.threshold_outlier,
    folder_path="./preprocessing",
)

## Saving clean data

In [None]:
clean_train.to_parquet(data_config.train_data_out)
clean_test.to_parquet(data_config.test_data_out)

## Running all the previous step in one line

The data cleaning and processing pipeline is part of the data pipeline, you can run all previous steps running the next cell.

In [None]:
data_config = DataConfig(perform_exploration=False)
data_pipeline = DataPipe(data_config=data_config)
data_pipeline.run()