# Data Exploration

This notebook contain all the steps we performed during the data exploration phase. Before running this notebook, make sure you read the [README.md](../README.md) file.

In [None]:
%load_ext autoreload
%autoreload 2

from assaiku.data import DataConfig, DataPipe
from assaiku.data.validation import load_and_validate
from assaiku.data.exploration import (
    visualize_categorical_dist, 
    visualize_continuous_dist, 
    visualize_correlation,
    analyze_nans, 
    analyze_label_dist,
    visualize_distance
)
from assaiku.data.processing import remove_group_duplicates, filter_on_age
import pandas as pd

pd.set_option('display.max_columns', 50)

data_config = DataConfig(perform_exploration=True)

## Loading and validating data

First we load the data and validate it, for validation we are using the `pandera` library that will check several things:
- the data type of each column
- numerical constraints (for instance `age >= 0`)
- nategorical constraints (for instance ``sex is in [Male, Female]``)

In [None]:
train_df, test_df = load_and_validate(data_config=data_config)

## Some data cleaning

Before the analysis we will check several things:
- The missing values in the dataset
- The duplicates
- Filtering out some rows that may bias the analysis

### Missing values

Do we have nay missing values in our datasets ?

In [None]:
print("Analyze nans in train dataframe")
analyze_nans(train_df)
print("Analyze nans in test dataframe")
analyze_nans(test_df)

### Removing duplicates

Here are the operations we are performing:
- We remove some duplicates (including the instance group)
- Then we group the same instances together and sum their ``instance_weights``

In [None]:
clean_train_df = remove_group_duplicates(train_df,weight_col=data_config.weight_col)
clean_test_df = remove_group_duplicates(test_df,weight_col=data_config.weight_col)

## Analysis

Let's first have a look at our distribution of labels in each data set

In [None]:
# Let's look at the distribution of labels
print("Train")
analyze_label_dist(data=clean_train_df, data_config=data_config)
print("Test")
analyze_label_dist(data=clean_test_df, data_config=data_config)

### Filtering out some rows

To not bias the analysis, we filtered out some rows based on the age. We filtered out children (``age < 16``) as they would bias the statistics (we checked that all those children have an income lower than 50k-)

In [None]:
clean_train_df = filter_on_age(clean_train_df)

### 1D analysis of continuous features

To know if there exist any correlation between any continuous feature and the income, we ran the following analysis:
- Computation of the correlation coefficient between the feature and the income
- Visualization of the distribution for each group (50k+,50k-). 

In [None]:
visualize_correlation(data=clean_train_df,
                      data_config=data_config,)

In [None]:
visualize_continuous_dist(data=clean_train_df,
                          data_config=data_config,
                          folder_path="results/exploration/continuous",
                          filter_cols=None,
                          close_figs=False,
                          # filter_cols=["age","wage_per_hour"],
                          )

### 1D analysis of categorical features

To know if there exist any difference in the two groups in the distribution of values within each category we looked at the following things:
- For each category, the distance between the two distributions (we used wassertein distance)
- Visualization of the distribution of values for each group (50k+,50k-) and for each category.

In [None]:
visualize_distance(data=clean_train_df, data_config=data_config, truncate=15)

In [None]:
visualize_categorical_dist(data=clean_train_df, 
                           data_config=data_config, 
                           folder_path="results/exploration/categorical",
                           close_figs=False,
                           filter_cols=["detailed_industry_recode",
                           "sex",
                           "detailed_household_summary_in_household", 
                           "major_occupation_code"])

## Running all the previous step in one line

The data exploration pipeline is part of the data pipeline, you can run all previous steps running the next cell.

In [None]:
data_config = DataConfig(perform_exploration=True)
data_pipeline = DataPipe(data_config=data_config)
data_pipeline.run()