# Dataset Cleaning

The main objectives for this notebook are:
* Get familiar with the dataset
* Discover the data quality issues
* Fix the data quality issues


The skills that you need to showcase:
* Your ability to load, wrangle and analyse data
* Your knowledge of data pre-processing steps

## How to stand out?
1. Use non-Pandas DataFrame library like Polars or PySpark
2. Build a cleaning pipeline at the end of the notebook

In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import sys, os

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
    sys.path.append(path2add)
    
import polars as pl
import plotly.express as px
from cleaning import count_missing

## Data Ingestion

In [25]:
data = pl.read_csv("../data/supervised_dataset.csv")
print(data.shape)
data.head()

(1699, 12)


Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str
0,"""1f2c32d8-2d6e-…",0.000812,0.004066,85.643243,5405,"""default""",1460.0,1295.0,451.0,"""E""","""normal"""
1,"""4c486414-d4f5-…",6.3e-05,0.002211,16.166805,519,"""default""",9299.0,8447.0,302.0,"""E""","""normal"""
2,"""7e5838fc-bce1-…",0.004481,0.015324,99.573276,6211,"""default""",255.0,232.0,354.0,"""E""","""normal"""
3,"""82661ecd-d87f-…",0.017837,0.014974,69.792793,8292,"""default""",195.0,111.0,116.0,"""E""","""normal"""
4,"""d62d56ea-775e-…",0.000797,0.006056,14.952756,182,"""default""",272.0,254.0,23.0,"""E""","""normal"""


## Data Profiling

In [26]:
print("Original shape:", data.shape)
print("Columns:", data.columns)

Original shape: (1699, 12)
Columns: ['', '_id', 'inter_api_access_duration(sec)', 'api_access_uniqueness', 'sequence_length(count)', 'vsession_duration(min)', 'ip_type', 'num_sessions', 'num_users', 'num_unique_apis', 'source', 'classification']


In [27]:
data.describe()

describe,Unnamed: 1_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification
str,f64,str,f64,f64,f64,f64,str,f64,f64,f64,str,str
"""count""",1699.0,"""1699""",1699.0,1699.0,1699.0,1699.0,"""1699""",1699.0,1699.0,1699.0,"""1699""","""1699"""
"""null_count""",0.0,"""0""",4.0,4.0,0.0,0.0,"""0""",0.0,0.0,0.0,"""0""","""0"""
"""mean""",849.0,,1.501123,0.173226,61.648982,6028.340789,,564.726898,406.263685,67.246616,,
"""std""",490.60337,,21.697558,0.283641,205.803273,46650.419622,,1179.9312,960.71858,82.189214,,
"""min""",0.0,"""00041830-3168-…",3e-06,0.0012,0.0,1.0,"""datacenter""",2.0,1.0,0.0,"""E""","""normal"""
"""25%""",424.0,,0.000707,0.009192,9.969512,63.0,,5.0,1.0,14.0,,
"""50%""",849.0,,0.002574,0.018717,17.095238,195.0,,164.0,141.0,37.0,,
"""75%""",1274.0,,0.024822,0.230769,41.446352,3714.0,,447.0,309.0,90.0,,
"""max""",1698.0,"""ffbf4937-68e6-…",852.92925,1.0,3303.0,1352948.0,"""default""",9299.0,8447.0,524.0,"""F""","""outlier"""


**Observations**
* The dataset seems to be very clean with just 4 rows of missing values
* There's an id column present in this dataset
* There are some extreme outliers in `inter_api_access_duration(sec)` and `vsession_duration(min)` columns

**Impact**
* Numerical ID column needs to be dropped since it has no value

### Missing Data

In [30]:
data.filter(pl.col("inter_api_access_duration(sec)").is_null())

Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str
1556,"""8e8b99bb-7b6d-…",,,0.0,3,"""datacenter""",2.0,1.0,0.0,"""E""","""outlier"""
1567,"""bedfd600-80ef-…",,,0.0,3,"""datacenter""",4.0,1.0,0.0,"""E""","""outlier"""
1619,"""60a25ad0-add8-…",,,0.0,3,"""datacenter""",2.0,1.0,0.0,"""E""","""outlier"""
1647,"""70b6a9dd-e4c6-…",,,0.0,3,"""datacenter""",4.0,1.0,0.0,"""E""","""outlier"""


**Observations**
* 4 rows of missing data all come from datacenter and are classes as outliers
* `sequence_length(count)` is equal to zero which means that no API calls were made

**Impact**
* These rows can be dropped since they're not supposed to be in this dataset - no API calls were made

### Outliers

In [33]:
px.box(
    x=data["inter_api_access_duration(sec)"].to_list(),
    title="Inter API Duration Boxplot",
)

In [39]:
data.filter(pl.col("inter_api_access_duration(sec)") > 100 )

Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str
1693,"""d8ac0f74-473a-…",159.783857,0.357143,14.0,134219,"""datacenter""",2.0,1.0,5.0,"""F""","""outlier"""
1695,"""44356d09-52e9-…",852.92925,0.5,2.0,102352,"""datacenter""",2.0,1.0,1.0,"""F""","""outlier"""


**Observations**
* 2 of the most extreme outliers come from datacenter
* These seem to be good data points, they're just anomalous

**Impact**
* These outliers won't be dropped since they represent data that was seen

## Data Pre-processing Pipeline

Based on the analysis above, the following data cleaning and pre-processing steps will be taken:
* Rows with null values will be removed
* Classification variable will be turned into a boolean for ease of the analysis
* Data will be outputed as parquet file with no numerical ID index

In [40]:
def cleaning_pipeline(data: pl.DataFrame, output_path: str):
    """Data cleaning and processing pipeline

    Args:
        data (pl.DataFrame): input dataset that needs to be pre-processed
        output_path (str): location of where to save the parquet file
    """
    data.filter(pl.col("inter_api_access_duration(sec)").is_not_null()).with_columns(
        is_anomaly=pl.col("classification") == "outlier"
    ).write_parquet(output_path)

In [41]:
output = "../data/supervised_clean_data.parquet"
cleaning_pipeline(data, output)