# Exploratory Data Analysis

The main objectives for this notebook are:
* Explore the clean dataset by performing univariate analysis
* Investiage the relationships between your features and your target by perofrming bivariate and multivariate analyses
* Extract relevant insights to share with business stakeholders
* Understand steps that will be required for ML pre-processing

The skills that you need to showcase:
* Your ability to raise hypotheses, answer them, and interpret the results
* You data wrangling and visualisation skills

## How to stand out?
1. Use non-Pandas DataFrame library like Polars or PySpark
2. Use interactive plots (e.g. Plotly) for visualisations (don't forget to render your notebooks as HTML)
    - Your visualisations should have some additional formatting
3. Write clear insights after every section of the analysis
4. Use well-written and documented utility functions

# Imports

In [109]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [110]:
import sys, os

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
    sys.path.append(path2add)
    
import polars as pl
import plotly.express as px
from visualisations import bar_plot, proportion_plot, boxplot_by_bin_with_target
# etc

# Data

In [111]:
data = pl.read_parquet("../data/supervised_clean_data.parquet")
print(data.shape)
data.head()

(1695, 13)


Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str,bool
0,"""1f2c32d8-2d6e-…",0.000812,0.004066,85.643243,5405,"""default""",1460.0,1295.0,451.0,"""E""","""normal""",False
1,"""4c486414-d4f5-…",6.3e-05,0.002211,16.166805,519,"""default""",9299.0,8447.0,302.0,"""E""","""normal""",False
2,"""7e5838fc-bce1-…",0.004481,0.015324,99.573276,6211,"""default""",255.0,232.0,354.0,"""E""","""normal""",False
3,"""82661ecd-d87f-…",0.017837,0.014974,69.792793,8292,"""default""",195.0,111.0,116.0,"""E""","""normal""",False
4,"""d62d56ea-775e-…",0.000797,0.006056,14.952756,182,"""default""",272.0,254.0,23.0,"""E""","""normal""",False


## Univariate Analysis
This section goes through the avilable columns and plots them  to see the distributions, outliers, etc. This is done to introduce the data set and to get you familiar with it

In [112]:
bar_plot(data, "ip_type", "IP Type Counts",)

**Observations:**
* There are just two ip types - `default` and `datacenter`, with `datacenter` being the most frequent one

## Features vs Target

This section performs a bi-variate analysis by looking at the distributions of normal vs outliers. This can help in determining what data and feature selection to perform.

In [113]:
proportion_plot(data, "ip_type", "is_anomaly", "Behaviour Type by Source")

**Observations:**
* If the acitivty comes from a `datacenter`, it's guaranteed to be an outlier

**Impact**
* The dataset needs to be filtered to include only `default` traffic since we don't need a model to classify `datacenter` traffic

## Hypotheses
This section is for you to showcase your analytical skill and to ask interesting questions.

### Are longer sessions with high speed inter API calls more anomalous?
It's usually the case that if a lot of events happen in a short period of time - this might signal bot or other malicious activity. Let's see if it's the case for this dataset

In [114]:
boxplot_by_bin_with_target(
    data = data,
    column_to_bin = "sequence_length(count)",
    numeric_column = "inter_api_access_duration(sec)",
    target = "is_anomaly"
)

**Observations**
* Outliers have faster inter API duration than normal traffic
* In the shortest sequence length, the difference in inter API duration is not as drastic as it is for longer sequences

**Insights**
* Longer sequences with faster inter API access duration are more likely to be anomalous


## Summary

### Main Insights
* Most of the traffic comes from the default source, only 9% comes from datacenters
* All the datacenter traffic is considered to be anomalous
* Longer sequences with faster inter API access ruations are more likely to be anomalous

### Implications for Modelling
* Dataset needs to be filter to include only the `default` source type
* Interaction between `sequence_length` and `inter_api_duration` needs to be either manually encoded, or a tree-based model needs to be used