# Feature Engineering

## Objectives:
- Develop a set of features that have a potential to improve your model's performance
- Investiage the relationships between your new features and your target

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Identify and remove highly correlated features (to avoid multicolinearity --> avoid overfitting)
from feature_engine.selection import SmartCorrelatedSelection

Data ingestion

In [4]:
dataframe = pd.read_csv("/mnt/c/Users/haanh/api-behavior-anomaly/data/supervised_clean_data.csv")
api_call_graph = pd.read_json("/mnt/d/Opportunities/Jobs/datasets/api_anomaly/supervised_call_graphs.json")

In [5]:
# Checking the graph dataset
api_call_graph.head(3)

Unnamed: 0,_id,call_graph
0,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,[{'toId': '1f873432-6944-3df9-8300-8a3cf9f95b3...
1,4c486414-d4f5-33f6-b485-24a8ed2925e8,[{'toId': '016099ea-6f20-3fec-94cf-f7afa239f39...
2,7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a,[{'toId': '1f873432-6944-3df9-8300-8a3cf9f95b3...


### Some notes about the graph data
- The call_graph column in the graph dataset (JSON file) contains a list of structures that represent relationships between nodes and edges.
- Each structure provides information about the source and destination nodes and edges.
- A graph comprises source and destination nodes that correspond to the available API calls.
- The call_graph will be analyzed further to extract engineered features from the graph.

### Some notes about the relationship between the tabular data and the graph data
- The cleaned CSV file primarily consists of engineered features, with minimal null values and no significant outliers, making further feature engineering on it potentially less impactful. Instead, additional features will be derived from the graph dataset through feature engineering.
- These additional features from the graph dataset will eventually be integrated into the cleaned CSV file using the _id column.

## Processing the graph data

In [6]:
# Explode the list of dictionaries in call_graph into multiple rows
# In simpler terms, each row contains 1 dictionary from the list of dictionaries in call_graph
# Note: _id value from the original row is repeated for each element in the exploded list
calls_exploded = api_call_graph.explode("call_graph").reset_index(drop=True)

# Normalize the exploded dictionaries into separate columns
# At the start, each dictionary looks like this: 0 {"toID": "B", "fromID": "A"}
# For example
#     toID   fromID
# 0   B      A
# 1   C      B
calls_normalized = pd.json_normalize(calls_exploded['call_graph'])

# Concatenate the original DataFrame (without 'call_graph') with the normalized DataFrame
# In simpler terms, joining _id column with the normalized DataFrame
calls_processed = pd.concat(
    [calls_exploded.drop(columns=['call_graph']).reset_index(drop=True),
     calls_normalized.reset_index(drop=True)],
    axis=1
)  # axis=1 as I want to concatenate the DataFrames side by side (or horizontally)

# Extra: rename "toId" to "to", and "fromId" to "from" to avoid reading as "told" and "fromid" (easier understanding)
# Rename the columns fromID and toID to from and to
calls_processed.rename(columns={'fromId': 'from', 'toId': 'to'}, inplace=True)

# Display the first 3 rows
calls_processed.head(3)

Unnamed: 0,_id,to,from
0,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,1f873432-6944-3df9-8300-8a3cf9f95b35,5862055b-35a6-316a-8e20-3ae20c1763c2
1,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,8955faa9-0e33-37ad-a1dc-f0e640a114c2,a4fd6415-1fd4-303e-aa33-bb1830b5d9d4
2,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,85754db8-6a55-30b7-8558-dec75ff89abd,85754db8-6a55-30b7-8558-dec75ff89abd


### Checking for null values in the graph data

In [29]:
missing_calls = calls_processed[calls_processed.isnull().any(axis=1)]

missing_calls.head()

Unnamed: 0,_id,to,from
408116,286cf539-f6df-3af7-a0d9-fd9c0149b5d5,,
414714,8e8b99bb-7b6d-3437-9abc-1d884fe023d0,,
414860,bedfd600-80ef-3e95-8667-fa0d9a905d72,,
416318,60a25ad0-add8-3976-bc51-a7ff6779d5dc,,
417300,70b6a9dd-e4c6-36a0-908d-2311c277b5e8,,


There are 5 ids with no nodes concerning them. All of these ids, except 286cf539-f6df-3af7-a0d9-fd9c0149b5d5, have missing values in the tabular data (csv file). 
Hence, all of these 5 rows with these ids will be deleted from the graph data because graphs cannot be built with null nodes. 
Additionally, the row with the id 286cf539-f6df-3af7-a0d9-fd9c0149b5d5 will be deleted from the tabular data because there is no point to link non-existent information from the graph data with the existent information from the tabular data.

In [40]:
# Drop rows with null value from the graph data
calls_processed = calls_processed.dropna()

## Conducting feature engineering
### Basic graph level features
The most basic graph-level that we can engineer are:
- Number of edges (connections)
- Number of nodes (APIs)

These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.

In [41]:
# Group by '_id' and aggregate the number of connections and unique from/to values
graph_features = calls_processed.groupby('_id').agg(
    n_connections=('from', 'count'),  # Count number of 'from' entries
    from_nodes=('from', lambda x: list(set(x))),  # Gather all unique 'from' values
    to_nodes=('to', lambda x: list(set(x)))         # Gather all unique 'to' values
).reset_index()

# Calculate the number of unique nodes (both to_nodes and from_nodes) concerning each _id
graph_features['n_unique_nodes'] = graph_features.apply(
    lambda row: len(set(row['from_nodes'] + row['to_nodes'])),
    axis=1
)
# Display the features in graph_features so far
graph_features.head(3)

Unnamed: 0,_id,n_connections,from_nodes,to_nodes,n_unique_nodes
0,00041830-3168-3731-8bbc-c6838311da58,28,"[257b9618-6c20-3fd2-89a2-c53961eab4ef, 7c4ed48...","[257b9618-6c20-3fd2-89a2-c53961eab4ef, 7c4ed48...",15
1,007fa202-d51f-3619-8104-779c31c30138,57,"[66837568-9caf-35e7-b7bc-f6d907b9dfac, d8ab934...","[552c5e80-557f-3108-a126-27aa1371a5f1, d8ab934...",30
2,00b35506-6ac8-3726-a9f6-4adbedd5bc0c,57,"[257b9618-6c20-3fd2-89a2-c53961eab4ef, 47a9d8f...","[257b9618-6c20-3fd2-89a2-c53961eab4ef, 2e8ac1d...",35


In [21]:
graph_features.to_csv("/mnt/c/Users/haanh/api-behavior-anomaly/data/graph_features.csv", index=False)

### Note
From this point, from_nodes and to_nodes do not serve any more value so they will be dropped

In [32]:
# Drop 'from_nodes' and 'to_nodes'
graph_features = graph_features.drop(['from_nodes', 'to_nodes'], axis=1)

# Check the dataframe
graph_features.head(3)

Unnamed: 0,_id,n_connections,n_unique_nodes
0,00041830-3168-3731-8bbc-c6838311da58,28,15
1,007fa202-d51f-3619-8104-779c31c30138,57,30
2,00b35506-6ac8-3726-a9f6-4adbedd5bc0c,57,35


### Further graph level features
Based on the data in the json file about the graph, we can explore several other features of the graph on the nodeand edge level, namely:
- Degree Centrality: Measure of the number of connections a node has. This can be broken down into 2 features: in-degree and out-degree. In-degree is the number of incoming connections (edges) to each node, representing how many times the node is referenced as a "to" node. Out-degree is the number of outgoing connections from each node, which is essentially the same as n_connections but confirms the count of edges leading out from the node.
- Clustering Coefficient: A measure of the degree to which nodes in the graph tend to cluster together. A higher coefficient indicates that nodes' neighbors are also connected.
- Node Importance Measures: Calculate measures like PageRank or betweenness centrality for node importance. Additionally, higher values indicate more important nodes.

These features can be broken down into:
- global features - measure node attributes across all the graphs
- local features - measure node attributes across a specific graph

In [None]:
import networkx as nx

# Create a directed graph from the processed DataFrame
G = nx.from_pandas_edgelist(
    calls_processed,
    source='from',
    target='to',
    edge_attr='_id',  # Include '_id' as an edge attribute
    create_using=nx.DiGraph()
)

# Get the list of nodes in the global graph
nodes_in_graph = list(G.nodes)

# Initialize the DataFrame for node features
node_edge_features = pd.DataFrame({'node': nodes_in_graph})

# Compute global features
in_degree_global = dict(G.in_degree())
out_degree_global = dict(G.out_degree())
page_rank_global = nx.pagerank(G)
clustering_global = nx.clustering(G.to_undirected())  # Clustering coefficient (undirected)

# Map global features to nodes
node_edge_features['global_in_degree'] = node_edge_features['node'].map(in_degree_global)
node_edge_features['global_out_degree'] = node_edge_features['node'].map(out_degree_global)
node_edge_features['global_page_rank'] = node_edge_features['node'].map(page_rank_global)
node_edge_features['global_clustering_coefficient'] = node_edge_features


Unnamed: 0,_id,node,global_in_degree,global_out_degree,global_page_rank,global_clustering_coefficient,local_in_degree,local_out_degree,local_page_rank,local_clustering_coefficient
0,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,5862055b-35a6-316a-8e20-3ae20c1763c2,35,49,0.000962,0.676395,35,49,1.312371,0.333333
1,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,1f873432-6944-3df9-8300-8a3cf9f95b35,51,40,0.001379,0.711538,51,40,1.47635,0.333333
2,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,a4fd6415-1fd4-303e-aa33-bb1830b5d9d4,3,17,0.000453,0.320261,3,17,0.285384,0.0
3,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,8955faa9-0e33-37ad-a1dc-f0e640a114c2,1,1,0.000164,1.0,1,1,0.01168,0.0
4,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,85754db8-6a55-30b7-8558-dec75ff89abd,13,44,0.000495,0.520773,13,44,0.563051,0.0
5,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,876b4958-7df1-3b2b-9def-1a22f1d444e3,60,117,0.00327,0.327035,60,117,8.853034,2.233333
6,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,9f08fee1-953c-3801-b254-c0256f276bc2,300,96,0.012977,0.149934,300,96,13.472717,4.7
7,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,857c4b20-3057-30e0-9ca3-d6f5c3dbe4a6,24,35,0.000664,0.881506,24,35,1.631157,0.8
8,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,756ab2fe-a386-32dd-9a4e-18785c38a414,143,489,0.005992,0.084547,143,489,15.287922,5.652564
9,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,4f1415e9-90dd-3a15-85a0-fdf074ddf6a0,22,4,0.000886,0.766234,22,4,0.607437,0.0


### Observations:
- global_in_degree and local_in_degree are the same. global_out_degree and local_out_degree are also the same. These indidcate that the _id subgraphs cover all edges for the nodes being analyzed. For more efficient computation later, the local_in_degree and local_out_degree will be dropped from the dataframe
- Some nodes have low global clustering coefficient and 0 loal clustering coefficient. This can happen if a node has few or no connections among its neighbors, leading to no triangles being formed. This situation can arise in sparsely connected graphs or in cases where neighbors do not interconnect, resulting in limited clustering overall.
- Some nodes have low global PageRank and high local PageRank because a node might be well-connected to highly ranked neighbors within its local context, resulting in a high local PageRank, while having few connections to the broader network or being linked to lower-ranked nodes globally, which reduces its global PageRank.

In [34]:
# Drop 'local_in_degree' and 'local_out_degree' from the dataframe
node_edge_features = node_edge_features.drop(['local_in_degree', 'local_out_degree'], axis=1)

# Check the result dataframe
node_edge_features.head(3)

Unnamed: 0,_id,node,global_in_degree,global_out_degree,global_page_rank,global_clustering_coefficient,local_page_rank,local_clustering_coefficient
0,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,5862055b-35a6-316a-8e20-3ae20c1763c2,35,49,0.000962,0.676395,1.312371,0.333333
1,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,1f873432-6944-3df9-8300-8a3cf9f95b35,51,40,0.001379,0.711538,1.47635,0.333333
2,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,a4fd6415-1fd4-303e-aa33-bb1830b5d9d4,3,17,0.000453,0.320261,0.285384,0.0


In [35]:
# Merge back with the original graph_features based on unique node connections
graph_features_merged = pd.merge(graph_features, node_edge_features, how='left', on='_id')

# Display the final dataframe
graph_features_merged.head(3)

Unnamed: 0,_id,n_connections,n_unique_nodes,node,global_in_degree,global_out_degree,global_page_rank,global_clustering_coefficient,local_page_rank,local_clustering_coefficient
0,00041830-3168-3731-8bbc-c6838311da58,28,15,,,,,,,
1,007fa202-d51f-3619-8104-779c31c30138,57,30,,,,,,,
2,00b35506-6ac8-3726-a9f6-4adbedd5bc0c,57,35,,,,,,,


In [37]:
missing_ids = graph_features[~graph_features['_id'].isin(node_edge_features['_id'])]
print("Missing _id values in node_edge_features:")
print(missing_ids)

Missing _id values in node_edge_features:
                                       _id  n_connections  n_unique_nodes
0     00041830-3168-3731-8bbc-c6838311da58             28              15
1     007fa202-d51f-3619-8104-779c31c30138             57              30
2     00b35506-6ac8-3726-a9f6-4adbedd5bc0c             57              35
3     00bcf38a-a97e-3a9c-8ba6-1383596682c2              1               2
4     00dd0f82-0bc2-35dc-8e6a-79f952c70d94             49              27
...                                    ...            ...             ...
1666  fefcc98c-5b68-3c2e-87c8-8dfb16e8d782              3               3
1669  ff62d81d-4b83-3292-8a45-b902265e1a97            238              94
1670  ff6fea70-ca97-34fe-9a44-b46afcfaae0f             90              34
1671  ffa46dee-4173-310d-b671-8cc4bb091b86              2               2
1672  ffbf4937-68e6-3d12-b817-dd3c5c14a782           1165             238

[1498 rows x 3 columns]


In [16]:
# Print unique _id values in both DataFrames
unique_ids_graph = set(graph_features['_id'])
unique_ids_node_edge = set(node_edge_features['_id'])

print("Unique _id values in graph_features:")
print(unique_ids_graph)

print("\nUnique _id values in node_edge_features:")
print(unique_ids_node_edge)

# Check for ids in graph_features that are not in node_edge_features
not_in_node_edge = unique_ids_graph - unique_ids_node_edge
print("\n_ids in graph_features but not in node_edge_features:")
print(not_in_node_edge)

Unique _id values in graph_features:
{'5fec83a8-02a3-3b24-8217-344d5667d5a7', '87305252-f656-3a48-a0ae-9c21fa19b7b6', '5bb277c2-c7f3-3cae-8d21-646bbc65de6e', '4a1de205-9cf8-38e3-95ab-a857abc4be9c', '1a7f669e-7130-303e-8c0b-193e94404bb9', '8879d1b2-00be-34dd-addf-a1ad22cc80a6', '303b1501-d47a-3e3e-9444-45f29d15026a', '84b91298-0b39-360a-8f9c-88f4fb871df8', '28660e09-7e96-3a2d-89d7-a90b36e6e9b0', 'c6d5c783-6b35-396c-8d78-68408658118e', '9674f6c1-b236-35eb-9b44-3698d85e85f2', '3e7861cd-520a-33e6-bacf-43fdc1358744', '366f3b54-dde7-3436-82b7-5d0f70ac0bae', '8356d8cc-943c-31fa-abb8-d4a6653f2b1b', 'ffbf4937-68e6-3d12-b817-dd3c5c14a782', 'f849bc05-cbb6-3b03-b216-dad7fd5bd92a', '1f6f2720-ac86-333d-a1f0-38d7e6075ead', '3f7f33a8-8995-3de8-af02-348b0ba422fc', 'ce308907-776a-377a-934e-a05118a05af1', '1282c1a6-3d6c-32a0-b759-f89804bb63cf', '1f2f1f05-8bc8-3a78-b804-b7383471015c', '4029c31c-2c15-394f-b84c-157c02384eab', '4ee267ef-5183-326c-9285-6d587ea82e9b', '5f25eb9a-b5ec-3465-8f6c-7414caec5b8d', '0

In [22]:
missing_nodes = set(node_edge_features['node']) - set(G.nodes)
print("Nodes in node_edge_features but not in the graph:", missing_nodes)

Nodes in node_edge_features but not in the graph: set()


In [38]:
missing_ids_count = node_edge_features['_id'].isnull().sum()
print("Number of nodes with missing _id values:", missing_ids_count)


Number of nodes with missing _id values: 31


In [39]:
missing_id_nodes = node_edge_features[node_edge_features['_id'].isnull()]['node']
print("Nodes with missing _id values:")
print(missing_id_nodes)

Nodes with missing _id values:
566     cda83484-5167-31c7-9d38-8117ea951650
588     2880b797-00b8-36b6-b4a9-ec66942871b2
697     6a3d9f07-0b10-3655-82e0-92c47bcc0c5b
764     69240e4d-4d01-3076-b188-e175d1fd49df
777     ea3a02b1-c7f7-3269-952c-62d0abe6304d
798     8a62406b-03df-3f3f-aec7-abbe42a2eb4e
802     a64ba409-a01f-3f0d-8aa9-d70bd9937bfc
804     c96278e9-c6b5-3ffd-a57f-909e2b72beca
807     e138fe3f-e38f-3c0f-9c08-015b6521b585
809     dca692fb-1ff5-30c2-8263-eee76e133661
816     dbe93344-63a4-317d-af61-f3cb855a14ce
836     48b5b51b-bfff-3d7e-8e4d-784040cca9a6
847     f3ec46fa-6a53-365a-920f-1d3ce30c9e5d
850     0edceeff-b55a-3ece-b159-305f7a1c8cfd
863     c462300d-5c07-3c21-aeba-7ff929e61f7c
887     ae35cf32-9017-330c-aa87-b02206279eef
896     c8042e79-e0f3-3ffd-99f6-49ab9ec992cd
917     0dc702dd-aea2-3853-a76f-28085011a350
1023    793121c5-818c-3c5d-a9ec-555c1db5beb3
1026    bee05cb6-1508-3c68-9dd3-82e4bab39946
1038    14284ecf-8604-33bf-8369-35a50793b6ef
1042    04077f15-97f5-3f

In [42]:
node_edge_features.to_csv("/mnt/c/Users/haanh/api-behavior-anomaly/data/node_edge.csv", index=False)

In [25]:
for node in missing_id_nodes:
    node_edges = calls_processed[calls_processed['from'] == node]
    print(f"Edges for node {node} in calls_processed:")
    print(node_edges)


Edges for node cda83484-5167-31c7-9d38-8117ea951650 in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node 2880b797-00b8-36b6-b4a9-ec66942871b2 in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node 6a3d9f07-0b10-3655-82e0-92c47bcc0c5b in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node 69240e4d-4d01-3076-b188-e175d1fd49df in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node ea3a02b1-c7f7-3269-952c-62d0abe6304d in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node 8a62406b-03df-3f3f-aec7-abbe42a2eb4e in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node a64ba409-a01f-3f0d-8aa9-d70bd9937bfc in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node c96278e9-c6b5-3ffd-a57f-909e2b72beca in calls_processed:
Empty DataFrame
Columns: [_id, to, from]
Index: []
Edges for node e