# HDFS Log Anomaly Detection - Part 2: Feature Engineering

## Objective
Create meaningful features from preprocessed HDFS logs for anomaly detection models

## Features to be created:
1. **Statistical features**: Message length, word count, character frequency
2. **Pattern-based features**: Template occurrence, keyword presence
3. **Temporal features**: Time-based patterns and sequences
4. **Categorical encodings**: Log level, session ID encoding

In [2]:
from google.colab import drive
drive.mount('/content/drive')
saved_path = '/content/drive/MyDrive/HDFS_Project/'

Mounted at /content/drive


##1. Grouping by BlockId (Session Creation)

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Load cleaned data from Notebook 01
df_logs = pd.read_csv(saved_path+'hdfs_cleaned_sample.csv')

# Group messages by BlockId (Session creation)
# Convert the list of messages into one string per block
df_sessions = df_logs.groupby('BlockId')['Content'].apply(lambda x: ' '.join(x)).reset_index()

print(f"Number of unique sessions (blocks): {len(df_sessions)}")
df_sessions.head()

Number of unique sessions (blocks): 66167


Unnamed: 0,BlockId,Content
0,blk_-1000083860370843431,BLOCK* NameSystem.allocateBlock: /user/root/ra...
1,blk_-1000639647761179183,Receiving block blk_-1000639647761179183 src: ...
2,blk_-1000927344760357676,Receiving block blk_-1000927344760357676 src: ...
3,blk_-1000974751560899194,Receiving block blk_-1000974751560899194 src: ...
4,blk_-1001052357459993016,Receiving block blk_-1001052357459993016 src: ...


##2. Vectorization (Bag-of-Words / TF-IDF)

In [4]:
# This is the key step of feature engineering.
# We transform textual log data into a numerical matrix that machine learning models can process.
# Two main approaches:
# - CountVectorizer: Simply counts word/event occurrences
# - TF-IDF: Gives more importance to rare events (often more suspicious)
# Here, we use TF-IDF to better capture rare and meaningful log patterns.

# Use TF-IDF to capture the importance of rare events
tfidf = TfidfVectorizer(max_features=500, stop_words='english')
X = tfidf.fit_transform(df_sessions['Content'])

# Convert to DataFrame for better readability
X_df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
X_df['BlockId'] = df_sessions['BlockId']

print(f"Shape of the feature matrix: {X_df.shape}")

Shape of the feature matrix: (66167, 501)


##3. Merging with Labels (Ground Truth)

In [5]:
# We retrieve the labels (“Normal” or “Anomaly”) for each block in order to evaluate the performance of our anomaly detection models.

# Load labels
df_labels = pd.read_csv(saved_path + 'labels_cleaned.csv')

# Merge feature matrix with labels using BlockId
df_final = pd.merge(X_df, df_labels, on='BlockId')

# Encode labels: Normal -> 0, Anomaly -> 1
# (For some models like One-Class SVM, you may also use -1 for anomalies)
df_final['Label'] = df_final['Label'].map({'Normal': 0, 'Anomaly': 1})

print("Final class distribution in the dataset:")
print(df_final['Label'].value_counts())

Final class distribution in the dataset:
Label
0    63215
1     2952
Name: count, dtype: int64


##4. Normalization (Feature Scaling)

In [7]:
# Some models such as One-Class SVM, Isolation Forest, and Neural Networks are sensitive to feature scale.
# Therefore, we normalize the features before training.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Exclude BlockId and Label from scaling
features = df_final.drop(['BlockId', 'Label'], axis=1)
X_scaled = scaler.fit_transform(features)

# Save processed data for Notebook 03 (Model Training)
import joblib
joblib.dump((X_scaled, df_final['Label']), saved_path + 'processed_data.pkl')

print("Feature engineering completed. Data saved for Notebook 03.")

Feature engineering completed. Data saved for Notebook 03.
