<a href="https://colab.research.google.com/github/aniray2908/silent-attrition-detector/blob/main/notebooks/enron_behavioral_drift.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behavioral Drift Engine – Enron Communication Analysis

### Objective

This notebook builds a behavioral disengagement detection engine
using internal email communication patterns.

The goal is to:

- Model communication volume over time
- Detect significant deviations from baseline behavior
- Convert behavioral drift into a standardized risk score
- Produce a signal compatible with structured HR risk models

This module serves as the behavioral signal layer
in the Silent Attrition Detection system.


## 1. Data Acquisition

We use the Enron Email Dataset (Kaggle version),
which contains raw email messages including headers and body.

Each row contains:
- File path
- Full email text

We must extract structured fields manually from the raw message.


In [None]:
import kagglehub
path = kagglehub.dataset_download("wcukierski/enron-email-dataset")

Downloading from https://www.kaggle.com/api/v1/datasets/download/wcukierski/enron-email-dataset?dataset_version_number=2...


100%|██████████| 358M/358M [00:02<00:00, 172MB/s]

Extracting files...





In [None]:
import os

print(path)
print(os.listdir(path))

/root/.cache/kagglehub/datasets/wcukierski/enron-email-dataset/versions/2
['emails.csv']


In [None]:
import pandas as pd

enron_path = "/root/.cache/kagglehub/datasets/wcukierski/enron-email-dataset/versions/2/emails.csv"

df_enron = pd.read_csv(enron_path)

df_enron.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [None]:
df_enron.shape
df_enron.columns

Index(['file', 'message'], dtype='object')

## 2. Email Header Parsing

The dataset stores full email text in a single column.
We extract key structured fields:

- Sender ("From")
- Timestamp ("Date")

These fields are required for behavioral time-series modeling.


In [None]:
import re

def extract_field(text, field_name):
    pattern = rf"{field_name}: (.*)"
    match = re.search(pattern, text)
    return match.group(1) if match else None

df_enron["From"] = df_enron["message"].apply(lambda x: extract_field(x, "From"))
df_enron["Date"] = df_enron["message"].apply(lambda x: extract_field(x, "Date"))

df_enron[["From", "Date"]].head()


Unnamed: 0,From,Date
0,phillip.allen@enron.com,"Mon, 14 May 2001 16:39:00 -0700 (PDT)"
1,phillip.allen@enron.com,"Fri, 4 May 2001 13:51:00 -0700 (PDT)"
2,phillip.allen@enron.com,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)"
3,phillip.allen@enron.com,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)"
4,phillip.allen@enron.com,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)"


### Dataset Structure

The dataset contains raw email records where each row includes:

- File path
- Full email text (headers + body)

Since headers are embedded inside the message text,
we must extract structured fields manually.


In [None]:
import re

def extract_field(text, field_name):
    pattern = rf"{field_name}: (.*)"
    match = re.search(pattern, text)
    return match.group(1) if match else None

df_enron["From"] = df_enron["message"].apply(lambda x: extract_field(x, "From"))
df_enron["Date"] = df_enron["message"].apply(lambda x: extract_field(x, "Date"))

df_enron[["From", "Date"]].head()

Unnamed: 0,From,Date
0,phillip.allen@enron.com,"Mon, 14 May 2001 16:39:00 -0700 (PDT)"
1,phillip.allen@enron.com,"Fri, 4 May 2001 13:51:00 -0700 (PDT)"
2,phillip.allen@enron.com,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)"
3,phillip.allen@enron.com,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)"
4,phillip.allen@enron.com,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)"


In [None]:
df_enron["Date"] = pd.to_datetime(df_enron["Date"], errors="coerce")

df_enron = df_enron.dropna(subset=["From", "Date"])

  df_enron["Date"] = pd.to_datetime(df_enron["Date"], errors="coerce")
  df_enron["Date"] = pd.to_datetime(df_enron["Date"], errors="coerce")


### Timestamp Cleaning

Email timestamps are converted to datetime format.
Invalid or malformed dates are removed.

This ensures temporal aggregation is reliable.


In [None]:
df_enron["employee"] = df_enron["From"].str.extract(r'([^@]+)@')

In [None]:
df_enron["employee"].nunique()

18992

## 3. Internal Communication Filtering

The dataset contains both internal and external email addresses.

To model organizational behavior, we retain only
emails sent from "@enron.com" addresses.

This ensures we analyze employee communication patterns
rather than vendor or external traffic.


In [None]:
df_enron = df_enron[df_enron["From"].str.contains("@enron.com", na=False)]

df_enron["employee"] = df_enron["From"].str.extract(r'([^@]+)@')

df_enron.shape, df_enron["employee"].nunique()

((427785, 5), 6460)

In [None]:
# Try converting again with better parsing
df_enron["Date"] = pd.to_datetime(
    df_enron["Date"],
    errors="coerce",
    utc=True
)

# Drop rows where conversion failed
df_enron = df_enron.dropna(subset=["Date"])

df_enron["Date"].head()

Unnamed: 0,Date
0,2001-05-14 23:39:00+00:00
1,2001-05-04 20:51:00+00:00
2,2000-10-18 10:00:00+00:00
3,2000-10-23 13:13:00+00:00
4,2000-08-31 12:07:00+00:00


### Internal Filtering Impact

After filtering to "@enron.com" addresses:

- External senders are removed
- Employee count reduces significantly
- Behavioral modeling becomes organization-focused

This ensures we analyze internal communication patterns only.


## 4. Monthly Communication Aggregation

We aggregate emails by:
- Employee
- Year-Month

This produces a time series of communication volume
for each employee.


In [None]:
df_enron["year_month"] = df_enron["Date"].dt.to_period("M")

  df_enron["year_month"] = df_enron["Date"].dt.to_period("M")


In [None]:
monthly_volume = (
    df_enron
    .groupby(["employee", "year_month"])
    .size()
    .reset_index(name="email_count")
)

monthly_volume.head()

Unnamed: 0,employee,year_month,email_count
0,'todd'.delahoussaye,2001-10,1
1,'todd'.delahoussaye,2001-11,4
2,'todd'.delahoussaye,2002-02,1
3,2.ews,2001-09,1
4,3e,2001-05,1


In [None]:
print("Earliest:", df_enron["year_month"].min())
print("Latest:", df_enron["year_month"].max())

Earliest: 1980-01
Latest: 2002-09


### Active Employee Filtering

Employees with fewer than 6 active months are removed.

Reason:
Drift detection requires sufficient historical baseline.
Short-lived activity introduces noise.

### Time Span Overview

The dataset spans multiple years of internal communication.

This allows:

- Establishing historical behavioral baselines
- Detecting sustained communication drops
- Modeling medium-term behavioral drift


In [None]:
employee_months = (
    monthly_volume
    .groupby("employee")["year_month"]
    .nunique()
    .reset_index(name="active_months")
)

active_employees = employee_months[employee_months["active_months"] >= 6]["employee"]

monthly_volume = monthly_volume[monthly_volume["employee"].isin(active_employees)]

monthly_volume.shape, len(active_employees)

((15746, 3), 1430)

### Active Employee Summary

Employees active for fewer than 6 months are removed.

Reason:
Reliable drift detection requires sufficient historical activity.
Short-lived accounts introduce noise and unstable baselines.

This step improves signal quality.


## 5. Baseline Communication Modeling

We compute a rolling 6-month average email volume
for each employee.

This serves as the behavioral baseline.

Drift is measured relative to this rolling baseline.


In [None]:
monthly_volume = monthly_volume.sort_values(["employee", "year_month"])

monthly_volume["rolling_avg"] = (
    monthly_volume
    .groupby("employee")["email_count"]
    .transform(lambda x: x.rolling(window=6, min_periods=3).mean())
)

monthly_volume.head(10)

Unnamed: 0,employee,year_month,email_count,rolling_avg
11,40enron,2001-02,5,
12,40enron,2001-03,26,
13,40enron,2001-04,55,28.666667
14,40enron,2001-05,496,145.5
15,40enron,2001-06,552,226.8
16,40enron,2001-07,206,223.333333
17,40enron,2001-08,304,273.166667
18,40enron,2001-09,616,371.5
19,40enron,2001-10,175,391.5
52,a..howard,2001-09,1,


## 6. Drift Score Calculation

Drift is defined as:

(Current Month Volume - Rolling Average)
-----------------------------------------
        Rolling Average

Interpretation:
- Negative drift → reduced communication
- Large negative drift → potential disengagement
- Positive drift → increased activity


In [None]:
monthly_volume["drift_score"] = (
    (monthly_volume["email_count"] - monthly_volume["rolling_avg"])
    / monthly_volume["rolling_avg"]
)

monthly_volume.head()

Unnamed: 0,employee,year_month,email_count,rolling_avg,drift_score
11,40enron,2001-02,5,,
12,40enron,2001-03,26,,
13,40enron,2001-04,55,28.666667,0.918605
14,40enron,2001-05,496,145.5,2.408935
15,40enron,2001-06,552,226.8,1.433862


### Why 6-Month Rolling Baseline?

A 6-month window balances:

- Responsiveness to change
- Stability of baseline estimation

Shorter windows may overreact to noise.
Longer windows may hide gradual disengagement.


In [None]:
monthly_volume["behavioral_flag"] = monthly_volume["drift_score"] < -0.5

In [None]:
monthly_volume["drift_score"].describe()

Unnamed: 0,drift_score
count,12886.0
mean,0.045583
std,0.793724
min,-0.997444
25%,-0.547894
50%,-0.142857
75%,0.44
max,4.626401


### Drift Statistics Interpretation

The drift distribution shows:

- Many months with moderate negative drift
- Some extreme drops nearing -1 (near-total communication collapse)
- Occasional positive spikes (temporary workload bursts)

This indicates realistic behavioral variability.


In [None]:
monthly_volume["behavioral_risk"] = -monthly_volume["drift_z"]

In [None]:
monthly_volume["drift_z"] = (
    (monthly_volume["drift_score"] - monthly_volume["drift_score"].mean())
    / monthly_volume["drift_score"].std()
)

### Drift Distribution Analysis

The distribution shows:

- Median slightly negative → moderate communication decline common
- Extreme negative values → potential disengagement
- Extreme positive values → temporary communication bursts

This confirms realistic behavioral variability.


## 7. Behavioral Risk Standardization

Raw drift values are standardized using Z-score normalization.

We invert the standardized value so that:

Higher values = Higher behavioral risk


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
monthly_volume["behavioral_risk_scaled"] = scaler.fit_transform(
    monthly_volume[["behavioral_risk"]]
)

In [None]:
monthly_volume["behavioral_risk_scaled"].min(), monthly_volume["behavioral_risk_scaled"].max()

(0.0, 1.0)

### Why Normalize to 0–1?

To integrate behavioral risk with structured HR risk,
both signals must be on comparable scales.

Min–Max scaling ensures:

0 = Lowest behavioral risk  
1 = Highest behavioral risk  

This prepares the signal for multi-model fusion.


In [None]:
latest_behavioral = (
    monthly_volume
    .sort_values(["employee", "year_month"])
    .groupby("employee")
    .tail(1)
    [["employee", "behavioral_risk_scaled"]]
)

latest_behavioral.head()

Unnamed: 0,employee,behavioral_risk_scaled
19,40enron,0.920972
58,a..howard,0.932835
86,a..martin,0.949038
98,a..roberts,0.913005
113,a..shankman,0.931421


## Final Behavioral Output

For each employee, we extract the most recent
behavioral risk score.

Output format:

employee | behavioral_risk_scaled

This signal represents real-time behavioral disengagement
and will be fused with HR-based attrition risk
in the next module.


## Limitations and Future Improvements

Current behavioral modeling uses:

- Email volume only
- Rolling average drift detection

Limitations:

- Does not capture sentiment
- Does not analyze communication network structure
- Assumes reduced communication correlates with disengagement

Future improvements may include:

- Network centrality metrics
- Interaction diversity indices
- Anomaly detection models
- Sentiment analysis
