# LANL Authentication Data Exploration

## Purpose
This notebook examines LANL authentication records to derive features that describe user access patterns.
I document the steps I used, the reasoning behind key choices, and the resulting feature set so the analysis can be reproduced and defended in an oral exam.


Phase 0 - Environment and dataset setup

Confirm paths, dependencies, and a small sample read before working with the full LANL files. This avoids unexpected failures when processing large compressed logs.

In [1]:
import sys
print (sys.executable)

/Users/akeshchandrasiri/spear-phishing-research/.venv/bin/python


In [2]:
import pandas as pd
pd.__version__

'2.3.3'

In [3]:
from pathlib import Path

LANL_DIR = Path(
    "/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL"
)

LANL_DIR

PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL')

Confirm the Python environment and dataset locations before proceeding. I run a small sample read to check the file format and avoid downstream errors when loading larger extracts.

In [4]:
list(LANL_DIR.iterdir())

[PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/redteam.txt.gz'),
 PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/auth.txt.gz'),
 PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/flows.txt.gz'),
 PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/dns.txt.gz')]

In [5]:
AUTH_LOG = LANL_DIR / "auth.txt.gz"

AUTH_LOG.exists(), AUTH_LOG

(True,
 PosixPath('/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/auth.txt.gz'))

# Phase 1 – LANL Authentication Data Ingestion & Feature Engineering

I load a sample of the LANL authentication log to identify column meanings and extract basic features. These features - event counts, distinct hosts, and inter-event gaps - form the basis for later anomaly scoring.

In [6]:
import pandas as pd

sample_auth = pd.read_csv(
    AUTH_LOG,
    compression="gzip",
    sep=",",
    header=None,
    nrows=50
)

sample_auth.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C1250,C586,NTLM,Network,LogOn,Success
1,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C586,C586,?,Network,LogOff,Success
2,1,C101$@DOM1,C101$@DOM1,C988,C988,?,Network,LogOff,Success
3,1,C1020$@DOM1,SYSTEM@C1020,C1020,C1020,Negotiate,Service,LogOn,Success
4,1,C1021$@DOM1,C1021$@DOM1,C1021,C625,Kerberos,Network,LogOn,Success


Schema discovery - inspect a small sample to determine column order and types. I prefer confirming the schema empirically rather than assuming a header layout.

In [7]:
sample_auth.shape
sample_auth.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C1250,C586,NTLM,Network,LogOn,Success
1,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C586,C586,?,Network,LogOff,Success
2,1,C101$@DOM1,C101$@DOM1,C988,C988,?,Network,LogOff,Success
3,1,C1020$@DOM1,SYSTEM@C1020,C1020,C1020,Negotiate,Service,LogOn,Success
4,1,C1021$@DOM1,C1021$@DOM1,C1021,C625,Kerberos,Network,LogOn,Success
5,1,C1035$@DOM1,C1035$@DOM1,C1035,C586,Kerberos,Network,LogOn,Success
6,1,C1035$@DOM1,C1035$@DOM1,C586,C586,?,Network,LogOff,Success
7,1,C1069$@DOM1,SYSTEM@C1069,C1069,C1069,Negotiate,Service,LogOn,Success
8,1,C1085$@DOM1,C1085$@DOM1,C1085,C612,Kerberos,Network,LogOn,Success
9,1,C1085$@DOM1,C1085$@DOM1,C612,C612,?,Network,LogOff,Success


Observed schema - each row represents an authentication event with fields for timestamp, source/destination user and host, auth type, event type and result.
These fields are sufficient to characterise who accessed what, when, and whether the attempt succeeded - information I use to build temporal and frequency features.

In [8]:
sample_auth.columns = [
    "timestamp",
    "src_user",
    "dst_user",
    "src_host",
    "dst_host",
    "auth_type",
    "logon_type",
    "event_type",
    "result"
]

sample_auth.head()

Unnamed: 0,timestamp,src_user,dst_user,src_host,dst_host,auth_type,logon_type,event_type,result
0,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C1250,C586,NTLM,Network,LogOn,Success
1,1,ANONYMOUS LOGON@C586,ANONYMOUS LOGON@C586,C586,C586,?,Network,LogOff,Success
2,1,C101$@DOM1,C101$@DOM1,C988,C988,?,Network,LogOff,Success
3,1,C1020$@DOM1,SYSTEM@C1020,C1020,C1020,Negotiate,Service,LogOn,Success
4,1,C1021$@DOM1,C1021$@DOM1,C1021,C625,Kerberos,Network,LogOn,Success


Event distribution - examine counts of event types, results, and logon types to characterise common vs. rare behaviours in the sample.

In [9]:
sample_auth["event_type"].value_counts()
sample_auth["result"].value_counts()

result
Success    50
Name: count, dtype: int64

In [10]:
sample_auth["event_type"].value_counts()

event_type
LogOn     30
LogOff    19
TGS        1
Name: count, dtype: int64

In [11]:
sample_auth["logon_type"].value_counts()


logon_type
Network    35
Service    13
Batch       1
?           1
Name: count, dtype: int64

In [12]:
sample_auth["auth_type"].value_counts()

auth_type
?            20
Kerberos     15
Negotiate    14
NTLM          1
Name: count, dtype: int64

In [13]:
user_activity = sample_auth.groupby("src_user").agg(
    total_events=("event_type", "count"),
    unique_hosts=("dst_host", "nunique"),
    logon_events=("event_type", lambda x: (x == "LogOn").sum()),
    logoff_events=("event_type", lambda x: (x == "LogOff").sum())
)

user_activity

Unnamed: 0_level_0,total_events,unique_hosts,logon_events,logoff_events
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANONYMOUS LOGON@C586,2,1,1,1
C101$@DOM1,1,1,0,1
C1020$@DOM1,1,1,1,0
C1021$@DOM1,1,1,1,0
C1035$@DOM1,2,1,1,1
C1069$@DOM1,1,1,1,0
C1085$@DOM1,2,1,1,1
C1151$@DOM1,1,1,1,0
C1154$@DOM1,1,1,1,0
C1164$@DOM1,1,1,0,1


In [14]:
sample_auth["timestamp"] = pd.to_datetime(sample_auth["timestamp"], unit="s")

sample_auth["hour"] = sample_auth["timestamp"].dt.hour
sample_auth["day"] = sample_auth["timestamp"].dt.dayofweek

sample_auth[["timestamp", "hour", "day"]].head()

Unnamed: 0,timestamp,hour,day
0,1970-01-01 00:00:01,0,3
1,1970-01-01 00:00:01,0,3
2,1970-01-01 00:00:01,0,3
3,1970-01-01 00:00:01,0,3
4,1970-01-01 00:00:01,0,3


Temporal profiling - compute average login hour and variability per user to capture normal working patterns and irregular activity windows.

In [15]:
user_time_profile = sample_auth.groupby("src_user").agg(
    avg_login_hour=("hour", "mean"),
    login_hour_std=("hour", "std"),
    active_days=("day", "nunique")
)

user_time_profile

Unnamed: 0_level_0,avg_login_hour,login_hour_std,active_days
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C586,0.0,0.0,1
C101$@DOM1,0.0,,1
C1020$@DOM1,0.0,,1
C1021$@DOM1,0.0,,1
C1035$@DOM1,0.0,0.0,1
C1069$@DOM1,0.0,,1
C1085$@DOM1,0.0,0.0,1
C1151$@DOM1,0.0,,1
C1154$@DOM1,0.0,,1
C1164$@DOM1,0.0,,1


In [16]:
# Treat timestamps as relative time (not absolute)
sample_auth["relative_time"] = sample_auth["timestamp"].astype("int64") // 10**9

# Session gap per user
sample_auth["prev_time"] = sample_auth.groupby("src_user")["relative_time"].shift(1)
sample_auth["time_gap"] = sample_auth["relative_time"] - sample_auth["prev_time"]

sample_auth[["src_user", "relative_time", "time_gap"]].head(10)

Unnamed: 0,src_user,relative_time,time_gap
0,ANONYMOUS LOGON@C586,1,
1,ANONYMOUS LOGON@C586,1,0.0
2,C101$@DOM1,1,
3,C1020$@DOM1,1,
4,C1021$@DOM1,1,
5,C1035$@DOM1,1,
6,C1035$@DOM1,1,0.0
7,C1069$@DOM1,1,
8,C1085$@DOM1,1,
9,C1085$@DOM1,1,0.0


In [17]:
user_anomaly_profile = sample_auth.groupby("src_user").agg(
    avg_gap=("time_gap", "mean"),
    max_gap=("time_gap", "max"),
    failed_logins=("result", lambda x: (x != "Success").sum()),
    unique_dst_hosts=("dst_host", "nunique")
)

user_anomaly_profile

Unnamed: 0_level_0,avg_gap,max_gap,failed_logins,unique_dst_hosts
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANONYMOUS LOGON@C586,0.0,0.0,0,1
C101$@DOM1,,,0,1
C1020$@DOM1,,,0,1
C1021$@DOM1,,,0,1
C1035$@DOM1,0.0,0.0,0,1
C1069$@DOM1,,,0,1
C1085$@DOM1,0.0,0.0,0,1
C1151$@DOM1,,,0,1
C1154$@DOM1,,,0,1
C1164$@DOM1,,,0,1


In [18]:
import os

auth_file_path = "/Users/akeshchandrasiri/Library/CloudStorage/GoogleDrive-akeshchandrasiri@gmail.com/My Drive/LANL/auth.txt.gz"

os.path.exists(auth_file_path)

True

In [19]:
auth_columns = [
    "timestamp",
    "src_user",
    "dst_user",
    "src_host",
    "dst_host",
    "auth_type",
    "logon_type",
    "event_type",
    "result"
]

In [20]:
sample_auth = pd.read_csv(
    auth_file_path,
    sep=",",
    header=None,
    names=auth_columns,
    nrows=5000   # NOT full dataset yet
)

In [21]:
sample_auth.head()
sample_auth.shape

(5000, 9)

In [22]:
sample_auth.columns

Index(['timestamp', 'src_user', 'dst_user', 'src_host', 'dst_host',
       'auth_type', 'logon_type', 'event_type', 'result'],
      dtype='object')

Session gap analysis - measure time between consecutive events per user to identify session lengths and unusually short or long gaps that may indicate automated activity or infrequent usage.

In [23]:
# Convert timestamp to relative integer time (safe for LANL)
sample_auth["relative_time"] = sample_auth["timestamp"].astype("int64") // 10**9

# Previous event time per user
sample_auth["prev_time"] = sample_auth.groupby("src_user")["relative_time"].shift(1)

# Time gap between consecutive events
sample_auth["time_gap"] = sample_auth["relative_time"] - sample_auth["prev_time"]

# Verify
sample_auth[["src_user", "relative_time", "time_gap"]].head(10)

Unnamed: 0,src_user,relative_time,time_gap
0,ANONYMOUS LOGON@C586,0,
1,ANONYMOUS LOGON@C586,0,0.0
2,C101$@DOM1,0,
3,C1020$@DOM1,0,
4,C1021$@DOM1,0,
5,C1035$@DOM1,0,
6,C1035$@DOM1,0,0.0
7,C1069$@DOM1,0,
8,C1085$@DOM1,0,
9,C1085$@DOM1,0,0.0


In [24]:
"time_gap" in sample_auth.columns

True

Baseline behavior calculation - compute population averages for event frequency, unique hosts and session gaps.
I use population means because the dataset is unlabeled; these aggregate values give a practical reference for measuring individual deviations.

In [25]:
baseline = {
    "avg_events_per_user": sample_auth.groupby("src_user").size().mean(),
    "avg_unique_hosts": sample_auth.groupby("src_user")["dst_host"].nunique().mean(),
    "avg_session_gap": sample_auth["time_gap"].dropna().mean()
}

baseline

{'avg_events_per_user': np.float64(2.765486725663717),
 'avg_unique_hosts': np.float64(1.3766592920353982),
 'avg_session_gap': np.float64(0.0)}

In [26]:
user_features = sample_auth.groupby("src_user").agg(
    total_events=("event_type", "count"),
    unique_hosts=("dst_host", "nunique"),
    avg_time_gap=("time_gap", "mean")
)

user_features.head()

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C1065,3,1,0.0
ANONYMOUS LOGON@C1529,1,1,
ANONYMOUS LOGON@C1715,1,1,
ANONYMOUS LOGON@C1719,1,1,
ANONYMOUS LOGON@C1909,10,1,0.0


# Phase 2 - Rule-Based Behavioral Anomaly Detection

This phase applies interpretable heuristics to aggregated user behavior to quantify deviations from baseline activity.
Baseline statistics and deviation metrics produce a rule-based anomaly score that highlights suspicious patterns before ML is applied.

Deviation metrics - convert raw counts into differences from the population baseline.
Deviations are more meaningful than raw counts because they reveal how a single account differs from typical behaviour in the same dataset.

In [27]:
user_features["event_deviation"] = (
    user_features["total_events"] - baseline["avg_events_per_user"]
)

user_features["host_deviation"] = (
    user_features["unique_hosts"] - baseline["avg_unique_hosts"]
)

user_features["gap_deviation"] = (
    user_features["avg_time_gap"] - baseline["avg_session_gap"]
)

user_features.head()

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ANONYMOUS LOGON@C1065,3,1,0.0,0.234513,-0.376659,0.0
ANONYMOUS LOGON@C1529,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1715,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1719,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1909,10,1,0.0,7.234513,-0.376659,0.0


In [28]:
features = user_features.copy()

features["avg_time_gap"] = features["avg_time_gap"].fillna(0)
features = features.fillna(0)

features.head()

features["anomaly_score"] = (
    abs(features["event_deviation"]) +
    abs(features["host_deviation"]) +
    abs(features["gap_deviation"])
)

features.sort_values("anomaly_score", ascending=False).head(10)

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation,anomaly_score
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
U6@DOM1,84,15,0.0,81.234513,13.623341,0.0,94.857854
U78@DOM1,83,8,0.0,80.234513,6.623341,0.0,86.857854
U22@DOM1,78,11,0.0,75.234513,9.623341,0.0,84.857854
C599$@DOM1,78,7,0.0,75.234513,5.623341,0.0,80.857854
U7@DOM1,68,8,0.0,65.234513,6.623341,0.0,71.857854
U66@DOM1,53,21,0.0,50.234513,19.623341,0.0,69.857854
U3@DOM1,60,12,0.0,57.234513,10.623341,0.0,67.857854
U4@DOM1,59,11,0.0,56.234513,9.623341,0.0,65.857854
ANONYMOUS LOGON@C586,63,1,0.0,60.234513,-0.376659,0.0,60.611173
C104$@DOM1,54,9,0.0,51.234513,7.623341,0.0,58.857854


In [29]:
user_features.columns

Index(['total_events', 'unique_hosts', 'avg_time_gap', 'event_deviation',
       'host_deviation', 'gap_deviation'],
      dtype='object')

Rule-based anomaly score - sum absolute deviations across selected features to produce a single interpretable score.
I use absolute values so that increases and decreases do not cancel out; the sum reflects total deviation magnitude and is easy to explain.

In [30]:
user_features["anomaly_score"] = (
    user_features["event_deviation"].abs()
    + user_features["host_deviation"].abs()
    + user_features["gap_deviation"].fillna(0).abs()
)

user_features.head()

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation,anomaly_score
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANONYMOUS LOGON@C1065,3,1,0.0,0.234513,-0.376659,0.0,0.611173
ANONYMOUS LOGON@C1529,1,1,,-1.765487,-0.376659,,2.142146
ANONYMOUS LOGON@C1715,1,1,,-1.765487,-0.376659,,2.142146
ANONYMOUS LOGON@C1719,1,1,,-1.765487,-0.376659,,2.142146
ANONYMOUS LOGON@C1909,10,1,0.0,7.234513,-0.376659,0.0,7.611173


In [31]:
"user_features columns:", list(user_features.columns)

('user_features columns:',
 ['total_events',
  'unique_hosts',
  'avg_time_gap',
  'event_deviation',
  'host_deviation',
  'gap_deviation',
  'anomaly_score'])

In [32]:
threshold = user_features["anomaly_score"].quantile(0.95)
threshold

suspicious_users = user_features[
    user_features["anomaly_score"] >= threshold
]

suspicious_users

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation,anomaly_score
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANONYMOUS LOGON@C586,63,1,0.0,60.234513,-0.376659,0.0,60.611173
C1035$@DOM1,14,1,0.0,11.234513,-0.376659,0.0,11.611173
C104$@DOM1,54,9,0.0,51.234513,7.623341,0.0,58.857854
C1065$@DOM1,14,3,0.0,11.234513,1.623341,0.0,12.857854
C1114$@DOM1,24,7,0.0,21.234513,5.623341,0.0,26.857854
...,...,...,...,...,...,...,...
U88@DOM1,8,4,0.0,5.234513,2.623341,0.0,7.857854
U8@DOM1,38,6,0.0,35.234513,4.623341,0.0,39.857854
U90@DOM1,20,4,0.0,17.234513,2.623341,0.0,19.857854
U94@DOM1,21,3,0.0,18.234513,1.623341,0.0,19.857854


Behavior classification rules - apply simple heuristics to map deviation patterns to human-interpretable behaviour types.
Heuristics are acceptable here because they are transparent and link directly to observable attacker-like actions; they also provide realistic scenarios for simulated phishing.

In [33]:
def classify_behavior(row):
    if row["event_deviation"] > 30 and row["host_deviation"] < 0:
        return "High-volume automated activity"
    elif row["event_deviation"] > 20 and row["host_deviation"] > 5:
        return "Lateral movement pattern"
    elif row["host_deviation"] > 10:
        return "Unusual multi-host access"
    else:
        return "Normal variation"

user_features["behavior_type"] = user_features.apply(classify_behavior, axis=1)

user_features[["anomaly_score", "behavior_type"]].sort_values(
    by="anomaly_score", ascending=False
).head(10)

Unnamed: 0_level_0,anomaly_score,behavior_type
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1
U6@DOM1,94.857854,Lateral movement pattern
U78@DOM1,86.857854,Lateral movement pattern
U22@DOM1,84.857854,Lateral movement pattern
C599$@DOM1,80.857854,Lateral movement pattern
U7@DOM1,71.857854,Lateral movement pattern
U66@DOM1,69.857854,Lateral movement pattern
U3@DOM1,67.857854,Lateral movement pattern
U4@DOM1,65.857854,Lateral movement pattern
ANONYMOUS LOGON@C586,60.611173,High-volume automated activity
C104$@DOM1,58.857854,Lateral movement pattern


In [34]:
phishing_scenarios = {
    "High-volume automated activity": {
        "theme": "Security Alert",
        "trigger": "Unusual automated login activity detected",
        "goal": "Prompt user to verify activity"
    },
    "Lateral movement pattern": {
        "theme": "Internal IT Request",
        "trigger": "New internal access request detected",
        "goal": "Elicit credential or approval action"
    },
    "Unusual multi-host access": {
        "theme": "VPN / Remote Access Warning",
        "trigger": "Multiple device access detected",
        "goal": "Prompt security confirmation"
    },
    "Normal variation": {
        "theme": "None",
        "trigger": "No action",
        "goal": "No simulation"
    }
}

user_features["phishing_theme"] = user_features["behavior_type"].map(
    lambda x: phishing_scenarios[x]["theme"]
)

user_features[["anomaly_score", "behavior_type", "phishing_theme"]].head(10)

Unnamed: 0_level_0,anomaly_score,behavior_type,phishing_theme
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C1065,0.611173,Normal variation,
ANONYMOUS LOGON@C1529,2.142146,Normal variation,
ANONYMOUS LOGON@C1715,2.142146,Normal variation,
ANONYMOUS LOGON@C1719,2.142146,Normal variation,
ANONYMOUS LOGON@C1909,7.611173,Normal variation,
ANONYMOUS LOGON@C1972,2.142146,Normal variation,
ANONYMOUS LOGON@C2021,2.142146,Normal variation,
ANONYMOUS LOGON@C2235,2.142146,Normal variation,
ANONYMOUS LOGON@C2626,2.142146,Normal variation,
ANONYMOUS LOGON@C457,5.611173,Normal variation,


In [35]:
# Inspect top anomalous users regardless of behavior label
user_features.sort_values(
    by="anomaly_score",
    ascending=False
).head(15)

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation,anomaly_score,behavior_type,phishing_theme
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
U6@DOM1,84,15,0.0,81.234513,13.623341,0.0,94.857854,Lateral movement pattern,Internal IT Request
U78@DOM1,83,8,0.0,80.234513,6.623341,0.0,86.857854,Lateral movement pattern,Internal IT Request
U22@DOM1,78,11,0.0,75.234513,9.623341,0.0,84.857854,Lateral movement pattern,Internal IT Request
C599$@DOM1,78,7,0.0,75.234513,5.623341,0.0,80.857854,Lateral movement pattern,Internal IT Request
U7@DOM1,68,8,0.0,65.234513,6.623341,0.0,71.857854,Lateral movement pattern,Internal IT Request
U66@DOM1,53,21,0.0,50.234513,19.623341,0.0,69.857854,Lateral movement pattern,Internal IT Request
U3@DOM1,60,12,0.0,57.234513,10.623341,0.0,67.857854,Lateral movement pattern,Internal IT Request
U4@DOM1,59,11,0.0,56.234513,9.623341,0.0,65.857854,Lateral movement pattern,Internal IT Request
ANONYMOUS LOGON@C586,63,1,0.0,60.234513,-0.376659,0.0,60.611173,High-volume automated activity,Security Alert
C104$@DOM1,54,9,0.0,51.234513,7.623341,0.0,58.857854,Lateral movement pattern,Internal IT Request


In [36]:
# Convert timestamp to numeric (relative time)
sample_auth["timestamp"] = pd.to_numeric(sample_auth["timestamp"], errors="coerce")

# Sort by user and time (critical)
sample_auth = sample_auth.sort_values(by=["src_user", "timestamp"])

In [37]:
sample_auth[["src_user", "timestamp"]].head()

Unnamed: 0,src_user,timestamp
842,ANONYMOUS LOGON@C1065,3
4021,ANONYMOUS LOGON@C1065,29
4848,ANONYMOUS LOGON@C1065,39
244,ANONYMOUS LOGON@C1529,2
2516,ANONYMOUS LOGON@C1715,8


In [38]:
sample_auth["time_gap"] = (
    sample_auth
    .groupby("src_user")["timestamp"]
    .diff()
)

In [39]:
sample_auth[["src_user", "timestamp", "time_gap"]].head(10)

Unnamed: 0,src_user,timestamp,time_gap
842,ANONYMOUS LOGON@C1065,3,
4021,ANONYMOUS LOGON@C1065,29,26.0
4848,ANONYMOUS LOGON@C1065,39,10.0
244,ANONYMOUS LOGON@C1529,2,
2516,ANONYMOUS LOGON@C1715,8,
4614,ANONYMOUS LOGON@C1719,36,
2310,ANONYMOUS LOGON@C1909,6,
2431,ANONYMOUS LOGON@C1909,7,1.0
3363,ANONYMOUS LOGON@C1909,19,12.0
3406,ANONYMOUS LOGON@C1909,20,1.0


In [40]:
sample_auth[["src_user", "timestamp", "time_gap"]].head(10)

Unnamed: 0,src_user,timestamp,time_gap
842,ANONYMOUS LOGON@C1065,3,
4021,ANONYMOUS LOGON@C1065,29,26.0
4848,ANONYMOUS LOGON@C1065,39,10.0
244,ANONYMOUS LOGON@C1529,2,
2516,ANONYMOUS LOGON@C1715,8,
4614,ANONYMOUS LOGON@C1719,36,
2310,ANONYMOUS LOGON@C1909,6,
2431,ANONYMOUS LOGON@C1909,7,1.0
3363,ANONYMOUS LOGON@C1909,19,12.0
3406,ANONYMOUS LOGON@C1909,20,1.0


In [41]:
user_features = sample_auth.groupby("src_user").agg(
    total_events=("event_type", "count"),
    unique_hosts=("dst_host", "nunique"),
    avg_time_gap=("time_gap", "mean")
)

user_features.head()

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C1065,3,1,18.0
ANONYMOUS LOGON@C1529,1,1,
ANONYMOUS LOGON@C1715,1,1,
ANONYMOUS LOGON@C1719,1,1,
ANONYMOUS LOGON@C1909,10,1,3.666667


In [42]:
user_features.describe()

Unnamed: 0,total_events,unique_hosts,avg_time_gap
count,1808.0,1808.0,749.0
mean,2.765487,1.376659,5.836228
std,6.370613,1.248019,7.265567
min,1.0,1.0,0.0
25%,1.0,1.0,2.0
50%,1.0,1.0,2.0
75%,2.0,1.0,9.25
max,84.0,21.0,40.0


In [43]:
baseline = {
    "avg_events_per_user": user_features["total_events"].mean(),
    "avg_unique_hosts": user_features["unique_hosts"].mean(),
    "avg_time_gap": user_features["avg_time_gap"].dropna().mean()
}

baseline

{'avg_events_per_user': np.float64(2.765486725663717),
 'avg_unique_hosts': np.float64(1.3766592920353982),
 'avg_time_gap': np.float64(5.836228487478603)}

In [44]:
user_features["event_deviation"] = (
    user_features["total_events"] - baseline["avg_events_per_user"]
)

user_features["host_deviation"] = (
    user_features["unique_hosts"] - baseline["avg_unique_hosts"]
)

user_features["gap_deviation"] = (
    user_features["avg_time_gap"] - baseline["avg_time_gap"]
)

user_features.head()

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ANONYMOUS LOGON@C1065,3,1,18.0,0.234513,-0.376659,12.163772
ANONYMOUS LOGON@C1529,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1715,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1719,1,1,,-1.765487,-0.376659,
ANONYMOUS LOGON@C1909,10,1,3.666667,7.234513,-0.376659,-2.169562


In [45]:
user_features.columns

Index(['total_events', 'unique_hosts', 'avg_time_gap', 'event_deviation',
       'host_deviation', 'gap_deviation'],
      dtype='object')

In [46]:
def classify_behavior(row):
    if row["event_deviation"] > 30 and row["host_deviation"] < 0:
        return "High-volume automated activity"
    elif row["event_deviation"] > 20 and row["host_deviation"] > 5:
        return "Lateral movement pattern"
    elif row["host_deviation"] > 10:
        return "Unusual multi-host access"
    else:
        return "Normal variation"

user_features["behavior_type"] = user_features.apply(classify_behavior, axis=1)

user_features[["event_deviation", "host_deviation", "behavior_type"]].head()

Unnamed: 0_level_0,event_deviation,host_deviation,behavior_type
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C1065,0.234513,-0.376659,Normal variation
ANONYMOUS LOGON@C1529,-1.765487,-0.376659,Normal variation
ANONYMOUS LOGON@C1715,-1.765487,-0.376659,Normal variation
ANONYMOUS LOGON@C1719,-1.765487,-0.376659,Normal variation
ANONYMOUS LOGON@C1909,7.234513,-0.376659,Normal variation


In [47]:
def map_phishing_strategy(behavior):
    if behavior == "High-volume automated activity":
        return "Credential harvesting with urgency"
    elif behavior == "Lateral movement pattern":
        return "Internal trust exploitation"
    elif behavior == "Unusual multi-host access":
        return "Security alert impersonation"
    else:
        return "Generic low-risk phishing"

user_features["phishing_strategy"] = user_features["behavior_type"].apply(map_phishing_strategy)

user_features[["behavior_type", "phishing_strategy"]].head(10)

Unnamed: 0_level_0,behavior_type,phishing_strategy
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1
ANONYMOUS LOGON@C1065,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C1529,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C1715,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C1719,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C1909,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C1972,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C2021,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C2235,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C2626,Normal variation,Generic low-risk phishing
ANONYMOUS LOGON@C457,Normal variation,Generic low-risk phishing


LLM prompt generation - the model is only used to convert an assigned strategy into realistic message text for a controlled study.
I do not use the LLM for detection; prompts are for simulation and training exercises under ethical oversight.

In [48]:
def generate_phishing_prompt(row):
    if row["phishing_strategy"] == "Credential harvesting with urgency":
        return (
            "Write a high-urgency internal email requesting immediate credential verification. "
            "Tone: authoritative, time-sensitive. Context: system maintenance."
        )

    elif row["phishing_strategy"] == "Internal trust exploitation":
        return (
            "Write an internal email impersonating a trusted colleague requesting access "
            "to shared internal resources. Tone: casual but credible."
        )

    elif row["phishing_strategy"] == "Security alert impersonation":
        return (
            "Write a security alert email warning of suspicious activity and asking the user "
            "to click a link to secure their account. Tone: official and urgent."
        )

    else:
        return (
            "Write a low-risk generic phishing awareness test email with minimal urgency."
        )

user_features["llm_prompt"] = user_features.apply(generate_phishing_prompt, axis=1)

user_features[["behavior_type", "phishing_strategy", "llm_prompt"]].head(5)

Unnamed: 0_level_0,behavior_type,phishing_strategy,llm_prompt
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C1065,Normal variation,Generic low-risk phishing,Write a low-risk generic phishing awareness te...
ANONYMOUS LOGON@C1529,Normal variation,Generic low-risk phishing,Write a low-risk generic phishing awareness te...
ANONYMOUS LOGON@C1715,Normal variation,Generic low-risk phishing,Write a low-risk generic phishing awareness te...
ANONYMOUS LOGON@C1719,Normal variation,Generic low-risk phishing,Write a low-risk generic phishing awareness te...
ANONYMOUS LOGON@C1909,Normal variation,Generic low-risk phishing,Write a low-risk generic phishing awareness te...


# Phase 3 – Lightweight Unsupervised Machine Learning (Isolation Forest)

This phase scales selected user features and trains an Isolation Forest to detect multivariate outliers in an unsupervised manner.
ML-derived anomaly scores complement the rule-based indicators by revealing complex deviations not captured by simple heuristics.

In [49]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest

ml_features = user_features[
    ["total_events", "unique_hosts", "avg_time_gap"]
].fillna(0)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(ml_features)

Phase 3.3: Isolation Forest Training

Isolation Forest - I select an unsupervised isolation-based detector because the data are unlabeled and the method is efficient for small feature sets.
This is an outlier detector, not a predictive classifier; its purpose here is to highlight multivariate points that differ from the bulk of the population.

In [50]:
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.05,   # top 5% anomalies
    random_state=42
)

iso_forest.fit(X_scaled)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of base estimators in the ensemble.",200
,"max_samples  max_samples: ""auto"", int or float, default=""auto"" The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. - If ""auto"", then `max_samples=min(256, n_samples)`. If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).",'auto'
,"contamination  contamination: 'auto' or float, default='auto' The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. - If 'auto', the threshold is determined as in the  original paper. - If float, the contamination should be in the range (0, 0.5]. .. versionchanged:: 0.22  The default value of ``contamination`` changed from 0.1  to ``'auto'``.",0.05
,"max_features  max_features: int or float, default=1.0 The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max(1, int(max_features * n_features_in_))` features. Note: using a float number less than 1.0 or integer less than number of features will enable feature subsampling and leads to a longer runtime.",1.0
,"bootstrap  bootstrap: bool, default=False If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.",False
,"n_jobs  n_jobs: int, default=None The number of jobs to run in parallel for :meth:`fit`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.",42
,"verbose  verbose: int, default=0 Controls the verbosity of the tree building process.",0
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`the Glossary `. .. versionadded:: 0.21",False


# Phase 4 – Hybrid Risk Scoring & User Prioritization

This phase normalizes and fuses rule-based and ML anomaly scores into a single, interpretable risk metric.
Users are ranked and thresholded to identify a high-risk subset; results are prepared for controlled simulation and export.

ML anomaly score interpretation - the Isolation Forest decision function returns higher values for points the model considers normal; negating that output produces a score where larger means more anomalous.
This inversion makes the ML score direction consistent with the rule-based anomaly metric used later.

In [51]:
# Predict anomalies (-1 = anomaly, 1 = normal)
ml_labels = iso_forest.predict(X_scaled)

# Convert to continuous anomaly score (higher = more anomalous)
ml_scores = -iso_forest.decision_function(X_scaled)

# Store in user_features
user_features["ml_anomaly_score"] = ml_scores

In [52]:
user_features[["ml_anomaly_score"]].describe()

Unnamed: 0,ml_anomaly_score
count,1808.0
mean,-0.224588
std,0.098825
min,-0.285055
25%,-0.285055
50%,-0.285055
75%,-0.165446
max,0.188739


In [53]:
# Create a rule-based anomaly score from deviations
user_features["rule_anomaly_score"] = (
    user_features["event_deviation"].abs() +
    user_features["host_deviation"].abs() +
    user_features["gap_deviation"].fillna(0).abs()
)

user_features[["rule_anomaly_score", "ml_anomaly_score"]].describe()

Unnamed: 0,rule_anomaly_score,ml_anomaly_score
count,1808.0,1808.0
mean,5.294987,-0.224588
std,8.098394,0.098825
min,0.774944,-0.285055
25%,2.142146,-0.285055
50%,2.142146,-0.285055
75%,4.978375,-0.165446
max,100.224203,0.188739


Score normalization and weighting - rescale the rule-based and ML scores to a common range before combining them.
I weight the rule-based score higher because it directly encodes interpretable behaviours; the ML score serves as supporting evidence.

In [54]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

user_features[["rule_score_norm", "ml_score_norm"]] = scaler.fit_transform(
    user_features[["rule_anomaly_score", "ml_anomaly_score"]]
)

user_features["final_risk_score"] = (
    0.6 * user_features["rule_score_norm"] +
    0.4 * user_features["ml_score_norm"]
)

user_features[["rule_score_norm", "ml_score_norm", "final_risk_score"]].describe()

Unnamed: 0,rule_score_norm,ml_score_norm,final_risk_score
count,1808.0,1808.0,1808.0
mean,0.045451,0.127624,0.07832
std,0.081432,0.208583,0.12516
min,0.0,0.0,0.008249
25%,0.013748,0.0,0.008249
50%,0.013748,0.0,0.008249
75%,0.042267,0.252449,0.128316
max,1.0,1.0,1.0


In [55]:
user_features.sort_values(
    by="final_risk_score",
    ascending=False
).head(10)

Unnamed: 0_level_0,total_events,unique_hosts,avg_time_gap,event_deviation,host_deviation,gap_deviation,behavior_type,phishing_strategy,llm_prompt,ml_anomaly_score,rule_anomaly_score,rule_score_norm,ml_score_norm,final_risk_score
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
U6@DOM1,84,15,0.46988,81.234513,13.623341,-5.366349,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.188739,100.224203,1.0,1.0,1.0
U22@DOM1,78,11,0.506494,75.234513,9.623341,-5.329735,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.174179,90.187589,0.899078,0.96927,0.927155
U78@DOM1,83,8,0.47561,80.234513,6.623341,-5.360619,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.154053,92.218473,0.919499,0.926791,0.922416
C599$@DOM1,78,7,0.480519,75.234513,5.623341,-5.355709,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.145183,86.213563,0.859118,0.908069,0.878698
U66@DOM1,53,21,0.711538,50.234513,19.623341,-5.12469,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.17553,74.982544,0.746186,0.972121,0.83656
U7@DOM1,68,8,0.58209,65.234513,6.623341,-5.254139,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.139857,77.111993,0.767598,0.896828,0.81929
U3@DOM1,60,12,0.457627,57.234513,10.623341,-5.378601,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.164914,73.236455,0.728628,0.949714,0.817062
U4@DOM1,59,11,0.655172,56.234513,9.623341,-5.181056,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.159739,71.03891,0.706531,0.938792,0.799435
C104$@DOM1,54,9,0.735849,51.234513,7.623341,-5.100379,Lateral movement pattern,Internal trust exploitation,Write an internal email impersonating a truste...,0.139857,63.958233,0.635332,0.896828,0.73993
ANONYMOUS LOGON@C586,63,1,0.645161,60.234513,-0.376659,-5.191067,High-volume automated activity,Credential harvesting with urgency,Write a high-urgency internal email requesting...,0.105701,65.80224,0.653874,0.824738,0.72222


In [56]:
# Inspect top 5% high-risk users
threshold = user_features["final_risk_score"].quantile(0.95)

top_risk_users = user_features[
    user_features["final_risk_score"] >= threshold
]

top_risk_users[[
    "final_risk_score",
    "behavior_type",
    "phishing_strategy"
]].head(10)

Unnamed: 0_level_0,final_risk_score,behavior_type,phishing_strategy
src_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANONYMOUS LOGON@C586,0.72222,High-volume automated activity,Credential harvesting with urgency
C104$@DOM1,0.73993,Lateral movement pattern,Internal trust exploitation
C1073$@DOM1,0.458709,Normal variation,Generic low-risk phishing
C1114$@DOM1,0.483126,Lateral movement pattern,Internal trust exploitation
C1164$@DOM1,0.513463,Normal variation,Generic low-risk phishing
C1167$@DOM1,0.411794,Normal variation,Generic low-risk phishing
C123$@DOM1,0.687777,Lateral movement pattern,Internal trust exploitation
C1282$@DOM1,0.444519,Normal variation,Generic low-risk phishing
C13$@DOM1,0.408559,Normal variation,Generic low-risk phishing
C1349$@DOM1,0.367842,Normal variation,Generic low-risk phishing


In [57]:
user_features.groupby("behavior_type")["final_risk_score"].mean().sort_values(ascending=False)

behavior_type
Lateral movement pattern          0.728728
High-volume automated activity    0.615619
Normal variation                  0.071171
Name: final_risk_score, dtype: float64

In [58]:
user_features.sort_values(
    by="final_risk_score",
    ascending=False
).to_csv("../docs/final_risk_scores.csv")

Phase 4 results ready: hybrid scores computed and exported where applicable.
Proceeding to LLM-based simulation and prompt generation (Phase 5).

# Phase 5 – LLM-Based Behavior-Adaptive Spear Phishing Simulation

This phase maps detected behavior types to targeted phishing strategies and generates LLM prompts accordingly.
It demonstrates integration with LLMs to synthesize behavior-adaptive phishing content for academic simulation under controlled, ethical constraints.

In [59]:
from openai import OpenAI

client = OpenAI(
    api_key="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"
)

In [60]:
import os
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = "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"

client = OpenAI()

Note:
Live execution of the LLM requires API credits.
For this progress phase, prompt generation logic and integration
are validated without consuming tokens.

In [61]:
pip install google-generativeai


Note: you may need to restart the kernel to use updated packages.


In [62]:
import google.generativeai as genai

# TEMP: hardcode for demo (remove before final submission)
genai.configure(api_key="AIzaSyA8mPxReHdpQFFdc5yJ3-8QdSg5ehB3GhU")

  from .autonotebook import tqdm as notebook_tqdm

All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


In [63]:
model = genai.GenerativeModel("gemini-pro")

In [64]:
from google import genai
import os

os.environ["GEMINI_API_KEY"] = "AIzaSyA8mPxReHdpQFFdc5yJ3-8QdSg5ehB3GhU"

client = genai.Client()

models = client.models.list()
for m in models:
    print(m.name)

models/embedding-gecko-001
models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image-preview
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/g

In [65]:
from google import genai
import os

# Make sure your key is set (temporary hardcode is OK for demo)
os.environ["GEMINI_API_KEY"] = "AIzaSyA8mPxReHdpQFFdc5yJ3-8QdSg5ehB3GhU"

client = genai.Client()

In [66]:
response = client.models.generate_content(
    model="models/gemini-flash-latest",
    contents="Write a short internal cybersecurity awareness email about phishing."
)

print(response.text)

Subject: 📧 Quick Tip: Stop Phishing Before it Starts

Hi Team,

As threat actors continue to launch sophisticated attacks, vigilance against phishing remains our top defense. Phishing emails often look incredibly real and aim to steal credentials or deploy malware.

**Remember the Golden Rule: Always Inspect, Never Assume.**

Before clicking a link or downloading an attachment, take a moment to confirm authenticity:

1.  **Verify Sender:** Hover your mouse over the sender’s address. Does the domain exactly match what you expect?
2.  **Inspect Links:** Hover your mouse over any hyperlink. Does the destination URL look legitimate? Be suspicious of slight misspellings or unexpected redirects.
3.  **Urgency/Pressure:** Be cautious of emails demanding immediate action (e.g., "Account suspended," "Immediate password reset required").

**What to do if you are suspicious:**

Do NOT reply or click. Report the email immediately using the **[Insert Reporting Button/Method, e.g., "Report Phish" bu