# Looking for Lateral Movement

----

*Lateral movement* is a cyberattack pattern that describes how an adversary leverages a single foothold to compromise other systems within a network.
Identifying and stopping lateral movement is an important step in controlling the damage from a breach, and also plays a role in forensic analysis of a cyberattack, helping to identify its source and reconstruct what happened.
In this notebook, we show how xGT can be used to find evidence of these types of patterns hiding in large data.

This notebook is an example of using 
the vast collection of malicious cyber attack patterns described in the [MITRE ATT&CK Catalog](https://attack.mitre.org/) as a guide to search for evidence of lateral movemement within an enterprise network.

For data, we'll be using the [LANL Unified Host and Network Dataset](https://datasets.trovares.com/cyber/LANL/index.html), a set of netflow and host event data collected on an internal Los Alamos National Lab network.

----
## RDP Hijacking

There are 17 *lateral movement* techniques presented in the MITRE ATT&CK Catalog.
We will consider the *RDP Hijacking* technique presented as [tactic 1076](https://attack.mitre.org/techniques/T1076/).

RDP hijacking is actually a family of attacks, each with different characteristics on how to attain the
privileges required to perform the RDP Hijacking.
The attack broadly looks like this:

1. Lateral movement starts from a foothold where an adversary already has gained access. We'll call this host `A`.

1. The attacker uses some *privilege escalation* technique to attain SYSTEM privilege.

1. The attacker then leverages their SYSTEM privilege to *hijack* as RDP session to
[move through a network](https://doublepulsar.com/rdp-hijacking-how-to-hijack-rds-and-remoteapp-sessions-transparently-to-move-through-an-da2a1e73a5f6).
The result is to become logged in to another system where the RDP session had been.  We'll call this host `B`.

This hijacking action can be repeated to form longer chains of lateral movememt; and these chains
can be represented as graph patterns:

![rdp_hijack](images/lateral-movement.png)


----
## Privilege Escalation

The MITRE ATT&CK Catalog contains 28 different techniques for performing privilege escalation.
For our example, we will look for evidence of RDP Hijacking where privilege escalation was carried out using 
a technique called *Accessibility Features* described as [T1015](https://attack.mitre.org/techniques/T1015/).

The astute reader will note that we are looking for only one of 476 (or more) techniques for lateral movement.
Each of the others might result in different graph patterns and different queries, but can all be addressed
using the same approach described here.

----
## Mapping to a cyber dataset

In order to formulate a query, we need to understand the content and structure of our
graph.
We will work under the assumption that we have both *netflow* and *windows server log* event information.

Mapping each of the adversary steps (the number before each edge label in the diagram) to our dataset:

1. "Accessibility Features (*privilege escalation*)": An adversary modifies the way programs are launched 
to get a back door into a system.  The following programs can be used for this purpose:
    1. `sethc.exe`
    1. `utilman.exe`

1. "RDP Session Hijack":  Once an adversary finds a session to hijack they can do this command:  `c:\windows\system32\tscon.exe [session number to be stolen]`.  We look in our graph for windows log events showing the running of the `tscon.exe` program.

1. "RDP/RDS Netflow": Logging in to system `B` will leave one or more netflow packets from system `A` to `B` that use the RDP port.


## Mapping to the LANL dataset

Once we understand the pattern we want to find, we need to determine what specifically to look for in the dataset.

We first need to understand that the LANL dataset has been modified from its raw form.
For example, the anonymization process replaced many of the program names with arbitrary strings such as `Prog123456.exe`.  Also, the program arguments (such as a `/network` option) are not recorded.

Given this lack of information, we will emulate a search for the RDP Hijacking lateral movement behavior by picking some actual values present in the LANL data as a proxy to desired programs such as `sethc.exe`.  Here are the mappings:

 - In steps 1 and 4, we will use the string `Proc336322.exe` as a proxy for the `sethc.exe` program and the string `Proc695356.exe` as a proxy for the `utilman.exe` program.
 - In steps 2 and 5, we will use the string `Proc249569.exe` as a proxy for the `tscon.exe` program.


----
## Using xGT to perform this search

The rest of this notebook demonstrates how to take this LANL data and the search pattern description to do these steps:
  1. Ingest the cyber data into xGT
  2. Search for all occurrences of this pattern.

In [None]:
import xgt
conn = xgt.Connection()
conn

## Establish Graph Component Schemas

We first try to retrieve the graph component schemas from xGT server.
If that should fail, we create an empty component (vertex or edge frame) for the missing component.

In [None]:
try:
  devices = conn.get_vertex_frame('Devices')
except xgt.XgtNameError:
  devices = conn.create_vertex_frame(
      name='Devices',
      schema=[['device', xgt.TEXT]],
      key='device')
devices

In [None]:
try:
  netflow = conn.get_edge_frame('Netflow')
except xgt.XgtNameError:
  netflow = conn.create_edge_frame(
      name='Netflow',
      schema=[['epoch_time', xgt.INT],
              ['duration', xgt.INT],
              ['src_device', xgt.TEXT],
              ['dst_device', xgt.TEXT],
              ['protocol', xgt.INT],
              ['src_port', xgt.INT],
              ['dst_port', xgt.INT],
              ['src_packets', xgt.INT],
              ['dst_packets', xgt.INT],
              ['src_bytes', xgt.INT],
              ['dst_bytes', xgt.INT]],
      source=devices,
      target=devices,
      source_key='src_device',
      target_key='dst_device')
netflow

**Edges:** The LANL dataset contains two types of data: netflow and windows log events. Of the log events recorded, some describe events within a host/device (e.g., reboots), and some describe authentication events that may be between devices (e.g., login from device A to B). We'll call the authentication events *AuthEvents* and the others we'll call *HostEvents*.

In [None]:
try:
  host_events = conn.get_edge_frame('HostEvents')
except xgt.XgtNameError:
  host_events = conn.create_edge_frame(
      name='HostEvents',
      schema=[['epoch_time', xgt.INT],
              ['event_id', xgt.INT],
              ['log_host', xgt.TEXT],
              ['user_name', xgt.TEXT],
              ['user_name', xgt.TEXT],
              ['logon_id', xgt.INT],
              ['process_name', xgt.TEXT],
              ['process_id', xgt.INT],
              ['parent_process_name', xgt.TEXT],
              ['parent_process_id', xgt.INT]],
           source=devices,
           target=devices,
           source_key='log_host',
           target_key='log_host')
host_events

In [None]:
try:
  auth_events = conn.get_edge_frame('AuthEvents')
except xgt.XgtNameError:
  auth_events = conn.create_edge_frame(
           name='AuthEvents',
           schema = [['epoch_time',xgt.INT],
                     ['event_id',xgt.INT],
                     ['log_host',xgt.TEXT],
                     ['logon_type',xgt.INT],
                     ['logon _type_description',xgt.TEXT],
                     ['user_name',xgt.TEXT],
                     ['domain_name',xgt.TEXT],
                     ['logon_id',xgt.INT],
                     ['subject_user_name',xgt.TEXT],
                     ['subject_domain_name',xgt.TEXT],
                     ['subject_logon_id',xgt.TEXT],
                     ['status',xgt.TEXT],
                     ['src',xgt.TEXT],
                     ['service_name',xgt.TEXT],
                     ['destination',xgt.TEXT],
                     ['authentication_package',xgt.TEXT],
                     ['failure_reason',xgt.TEXT],
                     ['process_name',xgt.TEXT],
                     ['process_id',xgt.INT],
                     ['parent_process_name',xgt.TEXT],
                     ['parent_process_id',xgt.INT]],
            source = 'Devices',
            target = 'Devices',
            source_key = 'src',
            target_key = 'log_host')
auth_events

In [None]:
# Utility to print the sizes of data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host events (edges): {:,}'.format(host_events.num_edges))
  print('Authentication events (edges): {:,}'.format(auth_events.num_edges))
  print('Total (edges): {:,}'.format(
      netflow.num_edges + host_events.num_edges + auth_events.num_edges))
    
print_data_summary()

## Load the data

If you are already connected to an xGT server with data loaded, this section may be skipped.
You may skip ahead to the "**Utility python functions for interacting with xGT**" section.

**Load the HostEvents event data:**

In [None]:
%%time
if host_events.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_1v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_1v.csv".format(_) for _ in range(2,91)]
    host_events.load(urls)
    print_data_summary()

**Load the AuthEvents event data:**

In [None]:
%%time
if auth_events.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_2v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_2v.csv".format(_) for _ in range(2,91)]
    auth_events.load(urls)
    print_data_summary()

**Load the netflow data:**

In [None]:
%%time
if netflow.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-85.csv"]
    #urls = ["xgtd://nf_day-{:02d}.csv".format(_) for _ in range(2,91)]
    netflow.load(urls)
    print_data_summary()

## Utility python functions for interacting with xGT

----

Now define some useful functions and get on with the querying ...

In [None]:
# Utility function to launch queries and show job number:
#   The job number may be useful if a long-running job needs
#   to be canceled.

def run_query(query, table_name = "answers", drop_answer_table=True, show_query=False):
    if drop_answer_table:
        conn.drop_frame(table_name)
    if query[-1] != '\n':
        query += '\n'
    query += 'INTO {}'.format(table_name)
    if show_query:
        print("Query:\n" + query)
    job = conn.schedule_job(query)
    print("Launched job {}".format(job.id))
    conn.wait_for_job(job)
    table = conn.get_table_frame(table_name)
    return table

## Pulling out only RDP netflow edges

Because of the way LANL has chosen to represent the netflow data, there may be some netflow edges in the *forward* direction where the `dstPort` field indicates RDP (`dstPort = 3389`), and other edges in the *reverse* direction where the `srcPort` field contains 3389.

The following section of code pulls out all forward RDP edges and drops them into a new edge frame.
It then pulls out all reverse RDP edges, reverses the appropriate fields (i.e., swapping `dst` and `src` versions of the attribute values), and adds these reversed RDP edges to the new edge frame.

Note that the edges in this new edge frame connect up with the same set of vertices as the netflow edges.

We first generate a new edge frame we call `RDPFlow` that has the exact same schema as the netflow edge frame.

In [None]:
# Generate a new edge frame for holding only the RDP edges
conn.drop_frame('RDPFlow')
rdp_flow = conn.create_edge_frame(
            name='RDPFlow',
            schema=netflow.schema,
            source=devices,
            target=devices,
            source_key='src_device',
            target_key='dst_device')
rdp_flow

### Extract forward RDP edges

A "forward" edge is one where the `dstPort = 3389`.
This edge is copied verbatim to the `RDPFlow` edge frame.

In [None]:
%%time
q = """
MATCH (v0)-[edge:Netflow]->(v1)
WHERE edge.dst_port=3389
CREATE (v0)-[e:RDPFlow {epoch_time : edge.epoch_time,
  duration : edge.duration, protocol : edge.protocol,
  src_port : edge.src_port, dst_port : edge.dst_port,
  src_packets : edge.src_packets, dst_packets : edge.dst_packets,
  src_bytes : edge.src_bytes, dst_bytes : edge.dst_bytes}]->(v1)
RETURN count(*)
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.get_data()[0][0]))

### Extract reverse RDP edges

A "reverse" edge is one where the `srcPort = 3389`.
These edges are copied to the `RDPFlow` edge frame but **reversed** in transit.
The reversal process involves swapping the: `srcDevice` and `dstDevice`;
`srcPort` and `dstPort`; `srcPackets` and `dstPackets`; and `srcBytes` and `dstBytes`.

In [None]:
%%time
q = """
MATCH (v0)-[edge:Netflow]->(v1)
WHERE edge.src_port=3389
CREATE (v1)-[e:RDPFlow {epoch_time : edge.epoch_time,
  duration : edge.duration, protocol : edge.protocol,
  src_port : edge.dst_port, dst_port : edge.src_port,
  src_packets : edge.dst_packets, dst_packets : edge.src_packets,
  src_bytes : edge.dst_bytes, dst_bytes : edge.src_bytes}]->(v0)
RETURN count(*)
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.get_data()[0][0]))

### Resulting RDPFlow

The result of combining these two "edge-create" queries is the `RDPFlow` edge frame containing only "forward" RDP edges.
This alternate edge frame holding only RDP edges can be used instead of the generic
`Netflow` edge frame where an RDP edge is required in a query.

In [None]:
data=None
if rdp_flow.num_edges == 0:
    print("RDPFlow is empty")
elif rdp_flow.num_edges <= 1000:
    data = rdp_flow.get_data_pandas()
else:
    data = 'RDPflow (edges): {:,}'.format(rdp_flow.num_edges)
data

In [None]:
# Utility to print the data sizes currently in xGT
def print_netflow_data_summary():
  print_data_summary()
  print('RDPFlow (edges): {:,}'.format(rdp_flow.num_edges))

print_netflow_data_summary()

### Building a better query: adding temporal constraints 

Being more specific about what you're looking for is a good way to both improve performance and cut down on false positives in your results.
In our example, there is a causal dependence between the attacker's steps, which means that they must be temporally ordered.
For convenience, we again show the RDP Hijack graph pattern here:

![rdp_hijack](images/lateral-movement.png)

So if *t<sub>1</sub>* represents the time at which event 1 takes place, we know that:

*t<sub>1</sub>* &le; *t<sub>2</sub>* &le; *t<sub>3</sub>* &le; *t<sub>4</sub>* &le; *t<sub>5</sub>* &le; *t<sub>6</sub>*

In addition, since this pattern models intentional lateral movement, we suspect that some of these events will be close together in time.
We can narrow the results by setting a maximum time thresholds between specific groups of events:

 - Between an RDP Hijack (`tscon.exe`) and a subsequent RDP netflow is called the *hijack threshold*
 - From the initial *privilege escalation* event to the RDP netflow is called the *one_step threshold*
 - The time allowed between between steps (e.g., the time between RDP1 and RDP2), is called the *between_step threshold*

Given some fixed constants for these thresholds, we can impose the following additional constraints:

 - *t<sub>3</sub>* - *t<sub>2</sub>* &le; *hijack threshold*
 - *t<sub>3</sub>* - *t<sub>1</sub>* &le; *one_step threshold*
 - *t<sub>6</sub>* - *t<sub>5</sub>* &le; *hijack threshold*
 - *t<sub>6</sub>* - *t<sub>4</sub>* &le; *one_step threshold*
 - *t<sub>3</sub>* - *t<sub>1</sub>* &le; *between_step threshold*

We will add all of these onstraints to our query to help focus on just the results we want.

### Lateral Movement query

This query leverages the new `RDPFlow` edge frame (and data) to find the proper RDP edges for steps #3 and #6.

In [None]:
%%time
time_threshold_between_step = 3600   # one hour
time_threshold_hijack = 180          # three minutes
time_threshold_one_step = 480        # eight minutes
q = """
MATCH (A)-[rdp1:RDPFlow]->(B)-[rdp2:RDPFlow]->(C),
      (A)-[hijack1:HostEvents]->(A)-[privEsc1:HostEvents]->(A),
      (B)-[hijack2:HostEvents]->(B)-[privEsc2:HostEvents]->(B)
WHERE A <> B AND B <> C AND A <> C 
  AND privEsc1.event_id = 4688 
  AND (privEsc1.process_name = "Proc336322.exe" OR privEsc1.process_name = "Proc695356.exe")
  AND hijack1.event_id = 4688 AND hijack1.process_name = "Proc249569.exe"
  AND privEsc2.event_id = 4688 
  AND (privEsc2.process_name = "Proc336322.exe" OR privEsc2.process_name = "Proc695356.exe")
  AND hijack2.event_id = 4688 AND hijack2.process_name = "Proc249569.exe"

  // Check time constraints on the overall pattern
  AND rdp1.epoch_time <= rdp2.epoch_time
  AND rdp2.epoch_time - rdp1.epoch_time < {0}

  // Check time constraints on step from A to B
  AND privEsc1.epoch_time <= hijack1.epoch_time
  AND hijack1.epoch_time <= rdp1.epoch_time
  AND rdp1.epoch_time - hijack1.epoch_time < {1}
  AND rdp1.epoch_time - privEsc1.epoch_time < {2}

  // Check time constraints on step from B to C
  AND privEsc2.epoch_time <= hijack2.epoch_time
  AND hijack2.epoch_time <= rdp2.epoch_time
  AND rdp2.epoch_time - hijack2.epoch_time < {1}
  AND rdp2.epoch_time - privEsc2.epoch_time < {2}
RETURN rdp1.src_device, rdp1.dst_device, rdp1.epoch_time, rdp2.dst_device, rdp2.epoch_time
""".format(time_threshold_between_step, time_threshold_hijack, time_threshold_one_step)
answer_table = run_query(q)
print('Number of answers: {:,}'.format(answer_table.num_rows))

## A faster Lateral Movement query

This query builds something comparable to an SQL Index by observing that Hijack Events and Privilege Escalation Events occur multiple times within the query.  By building a separate table to pull just the Hijack Events and another table to the the Privilege Escalation Events, we drastically reduce the search space as we progress through the partial matches.

The next several cells create the two tables, populate them with queries, and then run a lateral movement query that leverages these "index" tables.  This time for this entire process will be captured, which makes sense only when the entire sequence is run together rather than interactively.

In [None]:
%%time
# Build HijackEvents table

import time
start_optimized_query_time = time.time()

conn.drop_frame('HijackEvents')
hijack_events = conn.create_edge_frame(
    name   ='HijackEvents',
    schema = [['epoch_time', xgt.INT],
              ['log_host', xgt.TEXT]],
    source = devices,
    target = devices,
    source_key = 'log_host',
    target_key = 'log_host'
)

query = """
MATCH (v0)-[edge:HostEvents]->(v0)
WHERE edge.process_name = "Proc249569.exe"
  AND edge.event_id = 4688
CREATE (v0)-[e:HijackEvents { epoch_time : edge.epoch_time }]->(v0)
RETURN count(*)
"""
run_query(query)
print('HijackEvents (edges): {:,}'.format(hijack_events.num_edges))

In [None]:
%%time
# Build a PrivEscEvents table

conn.drop_frame('PrivEscEvents')
priv_esc_events = conn.create_edge_frame(
    name   ='PrivEscEvents',
    schema = [['epoch_time', xgt.INT],
              ['log_host', xgt.TEXT]],
    source = devices,
    target = devices,
    source_key = 'log_host',
    target_key = 'log_host')

query = """
MATCH (v0)-[edge:HostEvents]->(v0)
WHERE edge.process_name = "Proc336322.exe" OR
      edge.process_name = "Proc695356.exe"
  AND edge.event_id = 4688
CREATE (v0)-[e:PrivEscEvents { epoch_time : edge.epoch_time }]->(v0)
RETURN count(*)
"""
run_query(query)
print('PrivEscEvents (edges): {:,}'.format(priv_esc_events.num_edges))

In [None]:
%%time
# Now run the lateral movement query using these new index tables

q = """
MATCH (A)-[rdp1:RDPFlow]->(B)-[rdp2:RDPFlow]->(C),
      (A)-[hijack1:HijackEvents]->(A)-[priv_esc1:PrivEscEvents]->(A),
      (B)-[hijack2:HijackEvents]->(B)-[priv_esc2:PrivEscEvents]->(B)
WHERE A <> B AND B <> C AND A <> C 
  // Check time constraints on the overall pattern
  AND rdp1.epoch_time <= rdp2.epoch_time
  AND rdp2.epoch_time - rdp1.epoch_time < {0}

  // Check time constraints on step from A to B
  AND priv_esc1.epoch_time <= hijack1.epoch_time
  AND hijack1.epoch_time <= rdp1.epoch_time
  AND rdp1.epoch_time - hijack1.epoch_time < {1}
  AND rdp1.epoch_time - priv_esc1.epoch_time < {2}

  // Check time constraints on step from B to C
  AND priv_esc2.epoch_time <= hijack2.epoch_time
  AND hijack2.epoch_time <= rdp2.epoch_time
  AND rdp2.epoch_time - hijack2.epoch_time < {1}
  AND rdp2.epoch_time - priv_esc2.epoch_time < {2}
RETURN rdp1.src_device, rdp1.dst_device, rdp1.epoch_time, rdp2.dst_device, rdp2.epoch_time
""".format(time_threshold_between_step, time_threshold_hijack, time_threshold_one_step)
answer_table = run_query(q)
end_optimized_query_time = time.time()
print('Number of answers: {:,}'.format(answer_table.num_rows))

print('Total time to build index and run query: {:,.2f}'.format(
    end_optimized_query_time - start_optimized_query_time))

In [None]:
# retrieve the answer rows to the client in a pandas frame
data = answer_table.get_data_pandas()
data[0:10]