Skip to content

GetMatch query

Hassaan edited this page Jun 24, 2021 · 17 revisions

Overview

SPADE's query client can be used to retrieve stored provenance for subsequent analysis. The complete list of queries that the QuickGrail query surface supports are documented here.

This page documents usage of the query getMatch. It illustrates this by identifying cross-namespace data flows in a CamFlow provenance graph.

Partial Matching

When querying provenance graphs, an analyst may be interested in finding common vertices (or edges) in two different graphs -- that is, their intersection. SPADE query support provides the operator & for this. For example, $common = $set1 & $set2 will extract elements present in both $set1 and $set2 and store them in $common. The intersection operator works by comparing all annotations of graph (vertex or edge) elements in the given sets. It only returns graph elements for which all the annotations match.

In some contexts, even a partial match of annotations suffices. The getMatch query provides support for this. More specifically, getMatch operates on a subject graph ($set1 in the example below), an object graph ($set2 below), and a list of annotation keys ('pid' below). It works by comparing only the specified annotations of graph elements in the subject and object graphs. The result of getMatch contains the elements for which the specified annotation keys had matching values in both graphs. Annotations with keys that were not specified are ignored.

An example query:

-> $matching_pids = $set1.getMatch($set2, 'pid')

Above, each vertex in $set1 will only be added to $matching_pids if there is a vertex in $set2 that has the same value for the pid annotation key. Similarly, the only vertices in $set2 that will appear in $matching_pids are those with a value for the pid annotation key that is also present in some vertex in $set1.

CamFlow Cross-IPC Namespace Example

A scenario is described below where data flows from one process to another while the two are in different IPC namespaces. After that, the use of getMatch is described to illustrate how the resulting cross-namespace provenance can be identified.

Scenario

CamFlow is used to collect provenance for system activity where:

  • A process (called writer) in IPC namespace X writes to a file.
  • A process (called reader) in IPC namespace Y reads the same file that the writer process wrote to.

Provenance Elements

In the above scenario, there is a data flow to the reader process from the writer process. Since the two processes exist in different IPC namespaces, cross-namespace provenance arises. It can be seen in this CamFlow graph: getMatch-complete

The figure above depicts the following vertices and edges:

  1. writer process (pid:2223) and its IPC namespace (ipcns:4026531839): The writer process is represented by a vertex with the annotation object_type:task. Its IPC namespace is in the vertex with annotation object_type:process_memory (and is connected to the process vertex).

  2. reader process (pid:2226) and its IPC namespace (ipcns:4026532201): The reader process is represented by a vertex with the annotation object_type:task. Its IPC namespace is in the vertex with annotation object_type:process_memory (and is connected to the process vertex).

  3. file at path /tmp/testfile: This is written to and read from. It is represented by the vertex with the annotation object_type:path.

  4. write to path /tmp/testfile: The write is represented by an edge incident on an inode vertex with the annotation object_type:file. The edge itself has the annotation relation_type:write.

  5. read of path /tmp/testfile: It is represented similarly to the write, except that the edge has the annotation relation_type:read.

The CamFlow provenance log for the scenario above can be found here.

Query Approach

Given the complete provenance graph, the goal is to construct a series of queries that can be used extract all cross-namespace provenance that arises because of writes to and reads from the same path. This can be broken down into the following steps:

  1. Find the inodes that were written to.
  2. Find the tasks (processes) that wrote to the inodes.
  3. Find the process memory vertices (that contain namespace identifiers) of the tasks found in (2).
  4. Find the inodes that were read from.
  5. Find the common inodes between those found in (1) and (4).
  6. Find the tasks that read from the common inodes, found in (5).
  7. Find the process memory vertices (that contain namespace identifiers) of the tasks found in (6).
  8. Find the process memory vertices that match on namespace identifiers, considering the sets found in (3) and (7).
  9. Find the process memory vertices that were either in sets found in (3) or (7) but not in both. Note that this is the set of process memory vertices where the namespaces identifiers did not match.
  10. Construct a graph using only the vertices and edges found above. This will yield a subgraph with cross-namespace provenance.

Sample Queries

Specific queries are listed and grouped together below, based on step that they perform in the approach described above. (The step number and descriptive comments are provided on lines that start with #).

Note: The queries below can be copied to a file. They can then be executed in the SPADE query client using the command: load <path to file with queries>

# Group all types of relevant vertices into respective variables for convenience.
# This comes in handy when only a particular type of vertex is required from another variable.

# Group all process memory vertices which contain the namespace identifiers.
$memorys = $base.getVertex(object_type = 'process_memory')
# Group all task vertices which contain the process identifiers. Tasks are connected to process memory vertices.
$tasks = $base.getVertex(object_type = 'task')
# Group all files vertices which represent an inode. Tasks are connected to files by 'relation_type'='read' or 'relation_type'='write'.
$files = $base.getVertex(object_type = 'file')
# Group all path vertices which contain the path of an inode in the filesystem. Files are connected to paths.
$paths = $base.getVertex(object_type = 'path')

# 1. Find the subgraph representing writes of files.
#
# Get all edges that represent a write
$write_edges = $base.getEdge(relation_type = 'write')
# Get all the written files.
# Note: the '&' operation with $files is a convenient way of getting only files from $write_edges.getEdgeSource().
$written_files = $write_edges.getEdgeSource() & $files
# Get all the paths for the files found.
$written_files_to_paths = $base.getPath($written_files, $files, 10, $paths, 1)
# Note: The path vertex is connected to the first version of the file vertex. Each time a write occurs to the path, a new file vertex is created to represent the new version and connected to the previous version. If a path is written N times, there will be N+1 file vertices representing the original and successive versions. However, only the first file vertex is connected to the path vertex. Consequently, in the query above, the number 10 indicates the maximum number of file writes to search for when traversing the sequence of _inode_ artifacts may have resulted from versioning. If $written_files_to_paths contains an empty set, this means the file was written to more than 10 times. To cover cases with upto N writes, replace 10 with N.

# 2.
#
# Get the tasks that wrote to a file.
$writing_tasks = $write_edges.getEdgeDestination() & $tasks
# Get the process memory vertices (that contain the namespace identifiers) for all the writing tasks.
$writing_tasks_to_memorys = $base.getPath($writing_tasks, $memorys, 1)

# 3.
#
# Get only the process memory vertices.
$writing_memorys = $writing_tasks_to_memorys.getEdgeDestination() & $memorys

# 4.
#
# Get the read edges to find files that were read.
$read_edges = $base.getEdge(relation_type = 'read')
$read_files = $read_edges.getEdgeDestination() & $files

# 5.
#
# Find common files -- i.e. ones that were written to and read from. If this is empty, there was no data flow through files.
$common_files = $written_files & $read_files
# Find the paths for the files that were common (above).
$common_files_to_paths = $base.getPath($common_files, $files, 10, $paths, 1)
$common_paths = $common_files_to_paths & $paths

# 6.
#
# Find the subset of tasks and their process memory vertices from the set of tasks that read the common files.
$reading_tasks_to_common_files = $base.getPath($tasks, $common_files, 1)
$reading_tasks = $reading_tasks_to_common_files & $tasks
$memorys_to_reading_tasks = $base.getPath($memorys, $reading_tasks, 1)

# 7.
#
$reading_memorys = $memorys_to_reading_tasks.getEdgeSource() & $memorys

# 8.
#
# Find a match between the memory vertices of the reading and writing tasks. This should be limited to finding ones that have the same values for container-related annotation keys. More specifically, these are the namespace identifiers: 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns'.
$common_memorys = $reading_memorys.getMatch($writing_memorys, 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns')

# 9.
#
# Divide the result from getMatch into two groups. The first group is the result -- i.e. the common ones. The second group is the group with namespaces that didn't match.
$group1_memorys = $common_memorys
# If the following query returns an empty set, that means there were no process memory vertices with differing namespaces.
$group2_memorys = $reading_memorys + $writing_memorys - $common_memorys

# 10. Cross-namespace provenance subgraph construction.
#
# Get the tasks of the process memorys in both groups (in both directions).
$group1_tasks = $base.getPath($tasks, $group1_memorys, 1) & $tasks
$group2_tasks = $base.getPath($tasks, $group2_memorys, 1) & $tasks
$group1_tasks = $base.getPath($group1_memorys, $tasks, 1) & $tasks
$group2_tasks = $base.getPath($group2_memorys, $tasks, 1) & $tasks
# Use the files that were read from and written to as the starting point of the contruction.
$subgraph = $common_files
# Find the paths of the files involved.
$subgraph = $subgraph + $base.getPath($common_files, $files, 10, $common_paths, 1)
# Find the edges from tasks to files in both groups.
$subgraph = $subgraph + $base.getPath($group1_tasks, $common_files, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $common_files, 1)
# Find the edges from files to tasks in both groups.
$subgraph = $subgraph + $base.getPath($common_files, $group1_tasks, 1)
$subgraph = $subgraph + $base.getPath($common_files, $group2_tasks, 1)
# Find the edges from tasks to memory vertices in both groups (in both directions).
$subgraph = $subgraph + $base.getPath($group1_tasks, $group1_memorys, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $group2_memorys, 1)
$subgraph = $subgraph + $base.getPath($group1_memorys, $group1_tasks, 1)
$subgraph = $subgraph + $base.getPath($group2_memorys, $group2_tasks, 1)

# Discard intermediate graph variables.
erase $memorys $tasks $files $paths $write_edges $written_files $written_files_to_paths
erase $writing_tasks_to_memorys $read_edges $read_files $common_files_to_paths
erase $reading_tasks_to_common_files $memorys_to_reading_tasks $common_memorys

The variable $subgraph contains the resulting graph for the scenario described above. It is depicted below: getMatch-result

Clone this wiki locally