# Connecting to elasticsearch directly

It is useful to explore back-end elasticsearch data directly before using Moloch API. For example, exporting all available field names can be useful when writing API queries.

Start by importing elastic python module.

In [18]:
from elasticsearch import Elasticsearch

Then establish a connection. Multiple hosts can be specified as python list. Verity connectivity with `ping()` method.

In [19]:
es = Elasticsearch(hosts=["192.168.10.13:9200"])
es.ping()

True

## _cat API and exploring indices

We can take a look into available indices via `cat` API.

In [20]:
indices = es.cat.indices()
print(indices)

green open files_v5         k8W5NFoCTo6LY-FocOZRIQ 2 0   12 0  49.7kb  49.7kb
green open queries_v2       LgHRb7qJR5mq7gqyNcwyiQ 1 0    0 0    261b    261b
green open stats_v3         w0odpndxR_e-q2drnQx-lA 1 0    1 0  12.8kb  12.8kb
green open hunts_v1         rn2XmMD_SuCQcgFvPPx8nw 1 0    0 0    261b    261b
green open users_v6         pnagVE3KTHu58x1oX6Vm7Q 1 0    1 0   6.5kb   6.5kb
green open dstats_v3        3FcI3RT4RGKSgrnWrBB1Iw 2 0 1330 0 441.1kb 441.1kb
green open fields_v2        pK6ODUppSnuSlr1KZfK0Pg 1 0  307 2  97.4kb  97.4kb
green open sequence_v2      GSKzXikFTNGCEDhOS2luYQ 1 0    1 0   2.6kb   2.6kb
green open sessions2-190522 bO8ol02FSVGR75G-Z0_3RA 1 0 1609 0   2.2mb   2.2mb



By default, it returns a text string. We could split it by newline to make it usable in a script.

In [22]:
data = indices.split("\n")
newIdxList = []
for idx in data:
    if len(idx) == 0:
        continue
    print("-"*10)
    idx = idx.split()
    newIdxList.append(idx[2])
    print(idx)
print("*"*20)
print(newIdxList)

----------
['green', 'open', 'files_v5', 'k8W5NFoCTo6LY-FocOZRIQ', '2', '0', '12', '0', '49.7kb', '49.7kb']
----------
['green', 'open', 'queries_v2', 'LgHRb7qJR5mq7gqyNcwyiQ', '1', '0', '0', '0', '261b', '261b']
----------
['green', 'open', 'stats_v3', 'w0odpndxR_e-q2drnQx-lA', '1', '0', '1', '0', '12.8kb', '12.8kb']
----------
['green', 'open', 'hunts_v1', 'rn2XmMD_SuCQcgFvPPx8nw', '1', '0', '0', '0', '261b', '261b']
----------
['green', 'open', 'users_v6', 'pnagVE3KTHu58x1oX6Vm7Q', '1', '0', '1', '0', '6.5kb', '6.5kb']
----------
['green', 'open', 'dstats_v3', '3FcI3RT4RGKSgrnWrBB1Iw', '2', '0', '1330', '0', '441.1kb', '441.1kb']
----------
['green', 'open', 'fields_v2', 'pK6ODUppSnuSlr1KZfK0Pg', '1', '0', '307', '2', '97.4kb', '97.4kb']
----------
['green', 'open', 'sequence_v2', 'GSKzXikFTNGCEDhOS2luYQ', '1', '0', '1', '0', '2.6kb', '2.6kb']
----------
['green', 'open', 'sessions2-190522', 'bO8ol02FSVGR75G-Z0_3RA', '1', '0', '1609', '0', '2.2mb', '2.2mb']
********************
['fi

Alternatively, we could get structured data by simply specifying `format` query parameter.

In [24]:
indices2 = es.cat.indices(format="json")
indices2

[{'health': 'green',
  'status': 'open',
  'index': 'files_v5',
  'uuid': 'k8W5NFoCTo6LY-FocOZRIQ',
  'pri': '2',
  'rep': '0',
  'docs.count': '12',
  'docs.deleted': '0',
  'store.size': '49.7kb',
  'pri.store.size': '49.7kb'},
 {'health': 'green',
  'status': 'open',
  'index': 'queries_v2',
  'uuid': 'LgHRb7qJR5mq7gqyNcwyiQ',
  'pri': '1',
  'rep': '0',
  'docs.count': '0',
  'docs.deleted': '0',
  'store.size': '261b',
  'pri.store.size': '261b'},
 {'health': 'green',
  'status': 'open',
  'index': 'stats_v3',
  'uuid': 'w0odpndxR_e-q2drnQx-lA',
  'pri': '1',
  'rep': '0',
  'docs.count': '1',
  'docs.deleted': '0',
  'store.size': '12.8kb',
  'pri.store.size': '12.8kb'},
 {'health': 'green',
  'status': 'open',
  'index': 'hunts_v1',
  'uuid': 'rn2XmMD_SuCQcgFvPPx8nw',
  'pri': '1',
  'rep': '0',
  'docs.count': '0',
  'docs.deleted': '0',
  'store.size': '261b',
  'pri.store.size': '261b'},
 {'health': 'green',
  'status': 'open',
  'index': 'users_v6',
  'uuid': 'pnagVE3KTHu58x

Now we can do a lot of cool things by simply looping over the data structure. For example, check index health.

In [28]:
for idx in indices2:
    if idx["health"] != "green":
        print(idx["index"])

Check document counts.

In [29]:
for idx in indices2:
    if int(idx["docs.count"]) > 100:
        print(idx)

{'health': 'green', 'status': 'open', 'index': 'dstats_v3', 'uuid': '3FcI3RT4RGKSgrnWrBB1Iw', 'pri': '2', 'rep': '0', 'docs.count': '1348', 'docs.deleted': '0', 'store.size': '435.8kb', 'pri.store.size': '435.8kb'}
{'health': 'green', 'status': 'open', 'index': 'fields_v2', 'uuid': 'pK6ODUppSnuSlr1KZfK0Pg', 'pri': '1', 'rep': '0', 'docs.count': '307', 'docs.deleted': '2', 'store.size': '97.4kb', 'pri.store.size': '97.4kb'}
{'health': 'green', 'status': 'open', 'index': 'sessions2-190522', 'uuid': 'bO8ol02FSVGR75G-Z0_3RA', 'pri': '1', 'rep': '0', 'docs.count': '1623', 'docs.deleted': '0', 'store.size': '2.3mb', 'pri.store.size': '2.3mb'}


Use list comprehensions to awesome one-line conditional data collection. Like, extract all Moloch session indices.

In [30]:
idxList = [i["index"] for i in indices2 if "sessions2" in i["index"]]
idxList

['sessions2-190522']

## Exploring Moloch data

We cannot write API (nor normal) queries without knowing the field names we can play with. At the moment of writing this notebook, the best way to extract all fields is by running elasticsearch query against `fields` index. Note that this is an empty query. `size` parameter simply limits the maximum number of returned items. `3000` is simply a large enough number to ensure we get all fields.

In [33]:
fields = es.search("fields", params={"size":3000})
#print(fields)

In [36]:
raw = [f["_source"] for f in fields["hits"]["hits"]]
#print(raw)

In [45]:
justnames = [f["dbField2"] for f in raw if f["dbField2"]]
justnames = sorted(justnames)

In [46]:
justnames[:10]

['_id',
 'asnall',
 'asset',
 'assetCnt',
 'cert.alt',
 'cert.altCnt',
 'cert.hash',
 'cert.issuerCN',
 'cert.issuerON',
 'cert.notAfter']

In [47]:
print("there are {} fields in database".format(len(justnames)))

there are 307 fields in database


# Tasks

* Extract all `dns` and `http` fields into dedicated python lists;
    * How many fields are in either list?
* **advanced** Explore pcap file list in moloch database;
    * Print all file names with first packet timestamp;
    * Run this query against singlehost or buildbox if this vagrant environment does not have enough pcap files;
    * Load additional pcap files from [here](https://malware-traffic-analysis.net/2019/index.html) (zip file password is always INFECTED) and get first packet timestamps from each file;