# Exploring VXDB

Starter notebook for exploring [virus.exchange](https://virus.exchange), which is [VX Underground](https://www.vx-underground.org/)'s malware repository based on [mwdb.cert.pl](https://mwdb.cert.pl).

# Getting started

Let's start by installing the [`vxdb`](https://github.com/backchannelinc/vxdb) wrapper.

In [None]:
%pip install git+https://github.com/backchannelinc/vxdb

We then import the library.

In [4]:
from vxdb import VXDB

Now let's get authenticated. We can do that with the CLI with `vxdb login`, or by using the `login()` method.

You could also check if there is an environment variable we could use instead, called `VX_API`, and pass that into the `api_key` argument of the VXDB Class object. But for now, we stick with the `login()` method.

In [6]:
vx = VXDB()
vx.login()

We are now logged in! We can confirm that by doing a count of the repository.

In [32]:
vx.count_files()

10286481

# Putting data into tabular format with `pandas`

Many data scientists use [`pandas`](https://pandas.pydata.org/) for manipulating and analyzing data in Python. We will use it for the similar purpose of taking VXDB data and doing something with it.

First let's install `pandas`.

In [None]:
%pip install pandas

And we then import the library. We also import some standard stuff like `json`.

In [95]:
import pandas as pd
import json

We need some idea of the data and what it looks like so we know how it should be tabulated by `pandas`.

Since we are just exploring, let's start looking at some of the most recent files.

In [71]:
recent_files = vx.recent_files()
most_recent_file = next(recent_files)
most_recent_file.name

'Trojan.Win32.MicroFake.ba-1fc1cd22918ef9b0d19165edb903dc709f1cc84d356e91d3056f681a7b5ca884'

Soooooooo I kind of what a method to just dump all the available metdata about the file as a Python `dict`.

In [111]:
def sample_info(sample: vxdb.file.MWDBFile):
    return {
        'id': sample.id,
        'upload_time': sample.upload_time,
        'name': sample.name,
        'file_name': sample.file_name,
        'file_type': sample.file_type,
        'size': sample.size,
        'type': sample.type,
        'tags': sample.tags,
        'analyses': [{'id':analysis.id,'status':analysis.status,'is_running':analysis.is_running,'arguments':analysis.arguments,'processing_in':analysis.processing_in} for analysis in sample.analyses],
        # NOTE: couldn't add analysis.last_update because of some stupid error in mwdb: "TypeError: fromisoformat: argument must be str"
        'parents': sample.parents,
        'children': sample.children,
        'comments': sample.comments,
        'object_type': sample.object_type,
        'shares': sample.shares
    }

sample_info(most_recent_file)

TypeError: fromisoformat: argument must be str

In [102]:
sample_info(most_recent_file)['analyses'][0].id

'a0e63a4b-926a-4e4f-95ad-082fd4f797d4'