# Log analysis Scenario

Let's say you were manually looking at some logs and found a suspicious call to a malicious domain (in this case, we'll use "pudim.com.br").
The code below shows how you could continue to investigate to find the source of the communication

### Import used libs and network data

In [1]:
import pyspark.sql.functions as f
import pyspark.sql.types as t
import json

Below we are importing the network data that contains all the communication. One important thing to notice is that
these logs follow a uncommon format. The most valuable field is not well structured, so we are making a schema
for this field on the fly, to be able to analyze it.

In [2]:
df_network_raw = spark.read.parquet('datalake/network_logs/').drop('host','tags').select('message_json.*','*')

21/11/15 19:25:45 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


### Start log analysis, from DNS protocol

In [3]:
df_network_dns = df_network_raw\
.filter(f.col('query').isNotNull())\
.select(
        "AA", "RA", "RD", "TC", "TTLs", "Z", "answers", "`id.orig_h`", "`id.orig_p`", "`id.resp_h`", "`id.resp_p`",
        "proto", "query", "rcode", "rcode_name", "rejected", "trans_id", "ts", "uid"
)

In [4]:
df_network_dns.filter(f.col('query').rlike('.*pudim\.com\.br.*')).show(20, False)

                                                                                

+-----+-----+----+-----+-----+---+---------------+------------+---------+----------+---------+-----+------------+-----+----------+--------+--------+-------------------+------------------+
|AA   |RA   |RD  |TC   |TTLs |Z  |answers        |id.orig_h   |id.orig_p|id.resp_h |id.resp_p|proto|query       |rcode|rcode_name|rejected|trans_id|ts                 |uid               |
+-----+-----+----+-----+-----+---+---------------+------------+---------+----------+---------+-----+------------+-----+----------+--------+--------+-------------------+------------------+
|false|false|true|false|null |0  |null           |172.16.0.134|50861    |172.16.0.2|53       |udp  |pudim.com.br|0    |NOERROR   |false   |60417   |1.635634832222794E9|CTXG322HuLUPytKwc |
|false|true |true|false|[5.0]|0  |[54.207.20.104]|172.16.0.134|50861    |172.16.0.2|53       |udp  |pudim.com.br|0    |NOERROR   |false   |18691   |1.635634832222791E9|CTXG322HuLUPytKwc |
|false|false|true|false|null |0  |null           |172.16.0.1

### After identifying the returned IP for the DNS (from the answers column), we can start looking at responses from this IP

In [5]:
df_network_raw.filter(f.col('`id.resp_h`') == '54.207.20.104')\
.select(
        "host", "`id.orig_h`", "`id.orig_p`", "`id.resp_h`", "`id.resp_p`", "method", "request_body_len", "resp_fuids", "resp_mime_types", "response_body_len", "status_code",
        "status_msg", "tags", "trans_depth", "ts", "uid", "uri", "user_agent", "version"
)\
.show(20, False)

+------------+------------+---------+-------------+---------+------+----------------+--------------------+---------------+-----------------+-----------+----------+----+-----------+-------------------+------------------+----+-----------+-------+
|host        |id.orig_h   |id.orig_p|id.resp_h    |id.resp_p|method|request_body_len|resp_fuids          |resp_mime_types|response_body_len|status_code|status_msg|tags|trans_depth|ts                 |uid               |uri |user_agent |version|
+------------+------------+---------+-------------+---------+------+----------------+--------------------+---------------+-----------------+-----------+----------+----+-----------+-------------------+------------------+----+-----------+-------+
|pudim.com.br|172.16.0.134|55188    |54.207.20.104|80       |GET   |0               |[Fp2TGz3Q2LjDRBVrai]|[text/html]    |851              |200        |OK        |[]  |1          |1.635634832293271E9|CuJjgb3uRqWrdtGGJj|/   |curl/7.74.0|1.1    |
|null        |172.16

### By looking at the user_agent column we already have a clue as of what the application can be, but to make sure, we need to use the host logs

In [6]:
df_host = spark.read.parquet('datalake/host_logs')

In [7]:
df_host\
.filter(f.col('event.action') == 'process_started')\
.select('process.*')\
.select('*',f.concat_ws(',', 'args').alias('args_str'))\
.filter((f.col('args_str').rlike('.*pudim.*')) | f.col('args_str').rlike('.*54\.207\.20\.104.*'))\
.drop('args')\
.show(100, False)

+-------+----------------+-------------+------------------------------------------+----+-----+-----+------------------------+-----------------+--------------------+
|created|entity_id       |executable   |hash                                      |name|pid  |ppid |start                   |working_directory|args_str            |
+-------+----------------+-------------+------------------------------------------+----+-----+-----+------------------------+-----------------+--------------------+
|null   |o3Znp1u/2VGxbpi4|/usr/bin/curl|{ef1137f1880a99cb150c5cdc9b5032bfc56713c5}|curl|31626|28809|2021-10-30T23:00:32.060Z|/                |curl,-L,pudim.com.br|
+-------+----------------+-------------+------------------------------------------+----+-----+-----+------------------------+-----------------+--------------------+



### Conclusion
And that's it, we found the culprit: `curl`. If we wanted to go even further, we could search for the `ppid` (parent process id) to check what exactly called this curl and understand the motivation for this, but for the purpose of this demo, we'll end up here.

**Steps we made:**
1. From a given URL, we translated the IP that the domain had at the time of the query
2. From the IP resolved from the name, we found responses to the host machine
3. On the logs from the host machine, we found the executable that made the request to this malicious domain