# Election Scan Data

This project contains example data for a web scanning tool.  The tool records potential fake sites revolving around the 2016 elextion.  The CSV files we have is a subsed of the IP addresses recorded when registering a 'Hit'.  The goal is to try and dig deeper into how some of these 'Hits' may be related by looking at the IP paths, or 'Hops', taken to get to a given URL.

We plan to achieve this by taking our sample data and loading it into a graph database.  A graph database is a perfect way to express this type of connected data.  We will, in the end, want to end up with a graph schema resembeling the following illustration:

<div align="left">
    <img src="images/graph_model.png" alt="Graph Model" width="500px" align="center"/>
</div>

Each scan will have multiple streams of 'Hits' and 'Hops' that are generated.  Each 'Hit' represents a URL, or website, that responded with a 200.  A redirect scenario can be seen if a 'Hit' resulted in other 'Hops' as well.

To begin, we need to create a python code context to read our CSV files, connect to our graph database (we will be using Neo4J), and import our data.

In [1]:
# Import os utilities
import os

# Import Python 2 Neo4J Package
from py2neo import Graph

# Connect to our Graph database, ensure connectivity, and store connection in variable.
graph = Graph("bolt://localhost", auth=("neo4j", "neo4j"));

# Set up a local path reference
rel_path = os.getcwd()

# If you want to clear your database and start fresh, uncomment the line below.
# BE SURE TO CHECK WHAT DATABASE YOU ARE RUNNING THIS AGAINST.
# MORE CAPITAL LETTERS TO EMPHASIZE THE POINT ABOVE.
# graph.run("MATCH (d) DETACH DELETE (d)").summary().counters

## Data CSV Files

Here we are setting up variables pointing to the CSV files we have stored on our machine.

In [53]:
# This file contains scans our tool ran on presidential candidates.
scans_file = "file:" + os.path.join(rel_path, "Scans.csv");

# This file contains the logs of the IP addresses the tool encountered while getting to each URL.
hops_file = "file:" + os.path.join(rel_path, "Hops.csv");

Let's see what the structure of the data looks like by using the `LOAD CSV` comment in Neo4J.  We will load each file and show the row of data as an example.

NOTE: If you get an error running the command below, try commenting out the `dbms.directories.import` line in your graph databases configuration file.  Make sure to uncomment the line when you are done as this is not secure.

In [54]:
scans_csv_query = """LOAD CSV WITH HEADERS FROM $scansFile AS scan RETURN scan LIMIT 1 """
print("Example scan:");
graph.run(scans_csv_query, { "scansFile": scans_file }).data()

Example scan:


[{'scan': {'include_typos': '1',
   'combination_level': '3',
   'id': '1',
   'type': 'election',
   'terms': 'Barak,Obama'}}]

In [55]:
hops_csv_query = """LOAD CSV WITH HEADERS FROM $hopsFile AS hop RETURN hop LIMIT 1 """
print("Example hop:");
graph.run(hops_csv_query, { "hopsFile": hops_file }).data()

Example hop:


[{'hop': {'last_ip': None,
   'ip': '254.247.245.171',
   'scan_id': '1',
   'id': '1',
   'url': None,
   'order': '1',
   'test_id': 'dcc90b36-7ea3-4450-a7c7-0775fffaaaad'}}]

## Importing Scans

Next, we want to import the scans our tool ran.  These will be our initial nodes identifying the information that generated any hits we may have encountered.

In [60]:
constraint_query = """
CREATE CONSTRAINT ON (s:Scan)
ASSERT s.id IS UNIQUE
"""

import_query = """
LOAD CSV WITH HEADERS FROM $scansFile AS scan
WITH scan WHERE scan.id IS NOT NULL
MERGE (s:Scan { id: scan.id })
SET s.terms = scan.terms,
    s.include_typos = toInt(scan.combination_level),
    s.combination_level = toInt(scan.combination_level),
    s.type = scan.type
"""

display(graph.run(constraint_query).summary().counters)
display(graph.run(import_query, { "scansFile": scans_file }).summary().counters)

{}

{'properties_set': 8}

## Importing Hops

Next, we want to import the 'Hops' our tool ran.  These represent the paths taken when requesting a URL from the scan tool.  Note the data may have gaps in the hops. This is seen in the `order` column within the CSV file.  This means the tool was able to reach a 'Hit'.  These gaps will be filled in and the 'Hops' will be connected once we import the 'Hit' data later.

In [57]:
constraint_query = """
CREATE CONSTRAINT ON (h:Hop)
ASSERT h.ip IS UNIQUE;
"""

import_query = """
USING PERIODIC COMMIT 1
LOAD CSV WITH HEADERS FROM $hopsFile AS hop
WITH hop WHERE hop.id IS NOT NULL
MATCH (s:Scan { id: hop.scan_id })
WITH hop, s
FOREACH (ift in CASE WHEN hop.url IS NULL OR hop.url = '' THEN [1] ELSE [] END |
    MERGE (h:Hop { ip: hop.ip })
)
FOREACH (ift in CASE WHEN hop.url <> '' THEN [1] ELSE [] END |
    MERGE (h:Hit { ip: hop.ip, url: hop.url })
)
WITH hop, s
MATCH (h { ip: hop.ip })
WITH hop, s, h
FOREACH (ift in CASE WHEN hop.order = '1' THEN [1] ELSE [] END |
    MERGE (s)-[:RESULTED_IN{ test_id: hop.test_id }]->(h)
)
WITH hop, s, h
MATCH (hh { ip: hop.last_ip })
WITH hop, s, h, hh
FOREACH (ift in CASE WHEN hop.last_ip <> '' THEN [1] ELSE [] END |
    MERGE (hh)-[:RESULTED_IN{ test_id: hop.test_id }]->(h)
)
"""

display(graph.run(constraint_query).summary().counters)
display(graph.run(import_query, { "hopsFile": hops_file }).summary().counters)

{}

{}

## All Done With Data Import!

Now we should have all the example data represented in our graph schema.  Try looking at one of the test paths by running the query below.  The query will disply the entire sequence of events from our 'Scan' to a 'Hit'

In [50]:
query = """
MATCH (s:Scan)-[r:RESULTED_IN* { test_id: '3c70533f-78c6-4d63-80b8-763d9d3b4b18' }]->(h)
RETURN s.id AS Scan_ID, s.terms AS Scan_Terms, h.ip AS IP, h.url AS URL
"""

graph.run(query).to_data_frame()

Unnamed: 0,IP,Scan_ID,Scan_Terms,URL
0,56.232.166.204,2,"Donald,Trump",
1,217.160.146.153,2,"Donald,Trump",
2,84.215.192.77,2,"Donald,Trump",
3,59.214.2.129,2,"Donald,Trump",
4,211.9.200.184,2,"Donald,Trump",
5,140.221.166.199,2,"Donald,Trump",
6,77.23.45.16,2,"Donald,Trump",http://www.preztrump2016.com/


If you run the query within the Neo4J browser, you will get something resembling the following visualization:
<div align="left">
    <img src="images/sample_path_data_import.png" alt="Graph Model" width="700px" align="center"/>
</div>