Skip to content

Commit

Permalink
Merge pull request #258 from censoredplanet/dns-docs
Browse files Browse the repository at this point in the history
Add docs for DNS data
  • Loading branch information
ohnorobo committed Sep 27, 2023
2 parents a5e129c + 7ffd048 commit 0a6f973
Show file tree
Hide file tree
Showing 5 changed files with 325 additions and 34 deletions.
130 changes: 121 additions & 9 deletions docs/base_tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,19 @@ These tables are created using the original censored planet json data, plus some

There is one table for each scan type.

- `firehook-censoredplanet:base.echo_scan`
- `firehook-censoredplanet:base.discard_scan`
- `firehook-censoredplanet:base.http_scan`
- `firehook-censoredplanet:base.https_scan`
- `censoredplanet-analysisv1:base.echo_scan`
- `censoredplanet-analysisv1:base.discard_scan`
- `censoredplanet-analysisv1:base.http_scan`
- `censoredplanet-analysisv1:base.https_scan`
- `censoredplanet-analysisv1:base.satellite_scan`

## Partitioning and Clustering

The tables are time-partitioned along the `date` field.

The tables are clustered along the `country` and then `asn` fields.
The tables are clustered along the `[server|resolver]_country` and then `[server|resolver]_asn` fields.

## Table Format
## Hyperquack Table Format

The json data is processed into a flat table format which looks like this.

Expand Down Expand Up @@ -47,7 +48,7 @@ The json data is processed into a flat table format which looks like this.
| server_as_full_name | STRING | Autonomous system long name, eg. `Cloudflare, Inc.` |
| server_as_class | STRING | The type of AS eg. `Transit/Access`, `Content` (for CDNs) or `Enterprise` |
| server_country | STRING | Autonomous system country, eg. `US` |
| server_organization | STRING | The IP organization, eg. `US` |
| server_organization | STRING | The IP organization, eg. `United Technical Services` |
| |
| **Received Fields** | | :warning: These fields differ between scan types |
| |
Expand Down Expand Up @@ -82,7 +83,7 @@ The json data is processed into a flat table format which looks like this.

We intend to add more columns in the future.

## Original Data Format
## Original Hyperquack Data Format

The Censored Planet data is stored in .json files with one measurement per line.

Expand Down Expand Up @@ -139,4 +140,115 @@ Data from before 2021-04-25 is parsed from the [Hyperquack V1 format](https://gi
"stateful_block": false,
"tag": "2021-05-30T01:01:01"
}
```
```

## DNS Table Format

The DNS (Satellite) data included the following alternative set of columns. (Many are identical to Hyperquack.)

| Field Name | Type | Contains |
| --------------------------- | ------------ | -------- |
| |
| **Measured Domain** |
| |
| domain | STRING | The domain being tested, eg. `example.com` |
| domain_is_control | STRING | If the measured domain a control domain? |
| domain_category | STRING | The [category](domain_categories.md) of the domain being tested, eg. `Social Networking`, `None` if unknown |
| domain_controls_failed | BOOLEAN | Did the other control tests for this domain fail? |
| |
| **Time** |
| |
| date | DATE | Date that an individual measurement was taken |
| start_time | TIMESTAMP | Start time of the individual measurement |
| end_time | TIMESTAMP | End time of the individual measurement |
| retry | INTEGER | Number 0-N (usually not > 4) indicating which retry in the measurement_id set this roundtrip was. Domain control measurements have a retry index of `None`. |
| |
| **DNS Resolver** |
| |
| resolver_ip | STRING | The ip address of the resolver being tested, eg. `1.1.1.1` |
| resolver_netblock | STRING | Netblock of the IP, eg. `1.1.1.0/24` |
| resolver_name | STRING | The domain name of the resolver. ex: 'ns2.tower.com.ar.` |
| resolver_is_trusted | BOOLEAN | Whether the resolver is considered a 'trusted' resolver, ie '1.1.1.1', '8.8.8.8', '9.9.9.9' |
| resolver_asn | INTEGER | Autonomous system number, eg. `13335` |
| resolver_as_name | STRING | Autonomous system short name, eg. `CLOUDFLARENET` |
| resolver_as_full_name | STRING | Autonomous system long name, eg. `Cloudflare, Inc.` |
| resolver_as_class | STRING | The type of AS eg. `Transit/Access`, `Content` (for CDNs) or `Enterprise` |
| resolver_country | STRING | Autonomous system country, eg. `US` |
| resolver_organization | STRING | The IP organization, eg. `United Technical Services` |
| |
| **DNS Resolver Properties** |
| |
| resolver_non_zero_rcode_rate | FLOAT | The rate of rcode errors returned by this resolver |
| resolver_private_ip_rate | FLOAT | The rate of private-use ips (eg. `10.10.1.1`) returned by this resolver |
| resolver_zero_ip_rate | FLOAT | The rate of ip `0.0.0.0` returned by this resolver |
| resolver_connect_error_rate | FLOAT | The rate of conection errors returned by this resolver |
| resolver_invalid_cert_rate | FLOAT | The rate of invalid certificates returned by IP answers given by this resolver |
| |
| **DNS Responses** |
| |
| received_error | STRING | Any error recieved from the resolver |
| received_rcode | INTEGER | Any [RCode](https://datatracker.ietf.org/doc/html/rfc5395#section-2.3) response recieved from the resolver. In the case of an error this is `-1`. ex: `2` representing `SERVFAIL` |
| |
| **Analysis** | These analysis fields are generally obselete |
| |
| success | BOOLEAN | Did the overall test succeed? |
| anomaly | BOOLEAN | Was there an anomaly in the test responses? |
| average_confidence | FLOAT | The calculated average confidence in this resolver |
| untagged_controls | BOOLEAN | Are the IP controls for this test missing ASN metadata |
| untagged_response | BOOLEAN | Are the responses for this test missing ASN metadata? |
| excluded | BOOLEAN | Should the test be excluded from analysis? |
| exclude_reason | STRING | The reason for the exclusion from analysis |
| has_type_a | BOOLEAN | Does the response contai a Type-A DNS record? |
| |
| **Internal** |
| |
| measurement_id | STRING | A uuid which is the same for observations which are part of the same measurement. </br> If there are 5 retries of a scan they will all have the same id. </br> eg. `a08df2fe70d54092916b8df87e330f47` |
| source | STRING | The name of the .tar.gz scan file this row came from. </br> eg. `CP_Satellite-2020-08-20-05-58-35` </br> Used internally and for debugging |
| |
| **Answers** | Each answer represents an IP address answer received from the resolver, and subsequent metadata for that IP. |
| |
| answers | REPEATED STRUCT | |
| answers.ip | STRING | IP address recieved from the resolver eg. `1.2.3.4` |
| answers.asn | INTEGER | Autonomous system number for the received iP address eg. `13335` |
| answers.as_name | STRING | Name of the autonomous system eg. `CLOUDFLARENET` |
| answers.ip_organization | STRING | IP organization of the IP address eg. `United Technical Services` |
| answers.censys_http_body_hash | STRING | The hash of the HTTP body taken from Censys |
| answers.censys_ip_cert | STRING | The IP cert taken from Censys |
| |
| **Matches Control** | Whether the metadata of the returned IP matches the expected metadata of a control measurement |
| |
| answers.matches_control | REPEATED RECORD | |
| answers.matches_control.ip | BOOLEAN | Whether the IP matches an expected control IP |
| answers.matches_control.censys_http_body_hash | BOOLEAN | Whether the HTTP body hash matches an expected control |
| answers.matches_control.censys_ip_cert | BOOLEAN | Whether the Censys IP cert matches a control |
| answers.matches_control.asn | BOOLEAN | Whether the ASN matches an expected control ASN |
| answers.matches_control.as_name | BOOLEAN | Whether the AS name matches an expected control AS name |
| answers.match_confidence | FLOAT | Value from 0-1. Confidence that this IP response matches a control measurement |
| |
| **HTTP Request** | Metadata from the HTTP request made to the returned IP |
| |
| answers.http_error | STRING | Any recieved error, eg. `Network Timeout` |
| answers.http_response_status | STRING | The HTTP response status, eg. `301 Moved Permanently` |
| answers.http_response_headers | REPEATED STRING | Each HTTP header in the response eg. `Content-Type: text/html` |
| answers.http_response_body | STRING | The HTTP response body </br> eg. `<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD></HTML>` </br> Truncated to 64k. |
| answers.http_analysis_is_known_blockpage | BOOLEAN | True if the received page matches a blockpage, False if it matches a known false positive blockpage, None otherwise. |
| answers.http_analysis_page_signature | STRING | A string describing the matched page </br> ex: `a_prod_cisco` (a know blockpage) or `x_document_moved` (a known false positive). </br> To see the pattern a signature matches check [blockpage signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) or [false positive signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/false_positive_signatures.json) |
| |
| **HTTPS Request** | Metadata from the HTTPS request made to the returned IP |
| |
| answers.https_error | STRING | Any recieved error, eg. `TLS error` |
| answers.https_tls_version | INTEGER | The TLS version number eg. `771` (meaning TLS 1.2) |
| answers.https_tls_cipher_suite | STRING | The TLS cipher suite number </br> eg. `49199` (meaning TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) |
| answers.https_tls_cert | BYTES | The TLS certificate eg. `MIIG1DCCBb...` |
| answers.https_tls_cert_common_name | STRING | Common name of the TLS certificate eg. `example.com` |
| answers.https_tls_cert_issuer | STRING | Issuer of the TLS certificate eg. `Verisign` |
| answers.https_tls_cert_start_date | TIMESTAMP | The issue data of the certificate |
| answers.https_tls_cert_end_date | TIMESTAMP | The expiration data of the certificate |
| answers.https_tls_cert_alternative_names | REPEATED STRING | Alternative names from the TLS certificate eg. `www.example.com` |
| answers.https_tls_cert_has_trusted_ca | BOOLEAN | Whether the issuing CA was trusted by the [Mozilla root CA list](https://wiki.mozilla.org/CA/Included_Certificates) when the request was made |
| answers.https_tls_cert_matches_domain | BOOLEAN | Whether the certificate is valid for the test domain |
| answers.https_response_status | STRING | The HTTP response status, eg. `301 Moved Permanently` |
| answers.https_response_headers | REPEATED STRING | Each HTTP header in the response eg. `Content-Type: text/html` |
| answers.https_response_body | STRING | The HTTP response body </br> eg. `<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD></HTML>` </br> Truncated to 64k. |
| answers.https_analysis_is_known_blockpage | BOOLEAN | True if the received page matches a blockpage, False if it matches a known false positive blockpage, None otherwise. |
| answers.https_analysis_page_signature | STRING | A string describing the matched page </br> ex: `a_prod_cisco` (a know blockpage) or `x_document_moved` (a known false positive). </br> To see the pattern a signature matches check [blockpage signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) or [false positive signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/false_positive_signatures.json) |
15 changes: 15 additions & 0 deletions docs/diagrams/dns.msc
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
msc {
probe,remote;

probe box probe [label="Initial Test Setup"];
probe=>remote [ label = "Write DNS Query over UDP" ];
probe<=remote [ label = "Read Response" ];
probe box probe [label="Validate IP Match"];
probe=>remote [ label = "Query Non-matching IPs for Domain over HTTPS" ];
probe<=remote [ label = "Read Responses" ];
probe box probe [label="Validate response certificate for domain"];
probe=>remote [ label = "Query Non-matching IPs for Domain over HTTP" ];
probe<=remote [ label = "Read Responses" ];
probe box probe [label="Check response content"];
probe box probe [label="Test Complete"];
}

0 comments on commit 0a6f973

Please sign in to comment.