diff --git a/docs/base_tables.md b/docs/base_tables.md index 5df31f9b..3fe697f8 100644 --- a/docs/base_tables.md +++ b/docs/base_tables.md @@ -11,14 +11,15 @@ There is one table for each scan type. - `firehook-censoredplanet:base.discard_scan` - `firehook-censoredplanet:base.http_scan` - `firehook-censoredplanet:base.https_scan` +- `firehook-censoredplanet:base.satellite_scan` ## Partitioning and Clustering The tables are time-partitioned along the `date` field. -The tables are clustered along the `country` and then `asn` fields. +The tables are clustered along the `[server|resolver]_country` and then `[server|resolver]_asn` fields. -## Table Format +## Hyperquack Table Format The json data is processed into a flat table format which looks like this. @@ -47,7 +48,7 @@ The json data is processed into a flat table format which looks like this. | server_as_full_name | STRING | Autonomous system long name, eg. `Cloudflare, Inc.` | | server_as_class | STRING | The type of AS eg. `Transit/Access`, `Content` (for CDNs) or `Enterprise` | | server_country | STRING | Autonomous system country, eg. `US` | -| server_organization | STRING | The IP organization, eg. `US` | +| server_organization | STRING | The IP organization, eg. `United Technical Services` | | | | **Received Fields** | | :warning: These fields differ between scan types | | | @@ -82,7 +83,7 @@ The json data is processed into a flat table format which looks like this. We intend to add more columns in the future. -## Original Data Format +## Original Hyperquack Data Format The Censored Planet data is stored in .json files with one measurement per line. @@ -139,4 +140,115 @@ Data from before 2021-04-25 is parsed from the [Hyperquack V1 format](https://gi "stateful_block": false, "tag": "2021-05-30T01:01:01" } -``` \ No newline at end of file +``` + +## DNS Table Format + +The DNS (Satellite) data included the following alternative set of columns. (Many are identical to Hyperquack.) + +| Field Name | Type | Contains | +| --------------------------- | ------------ | -------- | +| | +| **Measured Domain** | +| | +| domain | STRING | The domain being tested, eg. `example.com` | +| domain_is_control | STRING | If the measured domain a control domain? | +| domain_category | STRING | The [category](domain_categories.md) of the domain being tested, eg. `Social Networking`, `None` if unknown | +| domain_controls_failed | BOOLEAN | Did the other control tests for this domain fail? | +| | +| **Time** | +| | +| date | DATE | Date that an individual measurement was taken | +| start_time | TIMESTAMP | Start time of the individual measurement | +| end_time | TIMESTAMP | End time of the individual measurement | +| retry | INTEGER | Number 0-N (usually not > 4) indicating which retry in the measurement_id set this roundtrip was. Domain control measurements have a retry index of `None`. | +| | +| **DNS Resolver** | +| | +| resolver_ip | STRING | The ip address of the resolver being tested, eg. `1.1.1.1` | +| resolver_netblock | STRING | Netblock of the IP, eg. `1.1.1.0/24` | +| resolver_name | STRING | The domain name of the resolver. ex: 'ns2.tower.com.ar.` | +| resolver_is_trusted | BOOLEAN | Whether the resolver is considered a 'trusted' resolver, ie '1.1.1.1', '8.8.8.8', '9.9.9.9' | +| resolver_asn | INTEGER | Autonomous system number, eg. `13335` | +| resolver_as_name | STRING | Autonomous system short name, eg. `CLOUDFLARENET` | +| resolver_as_full_name | STRING | Autonomous system long name, eg. `Cloudflare, Inc.` | +| resolver_as_class | STRING | The type of AS eg. `Transit/Access`, `Content` (for CDNs) or `Enterprise` | +| resolver_country | STRING | Autonomous system country, eg. `US` | +| resolver_organization | STRING | The IP organization, eg. `United Technical Services` | +| | +| **DNS Resolver Properties** | +| | +| resolver_non_zero_rcode_rate | FLOAT | The rate of rcode errors returned by this resolver | +| resolver_private_ip_rate | FLOAT | The rate of private-use ips (eg. `10.10.1.1`) returned by this resolver | +| resolver_zero_ip_rate | FLOAT | The rate of ip `0.0.0.0` returned by this resolver | +| resolver_connect_error_rate | FLOAT | The rate of conection errors returned by this resolver | +| resolver_invalid_cert_rate | FLOAT | The rate of invalid certificates returned by IP answers given by this resolver | +| | +| **DNS Responses** | +| | +| received_error | STRING | Any error recieved from the resolver | +| received_rcode | INTEGER | Any [RCode](https://datatracker.ietf.org/doc/html/rfc5395#section-2.3) response recieved from the resolver. In the case of an error this is `-1`. ex: `2` representing `SERVFAIL` | +| | +| **Analysis** | These analysis fields are generally obselete | +| | +| success | BOOLEAN | Did the overall test succeed? | +| anomaly | BOOLEAN | Was there an anomaly in the test responses? | +| average_confidence | FLOAT | The calculated average confidence in this resolver | +| untagged_controls | BOOLEAN | Are the IP controls for this test missing ASN metadata | +| untagged_response | BOOLEAN | Are the responses for this test missing ASN metadata? | +| excluded | BOOLEAN | Should the test be excluded from analysis? | +| exclude_reason | STRING | The reason for the exclusion from analysis | +| has_type_a | BOOLEAN | Does the response contai a Type-A DNS record? | +| | +| **Internal** | +| | +| measurement_id | STRING | A uuid which is the same for observations which are part of the same measurement.
If there are 5 retries of a scan they will all have the same id.
eg. `a08df2fe70d54092916b8df87e330f47` | +| source | STRING | The name of the .tar.gz scan file this row came from.
eg. `CP_Satellite-2020-08-20-05-58-35`
Used internally and for debugging | +| | +| **Answers** | Each answer represents an IP address answer received from the resolver, and subsequent metadata for that IP. | +| | +| answers | REPEATED STRUCT | | +| answers.ip | STRING | IP address recieved from the resolver eg. `1.2.3.4` | +| answers.asn | INTEGER | Autonomous system number for the received iP address eg. `13335` | +| answers.as_name | STRING | Name of the autonomous system eg. `CLOUDFLARENET` | +| answers.ip_organization | STRING | IP organization of the IP address eg. `United Technical Services` | +| answers.censys_http_body_hash | STRING | The hash of the HTTP body taken from Censys | +| answers.censys_ip_cert | STRING | The IP cert taken from Censys | +| | +| **Matches Control** | Whether the metadata of the returned IP matches the expected metadata of a control measurement | +| | +| answers.matches_control | REPEATED RECORD | | +| answers.matches_control.ip | BOOLEAN | Whether the IP matches an expected control IP | +| answers.matches_control.censys_http_body_hash | BOOLEAN | Whether the HTTP body hash matches an expected control | +| answers.matches_control.censys_ip_cert | BOOLEAN | Whether the Censys IP cert matches a control | +| answers.matches_control.asn | BOOLEAN | Whether the ASN matches an expected control ASN | +| answers.matches_control.as_name | BOOLEAN | Whether the AS name matches an expected control AS name | +| answers.match_confidence | FLOAT | Value from 0-1. Confidence that this IP response matches a control measurement | +| | +| **HTTP Request** | Metadata from the HTTP request made to the returned IP | +| | +| answers.http_error | STRING | Any recieved error, eg. `Network Timeout` | +| answers.http_response_status | STRING | The HTTP response status, eg. `301 Moved Permanently` | +| answers.http_response_headers | REPEATED STRING | Each HTTP header in the response eg. `Content-Type: text/html` | +| answers.http_response_body | STRING | The HTTP response body
eg. `\nAccess Denied\n`
Truncated to 64k. | +| answers.http_analysis_is_known_blockpage | BOOLEAN | True if the received page matches a blockpage, False if it matches a known false positive blockpage, None otherwise. | +| answers.http_analysis_page_signature | STRING | A string describing the matched page
ex: `a_prod_cisco` (a know blockpage) or `x_document_moved` (a known false positive).
To see the pattern a signature matches check [blockpage signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) or [false positive signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/false_positive_signatures.json) | +| | +| **HTTPS Request** | Metadata from the HTTPS request made to the returned IP | +| | +| answers.https_error | STRING | Any recieved error, eg. `TLS error` | +| answers.https_tls_version | INTEGER | The TLS version number eg. `771` (meaning TLS 1.2) | +| answers.https_tls_cipher_suite | STRING | The TLS cipher suite number
eg. `49199` (meaning TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) | +| answers.https_tls_cert | BYTES | The TLS certificate eg. `MIIG1DCCBb...` | +| answers.https_tls_cert_common_name | STRING | Common name of the TLS certificate eg. `example.com` | +| answers.https_tls_cert_issuer | STRING | Issuer of the TLS certificate eg. `Verisign` | +| answers.https_tls_cert_start_date | TIMESTAMP | The issue data of the certificate | +| answers.https_tls_cert_end_date | TIMESTAMP | The expiration data of the certificate | +| answers.https_tls_cert_alternative_names | REPEATED STRING | Alternative names from the TLS certificate eg. `www.example.com` | +| answers.https_tls_cert_has_trusted_ca | BOOLEAN | Whether the issuing CA was trusted by the [Mozilla root CA list](https://wiki.mozilla.org/CA/Included_Certificates) when the request was made | +| answers.https_tls_cert_matches_domain | BOOLEAN | Whether the certificate is valid for the test domain | +| answers.https_response_status | STRING | The HTTP response status, eg. `301 Moved Permanently` | +| answers.https_response_headers | REPEATED STRING | Each HTTP header in the response eg. `Content-Type: text/html` | +| answers.https_response_body | STRING | The HTTP response body
eg. `\nAccess Denied\n`
Truncated to 64k. | +| answers.https_analysis_is_known_blockpage | BOOLEAN | True if the received page matches a blockpage, False if it matches a known false positive blockpage, None otherwise. | +| answers.https_analysis_page_signature | STRING | A string describing the matched page
ex: `a_prod_cisco` (a know blockpage) or `x_document_moved` (a known false positive).
To see the pattern a signature matches check [blockpage signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) or [false positive signatures](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/false_positive_signatures.json) | \ No newline at end of file diff --git a/docs/diagrams/dns.msc b/docs/diagrams/dns.msc new file mode 100644 index 00000000..72cf55e5 --- /dev/null +++ b/docs/diagrams/dns.msc @@ -0,0 +1,15 @@ +msc { + probe,remote; + + probe box probe [label="Initial Test Setup"]; + probe=>remote [ label = "Write DNS Query over UDP" ]; + probe<=remote [ label = "Read Response" ]; + probe box probe [label="Validate IP Match"]; + probe=>remote [ label = "Query Non-matching IPs for Domain over HTTPS" ]; + probe<=remote [ label = "Read Responses" ]; + probe box probe [label="Validate response certificate for domain"]; + probe=>remote [ label = "Query Non-matching IPs for Domain over HTTP" ]; + probe<=remote [ label = "Read Responses" ]; + probe box probe [label="Check response content"]; + probe box probe [label="Test Complete"]; +} \ No newline at end of file diff --git a/docs/diagrams/dns.svg b/docs/diagrams/dns.svg new file mode 100644 index 00000000..1a79c4ce --- /dev/null +++ b/docs/diagrams/dns.svg @@ -0,0 +1,123 @@ + + + + + +probe + + + + +remote + + + + + + + + + + +Initial Test Setup + + + + + + + +Write DNS Query over UDP + + + + + + + +Read Response + + + + + + + + + + +Validate IP Match + + + + + + + +Query Non-matching IPs for Domain over HTTPS + + + + + + + +Read Responses + + + + + + + + + + +Validate response certificate for domain + + + + + + + +Query Non-matching IPs for Domain over HTTP + + + + + + + +Read Responses + + + + + + + + + + +Check response content + + + + + + + + + + +Test Complete + + + + diff --git a/docs/merged_reduced_scans_table.md b/docs/merged_reduced_scans_table.md index 9aa456e5..df626fcc 100644 --- a/docs/merged_reduced_scans_table.md +++ b/docs/merged_reduced_scans_table.md @@ -38,4 +38,5 @@ Reduced Scans | outcome | STRING | What was the [outcome](outcome.md) of the individual measurement eg `read/timeout` | | count | INTEGER | How many measurements fit the exact pattern of this row? | | unexpected_count | INTEGER | Count of measurements with an unexpected outcome | - +| hostname | STRING | The domain name of the DNS resolver. (Only used in DNS) eg. `ns1.uts.ae` | +| reg_hostname | STRING | The domain name of the DNS resolver without subdomains. (Only used in DNS) eg. `uts.ae` | diff --git a/docs/outcome.md b/docs/outcome.md index fed39cc6..ec67c359 100644 --- a/docs/outcome.md +++ b/docs/outcome.md @@ -40,6 +40,10 @@ Not all tests include every stage depending on the type of test. For example sin ![https connection diagram](diagrams/https.svg) +##### DNS + +![dns connection diagram](diagrams/dns.svg) + ### Outcome Classes Basic outcomes represent simplest types of errors, as well as the `match` case (no error detected). @@ -48,32 +52,68 @@ Protocol errors are similar but not identical to the Network Error Logging stand Mismatch Errors are used when the connection is successful, but the content received does not match the content expected. This can happen in the case of blockpages, remote servers with unusual behavior, or complicated CDN networks serving many sites. -| Outcome Class | Explanation | -| ----------------------- | ----------- | +| Outcome Class | Additional Outcome Information Included | Explanation | +| ----------------------- | --------------------------------------- | ----------- | | | | **Basic Outcomes** | | | -| match | The test completed successfully and no interference was detected | -| system_failure | There was a test system failure, rendering the test invalid | -| unknown | The class of the outcome was not known. Usually these are new errors which should be investigated and classified | +| match | | The test completed successfully and no interference was detected | +| system_failure | | There was a test system failure, rendering the test invalid | +| unknown | | The class of the outcome was not known. Usually these are new errors which should be investigated and classified | | | -| **Protocol Errors** | There were errors in the connection protocol | +| **Protocol Errors** | | There were errors in the connection protocol | | | -| ip.network_unreachable | The network was unreachable | -| ip.host_no_route | No route to the host could be found | -| timeout | The connection timed out. Could indicate packets being dropped by a middlebox | -| tcp.refused | The TCP connection was refused by the server | -| tcp.reset | The TCP connection was reset. Could indicate an inserted `RST` packet | -| tls.failed | The TLS connection failed, usually due to a TLS protocol error | -| http.invalid | The HTTP response could not be parsed. Possibly because the response has a content-length mismatch, has improper encoding, or other conditions | -| http.empty | Received no content when HTTP content was expected | -| http.truncated_response | The HTTP response content was unexpectedly truncated | +| ip.network_unreachable | | The network was unreachable | +| ip.host_no_route | | No route to the host could be found | +| timeout | | The connection timed out. Could indicate packets being dropped by a middlebox | +| tcp.refused | | The TCP connection was refused by the server | +| tcp.reset | | The TCP connection was reset. Could indicate an inserted `RST` packet | +| tls.failed | | The TLS connection failed, usually due to a TLS protocol error | +| http.invalid | | The HTTP response could not be parsed. Possibly because the response has a content-length mismatch, has improper encoding, or other conditions | +| http.empty | | Received no content when HTTP content was expected | +| http.truncated_response | | The HTTP response content was unexpectedly truncated | | | -| **Mismatched Content** | The connection completed successfully, but the content returned didn't match the content expected for the domain. | +| **Mismatched Content** | | The connection completed successfully, but the content returned didn't match the content expected for the domain. | | | -| mismatch | Received a different response from the one expected.
For Discard no response is expected and any response is a mismatch,
for Echo a mirrored response is expected and anything else is a mismatch.
For HTTP/S the expected response is determined by sending multiple control domains to the server and building an expected template. This response is returned when more detail about the exact part of the template mismatched (eg. status, body) is not available. | -| status_mismatch | The HTTP status code didn't match, eg. `403` instead of `200` | -| body_mismatch | The HTTP body didn't match, potentially a blockpage | -| tls_mismatch | An element of the TLS connection (certificate, cipher suite, or TLS version) didn't match | -| blockpage | The response was unexpected and matched a [known blockpage]((https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json)) | -| trusted_host | The response didn't match the expected response for the template. But it did match a common known server pattern, and is likely not censorship. This outcome is used for CDNs that respond in network-specific ways to domains they host. | \ No newline at end of file +| mismatch | | Received a different response from the one expected.
For Discard no response is expected and any response is a mismatch,
for Echo a mirrored response is expected and anything else is a mismatch.
For HTTP/S the expected response is determined by sending multiple control domains to the server and building an expected template. This response is returned when more detail about the exact part of the template mismatched (eg. status, body) is not available. | +| status_mismatch | `:http_status_code` eg. `:301` | The HTTP status code didn't match, eg. `403` instead of `200` | +| body_mismatch | | The HTTP body didn't match, potentially a blockpage | +| tls_mismatch | | An element of the TLS connection (certificate, cipher suite, or TLS version) didn't match | +| blockpage | `` eg. `:b_nat_ir_national_1` | The response was unexpected and matched a [known blockpage](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) | +| trusted_host | `:` eg. `:akamai` | The response didn't match the expected response for the template. But it did match a common known server pattern, and is likely not censorship. This outcome is used for CDNs that respond in network-specific ways to domains they host. | + +## DNS Outcomes + +The Satellite data uses its own unique set of outcomes, and does not use stages. The outcomes are based on DNS errors and POSIX TCP/IP socket return codes. + +| Outcome | Additional Outcome Information Included | Explanation | +| ---------------------- | --------------------------------------- | ----------- | +| | +| **DNS Failures** | | The DNS request failed | +| | +| ❗️dns.error | `:` eg. `:NXDomain` | The DNS request returned an [RCode](https://datatracker.ietf.org/doc/html/rfc5395#section-2.3) error | +| ❗️dns.connrefused | | The DNS request was refused | +| ❗️dns.timedout | | The DNS request timed out | +| ❗️dns.msgsize | | The DNS request returned a message size error | +| ❗️dns.protocol_error | | The DNS request failed with a protocol error | +| ❔dns.hostunreach | | The DNS resolver was unreachable | +| | +| **DNS Response Analysis** | | The DNS request responded successfully. Outcome is based on the returned content/IP addresses | +| | +| ✅ip.matchip | | The DNS request returned an expected (matching) IP address for the domain | +| ✅ip.matchasn | | The DNS request returned an IP address matching the ASN of an expected IP address | +| ❗️ip.invalid | One of `:zero`, `:local_host`, `:local_net` | The DNS request returned an IP that could never be valid. eg. `0.0.0.0`, `127.0.0.1`, `10.10.0.0`, `172.16.0.0` | +| ❗️ip.empty | | The DNS request returned an empty response | +| | +| **HTTP/S Response Analysis** | | The DNS request successfully returned IP addresses. Analysis is based on follow-up HTTP/S requests sent to those IP addresses requesting the test domain | +| | +| ❗️tls.connerror | `:` eg. `:ERTELECOM_DS_AS`
or `:AS` eg. `:AS15169`
or `:missing_as_info` | Attempting to connect to all returned IP addresses over HTTPS failed | +| ✅tls.validcert | | An HTTPS connection to a returned IP address returned a valid certificate which matched the expected test domain | +| ❗️tls.baddomain | `:` eg. `:dnsfilter.net` | An HTTPS connection to a returned IP returned a certificate for an unexpected domain. Could indicate a MITM attempt | +| ❗️tls.badca | `:` eg. `:Fortiguard SDNS Blocked Page` | An HTTPS connection to a returned IP returned an invalid certificate. Could indicate a MITM attempt | +| ❗️http.blockpage | `:` eg. `:f_gen_id_1_satellite` | An HTTPS request to the returned IP address failed, but an HTTP request returned a [known blockpage](https://github.com/censoredplanet/censoredplanet-analysis/blob/master/pipeline/metadata/data/blockpage_signatures.json) | +| | +| **Other Errors** | +| | +| ❔setup.system_failure | | There was a test system failure, rendering the test invalid | +| ❗️unknown_error | | An unknown error occured | \ No newline at end of file