GitHub - e2ecensor/Disguiser_public: Disguiser measures and investigates global censorship activities and its deployment through an end-to-end framework that enables ground truth for automatic and accurate censorship detection.

Disguiser: End-to-End Framework for Measuring Censorship with Ground Truth

High-level Ideas

The detection of Internet censorship usually requires heavy manual inspection due to the lack of ground truth, resulting in the difficulty of identifying false positives (i.e., misclassified censorship) and false negatives (i.e., undetected censorship). The difficulty stems from the fact that without ground truth, in many cases it is unlikely to automatically distinguish the legitimate responses and the responses manipulated by censorship. Existing studies tackled such issues by retrieving and comparing distributed responses, but such an approach usually requires manual inspection, causing the analysis unscalable and inefficient.

The project aims to explore, develop, and deploy a framework that enables end-to-end measurement for accurately and automatically investigating global Internet censorship practices. The key idea is to provide a static payload as ground truth, which can be used to indicate the occurrence of censorship when the static payload has been altered by network devices. Moreover, the deployed end-to-end framework can facilitate extended measurements for investigating more aspects of Internet censorship, for example, pinpointing censor devices’ locations and exploring their policies and deployment.

The detail of the framework and a comprehensive measurement study on global censorship can be found in our ACM SIGMETRICS’22 paper.

Notes for Data and Code

Relevant Dataset:

Alexa List: Amazon's Alexa Top-site list. In our experiments, we use Alexa’s top 1,000 domains as the popular domain list.
Citizen Lab List: We also test the sensitive domains by using the test lists provided by Citizen Lab. The Citizen Lab offers two types of test lists, a global test list and a country-specific test list for certain counties. We compile the country-specific test list with the popular list and global test list to form the domain list for each country. The up-to-date list can be accessed at https://github.com/citizenlab/test-lists/.

Experiments Data:

The datasets that are collected by our framework (and those used in the aforementioned paper) can be obtained here.

Vantage Points:

SOCKS proxies: We use residential proxies to issue TCP-based DNS queries and HTTP/HTTPS queries through the SOCKS proxies. In our study, we sign-up ProxyRack.
RIPE Atlas: We use RIPE Atlas to conduct UDP-based DNS tests to complement the results of TCP-based measurement from SOCKS proxies.
VPN: We use VPN vantage points to conduct the application traceroute to investigate the deployment of censors. There are two additional requirements for a VPN server to carry out such an experiment: (1) the VPN server and its default gateway should not alter the TTL values of our packets so that the intermediate routers can process the packets properly according to the TTL values we set and (2) the VPN server must be physically located in the country as advertised. The pinpoint_censor results shows example results of some countries where we identified VPN servers satisfying above requirements.

Backend Server Setup:

The backend control server can be any ordinary Web servers accepting HTTP(S) request. In our experiment, we configure a static HTML page as ground truth that only states our experiment purpose and return it all incoming requests.

Code repository: (The files given a description reflect the core functions to run the experiments. Others refer to the specific processing or analysis)

build_domain_webpage.py: this code help extract the title and landing page of the sensitive domains in testing list and store the output in separte two files.
pinpoint_censor.py: perform application traceroute on HTTP proctocol which pinpoints the censor's location on specific router.
proxy_request.py: define the data format of responses that collecting from censorship measurements on DNS, HTTP, HTTPS protocol.
proxyrack.py: this code for conducting HTTP experiments on distrubuted residential proxy provided by proxyrack platform, and receving either static payload from the controlled server or censorship.
proxyrack_client.py: define rules/thresholds to obtain as many vantage points as possible from the proxy platforms we used in this experiment.
setup.py: to store confidential information for other files.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
analysis		analysis
code		code
data/Explortary		data/Explortary
materials		materials
pinpoint_censor		pinpoint_censor
results/proxyrack		results/proxyrack
.gitignore		.gitignore
Best Practices for Control Server Deployment.md		Best Practices for Control Server Deployment.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

code

code

data/Explortary

data/Explortary

materials

materials

pinpoint_censor

pinpoint_censor

results/proxyrack

results/proxyrack

.gitignore

.gitignore

Best Practices for Control Server Deployment.md

Best Practices for Control Server Deployment.md

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Disguiser: End-to-End Framework for Measuring Censorship with Ground Truth

High-level Ideas

Notes for Data and Code

About

Releases

Packages

Contributors 3

Languages

License

e2ecensor/Disguiser_public

Folders and files

Latest commit

History

Repository files navigation

Disguiser: End-to-End Framework for Measuring Censorship with Ground Truth

High-level Ideas

Notes for Data and Code

About

Resources

License

Stars

Watchers

Forks

Languages