🎈 Privacy Bot
Please join our Gitter channel to start discussing the project! For more information on how to contribute, we encourage you to have a look at this README, and to check the issues.
Privacy policies are a legal requirement for websites handling users' data. So anyone should be able to access them, read them and understand what it takes (in terms of privacy) to be using a given service. Except no one reads them
- A lot of people don't even know what they are.
- Privacy policies are "legal documents", and it takes a specific set of skills to comprehend them.
- They are long and reading them would be time consuming (the median length is ~2500 words).
In short, they usually aren't designed for people to read and understand. But still, the content of these policies is very important to anyone's privacy, for this is where you should learn what private data you agree to give away.
Privacy Bot is a project which aims at addressing these aforementioned issues. If privacy policies are not meant for humans, then perhaps we can design a bot to automatically do the heavy lifting for us. The high level goals of the project are to:
💾Automatically fetch and store them in a central place (eg: in a github repository, which will give us
diffson updates for free).
🔍Analyze them to extract a summary of what private data is shared, and with whom. 👀Stay up-to-date by monitoring updates. 📦Make all the policies available in a central repository, in a usable data format that people can build upon (eg: building a browser extension to show the summary on any visited website, creating a twitter bot to communicate facts and updates about policies, etc.).
You can find the current privacy policies in the
privacy_policies folder. In
the future, we should probably host them on a separate branch to not mix the
code and the data.
To get going with the project as a contributor, it is recommended to install the
package in 'developer mode' using
pip, in a virtual environment. You also need
$ pip install -e .
To analyze privacy policies, make sure you install the
$ pip install -r requirements-analysis.txt
For another example analysis, have a look at the word relevance analysis
There are two entry points, used respectively for:
- Automagically discovering privacy policies given a list of domains
- Fetching privacy policies given the output of the first entry point (a list of privacy policies for each domain).
$ find_policies --urls domains.txt # Outputs: policy_url_candidates.json $ fetch_policies policy_url_candidates.json # Outputs: index.json and privacy_policies/
Keep in mind that the file formats are still a work in progress, and will likely evolve in the near future. Feel free to contribute with ideas and improvements!
Thanks for your interest in contributing to Privacy Bot! There are many ways to contribute. To get started, take a look at CONTRIBUTING.md.
TODO - move to issues tracker.
- Some domains seem to have several pages related to privacy, we could collect all of them.
- Some domain have URL with randomly generated parts inside, which will make the policy appear like it was updated. We could strip these random parts before saving the policy.
- Add even more domains.
- Make use of proxies to have an IP in the country we want.
- Improve parallelism (for a given domain, requests are sequential)
Join us at the Mozilla Global Sprint June 1-2, 2017! We'll be gathering in-person at sites around the world and online to collaborate on this project and learn from each other. Get your #mozsprint tickets now!