Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Reimplementing TOSBack with Ruby and using git to see TOS changes!
C Python Perl Shell C++ JavaScript Other
Branch: master
Pull request Compare This branch is 1168 commits behind tosdr:master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
GitPython @ 6e86f8a
code
crawl
crawl_reviewed
lib
logs
python
rubycode
rules
web-frontend
.gitignore
.gitmodules
Gemfile
Gemfile.lock
LICENSE
README.md

README.md

ToSBack!

This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder.

The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:

Here's an example:

<docname name="Privacy Policy">
  <url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
   <norecurse name="arbitrary"/>
  </url>
</docname>

Now, tosback.rb should only grab the content we want from that URL! Hooray!

You can also pass a rule file as an argument to the script to get a preview of the results! For example:

rubycode$ ruby main.rb ../rules/abercrombie.com.xml

This will only scrape and write the rule you pass, so you can add xpath data to a rule and quickly test to make sure it's correct.

Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:

rubycode$ ruby main.rb -empty

Something went wrong with that request. Please try again.