# Web Data Science
## Week 01 - Law and ethics

INFO 4871/5871  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Acknowledgements

This course will draw on resources built by myself and [Allison Morgan](https://allisonmorgan.github.io/) for the [2018 Summer Institute for Computational Social Science](https://github.com/allisonmorgan/sicss_boulder), which were in turn derived from [other resources](https://github.com/simonmunzert/web-scraping-with-r-extended-edition) developed by [Simon Munzert](http://simonmunzert.github.io/) and [Chris Bail](http://www.chrisbail.net/). 

## Introduction to Jupyter Notebooks

Jupyter Notebooks (previously called IPython Notebooks) are interactive programming environments that are great for writing, executing, and documenting code. The notebooks we'll be using launch in your web browser but are "talking" to a local background kernel (hence the "localhost" in the URL) rather than an external server. Put another way: the data you're analyzing in Jupyter Notebook isn't leaving your computer.

Jupyter Notebooks are composed of "cells" which you can create, move, and delete. These cells can be different formats like "Markdown" (what this pretty cell is), "code" (what you'll spend most of your time in), or "raw" (useful for preserving code but avoiding accidentally running it). All the cells up until now have been [Markdown](https://daringfireball.net/projects/markdown/syntax), which is a lightweight standard for formatting text. You can double click on these Markdown cells to edit them.

This is an example of a code cell below. You type the code into the cell and run the cell with the "Run" button in the toolbar or pressing Shift+Enter.

In [1]:
name = 'Brian Keegan'
print(name)

Brian Keegan


In [2]:
1+1 

2

Jupyter Notebooks are great because they can keep your code, documentation, and results all in one file. Here I'll import a helper library that lets Jupyter Notebook embed images from the web (from [this tweet](https://twitter.com/barackobama/status/831527113211645959), which we'll return to in a bit). When you save this notebook, this image will be saved with it.

In [3]:
from IPython.display import Image

Image('https://pbs.twimg.com/media/C4otUykWcAIbSy1.jpg')

<IPython.core.display.Image object>

### Exercises

1. Create at least two new cells below (before the **Additional resources** cell) and write your name and 1+1 in them.
2. Experiment with running a code cell versus a markdown cell. How does the cell change formatting for each?
3. Move cells around (look at the arrows in the toolbar or use the keyboard shortcuts under Help).
4. Create a third cell and add another image from the web (use the "Copy Image Address" on a right click) into the notebook using the `Image` function.

In [4]:
'I love notebooks'

'I love notebooks'

In [5]:
2*2

4

### Additional resources

Jupyter Notebooks are very powerful tools that are increasingly pervasive throughout the computer, information, and data science communities. I am an [unapologetic evangelist](https://github.com/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb) for them to be used by researchers, journalists, and activists to improve documentation practices and openness in data analysis, but notebooks also have their critics: [Joel Grus](https://twitter.com/joelgrus?lang=en)'s hilarious but important presentation, [I Don't Like Notebooks](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit). 

Here is some helpful documentation, tutorials, and examples; there are many others out there on the web!

* [DataCamp — The Definitive Jupyter Notebook Tutorial](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Jupyter Notebook documentation](https://jupyter.readthedocs.io/en/latest/)
* [Gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)

## Ethical web scraping

The phrase "data scraping" is colloquial and popular but has pejorative connotations. Data is valuable: other people invested time in collecting, organizing, and sharing it. When you show up with a scraper you built after maybe a dozen hours demanding data, you rarely pay the costs of labor, hosting, *etc*. that went into making the data available. There are *very* good rationales for making many kinds of data more availabile: reproducibility of scientific results, sharing publicly-funded and/or close-to-zero marginal cost resources, transparency and accountability in democratic institutions, remixing for innovative new analyses, *etc*. 

But data breaches have become eponymous (Target in 2013, Equifax in 2017, Facebook in 2018, *etc*.) because they violate other values like privacy. These manifest most clearly in principles outlined in the 1978 [Belmont Report](https://en.wikipedia.org/wiki/Belmont_Report):
* **Respect for persons**: protecting the autonomy of all people and treating them with courtesy and respect and allowing for informed consent. Researchers must be truthful and conduct no deception;
* **Beneficence**: The philosophy of "Do no harm" while maximizing benefits for the research project and minimizing risks to the research subjects; and
* **Justice**: ensuring reasonable, non-exploitative, and well-considered procedures are administered fairly — the fair distribution of costs and benefits to potential research participants — and equally.

(A fourth principle "Respect for Public" emphasizes compliance, accountability, and transparency in the conduct of research.)

In the context of data scraping, there are four "areas of difficulty":

* **Informed consent**: does the data scraper obtain consent from every person whose data is being retrieved?
* **Informational risk**: can the data scraper inflict economic, social, *etc*. harm on individuals by disclosing data?
* **Privacy**: does the data scraper know which information a person intended to be private or public? 
* **Decision-making under uncertainty**: does the data scraper know all the ways the data could be (mis)used? 

Ethical and legal risks involved with scraping:

* **[Copyright infringement](https://en.wikipedia.org/wiki/Copyright_infringement)**: compiling data that someone else can claim ownership over
* **[Trespass](https://en.wikipedia.org/wiki/Trespass_to_chattels#In_the_electronic_age)**: over-aggressive scraping shuts down someone else's property
* **[Computer Fraud & Abuse Act](https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act)**: misrepresenting yourself to access a system is "hacking"

While I cannot provide legal advice, we will revisit these concerns throughout the course through best practices for avoiding infringement, staggering data collection, simulating human requests, securing data, and protecting privacy.

James Densmore has a nice summary of [practices for ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

> * If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
> * I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
> * I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
> * I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
> * I will respect any content I do keep. I’ll never pass it off as my own.
> * I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
> * I will respond in a timely fashion to your outreach and work with you towards a resolution.
> * I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other important components of ethical web scraping practices [include](http://robertorocha.info/on-the-ethics-of-web-scraping/):

* Reading the Terms of Service and Privacy Policies for the site's rules on scraping.
* Inspecting the robots.txt file for rules about what pages can be scraped, indexed, *etc*.
* Be gentle on smaller websites by running during off-peak hours and spacing out requests.
* Identify yourself by name and email in your User-Agent strings

What does a robots.txt file look like? Here is CNN's. It helpfull provides a sitemap to the robot to get other pages, it allows all kinds of User-agents, and disallows crawling of pages in specific directories (ads, polls, tests).

In [6]:
import requests

In [8]:
print(requests.get('https://www.amazon.com/robots.txt').text)

User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member/
Disallow: /gp/aw/help/id=sss
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images
Disallow: /gp/richpub/listmania/createpipeline
Disallow: /gp/content-form
Disallow: /gp/pdp/invitation/invite
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/customer-re

When we are scraping websites, it is a good idea to include your contact information as a custom User-Agent string so that the webmaster can get in contact.

In [9]:
# contact_header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_2_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.15'}
contact_header = {'User-Agent':'Brian Keegan brian.keegan@colorado.edu'}

request = requests.get('https://www.cnn.com',headers=contact_header)

In [10]:
request.text



Adverse consequences of web scraping include:
* Compromising the privacy and integrity of individual users' data
* Damaging a web server with too many requests
* Denying access to the web service to other authorized users
* Infringing on copyrighted material
* Damaging the business value of a web site

[Amanda Bee](http://velociraptor.info/) compiled [a nice set of examples](https://github.com/amandabee/scraping-for-journalists/wiki/Reporting-Examples) of data journalists using web scraping for their reporting. There are some ethical justifications for violating a site's terms of service to scrape data:
* Obtaining data for the public interest from official statements, government reports, *etc*.
* Conducting audit studies (as long as these are responsibly designed and pre-cleared)
* The data is unavailable from APIs, FOIA requests, and other reports

[Sophie Chou](http://sophiechou.com/) made this nice [decision flow-chart](http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/) of whether to build a scraper or not from a NICAR panel in 2016:

![Should you build a scraper flowchart](http://www.storybench.org/wp-content/uploads/2016/04/flowchart_final.jpeg)

Why is there a "Talk to a lawyer?" outcome at the bottom?

### Computer Fraud and Abuse Act

The [Computer Fraud and Abuse Act](https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act) was passed in 1984, [in large part due to](https://www.cnet.com/news/from-wargames-to-aaron-swartz-how-u-s-anti-hacking-law-went-astray/) the 1983 film [WarGames](https://en.wikipedia.org/wiki/WarGames) starring Matthew Broderick. A plain reading of the text of the law ([18 U.S.C. § 1030](https://www.law.cornell.edu/uscode/text/18/1030)) criminalizes just about any form of web scraping:

> * Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains… information from any protected computer;
> * knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
> * the term “exceeds authorized access” means to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter;
> * the term “damage” means any impairment to the integrity or availability of data, a program, a system, or information;
> * the term “protected computer” means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States;

Violators can be fined and jailed under a misdemeanor charge for up to 1 year for the first violation and jailed up to 10 years under a felony charge for repeated violations.

This law has a [chilling effect](https://en.wikipedia.org/wiki/Chilling_effect) on many forms of research, journalism, and other forms of protected speech. The CFAA has been used by federal prosecutors to bring federal felony charges against programmers, journalists, and activists. In 2011, programmer and hacktivist [Aaron Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz) (who contributed to the development of RSS, Markdown, Creative Commons, and Reddit) was [arrested and charged](https://en.wikipedia.org/wiki/United_States_v._Swartz) with violating the CFAA for downloading several million PDFs from JSTOR over MIT's network. The [decision to prosecute was unusual](https://www.huffingtonpost.com/2013/03/13/aaron-swartz-prosecutorial-misconduct_n_2867529.html). Facing 35 years of imprisonment and over $1 million in fines under the CFAA, Swartz committed suicide on January 11, 2013.

In 2016, four computer science researchers and the publisher of *The Intercept* who all use scraping techniques to run experiments to measure bias and discrimination in web content [filed suit with the ACLU](https://www.aclu.org/cases/sandvig-v-sessions-challenge-cfaa-prohibition-uncovering-racial-discrimination-online) against the U.S. Government: *Sandvig v. Barr*. Their research involves creating multiple fake accounts, providing inaccurate information to websites, using automated tools to record publicly-available data, and other scraping techniques. In March 2020 (was something else also happening?), a [federal court ruled](https://www.aclu.org/press-releases/federal-court-rules-big-data-discrimination-studies-do-not-violate-federal-anti) that anti-discrimination research using web scraping methods does not violate the CFAA.

There has been other major court cases navigating the boundaries between the CFAA and legitimate uses of web scraping for research and accountability:
  * ***[hiQ Labs v. LinkedIn](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn)*** (2019) involved a startup company hiQ Labs that scraped information from LinkedIn profiles. LinkedIn sued to stop hiQ from scraping and the 9th Court of Appeals issued a split ruling that hiQ violated LinkedIn's terms preventing scraping but recognizing the CFAA did not prevent [scraping in the public interest](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data).
  * ***[Van Buren v. United States](https://en.wikipedia.org/wiki/Van_Buren_v._United_States)*** (2021) was a wacky Supreme Court case in which a (corrupt) police officer used his access to criminal databases to provide information to an FBI informant in a sting operation. The primary goal of the case was to resolve a circuit split in which lower courts had issued conflicting opinions about the scope of the CFAA's definition of "exceeds authorized access." The case importantly [limited the scope](https://www.eff.org/deeplinks/2021/06/van-buren-victory-against-overbroad-interpretations-cfaa-protects-security) arguing "the 'exceeds authorized access' clause criminalizes every violation of a computer-use policy" and potentially criminalizes millions of otherwise law-abiding citizens.

## Warning!

The code we will write and execute will violate the Terms of Service for platforms like [Twitter](https://twitter.com/en/tos) ("You may not...  access or search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by Twitter") and [YouTube](https://www.youtube.com/static?template=terms) ("you are not allowed to... access the Service using any automated means (such as robots, botnets or scrapers)...") for retrieving information from the platform. 

In effect, we will transmit code in excess of our authorized access and potentially cause damage, in order to obtain information from a protected computer. 

We will do this in order to obtain public statements made by goverment officials acting in their official capacity because this data is otherwise unavailable for retrieval from YouTube. There is an interesting body of emerging legal precedent treating elected officials' use of Twitter as a public forum: [*Knight First Amendment Institute v. Trump*](https://en.wikipedia.org/wiki/Knight_First_Amendment_Institute_v._Trump) established that [the President may not block other Twitter users](https://www.courtlistener.com/docket/6087955/72/knight-first-amendment-institute-at-columbia-university-v-trump/):

> * "We hold that portions of the @realDonaldTrump account -- the “interactive space” where Twitter users may directly engage with the content of the President’s tweets -- are properly analyzed under the “public forum” doctrines set forth by the Supreme Court, that such space is a designated public forum..."
> * "we nonetheless conclude that the extent to which the President and Scavino can, and do, exercise control over aspects of the @realDonaldTrump account are sufficient to establish the government-control element as to the content of the tweets sent by the @realDonaldTrump account, the timeline compiling those tweets, and the interactive space associated with each of those tweets."
> * "Because a Twitter user lacks control over the comment thread beyond the control exercised over first-order replies through blocking, the comment threads -- as distinguished from the content of tweets sent by @realDonaldTrump, the @realDonaldTrump timeline, and the interactive space associated with each tweet -- do not meet the threshold criterion for being a forum."
> * "the account’s timeline, which “displays all tweets generated by the [account]”... all of which is government speech."

On this basis, I believe the White House's videos posted to Twitter or YouTube are government speech and our automated retrieval of this content and associated meta-data in violation of YouTube's Terms of Serice is justifiable for understanding this speech as a public forum.

I would advise you against using these tools and approaches without a similarly clear public interest rationale and jurisprudence linking behavior to public forum doctrines.

## Exercises

These exercises do not need to be submitted and will not be graded. But please come to class on Thursday with some examples to discuss.

### Exercise 0: Explore leg.colorado.gov

Click through the information about bills, legislators, committees, histories, *etc.* on https://leg.colorado.gov/bills-by-bill-number 

What are different ways for getting a list of all the bills in a single session? What could we do if we had more accessible data?

### Exercise 1: Changing User-Agents
Find a website that returns an error when you use `requests` with a responsible User-Agent disclosure. Change the User-Agent string to a web browser string and see if it returns something else when you pretend to be a browser.

In [52]:
contact_header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_2_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.15'}
# contact_header = {'User-Agent':'Brian Keegan brian.keegan@colorado.edu'}
# contact_header = {'User-Agent':'ChatGPT-User'}

request = requests.get('https://www.x.com/',headers=contact_header)

request.text

'\n    <!DOCTYPE html>\n    <head>\n      <title>x.com</title>\n      <meta http-equiv="refresh" content="0; url = https://twitter.com/x/migrate?tok=7b2265223a222f222c2274223a313732343934383131347d84eb9d2adcef0df01faafd91813efcc3" />\n      <meta charset="utf-8">\n      <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover">\n\n      <link rel="preconnect" href="//abs.twimg.com">\n      <link rel="dns-prefetch" href="//abs.twimg.com">\n      <link rel="preconnect" href="//api.twitter.com">\n      <link rel="dns-prefetch" href="//api.twitter.com">\n      <link rel="preconnect" href="//api.x.com">\n      <link rel="dns-prefetch" href="//api.x.com">\n      <link rel="preconnect" href="//pbs.twimg.com">\n      <link rel="dns-prefetch" href="//pbs.twimg.com">\n      <link rel="preconnect" href="//t.co">\n      <link rel="dns-prefetch" href="//t.co">\n      <meta http-equiv="onion-location" content="https://twitter3e4tixl4xyajtrz

### Exercise 2: Interpreting robots.txt

Use `requests` to retrieve a website's robots.txt file. Interpret what is excluded and permitted. Compare to another website.

In [53]:
print(requests.get('https://www.x.com/robots.txt').text)

# Google Search Engine Robot
User-agent: Googlebot

Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Allow: /i/api/
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/deactivated
Disallow: /settings/deactivated

Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/likes
Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/retweets
Disallow: /[_0-9a-zA-Z]+/likes
Disallow: /[_0-9a-zA-Z]+/media 
Disallow: /[_0-9a-zA-Z]+/photo

User-Agent: Google-Extended
Disallow: *

User-Agent: FacebookBot
Disallow: *

User-agent: facebookexternalhit
Disallow: *

User-agent: Discordbot
Disallow: *

User-agent: Bingbot
Disallow: *

# Every bot that might possibly read and respect this file
User-agent: *
Disallow: /


# WHAT-4882 - Block indexing of links in notification emails. This applies to all bots.
Disallow: /i/u
Noindex: /i/u

# Wait 1 second between successive requests. See ONBOARD-2698 for details.
Crawl-delay