# Commmunity

In the discussion this week we read about [The FAIR Principles](https://www.go-fair.org/fair-principles/) and how digital curation practices around *findability*, *accessability*, *interoperability* and *reusabilty* are important for the scientific community. For science to work in today's networked environment it's important for data to be shared in order for research to progress. Sometimes it can be a a challenge to make research data available. For example in situations where the privacy of individuals represented in the data are at play. In [some situations](https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/) the accessabilty or findabilty of data could harm those that are represented by the data.

In Module 5 we read [The Numbers Don't Speak for Themselves](https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/2) where D'Ignazio and Klein describe how data and its description are an instrument of power. Power issues can manifest in a multitude of ways, as extraction or oppression, but also as resistance to power, for example in cases where [human rights violations](https://archiving.witness.org/) are documnted. The ways in which data is produced and described can work to obscure or expose the biases and power dynamics that are always present in data curation activities. One size does not fit all when it comes to understanding the ethics of data curation practices. This can make principles like FAIR difficult to uniformly apply in practice, even while they are good frameworks for thinking with.

In Module 6 we looked at the effects of platforms, and how YouTube's API allows users to upload and describe video content, but only allows the retrieval of the metadata descriptions, and not the video that was uploaded. This asymmetry is an example of how power dynamics get encoded into platform protocols like APIs. these asymmetries highlight why tools like youtube-dl are often needed, which sometimes operate in opposition to powerful commercial interests, in an ethical gray area with respect to platform's terms of service.

In [The Politics of Platforms](https://journals.sagepub.com/doi/abs/10.1177/1461444809342738) Gillepsie argues that the term "platform" was developed by companies like YouTube because of the way it allows them to connect advertisers and content creators by seeming to provide a *neutral* platform. The appearance of neutrality is extremely important for YouTube and its parent company Google because the [Digital Millenium Copyright Act](https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act) (DMCA) makes their users liable for copyright infringement rather than the companies themselves. Google has built tools like [ContentID](https://en.wikipedia.org/wiki/Content_ID_(system)) that allow publishers to easily police content on the platform. Rather than being regulated, Google in effect becomes the regulator, and assumes a role that would be typically reserved for the government. Since these are global corporations their effects extend far outside the jurisdiction of the United States.

## Chilling Effects

<img style="width: 300px; float: right" src="https://raw.githubusercontent.com/edsu/inst341/master/modules/module-07/images/lumen.jpg">

In 2001 a group of law researchers and the [Electronic Freedom Foundation](https://eff.org) established the *Chilling Effects Database* in order to document the growing number of DMCA cease and desist letters that were being sent to remove information from the web. The database got its name because of the potential [chilling effect](https://en.wikipedia.org/wiki/Chilling_effect) that these requests had on legitimate publishing on the web. Google was one of the early supporters of the database and they routinely submit the requests they receive from parties that want remove content from their search index. The database has since been renamed [Lumen](https://lumendatabase.org), and is now run out of the [Berkman-Klein Center](https://en.wikipedia.org/wiki/Berkman_Klein_Center_for_Internet_%26_Society) at Harvard University, where it continues to collect information about takedown notices.

Every record in the Lumen Database provides information about three actors or entities:

* **Principal**: the entity that believes their copyright was infringed upon 
* **Sender**: the entity that sent the notification
* **Recipient**: the entity that received the notification
* **Submitter**: the entity that submitted the request to Lumen

The Principal and the Sender are often different, because the entity making the copyright claim can use a service to streamline the notification process. But the *Recipient* and *Submitter* are often the same, because the recipient (e.g. Google) submits it directly to Lumen. Here is a partial view of a notice:

<a href="https://www.lumendatabase.org/notices/15234516"><img style="max-width: 700px;" src="https://raw.githubusercontent.com/edsu/inst341/master/modules/module-07/images/lumen-record.png"></a>

In this case the *Principal* is [Warner Music Group](https://www.wmg.com/) (Germany), the *Sender* is the advertising firm [proMedia](https://www.promedia.com), and the *Recipient* and *Submitter* are both Google. Some information in the notice such as URLs for the content being removed, are not made available unless you request it. The website includes a [search interface](https://www.lumendatabase.org/) which allows you to browse notices, and they also has an [API](https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation) to support [research](shttps://lumendatabase.org/pages/research). Participation in Lumen is voluntary. In addition to Google Lumen has received notices from WordPress, DuckDuckGo, The Internet Archive, Kickstarter, Medium, Twitter, Vimeo and Wikipedia.

In this notebook we are going to use the Lumen Database to explore what kinds of content are being removed from the web using the DMCA. The purpose of doing this is to demonstrate how Lumen can be seen as an implementation of the FAIR Principles, where each of the principles applies to the way this data is being made available. In building Lumen they have chosen to implement findability and accessibility in particular ways. But Lumen is also a prime example of a tool that lets us examine how the web is being actively curated, using infratructures that have been developed by corporations in collaboration with law makers. Lumen is a piece of community infrastructure.

## Lumen API

The Lumen Database has a full featured [search interface](https://lumendatabase.org/notices/search) for eploring the notices. It provides a faceted search interface for exploring the notices by the principal, submitter, receiver, topics and more. They also have an [API](https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation) which allows you to perform simliar searches in an automated way, and to return structured metadata (JSON) for the notices. This is super useful for looking for patterns and summarizing the notices in various ways, which is important because as of 2018 it contained twelve million notices, referencing close to four billion URLs.

Testing on a small scale is possible without an API key. But most research use requires a key. I've requested a key for us to use in this class, which I have given to you in Canvas. Please use the key in the context of this notebook or your final project. If you would like to use it for other research please let me know or apply for your own key.

Go ahead and paste the key in here and execute the cell:

In [4]:
key = ''

Lets try a very simple API request using the Python [requests](https://requests.readthedocs.io/en/master/) library to fetch a record, in this case the same one pictured above `id=15234516` using the [Request a Notice](https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation#request-a-notice) endpoint. First the notebook needs to have requests installed:

In [19]:
! pip --quiet install requests

First it is helpful to set up a few Python dictionaries that we will use to make the HTTP requests to the Lumen API.

`headers` contains [HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers) to send with the request. The Lumen API requires that a `User-Agent` be supplied, which is an text string that identifies who you are submitting the request.

In addition we will create a dictionary `params` that will contain name/value pairs that represent [URL query string parameters]. These parameters are name/value pairs separated by an `=` that appear after a question mark `?` in a URL. If we put these parameters in a dictionary requests will create the query string for us, which is especially helpful when the values need to be [URL Encoded](https://en.wikipedia.org/wiki/Percent-encoding). 

Here is an HTTP GET request is issued to fetch the URL, which returns a response. The response objecthas a json() method to return the JSON data from the API parsed into a Python data structure so we can use it programatically.

In [84]:
import requests

headers = {'User-Agent': 'umd-inst341'}
params = {'authentication_token': key}

response = requests.get('https://lumendatabase.org/notices/15234516.json', 
                        headers=headers, params=params)
response.json()


{'dmca': {'id': 15234516,
  'type': 'DMCA',
  'title': 'DMCA (Copyright) Complaint to Google',
  'body': None,
  'date_sent': '2017-10-26T00:00:00.000Z',
  'date_received': '2017-10-26T00:00:00.000Z',
  'topics': ['DMCA Notices', 'Copyright'],
  'sender_name': 'proMedia',
  'principal_name': 'Warner Music Group Germany Holding GmbH',
  'recipient_name': 'Google Inc',
  'works': [{'description': 'Simon,Carly - Songs From The Trees',
    'infringing_urls': [{'url': 'http://intmusic.net/9832/carly-simon-songs-from-the-trees-a-musical-memoir-collection-2015'}],
    'copyrighted_urls': [{'url': 'No URL submitted'}]},
   {'description': 'Morissette,Alanis - Unplugged',
    'infringing_urls': [{'url': 'No URL submitted'}],
    'copyrighted_urls': [{'url': 'No URL submitted'}]},
   {'description': "Bowie,David - 'Hours...'",
    'infringing_urls': [{'url': 'No URL submitted'}],
    'copyrighted_urls': [{'url': 'No URL submitted'}]},
   {'description': 'Bowie,David - 1.Outside',
    'infringing

You can see a variety of information in this densely packed data structure, including the sender, principal and recipeint names. You can see the dates that the notice was sent and the date it was received, which appear to be the same here. There is also a list of objects that describe the works that were infringed on, which have a desciption and a domain for the infringing website.

In this notice Warner Music is using proMedia to get Google to delist these resources from their search index. But a critical piece of information included near the bottom of the record is `action_taken` which indicates that Google chose not to act on this notice.

## Search

So now for the fun part. We can also use the Lumen API to perform searches of *all* the notices. The [search endpoint](https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation#search-notices-via-fulltext) supports LOTS of options. We're just going to show a few of them here in this notebook, and what follows is not meant to be exhaustive but just to get you thinking.

In this example we are going to use the `term` query parameter to search for the word `Bowie` in the full text of the notices. Notice how we add `term` to our params dictionary? See if you can find where `term` is defined in the [Lumen API Documentation](https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation#search-notices-via-fulltext). Being able to read (and write) the various styles of API documentation is very important for data curation work.

In [95]:
params = {
    "term": "Bowie",
    "authentication_token": key
}

response = requests.get("https://lumendatabase.org/notices/search.json",
                        headers=headers, params=params)

results = response.json()
results

{'notices': [{'id': 178894,
   'type': 'DMCA',
   'title': 'BPI DMCA (Copyright) Complaint to Google',
   'body': None,
   'date_sent': '2012-04-29T04:00:00.000Z',
   'date_received': '2012-04-29T04:00:00.000Z',
   'topics': ['Copyright', 'DMCA Safe Harbor'],
   'sender_name': 'BPI (British Recorded Music Industry) Ltd',
   'principal_name': 'BPI (British Recorded Music Industry) Ltd',
   'recipient_name': 'Google, Inc.              ',
   'works': [{'description': 'DAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE\nDAVID BOWIE BEST OF BOWIE',
     'infringing_urls': [{'url': 'http://filetram.com/4shared/1-0/davi

The `results` variable here contains a dictionary that represents the response from the search API call. It is a dictionary with two keys `notices` which contains the notices and `meta` which contains some adminstrative information about our query. Let's create a variable called `notices` for the list of notice objects. This is a list of objects similar to the one we saw above.

In [97]:
notices = results['notices']
len(notices)

10

## Paging

As you can see the response only returned 10 results. But we can see from the `pages` property of the `results['meta']` dictionary that there are many more pages of results that matched. We only received the first 10, the first page.

In [92]:
results['meta']['total_pages']

1000

Fortunately the API lets us request the next page of results using the `page` parameter.

In [94]:
params = {
    "term": "Bowie",
    "page": 2,
    "authentication_token": key
}

response = requests.get("https://lumendatabase.org/notices/search.json",
                        headers=headers, params=params)

response.json()

{'notices': [{'id': 621271,
   'type': 'DMCA',
   'title': 'BPI DMCA (Copyright) Complaint to Google',
   'body': None,
   'date_sent': '2013-02-23T05:00:00.000Z',
   'date_received': '2013-02-23T05:00:00.000Z',
   'topics': ['Copyright', 'DMCA Safe Harbor'],
   'sender_name': 'BPI (British Recorded Music Industry) Ltd',
   'principal_name': 'BPI (British Recorded Music Industry) Ltd',
   'recipient_name': 'Google, Inc.              ',
   'works': [{'description': 'DAVID BOWIE,ASHES,,BPI Ltd Member Companies',
     'infringing_urls': [{'url': 'http://mp3take.com/mp3/05_david_bowie_ashes_to_ashes.html'},
      {'url': 'http://mp3take.com/mp3/other/1/david_bowie_ashes_to_ashes.html'},
      {'url': 'http://search.4shared.com/q/10/David+Bowie+Ashes+to+Ashes'},
      {'url': 'http://search.4shared.com/q/1/David+Bowie+++Ashes+To+Ashes'},
      {'url': 'http://searchmp3.mobi/04-ashes-to-ashes-mp3-david-bowie-feat-monsters?lang=ru'},
      {'url': 'http://searchmp3.mobi/05-ashes-to-ashes-mp3-

## Metadata

We can write a little loop to print out things from the notices. For example the `title` for each notice:

In [98]:
for notice in notices:
    print(notice['title'])

BPI DMCA (Copyright) Complaint to Google
DMCA notice to Google, Inc.
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google
BPI DMCA (Copyright) Complaint to Google


We can also look at the *principal*, or who felt their copyright was being infringed on.

In [99]:
for notice in notices:
    print(notice['principal_name'])

BPI (British Recorded Music Industry) Ltd
Self
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd
BPI (British Recorded Music Industry) Ltd


In [133]:
for notice in notices:
    print(notice['recipient_name'])

Google, Inc.              
Google, Inc.
Google, Inc.              
Google, Inc.              
Google, Inc.              
Google, Inc.              
Google, Inc.              
Google, Inc.              
Google, Inc.              
Google, Inc.              


Lumen assign topics to the notices as well.

In [71]:
for notice in notices:
    print(notice['topics'])

['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']
['DMCA Notices', 'Copyright']


There are a lot of DMCA and Copyright take down notices in these ten. We can also look at the works that have been infringed on, and the infringing urls. These are the URLs that the principal is requesting be taken down.

In [151]:
for notice in notices:
    for work in notice['works']:
        for url in work['infringing_urls']:
            print(url['url'])

http://filetram.com/4shared/1-0/david-bowie-best-of-bowie
http://filetram.com/david-bowie-best-of-bowie-1970
http://filetram.com/david-bowie-best-of-bowie-4shared
http://filetram.com/filesonic/gorillaz-plastic-beach-rar-8624386717
http://isohunt.com/torrent_details/143548761/?tab=summary
http://isohunt.com/torrent_details/148855225/?tab=summary
http://isohunt.com/torrent_details/77928369/?tab=summary
http://kat.ph/search/david%20bowie%20best%20of%20bowie%20best%20of%20bowie/
http://mediafiremp3.com/david+bowie+best+of+bowie+06+let+s+dance-mediafire-5.html
http://mp3skull.com/mp3/david_bowie_space_oddity_david_bowie_best_of_bowie.html
http://torrentz.eu/5823324df7112343c94e477b017c084420dc79b8
http://www.airmp3.me/download/David-Bowie--Best-of-Bowie-Fame/mp3/air-a1ia1j
http://www.downloads.nl/music/David+Bowie+Best+Of+Bowie
http://www.filestube.com/6OPRCAMoxhqAMflPAd9oar/DAVID-BOWIE-BEST-OF-BOWIE.html
http://www.filestube.com/8XhH9L0NI2WoyFVHnq68QJ/David-Bowie-Best-of-Bowie-2CD-2002-EMG

And we can look at when the notices were received.

In [100]:
for notice in notices:
    print(notice['date_received'])

2012-04-29T04:00:00.000Z
2012-04-27T04:00:00.000Z
2012-04-29T04:00:00.000Z
2012-04-24T04:00:00.000Z
2012-04-28T04:00:00.000Z
2013-08-11T04:00:00.000Z
2013-06-24T04:00:00.000Z
2013-04-03T04:00:00.000Z
2012-04-27T04:00:00.000Z
2013-02-23T05:00:00.000Z


Notice that they are not in order. In fact we can request that the API return results are ordered chronologically using the `sort_by` parameter to indicate we want results ordered by `date_received` in descending order (most recent first).

In [150]:
params = {
    "term": "Bowie",
    "sort_by": "date_received desc",
    "authentication_token": key
}

response = requests.get("https://lumendatabase.org/notices/search.json",
                        headers=headers, params=params)

for notice in response.json()['notices']:
    print(notice['date_received'])

2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-12-01T00:00:00.000Z
2020-11-30T00:00:00.000Z


## A Search Function


As you can see it's getting a little bit tedious having to repeat some of the logic around doing the API requests. So before we go further let's create a little function that takes the search parameters and the number of results that are to be returned. It will handle paging, and passing the authentication key and the headers for us. Also notice how the function uses `yield` to return a value. The function is a [generator](https://docs.python.org/3/c-api/gen.html?highlight=generator) that allows it to return results as they become available, which is ideal for use in a loop.

In [214]:
def search_notices(params, max_results=100):
    page = 1
    count = 0
    stop = False
    
    while not stop:
        params['authentication_token'] = key
        params['page'] = page
        params['per_page'] = 100
        headers = {'User-Agent': 'umd-inst341'}
                
        response = requests.get("https://lumendatabase.org/notices/search.json",
                                headers=headers, params=params)
        results = response.json()
        
        for notice in results['notices']:
            count += 1
            yield notice
            if count > max_results:
                stop = True
                break
                
        if results['meta']['next_page'] is None:
            stop = True
        else:
            page += 1

Now we can try it out:

In [190]:
for notice in search_notices({"token": "Bowie"}, max_results=20):
    print(notice['id'], notice['date_received'], notice['title'])

371546 2012-08-18T04:00:00.000Z BPI DMCA (Copyright) Complaint to Google
550240 2014-02-08T05:00:00.000Z DMCA (Copyright) Complaint to Google
420779 2012-11-29T05:00:00.000Z Legal Complaint to Google
634626 2013-06-14T04:00:00.000Z Legal Complaint to Google
371547 2012-05-21T04:00:00.000Z Blog DMCA (Copyright) Complaint to Google
550241 2013-06-07T04:00:00.000Z DMCA (Copyright) Complaint to Twitter
440138 2012-05-30T04:00:00.000Z BPI DMCA (Copyright) Complaint to Google
420782 2012-05-28T04:00:00.000Z BPI DMCA (Copyright) Complaint to Google
634605 2013-06-14T04:00:00.000Z Music DMCA (Copyright) Complaint to Google
371555 2012-08-18T04:00:00.000Z DMCA (Copyright) Complaint to Google
203966 2012-05-01T04:00:00.000Z Web DMCA (Copyright) Complaint to Google
117875 2011-08-12T04:00:00.000Z DMCA (Copyright) Complaint to Google
152344 2010-06-21T04:00:00.000Z Removeyourcontent DMCA (Copyright) Complaint to Google
324867 2012-07-05T04:00:00.000Z BPI DMCA (Copyright) Complaint to Google
382379

It's a bit of a complicated function, but it is worth it because it will make it *so much* easier to run searches. If we discover a problem with it we can fix it one place, and all the places that use it will benefit from the fix.

## Display Function

Similarly, it also might be nice to have a function that will create a readable string representation of a notice, so we don't have to try to read the JSON or constantly print the same things. Python's string [format](https://docs.python.org/3/library/stdtypes.html#str.format) function is useful for this since it lets us define a template.

In [262]:
def notice_summary(notice, urls=False):
    s = """
title: {0[title]}
url: https://www.lumendatabase.org/notices/{0[id]}
received: {0[date_received]}
action: {0[action_taken]}
principal: {0[principal_name]}
sender: {0[sender_name]}
recipient: {0[recipient_name]}
topics: {0[topics]}""".format(notice)
    
    if urls:
        s += "\nurls:\n"
        for work in notice['works']:
            for url in work.get('infringing_urls', []):
                s += '- ' + url['url'] + '\n' 
                
    return s

In [215]:
for notice in search_notices({"term": "bowie"}):
    print(notice_summary(notice))


title: BPI DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/178894
received: 2012-04-29T04:00:00.000Z
action: 
principal: BPI (British Recorded Music Industry) Ltd
sender: BPI (British Recorded Music Industry) Ltd
recipient: Google, Inc.              
topics: ['Copyright', 'DMCA Safe Harbor']

title: DMCA notice to Google, Inc.
url: https://www.lumendatabase.org/notices/158803
received: 2012-04-27T04:00:00.000Z
action: 
principal: Self
sender: Anti-Piracy Unit
recipient: Google, Inc.
topics: ['Uncategorized', 'Copyright']

title: BPI DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/182394
received: 2012-04-29T04:00:00.000Z
action: 
principal: BPI (British Recorded Music Industry) Ltd
sender: BPI (British Recorded Music Industry) Ltd
recipient: Google, Inc.              
topics: ['Copyright', 'DMCA Safe Harbor']

title: BPI DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/106013
received: 2012-


title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/12254380
received: 2016-05-15T00:00:00.000Z
action: None
principal: APCM Mexico Member Companies
sender: APDIF - Mexico
recipient: Google Inc.
topics: ['DMCA Notices', 'Copyright']


We can even search for words with the *works* that the claim is being made about. So we could see who is trying to take down content related to *Beyoncé*.

In [225]:
for notice in search_notices({"works": "Beyoncé"}):
    print(notice_summary(notice))


title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/12181694
received: 2016-05-03T00:00:00.000Z
action: None
principal: TIDAL
sender: MUSO.com Anti-piracy
recipient: Google Inc.
topics: ['DMCA Notices', 'Copyright']

title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/12159497
received: 2016-04-28T00:00:00.000Z
action: None
principal: TIDAL
sender: MUSO.com Anti-piracy
recipient: Google Inc.
topics: ['DMCA Notices', 'Copyright']

title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/12181225
received: 2016-05-03T00:00:00.000Z
action: None
principal: TIDAL
sender: MUSO.com Anti-piracy
recipient: Google Inc.
topics: ['DMCA Notices', 'Copyright']

title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/12137181
received: 2016-04-24T00:00:00.000Z
action: None
principal: TIDAL
sender: MUSO.com Anti-piracy
recipient: Google Inc.
topics: ['DMCA Notices', 'Copy


title: DMCA (Copyright) Complaint to Google
url: https://www.lumendatabase.org/notices/11345021
received: 2015-10-22T00:00:00.000Z
action: 
principal: Sony Music Entertainment Germany GmbH
sender: proMedia
recipient: Google Inc.
topics: ['DMCA Notices', 'Copyright']


## Search Facets

The really nice thing that the Lumen API lets you do is limit the search results in lots of useful ways. For example we can limit to notices that were received by the social media platform Twitter by using the `recipient_name` API option. We can use it with our new *search_notices()* function by including it in the parameters that we pass in.


In [217]:
for notice in search_notices({"recipient_name": "twitter", "sort_by": "date_received+desc"}):
    print(notice_summary(notice))


title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/550241
received: 2013-06-07T04:00:00.000Z
action: Yes
principal: England and Wales Cricket Board
sender: Legal Counsel
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']

title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/634711
received: 2013-12-26T05:00:00.000Z
action: Yes
principal: MX International Inc
sender: Authorized agent
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']

title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/634666
received: 2013-12-26T05:00:00.000Z
action: Yes
principal: None
sender: None
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']

title: DMCA notice to Twitter
url: https://www.lumendatabase.org/notices/634676
received: 2013-12-26T05:00:00.000Z
action: Yes
principal: None
sender: None
recipient: Twitter
topics: ['Copyright']

title: DMCA (Copyright) Complaint to T


title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/565924
received: 2013-04-19T04:00:00.000Z
action: Yes
principal: DHX Media Cookie Jar Inc
sender: Head of Online
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']


We can also limit to notices that were *acted on*. These are notices that resulted in an actual takedown.

In [216]:
query = {
    "recipient_name": "twitter", 
    "sort_by": "date_received+desc",
    "action_taken": "Yes"
}

for notice in search_notices(query):
    print(notice_summary(notice, urls=True))


title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/550241
received: 2013-06-07T04:00:00.000Z
action: Yes
principal: England and Wales Cricket Board
sender: Legal Counsel
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']
urls:
- https://twitter.com/ShaunaDaly1/status/340713467407380480 
- https://twitter.com/LalitKModi/status/194819587668324352 
- https://twitter.com/AnandHarasser/status/308523130182303744 
- https://twitter.com/joeyndlovu/status/332099903398498304
- https://twitter.com/joeyndlovu/status/332098182249066497 
- https://twitter.com/GCCKaitlyn/status/330671211489132544 
- https://twitter.com/ktbug0504/status/330670612265701377 
- https://twitter.com/katmaclyndonald/status/330670217325838336 
- https://twitter.com/Sufffs/status/309978014165368832 
- https://twitter.com/SaZza_Mufc/status/285344614897049600 
- https://twitter.com/Wajihaaxx/status/283560528201281536 


title: DMCA (Copyright) Complaint to Twitter
url: https://www


title: DMCA (Copyright) Complaint to Twitter
url: https://www.lumendatabase.org/notices/565924
received: 2013-04-19T04:00:00.000Z
action: Yes
principal: DHX Media Cookie Jar Inc
sender: Head of Online
recipient: Twitter
topics: ['Copyright', 'DMCA Safe Harbor']
urls:
- No URL submitted



This is interesting because we can see in the URLs what Twitter accounts have had content removed.

https://twitter.com/AnimeRatio/status/323177529898844160

Is from user *AnimeRatio*. We could search in the last 1,000 takedown notices and count the Twitter accounts that have had tweets removed by using a regular expression to extract the username from the URL.

In [223]:
import re
from collections import Counter

users = Counter()
for notice in search_notices(query, max_results=200):
    for work in notice['works']:
        for url in work['infringing_urls']:
            m = re.search(r'twitter.com/(.+?)/status/\d+$', url['url'])
            if m:
                users[m.group(1)] += 1
                
for user, count in users.most_common():
    print(user, count)

narutogetonline 40
90animax 40
hentaistreamcom 32
Roseboutique3 22
newtvworld 13
WrestlingCity 10
#!/Bookdlws 7
#!/TimandEricFeed 6
mcdownloads 6
#!/techtime2012 5
#!/_2311802752541 4
TruthHurtsCEO 4
#!/FeedPirate 4
#!/girlsgogameinfo 4
XXXLeechcom 3
LugeeRP 3
TonsMoviesFree 3
jamstuncom 3
Kh345Khaed 3
LanaMhd 3
afkar99 3
alaaal7awe12 3
alaaalnasr1 3
#!/Moovi3 3
#!/TAMOENCORO_NET 3
#!/sonu1211 2
#!/3rbfilez 2
#!/drtyhnds 2
#!/NewtraidRu 2
#!/VortexNews 2
#!/cool4ik 2
#!/huyase_alena 2
#!/kinowarorg 2
#!/livetorrent 2
#!/maligin26 2
#!/santikov_net 2
#!/tolyanidze 2
Polaco_ 2
SheLikesItMini 2
andrej077 2
Wario64 2
Videotoolz 2
Discotheque48 2
lucky7777777a 2
Tololliteras 2
LoboSolitario 2
tv_release 2
MoviesTVSeries1 2
shap_new 2
Itz_Swagger_On 2
AlbertDunbar 2
PrivOna_Herrera 2
31_luciano 2
fatemahalmdfa3 2
ByYasinMusic 2
dz4all 2
#!/ToutG_com 2
#!/_MrMario 2
#!/adelebestoffer 2
#!/baixetudocomple 2
#!/els7a 2
#!/waythenet 2
#!/Mexicoloko 2
zackusausa 2
#!/uakino_net 2
#!/arty_in_ua 2


## Exercise

Hopefully this brief exploration of the API gives you an idea of how Lumen Database provides a window into on data curation practices on the web, and how services like Lumen are used as community infrastructure.

Please use the discussion and examples above to answer the following questions.

1. Use the `search_notices()` and `notice_summary()` functions that were created above to search for 100 notices that match "wikileaks.org".


2. What URLs are these notices attempting to remove? Can you print them out? 

3. Retrieve the most recent 1000 claims mentioning "wikileaks.org". Print out the sender names and the number of notices they sent.

4. Try to formulate a research question you can answer with the Lumen API. How would you go about answering it? (Maybe you could do this in a notebook as your final project.)