# Basic API Usage

Note: the API is in a basic state, it really doesn't do a whole lot more than
the website does for now! The biggest (and hopefully easy) features to implement
are

- caching results locally (just need to dump the json in LumenAPIManager)
- making a dataframe/pandas/polars interface (use dataclasse's `asdict` method to convert them to dictionaries, feed to pandas)
- formally doing pagination (easy in code but can be brought over)

## Prep

Make sure you've followed the README, most notably
- install the needed libraries from `requirements.txt`
    - if you want to write a notebook, also install `ipykernel`. I'd recommend VSCode's interface so you can actually use Python typechecking
- have the LUMEN_API key in a `.env` file or in your terminal env. **NEVER PUT IT IN SOURCE CODE. NEVER PUSH IT TO GITHUB.**

In [2]:
from dotenv import load_dotenv
from os import getenv

load_dotenv()
api_key = getenv("LUMEN_API")
if not api_key:
    raise Exception("A Lumen API key needs to be in a .env file!")

## Basic Querying

In [3]:
import logging

# If you don't want logging from the API, comment this out!
logging.basicConfig(level=logging.INFO)

First, you need to create an api session, this will hold our api key and handle
timing our requests for us. The LumenAPIManager constructor also holds the timeout in seconds
and cache location - if you don't want caching, set cache to None.

In [4]:
from lumen.LumenAPIManager import LumenAPIManager
s = LumenAPIManager(api_key)

Entity (people who file requests) and notice grabbing just dump the JSON for now, I prioritized searching. This can/will be improved, they're very simple objects. Note that we can make multiple API requests in a row: the manager will take care of sleeping between requests!

In [5]:
from pprint import pprint
pprint(s.search_entity("Youtube Inc", per_page=1))
pprint(s.get_notice(5))

INFO:root:Cache hit on /entities/search.json with {'term': 'Youtube Inc', 'per_page': '1'} at cache/434fc7cd28485349912cd4e615878ce21cc756cf3266e1ad279ccfff392de359.json
INFO:root:Cache hit on /notices/5.json with None at cache/425f22d76b99a1a27b301953ce13b0e8dec71822f2641a7fd045f89ce06dbef5.json


{'entities': [{'country_code': '',
               'id': 9159,
               'name': 'YouTube, Inc.',
               'parent_id': None,
               'url': ''}],
 'meta': {'current_page': 1,
          'facets': None,
          'next_page': 2,
          'offset': None,
          'per_page': None,
          'previous_page': None,
          'query': {'term': 'Youtube Inc'},
          'total_entries': None,
          'total_pages': 10000}}
{'dmca': {'action_taken': '',
          'body': None,
          'case_id_number': None,
          'date_received': '2012-04-13T04:00:00.000Z',
          'date_sent': '2012-04-13T04:00:00.000Z',
          'id': 5,
          'jurisdictions': [],
          'language': None,
          'principal_name': None,
          'recipient_name': 'Google LLC',
          'sender_name': 'BPI (British Recorded Music Industry) Ltd',
          'tags': [],
          'title': 'BPI DMCA (Copyright) Complaint to Google',
          'topics': ['Copyright', 'DMCA Safe Harbor'],


Searching is more indepth. Create a new search query, add all the terms you want (use your IDE's autocomplete), and then search. You'll get a SearchResult back! Let's get the first 5 Star Wars results.

In [6]:
from lumen.SearchQuery import SearchQuery, Topic
result = SearchQuery(s).with_query("star wars").with_amount(3).with_topic(Topic.DMCANotice).search()

INFO:root:Requesting /notices/search.json with params {'term': 'star wars', 'per_page': '3', 'topics': <Topic.DMCANotice: 'DMCA Notices'>}
INFO:root:Caching at cache/334cdb621c7474875c4a40a39dfada619314c3ffd627014d74ebdde9e15323e9.json


First, we can look at the metadata. This is pretty powerful by itself - without having to get every single notice, we get plenty of numbers about *every* entry that applied to our term. For instance, who are the top 10 principals for Star Wars content? (A principal is the one who owns the content, as far as I can tell)

In [7]:
result.metadata.principals

[NameCount(name='BPI LTD MEMBER COMPANIES', instances=223082),
 NameCount(name='Mauris Film', instances=93127),
 NameCount(name='StarMedia', instances=88735),
 NameCount(name='CM.', instances=73080),
 NameCount(name='R1', instances=67699),
 NameCount(name='R-1', instances=65736),
 NameCount(name='MG Premium Ltd', instances=58013),
 NameCount(name='VGT', instances=40359),
 NameCount(name='BPI (British Recorded Music Industry) Ltd', instances=35496),
 NameCount(name='sacem', instances=30403)]

BPI seems to be the biggest principal - that last one lets us know that it's the British Recorded Music Industry, so likely based on the soundtrack, interesting. Who's filing these requests, if it isn't the principal?

In [8]:
result.metadata.senders  # Note: submitters are those who submited it to LUMEN, senders are those who sent the request

[NameCount(name='STAR MEDIA CONTENT PROTECTION', instances=294008),
 NameCount(name='BPI (British Recorded Music Industry) Ltd', instances=255142),
 NameCount(name='Star Media LLC.', instances=150161),
 NameCount(name='STAR MEDIA', instances=122036),
 NameCount(name='MG Premium Ltd.', instances=62081),
 NameCount(name='rivendell', instances=48450),
 NameCount(name='AudioLock.NET', instances=37934),
 NameCount(name='Link-Busters.com', instances=31346),
 NameCount(name='Remove Your Media LLC', instances=30942),
 NameCount(name='Recording Industry Association of America, Inc.', instances=29690)]

BPI is definitely submitting their own, but Star Media could be something to look into. Use your IDE to look into the other fields in metadata! We get all of this without getting all of those individual entries, pretty sweet.

Now we can take a look at some entries.

In [9]:
pprint(result.notices[0])

Notice(title='DMCA (Copyright) Complaint to Google',
       type=<NoticeType.Dmca: 'dmca'>,
       date_sent='2022-02-28T00:00:00.000Z',
       date_received='2022-02-28T00:00:00.000Z',
       topics=[<Topic.DMCANotice: 'DMCA Notices'>,
               <Topic.Copyright: 'Copyright'>],
       tags=[],
       jurisdictions=['DE'],
       infringing_urls=Counter({'canna-power.to': 14, 'rapidgator.net': 2}),
       works=['Meade,Austin - Black Sheep',
              'Star Wars - Episode 5 - Das Imperium Schlägt Zurück',
              'Star Wars - Episode 6 - Die Rückkehr Der Jedi Ritter',
              'Star Wars - Solo: A Star Wars Story',
              'Star Wars - Star Wars: Angriff Der Klonkrieger',
              'Star Wars - Star Wars: Die Dunkle Bedrohung',
              'Star Wars - Star Wars: Die Letzten Jedi',
              'Star Wars - Star Wars: Die Rache Der Sith',
              'Star Wars - Star Wars: Eine Neue Hoffnung',
              'Warbringer - Weapons Of Tomorrow'],
      

So we get one request here, we see that it's DMCA, it was within the DE jurisdiction, the two URLs they were targetting were canna-power.to and rapidator.net and we counted how many times those urls occured. Let's join all the urls together for all 5 of those requests and see what the most common were:

In [10]:
from functools import reduce
reduce(lambda a, b: a + b, (notice.infringing_urls for notice in result.notices)).most_common(10)

[('book4you.org', 210),
 ('nerdebooks.com', 37),
 ('gosemuts.space', 30),
 ('minhateca.com.br', 16),
 ('canna-power.to', 14),
 ('toutbox.fr', 14),
 ('chomikuj.pl', 12),
 ('185.231.223.131', 9),
 ('123moviesgo.ws', 8),
 ('103.194.171.185', 7)]

Haha, that's pretty cool, interesting that book websites were actually the highest number of requests here. book4you.org has been seized by the United States Government, and I'm probably on a list now for trying to visit it.

You can also access the raw json with `result.raw`. I haven't mapped everything, if there's something in the JSON you want in the python object feel free to add it or let me know. The mapping from JSON to object is in `SearchResult.py`.

Anyways that's generally where it's at right now, it returns Python objects that you can explore and compare. I definitely want to get it into a dataframe, shouldn't be that bad. There are some fields I haven't mapped, feel free to mess around with those.

## Basic Pagination Example

Haven't added this as a function, but pagination is pretty simple:

In [11]:
from lumen.SearchResult import Notice
# Build a query and set the number per page, but don't set the page or search it yet
basic_query = SearchQuery(s).with_amount(1).with_query("Skinamarink")
# Start building a list of notices
notices : list[Notice] = []

# Be sure to use range 1, since we don't want page 0.
for page in range(1, 4):
    # Set the page for the query each time and search it!
    data = basic_query.with_page(page).search()
    notices.extend(data.notices)

# entries now holds all the entries from all pages!
[(notice.title, notice.date_received) for notice in notices]

INFO:root:Sleeping for 2 seconds
INFO:root:Requesting /notices/search.json with params {'per_page': '1', 'term': 'Skinamarink', 'page': '1'}
INFO:root:Caching at cache/8d9d026a6e9e52dcf45fb8aeee277fe53087d287a73fa1aae916a4f1582861fa.json
INFO:root:Sleeping for 2 seconds
INFO:root:Requesting /notices/search.json with params {'per_page': '1', 'term': 'Skinamarink', 'page': '2'}
INFO:root:Caching at cache/502c81a7182dec1f2869e2382c76a9c00fe10e25e6d5f1c1ae0cd49f23e7f61f.json
INFO:root:Sleeping for 2 seconds
INFO:root:Requesting /notices/search.json with params {'per_page': '1', 'term': 'Skinamarink', 'page': '3'}
INFO:root:Caching at cache/9fc911d3f4d6a283086c8344f7fbf2938cab1c5b3cdc7c1ef38b48956ae3dd95.json


[('DMCA (Copyright) Complaint to Google', '2023-01-04T00:00:00.000Z'),
 ('DMCA (Copyright) Complaint to Google', '2023-01-18T00:00:00.000Z'),
 ('DMCA (Copyright) Complaint to Google', '2022-12-29T00:00:00.000Z')]

## Date Ranges

Uses python's date module. It lets us get down to millisecond precision - I assume we don't care about that, so I left it at day precision :)

In [46]:
from datetime import date
date_result = SearchQuery(s).with_works_desc(
    "Everything Everywhere All At Once", works_require_all=True
    ).with_date_range(date(2015, 1, 1), date(2020, 1, 1)).with_amount(100).search()
{work for notice in date_result.notices for work in notice.works if "all at once" in work.lower()}

INFO:root:Cache hit on /notices/search.json with {'works': 'Everything Everywhere All At Once', 'works-require-all': 'true', 'date_received_facet': '1420099200000..1577865600000', 'per_page': '100'} at cache/7006f69870c1cd8ce1efeb88a22a9c6d9699e6fbf6cfd1bdf3832e4188ee0513.json


{'"Blue Ocean Floor - Archetype", "Krewella - Human (Trapstep Remix)", "Akilla/Justin Timberlake - Amnesia", "SCORPIONS - Still loving you [Napisy PL]", "[Diabolik lovers] Yui Komori _ Human", "Mauricio Skate Edit", "Trade Wind", "Лучшие умирают молодыми. The Scorpions - The Good Die Young", "Christina Aguilera - All I Need (subtítulos español)", "CHRISTINA AGUILERA NOT MYSELF TONIGHT  ULTIMIX BY JULIO SKOV", "Christina Aguilera - Not Myself Tonight (JMBW & Joel Dickinson Video Mix)", "Annie & Albert "The Perfect Pair"", "[kst.vn]il divo mama[kste] 640x360", "Il Divo - Mama", "a day on the trail", "Cycling into the Continental U.S.", "Ozzy Osbourne - The Ultimate Sin", "Camp July 6th-8th 2012", "DANCEACTION | SUMMERCLASS | LUKE | Chris Brown - 4 Years old", "MGMT "Alien Days"", "The Schacht Of Flight", "Rob Webb riding DMR Bolt", "Hipnoxis TV present: AWF - Hall of Fame Expo @ La Respuesta", "SkiSki Alpe d\'Huez 2016", "Pnoise Memories", "Graduation Party", "Fanfiction Trailer "Trading

Huh? Where's our movie?

Ah duh, the award winning "Everything Everywhere All At Once" didn't even come out until 2022 :p

This is also a good lesson that "require_all" really doesn' seem to do a whole lot - maybe those 5 words are in these requests, but they certainly aren't all together :/ Be sure to filter what you get back.

Let's look at some more recent requests then:

In [48]:
from collections import Counter
new_date_result = SearchQuery(s).with_works_desc(
    "Everything Everywhere All At Once", works_require_all=True
    ).with_date_range(date(2020, 1, 1), date(2023, 5, 23)).with_amount(100).search()

Counter(work for notice in new_date_result.notices for work in notice.works if "everywhere" in work.lower())

INFO:root:Cache hit on /notices/search.json with {'works': 'Everything Everywhere All At Once', 'works-require-all': 'true', 'date_received_facet': '1577865600000..1684825200000', 'per_page': '100'} at cache/1ba448b9e43bddfea30718564aaa90eaa3e316e98b0fb40f405ca6a1662cdf56.json


Counter({'Everything Everywhere All At Once': 96,
         'Everything Everywhere All at Once': 3,
         'Os direitos autorais do meu filme "Everything Everywhere All at Once", estão sendo violados pelo trecho publicado neste site, que inicia-se com: Everything Everywhere All at Once\r\n\r\nA violação leva a transmissão ilegal do conteúdo, não autorizamos esta reprodução.': 1})

There we go, that's more realistic for 2020-2023 :)

Finally, remember to close your session :) Not a biggie but it's polite. If you're writing a script you can instead do this in a with block (see `basictest.py`)

In [49]:
s.close()

## Loading from a Cache

If you want to quickly load all data from a previous session but don't have the exact requests, or if a groupmate has shared their cache and you want to load it, you can use `load_all_cache_entries` from `SearchResult` and you'll get a list of all the entries in that cache. 

This can be more easily achieved by naming the cache something meaningful when you initalize a session, for instance `s = LumenAPIManager(cache=Path("govtTakedowns2016-2022"))` will create a cache with that folder name.

In [2]:
from pathlib import Path
from lumen.SearchResult import load_all_cache_entries
entries = load_all_cache_entries(Path("cache"))

[entry.title for entry in entries][:10]

['SHESAID COPY',
 'DMCA ABUSE ',
 'http://www.webcamrecordings.com/category/all-sites/flirt4free/',
 'DMCA ',
 'DMCA notice to Google Inc',
 'A false report about a link to illegal download',
 'Project Free Tv',
 'Project Free Tv',
 'DMCA (Copyright) Complaint to Google',
 'LO MAAN LIYA Lyrical | Raaz Reboot | Arijit Singh | Emraan Hashmi, Kriti Kharbanda, Gaurav Arora']