# PROGRES 2023 - Mini-Projet 2 
# API Web 

Fabien Mathieu - fabien.mathieu@normalesup.org

Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr

The purpose of this mini-project is to use (among other things) the Python library Bottle to provide, on one hand, a specific API for the website http://dblp.uni-trier.de/, which aggregates all scientific publications in computer science, and on the other hand, a website that allows the use of the aforementioned API.

You will need to download a copy of DBLP's data to perform the project. The DBLP website provides all publications in the form of an XML file, but XML can be difficult to parse on large files.  

Therefore, you will use the following file instead: https://zenodo.org/records/7069915/files/2022-08-30-papers.jsonl.gz?download=1

This file is in compressed [json lines](https://jsonlines.org/) format:

- A [gzip compression](https://docs.python.org/3/library/gzip.html) is used to reduce the size of the data on the hard drive;
- The data itself is made of lines, and each line represents the json of a paper.

Typically, you will access the content of the file by doing something like this:

```python
import gzip
from json import loads

with gzip.open('2022-08-30-papers.jsonl.gz') as f: # open the file in default mode (read-only, text)
    while l := f.readline(): # iterate through the lines
        paper = loads(l) # load the paper as a dictionnary
        # You can work on the current paper here
```

Hint: You should never uncompress the content of a gz file in your hard drive. The Python [gzip module](https://docs.python.org/3/library/gzip.html) is designed so you can open a compressed file as if it was already uncompressed.

**REMINDER:** stackoverflow is your friend... **But** if you use something found on the Internet you must **cite the source** and **explain what that something does**. Copy/paste with no evidence of understanding will be sanctioned. Copy/paste without quoting your source will be heavily sanctioned.

# Exercise 1: Small is beautiful

Before working on your API, you need to understand and adapt you dataset.

- What is the size of the dataset (the file `2022-08-30-papers.jsonl.gz`)?
- How many papers are there?
- What is the typical structure of a papers?

OK, maybe there are too much information. Write a Python program to transform the dataset in something more compact. The new file, that you can call `papers.jsonl.gz` should have the following spec:

- It should be in compressed json-lines format;
- Each line should have the following fields:
  - `title`
  - `venue`
  - `year` (as an integer)
  - `authors`, which should contain a list of `authorId` as integers
- You can discard the authors without `authorId`
- If an input paper misses a field (e.g. it is empty or `None`), discard the paper

While you build the new file, you should also construct a *dictionary of authors* that associates to each `authorId` the name of the author and the list of her publications. Each publication should be referenced by an integer that gives its position inside `papers.json.gz`. How you code position is yours to decide. For example, it could be the line number of the publication or its actual position in the file (which you can obtain with the Python `tell` method).

Depending on how you decided to reference publications, you may build additional structures to facilitate the access to publications.

- What is the size of `papers.jsonl.gz`?
- How many publications are there?

In order to learn and test to use your new file(s), write the following methods:
- A `search_author(name: str) -> list` method that returns the list of `authorId` of authors whose name contains `name`.
- A `get_paper(position: int) -> dict` method that returns the json of a publication.
- A `get_author_papers(author_id:int) -> list` that returns the list of publications of one author.

# Exercise 2. Providing a Web API 

Using Python and the Bottle package, build a web server that implements the following API:
- `/publications/{id}` : where `id` is the position of a publication, returns the corresponding publication.
- `/publications` : returns by default the first 100 publications. The value 100 can be modified by sending a URL parameter `limit`.
- `/authors/{author_id}` : where `author_id` is the `authorId` of an author, returns the following informations about an author: name, number of publications that she co-authored, number of co-authors.
- `/authors/{author_id}/publications` : returns the publications of an authors (list of dictionaries).
- `/authors/{author_id}/coauthors` : returns the co-authors of one author (name and authorId).
- `/search/authors/{searchString}` : where `searchString` is a string to lookup one author. This route should return the authors whose name contains `searchString` (for example, `/search/authors/w` returns the authors whose name contains `w` or `W`). 
- `/search/publications/{searchString}`: where `searchString` is a string, returns the list of publications whose title contains `searchString`. The route should accept a URL parameter `filter` formatted like `key1:value1,key2:value2,...`  to restrain the search to the publications where key `keyi` contains `valuei`. For example, `/search/publications/robots?filter=author:Jean,venue:acm` 
should return the list of publications where the title contains `robots`, one of the author contains `Jean`, published in a venue that contains `acm`.
- `/authors/{id_origin}/distance/{id_destination}` : where `id_origin` 
and `id_destination` are two `authorId`, returns the collaboration distance between the two authors. That distance is given as the length of the shortest path `(id_origin, auth1, auth2, ..., authX, id_destination)`, where 
`id_origin` and `auth1` are co-authors, `auth1` and `auth2` are co-authors, ... and 
`authX` and `id_destination` are co-authors. In particular, one author is at distance 0 from herself and two co-authors are at distance 1. In addition to the distance, the response should include one shortest path between the two authors.sent.

The developed API should have the following characteristics:

- All errors should have the same format.
- Each route must be documented with the return format, possible errors, and an explanation of parameters.
- Each route that returns a list should return a maximum of 100 elements and should accept URL parameters `start` and `count` to display `count` elements starting from the `start`-th element. For example: `/search/authors/*` should return the first 100 authors, `/search/authors/*?start=100` displays the next 100, and `/search/authors/*?start=200&count=2` displays the next 2 elements.
- For each route that returns a list, the returned elements should be sortable based on a given field using a URL parameter order. For example: `/search/publications/*?order=venue` displays the first 100 publications sorted in alphabetical order by the name of the journal in which they appear.

# Exercise 3. Testing a Web API

Using `pytest`, write a program that checks that the API made in the previous exercise works as expected.

# Exercise 4. Website using a Web API

Create a Python web server using the Bottle library that utilizes the Web API developed in exercise 2 to offer the user a graphical Web interface. This interface allows the user to obtain, by entering relevant information into a Web form:

- The complete list of publications and the complete list of co-authors of an author, possibly sorted alphabetically. This author can be searched beforehand using a substring of characters appearing in her name.
- The distance between two authors. As above, the authors can be searched beforehand using a substring of characters appearing in their names.