## API Scraping

### A simple API query
You will start with the basics: how to do a simple request to an [API endpoint](../../2.python/2.python_advanced/05.Scraping/5.apis.ipynb).

You will use the [requests](https://requests.readthedocs.io/en/latest/) external library through the `import` keyword.

Note that all external libraries need to be installed first. Check their documentation.

Check the [quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/) section of the `requests` library's documentation to:
1. use the `get()` method to connect to this endpoint: https://country-leaders.onrender.com/status
2. check if the `status_code` is equal to 200, which means OK
    * if OK, `print()` the `text` of the response
    * if not, `print()` the `status_code`

Here is an overview of the essential [HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

In [9]:
# import the requests library (1 line)
import requests
# assign the root url (without /status) to the root_url variable for ease of reference (1 line)
root_url = "https://country-leaders.onrender.com/"
# assign the /status endpoint to another variable called status_url (1 line)
status_url = "status"
# query the /status endpoint using the get() method and store it in the req variable (1 line)
req = requests.get(root_url+status_url)
# Check the response status code
    # Get the response content
response_content = req.content
    # check the status_code using a condition and print appropriate messages (4 lines)
if req.status_code == 200:
    print("Response content:\n", response_content)
else:
    print("Error response code:", req.status_code)

# assign the output to the leaders variable (1 line)
leaders = requests.get("https://country-leaders.onrender.com/leaders?country=fr")
# display the leaders variable (1 line)
print(leaders)
# does it work?
'''Following message : 
    <Response [403]>'''

Response content:
 b'"Alive"'
<Response [403]>


'Following message : \n    <Response [403]>'

### Cookies anyone?

It looks like the access to this API is restricted...
Query the `/cookie` endpoint and extract the appropriate field to access your cookie.

You will need to use this cookie in each of the following API requests: `/countries`, `/leaders`, `/leader`.

Try to query the countries endpoint using the cookie, save the output and print it.

In [10]:
import random
# try to get country from the API passing the cookies argument
country_url = "countries"
cookie_url = "cookie"

# get cookie from api
def get_cookie(root_url, cookie_url):
    req = requests.get(root_url+cookie_url)
    cookie = "user_cookie=" + req.headers["Set-Cookie"].split("=")[1].split(";")[0]
    headers = {
        'accept': 'application/json',
        'Cookie': cookie
        }
    return headers

# get list of countries
def list_countries(root_url, cookie_url):
    response_list_countries = requests.get(root_url+country_url, headers=get_cookie(root_url=root_url,cookie_url=cookie_url))
    if response_list_countries.status_code == 200:
        print(response_list_countries.json())
    else:
        print("Request failed:", response_list_countries.status_code)

list_countries(root_url=root_url, cookie_url=cookie_url)

['ma', 'us', 'fr', 'be', 'ru']


## Extracting data from Wikipedia

Query one of the leaders Wikipedia urls (from the `/leader` API endpoint) and display its `text` (not JSON).

In [11]:
# 3 lines
import random
import requests


url = 'https://country-leaders.onrender.com/leaders?country=fr'
def get_leader(root_url, cookie_url):
    response = requests.get(url, headers=get_cookie(root_url=root_url, cookie_url=cookie_url))
    if response.status_code == 200:
        leader_dict = response.json()[round(random.random()*len(response.json()))] # to get a random leader
        wikipedia_url = leader_dict['wikipedia_url']
        leader_response = requests.get(wikipedia_url)
    else:
        print(f'Error: {response.status_code}')
    return leader_response
get_leader(root_url=root_url, cookie_url=cookie_url)

<Response [200]>

Ouch! You get the raw HTML code of the webpage. If you try to deal with it without tools, you will be there all night. Instead, use the [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) *external* library.

As shown in the **quickstart** section, start by importing the library and loading the output of your `get_text()` function.

Use the `prettify()` function and print it to take a look. You will start the actual parsing in the next step.

In [12]:
# 3 lines
from bs4 import BeautifulSoup
 
# Parse HTML Code
soup = BeautifulSoup(get_leader(root_url=root_url, cookie_url=cookie_url).content, 'html.parser')
print(soup.prettify())

KeyboardInterrupt: 

That looks better but you need to extract the right part of the webpage: the text of the first paragraph.

It is a bit tricky because Wikipedia pages slightly differ in structure from one language to the next. We cannot simply get the text for the first HTML paragraph.

You will start by getting all the HTML paragraphs from the HTML source and saving them in the `paragraphs` variable.

Use the documentation or google the appropriate keywords.

In [None]:
# 2 lines
all_paragraphs = [paragraph.text for paragraph in soup.find_all('p')]
print(all_paragraphs)

If you try different urls, you might find that the paragraph you want may be at a different index each time.

That is where you need to be clever and ask yourself what would be a reliable way to identify the right index, i.e., which string matches only the first paragraph whatever the language...

Spend a good 30 minutes on the problem and brainstorm with your fellow learners. If you come out empty handed, ask your coach.

1. Loop over the HTML paragraphs
2. When you have identified the correct one
    * store the [text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) inside the `first_paragraph` variable
    * exit the loop

In [None]:
url = 'https://country-leaders.onrender.com/leaders?country=fr'
response = requests.get(url, headers=get_cookie(root_url=root_url, cookie_url=cookie_url))
leader_dict = response.json()[round(random.random()*len(response.json()))]
first_name = leader_dict['first_name']
print(first_name)
soup = BeautifulSoup(get_leader(root_url=root_url, cookie_url=cookie_url).content, 'html.parser')
all_paragraphs = [paragraph.text for paragraph in soup.find_all('p')]
print(all_paragraphs)
# < 10 lines
for paragraph in all_paragraphs:
    if paragraph[:len(first_name)] == first_name:
        first_paragraph = paragraph
        print(first_paragraph)
        break

'''Does not work for <û>, impossible to retrieve date in "août" for instance...'''
# import re
# birth_date = leader_dict['birth_date']
# birth_day = birth_date[-2:]
# print(birth_day)
# birth_year = birth_date[:4]
# pattern = f"({birth_day})\s+(\w*)\s+({birth_year})"
# print(pattern)
    # match = re.search(pattern, paragraph)
    # print(match.group(1)) 
    # print(match.group(2))
    # first_paragraph = paragraph

# get_first_paragraph(wikipedia_url: str) -> str returns the first paragraph (defined by the HTML tag <p>) with details about the leader
'''Not working on names because of different reading directions'''
# def get_first_paragraph(wikipedia_url : str) -> str:
#     first_name = wikipedia_url[wikipedia_url.rfind("/") + 1:].split("_")[0]
#     r = requests.get(wikipedia_url)
#     soup = BeautifulSoup(r.text, 'html.parser')
#     all_paragraphs = [paragraph.text for paragraph in soup.find_all('p')]
#     for paragraph in all_paragraphs:
#         if paragraph[:len(first_name)] == first_name:
#             first_paragraph = paragraph
#             print(first_paragraph)
#             break
#     return first_paragraph

At this stage, you can create a function to maintain consistency in your code. We will give you its *skeleton*, you will copy the code you wrote and make it work inside a function.

Don't forget to test your function.

In [None]:
# 10 lines
import random
import requests
# get cookie from api


def get_wikipedia_url():
    req = requests.get(root_url+cookie_url)

    url = 'https://country-leaders.onrender.com/leaders?country=fr'

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        leader_dict = response.json()[round(random.random()*len(response.json()))]
        wikipedia_url = leader_dict['wikipedia_url']
        Leader_response = requests.get(wikipedia_url)
        print(Leader_response.text)
    else:
        print(f'Error: {response.status_code}')

def get_first_paragraph(wikipedia_url):
    print(wikipedia_url) # keep this for the rest of the notebook
    first_name = leader_dict['first_name']
    all_paragraphs = [paragraph.text for paragraph in soup.find_all('p')]
    for paragraph in all_paragraphs:
        if paragraph[:len(first_name)] == first_name:
            first_paragraph = paragraph


'''#Get a random country from the list.'''
random_country = response_list_countries.json()[round(random.random()*len(response_list_countries.json()))]
print(random_country)


#   return first_paragraph

### Regular expressions to the rescue

Now that you have extracted the content of the first paragraph, the only thing that remains to finish your Wikipedia scraper is to sanitize the output.

Some Wikipedia references, HTML code, phonetic pronunciation, ... still linger across your paragraphs. You might find *regular expressions* handy to get rid of them and obtain pristine text.

Once you have one of your regexes working [online](https://regexr.com/), try it in the cell below. 

Hints: 
* Check the `sub()` method documentation.
* Make sure to test urls in different languages. Some may look good but others won't.

In [None]:
# 3 lines



## Tidy things up in a stand-alone Python script

Congratulations! You now have a working scraper! However, your code is scattered throughout this notebook alongside the tutorial text. Hardly ready for production or for your GitHub portfolio...

Gather your code into a module, add some functionality for saving, and call it all from a single script!