# Exercise Sheet \#3


## Exercise 1. Quotes: manual scraping

In this exercise, you are required to compile a dataset of biographies taken from http://quotes.toscrape.com.
Recall this website displays 10 quotes per page, together with a link to their author's biography. This will be a step by step guide.

#### 1.1 Getting URLs of authors' pages

To get a list of URLs pointing at author pages, you will process quotes' pages. 

To do so, first complete the function get_links below which expects as parameter:

* `url` the URL of a page from quotes.toscrape.com

and returns:

* `authors` the list of links to author pages contained in the given quotes' page (beware of duplicates!)

In [38]:
import requests, re
from bs4 import BeautifulSoup

BASE_URL = 'http://quotes.toscrape.com'

def get_links(url):
    authors = []
    
    response = requests.get(url)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    authors = soup.find_all(href=re.compile('/author/.*'))
    
    authors_links = list(map(lambda x: f'http://quotes.toscrape.com{x.attrs["href"]}/', authors))

    return list(set(authors_links))

#Test:
authors = get_links(BASE_URL)
print(authors)

['http://quotes.toscrape.com/author/Steve-Martin/', 'http://quotes.toscrape.com/author/Jane-Austen/', 'http://quotes.toscrape.com/author/Thomas-A-Edison/', 'http://quotes.toscrape.com/author/Albert-Einstein/', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt/', 'http://quotes.toscrape.com/author/Andre-Gide/', 'http://quotes.toscrape.com/author/J-K-Rowling/', 'http://quotes.toscrape.com/author/Marilyn-Monroe/']


#### 1.2 iterate over pages of quotes

In a second step, fill the `collect` function below, which will iteratively collect author links. This function will take as input parameters:
- `url`: the starting url from which to collect links,
- `authors`: the list of links to be updated
- `limit`: the number of pages to visit (default being `None`, which means visit all pages)

In [51]:
soup.select('.next a')[0]['href'].split('/')[-2]

'2'

In [54]:
def collect(url, authors, limit=None):
    #Add links contained in page located at url to the authors being computed
    authors.extend([x for x in get_links(url) if x not in authors])
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    next_page = soup.select('.next a')

    
    if (not limit or limit > 1) and next_page:
        next_page_num = int(next_page[0]['href'].split('/')[-2])
        if limit and next_page_num <= limit:
            next_page_url = f'{BASE_URL}{next_page[0]["href"]}'
            collect(next_page_url, authors, limit=limit)
        
    #If no limit is given or limit > 1

        # Get page located at url:

        # Get url of next page

        # recursively collect links (if any)

# Test
authors = []
collect(BASE_URL, authors, limit=10)
print(authors)

['http://quotes.toscrape.com/author/Steve-Martin/', 'http://quotes.toscrape.com/author/Jane-Austen/', 'http://quotes.toscrape.com/author/Thomas-A-Edison/', 'http://quotes.toscrape.com/author/Albert-Einstein/', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt/', 'http://quotes.toscrape.com/author/Andre-Gide/', 'http://quotes.toscrape.com/author/J-K-Rowling/', 'http://quotes.toscrape.com/author/Marilyn-Monroe/', 'http://quotes.toscrape.com/author/Friedrich-Nietzsche/', 'http://quotes.toscrape.com/author/Dr-Seuss/', 'http://quotes.toscrape.com/author/Douglas-Adams/', 'http://quotes.toscrape.com/author/Mark-Twain/', 'http://quotes.toscrape.com/author/Allen-Saunders/', 'http://quotes.toscrape.com/author/Bob-Marley/', 'http://quotes.toscrape.com/author/Elie-Wiesel/', 'http://quotes.toscrape.com/author/Garrison-Keillor/', 'http://quotes.toscrape.com/author/Jim-Henson/', 'http://quotes.toscrape.com/author/Mother-Teresa/', 'http://quotes.toscrape.com/author/Ralph-Waldo-Emerson/', 'http://qu

#### Question 1.3 : get actual biographies

For each of the links computed in the previous question, retrieve the corresponding webpage and extract the biography it contains. To do so, fill the `get_biography` function below. It will feed a list of dictionaries of the following form:
```python
bios = [{name: '...', birth_date: '...', birth_place: '...', bio: '...'}, ...]
```

In [72]:
def get_biography(url):
    # Get page located at URL and parse it
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    
    # Get name with BeautifulSoup
    name = soup.find('h3', {'class':'author-title'}).text.split('\n')[0]
    # Get birth date
    birth_date = soup.find('span', {'class':'author-born-date'}).text
    # Get birth place
    birth_place= soup.find('span', {'class':'author-born-location'}).text[3:]
    # Get bio
    bio = soup.find('div', {'class':'author-description'}).text.strip()
    return {'name':name, 'birth_date': birth_date, 'birth_place': birth_place, 'bio': bio}

def get_bios(urls):
    bios = []
    for u in urls:
        bios.append(get_biography(u))
    return bios

#Test
bios=get_bios(authors)
print(bios)

[{'name': 'Steve Martin', 'birth_date': 'August 14, 1945', 'birth_place': 'Waco, Texas, The United States', 'bio': 'Stephen Glenn "Steve" Martin is an American actor, comedian, writer, playwright, producer, musician, and composer. He was raised in Southern California in a Baptist family, where his early influences were working at Disneyland and Knott\'s Berry Farm and working magic and comedy acts at these and other smaller venues in the area. His ascent to fame picked up when he became a writer for the Smothers Brothers Comedy Hour, and later became a frequent guest on the Tonight Show.In the 1970s, Martin performed his offbeat, absurdist comedy routines before packed houses on national tours. In the 1980s, having branched away from stand-up comedy, he became a successful actor, playwright, and juggler, and eventually earned Emmy, Grammy, and American Comedy awards.'}, {'name': 'Jane Austen', 'birth_date': 'December 16, 1775', 'birth_place': 'Steventon Rectory, Hampshire, The United K

In [69]:
response = requests.get('http://quotes.toscrape.com/author/C-S-Lewis/')
soup = BeautifulSoup(response.text, 'html.parser')
soup.find('div', {'class':'author-description'}).text.strip()

'CLIVE STAPLES LEWIS (1898–1963) was one of the intellectual giants of the twentieth century and arguably one of the most influential writers of his day. He was a Fellow and Tutor in English Literature at Oxford University until 1954. He was unanimously elected to the Chair of Medieval and Renaissance Literature at Cambridge University, a position he held until his retirement. He wrote more than thirty books, allowing him to reach a vast audience, and his works continue to attract thousands of new readers every year. His most distinguished and popular accomplishments include Mere Christianity, Out of the Silent Planet, The Great Divorce, The Screwtape Letters, and the universally acknowledged classics The Chronicles of Narnia. To date, the Narnia books have sold over 100 million copies and been transformed into three major motion pictures.'

#### Question 1.4: save your dataset

Finally, write a `save` function which takes as an input a list of biographies as computed above and save them in JSON on disk (the filename being an input parameter).

In [4]:
import json

def save(filename, dataset):
    # Open output file
    # write data in JSON format
    pass #remove when ready

save('bios.json', bios)

## Exercise 2. Let's use Scrapy now!

Here the goal is to play with scrapy. Let's look at the wikipedia article https://en.wikipedia.org/wiki/List_of_French_artists. Let's say, we want to extract all names of artists from here with links to their corresponding wikipedia pages and the first paragraph about them.

You will find a file called `Exercise_sheet_3_scrapy.py`. Can you fill in the gaps in this script?


In addition to the Scrapy documentation I highly recommend you to look at possible selectors: https://www.w3schools.com/cssref/css_selectors.php

In [76]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_French_artists')
soup = BeautifulSoup(response.text, 'html.parser')
soup.select('ul li')

[<li class="user-links-collapsible-item mw-list-item" id="pt-createaccount-2"><a href="/w/index.php?title=Special:CreateAccount&amp;returnto=List+of+French+artists" title="You are encouraged to create an account and log in; however, it is not mandatory"><span>Create account</span></a></li>,
 <li class="vector-user-menu-create-account user-links-collapsible-item"><a data-mw="interface" href="/w/index.php?title=Special:CreateAccount&amp;returnto=List+of+French+artists"><span class="mw-ui-icon mw-ui-icon-userAdd mw-ui-icon-wikimedia-userAdd"></span><span>Create account</span></a>
 </li>,
 <li class="vector-user-menu-login"><a accesskey="o" data-mw="interface" href="/w/index.php?title=Special:UserLogin&amp;returnto=List+of+French+artists" title="[o]"><span class="mw-ui-icon mw-ui-icon-logIn mw-ui-icon-wikimedia-logIn"></span><span>Log in</span></a>
 </li>,
 <li class="mw-list-item" id="pt-anontalk"><a accesskey="n" href="/wiki/Special:MyTalk" title="Discussion about edits from this IP addr