# Caso práctico: base de datos de libros

+ Alejandro E. Martínez Castro (amcastro@ugr.es)

Ejemplo tomado de "Practical Web Scraping for Data Science. Best practices and examples with Python", S. Broucke y B. Baesens. Appress.



Para este ejemplo, deben instalarse paquetes específicos. Si ya has instalado BeautifulSoup, posiblemente el único paquete que debas instar sea **dataset**

    dataset
    
En mi caso, lo instalé siguiendo la sugerencia que encontré buscando "dataset install Anaconda". 

    conda install -c coecms dataset 
    

## Introducción al caso

En esta caso se va a acceder a una página web que simula una tienda online de libros. La página es ésta (conviene visitarla y navegar por ella para ver su estructura desde un navegador). Este caso se proporciona desde Scrapinghub como ejemplo más realista de scraping. Observe la página:

http://books.toscrape.com/

Se van a emplear las herramientas _requests_ y _BeautifulSoup_ para construir una base de datos de SQLite (que podemos explorar a posterior instalando herramientas como [_DB Browser for Sqlite_](https://sqlitebrowser.org/dl/). Los datos que se van a extraer son: 

- El título. 
- La imagen. 
- Precio y disponibilidad. 
- Su puntuación por los usuarios (rating). 
- Descripción del producto. 
- Otros datos disponibles. 

Comenzaremos importando algunos paquetes: *requests, datasets, datetime, BeautifulSoup*, el módulo para expresiones regulares *re*, y algunos paquetes de *urllib.parse*.

Mediante *database*, conectaremos con la base de datos "books.db", que puede existir previamente en nuestro disco duro (para actualizarla) o bien ser nueva. 

La web base será http://books.toscrape.com/




In [1]:
import requests
import dataset
import re
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

db = dataset.connect('sqlite:///books.db')

base_url = 'http://books.toscrape.com/'

A continuación se definen dos funciones. 

In [2]:
def scrape_books(html_soup, url):
    for book in html_soup.select('article.product_pod'):
        # A partir de ahora, sólo almacenaremos la url de los libros
        book_url = book.find('h3').find('a').get('href')
        book_url = urljoin(url, book_url)
        path = urlparse(book_url).path
        book_id = path.split('/')[2]
        # Upsert intenta actualizar y, si no está, insertar
        db['books'].upsert({'book_id' : book_id,
                            'last_seen' : datetime.now()
                            }, ['book_id'])

def scrape_book(html_soup, book_id):
    main = html_soup.find(class_='product_main')
    book = {}
    book['book_id'] = book_id
    book['title'] = main.find('h1').get_text(strip=True)
    book['price'] = main.find(class_='price_color').get_text(strip=True)
    book['stock'] = main.find(class_='availability').get_text(strip=True)
    book['rating'] = ' '.join(main.find(class_='star-rating') \
                        .get('class')).replace('star-rating', '').strip()
    book['img'] = html_soup.find(class_='thumbnail').find('img').get('src')
    desc = html_soup.find(id='product_description')
    book['description'] = ''
    if desc:
        book['description'] = desc.find_next_sibling('p') \
                                  .get_text(strip=True)
    book_product_table = html_soup.find(text='Product Information').find_next('table')
    for row in book_product_table.find_all('tr'):
        header = row.find('th').get_text(strip=True)
        # Since we'll use the header as a column, clean it a bit
        # to make sure SQLite will accept it
        header = re.sub('[^a-zA-Z]+', '_', header)
        value = row.find('td').get_text(strip=True)
        book[header] = value
    db['book_info'].upsert(book, ['book_id'])

## Scraping de las páginas del catálogo

A continuación se realizará un rastreo de las páginas que existen en el catálogo. Se mostrará por pantalla qué páginas hay dentro del catálogo. Aquí se empleará la función *scrape_books* anteriormente mencionada

In [3]:
url = base_url
inp = input('¿Quieres volver a rastrear el catálogo completo (y/n)? ')
while True and inp == 'y':
    print('Now scraping page:', url)
    r = requests.get(url)
    html_soup = BeautifulSoup(r.text, 'html.parser') #Se convierte en un objeto de BeautifulSoup
    scrape_books(html_soup, url)
    # ¿Existen más páginas?
    next_a = html_soup.select('li.next > a')
    if not next_a or not next_a[0].get('href'):
        break
    url = urljoin(url, next_a[0].get('href'))

¿Quieres volver a rastrear el catálogo completo (y/n)? y
Now scraping page: http://books.toscrape.com/
Now scraping page: http://books.toscrape.com/catalogue/page-2.html
Now scraping page: http://books.toscrape.com/catalogue/page-3.html
Now scraping page: http://books.toscrape.com/catalogue/page-4.html
Now scraping page: http://books.toscrape.com/catalogue/page-5.html
Now scraping page: http://books.toscrape.com/catalogue/page-6.html
Now scraping page: http://books.toscrape.com/catalogue/page-7.html
Now scraping page: http://books.toscrape.com/catalogue/page-8.html
Now scraping page: http://books.toscrape.com/catalogue/page-9.html
Now scraping page: http://books.toscrape.com/catalogue/page-10.html
Now scraping page: http://books.toscrape.com/catalogue/page-11.html
Now scraping page: http://books.toscrape.com/catalogue/page-12.html
Now scraping page: http://books.toscrape.com/catalogue/page-13.html
Now scraping page: http://books.toscrape.com/catalogue/page-14.html
Now scraping page: ht

## Scraping libro a libro

A continuación se va a generar (o abrir) la base de datos *books.db*, accesible en SQLite. Se va a generar la estructura de la base de datos a partir del rastreo al catálogo de libros, libro a libro. Se comenzará por los libros más antiguos. En esta ocasión se emplea la función *scrape_book*.

In [4]:
books = db['books'].find(order_by=['last_seen'])
for book in books:
    book_id = book['book_id']
    book_url = base_url + 'catalogue/{}'.format(book_id)
    print('Now scraping book:', book_url)
    r = requests.get(book_url)
    r.encoding = 'utf-8'
    html_soup = BeautifulSoup(r.text, 'html.parser')
    scrape_book(html_soup, book_id)
    # Update the last seen timestamp
    db['books'].upsert({'book_id' : book_id,
                        'last_seen' : datetime.now()
                        }, ['book_id'])

Now scraping book: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000
Now scraping book: http://books.toscrape.com/catalogue/tipping-the-velvet_999
Now scraping book: http://books.toscrape.com/catalogue/soumission_998
Now scraping book: http://books.toscrape.com/catalogue/sharp-objects_997
Now scraping book: http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996
Now scraping book: http://books.toscrape.com/catalogue/the-requiem-red_995
Now scraping book: http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994
Now scraping book: http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993
Now scraping book: http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992
Now scraping book: http://books.toscrape.com/catalogue/the-black-maria_991
Now scraping book: http://books.toscrape.com

Now scraping book: http://books.toscrape.com/catalogue/princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920
Now scraping book: http://books.toscrape.com/catalogue/princess-between-worlds-wide-awake-princess-5_919
Now scraping book: http://books.toscrape.com/catalogue/pop-gun-war-volume-1-gift_918
Now scraping book: http://books.toscrape.com/catalogue/political-suicide-missteps-peccadilloes-bad-calls-backroom-hijinx-sordid-pasts-rotten-breaks-and-just-plain-dumb-mistakes-in-the-annals-of-american-politics_917
Now scraping book: http://books.toscrape.com/catalogue/patience_916
Now scraping book: http://books.toscrape.com/catalogue/outcast-vol-1-a-darkness-surrounds-him-outcast-1_915
Now scraping book: http://books.toscrape.com/catalogue/orange-the-complete-collection-1-orange-the-complete-collection-1_914
Now scraping book: http://books.toscrape.com/catalogue/online-marketing-for-busy-authors-a-step-by-step-guide_913
Now scraping book: http://books.toscrape.co

Now scraping book: http://books.toscrape.com/catalogue/the-bridge-to-consciousness-im-writing-the-bridge-between-science-and-our-old-and-new-beliefs_840
Now scraping book: http://books.toscrape.com/catalogue/the-artists-way-a-spiritual-path-to-higher-creativity_839
Now scraping book: http://books.toscrape.com/catalogue/the-art-of-war_838
Now scraping book: http://books.toscrape.com/catalogue/the-argonauts_837
Now scraping book: http://books.toscrape.com/catalogue/the-10-entrepreneur-live-your-startup-dream-without-quitting-your-day-job_836
Now scraping book: http://books.toscrape.com/catalogue/suddenly-in-love-lake-haven-1_835
Now scraping book: http://books.toscrape.com/catalogue/something-more-than-this_834
Now scraping book: http://books.toscrape.com/catalogue/soft-apocalypse_833
Now scraping book: http://books.toscrape.com/catalogue/so-youve-been-publicly-shamed_832
Now scraping book: http://books.toscrape.com/catalogue/shoe-dog-a-memoir-by-the-creator-of-nike_831
Now scraping book

Now scraping book: http://books.toscrape.com/catalogue/the-last-mile-amos-decker-2_754
Now scraping book: http://books.toscrape.com/catalogue/the-immortal-life-of-henrietta-lacks_753
Now scraping book: http://books.toscrape.com/catalogue/the-hidden-oracle-the-trials-of-apollo-1_752
Now scraping book: http://books.toscrape.com/catalogue/the-help-yourself-cookbook-for-kids-60-easy-plant-based-recipes-kids-can-make-to-stay-healthy-and-save-the-earth_751
Now scraping book: http://books.toscrape.com/catalogue/the-guilty-will-robie-4_750
Now scraping book: http://books.toscrape.com/catalogue/the-first-hostage-jb-collins-2_749
Now scraping book: http://books.toscrape.com/catalogue/the-dovekeepers_748
Now scraping book: http://books.toscrape.com/catalogue/the-darkest-lie_747
Now scraping book: http://books.toscrape.com/catalogue/the-bane-chronicles-the-bane-chronicles-1-11_746
Now scraping book: http://books.toscrape.com/catalogue/the-bad-ass-librarians-of-timbuktu-and-their-race-to-save-the-w

Now scraping book: http://books.toscrape.com/catalogue/buying-in-the-secret-dialogue-between-what-we-buy-and-who-we-are_670
Now scraping book: http://books.toscrape.com/catalogue/brain-on-fire-my-month-of-madness_669
Now scraping book: http://books.toscrape.com/catalogue/batman-europa_668
Now scraping book: http://books.toscrape.com/catalogue/barefoot-contessa-back-to-basics_667
Now scraping book: http://books.toscrape.com/catalogue/barefoot-contessa-at-home-everyday-recipes-youll-make-over-and-over-again_666
Now scraping book: http://books.toscrape.com/catalogue/balloon-animals_665
Now scraping book: http://books.toscrape.com/catalogue/art-ops-vol-1_664
Now scraping book: http://books.toscrape.com/catalogue/aristotle-and-dante-discover-the-secrets-of-the-universe-aristotle-and-dante-discover-the-secrets-of-the-universe-1_663
Now scraping book: http://books.toscrape.com/catalogue/angels-walking-angels-walking-1_662
Now scraping book: http://books.toscrape.com/catalogue/angels-demons-ro

Now scraping book: http://books.toscrape.com/catalogue/a-mothers-reckoning-living-in-the-aftermath-of-tragedy_585
Now scraping book: http://books.toscrape.com/catalogue/a-gentlemans-position-society-of-gentlemen-3_584
Now scraping book: http://books.toscrape.com/catalogue/112263_583
Now scraping book: http://books.toscrape.com/catalogue/10-happier-how-i-tamed-the-voice-in-my-head-reduced-stress-without-losing-my-edge-and-found-self-help-that-actually-works_582
Now scraping book: http://books.toscrape.com/catalogue/10-day-green-smoothie-cleanse-lose-up-to-15-pounds-in-10-days_581
Now scraping book: http://books.toscrape.com/catalogue/without-shame_580
Now scraping book: http://books.toscrape.com/catalogue/watchmen_579
Now scraping book: http://books.toscrape.com/catalogue/unlimited-intuition-now_578
Now scraping book: http://books.toscrape.com/catalogue/underlying-notes_577
Now scraping book: http://books.toscrape.com/catalogue/the-shack_576
Now scraping book: http://books.toscrape.com/

Now scraping book: http://books.toscrape.com/catalogue/unreasonable-hope-finding-faith-in-the-god-who-brings-purpose-to-your-pain_505
Now scraping book: http://books.toscrape.com/catalogue/under-the-tuscan-sun_504
Now scraping book: http://books.toscrape.com/catalogue/toddlers-are-aholes-its-not-your-fault_503
Now scraping book: http://books.toscrape.com/catalogue/the-year-of-living-biblically-one-mans-humble-quest-to-follow-the-bible-as-literally-as-possible_502
Now scraping book: http://books.toscrape.com/catalogue/the-whale_501
Now scraping book: http://books.toscrape.com/catalogue/the-story-of-art_500
Now scraping book: http://books.toscrape.com/catalogue/the-origin-of-species_499
Now scraping book: http://books.toscrape.com/catalogue/the-great-gatsby_498
Now scraping book: http://books.toscrape.com/catalogue/the-good-girl_497
Now scraping book: http://books.toscrape.com/catalogue/the-glass-castle_496
Now scraping book: http://books.toscrape.com/catalogue/the-faith-of-christopher-h

Now scraping book: http://books.toscrape.com/catalogue/a-year-in-provence-provence-1_421
Now scraping book: http://books.toscrape.com/catalogue/world-without-end-the-pillars-of-the-earth-2_420
Now scraping book: http://books.toscrape.com/catalogue/will-grayson-will-grayson-will-grayson-will-grayson_419
Now scraping book: http://books.toscrape.com/catalogue/why-save-the-bankers-and-other-essays-on-our-economic-and-political-crisis_418
Now scraping book: http://books.toscrape.com/catalogue/where-she-went-if-i-stay-2_417
Now scraping book: http://books.toscrape.com/catalogue/what-if-serious-scientific-answers-to-absurd-hypothetical-questions_416
Now scraping book: http://books.toscrape.com/catalogue/two-summers_415
Now scraping book: http://books.toscrape.com/catalogue/this-is-your-brain-on-music-the-science-of-a-human-obsession_414
Now scraping book: http://books.toscrape.com/catalogue/the-secret-garden_413
Now scraping book: http://books.toscrape.com/catalogue/the-raven-king-the-raven-c

Now scraping book: http://books.toscrape.com/catalogue/no-one-here-gets-out-alive_336
Now scraping book: http://books.toscrape.com/catalogue/night-shift-night-shift-1-20_335
Now scraping book: http://books.toscrape.com/catalogue/needful-things_334
Now scraping book: http://books.toscrape.com/catalogue/mockingjay-the-hunger-games-3_333
Now scraping book: http://books.toscrape.com/catalogue/misery_332
Now scraping book: http://books.toscrape.com/catalogue/little-women-little-women-1_331
Now scraping book: http://books.toscrape.com/catalogue/it_330
Now scraping book: http://books.toscrape.com/catalogue/harry-potter-and-the-sorcerers-stone-harry-potter-1_329
Now scraping book: http://books.toscrape.com/catalogue/harry-potter-and-the-prisoner-of-azkaban-harry-potter-3_328
Now scraping book: http://books.toscrape.com/catalogue/harry-potter-and-the-order-of-the-phoenix-harry-potter-5_327
Now scraping book: http://books.toscrape.com/catalogue/harry-potter-and-the-half-blood-prince-harry-potter

Now scraping book: http://books.toscrape.com/catalogue/the-girl-who-played-with-fire-millennium-trilogy-2_249
Now scraping book: http://books.toscrape.com/catalogue/the-girl-who-kicked-the-hornets-nest-millennium-trilogy-3_248
Now scraping book: http://books.toscrape.com/catalogue/the-exiled_247
Now scraping book: http://books.toscrape.com/catalogue/the-end-of-faith-religion-terror-and-the-future-of-reason_246
Now scraping book: http://books.toscrape.com/catalogue/the-elegant-universe-superstrings-hidden-dimensions-and-the-quest-for-the-ultimate-theory_245
Now scraping book: http://books.toscrape.com/catalogue/the-disappearing-spoon-and-other-true-tales-of-madness-love-and-the-history-of-the-world-from-the-periodic-table-of-the-elements_244
Now scraping book: http://books.toscrape.com/catalogue/the-devil-wears-prada-the-devil-wears-prada-1_243
Now scraping book: http://books.toscrape.com/catalogue/the-demon-haunted-world-science-as-a-candle-in-the-dark_242
Now scraping book: http://boo

Now scraping book: http://books.toscrape.com/catalogue/green-eggs-and-ham-beginner-books-b-16_165
Now scraping book: http://books.toscrape.com/catalogue/grayson-vol-3-nemesis-grayson-3_164
Now scraping book: http://books.toscrape.com/catalogue/gratitude_163
Now scraping book: http://books.toscrape.com/catalogue/gone-girl_162
Now scraping book: http://books.toscrape.com/catalogue/golden-heart-of-dread-3_161
Now scraping book: http://books.toscrape.com/catalogue/girl-in-the-blue-coat_160
Now scraping book: http://books.toscrape.com/catalogue/fruits-basket-vol-3-fruits-basket-3_159
Now scraping book: http://books.toscrape.com/catalogue/friday-night-lights-a-town-a-team-and-a-dream_158
Now scraping book: http://books.toscrape.com/catalogue/fire-bound-sea-havensisters-of-the-heart-5_157
Now scraping book: http://books.toscrape.com/catalogue/fifty-shades-freed-fifty-shades-3_156
Now scraping book: http://books.toscrape.com/catalogue/fellside_155
Now scraping book: http://books.toscrape.com/c

Now scraping book: http://books.toscrape.com/catalogue/the-mirror-the-maze-the-wrath-and-the-dawn-15_73
Now scraping book: http://books.toscrape.com/catalogue/the-little-prince_72
Now scraping book: http://books.toscrape.com/catalogue/the-light-of-the-fireflies_71
Now scraping book: http://books.toscrape.com/catalogue/the-last-girl-the-dominion-trilogy-1_70
Now scraping book: http://books.toscrape.com/catalogue/the-iliad_69
Now scraping book: http://books.toscrape.com/catalogue/the-hook-up-game-on-1_68
Now scraping book: http://books.toscrape.com/catalogue/the-haters_67
Now scraping book: http://books.toscrape.com/catalogue/the-girl-you-lost_66
Now scraping book: http://books.toscrape.com/catalogue/the-girl-in-the-ice-dci-erika-foster-1_65
Now scraping book: http://books.toscrape.com/catalogue/the-end-of-the-jesus-era-an-investigation-1_64
Now scraping book: http://books.toscrape.com/catalogue/the-edge-of-reason-bridget-jones-2_63
Now scraping book: http://books.toscrape.com/catalogue/