## Phase 1.10

# Webscraping

## DOM & HTML

> *Web pages can be represented by the objects that comprise their structure and content. This representation is known as the Document Object Model (DOM). The purpose of the DOM is to provide an interface for programs to change the structure, style, and content of web pages. The DOM represents the document as nodes and objects. Amongst other things, this allows programming languages to interactively change the page and HTML!*
>
> *What you'll see is the DOM and HTML create a hierarchy of elements. This structure and the underlying elements can be navigated similarly to a family tree which is one of Beautiful Soup's main mechanisms for navigation. Once you select a specific element within a page, you can then navigate to successive elements using methods to retrieve related tags including a tag's sibling, parent or descendants.*
>
> *To learn more about the DOM see:*
> *https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction*

## Beautiful Soup

> *https://www.crummy.com/software/BeautifulSoup/bs4/doc/*
>
> *Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.*


### ***Precaution***

> *While web scraping is a powerful tool, it can also lead you into ethical and legal gray areas.*
> - *To start, it is possible to make hundreds of requests a second to a website.*
>     - *Browsing at superhuman speeds such as this is apt to get noticed. Large volumes of requests such as this are apt to bog down a website's servers and in extreme cases could be considered a denial of service attack.* 
> - *Similarly, any website requiring login may contain information that is thereby not considered public and scraping said websites could leave you in legal jeopardy.* 
>
> *Use your best judgment when scraping and exercise precautions. Having your IP address blocked from your favorite website, for example, could prove to be quite an annoyance.*

In [None]:
import requests
import os
import time

from bs4 import BeautifulSoup
import pandas as pd

### *Pandas Hack - `pd.read_html()`*
*Occasionally, there will be a webpage where the information you want is already formatted in a `<table>` in the html. In those cases, Pandas can retrieve the data directly from the URL.*

*When used in this way, Pandas returns a list of dataframes that are extracted from tables from the webpage.*

In [None]:
# SUMMER_OLYMPICS = 'https://en.wikipedia.org/wiki/Athletics_at_the_1924_Summer_Olympics_%E2%80%93_Men%27s_javelin_throw'
# df_lst = pd.read_html(SUMMER_OLYMPICS)
# df_lst[1]

# Using Beautiful Soup

*It's always a good idea to explore the website with:*
- *`Right-Click > Inspect` ...*

> <a href='https://books.toscrape.com/index.html'>*books.toscrape.com*</a>

In [None]:
URL = 'https://books.toscrape.com/catalogue/page-1.html'

In [None]:
# Demo of `os.path.split()`. This may come in handy later...


In [None]:
# Demo of `time.sleep()`. This may come in handy later...


In [None]:
# Use requests to get the page.


In [None]:
# Make soup.


# Practice
Our goal is to extract data from this site. We want **a dataframe containing columns: `Title, Stars, Price, In Stock`**.

---

In order to do this successfully, we need to break this process up into steps which we will put together at the end.

***Steps***
1. ***Scrape a Single Page***
    1. **Capture a single book as a data point.**
        - Find how book entries are represented on the site.
            - *Using `Inspect` in the browser.*
        - Using an example entry, create a function that returns the data in a formatted way for us to use.
            - *The book should be thought of as an entry in the dataframe.* 
            - *How should we encode a row?*
        - **Write this in a function.**
    2. **Capture all books on a given page.**
        - Use the above function to extract *all* the data from a given page.
        - **Write this in a function.**


2. ***Scrape Multiple Pages***
    1. **Find a way to traverse the pages.**
        - Rather than hard-coding the url *(which is a reasonable option, but not best-practice)*, we should use the webpage itself to find the url we want to travel to next.
        - *What is the "gotchya" that we have to avoid for our code to avoid breaking?*
        > ***Write a practice script that goes through each page and prints the url or title.***
        > 
        > *Add a short pause between requests using `time.sleep()`.*
    
2. ***Scrape All Data from All Pages***
    > **Write a function that takes a URL and returns a dataframe.**
    >
    > ```python
    > def scrape_books_toscrape(
    >         url='https://books.toscrape.com/catalogue/page-1.html', 
    >         verbose=True):
    >     """
    >     Returns a pandas dataframe with the scraped data from books.toscrape.com
    >     """
    >     
    >     return
    > ```

## Step 1

### 1a. Capture a single book as a data point.

In [None]:
# Title


In [None]:
# Stars


In [None]:
# Price


In [None]:
# In Stock


### 1b. Capture all books on a given page.

In [None]:
# Define functions to extract each element from the pods.
# - Name
# - Stars
# - Price
# - Instock

## Step 2

### 2a. Find a way to traverse the pages.

## Step 3

### 3a. Write a function that takes a URL and returns a dataframe.

In [None]:
def scrape_books_toscrape(
        url='https://books.toscrape.com/catalogue/page-1.html', 
        verbose=True):
    """
    Returns a pandas dataframe with the scraped data from books.toscrape.com
    """

    return

# Further Scraping Tools
- <a href='https://www.selenium.dev/'>*Selenium*</a>
- <a href='https://scrapy.org/'>*Scrapy*</a>