# Web Scraping Tutorial

This tutorial will teach you how Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

**Goal:** We will extract the cryptocurrency market price from Etherscan website: https://etherscan.io/tokens

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [None]:
# First, we will import the requests package
import requests

In [None]:
# Request the webpage
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


In [None]:
# Type of the request object we've got


In [None]:
# Check if the request is success


In [None]:
# Get the header of the web page


In [None]:
# Get the content of the web page


In [None]:
# Get the text in the web page


In [None]:
# Save the content of web page


## Step 2: Load the web page as BeautifulSoup object

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

In [None]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [None]:
# Load the web page and parse it to BeautifulSoup


In [None]:
# Check the type of our soup object


In [None]:
# Print the content of the web page


In [None]:
# Print all the text in the webpage


**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title element
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

In [None]:
# Get the title of the page


In [None]:
# Other HTML elements also work too


In [None]:
# Get the class attribute of an element


**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [None]:
# We can also get the page title using soup.find() function


In [None]:
# Get all the elements with image tag


In [None]:
# Get all the token names on the web page


## Step 3: Extract the data from the table

Now, we will extract the cryptocurrencies market price from the table.

In [None]:
# Get the table element in the web page


In [None]:
# Get the table headers


For loop over each row in the table and extract the data for each column in the row.

In [None]:
# For loop over each row in the table


    # Get all the columns in the row
    
    
    # For loop over each column and extract the string
    
    

## Step 4: Create a DataFrame table and write to a CSV file

In [None]:
import pandas as pd

In [None]:
# How many rows in the extracted data


In [None]:
# Convert the data list to DataFrame object


Split the columns with "\n"

In [None]:
# Split between token name and token symbol


In [None]:
# Split between the USD and ETH prices


In [None]:
# Split the number of holders and percent changes


Convert string into numerical columns

In [None]:
# Regular expression pattern to match numbers


In [None]:
# For each numerical column, convert the string to float numbers


    # Use df[col_name].str.extract() to extract the numbers and
    # .astype(float) to convert the string to float numbers
    
    
    

Last but not least, remove the bracket in token symbol column

Write the DataFrame table to CSV