# How to Scrape HTML and Read Tables in Web Pages
See [the inspiration](https://scrapfly.io/blog/answers/how-to-scrape-tables-with-beautifulsoup). 
We will be using same source of data.<br>

Read more about BeautifulSoup and HTML parsing [here](https://scrapfly.io/blog/posts/web-scraping-with-python-beautifulsoup).

In [None]:
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup
import requests 
import pandas as pd
import regex as re
from io import StringIO

## Get the Page

In [None]:
# uri = 'https://scrapfly.io/blog/answers/how-to-scrape-tables-with-beautifulsoup'
uri = 'https://web-scraping.dev/product/1'

In [None]:
# get the whole page
response = requests.get(uri)

In [None]:
# get the content of the returned object
html = response.content

## Parse HTML

In [None]:
# create soup object
soup = BeautifulSoup(html, "html.parser")

In [None]:
# Extract what you want, e.g. the title
soup.title.text

### Search for HTML Tags

In [None]:
# extract links to other pages, you can use them to keep scraping related pages
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

In [None]:
len(links)

In [None]:
# search for sections or headings
headings = soup.find_all(re.compile("^h[1-6]$"))

In [None]:
headings

In [None]:
sections = [soup.find('h4').text]  

In [None]:
sections

## Search for Tables

### Option 1: From HTML to DataFrame

In [None]:
# search for tag table
table = soup.find_all('table')

In [None]:
# see how many have been found
len(table)

In [None]:
# get one table
tab_html = table[1]
tab_html

In [None]:
# initialise lists to store table headers and rows in
headers = []
rows = []

In [None]:
# find rows (search tag tr)
for i, row in enumerate(tab_html.find_all('tr')):
    if i == 0:
        # the first row is headers row (tag th)
        headers = [el.text.strip() for el in row.find_all('th')]      # first list
    else:
        # the rest of the rows contain cells (tag td)
        rows.append([el.text.strip() for el in row.find_all('td')])   # second list

In [148]:
# print headers
headers

['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']

In [149]:
# print rows
for row in rows:
    print(row)

['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping']
['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping']
['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping']
['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping']
['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']


In [150]:
# store in pandas
df1 = pd.DataFrame(rows, columns=[headers])

In [151]:
df1

Unnamed: 0,Version,Package Weight,Package Dimension,Variants,Delivery Type
0,Pack 1,"1,00 kg",100x230 cm,6 available,1 Day shipping
1,Pack 2,"2,11 kg",200x460 cm,6 available,1 Day shipping
2,Pack 3,"3,22 kg",300x690 cm,6 available,1 Day shipping
3,Pack 4,"4,33 kg",400x920 cm,6 available,1 Day shipping
4,Pack 5,"5,44 kg",500x1150 cm,6 available,1 Day shipping


### Option 2: From String to DataFrame

In [145]:
# this parses all the tables in webpages to a list
tables = pd.read_html(StringIO(response.text)) 

In [146]:
len(tables)

2

In [147]:
df2 = tables[1]
df2

Unnamed: 0,Version,Package Weight,Package Dimension,Variants,Delivery Type
0,Pack 1,"1,00 kg",100x230 cm,6 available,1 Day shipping
1,Pack 2,"2,11 kg",200x460 cm,6 available,1 Day shipping
2,Pack 3,"3,22 kg",300x690 cm,6 available,1 Day shipping
3,Pack 4,"4,33 kg",400x920 cm,6 available,1 Day shipping
4,Pack 5,"5,44 kg",500x1150 cm,6 available,1 Day shipping


DataFrames ready to use as a source for data analysis.