# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [2]:
from playwright.async_api import async_playwright
from io import StringIO
from bs4 import BeautifulSoup

import pandas as pd

In [79]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)

In [80]:
# now you can use page to navigate that page
page = await browser.new_page()

## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [6]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [9]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n</body></html>'

In [10]:
soup_doc = BeautifulSoup(html)

In [18]:
title = soup_doc.find(class_ ="title").string
print(title)

How to Scrape Things


In [17]:
subhead = soup_doc.find(class_ ="subhead").string
print(subhead)

Probably using Playwright


In [19]:
byline = soup_doc.find(class_ ="byline").string
print(byline)

By Jonathan Soma


## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [25]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-list.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' method='GET'>>

In [36]:
htmltag = await page.content()
htmltag

"<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n</body></html>"

In [37]:
soup_doc_tags = BeautifulSoup(htmltag)

In [38]:
all_data = {}
data = soup_doc_tags.find_all("p")

In [39]:
all_data["title"] = data[0].string
all_data["subhead"] = data[1].string
all_data["byline"] = data[2].string

In [40]:
all_data

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [47]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' method='GET'>>

In [49]:
await page.get_by_text("Everything has shown up").wait_for()

html3 = await page.content()
html3

'<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n<p>Everything has shown up</p> \n`\n\nlet pieces = html.split("\\n")\n\nfunction addPiece() {\n    document.querySelector(\'body\').innerHTML = document.querySelector(\'body\').innerHTML + pieces.shift()\n    if(pieces.length > 0) {\n        setTimeout(addPiece, 250)\n    } else {\n        setTimeout(() => {\n            document.querySelector(\'body\').innerHTML = ""\n            pieces = html.split("\\n")\n            setTimeout(addPiece, 1000)\n        }, 2000)\n    }\n}\n\nsetTimeout(addPiece, 250)\n</script>\n</head><body><p>How to Scrape Things</p><p>Probably using Playwright</p><p>By Jonathan Soma</p><p>Everything has shown up</p> </body></html>'

In [50]:
soup_doc_wait = BeautifulSoup(html3)

In [57]:
all_data_wait = {}
data_wait = soup_doc_wait.find_all("p")

In [59]:
data_wait

[<p>How to Scrape Things</p>,
 <p>Probably using Playwright</p>,
 <p>By Jonathan Soma</p>,
 <p>Everything has shown up</p>]

In [60]:
all_data_wait["title"] = data_wait[0].string
all_data_wait["subhead"] = data_wait[1].string
all_data_wait["byline"] = data_wait[2].string
all_data_wait["last"] = data_wait[3].string

In [61]:
all_data_wait

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma',
 'last': 'Everything has shown up'}

## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [81]:
# go to page
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/inputs.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' method='GET'>>

In [82]:
#look for drop down menu and select open
await page.locator("select").select_option('Open')

['Open']

In [83]:
#navigate to text box and fill write answer
await page.get_by_placeholder("write cat in here").fill("cat")

In [84]:
#click the submit button
await page.get_by_role("button", name="Click Me").click()

In [86]:
# get the html after form
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `<h1>You did it</h1>`\n</script>\n</head><body><h1>You did it</h1></body></html>'

In [87]:
soup_doc = BeautifulSoup(html)

In [89]:
print(soup_doc.h1.string)

You did it


## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [90]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [92]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</tbody></table>\n</body></html>"

In [93]:
soup_doc = BeautifulSoup(html)

In [99]:
one_row = soup_doc.find("tr")

In [103]:
table_data = one_row.find_all("td")
table_data[0]

<td>How to Scrape Things</td>

In [104]:
all_data_table = {}

all_data_table["title"] = table_data[0].string
all_data_table["subhead"] = table_data[1].string
all_data_table["byline"] = table_data[2].string

all_data_table

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [106]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [107]:
html = await page.content()
soup_doc = BeautifulSoup(html)

In [108]:
one_row = soup_doc.find("tr")

table_data = one_row.find_all("td")
table_data[0]

<td>How to Scrape Things</td>

In [109]:
book = {}

book["title"] = table_data[0].string
book["subhead"] = table_data[1].string
book["byline"] = table_data[2].string

book

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [110]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [111]:
table_html = await page.content()
table_html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table>\n</body></html>"

In [120]:
soup_doc_table = BeautifulSoup(table_html)

In [121]:
rows = soup_doc_table.find_all("tr")
rows

[<tr>
 <td>How to Scrape Things</td>
 <td>Probably using Playwright</td>
 <td>By Jonathan Soma</td>
 </tr>,
 <tr>
 <td>How to Scrape Many Things</td>
 <td>But, Is It Even Possible?</td>
 <td>By Sonathan Joma</td>
 </tr>,
 <tr>
 <td>The End of Scraping</td>
 <td>Let's All Use CSV Files</td>
 <td>By Amos Nathanos</td>
 </tr>]

In [122]:
full_table = []

In [123]:
for row in rows:
    one_row = {}
    cells = row.find_all("td")
    one_row["title"] = cells[0].string
    one_row["subhead"] = cells[1].string
    one_row["byline"] = cells[2].string
    
    full_table.append(one_row)
    

In [124]:
full_table

[{'title': 'How to Scrape Things',
  'subhead': 'Probably using Playwright',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

In [None]:
df = pd.json_normalize(full_table)
df.to_csv("output.csv", index = False)

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [125]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html' method='GET'>>

In [126]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `<table id="booklist">\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let\'s All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body><table id="booklist">\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let\'s All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</t

In [129]:
tables = pd.read_html(StringIO(html))
tables[0]

Unnamed: 0,0,1,2
0,How to Scrape Things,Probably using Playwright,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [130]:
filename = "output.csv"

tables[0].to_csv(filename, index = False)

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [4]:
#pip install html5lib

In [5]:
#pip install lxml

In [3]:
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""


html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""

### Good HTML 2 parsers

In [6]:
soup_doc_good_html5lib = BeautifulSoup(html_good, "html5lib")
soup_doc_good_lxml = BeautifulSoup(html_good, "lxml")

In [15]:
print(soup_doc_good_html5lib.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



In [14]:
print(soup_doc_good_lxml.prettify())

<html>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



### Bad HTML 2 parsers

In [9]:
soup_doc_bad_html5lib = BeautifulSoup(html_bad, "html5lib")
soup_doc_bad_lxml = BeautifulSoup(html_bad, "lxml")

In [13]:
print(soup_doc_bad_html5lib.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
   <p>
    This is a paragraph
   </p>
   <p>
    This is another paragraph
   </p>
  </h2>
 </body>
</html>



In [12]:
print(soup_doc_bad_lxml.prettify())

<html>
 <body>
  <h1>
   This is a title
   <h2>
    This is a subhead
   </h2>
  </h1>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



Seems to me that if the html is generally laid out well, parsers give you decently similar results. But if the html itself is bad, they give you different results and you might have to test out how they look so you can access the right tags? Html5lib seems to give a better structure for this intuitively?