# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [1]:
from playwright.async_api import async_playwright

In [2]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [3]:
page = await browser.new_page()

## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [58]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [59]:
html = await page.content()

In [60]:
from bs4 import BeautifulSoup

In [61]:
soup_doc = BeautifulSoup(html)
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n</body></html>'

## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [8]:
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `
<h1 class="title">How to Scrape Things</h1>
<h3 class="subhead">Probably using Playwright</h3>
<p class="byline">By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
  <h1 class="title">
   How to Scrape Things
  </h1>
  <h3 class="subhead">
   Probably using Playwright
  </h3>
  <p class="byline">
   By Jonathan Soma
  </p>
 </body>
</html>



In [9]:
basic_scrape = {}

title = soup_doc.find('h1', class_='title').string
subhead = soup_doc.find('h3', class_='subhead').string
byline = soup_doc.find('p', class_='byline').string

basic_scrape['title'] = title
basic_scrape['subhead'] = subhead
basic_scrape['byline'] = byline

print(basic_scrape)

{'title': 'How to Scrape Things', 'subhead': 'Probably using Playwright', 'byline': 'By Jonathan Soma'}


## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [10]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' method='GET'>>

In [11]:
await page.wait_for_selector("text='Everything has shown up'")

<JSHandle preview=JSHandle@node>

In [12]:
html = await page.content()

In [13]:
soup_doc = BeautifulSoup(html)
html

'<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n<p>Everything has shown up</p> \n`\n\nlet pieces = html.split("\\n")\n\nfunction addPiece() {\n    document.querySelector(\'body\').innerHTML = document.querySelector(\'body\').innerHTML + pieces.shift()\n    if(pieces.length > 0) {\n        setTimeout(addPiece, 250)\n    } else {\n        setTimeout(() => {\n            document.querySelector(\'body\').innerHTML = ""\n            pieces = html.split("\\n")\n            setTimeout(addPiece, 1000)\n        }, 2000)\n    }\n}\n\nsetTimeout(addPiece, 250)\n</script>\n</head><body>\n\n<p>How to Scrape Things</p><p>Probably using Playwright</p><p>By Jonathan Soma</p><p>Everything has shown up</p> </body></html>'

In [14]:
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
<p>Everything has shown up</p> 
`

let pieces = html.split("\n")

function addPiece() {
    document.querySelector('body').innerHTML = document.querySelector('body').innerHTML + pieces.shift()
    if(pieces.length > 0) {
        setTimeout(addPiece, 250)
    } else {
        setTimeout(() => {
            document.querySelector('body').innerHTML = ""
            pieces = html.split("\n")
            setTimeout(addPiece, 1000)
        }, 2000)
    }
}

setTimeout(addPiece, 250)
  </script>
 </head>
 <body>
  <p>
   How to Scrape Things
  </p>
  <p>
   Probably using Playwright
  </p>
  <p>
   By Jonathan Soma
  </p>
  <p>
   Everything has shown up
  </p>
 </body>
</html>



In [15]:
soup_doc.text

'\n\nHow to Scrape ThingsProbably using PlaywrightBy Jonathan SomaEverything has shown up '

In [16]:
text = soup_doc.find_all('p')
for line in text:
    print(line.string)

How to Scrape Things
Probably using Playwright
By Jonathan Soma
Everything has shown up


## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [17]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/inputs.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' method='GET'>>

In [18]:
await page.locator("select").select_option('Open')

['Open']

In [19]:
await page.get_by_placeholder("write cat in here").fill("cat")

In [20]:
await page.get_by_role("button", name="Click me").click()

## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [21]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [22]:
html = await page.content()
soup_doc = BeautifulSoup(html)
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
</table>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
  <table>
   <tbody>
    <tr>
     <td>
      How to Scrape Things
     </td>
     <td>
      Probably using Playwright
     </td>
     <td>
      By Jonathan Soma
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>



In [23]:
text = soup_doc.find_all('td')

soma_scrape = {}
soma_scrape['title'] = text[0].text
soma_scrape['subhead'] = text[1].text
soma_scrape['byline'] = text[2].text
soma_scrape

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [24]:
book = {}
book['title'] = text[0].text
book['subhead'] = text[1].text
book['byline'] = text[2].text
book

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [25]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [26]:
html = await page.content()
soup_doc = BeautifulSoup(html)
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
  <tr>
    <td>How to Scrape Many Things</td>
    <td>But, Is It Even Possible?</td>
    <td>By Sonathan Joma</td>
  </tr>
  <tr>
    <td>The End of Scraping</td>
    <td>Let's All Use CSV Files</td>
    <td>By Amos Nathanos</td>
  </tr>
</table>
`

setTimeout(() => {
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
 </body>
</html>



In [27]:
rows = soup_doc.find_all('tr')

all_rows = []

for row in rows:
    scraping_info = {}
    
    title = row.find_all('td')[0].string
    scraping_info['title'] = title

    musing = row.find_all('td')[1].string
    scraping_info['musing'] = musing

    byline = row.find_all('td')[2].string
    scraping_info['byline'] = byline

    all_rows.append(scraping_info)

all_rows

[]

In [28]:
import pandas as pd

df = pd.json_normalize(all_rows)
df.head()

In [29]:
df.to_csv("scraping.csv", index=False)

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [41]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html' method='GET'>>

In [42]:
html = await page.content()
soup_doc = BeautifulSoup(html)
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<table id="booklist">
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
  <tr>
    <td>How to Scrape Many Things</td>
    <td>But, Is It Even Possible?</td>
    <td>By Sonathan Joma</td>
  </tr>
  <tr>
    <td>The End of Scraping</td>
    <td>Let's All Use CSV Files</td>
    <td>By Amos Nathanos</td>
  </tr>
</table>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
  <table id="booklist">
   <tbody>
    <tr>
     <td>
      How to Scrape Things
     </td>
     <td>
      Probably using Playwright
     </td>
     <td>
      By Jonathan Soma
     </td>
    </tr>
    <tr>
     <td>
      How to Scrape Many Things
     </td>
     <td>
      But, Is It Even Possible?
     </td>
     <td>
      By Sonathan Joma
     </td>
    </tr>
    <tr>
     <td>
      The End of Scraping
     </t

In [44]:
from io import StringIO

tables = pd.read_html(StringIO(html))
tables[0]

Unnamed: 0,0,1,2
0,How to Scrape Things,Probably using Playwright,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [46]:
tables[0].to_csv("scraping2.csv", index=False)

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [34]:
!pip install html5lib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [235]:
!pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [49]:
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

In [50]:
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""

In [52]:
from bs4 import BeautifulSoup

soup_doc = BeautifulSoup(html_good, "html5lib")
print(soup_doc.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



In [53]:
soup_doc = BeautifulSoup(html_good, "lxml")
print(soup_doc.prettify())

<html>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



In [54]:
soup_doc = BeautifulSoup(html_good, "html.parser")
print(soup_doc.prettify())

<h1>
 This is a title
</h1>
<h2>
 This is a subhead
</h2>
<p>
 This is a paragraph
</p>
<p>
 This is another paragraph
</p>



In [55]:
soup_doc = BeautifulSoup(html_bad, "html5lib")
print(soup_doc.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
   <p>
    This is a paragraph
   </p>
   <p>
    This is another paragraph
   </p>
  </h2>
 </body>
</html>



In [56]:
soup_doc = BeautifulSoup(html_bad, "lxml")
print(soup_doc.prettify())

<html>
 <body>
  <h1>
   This is a title
   <h2>
    This is a subhead
   </h2>
  </h1>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>



In [57]:
soup_doc = BeautifulSoup(html_bad, "html.parser")
print(soup_doc.prettify())

<h1>
 This is a title
 <h2>
  This is a subhead
  <p>
   This is a paragraph
   <p>
    This is another paragraph
   </p>
  </p>
 </h2>
</h1>

