# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [6]:
from playwright.async_api import async_playwright  

In [7]:
playwright = await async_playwright().start()


In [13]:
browser = await playwright.chromium.launch(headless=False)

In [14]:
page = await browser.new_page()

In [15]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [16]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n</body></html>'

In [17]:
from bs4 import BeautifulSoup

In [18]:
#Feed HTML to beautiful soup. 

soup_doc = BeautifulSoup(html)
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `
<h1 class="title">How to Scrape Things</h1>
<h3 class="subhead">Probably using Playwright</h3>
<p class="byline">By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
  <h1 class="title">
   How to Scrape Things
  </h1>
  <h3 class="subhead">
   Probably using Playwright
  </h3>
  <p class="byline">
   By Jonathan Soma
  </p>
 </body>
</html>



## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [19]:
title=soup_doc.find(class_='title').text
title

'How to Scrape Things'

In [20]:
print('subhead:',soup_doc.find(class_='subhead').text)

subhead: Probably using Playwright


In [21]:
print('byline:',soup_doc.find(class_='byline').text)

byline: By Jonathan Soma


## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [23]:
await page.goto(' http://jonathansoma.com/columbia/interactive-scrape/by-list.html')

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' method='GET'>>

In [24]:
html_2 = await page.content()
html_2

"<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n</body></html>"

In [29]:
soup_doc = BeautifulSoup(html_2)
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
  <p>
   How to Scrape Things
  </p>
  <p>
   Probably using Playwright
  </p>
  <p>
   By Jonathan Soma
  </p>
 </body>
</html>



In [30]:
soup_doc

<!DOCTYPE html>
<html><head><script>
    const html = `<p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)</script>
</head><body><p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
</body></html>

In [31]:
soup_doc.find_all('p')

[<p>How to Scrape Things</p>,
 <p>Probably using Playwright</p>,
 <p>By Jonathan Soma</p>]

## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [None]:
from playwright.async_api import async_playwright

In [None]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [None]:
page = await browser.new_page()

In [None]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

In [None]:
html = await page.content()
soup_doc = BeautifulSoup(html, 'html.parser')


In [None]:
await page.get_by_text("Everything has shown up").wait_for()
paras = soup_doc_three.find_all('p')

for para in paras:
    my_dict = {}
    my_dict['title'] = paras[0].text
    my_dict['subhead'] =paras[1].text
    my_dict['byline'] = paras[2].text
    my_dict['last_key'] =paras[3].text

my_dict

## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [37]:
from playwright.async_api import async_playwright

In [38]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [39]:
page = await browser.new_page()

In [40]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/inputs.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' method='GET'>>

In [43]:
html_4 = await page.content()
html_4

'<!DOCTYPE html><html><head><script>\n    const html = `<h1>You did it</h1>`\n</script>\n</head><body>\n    <div id="things"></div>\n    <p>The secret is\n    <select>\n        <option selected="">Closed</option>\n        <option>Open</option>\n    </select>\n</p>\n<p>\n    <input type="text" placeholder="write cat in here" id="best-animal">\n</p>\n<p>\n    <input type="button" id="submit" value="Click me">\n</p>\n    <script>\n        document.querySelector("#submit").addEventListener(\'click\', function() {\n            if(document.querySelector(\'#best-animal\').value == \'cat\') {\n                if(document.querySelector("select").value == \'Open\') {\n                    document.querySelector(\'body\').innerHTML = html\n                } else {\n                    alert(\'fix the dropdown!!!\')\n                }\n            } else {\n                alert(\'write cat in there!!!\')\n            }\n        })\n    </script>\n\n</body></html>'

In [44]:
from bs4 import BeautifulSoup

In [45]:
soup_doc_four = BeautifulSoup(html_4)
print(soup_doc_four.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<h1>You did it</h1>`
  </script>
 </head>
 <body>
  <div id="things">
  </div>
  <p>
   The secret is
   <select>
    <option selected="">
     Closed
    </option>
    <option>
     Open
    </option>
   </select>
  </p>
  <p>
   <input id="best-animal" placeholder="write cat in here" type="text"/>
  </p>
  <p>
   <input id="submit" type="button" value="Click me"/>
  </p>
  <script>
   document.querySelector("#submit").addEventListener('click', function() {
            if(document.querySelector('#best-animal').value == 'cat') {
                if(document.querySelector("select").value == 'Open') {
                    document.querySelector('body').innerHTML = html
                } else {
                    alert('fix the dropdown!!!')
                }
            } else {
                alert('write cat in there!!!')
            }
        })
  </script>
 </body>
</html>



In [51]:
import io
import time
await page.locator("write cat in here")
await page.get_by_label("Select:").select_option('Open')
await page.get_by_role("button", name="Click me").click()

TypeError: object Locator can't be used in 'await' expression

In [52]:
h1_content = await page.inner_text('h1')
h1_content       

'You did it'

## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [182]:
import asyncio
from playwright.async_api import async_playwright

async def scrape_table_row():
    async with async_playwright() as p:
  
        browser = await p.chromium.launch(headless=False)  
        page = await browser.new_page()

      
        await page.goto("http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

    
        title = await page.inner_text('td.title')
        subhead = await page.inner_text('td.subhead')
        byline = await page.inner_text('td.byline')

        # Create a dictionary
        row_data = {
            "title": title,
            "subhead": subhead,
            "byline": byline
        }

        # Print the dictionary
        print(row_data)


await scrape_table_row()


TimeoutError: Page.inner_text: Timeout 30000ms exceeded.
Call log:
waiting for locator("td.title")


In [183]:
row_data = {
 "title": title,
 "subhead": subhead,
 "byline": byline
        }

print(row_data)

NameError: name 'subhead' is not defined

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [59]:
#Do i have to save dictionaries by importing pickle? 

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [60]:
html_6 = await page.content()

In [61]:
soup_doc_six = BeautifulSoup(html_6, 'html.parser')
soup_doc_six

<!DOCTYPE html>
<html><head><script>
    const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
</table>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)</script>
</head><body><table>
<tbody><tr>
<td>How to Scrape Things</td>
<td>Probably using Playwright</td>
<td>By Jonathan Soma</td>
</tr>
</tbody></table>
</body></html>

In [62]:
td_tags=soup_doc_six.find_all('td')
print(td_tags)

[<td>How to Scrape Things</td>, <td>Probably using Playwright</td>, <td>By Jonathan Soma</td>]


In [64]:
book = {
    "title": td_tags[0].text.strip(),
    "subhead": td_tags[1].text.strip(),
    "byline": td_tags[2].text.strip()
}

# Print the dictionary
print(book)

#next step is to save the dictionary 

{'title': 'How to Scrape Things', 'subhead': 'Probably using Playwright', 'byline': 'By Jonathan Soma'}


In [65]:
for td in td_tags:
    print(td.text)

How to Scrape Things
Probably using Playwright
By Jonathan Soma


## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [73]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [79]:
html = await page.content()

In [80]:
soup_doc = BeautifulSoup(html, 'html.parser')
soup_doc

<!DOCTYPE html>
<html><head><script>
    const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
  <tr>
    <td>How to Scrape Many Things</td>
    <td>But, Is It Even Possible?</td>
    <td>By Sonathan Joma</td>
  </tr>
  <tr>
    <td>The End of Scraping</td>
    <td>Let's All Use CSV Files</td>
    <td>By Amos Nathanos</td>
  </tr>
</table>
`

setTimeout(() => {
    document.querySelector('body').innerHTML = html
}, 250)</script>
</head><body><table>
<tbody><tr>
<td>How to Scrape Things</td>
<td>Probably using Playwright</td>
<td>By Jonathan Soma</td>
</tr>
<tr>
<td>How to Scrape Many Things</td>
<td>But, Is It Even Possible?</td>
<td>By Sonathan Joma</td>
</tr>
<tr>
<td>The End of Scraping</td>
<td>Let's All Use CSV Files</td>
<td>By Amos Nathanos</td>
</tr>
</tbody></table>
</body></html>

In [128]:
full_tr=soup_doc.find_all('tr')
full_tr

[<tr>
 <td>How to Scrape Things</td>
 <td>Probably using Playwright</td>
 <td>By Jonathan Soma</td>
 </tr>,
 <tr>
 <td>How to Scrape Many Things</td>
 <td>But, Is It Even Possible?</td>
 <td>By Sonathan Joma</td>
 </tr>,
 <tr>
 <td>The End of Scraping</td>
 <td>Let's All Use CSV Files</td>
 <td>By Amos Nathanos</td>
 </tr>]

In [130]:
# full_tr.find_all('tr')[0].find_all('td')[0].text.strip()

In [133]:
info = full_tr[0]
dictionary = {}
dictionary['title'] = info.find_all('td')[0].text
dictionary['subhead'] = info.find_all('td')[1].text
dictionary['byline'] = info.find_all('td')[2].text
dictionary


# for row in full_tr:
#     item = {
#         "field1":row.find_all('tr')[0].find_all('td')[0].text.strip()
#         "field2":row.find_all('tr')[1].find_all('td')[1].text.strip() 
#         "field3":row.find_all('tr')[0].find_all('td')[2].text.strip() 
#             }
#         data_list.append(item)

# data_list

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

In [136]:
info1 = full_tr[1]
dictionary1 = {}
dictionary1['title'] = info1.find_all('td')[0].text
dictionary1['subhead'] = info1.find_all('td')[1].text
dictionary1['byline'] = info1.find_all('td')[2].text
dictionary1

{'title': 'How to Scrape Many Things',
 'subhead': 'But, Is It Even Possible?',
 'byline': 'By Sonathan Joma'}

In [137]:
info2 = full_tr[2]
dictionary2 = {}
dictionary2['title'] = info2.find_all('td')[0].text
dictionary2['subhead'] = info2.find_all('td')[1].text
dictionary2['byline'] = info2.find_all('td')[2].text
dictionary2

{'title': 'The End of Scraping',
 'subhead': "Let's All Use CSV Files",
 'byline': 'By Amos Nathanos'}

In [143]:
new_list=[dictionary, dictionary1, dictionary2]
new_list

[{'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

In [145]:
import pandas as pd

In [148]:
df = pd.json_normalize(new_list)
df.head()

Unnamed: 0,title,subhead,byline
0,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [None]:
# Save it to csv for non-data people
df.to_csv("table.csv", index=False)

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [177]:
page = await browser.new_page()
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [173]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table>\n</body></html>"

In [174]:
soup_doc = BeautifulSoup(html, 'html.parser')
soup_doc

<!DOCTYPE html>
<html><head><script>
    const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
  <tr>
    <td>How to Scrape Many Things</td>
    <td>But, Is It Even Possible?</td>
    <td>By Sonathan Joma</td>
  </tr>
  <tr>
    <td>The End of Scraping</td>
    <td>Let's All Use CSV Files</td>
    <td>By Amos Nathanos</td>
  </tr>
</table>
`

setTimeout(() => {
    document.querySelector('body').innerHTML = html
}, 250)</script>
</head><body><table>
<tbody><tr>
<td>How to Scrape Things</td>
<td>Probably using Playwright</td>
<td>By Jonathan Soma</td>
</tr>
<tr>
<td>How to Scrape Many Things</td>
<td>But, Is It Even Possible?</td>
<td>By Sonathan Joma</td>
</tr>
<tr>
<td>The End of Scraping</td>
<td>Let's All Use CSV Files</td>
<td>By Amos Nathanos</td>
</tr>
</tbody></table>
</body></html>

In [175]:
!pip install --quiet html5lib lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [176]:
import io

tables = pd.read_html(io.StringIO(html))
df = tables[0]
df.head()


Unnamed: 0,0,1,2
0,How to Scrape Things,Probably using Playwright,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [178]:
# Save it to csv for non-data people
df.to_csv("output.csv", index=False)

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [179]:
!pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [180]:
!pip install html5lib 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
soup_doc = BeautifulSoup(html, 'html.parser')

In [181]:
# Good HTML
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""
# Bad HTML
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
parsers = ['html.parser', 'html5lib', 'lxml']
for parser in parsers:
    print("Using parser:", parser)
    soup_good = BeautifulSoup(html_good, parser)
    print("Good HTML:")
    print(soup_good.prettify())
    soup_bad = BeautifulSoup(html_bad, parser)
    print("Bad HTML:")
    print(soup_bad.prettify())


Using parser: html.parser
Good HTML:
<h1>
 This is a title
</h1>
<h2>
 This is a subhead
</h2>
<p>
 This is a paragraph
</p>
<p>
 This is another paragraph
</p>

Bad HTML:
<h1>
 This is a title
 <h2>
  This is a subhead
  <p>
   This is a paragraph
   <p>
    This is another paragraph
   </p>
  </p>
 </h2>
</h1>

Using parser: html5lib


FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?