# Scraping basics for Playwright

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

> The [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.
>
> I know I love them, but **you don't have to use CSS selectors!**

## Part 0: Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [1]:
!pip install playwright



In [194]:
from playwright.async_api import async_playwright

In [195]:
# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [196]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [197]:
from bs4 import BeautifulSoup

html = await page.content()
doc = BeautifulSoup(html)

In [198]:
doc.find(class_="title").text

'How to Scrape Things'

In [199]:
doc.find(class_="subhead").text

'Some Supplemental Materials'

In [200]:
doc.find(class_="byline").text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [201]:
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [202]:
html = await page.content()
doc2 = BeautifulSoup(html)

In [203]:
doc2.select_one("h1").text

'How to Scrape Things'

In [204]:
doc2.select_one("h3").text

'Some Supplemental Materials'

In [205]:
doc2.select_one("p").text

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, creating a dictionary out of the title, subhead, and byline in sentences, e.g. "the title is `______`"

> **This will be important for the next few:** you can use `.get_by_text` but it seems kind of silly since maybe the text would change. I think getting them all, then using list indexes like `[0]`, etc, would be better! If I sold you on CSS selectors, you can also look up `nth-of-type` and use it with `.select_one`.

In [206]:
await page.goto("http://jonathansoma.com/lede/static/by-list.html")

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [207]:
html = await page.content()
doc3 = BeautifulSoup(html)

In [208]:
lists = doc3.select("body p")
for list in lists:
    print(list.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


In [211]:
!pip install selenium
from selenium import webdriver

Collecting selenium
  Downloading selenium-4.15.2-py3-none-any.whl.metadata (6.9 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.23.1-py3-none-any.whl.metadata (4.9 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl (16 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m 

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline in sentences, e.g. "the title is `______`."

In [99]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")
html = await page.content()
doc4 = BeautifulSoup(html)

In [114]:
title = doc4.select_one("tbody td")
subtitle = doc4.select("tbody td")[1]
byline = doc4.select("tbody td")[2]
print(f'the title is "{title.text}"')
print(f'the subtitle is "{subtitle.text}"')
print(f'the byline is "{byline.text}"')

the title is "How to Scrape Things"
the subtitle is "Some Supplemental Materials"
the byline is "By Jonathan Soma"


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [216]:
title = doc4.select_one("tbody td").text
subtitle = doc4.select("tbody td")[1].text
byline = doc4.select("tbody td")[2].text
book = {
        'title': title,
        'subtitle': subtitle,
        'byline': byline
    }
book

{'title': 'How to Scrape Things',
 'subtitle': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

In [123]:
book = []

title = doc4.select_one("tbody td").text
subtitle = doc4.select("tbody td")[1].text
byline = doc4.select("tbody td")[2].text
data = {
        'title': title,
        'subtitle': subtitle,
        'byline': byline
    }

book.append(data)
book

[{'title': 'How to Scrape Things',
  'subtitle': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'}]

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [124]:
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")
html = await page.content()
doc6 = BeautifulSoup(html)

In [219]:
    # Extract table rows
    rows = doc6.find_all('tr')

    for row in rows:
        # Extract table cells for each row
        cells = row.find_all('td')

        for cell in cells:
            print(cell.text)  # Print cell content
            # print(cells)  # Separate rows with an empty line

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma
How to Scrape Many Things
But, Is It Even Possible?
By Sonathan Joma
The End of Scraping
Let's All Use CSV Files
By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either, even though that's exactly what we did in class.

In [222]:
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")
html = await page.content()
doc7 = BeautifulSoup(html)

In [223]:
all_data = []
table = doc7.select_one('tbody') 
table_rows = table.find_all('tr')  

for row in table_rows:
    data = {
        'row1': row.find_all('td')[0].get_text(),  
        'row2': row.find_all('td')[1].get_text(),  
        'row3': row.find_all('td')[2].get_text() 
    }
    all_data.append(data)
all_data

[{'row1': 'How to Scrape Things',
  'row2': 'Some Supplemental Materials',
  'row3': 'By Jonathan Soma'},
 {'row1': 'How to Scrape Many Things',
  'row2': 'But, Is It Even Possible?',
  'row3': 'By Sonathan Joma'},
 {'row1': 'The End of Scraping',
  'row2': "Let's All Use CSV Files",
  'row3': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [232]:
import pandas as pd
tables = pd.read_html(html)
df = tables[0]
df

  tables = pd.read_html(html)


Unnamed: 0,0,1,2
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [234]:
html = await page.content()
tables = pd.read_html(html)
filename = f"output.csv"
print("Saving as", filename)
tables[0].to_csv(filename, index=False)

Saving as output.csv


  tables = pd.read_html(html)
