Increase scraping performance and reliability #9

stevenleeg · 2022-01-16T14:13:06Z

Hi there,

First off, I just want to say huge thanks for building this- it helped me build a nice scraper to track/store my electricity usage which is something I've been putting off for a very long time.

While trying to get this library to work I noticed that I would successfully fetch reading data for only about 25% of my attempts. The other 75% would get caught up on authentication issues or some element not showing up as it should. I had a feeling these issues were due to various race conditions associated with trying to scrape JS-heavy webpage and wanted to do some refactoring, taking some lessons from how Cypress approaches this kind of work. Namely, I restructured the scraper to watch for and respond to elements appearing/disappearing on the page rather than waiting arbitrary amounts of time and hoping requests finished.

Here's the code, which is built pretty specifically for the context where I'm using it, but you can see the general ideas:

async def fetch_element(page, selector, max_tries=10):
    tries = 0
    el = None
    while el == None and tries < max_tries:
        el = await page.querySelector(selector)
        await page.waitFor(1000)

    return el

async def fetch_readings():
    browser_launch_config = {
        'defaultViewport': {'width': 1920, 'height': 1080},
        'dumpio': False,
        'args': ['--no-sandbox'],
    }

    browser = await pyppeteer.launch(browser_launch_config)
    page = await browser.newPage()
    await page.goto('https://coned.com/en/login', {'waitUntil' : 'domcontentloaded'})
    element = await page.querySelector('#form-login-email')
    logging.info('Authenticating...')

    await page.type('#form-login-email', os.environ.get('CONED_EMAIL'))
    await page.type('#form-login-password', os.environ.get('CONED_PASSWORD'))
    await page.click('.submit-button'),

    mfa_form = await fetch_element(page, '.js-login-new-device-form-selector:not(.hidden)')
    if mfa_form is None:
        logging.error('Never got MFA prompt. Aborting!')
        return

    logging.info('Entering MFA code...')
    mfa = await fetch_element(page, '#form-login-mfa-code')
    mfa_code = pyotp.TOTP(os.environ.get('CONED_MFA')).now()
    await mfa.type(mfa_code)
    await asyncio.gather(
        page.waitForNavigation(),
        page.click('.js-login-new-device-form .button'),
    )

    logging.info('Pausing for auth...')
    await page.waitFor(5000)
    logging.info('Fetching readings JSON...')
    
    api_page = await browser.newPage()
    account_id = os.environ.get('CONED_ACCOUNT_ID')
    meter_no = os.environ.get('CONED_METER_NUMBER')
    url = f"https://cned.opower.com/ei/edge/apis/cws-real-time-ami-v1/cws/cned/accounts/{account_id}/meters/{meter_no}/usage"
    await api_page.goto(url)
    data_elem = await api_page.querySelector('pre')
    raw_data = await api_page.evaluate('(el) => el.textContent', data_elem)

    data = json.loads(raw_data)

    await browser.close()
    logging.info('Done!')

    return data

The results have been promising so far- in my testing this method has been more reliable and faster than the current implementation since it doesn't have to wait as long for data. I figured I'd share it here in case you'd like to incorporate the changes into your library (or if you're open to a PR I can see if I can make the changes myself), or for others to use as a reference.

One thing I'd also like to add: I would recommend returning all of the reading results rather than just the latest one. AMI data can be lagged or be updated as time goes on (utilities are bad at computers), so if you're trying to scrape and store your meter's data you'll likely want to fetch the whole set of readings and insert/replace each interval in the database you're storing them in. Since running this scraper I've noticed that the latest reading is usually null for an hour or so before it starts getting populated with a kwh value.

Hope this is helpful!

The text was updated successfully, but these errors were encountered:

bvlaicu · 2022-01-16T14:32:28Z

@stevenleeg Thanks for the feedback and suggestions. The library could definitely use these improvements. I'll try to find some time to incorporate them unless you want to open a PR yourself.

tmb5cg · 2022-03-29T03:39:09Z

🐐

tmb5cg · 2022-03-29T18:19:16Z

Hi there,

First off, I just want to say huge thanks for building this- it helped me build a nice scraper to track/store my electricity usage which is something I've been putting off for a very long time.

While trying to get this library to work I noticed that I would successfully fetch reading data for only about 25% of my attempts. The other 75% would get caught up on authentication issues or some element not showing up as it should. I had a feeling these issues were due to various race conditions associated with trying to scrape JS-heavy webpage and wanted to do some refactoring, taking some lessons from how Cypress approaches this kind of work. Namely, I restructured the scraper to watch for and respond to elements appearing/disappearing on the page rather than waiting arbitrary amounts of time and hoping requests finished.

Here's the code, which is built pretty specifically for the context where I'm using it, but you can see the general ideas:
async def fetch_element(page, selector, max_tries=10):
    tries = 0
    el = None
    while el == None and tries < max_tries:
        el = await page.querySelector(selector)
        await page.waitFor(1000)

    return el

async def fetch_readings():
    browser_launch_config = {
        'defaultViewport': {'width': 1920, 'height': 1080},
        'dumpio': False,
        'args': ['--no-sandbox'],
    }

    browser = await pyppeteer.launch(browser_launch_config)
    page = await browser.newPage()
    await page.goto('https://coned.com/en/login', {'waitUntil' : 'domcontentloaded'})
    element = await page.querySelector('#form-login-email')
    logging.info('Authenticating...')

    await page.type('#form-login-email', os.environ.get('CONED_EMAIL'))
    await page.type('#form-login-password', os.environ.get('CONED_PASSWORD'))
    await page.click('.submit-button'),

    mfa_form = await fetch_element(page, '.js-login-new-device-form-selector:not(.hidden)')
    if mfa_form is None:
        logging.error('Never got MFA prompt. Aborting!')
        return

    logging.info('Entering MFA code...')
    mfa = await fetch_element(page, '#form-login-mfa-code')
    mfa_code = pyotp.TOTP(os.environ.get('CONED_MFA')).now()
    await mfa.type(mfa_code)
    await asyncio.gather(
        page.waitForNavigation(),
        page.click('.js-login-new-device-form .button'),
    )

    logging.info('Pausing for auth...')
    await page.waitFor(5000)
    logging.info('Fetching readings JSON...')
    
    api_page = await browser.newPage()
    account_id = os.environ.get('CONED_ACCOUNT_ID')
    meter_no = os.environ.get('CONED_METER_NUMBER')
    url = f"https://cned.opower.com/ei/edge/apis/cws-real-time-ami-v1/cws/cned/accounts/{account_id}/meters/{meter_no}/usage"
    await api_page.goto(url)
    data_elem = await api_page.querySelector('pre')
    raw_data = await api_page.evaluate('(el) => el.textContent', data_elem)

    data = json.loads(raw_data)

    await browser.close()
    logging.info('Done!')

    return data
The results have been promising so far- in my testing this method has been more reliable and faster than the current implementation since it doesn't have to wait as long for data. I figured I'd share it here in case you'd like to incorporate the changes into your library (or if you're open to a PR I can see if I can make the changes myself), or for others to use as a reference.

One thing I'd also like to add: I would recommend returning all of the reading results rather than just the latest one. AMI data can be lagged or be updated as time goes on (utilities are bad at computers), so if you're trying to scrape and store your meter's data you'll likely want to fetch the whole set of readings and insert/replace each interval in the database you're storing them in. Since running this scraper I've noticed that the latest reading is usually null for an hour or so before it starts getting populated with a kwh value.

Hope this is helpful!

You're the bomb.com this works great. And of course thanks to @bvlaicu . I'll upload my edits on my fork but no idea how to merge/however it works

One thing I'm confused about all the event_loop stuff. I think I understand at a super high level why asyncio is used (if you were running multiple 'apis' on same server to avoid race conditions?), the await function is great (not sure if this is built into python or part of asyncio?), but yea I basically removed some of it to avoid the double chromium.

Anyways thanks all, I'm big into selenium and hadn't played with py puppet till now, this solves so many JS headaches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase scraping performance and reliability #9

Increase scraping performance and reliability #9

stevenleeg commented Jan 16, 2022 •

edited

bvlaicu commented Jan 16, 2022

tmb5cg commented Mar 29, 2022

tmb5cg commented Mar 29, 2022 •

edited

Increase scraping performance and reliability #9

Increase scraping performance and reliability #9

Comments

stevenleeg commented Jan 16, 2022 • edited

bvlaicu commented Jan 16, 2022

tmb5cg commented Mar 29, 2022

tmb5cg commented Mar 29, 2022 • edited

stevenleeg commented Jan 16, 2022 •

edited

tmb5cg commented Mar 29, 2022 •

edited