Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase scraping performance and reliability #9

Open
stevenleeg opened this issue Jan 16, 2022 · 3 comments
Open

Increase scraping performance and reliability #9

stevenleeg opened this issue Jan 16, 2022 · 3 comments

Comments

@stevenleeg
Copy link

stevenleeg commented Jan 16, 2022

Hi there,

First off, I just want to say huge thanks for building this- it helped me build a nice scraper to track/store my electricity usage which is something I've been putting off for a very long time.

While trying to get this library to work I noticed that I would successfully fetch reading data for only about 25% of my attempts. The other 75% would get caught up on authentication issues or some element not showing up as it should. I had a feeling these issues were due to various race conditions associated with trying to scrape JS-heavy webpage and wanted to do some refactoring, taking some lessons from how Cypress approaches this kind of work. Namely, I restructured the scraper to watch for and respond to elements appearing/disappearing on the page rather than waiting arbitrary amounts of time and hoping requests finished.

Here's the code, which is built pretty specifically for the context where I'm using it, but you can see the general ideas:

async def fetch_element(page, selector, max_tries=10):
    tries = 0
    el = None
    while el == None and tries < max_tries:
        el = await page.querySelector(selector)
        await page.waitFor(1000)

    return el

async def fetch_readings():
    browser_launch_config = {
        'defaultViewport': {'width': 1920, 'height': 1080},
        'dumpio': False,
        'args': ['--no-sandbox'],
    }

    browser = await pyppeteer.launch(browser_launch_config)
    page = await browser.newPage()
    await page.goto('https://coned.com/en/login', {'waitUntil' : 'domcontentloaded'})
    element = await page.querySelector('#form-login-email')
    logging.info('Authenticating...')

    await page.type('#form-login-email', os.environ.get('CONED_EMAIL'))
    await page.type('#form-login-password', os.environ.get('CONED_PASSWORD'))
    await page.click('.submit-button'),

    mfa_form = await fetch_element(page, '.js-login-new-device-form-selector:not(.hidden)')
    if mfa_form is None:
        logging.error('Never got MFA prompt. Aborting!')
        return

    logging.info('Entering MFA code...')
    mfa = await fetch_element(page, '#form-login-mfa-code')
    mfa_code = pyotp.TOTP(os.environ.get('CONED_MFA')).now()
    await mfa.type(mfa_code)
    await asyncio.gather(
        page.waitForNavigation(),
        page.click('.js-login-new-device-form .button'),
    )

    logging.info('Pausing for auth...')
    await page.waitFor(5000)
    logging.info('Fetching readings JSON...')
    
    api_page = await browser.newPage()
    account_id = os.environ.get('CONED_ACCOUNT_ID')
    meter_no = os.environ.get('CONED_METER_NUMBER')
    url = f"https://cned.opower.com/ei/edge/apis/cws-real-time-ami-v1/cws/cned/accounts/{account_id}/meters/{meter_no}/usage"
    await api_page.goto(url)
    data_elem = await api_page.querySelector('pre')
    raw_data = await api_page.evaluate('(el) => el.textContent', data_elem)

    data = json.loads(raw_data)

    await browser.close()
    logging.info('Done!')

    return data

The results have been promising so far- in my testing this method has been more reliable and faster than the current implementation since it doesn't have to wait as long for data. I figured I'd share it here in case you'd like to incorporate the changes into your library (or if you're open to a PR I can see if I can make the changes myself), or for others to use as a reference.

One thing I'd also like to add: I would recommend returning all of the reading results rather than just the latest one. AMI data can be lagged or be updated as time goes on (utilities are bad at computers), so if you're trying to scrape and store your meter's data you'll likely want to fetch the whole set of readings and insert/replace each interval in the database you're storing them in. Since running this scraper I've noticed that the latest reading is usually null for an hour or so before it starts getting populated with a kwh value.

Hope this is helpful!

@bvlaicu
Copy link
Owner

bvlaicu commented Jan 16, 2022

@stevenleeg Thanks for the feedback and suggestions. The library could definitely use these improvements. I'll try to find some time to incorporate them unless you want to open a PR yourself.

@tmb5cg
Copy link

tmb5cg commented Mar 29, 2022

🐐

@tmb5cg
Copy link

tmb5cg commented Mar 29, 2022

Hi there,

First off, I just want to say huge thanks for building this- it helped me build a nice scraper to track/store my electricity usage which is something I've been putting off for a very long time.

While trying to get this library to work I noticed that I would successfully fetch reading data for only about 25% of my attempts. The other 75% would get caught up on authentication issues or some element not showing up as it should. I had a feeling these issues were due to various race conditions associated with trying to scrape JS-heavy webpage and wanted to do some refactoring, taking some lessons from how Cypress approaches this kind of work. Namely, I restructured the scraper to watch for and respond to elements appearing/disappearing on the page rather than waiting arbitrary amounts of time and hoping requests finished.

Here's the code, which is built pretty specifically for the context where I'm using it, but you can see the general ideas:

async def fetch_element(page, selector, max_tries=10):
    tries = 0
    el = None
    while el == None and tries < max_tries:
        el = await page.querySelector(selector)
        await page.waitFor(1000)

    return el

async def fetch_readings():
    browser_launch_config = {
        'defaultViewport': {'width': 1920, 'height': 1080},
        'dumpio': False,
        'args': ['--no-sandbox'],
    }

    browser = await pyppeteer.launch(browser_launch_config)
    page = await browser.newPage()
    await page.goto('https://coned.com/en/login', {'waitUntil' : 'domcontentloaded'})
    element = await page.querySelector('#form-login-email')
    logging.info('Authenticating...')

    await page.type('#form-login-email', os.environ.get('CONED_EMAIL'))
    await page.type('#form-login-password', os.environ.get('CONED_PASSWORD'))
    await page.click('.submit-button'),

    mfa_form = await fetch_element(page, '.js-login-new-device-form-selector:not(.hidden)')
    if mfa_form is None:
        logging.error('Never got MFA prompt. Aborting!')
        return

    logging.info('Entering MFA code...')
    mfa = await fetch_element(page, '#form-login-mfa-code')
    mfa_code = pyotp.TOTP(os.environ.get('CONED_MFA')).now()
    await mfa.type(mfa_code)
    await asyncio.gather(
        page.waitForNavigation(),
        page.click('.js-login-new-device-form .button'),
    )

    logging.info('Pausing for auth...')
    await page.waitFor(5000)
    logging.info('Fetching readings JSON...')
    
    api_page = await browser.newPage()
    account_id = os.environ.get('CONED_ACCOUNT_ID')
    meter_no = os.environ.get('CONED_METER_NUMBER')
    url = f"https://cned.opower.com/ei/edge/apis/cws-real-time-ami-v1/cws/cned/accounts/{account_id}/meters/{meter_no}/usage"
    await api_page.goto(url)
    data_elem = await api_page.querySelector('pre')
    raw_data = await api_page.evaluate('(el) => el.textContent', data_elem)

    data = json.loads(raw_data)

    await browser.close()
    logging.info('Done!')

    return data

The results have been promising so far- in my testing this method has been more reliable and faster than the current implementation since it doesn't have to wait as long for data. I figured I'd share it here in case you'd like to incorporate the changes into your library (or if you're open to a PR I can see if I can make the changes myself), or for others to use as a reference.

One thing I'd also like to add: I would recommend returning all of the reading results rather than just the latest one. AMI data can be lagged or be updated as time goes on (utilities are bad at computers), so if you're trying to scrape and store your meter's data you'll likely want to fetch the whole set of readings and insert/replace each interval in the database you're storing them in. Since running this scraper I've noticed that the latest reading is usually null for an hour or so before it starts getting populated with a kwh value.

Hope this is helpful!

You're the bomb.com this works great. And of course thanks to @bvlaicu . I'll upload my edits on my fork but no idea how to merge/however it works

One thing I'm confused about all the event_loop stuff. I think I understand at a super high level why asyncio is used (if you were running multiple 'apis' on same server to avoid race conditions?), the await function is great (not sure if this is built into python or part of asyncio?), but yea I basically removed some of it to avoid the double chromium.

Anyways thanks all, I'm big into selenium and hadn't played with py puppet till now, this solves so many JS headaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants