Nonbreaking spaces lead to surprising behavior #23

steve-bate · 2020-09-13T12:51:17Z

I tried using autoscraper to scrape items from the hackernews home page. The scraper had issues with the nonbreaking space in the comments link on each list item. I was eventually able to workaround the issue by using '\xa0' in the wanted_list string. That matched the comments field but then returned incorrect results anyway. My guess is that something is not matching the nonbreaking space in the "stack" analysis (but I didn't invest the time to find the root cause).

This project is an interesting idea, but I recommend unit tests and some documentation about the matching algorithm to help users help you with diagnosing bugs.

alirezamika · 2020-09-13T13:27:33Z

Can you please share more information? Like the url and the wanted list or your code, so we can check.

steve-bate · 2020-09-13T14:49:07Z

The training data (wanted_list) will be different when you try to scrape the page since the content is dynamically updated. To test this specific issue, I commented most of the entries in the wanted_list I was extracting and only left the comment count uncommented.

Note that without the '0xa0' in the wanted_list string, no matches were found (e.g., neither regular space or ' ' worked).

url = 'https://news.ycombinator.com/'

wanted_list = [
    # 'Ancient Earth Globe',
    # '23 points',
    # 'BerislavLopac',
    # '1 hour ago',
    '7\xa0comments'  # '7 comments'
]

alirezamika · 2020-09-13T15:04:44Z

Got it. We should deal with non-breaking spaces. Thank you.

alirezamika · 2020-09-13T16:49:30Z

#24

alirezamika closed this as completed Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonbreaking spaces lead to surprising behavior #23

Nonbreaking spaces lead to surprising behavior #23

steve-bate commented Sep 13, 2020

alirezamika commented Sep 13, 2020

steve-bate commented Sep 13, 2020 •

edited

alirezamika commented Sep 13, 2020

alirezamika commented Sep 13, 2020

Nonbreaking spaces lead to surprising behavior #23

Nonbreaking spaces lead to surprising behavior #23

Comments

steve-bate commented Sep 13, 2020

alirezamika commented Sep 13, 2020

steve-bate commented Sep 13, 2020 • edited

alirezamika commented Sep 13, 2020

alirezamika commented Sep 13, 2020

steve-bate commented Sep 13, 2020 •

edited