Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonbreaking spaces lead to surprising behavior #23

Closed
steve-bate opened this issue Sep 13, 2020 · 4 comments
Closed

Nonbreaking spaces lead to surprising behavior #23

steve-bate opened this issue Sep 13, 2020 · 4 comments

Comments

@steve-bate
Copy link

I tried using autoscraper to scrape items from the hackernews home page. The scraper had issues with the nonbreaking space in the comments link on each list item. I was eventually able to workaround the issue by using '\xa0' in the wanted_list string. That matched the comments field but then returned incorrect results anyway. My guess is that something is not matching the nonbreaking space in the "stack" analysis (but I didn't invest the time to find the root cause).

This project is an interesting idea, but I recommend unit tests and some documentation about the matching algorithm to help users help you with diagnosing bugs.

@alirezamika
Copy link
Owner

Can you please share more information? Like the url and the wanted list or your code, so we can check.

@steve-bate
Copy link
Author

steve-bate commented Sep 13, 2020

The training data (wanted_list) will be different when you try to scrape the page since the content is dynamically updated. To test this specific issue, I commented most of the entries in the wanted_list I was extracting and only left the comment count uncommented.

Note that without the '0xa0' in the wanted_list string, no matches were found (e.g., neither regular space or ' ' worked).

url = 'https://news.ycombinator.com/'

wanted_list = [
    # 'Ancient Earth Globe',
    # '23 points',
    # 'BerislavLopac',
    # '1 hour ago',
    '7\xa0comments'  # '7 comments'
]

@alirezamika
Copy link
Owner

Got it. We should deal with non-breaking spaces. Thank you.

@alirezamika
Copy link
Owner

#24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants