New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonbreaking spaces lead to surprising behavior #23
Comments
Can you please share more information? Like the url and the wanted list or your code, so we can check. |
The training data (wanted_list) will be different when you try to scrape the page since the content is dynamically updated. To test this specific issue, I commented most of the entries in the wanted_list I was extracting and only left the comment count uncommented. Note that without the '0xa0' in the wanted_list string, no matches were found (e.g., neither regular space or ' ' worked). url = 'https://news.ycombinator.com/'
wanted_list = [
# 'Ancient Earth Globe',
# '23 points',
# 'BerislavLopac',
# '1 hour ago',
'7\xa0comments' # '7 comments'
] |
Got it. We should deal with non-breaking spaces. Thank you. |
I tried using autoscraper to scrape items from the hackernews home page. The scraper had issues with the nonbreaking space in the comments link on each list item. I was eventually able to workaround the issue by using '\xa0' in the wanted_list string. That matched the comments field but then returned incorrect results anyway. My guess is that something is not matching the nonbreaking space in the "stack" analysis (but I didn't invest the time to find the root cause).
This project is an interesting idea, but I recommend unit tests and some documentation about the matching algorithm to help users help you with diagnosing bugs.
The text was updated successfully, but these errors were encountered: