Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed _extract_post_id returning null values caused by multiple items in page with class _5pcq #21

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ongzexuan
Copy link

@ongzexuan ongzexuan commented Jun 10, 2020

_extract_post_id uses the class _5pcq to get the postId. This sometimes fails when there is a long list of items with that class, some of which have an effectively empty href tag. Propose a fix where we check if the href tag contains a URL (beginning with '/') before extracting.

Sample list of prints from item.find_all(class_="_5pcq") below. The last item on the list has '#' in the href field. This causes the resultant post_id to be '#'.

To replicate:
python scraper.py -p TheStraitsTimes -l 1

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114673327115?__xts__%5B0%5D=68.ARAQBh0K_NMj_mQAANUH_3XvHEDd3zLc83FLEcu4VcfDAdkM6z1PAP4Izat-cL4tQmNTMr_W875cfYO3vqYneCqXcjuRt9Q1tiYK64NKoaEUtHoyIyAjcZi6jHtUrCB60YZfPvwidqL6Aw6Vm7yIdE7amIjP-yTjI25iMi-EH7xYHzCLxG1U83eUuG-L4xX73BaqcA8MtjD6aeI-EFfelvwRVHDV5GlwwgN2cGDrcv5_--KTGPV8mNO9UFtcj4BdxBG45bb4QZrpTE-PxmdnjHAIjbauy89o3zXPRG8t5LsfThBfy5UYs0M3PcVsiJi8UJswS-_QJDDTwFMnozEp&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591800911" title="Wednesday, June 10, 2020 at 7:55 AM"><span class="timestampContent" id="js_6">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114743227115?__xts__%5B0%5D=68.ARCAUPaCRHFNlhpvP2W3jDKjTebqzmTZplSSOjw7Q6sLY5VjEDPitgFQ1kYPbbGkhEiMNdN4ZLR2BjaCFdWQe5V3pDbTZ73LXDRBsFjuGX_WX2BFnx0r1xDjP2OXiYNx9B1YJOEmnVbPJg6M817WmCRTUmSjsCECgHKDAaLin8z7bP3s0XjTqaEXxtmINF7Beqwi4lqMhx8D8HQG5rgZqFzCjMOpo8s_glZV36SHwX1z2fFLpF4iudosAK-005XvhBBIfs66n5UZe9AsQmvd0QsMbjfVQIN_JqGY4-mn8VjW8XjZRzBKFEUCir2efcX5bAitc0MVnQ2Fdn0Pdzyj&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591803018" title="Wednesday, June 10, 2020 at 8:30 AM"><span class="timestampContent" id="js_9">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant