Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

豆瓣页面结构变化 #6

Closed
WindStill opened this issue Feb 26, 2022 · 2 comments
Closed

豆瓣页面结构变化 #6

WindStill opened this issue Feb 26, 2022 · 2 comments

Comments

@WindStill
Copy link

WindStill commented Feb 26, 2022

1、作者的链接变由/author变为了/serach

def author_filter(self, a_element):
    a_href = a_element.attrib['href']
    return '/search' in a_href

2、简介内容上层div标签不完整导致包含简介之外的其他内容。剪去</div>之后的内容

book['description'] = ''
if len(summary_element):
     summary = etree.tostring(summary_element[-1], encoding="utf8").decode("utf8").strip()
     book['description'] = summary[0:(summary.index("</div>") + 6)]
@fugary
Copy link
Owner

fugary commented Feb 26, 2022

第一个问题是存在的,不过不是豆瓣页面结构变化,应该是部分作者在豆瓣作者库中没有映射上,所以显示成搜索。

第二个暂不修改,简介中div是普遍存在,直接去掉肯定不行(最好多找几个例子看看)

@WindStill
Copy link
Author

WindStill commented Feb 26, 2022

第一个问题是存在的,不过不是豆瓣页面结构变化,应该是部分作者在豆瓣作者库中没有映射上,所以显示成搜索。

第二个暂不修改,简介中div是普遍存在,直接去掉肯定不行(最好多找几个例子看看)

第一个问题的确实是你说的作者没映射上,不过这种情况很多,尤其是外籍作者,可以考虑兼容两种情况。其他的比如丛书链接是/series开头,出品方/producer开头,应该不会有影响。

第二个问题,我大概看了二十来个,简介<div class="intro">下级都是p标签,目前没有见到有div标签的

@fugary fugary closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants