Skip to content

Commit

Permalink
javbus: 应对抓取数据时被重定向到登录页的情形
Browse files Browse the repository at this point in the history
  • Loading branch information
Yuukiy committed Oct 2, 2023
1 parent bb57163 commit a11f65e
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions web/javbus.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,17 @@ def parse_data(movie: MovieInfo):
if resp.status_code == 404:
raise MovieNotFoundError(__name__, movie.dvdid)
resp.raise_for_status()
html = resp2html(resp)
# 疑似JavBus检测到类似爬虫的行为时会要求登录,不过发现目前不需要登录也可以从重定向前的网页中提取信息
if resp.history and resp.history[0].status_code == 302:
html = resp2html(resp.history[0])
else:
html = resp2html(resp)
# 引入登录验证后状态码不再准确,因此还要额外通过检测标题来确认是否发生了404
page_title = html.xpath('/html/head/title/text()')
if page_title and page_title[0].startswith('404 Page Not Found!'):
raise MovieNotFoundError(__name__, movie.dvdid)

container = html.xpath("/html/body/div[@class='container']")[0]
container = html.xpath("//div[@class='container']")[0]
title = container.xpath("h3/text()")[0]
cover = container.xpath("//a[@class='bigImage']/img/@src")[0]
preview_pics = container.xpath("//div[@id='sample-waterfall']/a/@href")
Expand Down

2 comments on commit a11f65e

@dangjingtao
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

作者研究深刻

@dangjingtao
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前试了下用无头浏览器试了下还是可以。但未来很难说

Please sign in to comment.