# API Test

This notebook probes the endpoints exposed by `movie.douban.com` and their sample responses.

The endpoints were discovered from chrome devtools.

In [1]:
import requests
import json
from common import headers
from bs4 import BeautifulSoup

The following API is found on `https://movie.douban.com/explore`. However it is limited to max last 500 results, making it unsuitable for scraping

In [2]:
r = requests.get("https://movie.douban.com/j/search_subjects?type=movie&tag=%E5%8D%8E%E8%AF%AD&sort=time&page_limit=20&page_start=0", headers=headers)
r.text

'{"subjects":[{"episodes_info":"","rate":"7.0","cover_x":992,"title":"目中无人","url":"https:\\/\\/movie.douban.com\\/subject\\/35295405\\/","playable":true,"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2873818227.jpg","id":"35295405","cover_y":1389,"is_new":true},{"episodes_info":"","rate":"6.5","cover_x":2200,"title":"山村狐妻","url":"https:\\/\\/movie.douban.com\\/subject\\/35914264\\/","playable":false,"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2874192380.jpg","id":"35914264","cover_y":3911,"is_new":true},{"episodes_info":"","rate":"5.5","cover_x":1080,"title":"盲战","url":"https:\\/\\/movie.douban.com\\/subject\\/35604619\\/","playable":true,"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2872304316.jpg","id":"35604619","cover_y":1921,"is_new":false},{"episodes_info":"","rate":"4.6","cover_x":3000,"title":"我是真的讨厌异地恋","url":"https:\\/\\/movie.douban.com\\/subject\\/35057107\\/","playable":true

In [3]:
movies = json.loads(r.text)["subjects"]
movies

[{'episodes_info': '',
  'rate': '7.0',
  'cover_x': 992,
  'title': '目中无人',
  'url': 'https://movie.douban.com/subject/35295405/',
  'playable': True,
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2873818227.jpg',
  'id': '35295405',
  'cover_y': 1389,
  'is_new': True},
 {'episodes_info': '',
  'rate': '6.5',
  'cover_x': 2200,
  'title': '山村狐妻',
  'url': 'https://movie.douban.com/subject/35914264/',
  'playable': False,
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2874192380.jpg',
  'id': '35914264',
  'cover_y': 3911,
  'is_new': True},
 {'episodes_info': '',
  'rate': '5.5',
  'cover_x': 1080,
  'title': '盲战',
  'url': 'https://movie.douban.com/subject/35604619/',
  'playable': True,
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2872304316.jpg',
  'id': '35604619',
  'cover_y': 1921,
  'is_new': False},
 {'episodes_info': '',
  'rate': '4.6',
  'cover_x': 3000,
  'title': '我是真的讨厌异地恋',
  'url': 'https://

The following API is found on `https://movie.douban.com/tag/#/`. This is the one we will use to scrape the complete list of movies

In [2]:
r = requests.get("https://movie.douban.com/j/new_search_subjects?sort=R&range=0,10&tags=%E7%94%B5%E5%BD%B1&start=0&genres=%E5%89%A7%E6%83%85&countries=%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86&year_range=2022,2022", headers=headers)
r.text

'{"data":[{"directors":["郑保瑞"],"rate":"","cover_x":1428,"star":"0","title":"澎湖海战","url":"https:\\/\\/movie.douban.com\\/subject\\/35558234\\/","casts":["吴京"],"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2692973555.jpg","id":"35558234","cover_y":2000},{"directors":["汪迪"],"rate":"","cover_x":1299,"star":"0","title":"不游海水的鲸","url":"https:\\/\\/movie.douban.com\\/subject\\/35609495\\/","casts":["朱丛冉","野兆月","韩三明","姬云潇","李丽鲜"],"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2753248131.jpg","id":"35609495","cover_y":1949},{"directors":["韩涛"],"rate":"","cover_x":1080,"star":"0","title":"岁月如织","url":"https:\\/\\/movie.douban.com\\/subject\\/35093566\\/","casts":[],"cover":"https://img9.doubanio.com\\/view\\/photo\\/s_ratio_poster\\/public\\/p2608002554.jpg","id":"35093566","cover_y":1500},{"directors":["徐昂"],"rate":"","cover_x":4429,"star":"0","title":"忠犬八公","url":"https:\\/\\/movie.douban.com\\/subject\\/26999802\\/","casts":["冯小

In [7]:
movies = json.loads(r.text)["data"]
movies

[{'directors': ['郑保瑞'],
  'rate': '',
  'cover_x': 1428,
  'star': '0',
  'title': '澎湖海战',
  'url': 'https://movie.douban.com/subject/35558234/',
  'casts': ['吴京'],
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2692973555.jpg',
  'id': '35558234',
  'cover_y': 2000},
 {'directors': ['汪迪'],
  'rate': '',
  'cover_x': 1299,
  'star': '0',
  'title': '不游海水的鲸',
  'url': 'https://movie.douban.com/subject/35609495/',
  'casts': ['朱丛冉', '野兆月', '韩三明', '姬云潇', '李丽鲜'],
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2753248131.jpg',
  'id': '35609495',
  'cover_y': 1949},
 {'directors': ['韩涛'],
  'rate': '',
  'cover_x': 1080,
  'star': '0',
  'title': '岁月如织',
  'url': 'https://movie.douban.com/subject/35093566/',
  'casts': [],
  'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2608002554.jpg',
  'id': '35093566',
  'cover_y': 1500},
 {'directors': ['徐昂'],
  'rate': '',
  'cover_x': 4429,
  'star': '0',
  'title': '忠犬八公',
  '

The following API is from the individual page for movie detail. Unfortunately this is in HTML and not JSON so we will have to parse it...

In [2]:
r = requests.get("https://movie.douban.com/subject/26925317/", headers=headers)
r.text

'<!DOCTYPE html>\n<html lang="zh-CN" class="ua-windows ua-webkit">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n    <meta name="renderer" content="webkit">\n    <meta name="referrer" content="always">\n    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />\n    <title>\n        动物世界 (豆瓣)\n</title>\n    \n    <meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />\n    <meta http-equiv="Pragma" content="no-cache">\n    <meta http-equiv="Expires" content="Sun, 6 Mar 2005 01:00:00 GMT">\n    \n    <link rel="apple-touch-icon" href="https://img9.doubanio.com/f/movie/d59b2715fdea4968a450ee5f6c95c7d7a2030065/pics/movie/apple-touch-icon.png">\n    <link href="https://img9.doubanio.com/f/shire/204847ecc7d679de915c283531d14f16cfbee65e/css/douban.css" rel="stylesheet" type="text/css">\n    <link href="https://img9.doubanio.com/f/shire/0b4cdb02dd620693709d9314196b617f17c2f9ea/css/separation/_all.css" rel="

In [3]:
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="ua-windows ua-webkit" lang="zh-CN">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="webkit" name="renderer"/>
  <meta content="always" name="referrer"/>
  <meta content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" name="google-site-verification">
   <title>
    动物世界 (豆瓣)
   </title>
   <meta content="cZdR4xxR7RxmM4zE" name="baidu-site-verification">
    <meta content="no-cache" http-equiv="Pragma"/>
    <meta content="Sun, 6 Mar 2005 01:00:00 GMT" http-equiv="Expires"/>
    <link href="https://img9.doubanio.com/f/movie/d59b2715fdea4968a450ee5f6c95c7d7a2030065/pics/movie/apple-touch-icon.png" rel="apple-touch-icon"/>
    <link href="https://img9.doubanio.com/f/shire/204847ecc7d679de915c283531d14f16cfbee65e/css/douban.css" rel="stylesheet" type="text/css"/>
    <link href="https://img9.doubanio.com/f/shire/0b4cdb02dd620693709d9314196b617f17c2f9ea/css/separation/_all.css" rel="stylesheet" type="text/css"/>
    <

In [4]:
soup.find("script", type="application/ld+json")

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "name": "动物世界",
  "url": "/subject/26925317/",
  "image": "https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2525528688.webp",
  "director": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1314828/",
      "name": "韩延 Yan Han"
    }
  ]
,
  "author": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1314828/",
      "name": "韩延 Yan Han"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1321974/",
      "name": "福本伸行 Nobuyuki Fukumoto"
    }
  ]
,
  "actor": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1314140/",
      "name": "李易峰 Yifeng Li"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1053620/",
      "name": "迈克尔·道格拉斯 Michael Douglas"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1274224/",
      "name": "周冬雨 Dongyu Zhou"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1313383/",

In [5]:
soup.find("span", property="v:summary")

<span class="" property="v:summary">
                                　　在游戏机厅做着兼职“小丑”的郑开司（李易峰 饰），幼时父亲突然失踪，母亲重病住院，使得郑开司的生活非常拮据。发小“大虾米”（曹炳琨 饰）借口买房骗下了郑开司父亲留下的房产，还给他带来了巨额的欠债。神秘人物（迈克尔·道格拉斯 Michael Douglas 饰）出现，告诉郑开司，只要参加“命运号”游轮上的神秘游戏，就有机会偿还完所有欠款，一无所有的郑开司为了给青梅竹马的护士刘青（周冬雨 饰）和母亲更好的生活，只得登上游轮，开始了生存游戏，一场以“剪刀、石头、布”展开的生死较量即将登场……
                        </span>

In [9]:
soup.find("span", string="语言:").next_sibling

' 汉语普通话 / 英语'

The following is the API for the celebrities.

In [2]:
r = requests.get("https://movie.douban.com/celebrity/1274255/", headers=headers)
r.text

'<!DOCTYPE html>\n<html lang="zh-CN" class="ua-windows ua-webkit">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n    <meta name="renderer" content="webkit">\n    <meta name="referrer" content="always">\n    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />\n    <title>\n  冯小刚 Xiaogang Feng\n</title>\n    \n    <meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />\n    <meta http-equiv="Pragma" content="no-cache">\n    <meta http-equiv="Expires" content="Sun, 6 Mar 2005 01:00:00 GMT">\n    \n  <meta name="keywords" content="冯小刚 Xiaogang Feng,简介,个人资料,图片,电影作品,获奖情况,合作影人"/>\n  <meta name="description" content="冯小刚简介、图片写真、获奖情况及电影作品一览"/>\n  \n<meta property="og:title" content="冯小刚" />\n<meta property="og:description" content="冯小刚，中国著名电影导演、编剧。冯小刚作品风格以北方京味儿喜剧著称，擅长商业片。是中国大陆最具有票房号召力的导演之一。\n\u3000\u3000冯小刚自幼喜爱美术、文学，高中毕业后进入北京军区文工团，担任舞美设计，先后在《大林莽》、《凯旋在子夜》、《便衣警察》、《好男好女》等几部当时很有影响的电视剧中任美术设计。1985年，他调入北京电视艺术中心成为

In [3]:
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="ua-windows ua-webkit" lang="zh-CN">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="webkit" name="renderer"/>
  <meta content="always" name="referrer"/>
  <meta content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" name="google-site-verification">
   <title>
    冯小刚 Xiaogang Feng
   </title>
   <meta content="cZdR4xxR7RxmM4zE" name="baidu-site-verification">
    <meta content="no-cache" http-equiv="Pragma"/>
    <meta content="Sun, 6 Mar 2005 01:00:00 GMT" http-equiv="Expires"/>
    <meta content="冯小刚 Xiaogang Feng,简介,个人资料,图片,电影作品,获奖情况,合作影人" name="keywords">
     <meta content="冯小刚简介、图片写真、获奖情况及电影作品一览" name="description">
      <meta content="冯小刚" property="og:title">
       <meta content="冯小刚，中国著名电影导演、编剧。冯小刚作品风格以北方京味儿喜剧著称，擅长商业片。是中国大陆最具有票房号召力的导演之一。
　　冯小刚自幼喜爱美术、文学，高中毕业后进入北京军区文工团，担任舞美设计，先后在《大林莽》、《凯旋在子夜》、《便衣警察》、《好男好女》等几部当时很有影响的电视剧中任美术设计。1985年，他调入北京电视艺术中心成为美工师。《遭遇激情》是他与郑晓龙联合编导的第一部作品，后被夏刚拍成电影，影片获中国电影金鸡奖最佳编剧等四项提名，他与王

In [22]:
content = soup.find("div", id="content")
print(content.prettify())

<div id="content">
 <h1>
  冯小刚 Xiaogang Feng
 </h1>
 <div class="grid-16-8 clearfix">
  <div class="article">
   <div class="item" id="headline">
    <div class="pic">
     <div class="nbg" title="冯小刚">
      <img alt="冯小刚" src="https://img9.doubanio.com/view/celebrity/raw/public/p45667.jpg" title="冯小刚"/>
     </div>
    </div>
    <div class="info">
     <ul class="">
      <li>
       <span>
        性别
       </span>
       : 
        男
      </li>
      <li>
       <span>
        星座
       </span>
       : 
        双鱼座
      </li>
      <li>
       <span>
        出生日期
       </span>
       : 
        1958年03月18日
      </li>
      <li>
       <span>
        出生地
       </span>
       : 
        中国,北京,大兴
      </li>
      <li>
       <span>
        职业
       </span>
       : 
        导演 / 制片人 / 演员 / 编剧 / 配音
      </li>
      <li>
       <span>
        家庭成员
       </span>
       : 
        徐帆(妻) / 冯孔修(父)
      </li>
      <li>
       <span>
        imdb编号
       </span>
       :
       

In [63]:
# extract number of fans
fans = content.find("div", id="fans").find("h2").contents[0]
fans

'\n        冯小刚的影迷（17105）\n            \xa0·\xa0·\xa0·\xa0·\xa0·\xa0·\n            '

In [66]:
start = fans.index("（")
end = fans.index("）")
fans[start+1:end]

'17105'

In [67]:
# extract name
content.find("h1").string

'冯小刚 Xiaogang Feng'

In [68]:
# extract info
info = content.find("div", class_="info")
info

<div class="info">
<ul class="">
<li>
<span>性别</span>: 
        男
        </li>
<li>
<span>星座</span>: 
        双鱼座
        </li>
<li>
<span>出生日期</span>: 
        1958年03月18日
        </li>
<li>
<span>出生地</span>: 
        中国,北京,大兴
        </li>
<li>
<span>职业</span>: 
        导演 / 制片人 / 演员 / 编剧 / 配音
        </li>
<li>
<span>家庭成员</span>: 
        徐帆(妻) / 冯孔修(父)
        </li>
<li>
<span>imdb编号</span>: 
        <a href="https://www.imdb.com/name/nm0271815" target="_blank">nm0271815</a>
</li>
</ul>
</div>

In [118]:
item = info.find("li")

while item is not None:
    print(item.span.string)
    print(item.span.next_sibling.strip(":\n "))
    item = item.find_next_sibling()

性别
男
星座
双鱼座
出生日期
1958年03月18日
出生地
中国,北京,大兴
职业
导演 / 制片人 / 演员 / 编剧 / 配音
家庭成员
徐帆(妻) / 冯孔修(父)
imdb编号

