## 如何使用python获取京东商品数据？爬虫傻瓜式入门
https://zhuanlan.zhihu.com/p/56441988

## 一、认识爬虫

## 二、训练爬虫

## 三、开始爬虫
本章为爬虫入门，所以我们只需要安装几个Python库即可，如下：

- requests | pip install requests
- bs4 | pip install bs4
- lxml | pip install lxml

### 3.1、Python之requests库

In [1]:
import requests # 导入requests库
r = requests.get('https://search.jd.com/Search?keyword=Lenovo&enc=utf-8&wq=Lenovo&pvid=414c4f412d6b40908792068987460b43')

In [2]:
r.status_code # 查看访问状态码 200为ok 是成功的

200

In [3]:
r.headers # 查看响应头部

{'Date': 'Wed, 12 Feb 2020 05:49:44 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Server': 'jfe', 'Strict-Transport-Security': 'max-age=7776000'}

In [4]:
r.request.headers # 请求头

{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

### 3.2、Python爬虫利器二之Beautiful Soup的用法

In [5]:
#本章内容重点学习find & find_all即可, 案例代码

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

In [6]:
# 查找所有关于title标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
title = soup.find_all('title')
print(title)

[<title>The Dormouse's story</title>]


In [7]:
# 查找所有的p标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
p = soup.find_all('p')
print(p)

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>, <p class="story">...</p>]


In [8]:
# 查找p标签中的title
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
p = soup.find_all('p', 'title')
print(p)

[<p class="title"><b>The Dormouse's story</b></p>]


In [9]:
# 查找a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
a = soup.find_all('a')
print(a)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [10]:
# 查找id="link2"标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
link = soup.find_all(id="link2")
print(link)

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]


In [11]:
# 查找所包含id属性的tag
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
link = soup.find_all(id=True)
print(link)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [12]:
# 使用多个指定名字的参数可以同时过滤tag的多个属性:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
link = soup.find_all(href=re.compile("lacie"), id="link2")
print(link)

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]


#### 按CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字class在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过class_ 参数搜索有指定CSS类名的tag。

In [13]:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
CLS = soup.find('a', class_ = "sister")
print(CLS)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [14]:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
CLS = soup.find_all('a', class_ = "sister")
print(CLS)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [15]:
# 结合正则查找
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
CLS = soup.find(class_ = re.compile("ti"))
print(CLS)

<p class="title"><b>The Dormouse's story</b></p>


In [16]:
# 完全匹配 class 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
CLS = soup.find('p', attrs={'class':'title'})
print(CLS)

<p class="title"><b>The Dormouse's story</b></p>


### 3.3、开始正题
经过以上的铺垫，现在正式开始写代码，我们需要获取京东的电脑数据，首先我们需要打开京东网站去搜索电脑如下：
https://www.jd.com/

选择其中一台电脑选择右键检查，此页面中每一台电脑的数据都在< li data-sku >标签中。

In [17]:
import re
from bs4 import BeautifulSoup

jd_Computer_html = """
<li data-sku="100005171461" class="gl-item">
	<div class="gl-i-wrap">
					<div class="p-img">
						<a target="_blank" title="锐龙标压处理器性能更强劲，16G双通道内存响应更快速，超窄边框，“真”全面屏！更多好物" href="//item.jd.com/100005171461.html" onclick="searchlog(1,100005171461,0,2,'','flagsClk=1363153032')">
							<img width="220" height="220" class="" data-img="1" source-data-lazy-img="" data-lazy-img="done" src="//img11.360buyimg.com/n7/jfs/t1/99733/2/8261/174001/5e045083Ec79b2c6e/fac2d957b511c1da.jpg">
</a>						<div data-lease="" data-catid="672" data-venid="1000000157" data-presale="" data-done="1"></div>
					</div>
					<div class="p-price">
<strong class="J_100005171461" data-done="1"><em>￥</em><i>4499.00</i></strong>					</div>
					<div class="p-name p-name-type-2">
						<a target="_blank" title="锐龙标压处理器性能更强劲，16G双通道内存响应更快速，超窄边框，“真”全面屏！更多好物" href="//item.jd.com/100005171461.html" onclick="searchlog(1,100005171461,0,1,'','flagsClk=1363153032')">
							<em><img class="p-tag3" src="//img14.360buyimg.com/uba/jfs/t6919/268/501386350/1257/92e5fb39/5976fcf9Nd915775f.png">联想(<font class="skcolor_ljg">Lenovo</font>)小新Pro13.3英寸全面屏超轻薄笔记本电脑(标压锐龙R5-3550H 16G 512G 2.5K QHD 100%sRGB)银</em>
							<i class="promo-words" id="J_AD_100005171461">锐龙标压处理器性能更强劲，16G双通道内存响应更快速，超窄边框，“真”全面屏！更多好物</i>
						</a>
					</div>
					<div class="p-commit" data-done="1">
						<strong><a id="J_comment_100005171461" target="_blank" href="//item.jd.com/100005171461.html#comment" onclick="searchlog(1,100005171461,0,3,'','flagsClk=1363153032')">16万+</a>条评价</strong>
					</div>
			<div class="p-shop" data-dongdong="" data-selfware="1" data-score="5" data-reputation="93" data-done="1">
<span class="J_im_icon"><a target="_blank" class="curr-shop hd-shopname" onclick="searchlog(1,1000000157,0,58)" href="//mall.jd.com/index-1000000157.html" title="联想电脑京东自营旗舰店">联想电脑京东自营旗舰店</a><b class="im-02" style="background:url(//img14.360buyimg.com/uba/jfs/t26764/156/1205787445/713/9f715eaa/5bc4255bN0776eea6.png) no-repeat;" title="联系客服" onclick="searchlog(1,1000000157,0,61)"></b></span>					</div>	
			
		<div class="p-icons" id="J_pro_100005171461" data-done="1">
			<i class="goods-icons J-picon-tips J-picon-fix" data-idx="1" data-tips="京东自营，品质保障">自营</i>
						<i class="goods-icons4 J-picon-tips J-picon-fix" data-tips="天天低价，正品保证">秒杀</i>
					</div>
					<div class="p-operate">
						<a class="p-o-btn contrast J_contrast" data-sku="100005171461" href="javascript:;" onclick="searchlog(1,100005171461,0,6,'','flagsClk=1363153032')"><i></i>对比</a>
						<a class="p-o-btn focus J_focus" data-sku="100005171461" href="javascript:;" onclick="searchlog(1,100005171461,0,5,'','flagsClk=1363153032')"><i></i>关注</a>
						<a class="p-o-btn addcart" href="//cart.jd.com/gate.action?pid=100005171461&amp;pcount=1&amp;ptype=1" target="_blank" onclick="searchlog(1,100005171461,0,4,'','flagsClk=1363153032')" data-limit="0"><i></i>加入购物车</a>
					</div>
	</div>
</li>
"""

In [18]:
soup = BeautifulSoup(jd_Computer_html, "lxml")
Computer_price = soup.find('div', attrs={'class':'p-price'}).find('i').text
print(f"电脑的价格为：{Computer_price}元")
Computer_name = soup.find('div', attrs={'class':'p-name p-name-type-2'}).find('em').text
print(f"电脑的名称为：{Computer_name}")

电脑的价格为：4499.00元
电脑的名称为：联想(Lenovo)小新Pro13.3英寸全面屏超轻薄笔记本电脑(标压锐龙R5-3550H 16G 512G 2.5K QHD 100%sRGB)银


#### 获取全部电脑信息

In [22]:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re, requests, csv, codecs
from requests.exceptions import RequestException
from bs4 import BeautifulSoup

In [23]:
# 1、要取所有的电脑数据，首先得获得所有电脑的网页源码，以及url地址如下：
def download(url, headers, num_retries=3):
    print("download", url)
    try:
        response = requests.get(url, headers=headers)
        print(response.status_code)
        # 如果相应的状态码为：200 那么return 返回网站源码，如果访问非200则调用重试
        if response.status_code == 200:
            return response.content
        return None
    except RequestException as e:
        print(e.response)
        html = ""
        if hasattr(e.response, 'status_code'):
            code = e.response.status_code
            print('error code', code)
            if num_retries > 0 and 500 <= code < 600:
                html = download(url, headers, num_retries - 1)
        else:
            code = None
    return html

In [None]:
def find_detail(url):
    headers = {
    'User-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36",
    "referer": "https://search.jd.com"
    }
    r = download(url, headers)
    page = BeautifulSoup(r, "lxml")
    all_items = page.find_all('li', attrs={'class':'gl-item'})

In [20]:
def find_Computer(url, headers):
    r = download(url, headers=headers)
    # print(r)
    # 3、状态码返回200后证明是正常的，将返回的源码通过 BeautifulSoup 生成对象page，
    page = BeautifulSoup(r, "lxml")
    # print(page.prettify())

    # 接下来就可以通过page对象对返回的源码进行查找等操作：
    all_items = page.find_all('li', attrs={'class':'gl-item'})
    
    # 末：将数据保存至csv
    # 【Python】csv – 中文資料的讀取和寫入, https://kirin.idv.tw/python-csv-chinese-utf8/
    # python3.6（以及相关版本，例3.5等）写入以编码为‘utf-8’中文时，虽然读的时候用‘utf-8’打开不影响中文编码，但用excel打开csv文件时，会出现中文乱码问题，因此采用编码为‘utf-8-sig’的方式写入，读文件时可用‘utf-8’打开，也可用‘utf-8-sig’打开。
    # 原文链接：https://blog.csdn.net/qq_36512295/article/details/84889426
    with open("Computer.csv", 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.writer(f)
        fields = ('ID', 'NAME', 'PRICE')
        writer.writerow(fields)
        # 4、接下来通过for循环将找到的所有电脑源码进行循环。
        for all in all_items:
            Computer_id = all["data-sku"]
            Computer_name = all.find('div', attrs={'class': 'p-name p-name-type-2'}).find('em').text
            _price = all.find('div', attrs={'class': 'p-price'}).find('strong')
            Computer_price = _price.get('data-price', _price.find('i').text)
            commit_url = "https:" + all.find('div', attrs={'class': 'p-commit'}).find('a').get('href')
            print(f"ID：{Computer_id}")
            print(f"Name：{Computer_name}")
            print(f"href：{commit_url}")
            print(f"Price：{Computer_price}元\n")
            row = []
            row.append(Computer_id)
            row.append(Computer_name)
            row.append(str(Computer_price) + "元")
            writer.writerow(row)

In [21]:
def main():
    headers = {
    'User-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36",
    "referer": "https://search.jd.com"
    }
    
    URL="https://search.jd.com/Search?keyword=lenovo&enc=utf-8&wq=lenovo&pvid=271f6c114446474f91249bbd788b7090"
    find_Computer(URL, headers=headers)

if __name__ == '__main__':
    main()

download https://search.jd.com/Search?keyword=lenovo&enc=utf-8&wq=lenovo&pvid=271f6c114446474f91249bbd788b7090
200
ID：100005171461
Name：联想(Lenovo)小新Pro13.3英寸全面屏超轻薄笔记本电脑(标压锐龙R5-3550H 16G 512G 2.5K QHD 100%sRGB)银
href：https://item.jd.com/100005171461.html#comment
Price：4499.00元

ID：100004286189
Name：京品电脑联想(Lenovo)小新Air14英寸 AMD锐龙版(全新12nm)轻薄笔记本电脑(R5-3500U 12G 512G PCIE IPS)轻奢灰
href：https://item.jd.com/100004286189.html#comment
Price：4299.00元

ID：100003688077
Name：京品电脑联想(Lenovo)拯救者Y7000P 2019英特尔酷睿i715.6英寸游戏笔记本电脑(i7-9750H 16G 1T SSD GTX1660Ti 144Hz)
href：https://item.jd.com/100003688077.html#comment
Price：9299.00元

ID：60220113212
Name：联想（Lenovo）小新Pro13.3英寸轻薄笔记本电脑锐龙R5-3550H 16G 512G固态 高效办公套装
href：https://item.jd.com/60220113212.html#comment
Price：4766.00元

ID：100010756268
Name：联想(Lenovo)YOGAS740英特尔酷睿i5 14.0英寸超轻薄笔记本电脑移动超能版(i5-1035G1 8G 512G MX250)深空灰
href：https://item.jd.com/100010756268.html#comment
Price：5499.00元

ID：100004498316
Name：联想(Lenovo)小新青春版  英特尔酷睿i7 14英寸 轻薄笔记本电脑(I7-8565U 8G 1T+128G