## 爬取新浪网的新闻

+ 使用 selenium 请求内容
+ 使用 bs4 解析 html

### 大致思路

1. 爬取滚动
1. 判断时间是否在两周内，如果是则爬取

准备

In [1]:
import bs4
import requests_html
import asyncio
import requests



SINA_URL = 'https://news.sina.com.cn/roll/'

# browser.close()

找出链接, 顺便写一个获取最新新闻链接的函数，从第六个链接开始忽略前面的非新闻链接

In [2]:


async def __get_latest_news_urls():
    s = requests_html.AsyncHTMLSession()
    r = await s.get(SINA_URL)
    await r.html.arender()
    bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
    await s.close()
    return bs.find_all('a')[6:]

links = await(__get_latest_news_urls())
print(links)


[<a href="https://finance.sina.com.cn/hy/hyjz/2020-04-10/doc-iirczymi5512253.shtml" target="_blank">4月10日15:00天达共和公益讲堂：离婚与继承案例分析</a>, <a href="https://news.sina.com.cn/w/2020-04-10/doc-iircuyvh6981593.shtml" target="_blank">美国如何逐步解锁抗疫的“正确姿态”</a>, <a href="https://finance.sina.com.cn/roll/2020-04-10/doc-iirczymi5511874.shtml" target="_blank">商务部：缩小加工贸易禁止类商品种类 支持企业开展加工贸易</a>, <a href="https://finance.sina.com.cn/stock/marketresearch/2020-04-10/doc-iirczymi5510639.shtml" target="_blank">华创证券:都市圈战略明确 兼顾稳经济和稳房地产重任</a>, <a href="https://finance.sina.com.cn/china/gncj/2020-04-10/doc-iircuyvh6979969.shtml" target="_blank">商务部：暂免今年4月至年底的加工贸易内销缓税利息</a>, <a href="https://finance.sina.com.cn/china/gncj/2020-04-10/doc-iirczymi5511299.shtml" target="_blank">商务部：网上广交会将为每一家参展企业提供独立直播间</a>, <a href="https://finance.sina.com.cn/stock/jsy/2020-04-10/doc-iircuyvh6979538.shtml" target="_blank">RCS概念大幅分化 号百控股等多股跌停</a>, <a href="https://finance.sina.com.cn/money/future/fmnews/2020-04-10/doc-iirczymi5509976.s

思考一下数据的维度

1. 标题
1. 类别
2. 时间
4. 关键字
5. 责任编辑
6. 来源
7. 相关专题
8. 正文

总之提取信息相当麻烦，因为格式不是固定的，上述元素也不一定在所有新闻中都出现

这里随便选一个全部都有的新闻当例子

In [3]:
sample_url = "https://finance.sina.com.cn/roll/2020-04-03/doc-iimxxsth3469383.shtml"
r = requests.get(sample_url)
bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode("utf8","ignore"))

In [19]:
import re
import datetime
import requests
from requests.adapters import HTTPAdapter

class SinaNews:
    def __init__(self, url: str, s: requests.Session):
        self.url = url
        r = s.get(url)
#         print(r)
        self.bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode("utf8","ignore"))
    async def go(self):
#         print(self.bs.prettify())
        
        self.main_title = self.bs.find('title').text
        date_source = self.bs.find('div', 'date-source')
        if date_source is None:
            date_source = self.bs.find('div', 'artInfo')
        date_text = date_source.find('span', 'date')
        if date_text is None:
            date_text = date_source.find('span', id='pub_date')
#         print(date_source, date_text)
#         self.html = self.bs.prettify()

        if date_text is not None:
            date_text = date_text.text.strip()
#             print(date_text)
            date_text = date_text.replace(' ', '')
            
            if '年' in date_text:
                self.publish_date = datetime.datetime.strptime(date_text, '%Y年%m月%d日%H:%M')
            else:
                self.publish_date = datetime.datetime.strptime(date_text, '%Y-%m-%d%H:%M:%S')
        self.source = date_source.find_all()[1].text
        content_list = self.bs.find_all('p')
        self.content = "".join([ tag.text for tag in content_list if not 'class' in tag.attrs ])
        key_words_list = self.bs.find('div', 'article-bottom clearfix')
        if key_words_list is not None:
            key_words_list = key_words_list.find_all('a')
            self.key_words = " ".join([ tag.text for tag in key_words_list if not 'class' in tag.attrs])
        
        editor_source = self.bs.find(string=re.compile('责任编辑'))
        if editor_source is not None:
            editor_source = editor_source.split('责任编辑：')
            self.article_editor = editor_source[-1].strip()
        
        self.category = self.bs.find('div', 'channel-path')
        if self.category is not None:
            self.category = self.category.find('a')
        if self.category is not None:
            self.category = self.category.text
        
        relative_topics_tags = self.bs.find('div', attrs={
            'data-sudaclick': 'content_relativetopics_p'
        })
        if relative_topics_tags is not None:
            relative_topics_tags = relative_topics_tags.find_all('a')
#             print(relative_topics_tags)
            self.relative_topics = " ".join(tag.text for tag in relative_topics_tags)
        self.data = dict()
        for k in self.__dict__:
            if k != 'bs' and k != 'url' and k != 'data':
                self.data[k] = self.__dict__[k]
#                 if k != 'content':
#                     print(k, self.__dict__[k])
        
#         pass
with requests.Session() as session:
    session.mount('http://', HTTPAdapter(max_retries=5))
    session.mount('https://', HTTPAdapter(max_retries=5))
    r = SinaNews(sample_url, session)
    await r.go()

写入 csv 文件


In [11]:
import csv
import pandas as pd
import numpy as np

sample_csv_file = "./sample.csv"

def dump(sampleNews: SinaNews, sinaNewsList: list):
    d = dict()
    attrs = sampleNews.data.keys()
    for k in attrs:
        d[k] = []
    for n in sinaNewsList:
        for k in attrs:
            if k in n.data:
                d[k].append(n.data[k])
            else:
                d[k].append(None)
    return pd.DataFrame(d, columns=attrs)


df = dump(r, [r])
print('done')
df.to_csv('sample.csv')


done


尝试读取内容

In [12]:
df2 = pd.read_csv(sample_csv_file, index_col=0)
df2

Unnamed: 0,main_title,publish_date,source,content,key_words,article_editor,category,relative_topics
0,做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的|做空机构_新浪财经_新浪网,2020-04-03 13:48:00,澎湃新闻,安装新浪财经客户端第一时间接收最全面的市场资讯→【下载地址】 瑞幸咖啡（Nasdaq：LK...,做空机构 瑞幸咖啡,孟然,美股,瑞幸咖啡伪造交易股价暴跌专题


写一个测试函数，爬取最新的几条新闻

In [14]:
url = "https://news.sina.com.cn/c/2020-04-09/doc-iircuyvh6810407.shtml"
with requests.Session() as s:
    n = SinaNews(url, s)
    await n.go()
    print(n.data)

<Response [200]>
{'main_title': '北京玉渊潭公园门票倒卖情况突出 将实名购票|玉渊潭公园|景山公园|北京_新浪新闻', 'publish_date': datetime.datetime(2020, 4, 9, 16, 39), 'source': '新京报', 'content': '\u3000\u3000原标题：玉渊潭公园门票倒卖情况突出，将实名购票\u3000\u3000景山公园也将实名购票。\u3000\u3000新京报快讯 4月9日，北京召开新冠疫情防控第76场发布会。\u3000\u3000北京市公园管理中心副主任、新闻发言人张亚红表示，市属11家公园在清明假期接待游客数量比去年同期减少70%，游客在游园时也配合戴口罩、测温。这是因为市属公园提前采取网上预约措施。\u3000\u3000倒卖玉渊潭公园门票情况比较突出，已向公安机关报警。相关人员禁止进入所有市属公园。\u3000\u30004月11日，玉渊潭将实行实名购票。景山公园也将实名购票。\u3000\u3000新京报记者 李玉坤更多猛料！欢迎扫描左方二维码关注新浪新闻官方微信（xinlang-xinwen）违法和不良信息举报电话：4000520066\n                    举报邮箱：jubao@vip.sina.comCopyright © 1996-2020 SINA CorporationAll Rights Reserved  新浪公司 版权所有 ', 'key_words': '玉渊潭公园 景山公园 北京', 'article_editor': '郑亚鹏', 'category': ' 社会万象', 'relative_topics': '聚焦新型冠状病毒肺炎疫情'}


In [15]:
import datetime

async def __test_latest_news(limit_pages = 1000):
    def get_page_url(pid):
        pid = int(pid)
        return "https://news.sina.com.cn/roll/#pageid=153&lid=2509&k=&num=50&page={}".format(pid)
    pid = 0
    
    
    async def next_page(pid):
        pid += 1
        s = requests_html.AsyncHTMLSession()
        r = await s.get(get_page_url(pid))
        await r.html.arender()
        bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
        links = bs.find_all('a')[6:]
        await s.close()
        return pid, links
    
    
    cnt = 0
    today = datetime.date.today()
    tasks = []
    while True:
        pid, links = await next_page(pid)
        for link in links:
            print(link.text, link['href'])
            n = SinaNews(link['href'])
            task = asyncio.create_task(n.go())
            tasks.append(task)
            cnt += 1
            if cnt % 50 == 0:
                break
            if cnt == limit_pages:
                break
        if cnt == limit_pages:
            break
            
            
    res = []
    for task in tasks:
        n = await task
        res.append(n)
            
    return res



In [None]:
res = await __test_latest_news(20)

测试基本完成，但是滚动页面最多只能实现50页。分析选择往日回顾的模式，采用另一种方案

In [16]:
def getNewsFromDate(d: datetime.date):
    def date_to_datetime(d):
        return datetime.datetime(d.year, d.month, d.day)

    def timestamp(d):
        return (date_to_datetime(d) - datetime.datetime(1970, 1, 1)) / datetime.timedelta(seconds=1)
    
    dt = date_to_datetime(d)
    
    
    stime = d + datetime.timedelta(hours=4)
    etime = d - datetime.timedelta(days=1)
    ctime = dt
    
    url = '''https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime={}&stime={}&
    ctime={}&date={}&k=&num=50'''.format(timestamp(etime), timestamp(stime), timestamp(ctime), d.strftime('%Y-%m-%d'))
    return url


测试一下最近三天，每天10篇新闻

In [17]:
from itertools import chain
import traceback

async def getNewsFromDaysDuration(daysl: int, daysr: int, limit_num_per_day=0, output_num=5):
    async def getNewsFromRollPage(url, limit_num=0, output_num=5):
        
        def get_page_url(pid):
            pid = int(pid)
            return url + "&page={}".format(pid)
        
        pid = 0
        async def next_page(pid):
            pid += 1
            print(pid, get_page_url(pid))
            s = requests_html.AsyncHTMLSession()
            r = await s.get(get_page_url(pid))

            await r.html.arender()
            bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
            links = bs.find_all('a')[9:]
#             print(links)
            await s.close()
            return pid, links


        cnt = 0
        today = datetime.date.today()
        tasks = []
        res = []
        with requests.Session() as s:
            s.mount('http://', HTTPAdapter(max_retries=5))
            s.mount('https://', HTTPAdapter(max_retries=5))
            while True:
                pid, links = await next_page(pid)
                for link in links:
                    if cnt % output_num == 0:
                        print(f"#{cnt}", link.text, link['href'])

                    try:
                        n = SinaNews(link['href'], s)
                        res.append(n)
                        tasks.append(asyncio.create_task(n.go()))
                    except Exception as err:
                        traceback.print_tb(err.__traceback__)
                        print("something wrong with news url: {}".format(link))

                    cnt += 1

                    if cnt == limit_num:
                        break
                    if cnt % 50 == 0:
                        break
                if cnt == limit_num:
                    break
        
        for task in tasks:
            await task
        
        return res
    
    
    today = datetime.date.today()
    res = []
    for i in range(daysl, daysr+1):
        print("day {} from today".format(i))
        d = today - datetime.timedelta(days=i)
        url = getNewsFromDate(d)
        try:
            res += await getNewsFromRollPage(url, limit_num_per_day, output_num)
        except Exception as e:
            traceback.print_tb(e.__traceback__)
            print("something wrong when finding news from day {}".format(i))
    
    return res

In [20]:
news = await getNewsFromDaysDuration(1, 3, 10)

day 1 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586304000.0&stime=1586390400.0&
    ctime=1586390400.0&date=2020-04-09&k=&num=50&page=1
#0 美移民律师：很多中国客户因疫情失业 也回不了中国 https://news.sina.com.cn/w/2020-04-10/doc-iirczymi5399549.shtml
#5 中国要“接管”全球？美前外交官：担忧被极大夸大 https://finance.sina.com.cn/world/gjcj/2020-04-10/doc-iirczymi5398964.shtml
day 2 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586217600.0&stime=1586304000.0&
    ctime=1586304000.0&date=2020-04-08&k=&num=50&page=1
#0 充电桩：新基建 迈向新能源汽车时代 https://finance.sina.com.cn/stock/stockzmt/2020-04-09/doc-iirczymi5206529.shtml
#5 为遏制疫情蔓延 阿曼关闭首都所有陆路出入口 https://news.sina.com.cn/w/2020-04-09/doc-iirczymi5203902.shtml
day 3 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=1
#0 任泽平分析全球疫情：欧美陆续现拐点 但有长尾特征 https://finance.sina.com.cn/review/mspl/2020-04-08/doc-iimxxsth4181599.shtml
#5 长城证券：市场驱动因素有

In [13]:
print(news)

[<__main__.SinaNews object at 0x7f321e0d0a00>, <__main__.SinaNews object at 0x7f32347f96d0>, <__main__.SinaNews object at 0x7f32160afa60>, <__main__.SinaNews object at 0x7f32160b5a60>, <__main__.SinaNews object at 0x7f321e0d0190>, <__main__.SinaNews object at 0x7f321cacce20>, <__main__.SinaNews object at 0x7f321c8e5100>, <__main__.SinaNews object at 0x7f321c8dad60>, <__main__.SinaNews object at 0x7f321609b370>, <__main__.SinaNews object at 0x7f321610b880>, <__main__.SinaNews object at 0x7f321c6a1520>, <__main__.SinaNews object at 0x7f321c6a17c0>, <__main__.SinaNews object at 0x7f321cd51a90>, <__main__.SinaNews object at 0x7f3216253700>, <__main__.SinaNews object at 0x7f321cc1e130>, <__main__.SinaNews object at 0x7f321e6a3f40>, <__main__.SinaNews object at 0x7f321e5b1df0>, <__main__.SinaNews object at 0x7f321e6a3eb0>, <__main__.SinaNews object at 0x7f321e2c53a0>, <__main__.SinaNews object at 0x7f321e192460>, <__main__.SinaNews object at 0x7f321c6c5e50>, <__main__.SinaNews object at 0x7f

In [14]:
df = dump(r, news)
df.to_csv('sample2.csv')
df.head()


Unnamed: 0,main_title,publish_date,source,content,key_words,article_editor,category,relative_topics
0,美移民律师：很多中国客户因疫情失业 也回不了中国|疫情|新冠肺炎|美国_新浪新闻,2020-04-10 00:00:00,环球网,原标题：美国移民律师：很多中国客户因为疫情失业，也回不了中国 美国CNN在一篇最新报道...,疫情 新冠肺炎 美国,吴金明,国内新闻,聚焦新型冠状病毒肺炎疫情 全球多国爆发新冠肺炎疫情 美国成疫情“震中”
1,林义相：“启动股市，接力房市”的政策取向不应该变|疫情_新浪财经_新浪网,2020-04-09 23:59:00,新浪财经-自媒体综合,来源：乐趣投资 关注乐趣投资的人，都能找到人生乐趣 “启动股市，接力房市”，是中国经...,疫情 股票市场,常福强,基金,
2,立陶宛新增43例新冠肺炎确诊病例 累计955例|立陶宛|新冠肺炎_新浪新闻,2020-04-09 23:59:00,央视,原标题：立陶宛新增43例新冠肺炎确诊病例 累计955例 当地时间4月9日，波罗的海新闻...,立陶宛 新冠肺炎,吴金明,国际新闻,聚焦新型冠状病毒肺炎疫情 全球多国爆发新冠肺炎疫情
3,警方通报鲍某某性侵事件：决定再次立案 目前仍在侦查_新浪财经_新浪网,2020-04-09 23:58:00,新浪财经,如何在结构性行情中开展投资布局？新浪财经《基金直播间》，邀请基金经理在线路演解读市场。 新...,,王帅,证券,
4,武汉“解封”首日到底多少人出城？流向何处？|航班|旅客列车|新冠肺炎_新浪新闻,2020-04-09 23:57:00,红星新闻,原标题：武汉“解封”首日到底多少人出城？流向何处？ 4月8日零时，全国新冠肺炎疫情防控...,航班 旅客列车 新冠肺炎 武汉 湖北,吴金明,国内新闻,聚焦新型冠状病毒肺炎疫情


查看缺失情况

In [15]:
for attr in df:
    print(attr, df[attr].isnull().sum())

main_title 0
publish_date 0
source 0
content 0
key_words 0
article_editor 1
category 0
relative_topics 14


测试收集一天的所有新闻，新浪新闻每天只有50页保存，每页50个新闻，因此一天只有2500新闻最多可以保存

In [None]:
news = await getNewsFromDaysDuration(1, 14, 500, 10)

day 1 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586304000.0&stime=1586390400.0&
    ctime=1586390400.0&date=2020-04-09&k=&num=50&page=1
#0 美移民律师：很多中国客户因疫情失业 也回不了中国 https://news.sina.com.cn/w/2020-04-10/doc-iirczymi5399549.shtml
#10 爱奇艺股价攀升；瑞穗认为其收入确认方法与行业一致 https://finance.sina.com.cn/stock/usstock/c/2020-04-09/doc-iircuyvh6868545.shtml
#20 “华为汽车”被曝有望今年落地 华为回应：进行中 https://finance.sina.com.cn/chanjing/gsnews/2020-04-10/doc-iircuyvh6868873.shtml
#30 美媒：这场疫情物资争夺战，穷国被富国挤得靠边站 https://news.sina.com.cn/w/2020-04-09/doc-iirczymi5397094.shtml
#40 股票基金再现一日售罄 认购金额超上限80亿 https://finance.sina.com.cn/roll/2020-04-09/doc-iircuyvh6866605.shtml
2 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586304000.0&stime=1586390400.0&
    ctime=1586390400.0&date=2020-04-09&k=&num=50&page=2
#50 南非团结基金将为困难家庭提供粮食救济 https://finance.sina.com.cn/roll/2020-04-09/doc-iircuyvh6866616.shtml
#60 武汉“封城”之前，我们究竟做了什么？ https://news.sina.com.cn/c/2020-04-09/doc-iirczymi5395971.shtml
#70 拉卡拉2

#130 欧元区共同救助计划难产 英媒:疫情暴露欧盟内部严重分歧 https://finance.sina.com.cn/world/gjcj/2020-04-08/doc-iircuyvh6667759.shtml
#140 武磊：已康复但未核酸复测，隔离期第一次下厨做饭 https://news.sina.com.cn/c/2020-04-08/doc-iirczymi5195653.shtml
4 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586217600.0&stime=1586304000.0&
    ctime=1586304000.0&date=2020-04-08&k=&num=50&page=4
#150 券商代资管漩涡再起：招商资管提起4亿仲裁 https://finance.sina.com.cn/stock/quanshang/2020-04-08/doc-iircuyvh6666461.shtml
#160 企业抗疫复产典型事迹|中铁股份：战疫复工中彰显担当 https://finance.sina.com.cn/chanjing/gsnews/2020-04-08/doc-iircuyvh6665137.shtml
#170 早盘：美股涨幅收窄 道指上涨不足百点 https://finance.sina.com.cn/stock/usstock/c/2020-04-08/doc-iircuyvh6664528.shtml
#180 企业抗疫复产典型事迹|北汽集团：双线战“疫” https://finance.sina.com.cn/chanjing/gsnews/2020-04-08/doc-iircuyvh6664003.shtml
#190 围猎瑞幸、爱奇艺们的海外机构是群什么人？ https://finance.sina.com.cn/stock/stockptd/2020-04-08/doc-iircuyvh6663554.shtml
5 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586217600.0&stime=1586304000.0&
    ctime=158630400

#250 马耳他新增新冠肺炎确诊病例52例 累计确诊293例 https://news.sina.com.cn/w/2020-04-07/doc-iimxyqwa5602228.shtml
#260 华新水泥：预计一季度业绩减少6.1亿元-7.1亿元 https://finance.sina.com.cn/stock/s/2020-04-07/doc-iimxyqwa5601356.shtml
#270 土总统指示严密跟踪地区和世界新冠肺炎疫情形势 https://finance.sina.com.cn/roll/2020-04-08/doc-iimxyqwa5600689.shtml
#280 4月8日19:30武汉重启特别直播:阎志讲述大武汉的光荣与梦想 https://finance.sina.com.cn/hy/hyjz/2020-04-07/doc-iimxyqwa5600312.shtml
#290 37家标杆房企下调今年目标预期：平均增速仅为14% https://finance.sina.com.cn/stock/hyyj/2020-04-07/doc-iimxyqwa5602992.shtml
7 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=7
#300 研究显示欧美仍为彼此最重要的经贸伙伴 https://finance.sina.com.cn/roll/2020-04-08/doc-iimxyqwa5600710.shtml
#310 豫金刚石变脸巨亏后首日跌停 股民:亏损单位有没有搞错 https://finance.sina.com.cn/roll/2020-04-07/doc-iimxyqwa5599287.shtml
#320 吴晓波:瑞幸造假后大量订单涌入 消费者多抱着别浪费优惠券的心理 https://finance.sina.com.cn/chanjing/gsnews/2020-04-07/doc-iimxxsth4158358.shtml
#330 新开普：一季度净利润预增100%-13

In [None]:
len(news)

In [None]:
df = dump(r, news)
df.head()
# df.describe()
df.to_csv('0410.csv')

In [None]:
for i in range(2, 14):
    news = await getNewsFromDaysDuration(i, i, 2500, 20)
    df = dump(r, news)
    df.to_csv(f'{i}.csv')