## 爬取新浪网的新闻

+ 使用 selenium 请求内容
+ 使用 bs4 解析 html

### 大致思路

1. 爬取滚动
1. 判断时间是否在两周内，如果是则爬取

准备

In [3]:
import bs4
import requests_html
import asyncio
import requests



SINA_URL = 'https://news.sina.com.cn/roll/'

# browser.close()

找出链接, 顺便写一个获取最新新闻链接的函数，从第六个链接开始忽略前面的非新闻链接

In [4]:


async def __get_latest_news_urls():
    s = requests_html.AsyncHTMLSession()
    r = await s.get(SINA_URL)
    await r.html.arender()
    bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
    await s.close()
    return bs.find_all('a')[6:]

links = await(__get_latest_news_urls())
print(links)


[<a href="https://finance.sina.com.cn/stock/marketresearch/2020-04-10/doc-iirczymi5564317.shtml" target="_blank">国盛策略：反弹后 全球及A股估值水平如何？</a>, <a href="https://finance.sina.com.cn/jryx/2020-04-10/doc-iircuyvh7033664.shtml" target="_blank">邮储银行嘉兴市分行被罚125万：因贷款发放不审慎等</a>, <a href="https://finance.sina.com.cn/money/future/fmnews/2020-04-10/doc-iircuyvh7033544.shtml" target="_blank">一德期货佘建跃：欧佩克+减产有助于油市长期供需平衡</a>, <a href="https://finance.sina.com.cn/stock/s/2020-04-10/doc-iirczymi5564061.shtml" target="_blank">方正证券：向法院起诉请求判令长乐汇等偿还融资本息</a>, <a href="https://finance.sina.com.cn/stock/s/2020-04-10/doc-iircuyvh7033450.shtml" target="_blank">方正证券：请求法院判令长乐汇等偿还融资本金及利息</a>, <a href="https://finance.sina.com.cn/jryx/2020-04-10/doc-iircuyvh7032940.shtml" target="_blank">厦门天地安保险代理被责令停止接受新业务3年:妨碍依法监督检查</a>, <a href="https://news.sina.com.cn/w/2020-04-10/doc-iircuyvh7032998.shtml" target="_blank">西班牙新增605例新冠肺炎死亡病例，系17天以来最低</a>, <a href="https://finance.sina.com.cn/money/insurance/bxdt/2020-04-10/doc-iirczy

思考一下数据的维度

1. 标题
1. 类别
2. 时间
4. 关键字
5. 责任编辑
6. 来源
7. 相关专题
8. 正文

总之提取信息相当麻烦，因为格式不是固定的，上述元素也不一定在所有新闻中都出现

这里随便选一个全部都有的新闻当例子

In [5]:
sample_url = "https://finance.sina.com.cn/roll/2020-04-03/doc-iimxxsth3469383.shtml"
r = requests.get(sample_url)
bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode("utf8","ignore"))

In [6]:
import re
import datetime
import requests
from requests.adapters import HTTPAdapter

class SinaNews:
    def __init__(self, url: str, s: requests.Session):
        self.url = url
        r = s.get(url)
#         print(r)
        self.bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode("utf8","ignore"))
    async def go(self):
#         print(self.bs.prettify())
        
        self.main_title = self.bs.find('title').text
        date_source = self.bs.find('div', 'date-source')
        if date_source is None:
            date_source = self.bs.find('div', 'artInfo')
        date_text = date_source.find('span', 'date')
        if date_text is None:
            date_text = date_source.find('span', id='pub_date')
#         print(date_source, date_text)
#         self.html = self.bs.prettify()

        if date_text is not None:
            date_text = date_text.text.strip()
#             print(date_text)
            date_text = date_text.replace(' ', '')
            
            if '年' in date_text:
                self.publish_date = datetime.datetime.strptime(date_text, '%Y年%m月%d日%H:%M')
            else:
                self.publish_date = datetime.datetime.strptime(date_text, '%Y-%m-%d%H:%M:%S')
        if date_source is not None:
            self.source = date_source.find_all()[1].text
        content_list = self.bs.find_all('p')
        self.content = "".join([ tag.text for tag in content_list if not 'class' in tag.attrs ])
        key_words_list = self.bs.find('div', 'article-bottom clearfix')
        if key_words_list is not None:
            key_words_list = key_words_list.find_all('a')
            self.key_words = " ".join([ tag.text for tag in key_words_list if not 'class' in tag.attrs])
        
        editor_source = self.bs.find(string=re.compile('责任编辑'))
        if editor_source is not None:
            editor_source = editor_source.split('责任编辑：')
            self.article_editor = editor_source[-1].strip()
        
        self.category = self.bs.find('div', 'channel-path')
        if self.category is not None:
            self.category = self.category.find('a')
        if self.category is not None:
            self.category = self.category.text
        
        relative_topics_tags = self.bs.find('div', attrs={
            'data-sudaclick': 'content_relativetopics_p'
        })
        if relative_topics_tags is not None:
            relative_topics_tags = relative_topics_tags.find_all('a')
#             print(relative_topics_tags)
            self.relative_topics = " ".join(tag.text for tag in relative_topics_tags)
        self.data = dict()
        for k in self.__dict__:
            if k != 'bs' and k != 'url' and k != 'data':
                self.data[k] = self.__dict__[k]
#                 if k != 'content':
#                     print(k, self.__dict__[k])
        
#         pass
with requests.Session() as session:
    session.mount('http://', HTTPAdapter(max_retries=5))
    session.mount('https://', HTTPAdapter(max_retries=5))
    r = SinaNews(sample_url, session)
    await r.go()

写入 csv 文件


In [7]:
import csv
import pandas as pd
import numpy as np

sample_csv_file = "./sample.csv"

def dump(sampleNews: SinaNews, sinaNewsList: list):
    d = dict()
    attrs = sampleNews.data.keys()
    for k in attrs:
        d[k] = []
    for n in sinaNewsList:
        for k in attrs:
            if k in n.data:
                d[k].append(n.data[k])
            else:
                d[k].append(None)
    return pd.DataFrame(d, columns=attrs)


df = dump(r, [r])
print('done')
df.to_csv('sample.csv')


done


尝试读取内容

In [8]:
df2 = pd.read_csv(sample_csv_file, index_col=0)
df2

Unnamed: 0,main_title,publish_date,source,content,key_words,article_editor,category,relative_topics
0,做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的|做空机构_新浪财经_新浪网,2020-04-03 13:48:00,澎湃新闻,安装新浪财经客户端第一时间接收最全面的市场资讯→【下载地址】 瑞幸咖啡（Nasdaq：LK...,做空机构 瑞幸咖啡,孟然,美股,瑞幸咖啡伪造交易股价暴跌专题


写一个测试函数，爬取最新的几条新闻

In [9]:
url = "https://news.sina.com.cn/c/2020-04-09/doc-iircuyvh6810407.shtml"
with requests.Session() as s:
    n = SinaNews(url, s)
    await n.go()
    print(n.data)

{'main_title': '北京玉渊潭公园门票倒卖情况突出 将实名购票|玉渊潭公园|景山公园|北京_新浪新闻', 'publish_date': datetime.datetime(2020, 4, 9, 16, 39), 'source': '新京报', 'content': '\u3000\u3000原标题：玉渊潭公园门票倒卖情况突出，将实名购票\u3000\u3000景山公园也将实名购票。\u3000\u3000新京报快讯 4月9日，北京召开新冠疫情防控第76场发布会。\u3000\u3000北京市公园管理中心副主任、新闻发言人张亚红表示，市属11家公园在清明假期接待游客数量比去年同期减少70%，游客在游园时也配合戴口罩、测温。这是因为市属公园提前采取网上预约措施。\u3000\u3000倒卖玉渊潭公园门票情况比较突出，已向公安机关报警。相关人员禁止进入所有市属公园。\u3000\u30004月11日，玉渊潭将实行实名购票。景山公园也将实名购票。\u3000\u3000新京报记者 李玉坤更多猛料！欢迎扫描左方二维码关注新浪新闻官方微信（xinlang-xinwen）违法和不良信息举报电话：4000520066\n                    举报邮箱：jubao@vip.sina.comCopyright © 1996-2020 SINA CorporationAll Rights Reserved  新浪公司 版权所有 ', 'key_words': '玉渊潭公园 景山公园 北京', 'article_editor': '郑亚鹏', 'category': ' 社会万象', 'relative_topics': '聚焦新型冠状病毒肺炎疫情'}


In [10]:
def getNewsFromDate(d: datetime.date):
    def date_to_datetime(d):
        return datetime.datetime(d.year, d.month, d.day)

    def timestamp(d):
        return (date_to_datetime(d) - datetime.datetime(1970, 1, 1)) / datetime.timedelta(seconds=1)
    
    dt = date_to_datetime(d)
    
    
    stime = d + datetime.timedelta(hours=4)
    etime = d - datetime.timedelta(days=1)
    ctime = dt
    
    url = '''https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime={}&stime={}&
    ctime={}&date={}&k=&num=50'''.format(timestamp(etime), timestamp(stime), timestamp(ctime), d.strftime('%Y-%m-%d'))
    return url


测试一下最近三天，每天10篇新闻

In [11]:
from itertools import chain
import traceback

async def getNewsFromDaysDuration(daysl: int, daysr: int, limit_num_per_day=0, output_num=5):
    async def getNewsFromRollPage(url, limit_num=0, output_num=5):
        
        def get_page_url(pid):
            pid = int(pid)
            return url + "&page={}".format(pid)
        
        pid = 0
        async def next_page(pid):
            pid += 1
            print(pid, get_page_url(pid))
            try:
                s = requests_html.AsyncHTMLSession()
                r = await s.get(get_page_url(pid))
                await asyncio.sleep(0.1)
                await r.html.arender()
                bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
                links = bs.find_all('a')[9:]
            except:
                traceback.print_exc()
                print("something went wrong when calling page {}".format(pid))
                pid += 1
                return next_page(pid+1)
            finally:
                await s.close()
            return pid, links


        cnt = 0
        today = datetime.date.today()
        tasks = []
        res = []
        with requests.Session() as s:
            s.mount('http://', HTTPAdapter(max_retries=5))
            s.mount('https://', HTTPAdapter(max_retries=5))
            while True:
                pid, links = await next_page(pid)
                for link in links:
                    if cnt % output_num == 0:
                        print(f"#{cnt}", link.text, link['href'])

                    try:
                        n = SinaNews(link['href'], s)
                        res.append(n)
                        tasks.append(asyncio.create_task(n.go()))
                    except Exception as err:
                        traceback.print_exc()
                        print("something wrong with news url: {}".format(link))
                    finally:
                        cnt += 1

                    if cnt == limit_num:
                        break
                    if cnt % 50 == 0:
                        break
                    urls = re.findall('http[s]?', link.text)
                    print(urls)
                    exit()
                    if len(urls) == 0:
                        break
                if cnt == limit_num:
                    break
        
        for task in tasks:
            await task
        
        return res
    
    
    today = datetime.date.today()
    res = []
    for i in range(daysl, daysr+1):
        print("day {} from today".format(i))
        d = today - datetime.timedelta(days=i)
        url = getNewsFromDate(d)
        try:
            res += await getNewsFromRollPage(url, limit_num_per_day, output_num)
        except Exception as e:
            traceback.print_exc()
            print("something wrong when finding news from day {}".format(i))
    
    return res

In [None]:
news = await getNewsFromDaysDuration(1, 3, 10)

day 1 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586304000.0&stime=1586390400.0&
    ctime=1586390400.0&date=2020-04-09&k=&num=50&page=1
#0 美移民律师：很多中国客户因疫情失业 也回不了中国 https://news.sina.com.cn/w/2020-04-10/doc-iirczymi5399549.shtml
[]
2 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586304000.0&stime=1586390400.0&
    ctime=1586390400.0&date=2020-04-09&k=&num=50&page=2


In [None]:
print(news)

In [None]:
df = dump(r, news)
df.to_csv('sample2.csv')
df.head()


查看缺失情况

In [None]:
for attr in df:
    print(attr, df[attr].isnull().sum())

测试收集一天的所有新闻，新浪新闻每天只有50页保存，每页50个新闻，因此一天只有2500新闻最多可以保存

In [None]:
for i in range(3, 6):
    news = await getNewsFromDaysDuration(i, i, 1500, 20)
    df = dump(r, news)
    df.to_csv(f'{i}.csv')

day 3 from today
1 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=1
#0 任泽平分析全球疫情：欧美陆续现拐点 但有长尾特征 https://finance.sina.com.cn/review/mspl/2020-04-08/doc-iimxxsth4181599.shtml
2 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=2
3 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=3
4 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=4
5 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=50&page=5
6 https://news.sina.com.cn/roll/#pageid=153&lid=2509&etime=1586131200.0&stime=1586217600.0&
    ctime=1586217600.0&date=2020-04-07&k=&num=