## 爬取新浪网的新闻

+ 使用 requests_html 请求内容
+ 使用 bs4 解析 html

### 大致思路

1. 爬取滚动
1. 判断时间是否在两周内，如果是则爬取

准备

In [126]:
import bs4
import requests_html
import asyncio


SINA_URL = 'https://news.sina.com.cn/roll/'

检查编码

In [116]:
s = requests_html.AsyncHTMLSession()
r = await s.get(SINA_URL)
await r.html.arender()
r.html.encoding

'utf-8'

找出链接, 顺便写一个获取最新新闻链接的函数

In [128]:
bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
links = bs.find_all('a')[6:]
links[0]['href'], links[0].text

async def __get_latest_news_url():
    s = requests_html.AsyncHTMLSession()
    r = await s.get(SINA_URL)
    await r.html.arender()
    bs = bs4.BeautifulSoup(r.html.html, 'html.parser')
    return bs.find_all['a'][6]
loop = asyncio.get_event_loop()

loop.run(__get_latest_news_url)


AttributeError: '_UnixSelectorEventLoop' object has no attribute 'run'

思考一下数据的维度

1. 标题
1. 类别
2. 时间
4. 关键字
5. 责任编辑
6. 来源
7. 相关专题
8. 正文

总之提取信息相当麻烦，因为格式不是固定的，上述元素也不一定在所有新闻中都出现

这里随便选一个全部都有的新闻当例子

In [102]:
sample_url = "https://finance.sina.com.cn/roll/2020-04-03/doc-iimxxsth3469383.shtml"
r = requests_html.requests.get(sample_url)
bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode())
print(bs.prettify())

<!DOCTYPE html>
<!-- [ published at 2020-04-03 13:55:29 ] -->
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <meta content="urlpath:stock/usstock/; allCIDs:57045,257,51894,56416,57043,56673,51070,258,240795,40809,260314,146643,260313,76557,76556" name="sudameta"/>
  <title>
   做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的|做空机构_新浪财经_新浪网
  </title>
  <meta content="做空机构,瑞幸咖啡" name="keywords"/>
  <meta content="做空机构,瑞幸咖啡" name="tags"/>
  <meta content="" name="description"/>
  <meta content="news" property="og:type"/>
  <meta content="做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的" property="og:title"/>
  <meta content="做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的" property="og:description"/>
  <meta content="https://finance.sina.com.cn/roll/2020-04-03/doc-iimxxsth3469383.shtml" property="og:url"/>
  <meta content="//n.sinaimg.cn/sinakd202043s/217/w1080h2337/20200403/9bb6-irtymmv6919426.jpg" property="og:image"/>
  <meta content="2020-04-03 13:

In [114]:
import re
import datetime

class SinaNews:
    def __init__(self, url: str):
        self.url = url
        r = requests_html.requests.get(url)
        self.bs = bs4.BeautifulSoup(r.text.encode(r.encoding).decode())
    def go(self):
        self.main_title = self.bs.find('title').text
        date_source = self.bs.find('div', 'date-source')
        date_text = date_source.find('span', 'date').text
#         self.publish_date = date_source.find('span', 'date').text

        if date_text is not None:
            self.publish_date = datetime.datetime.strptime(date_text, '%Y年%m月%d日 %H:%M')
        self.source = date_source.find_all()[1].text
        content_list = self.bs.find_all('p')
        self.content = "".join([ tag.text for tag in content_list if not 'class' in tag.attrs ])
        key_words_list = self.bs.find('div', 'article-bottom clearfix')
        if key_words_list is not None:
            key_words_list = key_words_list.find_all('a')
            self.key_words = " ".join([ tag.text for tag in key_words_list if not 'class' in tag.attrs])
        self.article_editor = self.bs.find(string=re.compile('责任编辑')).split('责任编辑：')[-1].strip()
        
        self.category = self.bs.find('div', 'channel-path').find('a').text
        relative_topics_tags = self.bs.find('div', attrs={
            'data-sudaclick': 'content_relativetopics_p'
        })
#         print(relative_topics_tags)
        if relative_topics_tags is not None:
            relative_topics_tags = relative_topics_tags.find_all('a')
            print(relative_topics_tags)
            self.relative_topics = " ".join(tag.text for tag in relative_topics_tags)
        self.data = dict()
        for k in self.__dict__:
            if k != 'bs' and k != 'url' and k != 'data':
                self.data[k] = self.__dict__[k]
                if k != 'content':
                    print(k, self.__dict__[k])
        
#         self.publish_date = bs.find_all('div', 'data-source')
#         print(self.main_title, self.publish_date)
        pass
r = SinaNews(sample_url)
r.go()

[<a href="https://finance.sina.com.cn/zt_d/lkcf" target="_blank">瑞幸咖啡伪造交易股价暴跌专题</a>]
main_title 做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的|做空机构_新浪财经_新浪网
publish_date 2020-04-03 13:48:00
source 澎湃新闻
key_words 做空机构 瑞幸咖啡
article_editor 孟然
category 美股
relative_topics 瑞幸咖啡伪造交易股价暴跌专题


写入 csv 文件


In [104]:
import csv
import pandas as pd
import numpy as np

sample_csv_file = "./sample.csv"

def dump(sampleNews: SinaNews, sinaNewsList: list):
    d = dict()
    attrs = sampleNews.data.keys()
    for k in attrs:
        d[k] = []
    for n in sinaNewsList:
        for k in attrs:
            d[k].append(n.data[k])
    return pd.DataFrame(d, columns=attrs)


df = dump(r, [r])
print('done')
df


done


Unnamed: 0,main_title,publish_date,source,content,key_words,article_editor,category
0,做空机构曾围绕瑞幸掐架：力挺瑞幸的香椽终于承认浑水是对的|做空机构_新浪财经_新浪网,2020-04-03 13:48:00,澎湃新闻,安装新浪财经客户端第一时间接收最全面的市场资讯→【下载地址】 瑞幸咖啡（Nasdaq：LK...,做空机构 瑞幸咖啡,孟然,美股


尝试读取内容

In [100]:
df2 = pd.read_csv(sample_csv_file, index_col=0)
df2

Unnamed: 0,main_title,publish_date,source,content,key_words,article_editor,category
0,一文读懂董事责任险_新浪财经_新浪网,2020年04月02日 22:53,新浪财经-自媒体综合,【金融315，我们帮你维权】近来，ETC纠纷、信用卡盗刷、银行征信、保险理赔难等问题困扰着金...,,陈鑫,保险


写一个测试函数，爬取最新的一条新闻

In [None]:
def __test_latest_news():
    

接下来筛选最近两周的新闻