Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加爬取话题功能 #104

Closed
standard-outlier opened this issue Jan 14, 2020 · 61 comments
Closed

增加爬取话题功能 #104

standard-outlier opened this issue Jan 14, 2020 · 61 comments
Labels

Comments

@standard-outlier
Copy link

standard-outlier commented Jan 14, 2020

首先感谢您的辛苦工作,我试用了一下,很好用,数据抓下来以后简单处理就可以用了。
是否可以增加爬取某个话题的功能呢,因为爬取某个人的微博意义有限,而根据话题拿到的数据研究的价值就更高了。

@dataabc
Copy link
Owner

dataabc commented Jan 14, 2020

感谢建议,其实这个就是微博搜索的功能,只不过关键字带有#符号。

之前的程序已经实现这个功能了,只是因为比较简陋,所以没有上传。
具体代码如下:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import codecs
import csv
import os
import random
import re
import sys
import traceback
from collections import OrderedDict
from datetime import datetime, timedelta
from time import sleep

import requests
from lxml import etree
from tqdm import tqdm


class Weibo(object):
    cookie = {
        'Cookie':
        'your cookie'
    }  # 将your cookie替换成自己的cookie

    def __init__(self, keyword, filter=0, pic_download=0):
        """Weibo类初始化"""
        self.keyword = keyword  # 要搜索的关键词
        if filter != 0 and filter != 1:
            sys.exit(u'filter值应为0或1,请重新输入')
        if pic_download != 0 and pic_download != 1:
            sys.exit(u'pic_download值应为0或1,请重新输入')
        self.filter = filter  # 取值范围为0、1,程序默认值为0,代表要爬取用户的全部微博,1代表只爬取用户的原创微博
        self.pic_download = pic_download  # 取值范围为0、1,程序默认值为0,代表不下载微博原始图片,1代表下载
        self.nickname = ''  # 用户昵称,如“Dear-迪丽热巴”
        self.weibo_num = 0  # 用户全部微博数
        self.got_num = 0  # 爬取到的微博数
        self.following = 0  # 用户关注数
        self.followers = 0  # 用户粉丝数
        self.weibo = []  # 存储爬取到的所有微博信息

    def deal_html(self, url):
        """处理html"""
        try:
            html = requests.get(url, cookies=self.cookie).content
            selector = etree.HTML(html)
            return selector
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def deal_garbled(self, info):
        """处理乱码"""
        try:
            info = (info.xpath('string(.)').replace(u'\u200b', '').encode(
                sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding))
            return info
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_nickname(self):
        """获取用户昵称"""
        try:
            url = 'https://weibo.cn/%d/info' % (self.user_id)
            selector = self.deal_html(url)
            nickname = selector.xpath('//title/text()')[0]
            self.nickname = nickname[:-3]
            if self.nickname == u'登录 - 新' or self.nickname == u'新浪':
                sys.exit(u'cookie错误或已过期,请按照README中方法重新获取')
            print(u'用户昵称: ' + self.nickname)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_user_info(self, selector):
        """获取用户昵称、微博数、关注数、粉丝数"""
        try:
            self.get_nickname()  # 获取用户昵称
            user_info = selector.xpath("//div[@class='tip2']/*/text()")

            self.weibo_num = int(user_info[0][3:-1])
            print(u'微博数: ' + str(self.weibo_num))

            self.following = int(user_info[1][3:-1])
            print(u'关注数: ' + str(self.following))

            self.followers = int(user_info[2][3:-1])
            print(u'粉丝数: ' + str(self.followers))
            print('*' * 100)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_page_num(self, selector):
        """获取微博总页数"""
        try:
            if selector.xpath("//input[@name='mp']") == []:
                page_num = 1
            else:
                page_num = (int)(
                    selector.xpath("//input[@name='mp']")[0].attrib['value'])
            return page_num
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_weibo(self, weibo_link):
        """获取长原创微博"""
        try:
            selector = self.deal_html(weibo_link)
            info = selector.xpath("//div[@class='c']")[1]
            wb_content = self.deal_garbled(info)
            wb_time = info.xpath("//span[@class='ct']/text()")[0]
            weibo_content = wb_content[wb_content.find(':') +
                                       1:wb_content.rfind(wb_time)]
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_original_weibo(self, info, weibo_id):
        """获取原创微博"""
        try:
            weibo_content = self.deal_garbled(info)
            weibo_content = weibo_content[:weibo_content.rfind(u'赞')]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                wb_content = self.get_long_weibo(weibo_link)
                if wb_content:
                    weibo_content = wb_content
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_retweet(self, weibo_link):
        """获取长转发微博"""
        try:
            wb_content = self.get_long_weibo(weibo_link)
            weibo_content = wb_content[:wb_content.rfind(u'原文转发')]
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_retweet(self, info, weibo_id):
        """获取转发微博"""
        try:
            original_user = info.xpath("div/span[@class='cmt']/a/text()")
            if not original_user:
                wb_content = u'转发微博已被删除'
                return wb_content
            else:
                original_user = original_user[0]
            wb_content = self.deal_garbled(info)
            wb_content = wb_content[wb_content.find(':') +
                                    1:wb_content.rfind(u'赞')]
            wb_content = wb_content[:wb_content.rfind(u'赞')]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                weibo_content = self.get_long_retweet(weibo_link)
                if weibo_content:
                    wb_content = weibo_content
            retweet_reason = self.deal_garbled(info.xpath('div')[-1])
            retweet_reason = retweet_reason[:retweet_reason.rindex(u'赞')]
            wb_content = (retweet_reason + '\n' + u'原始用户: ' + original_user +
                          '\n' + u'转发内容: ' + wb_content)
            return wb_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def is_original(self, info):
        """判断微博是否为原创微博"""
        is_original = info.xpath("div/span[@class='cmt']")
        if len(is_original) > 3:
            return False
        else:
            return True

    def get_weibo_content(self, info, is_original):
        """获取微博内容"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            if is_original:
                weibo_content = self.get_original_weibo(info, weibo_id)
            else:
                weibo_content = self.get_retweet(info, weibo_id)
            print(weibo_content)
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_place(self, info):
        """获取微博发布位置"""
        try:
            div_first = info.xpath('div')[0]
            a_list = div_first.xpath('a')
            publish_place = u'无'
            for a in a_list:
                if ('place.weibo.com' in a.xpath('@href')[0]
                        and a.xpath('text()')[0] == u'显示地图'):
                    weibo_a = div_first.xpath("span[@class='ctt']/a")
                    if len(weibo_a) >= 1:
                        publish_place = weibo_a[-1]
                        if (u'视频' == div_first.xpath(
                                "span[@class='ctt']/a/text()")[-1][-2:]):
                            if len(weibo_a) >= 2:
                                publish_place = weibo_a[-2]
                            else:
                                publish_place = u'无'
                        publish_place = self.deal_garbled(publish_place)
                        break
            print(u'微博发布位置: ' + publish_place)
            return publish_place
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_time(self, info):
        """获取微博发布时间"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.deal_garbled(str_time[0])
            publish_time = str_time.split(u'来自')[0]
            if u'刚刚' in publish_time:
                publish_time = datetime.now().strftime('%Y-%m-%d %H:%M')
            elif u'分钟' in publish_time:
                minute = publish_time[:publish_time.find(u'分钟')]
                minute = timedelta(minutes=int(minute))
                publish_time = (datetime.now() -
                                minute).strftime('%Y-%m-%d %H:%M')
            elif u'今天' in publish_time:
                today = datetime.now().strftime('%Y-%m-%d')
                time = publish_time[3:]
                publish_time = today + ' ' + time
            elif u'月' in publish_time:
                year = datetime.now().strftime('%Y')
                month = publish_time[0:2]
                day = publish_time[3:5]
                time = publish_time[7:12]
                publish_time = year + '-' + month + '-' + day + ' ' + time
            else:
                publish_time = publish_time[:16]
            print(u'微博发布时间: ' + publish_time)
            return publish_time
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_tool(self, info):
        """获取微博发布工具"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.deal_garbled(str_time[0])
            if len(str_time.split(u'来自')) > 1:
                publish_tool = str_time.split(u'来自')[1]
            else:
                publish_tool = u'无'
            print(u'微博发布工具: ' + publish_tool)
            return publish_tool
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_weibo_footer(self, info):
        """获取微博点赞数、转发数、评论数"""
        try:
            footer = {}
            pattern = r'\d+'
            str_footer = info.xpath('div')[-1]
            str_footer = self.deal_garbled(str_footer)
            str_footer = str_footer[str_footer.rfind(u'赞'):]
            weibo_footer = re.findall(pattern, str_footer, re.M)

            up_num = int(weibo_footer[0])
            print(u'点赞数: ' + str(up_num))
            footer['up_num'] = up_num

            retweet_num = int(weibo_footer[1])
            print(u'转发数: ' + str(retweet_num))
            footer['retweet_num'] = retweet_num

            comment_num = int(weibo_footer[2])
            print(u'评论数: ' + str(comment_num))
            footer['comment_num'] = comment_num
            return footer
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def extract_picture_urls(self, info, weibo_id):
        """提取微博原始图片url"""
        try:
            a_list = info.xpath('div/a/@href')
            first_pic = 'https://weibo.cn/mblog/pic/' + weibo_id + '?rl=1'
            all_pic = 'https://weibo.cn/mblog/picAll/' + weibo_id + '?rl=2'
            if first_pic in a_list:
                if all_pic in a_list:
                    selector = self.deal_html(all_pic)
                    preview_picture_list = selector.xpath('//img/@src')
                    picture_list = [
                        p.replace('/thumb180/', '/large/')
                        for p in preview_picture_list
                    ]
                    picture_urls = ','.join(picture_list)
                else:
                    if info.xpath('.//img/@src'):
                        preview_picture = info.xpath('.//img/@src')[-1]
                        picture_urls = preview_picture.replace(
                            '/wap180/', '/large/')
                    else:
                        sys.exit(
                            u"爬虫微博可能被设置成了'不显示图片',请前往"
                            u"'https://weibo.cn/account/customize/pic',修改为'显示'"
                        )
            else:
                picture_urls = '无'
            return picture_urls
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_picture_urls(self, info, is_original):
        """获取微博原始图片url"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            picture_urls = {}
            if is_original:
                original_pictures = self.extract_picture_urls(info, weibo_id)
                picture_urls['original_pictures'] = original_pictures
                if not self.filter:
                    picture_urls['retweet_pictures'] = '无'
            else:
                retweet_url = info.xpath("div/a[@class='cc']/@href")[0]
                retweet_id = retweet_url.split('/')[-1].split('?')[0]
                retweet_pictures = self.extract_picture_urls(info, retweet_id)
                picture_urls['retweet_pictures'] = retweet_pictures
                a_list = info.xpath('div[last()]/a/@href')
                original_picture = '无'
                for a in a_list:
                    if a.endswith(('.gif', '.jpeg', '.jpg', '.png')):
                        original_picture = a
                        break
                picture_urls['original_pictures'] = original_picture
            return picture_urls
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def download_pic(self, url, pic_path):
        """下载单张图片"""
        try:
            p = requests.get(url)
            with open(pic_path, 'wb') as f:
                f.write(p.content)
        except Exception as e:
            error_file = self.get_filepath(
                'img') + os.sep + 'not_downloaded_pictures.txt'
            with open(error_file, 'ab') as f:
                url = url + '\n'
                f.write(url.encode(sys.stdout.encoding))
            print('Error: ', e)
            traceback.print_exc()

    def download_pictures(self):
        """下载微博图片"""
        try:
            print(u'即将进行图片下载')
            img_dir = self.get_filepath('img')
            for w in tqdm(self.weibo, desc=u'图片下载进度'):
                if w['original_pictures'] != '无':
                    pic_prefix = w['publish_time'][:11].replace(
                        '-', '') + '_' + w['id']
                    if ',' in w['original_pictures']:
                        w['original_pictures'] = w['original_pictures'].split(
                            ',')
                        for j, url in enumerate(w['original_pictures']):
                            pic_suffix = url[url.rfind('.'):]
                            pic_name = pic_prefix + '_' + str(j +
                                                              1) + pic_suffix
                            pic_path = img_dir + os.sep + pic_name
                            self.download_pic(url, pic_path)
                    else:
                        pic_suffix = w['original_pictures'][
                            w['original_pictures'].rfind('.'):]
                        pic_name = pic_prefix + pic_suffix
                        pic_path = img_dir + os.sep + pic_name
                        self.download_pic(w['original_pictures'], pic_path)
            print(u'图片下载完毕,保存路径:')
            print(img_dir)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_one_weibo(self, info):
        """获取一条微博的全部信息"""
        try:
            weibo = OrderedDict()
            is_original = self.is_original(info)
            if (not self.filter) or is_original:
                weibo_id = info.xpath('@id')
                if len(weibo_id) > 0:
                    weibo['id'] = weibo_id[0][2:]
                    weibo['content'] = self.get_weibo_content(
                        info, is_original)  # 微博内容
                    picture_urls = self.get_picture_urls(info, is_original)
                    weibo['original_pictures'] = picture_urls[
                        'original_pictures']  # 原创图片url
                    if not self.filter:
                        weibo['retweet_pictures'] = picture_urls[
                            'retweet_pictures']  # 转发图片url
                        weibo['original'] = is_original  # 是否原创微博
                    weibo['publish_place'] = self.get_publish_place(
                        info)  # 微博发布位置
                    weibo['publish_time'] = self.get_publish_time(
                        info)  # 微博发布时间
                    weibo['publish_tool'] = self.get_publish_tool(
                        info)  # 微博发布工具
                    footer = self.get_weibo_footer(info)
                    weibo['up_num'] = footer['up_num']  # 微博点赞数
                    weibo['retweet_num'] = footer['retweet_num']  # 转发数
                    weibo['comment_num'] = footer['comment_num']  # 评论数
                else:
                    weibo = None
            else:
                weibo = None
            return weibo
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_one_page(self, page):
        """获取第page页的全部微博"""
        try:
            url = 'https://weibo.cn/search/mblog?keyword=%s&page=%d' % (
                self.keyword, page)
            selector = self.deal_html(url)
            info = selector.xpath("//div[@class='c']")
            is_exist = info[0].xpath("//div/span[@class='ctt']")
            if is_exist:
                for i in range(len(info) - 2):
                    weibo = self.get_one_weibo(info[i])
                    if weibo:
                        self.weibo.append(weibo)
                        self.got_num += 1
                        print('-' * 100)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_filepath(self, type):
        """获取结果文件路径"""
        try:
            keyword = self.keyword.replace('%23', '#')
            file_dir = os.path.split(os.path.realpath(
                __file__))[0] + os.sep + 'weibo' + os.sep + keyword
            if type == 'img':
                file_dir = file_dir + os.sep + 'img'
            if not os.path.isdir(file_dir):
                os.makedirs(file_dir)
            if type == 'img':
                return file_dir
            file_path = file_dir + os.sep + '%s' % keyword + '.' + type
            return file_path
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_csv(self, wrote_num):
        """将爬取的信息写入csv文件"""
        try:
            result_headers = [
                '微博id',
                '微博正文',
                '原始图片url',
                '发布位置',
                '发布时间',
                '发布工具',
                '点赞数',
                '转发数',
                '评论数',
            ]
            if not self.filter:
                result_headers.insert(3, '被转发微博原始图片url')
                result_headers.insert(4, '是否为原创微博')
            result_data = [w.values() for w in self.weibo][wrote_num:]
            if sys.version < '3':  # python2.x
                reload(sys)
                sys.setdefaultencoding('utf-8')
                with open(self.get_filepath('csv'), 'ab') as f:
                    f.write(codecs.BOM_UTF8)
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            else:  # python3.x
                with open(self.get_filepath('csv'),
                          'a',
                          encoding='utf-8-sig',
                          newline='') as f:
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            print(u'%d条微博写入csv文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('csv'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_txt(self, wrote_num):
        """将爬取的信息写入txt文件"""
        try:
            temp_result = []
            if wrote_num == 0:
                if self.filter:
                    result_header = u'\n\n原创微博内容: \n'
                else:
                    result_header = u'\n\n微博内容: \n'
                temp_result.append(result_header)
            for i, w in enumerate(self.weibo[wrote_num:]):
                temp_result.append(
                    str(wrote_num + i + 1) + ':' + w['content'] + '\n' +
                    u'微博位置: ' + w['publish_place'] + '\n' + u'发布时间: ' +
                    w['publish_time'] + '\n' + u'点赞数: ' + str(w['up_num']) +
                    u'   转发数: ' + str(w['retweet_num']) + u'   评论数: ' +
                    str(w['comment_num']) + '\n' + u'发布工具: ' +
                    w['publish_tool'] + '\n\n')
            result = ''.join(temp_result)
            with open(self.get_filepath('txt'), 'ab') as f:
                f.write(result.encode(sys.stdout.encoding))
            print(u'%d条微博写入txt文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('txt'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_file(self, wrote_num):
        """写文件"""
        if self.got_num > wrote_num:
            self.write_csv(wrote_num)
            self.write_txt(wrote_num)

    def get_weibo_info(self):
        """获取微博信息"""
        try:
            url = 'https://weibo.cn/search/mblog?keyword=%s' % (self.keyword)
            selector = self.deal_html(url)
            page_num = self.get_page_num(selector)  # 获取微博总页数
            wrote_num = 0
            page1 = 0
            random_pages = random.randint(1, 5)
            for page in tqdm(range(1, 11), desc=u'进度'):
                self.get_one_page(page)  # 获取第page页的全部微博

                if page % 20 == 0:  # 每爬20页写入一次文件
                    self.write_file(wrote_num)
                    wrote_num = self.got_num

                # 通过加入随机等待避免被限制。爬虫速度过快容易被系统限制(一段时间后限
                # 制会自动解除),加入随机等待模拟人的操作,可降低被系统限制的风险。默
                # 认是每爬取1到5页随机等待6到10秒,如果仍然被限,可适当增加sleep时间
                if page - page1 == random_pages and page < page_num:
                    sleep(random.randint(6, 10))
                    page1 = page
                    random_pages = random.randint(1, 5)

            self.write_file(wrote_num)  # 将剩余不足20页的微博写入文件
            if not self.filter:
                print(u'共爬取' + str(self.got_num) + u'条微博')
            else:
                print(u'共爬取' + str(self.got_num) + u'条原创微博')
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def start(self):
        """运行爬虫"""
        try:
            self.get_weibo_info()
            print(u'信息抓取完毕')
            print('*' * 100)
            if self.pic_download == 1:
                self.download_pictures()
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()


def main():
    try:
        # 使用实例,输入一个用户id,所有信息都会存储在wb实例中
        keyword = u'%23test%23'  # 可以改成任意合法的用户id(爬虫的微博id除外)
        filter = 1  # 值为0表示爬取全部微博(原创微博+转发微博),值为1表示只爬取原创微博
        pic_download = 1  # 值为0代表不下载微博原始图片,1代表下载微博原始图片
        wb = Weibo(keyword, filter, pic_download)  # 调用Weibo类,创建微博实例wb
        wb.start()  # 爬取微博信息
        if wb.weibo:
            print(u'最新/置顶 微博为: ' + wb.weibo[0]['content'])
            print(u'最新/置顶 微博位置: ' + wb.weibo[0]['publish_place'])
            print(u'最新/置顶 微博发布时间: ' + wb.weibo[0]['publish_time'])
            print(u'最新/置顶 微博获得赞数: ' + str(wb.weibo[0]['up_num']))
            print(u'最新/置顶 微博获得转发数: ' + str(wb.weibo[0]['retweet_num']))
            print(u'最新/置顶 微博获得评论数: ' + str(wb.weibo[0]['comment_num']))
            print(u'最新/置顶 微博发布工具: ' + wb.weibo[0]['publish_tool'])
    except Exception as e:
        print('Error: ', e)
        traceback.print_exc()


if __name__ == '__main__':
    main()

这是利用本程序修改的,功能较少,但主要功能都有。用法是先把程序前几行的your cookie换成自己的cookie。然后再把mian函数中的keyword换成自己要搜索的关键词,如果搜索普通关键词如test,代码如下:

        keyword = u'test'

如果搜索话题#test#:

        keyword = u'%23test%23'

注意关键词两边的#一定要替换成%23

@standard-outlier
Copy link
Author

没想到这么快就有了回复,赞!
我跑了一下,目前有一个数组越界的错,是在get_one_page函数中,有两行代码:

info = selector.xpath("//div[@Class='c']")
is_exist = info[0].xpath("//div/span[@Class='ctt']")

我推测应该是第一个xpath表达式有问题,但在Chrone控制台输入$x(//div[@Class='c'])却是可以拿到数据的,目前还没研究,回头我再仔细看看。

@dataabc
Copy link
Owner

dataabc commented Jan 16, 2020

感谢反馈,我现在电脑不在身边,没法调试。目测是第一行没有获取到数据,第二行却以为有数据且取了该列表的第一个值,所以越界。可以在取info数据前先判断下它的长度,如果大于1才执行第二行代码,否则is_exist直接赋值为False,大致就是这样。
之前贴代码时我特意测试了下,没有出错。现在之所以出错,猜测可能是账号暂时被限制,一段时间限制会自动解除;也可能是搜索没有结果;最有可能的是cookie设置不正确,暂时就想到这几种情况,不知道对不对。

@Cherno6
Copy link

Cherno6 commented Mar 3, 2020

您好,非常感谢。我在爬取话题里原创微博的过程中遇到了问题,每次只能爬74条左右的原创微博,请问这是被微博发现做出了限制,还是因为话题里可显示微博的数量本来就只有这么多?(我找了一下发现话题里好像是不分页码的,所以只能爬默认的第一页?)

@dataabc
Copy link
Owner

dataabc commented Mar 3, 2020

@Cherno6
获取话题的原理就是微博搜索,只不过关键字是话题,搜索结果是有页码的,每页10条。
刚刚进行了测试,可以获取多于74条微博。如果只能爬74条原创微博,有两种可能:
一是搜索结果确实只有这么多微博;
二是被限制了,可以通过修改get_weibo_info方法的随机等待代码降低被限风险:

                if page - page1 == random_pages and page < page_num:
                    sleep(random.randint(6, 10))
                    page1 = page
                    random_pages = random.randint(1, 5)

代码的意思是每1-5页,随机等待6-10秒,你可以按自己需要修改,等待频率越快,等待时间越久,被限风险越低

@Cherno6
Copy link

Cherno6 commented Mar 3, 2020

谢谢您,我再尝试尝试增加等待时间😄

@barnett2010
Copy link

https://weibo.cn/search/mblog/?keyword=电视剧&advanced=mblog&rl=1&starttime=20140901&endtime=20151001&sort=time

搜索选项中,还可以设置时间区间来得到想要的结果
请问大佬,这样的话,要怎么修改脚本

@dataabc
Copy link
Owner

dataabc commented Mar 5, 2020

@barnett2010
https://weibo.cn/search/mblog?hideSearchFrame=&keyword=电视剧&advancedfilter=1&starttime=20140901&endtime=20151001&sort=time&page=1

就是替换get_one_page方法和get_weibo_info方法中url后面的内容,稍微做些修改,如keyword和page换成变量,参考原来的内容就可以。

还有,因为很多搜索,一天的搜索结果可能就很多,而微博最多显示100页。为了获取更多信息,可以把starttime和endtime做成变量。如把20140901和20151001按天拆开,放到列表里,['20140901', '20140902',......, '20151001']。作一个循环,每次给starttime和endtime同一个日期,这样应该能多获取很多信息。

@Is-Sakura-i
Copy link

您好,请问一下上述修改时间区间来得到想要的结果,可以再说得详细一点吗?我只能根据修改最后的page来获取更多的微博。其次,还想问一下,可以根据地区来爬吗?比如想要北京的关于某个关键词的近两个月的数据。

@dataabc
Copy link
Owner

dataabc commented Mar 7, 2020

@Is-Sakura-i
微博是可以按日期选择搜索结果的,比如https://weibo.cn/search/mblog?hideSearchFrame=&keyword=测试&advancedfilter=1&starttime=20200307&endtime=20200307&sort=time&page=1就是2020-03-07日包含“测试”关键字的微博第一页,https://weibo.cn/search/mblog?hideSearchFrame=&keyword=测试&advancedfilter=1&starttime=20200306&endtime=20200306&sort=time&page=1,这是2020-03-06的第一页,以此类推。

因为微博搜索最多只显示100页,为了获取更多的结果,可以加上日期,每个日期都能获取100页,这样获取的数量就多了。

暂时无法按地区爬,但是你可以把地区名包含在关键字里,这样的搜索结果即满足地区又满足想要的关键字

@Is-Sakura-i
Copy link

@Is-Sakura-i
微博是可以按日期选择搜索结果的,比如https://weibo.cn/search/mblog?hideSearchFrame=&keyword=测试&advancedfilter=1&starttime=20200307&endtime=20200307&sort=time&page=1就是2020-03-07日包含“测试”关键字的微博第一页,https://weibo.cn/search/mblog?hideSearchFrame=&keyword=测试&advancedfilter=1&starttime=20200306&endtime=20200306&sort=time&page=1,这是2020-03-06的第一页,以此类推。

因为微博搜索最多只显示100页,为了获取更多的结果,可以加上日期,每个日期都能获取100页,这样获取的数量就多了。

暂时无法按地区爬,但是你可以把地区名包含在关键字里,这样的搜索结果即满足地区又满足想要的关键字

感谢大佬回复!!!请问一下,就是在get_one_page方法里url 为url = 'https://weibo.cn/search/mblog?keyword=%s&page=%d' % (self.keyword, page),没有后面关于时间的,所以是要在这个后面加嘛?

@dataabc
Copy link
Owner

dataabc commented Mar 7, 2020

@Is-Sakura-i
就是用上面包含starttime的url替换get_one_page方法里的url,再加一个日期相关的变量,让starttime和endtime值为这个变量,get_weibo_info里的url也要变,那个是用来提取页数的,然后在start方法里写一个按日期的循环,每天都运行一次get_weibo_info。

@Is-Sakura-i
Copy link

@dataabc 您好!我按照您的意思改成了下列代码。(只截取了部分改了的代码)能麻烦帮我看看嘛?非常感谢!(本人小白一只)

def get_one_page(self,starttime,endtime,page):
    """获取第page页的全部微博"""
    try:
        url = 'https://weibo.cn/search/mblog?hideSearchFrame=&keyword=%s&advancedfilter=1&starttime=%d&endtime=%d&sort=time&page=%d' % ( self.keyword, starttime, endtime, page)

def get_weibo_info(self,starttime,endtime):
    """获取微博信息"""
    try:
        url = 'https://weibo.cn/search/mblog?keyword=%s&advancedfilter=1&starttime=%d&endtime=%d&sort=time' % (self.keyword)
        selector = self.deal_html(url)

def start(self):
    """运行爬虫"""
    try:
        day = 20200107
        if(day <= 20200307):
            self.get_weibo_info(starttime=day,endtime=day)
            day = day + 1
            print(u'信息抓取完毕')
            print('*' * 100)
            if self.pic_download == 1:
                self.download_pictures()

然后报的错是 :
File "D:/pyCharm/weiboSpider-master/weiboSpider.py", line 556, in get_weibo_info
url = 'https://weibo.cn/search/mblog?keyword=%s&advancedfilter=1&starttime=%d&endtime=%d&sort=time' % (self.keyword)
TypeError: not enough arguments for format string

@dataabc
Copy link
Owner

dataabc commented Mar 7, 2020

@Is-Sakura-i

def get_weibo_info(self,starttime,endtime):
    """获取微博信息"""
    try:
        url = 'https://weibo.cn/search/mblog?keyword=%s&advancedfilter=1&starttime=%d&endtime=%d&sort=time' % (self.keyword)
        selector = self.deal_html(url)

url有三个变量,starttime、endtime和keyword,你只写了一个

@Is-Sakura-i
Copy link

@dataabc 好的!已经可以运行啦~谢谢!大赞

@puppyK
Copy link

puppyK commented Mar 11, 2020

@dataabc 好的!已经可以运行啦~谢谢!大赞

您好 我是超级小白 也想按照日期爬取主题做论文数据 您的代码可以发一下嘛

@dataabc
Copy link
Owner

dataabc commented Mar 11, 2020

@Is-Sakura-i @puppyK

修改了现在的代码,可以搜索关键字以及带#的话题了。
包含两个文件:config_search.jsonsearch.py,config_search.json负责配置程序,search.py负责执行。
config_search.json内容如下:

{
    "keyword_list": [ "keyword"],
    "start_time": "2020-03-01",
    "end_time": "2020-03-01",
    "filter": 0,
    "write_mode": [
        "csv",
        "txt",
        "json"
    ],
    "pic_download": 1,
    "video_download": 0,
    "cookie": "your cookie",
    "mysql_config": {
        "host": "localhost",
        "port": 3306,
        "user": "root",
        "password": "123456",
        "charset": "utf8mb4"
    }
}

config_search.json增加了keyword_list、start_time、end_time参数,删除了since_date。
keyword_list存储关键字列表,可以获取一个或多个,如果想搜索多个关键字:

 "keyword_list": [ "keyword1", "keyword2", "keyword3"],

如果想搜索包含#话题:

 "keyword_list": [ "%23keyword%23"],

如果想搜索多个关键字都包含的结果:

 "keyword_list": [ "keyword1+keyword2"],

start_time和end_time是微博的发布时间范围,结果微博的发布时间位于start_time和end_time之间,值为"yyyy-mm-dd"形式。其它参数的含义和本程序一样,可以参考本程序的README。

search.py代码如下:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import codecs
import copy
import csv
import json
import os
import random
import re
import sys
import traceback
from collections import OrderedDict
from datetime import datetime, timedelta
from time import sleep

import requests
from lxml import etree
from requests.adapters import HTTPAdapter
from tqdm import tqdm


class Weibo(object):
    def __init__(self, config):
        """Weibo类初始化"""
        self.validate_config(config)
        self.filter = config[
            'filter']  # 取值范围为0、1,程序默认值为0,代表要爬取用户的全部微博,1代表只爬取用户的原创微博
        self.write_mode = config[
            'write_mode']  # 结果信息保存类型,为list形式,可包含txt、csv、json、mongo和mysql五种类型
        self.pic_download = config[
            'pic_download']  # 取值范围为0、1,程序默认值为0,代表不下载微博原始图片,1代表下载
        self.video_download = config[
            'video_download']  # 取值范围为0、1,程序默认为0,代表不下载微博视频,1代表下载
        self.cookie = {'Cookie': config['cookie']}
        self.mysql_config = config.get('mysql_config')  # MySQL数据库连接配置,可以不填
        keyword_list = config['keyword_list']
        if not isinstance(keyword_list, list):
            if not os.path.isabs(keyword_list):
                keyword_list = os.path.split(
                    os.path.realpath(__file__))[0] + os.sep + keyword_list
            keyword_list = self.get_keyword_list(keyword_list)
        self.keyword_list = keyword_list  # 要爬取的微博关键字列表
        self.keyword = ''
        self.start_time = config['start_time']
        self.end_time = config['end_time']
        self.got_num = 0  # 存储爬取到的微博数
        self.weibo = []  # 存储爬取到的所有微博信息
        self.weibo_id_list = []  # 存储爬取到的所有微博id

    def validate_config(self, config):
        """验证配置是否正确"""

        # 验证filter、pic_download、video_download
        argument_lsit = ['filter', 'pic_download', 'video_download']
        for argument in argument_lsit:
            if config[argument] != 0 and config[argument] != 1:
                sys.exit(u'%s值应为0或1,请重新输入' % config[argument])

        # 验证write_mode
        write_mode = ['txt', 'csv', 'json', 'mongo', 'mysql']
        if not isinstance(config['write_mode'], list):
            sys.exit(u'write_mode值应为list类型')
        for mode in config['write_mode']:
            if mode not in write_mode:
                sys.exit(
                    u'%s为无效模式,请从txt、csv、json、mongo和mysql中挑选一个或多个作为write_mode' %
                    mode)

        # 验证keyword_list
        keyword_list = config['keyword_list']
        if (not isinstance(keyword_list,
                           list)) and (not keyword_list.endswith('.txt')):
            sys.exit(u'keyword_list值应为list类型或txt文件路径')
        if not isinstance(keyword_list, list):
            if not os.path.isabs(keyword_list):
                keyword_list = os.path.split(
                    os.path.realpath(__file__))[0] + os.sep + keyword_list
            if not os.path.isfile(keyword_list):
                sys.exit(u'不存在%s文件' % keyword_list)

    def is_date(self, since_date):
        """判断日期格式是否正确"""
        try:
            if ':' in since_date:
                datetime.strptime(since_date, '%Y-%m-%d %H:%M')
            else:
                datetime.strptime(since_date, '%Y-%m-%d')
            return True
        except ValueError:
            return False

    def str_to_time(self, text):
        """将字符串转换成时间类型"""
        if ':' in text:
            result = datetime.strptime(text, '%Y-%m-%d %H:%M')
        else:
            result = datetime.strptime(text, '%Y-%m-%d')
        return result

    def handle_html(self, url):
        """处理html"""
        try:
            html = requests.get(url, cookies=self.cookie).content
            selector = etree.HTML(html)
            return selector
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def handle_garbled(self, info):
        """处理乱码"""
        try:
            info = (info.xpath('string(.)').replace(u'\u200b', '').encode(
                sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding))
            return info
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_page_num(self, selector):
        """获取微博总页数"""
        try:
            if selector.xpath("//input[@name='mp']") == []:
                page_num = 1
            else:
                page_num = (int)(
                    selector.xpath("//input[@name='mp']")[0].attrib['value'])
            return page_num
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_weibo(self, weibo_link):
        """获取长原创微博"""
        try:
            for i in range(5):
                selector = self.handle_html(weibo_link)
                if selector is not None:
                    info = selector.xpath("//div[@class='c']")[1]
                    wb_content = self.handle_garbled(info)
                    wb_time = info.xpath("//span[@class='ct']/text()")[0]
                    weibo_content = wb_content[wb_content.find(':') +
                                               1:wb_content.rfind(wb_time)]
                    if weibo_content is not None:
                        return weibo_content
                sleep(random.randint(6, 10))
        except Exception as e:
            return u'网络出错'
            print('Error: ', e)
            traceback.print_exc()

    def get_original_weibo(self, info, weibo_id):
        """获取原创微博"""
        try:
            weibo_content = self.handle_garbled(info)
            weibo_content = weibo_content[:weibo_content.rfind(u'赞')]
            weibo_content = weibo_content[weibo_content.find(':') + 1:]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                wb_content = self.get_long_weibo(weibo_link)
                if wb_content:
                    weibo_content = wb_content
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_retweet(self, weibo_link):
        """获取长转发微博"""
        try:
            wb_content = self.get_long_weibo(weibo_link)
            weibo_content = wb_content[:wb_content.rfind(u'原文转发')]
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_retweet(self, info, weibo_id):
        """获取转发微博"""
        try:
            weibo_content = self.handle_garbled(info)
            weibo_content = weibo_content[weibo_content.find(':') +
                                          1:weibo_content.rfind(u'赞')]
            weibo_content = weibo_content[:weibo_content.rfind(u'赞')]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                wb_content = self.get_long_retweet(weibo_link)
                if wb_content:
                    weibo_content = wb_content
            retweet_reason = self.handle_garbled(info.xpath('div')[-1])
            retweet_reason = retweet_reason[:retweet_reason.rindex(u'赞')]
            original_user = info.xpath("div/span[@class='cmt']/a/text()")
            if original_user:
                original_user = original_user[0]
                weibo_content = (retweet_reason + '\n' + u'原始用户: ' +
                                 original_user + '\n' + u'转发内容: ' +
                                 weibo_content)
            else:
                weibo_content = (retweet_reason + '\n' + u'转发内容: ' +
                                 weibo_content)
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def is_original(self, info):
        """判断微博是否为原创微博"""
        is_original = info.xpath("div/span[@class='cmt']")
        if len(is_original) > 3:
            return False
        else:
            return True

    def get_weibo_content(self, info, is_original):
        """获取微博内容"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            if is_original:
                weibo_content = self.get_original_weibo(info, weibo_id)
            else:
                weibo_content = self.get_retweet(info, weibo_id)
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_user_id(self, info):
        """获取用户id"""
        user_id = info.xpath("div[1]/a[@class='nk']/@href")[0].split('/')[-1]
        return user_id

    def get_nickname(self, info):
        """获取用户昵称"""
        nickname = info.xpath("div[1]/a[@class='nk']/text()")[0]
        return nickname

    def get_publish_place(self, info):
        """获取微博发布位置"""
        try:
            div_first = info.xpath('div')[0]
            a_list = div_first.xpath('a')
            publish_place = u'无'
            for a in a_list:
                if ('place.weibo.com' in a.xpath('@href')[0]
                        and a.xpath('text()')[0] == u'显示地图'):
                    weibo_a = div_first.xpath("span[@class='ctt']/a")
                    if len(weibo_a) >= 1:
                        publish_place = weibo_a[-1]
                        if (u'视频' == div_first.xpath(
                                "span[@class='ctt']/a/text()")[-1][-2:]):
                            if len(weibo_a) >= 2:
                                publish_place = weibo_a[-2]
                            else:
                                publish_place = u'无'
                        publish_place = self.handle_garbled(publish_place)
                        break
            return publish_place
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_time(self, info):
        """获取微博发布时间"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.handle_garbled(str_time[0])
            publish_time = str_time.split(u'来自')[0]
            if u'刚刚' in publish_time:
                publish_time = datetime.now().strftime('%Y-%m-%d %H:%M')
            elif u'分钟' in publish_time:
                minute = publish_time[:publish_time.find(u'分钟')]
                minute = timedelta(minutes=int(minute))
                publish_time = (datetime.now() -
                                minute).strftime('%Y-%m-%d %H:%M')
            elif u'今天' in publish_time:
                today = datetime.now().strftime('%Y-%m-%d')
                time = publish_time[3:]
                publish_time = today + ' ' + time
                if len(publish_time) > 16:
                    publish_time = publish_time[:16]
            elif u'月' in publish_time:
                year = datetime.now().strftime('%Y')
                month = publish_time[0:2]
                day = publish_time[3:5]
                time = publish_time[7:12]
                publish_time = year + '-' + month + '-' + day + ' ' + time
            else:
                publish_time = publish_time[:16]
            return publish_time
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_tool(self, info):
        """获取微博发布工具"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.handle_garbled(str_time[0])
            if len(str_time.split(u'来自')) > 1:
                publish_tool = str_time.split(u'来自')[1]
            else:
                publish_tool = u'无'
            return publish_tool
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_weibo_footer(self, info):
        """获取微博点赞数、转发数、评论数"""
        try:
            footer = {}
            pattern = r'\d+'
            str_footer = info.xpath('div')[-1]
            str_footer = self.handle_garbled(str_footer)
            str_footer = str_footer[str_footer.rfind(u'赞'):]
            weibo_footer = re.findall(pattern, str_footer, re.M)

            up_num = int(weibo_footer[0])
            footer['up_num'] = up_num

            retweet_num = int(weibo_footer[1])
            footer['retweet_num'] = retweet_num

            comment_num = int(weibo_footer[2])
            footer['comment_num'] = comment_num
            return footer
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def extract_picture_urls(self, info, weibo_id):
        """提取微博原始图片url"""
        try:
            a_list = info.xpath('div/a/@href')
            first_pic = 'https://weibo.cn/mblog/pic/' + weibo_id + '?rl=1'
            all_pic = 'https://weibo.cn/mblog/picAll/' + weibo_id + '?rl=2'
            picture_urls = u'无'
            if first_pic in a_list:
                if all_pic in a_list:
                    selector = self.handle_html(all_pic)
                    preview_picture_list = selector.xpath('//img/@src')
                    picture_list = [
                        p.replace('/thumb180/', '/large/')
                        for p in preview_picture_list
                    ]
                    picture_urls = ','.join(picture_list)
                else:
                    if info.xpath('.//img/@src'):
                        for link in info.xpath('div/a'):
                            if len(link.xpath('@href')) > 0:
                                if first_pic == link.xpath('@href')[0]:
                                    if len(link.xpath('img/@src')) > 0:
                                        preview_picture = link.xpath(
                                            'img/@src')[0]
                                        picture_urls = preview_picture.replace(
                                            '/wap180/', '/large/')
                                        break
                    else:
                        sys.exit(
                            u"爬虫微博可能被设置成了'不显示图片',请前往"
                            u"'https://weibo.cn/account/customize/pic',修改为'显示'"
                        )
            return picture_urls
        except Exception as e:
            return u'无'
            print('Error: ', e)
            traceback.print_exc()

    def get_picture_urls(self, info, is_original):
        """获取微博原始图片url"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            picture_urls = {}
            if is_original:
                original_pictures = self.extract_picture_urls(info, weibo_id)
                picture_urls['original_pictures'] = original_pictures
                if not self.filter:
                    picture_urls['retweet_pictures'] = u'无'
            else:
                retweet_url = info.xpath("div/a[@class='cc']/@href")[0]
                retweet_id = retweet_url.split('/')[-1].split('?')[0]
                retweet_pictures = self.extract_picture_urls(info, retweet_id)
                picture_urls['retweet_pictures'] = retweet_pictures
                a_list = info.xpath('div[last()]/a/@href')
                original_picture = u'无'
                for a in a_list:
                    if a.endswith(('.gif', '.jpeg', '.jpg', '.png')):
                        original_picture = a
                        break
                picture_urls['original_pictures'] = original_picture
            return picture_urls
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_video_url(self, info, is_original):
        """获取微博视频url"""
        try:
            if is_original:
                div_first = info.xpath('div')[0]
                a_list = div_first.xpath('.//a')
                video_link = u'无'
                for a in a_list:
                    if 'm.weibo.cn/s/video/show?object_id=' in a.xpath(
                            '@href')[0]:
                        video_link = a.xpath('@href')[0]
                        break
                if video_link != u'无':
                    video_link = video_link.replace(
                        'm.weibo.cn/s/video/show', 'm.weibo.cn/s/video/object')
                    wb_info = requests.get(video_link,
                                           cookies=self.cookie).json()
                    video_url = wb_info['data']['object']['stream'].get(
                        'hd_url')
                    if not video_url:
                        video_url = wb_info['data']['object']['stream']['url']
                        if not video_url:  # 说明该视频为直播
                            video_url = u'无'
            else:
                video_url = u'无'
            return video_url
        except Exception as e:
            return u'无'
            print('Error: ', e)
            traceback.print_exc()

    def download_one_file(self, url, file_path, type, weibo_id):
        """下载单个文件(图片/视频)"""
        try:
            if not os.path.isfile(file_path):
                s = requests.Session()
                s.mount(url, HTTPAdapter(max_retries=5))
                downloaded = s.get(url, timeout=(5, 10))
                with open(file_path, 'wb') as f:
                    f.write(downloaded.content)
        except Exception as e:
            error_file = self.get_filepath(
                type) + os.sep + 'not_downloaded.txt'
            with open(error_file, 'ab') as f:
                url = weibo_id + ':' + url + '\n'
                f.write(url.encode(sys.stdout.encoding))
            print('Error: ', e)
            traceback.print_exc()

    def handle_download(self, file_type, file_dir, urls, w):
        """处理下载相关操作"""
        file_prefix = w['publish_time'][:11].replace('-', '') + '_' + w['id']
        if file_type == 'img':
            if ',' in urls:
                url_list = urls.split(',')
                for i, url in enumerate(url_list):
                    file_suffix = url[url.rfind('.'):]
                    file_name = file_prefix + '_' + str(i + 1) + file_suffix
                    file_path = file_dir + os.sep + file_name
                    self.download_one_file(url, file_path, file_type, w['id'])
            else:
                file_suffix = urls[urls.rfind('.'):]
                file_name = file_prefix + file_suffix
                file_path = file_dir + os.sep + file_name
                self.download_one_file(urls, file_path, file_type, w['id'])
        else:
            file_suffix = '.mp4'
            file_name = file_prefix + file_suffix
            file_path = file_dir + os.sep + file_name
            self.download_one_file(urls, file_path, file_type, w['id'])

    def download_files(self, file_type, wrote_num):
        """下载文件(图片/视频)"""
        try:
            if file_type == 'img':
                describe = u'图片'
                key = 'original_pictures'
            else:
                describe = u'视频'
                key = 'video_url'
            print(u'即将进行%s下载' % describe)
            file_dir = self.get_filepath(file_type)
            for w in tqdm(self.weibo[wrote_num:], desc='Download progress'):
                if w[key] != u'无':
                    self.handle_download(file_type, file_dir, w[key], w)
            print(u'%s下载完毕,保存路径:' % describe)
            print(file_dir)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_one_weibo(self, info):
        """获取一条微博的全部信息"""
        try:
            weibo = OrderedDict()
            is_original = self.is_original(info)
            if (not self.filter) or is_original:
                weibo['id'] = info.xpath('@id')[0][2:]
                weibo['user_id'] = self.get_user_id(info)
                weibo['nickname'] = self.get_nickname(info)
                weibo['content'] = self.get_weibo_content(info,
                                                          is_original)  # 微博内容
                picture_urls = self.get_picture_urls(info, is_original)
                weibo['original_pictures'] = picture_urls[
                    'original_pictures']  # 原创图片url
                if not self.filter:
                    weibo['retweet_pictures'] = picture_urls[
                        'retweet_pictures']  # 转发图片url
                    weibo['original'] = is_original  # 是否原创微博
                weibo['video_url'] = self.get_video_url(info,
                                                        is_original)  # 微博视频url
                weibo['publish_place'] = self.get_publish_place(info)  # 微博发布位置
                weibo['publish_time'] = self.get_publish_time(info)  # 微博发布时间
                weibo['publish_tool'] = self.get_publish_tool(info)  # 微博发布工具
                footer = self.get_weibo_footer(info)
                weibo['up_num'] = footer['up_num']  # 微博点赞数
                weibo['retweet_num'] = footer['retweet_num']  # 转发数
                weibo['comment_num'] = footer['comment_num']  # 评论数
            else:
                weibo = None
            return weibo
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def print_one_weibo(self, weibo):
        """打印一条微博"""
        print(weibo['nickname'] + ':' + weibo['content'])
        print(u'微博发布位置:%s' % weibo['publish_place'])
        print(u'发布发布时间:%s' % weibo['publish_time'])
        print(u'发布发布工具:%s' % weibo['publish_tool'])
        print(u'点赞数:%d' % weibo['up_num'])
        print(u'转发数:%d' % weibo['retweet_num'])
        print(u'评论数:%d' % weibo['comment_num'])
        print(u'url:https://weibo.cn/comment/%s' % weibo['id'])

    def is_pinned_weibo(self, info):
        """判断微博是否为置顶微博"""
        kt = info.xpath(".//span[@class='kt']/text()")
        if kt and kt[0] == u'置顶':
            return True
        else:
            return False

    def get_one_page(self, page):
        """获取第page页的全部微博"""
        try:
            start_time = self.start_time.replace('-', '')
            url = 'https://weibo.cn/search/mblog?hideSearchFrame=&keyword=%s&advancedfilter=1&starttime=%s&endtime=%s&sort=time&page=%d' % (
                self.keyword, start_time, start_time, page)
            selector = self.handle_html(url)
            info = selector.xpath("//div[@class='c']")[3:-2]
            if info:
                is_exist = info[0].xpath("//div/span[@class='ctt']")
                if is_exist:
                    for i in range(len(info)):
                        weibo = self.get_one_weibo(info[i])
                        if weibo:
                            if weibo['id'] in self.weibo_id_list:
                                continue
                            self.print_one_weibo(weibo)
                            self.weibo.append(weibo)
                            self.weibo_id_list.append(weibo['id'])
                            self.got_num += 1
                            print('-' * 100)
                print(u'{}已获取{}的第{}页微博{}'.format('-' * 30, self.keyword, page,
                                                 '-' * 30))
            else:
                print(u'\n%s 该关键字没有搜索结果或cookie无效' % self.start_time)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_filepath(self, type):
        """获取结果文件路径"""
        try:
            file_dir = os.path.split(os.path.realpath(
                __file__))[0] + os.sep + 'weibo' + os.sep + self.keyword
            if type == 'img' or type == 'video':
                file_dir = file_dir + os.sep + type
            if not os.path.isdir(file_dir):
                os.makedirs(file_dir)
            if type == 'img' or type == 'video':
                return file_dir
            file_path = file_dir + os.sep + self.keyword + '.' + type
            return file_path
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_log(self):
        """当程序因cookie过期停止运行时,将相关信息写入log.txt"""
        file_dir = os.path.split(
            os.path.realpath(__file__))[0] + os.sep + 'weibo' + os.sep
        if not os.path.isdir(file_dir):
            os.makedirs(file_dir)
        file_path = file_dir + 'log.txt'
        content = u'cookie已过期,从%s到今天的微博获取失败,请重新设置cookie\n' % self.since_date
        with open(file_path, 'ab') as f:
            f.write(content.encode(sys.stdout.encoding))

    def write_csv(self, wrote_num):
        """将爬取的信息写入csv文件"""
        try:
            result_headers = [
                '微博id',
                '用户id',
                '用户昵称',
                '微博正文',
                '原始图片url',
                '微博视频url',
                '发布位置',
                '发布时间',
                '发布工具',
                '点赞数',
                '转发数',
                '评论数',
            ]
            if not self.filter:
                result_headers.insert(5, '被转发微博原始图片url')
                result_headers.insert(6, '是否为原创微博')
            result_data = [w.values() for w in self.weibo[wrote_num:]]
            if sys.version < '3':  # python2.x
                reload(sys)
                sys.setdefaultencoding('utf-8')
                with open(self.get_filepath('csv'), 'ab') as f:
                    f.write(codecs.BOM_UTF8)
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            else:  # python3.x
                with open(self.get_filepath('csv'),
                          'a',
                          encoding='utf-8-sig',
                          newline='') as f:
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            print(u'%d条微博写入csv文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('csv'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_user_id(self, wrote_num):
        """将user_id写入user_id_list.txt文件"""
        file_dir = os.path.split(os.path.realpath(
            __file__))[0] + os.sep + 'weibo' + os.sep + self.keyword
        if not os.path.isdir(file_dir):
            os.makedirs(file_dir)
        file_path = file_dir + os.sep + 'user_id_list.txt'
        with open(file_path, 'ab') as f:
            for w in self.weibo[wrote_num:]:
                f.write((w['user_id'] + ' ' + w['nickname'] + '\n').encode(
                    sys.stdout.encoding))
        print(u'user_id_list.txt文件写入完毕,保存路径:')
        print(file_path)

    def write_txt(self, wrote_num):
        """将爬取的信息写入txt文件"""
        try:
            temp_result = []
            if wrote_num == 0:
                if self.filter:
                    result_header = u'\n\n原创微博内容: \n'
                else:
                    result_header = u'\n\n微博内容: \n'
                temp_result.append(result_header)
            for i, w in enumerate(self.weibo[wrote_num:]):
                temp_result.append(
                    str(wrote_num + i + 1) + ':' + w['content'] + '\n' +
                    u'微博位置: ' + w['publish_place'] + '\n' + u'发布时间: ' +
                    w['publish_time'] + '\n' + u'点赞数: ' + str(w['up_num']) +
                    u'   转发数: ' + str(w['retweet_num']) + u'   评论数: ' +
                    str(w['comment_num']) + '\n' + u'发布工具: ' +
                    w['publish_tool'] + '\n\n')
            result = ''.join(temp_result)
            with open(self.get_filepath('txt'), 'ab') as f:
                f.write(result.encode(sys.stdout.encoding))
            print(u'%d条微博写入txt文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('txt'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def update_json_data(self, data, weibo_info):
        """更新要写入json结果文件中的数据,已经存在于json中的信息更新为最新值,不存在的信息添加到data中"""
        # data['user'] = self.user
        if data.get('weibo'):
            is_new = 1  # 待写入微博是否全部为新微博,即待写入微博与json中的数据不重复
            for old in data['weibo']:
                if weibo_info[-1]['id'] == old['id']:
                    is_new = 0
                    break
            if is_new == 0:
                for new in weibo_info:
                    flag = 1
                    for i, old in enumerate(data['weibo']):
                        if new['id'] == old['id']:
                            data['weibo'][i] = new
                            flag = 0
                            break
                    if flag:
                        data['weibo'].append(new)
            else:
                data['weibo'] += weibo_info
        else:
            data['weibo'] = weibo_info
        return data

    def write_json(self, wrote_num):
        """将爬到的信息写入json文件"""
        data = {}
        path = self.get_filepath('json')
        if os.path.isfile(path):
            with codecs.open(path, 'r', encoding="utf-8") as f:
                data = json.load(f)
        weibo_info = self.weibo[wrote_num:]
        data = self.update_json_data(data, weibo_info)
        with codecs.open(path, 'w', encoding="utf-8") as f:
            json.dump(data, f, ensure_ascii=False)
        print(u'%d条微博写入json文件完毕,保存路径:' % self.got_num)
        print(path)

    def info_to_mongodb(self, collection, info_list):
        """将爬取的信息写入MongoDB数据库"""
        try:
            import pymongo
        except ImportError:
            sys.exit(u'系统中可能没有安装pymongo库,请先运行 pip install pymongo ,再运行程序')
        try:
            from pymongo import MongoClient
            client = MongoClient()
            db = client['weibo']
            collection = db[collection]
            if len(self.write_mode) > 1:
                new_info_list = copy.deepcopy(info_list)
            else:
                new_info_list = info_list
            for info in new_info_list:
                if not collection.find_one({'id': info['id']}):
                    collection.insert_one(info)
                else:
                    collection.update_one({'id': info['id']}, {'$set': info})
        except pymongo.errors.ServerSelectionTimeoutError:
            sys.exit(u'系统中可能没有安装或启动MongoDB数据库,请先根据系统环境安装或启动MongoDB,再运行程序')

    def weibo_to_mongodb(self, wrote_num):
        """将爬取的微博信息写入MongoDB数据库"""
        weibo_list = []
        for w in self.weibo[wrote_num:]:
            # w['user_id'] = self.user_config['user_id']
            weibo_list.append(w)
        self.info_to_mongodb('weibo', weibo_list)
        print(u'%d条微博写入MongoDB数据库完毕' % self.got_num)

    def mysql_create(self, connection, sql):
        """创建MySQL数据库或表"""
        try:
            with connection.cursor() as cursor:
                cursor.execute(sql)
        finally:
            connection.close()

    def mysql_create_database(self, mysql_config, sql):
        """创建MySQL数据库"""
        try:
            import pymysql
        except ImportError:
            sys.exit(u'系统中可能没有安装pymysql库,请先运行 pip install pymysql ,再运行程序')
        try:
            if self.mysql_config:
                mysql_config = self.mysql_config
            connection = pymysql.connect(**mysql_config)
            self.mysql_create(connection, sql)
        except pymysql.OperationalError:
            sys.exit(u'系统中可能没有安装或正确配置MySQL数据库,请先根据系统环境安装或配置MySQL,再运行程序')

    def mysql_create_table(self, mysql_config, sql):
        """创建MySQL表"""
        import pymysql

        if self.mysql_config:
            mysql_config = self.mysql_config
        mysql_config['db'] = 'weibo'
        connection = pymysql.connect(**mysql_config)
        self.mysql_create(connection, sql)

    def mysql_insert(self, mysql_config, table, data_list):
        """向MySQL表插入或更新数据"""
        import pymysql

        if len(data_list) > 0:
            keys = ', '.join(data_list[0].keys())
            values = ', '.join(['%s'] * len(data_list[0]))
            if self.mysql_config:
                mysql_config = self.mysql_config
            mysql_config['db'] = 'weibo'
            connection = pymysql.connect(**mysql_config)
            cursor = connection.cursor()
            sql = """INSERT INTO {table}({keys}) VALUES ({values}) ON
                     DUPLICATE KEY UPDATE""".format(table=table,
                                                    keys=keys,
                                                    values=values)
            update = ','.join([
                " {key} = values({key})".format(key=key)
                for key in data_list[0]
            ])
            sql += update
            try:
                cursor.executemany(
                    sql, [tuple(data.values()) for data in data_list])
                connection.commit()
            except Exception as e:
                connection.rollback()
                print('Error: ', e)
                traceback.print_exc()
            finally:
                connection.close()

    def weibo_to_mysql(self, wrote_num):
        """将爬取的微博信息写入MySQL数据库"""
        mysql_config = {
            'host': 'localhost',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'charset': 'utf8mb4'
        }
        # 创建'weibo'数据库
        create_database = """CREATE DATABASE IF NOT EXISTS weibo DEFAULT
                         CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci"""
        self.mysql_create_database(mysql_config, create_database)

        # 创建'weibo'表
        create_table = """
                CREATE TABLE IF NOT EXISTS weibo (
                id varchar(10) NOT NULL,
                user_id varchar(20),
                nickname varchar(100),
                content varchar(2000),
                original_pictures varchar(3000),
                retweet_pictures varchar(3000),
                original BOOLEAN NOT NULL DEFAULT 1,
                video_url varchar(300),
                publish_place varchar(100),
                publish_time DATETIME NOT NULL,
                publish_tool varchar(30),
                up_num INT NOT NULL,
                retweet_num INT NOT NULL,
                comment_num INT NOT NULL,
                PRIMARY KEY (id)
                ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4"""
        self.mysql_create_table(mysql_config, create_table)
        # 在'weibo'表中插入或更新微博数据
        weibo_list = []
        if len(self.write_mode) > 1:
            info_list = copy.deepcopy(self.weibo[wrote_num:])
        else:
            info_list = self.weibo[wrote_num:]
        for weibo in info_list:
            # weibo['user_id'] = self.user_config['user_id']
            weibo_list.append(weibo)
        self.mysql_insert(mysql_config, 'weibo', weibo_list)
        print(u'%d条微博写入MySQL数据库完毕' % self.got_num)

    def update_user_config_file(self, user_config_file_path):
        """更新用户配置文件"""
        with open(user_config_file_path, 'rb') as f:
            lines = f.read().splitlines()
            lines = [line.decode('utf-8-sig') for line in lines]
            for i, line in enumerate(lines):
                info = line.split(' ')
                if len(info) > 0 and info[0].isdigit():
                    if self.user_config['user_uri'] == info[0]:
                        if len(info) == 1:
                            info.append(self.user['nickname'])
                            info.append(self.start_time)
                        if len(info) == 2:
                            info.append(self.start_time)
                        if len(info) > 3 and self.is_date(info[2] + ' ' +
                                                          info[3]):
                            del info[3]
                        if len(info) > 2:
                            info[2] = self.start_time
                        lines[i] = ' '.join(info)
                        break
        with codecs.open(user_config_file_path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(lines))

    def write_data(self, wrote_num):
        """将爬取到的信息写入文件或数据库"""
        if self.got_num > wrote_num:
            self.write_user_id(wrote_num)
            if 'csv' in self.write_mode:
                self.write_csv(wrote_num)
            if 'txt' in self.write_mode:
                self.write_txt(wrote_num)
            if 'json' in self.write_mode:
                self.write_json(wrote_num)
            if 'mysql' in self.write_mode:
                self.weibo_to_mysql(wrote_num)
            if 'mongo' in self.write_mode:
                self.weibo_to_mongodb(wrote_num)
            if self.pic_download == 1:
                self.download_files('img', wrote_num)
            if self.video_download == 1:
                self.download_files('video', wrote_num)

    def get_weibo_info(self):
        """获取微博信息"""
        try:
            start_time = self.start_time.replace('-', '')
            url = 'https://weibo.cn/search/mblog?hideSearchFrame=&keyword=%s&advancedfilter=1&starttime=%s&endtime=%s&sort=time' % (
                self.keyword, start_time, start_time)
            selector = self.handle_html(url)
            page_num = self.get_page_num(selector)  # 获取微博总页数
            wrote_num = 0
            page1 = 0
            random_pages = random.randint(1, 5)
            for page in tqdm(range(1, page_num + 1), desc='Progress'):
                is_end = self.get_one_page(page)  # 获取第page页的全部微博
                if is_end:
                    break

                if page % 20 == 0:  # 每爬20页写入一次文件
                    self.write_data(wrote_num)
                    wrote_num = self.got_num

                # 通过加入随机等待避免被限制。爬虫速度过快容易被系统限制(一段时间后限
                # 制会自动解除),加入随机等待模拟人的操作,可降低被系统限制的风险。默
                # 认是每爬取1到5页随机等待6到10秒,如果仍然被限,可适当增加sleep时间
                if (page - page1) % random_pages == 0 and page < page_num:
                    sleep(random.randint(6, 10))
                    page1 = page
                    random_pages = random.randint(1, 5)

            self.write_data(wrote_num)  # 将剩余不足20页的微博写入文件
            if not self.filter:
                print(u'共爬取' + str(self.got_num) + u'条微博')
            else:
                print(u'共爬取' + str(self.got_num) + u'条原创微博')
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_keyword_list(self, file_name):
        """获取文件中的keyword"""
        with open(file_name, 'rb') as f:
            try:
                lines = f.read().splitlines()
                lines = [line.decode('utf-8-sig') for line in lines]
            except UnicodeDecodeError:
                sys.exit(u'%s文件应为utf-8编码,请先将文件编码转为utf-8再运行程序' % file_name)
            keyword_list = []
            for line in lines:
                info = line.split(' ')
                if len(info) > 0:
                    keyword_list.append(info[0])
        return keyword_list

    def initialize_info(self, keyword):
        """初始化爬虫信息"""
        self.got_num = 0
        self.weibo = []
        self.keyword = keyword
        self.weibo_id_list = []

    def start(self):
        """运行爬虫"""
        try:
            for keyword in self.keyword_list:
                start_time = datetime.strptime(self.start_time, '%Y-%m-%d')
                end_time = datetime.strptime(self.end_time, '%Y-%m-%d')
                while (start_time <= end_time):
                    self.initialize_info(keyword)
                    print('*' * 100)
                    self.get_weibo_info()
                    print(u'信息抓取完毕')
                    print('*' * 100)
                    start_time = start_time + timedelta(days=1)
                    self.start_time = start_time.strftime('%Y-%m-%d')
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()


def main():
    try:
        config_path = os.path.split(
            os.path.realpath(__file__))[0] + os.sep + 'config_search.json'
        if not os.path.isfile(config_path):
            sys.exit(u'当前路径:%s 不存在配置文件config.json' %
                     (os.path.split(os.path.realpath(__file__))[0] + os.sep))
        with codecs.open(config_path, 'r', encoding="utf-8") as f:
            try:
                config = json.loads(f.read())
            except ValueError:
                sys.exit(u'config_search.json 格式不正确,请参考 '
                         u'https://github.com/dataabc/weiboSpider#3程序设置')
        wb = Weibo(config)
        wb.start()  # 爬取微博信息
    except Exception as e:
        print('Error: ', e)
        traceback.print_exc()


if __name__ == '__main__':
    main()

@Is-Sakura-i
Copy link

@dataabc 感谢。原来的那个代码可以了,然后我用您新写的这个代码跑了一下,代码越界错误?因为我电脑上没有 MongoDB数据库和MYSQL数据库,所以把相关的代码删掉了。然后在config_search.json文件中改了cookie和keyword ,其余的都没有改。
File "D:/pyCharm/weiboSpider-master/search.py", line 539, in get_one_page
is_exist = info[0].xpath("//div/span[@Class='ctt']")
IndexError: list index out of range

@puppyK
Copy link

puppyK commented Mar 11, 2020

太感谢您了! 居然连调配参数的都做好了! 对于我这种超级小白太友好了

@dataabc
Copy link
Owner

dataabc commented Mar 11, 2020

@Is-Sakura-i
测试了一下,正常是可以运行的,出错有两个原因:
1.搜索关键字没有结果,可以换下关键字;
2.cookie无效,更新cookie就可以。

@dataabc
Copy link
Owner

dataabc commented Mar 11, 2020

@puppyK
能帮上你的忙就好,如果程序有问题欢迎反馈

@puppyK
Copy link

puppyK commented Mar 11, 2020

@puppyK
能帮上你的忙就好,如果程序有问题欢迎反馈

程序运行没问题! 就是用户名称跟博文在excel的一个单元格里。另外如果能爬取uid就更好啦!
比如:(沧海一粟nhs:运动不是一蹴而就的比拼,而是执着于每一瞬间的坚持!日行15000步的我已激活了一枚暴走勋章! #运动就是坚持# 你也来试试吧微博运动) 用户名+博文在excel一个单元格里

@dataabc
Copy link
Owner

dataabc commented Mar 11, 2020

@puppyK
更新了上面的代码,现在可以得到用户名了

@barnett2010
Copy link

感谢大佬。最新的search.py 已经可以日期自动加1天来搜索了吗?
爬完一天,再递进一天爬取。是这行控制吗
start_time = start_time + timedelta(days=1)

@dataabc
Copy link
Owner

dataabc commented Mar 11, 2020

@barnett2010
config_search.json有两个参数:start_time和end_time,格式yyyy-mm-dd。程序是爬完一天再爬进一天爬取。假如start_time是2020-03-1,第一次就爬2020-01-30这天的微博,第二次加1天,获取2020-01-31的信息,第三次加一天,转换成合法的日期2020-02-01,爬之,等等。你说的那行是加1天,它下面那行负责把它转化成合法日期

@puppyK
Copy link

puppyK commented Mar 12, 2020

感谢 成功运行新代码 而且看到大佬还把默认爬取10页改成爬取当天全部信息!不过之前我跟朋友尝试调整把用户名跟正文分开啦 还有个问题是如果我有了一个话题下的1000个用户参与 我还想爬取这1000个用户的全部微博 可以怎么实现

@dataabc
Copy link
Owner

dataabc commented Mar 12, 2020

@puppyK
现在可以了。
获取了user_id并集中保存到user_id_list.txt里,利用本项目可以获取这些用户的信息和微博,也可以只获取他们的信息不获取微博

@Is-Sakura-i
Copy link

@dataabc 您好。昨天报的错应该是没有搜索结果造成的,非常感谢您!提个小小的建议哈,我觉得某天没有搜索结果可以提示一下用户…真的很感谢大佬,我觉得您是我学编程以来遇到的最好的人了,技术高也很有耐心,感谢大佬!!!

@dataabc
Copy link
Owner

dataabc commented Mar 12, 2020

@Is-Sakura-i
我没有测试过,不过不同类型的微博限制不同,比如娱乐类的限制就比较宽容。如果被限,可以修改代码减缓速度

@Is-Sakura-i
Copy link

@dataabc 好的!谢谢!大佬早点休息!

@barnett2010
Copy link

"write_mode": ["csv","txt","json","mysql"],
想让结果存进数据库
只改这一行。没有起作用。 是不是还要改其它地方?。
端口密码都是对的
数据库已有weibo这个库,是之前spider爬取生成的。

-------报错信息

Error: (1060, "Duplicate column name 'user_id'")
Traceback (most recent call last):
File "search4.py", line 923, in get_weibo_info
self.write_data(wrote_num)
File "search4.py", line 898, in write_data
self.weibo_to_mysql(wrote_num)
File "search4.py", line 850, in weibo_to_mysql
self.mysql_create_table(mysql_config, create_table)
File "search4.py", line 787, in mysql_create_table
self.mysql_create(connection, sql)
File "search4.py", line 761, in mysql_create
cursor.execute(sql)
File "C:\Python37\lib\site-packages\pymysql\cursors.py", line 170, in execute
result = self._query(query)
File "C:\Python37\lib\site-packages\pymysql\cursors.py", line 328, in _query
conn.query(q)
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 517, in quer
y
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 732, in _rea
d_query_result
result.read()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 1075, in rea
d
first_packet = self.connection._read_packet()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 684, in _rea
d_packet
packet.check_error()
File "C:\Python37\lib\site-packages\pymysql\protocol.py", line 220, in check_e
rror
err.raise_mysql_exception(self.data)
File "C:\Python37\lib\site-packages\pymysql\err.py", line 109, in raise_mysql

exception
raise errorclass(errno, errval)
pymysql.err.InternalError: (1060, "Duplicate column name 'user_id'")
信息抓取完毕



@barnett2010
Copy link

Error: (1049, "Unknown database 'weibo'")
Traceback (most recent call last):
File "search4.py", line 923, in get_weibo_info
self.write_data(wrote_num)
File "search4.py", line 898, in write_data
self.weibo_to_mysql(wrote_num)
File "search4.py", line 850, in weibo_to_mysql
self.mysql_create_table(mysql_config, create_table)
File "search4.py", line 786, in mysql_create_table
connection = pymysql.connect(**mysql_config)
File "C:\Python37\lib\site-packages\pymysql_init_.py", line 94, in Connect
return Connection(*args, **kwargs)
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 325, in in
it

self.connect()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 599, in conn
ect
self._request_authentication()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 861, in _req
uest_authentication
auth_packet = self._read_packet()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 684, in _rea
d_packet
packet.check_error()
File "C:\Python37\lib\site-packages\pymysql\protocol.py", line 220, in check_e
rror
err.raise_mysql_exception(self.data)
File "C:\Python37\lib\site-packages\pymysql\err.py", line 109, in raise_mysql

exception
raise errorclass(errno, errval)
pymysql.err.InternalError: (1049, "Unknown database 'weibo'")
信息抓取完毕


Mysql 里事先没有weibo库存在时,报上面的错

@barnett2010
Copy link

我使用的是wampserver314
这个在win平台上,一键完成php mysql apache环境搭建。
用weibospider 写进mysql 是正常的。然后比照search config,同样修改,就一直出bug

@dataabc
Copy link
Owner

dataabc commented Mar 13, 2020

@barnett2010
在上面代码修复了这个问题,现在应该可以了

@barnett2010
Copy link

@dataabc 感谢大佬的回复,辛苦啦。
我立马试用了一下新脚本,
# 创建'weibo'数据库
create_database = """CREATE DATABASE IF NOT EXISTS weibo DEFAULT
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci"""
self.mysql_create_database(mysql_config, create_database)

爬取结束后,csv生成正常。
Mysql中会有weibo的库,但是库里是空的。

============================

"filter": 1,
"write_mode": [
"csv",
"txt",
"json",
"mysql"
],
"pic_download": 0,
"video_download": 0,

E:\e\weibohuati\new\weibo\迪丽热巴\迪丽热巴.json
Error: (1060, "Duplicate column name 'user_id'")
Traceback (most recent call last):
File "search5.py", line 928, in get_weibo_info
self.write_data(wrote_num)
File "search5.py", line 903, in write_data
self.weibo_to_mysql(wrote_num)
File "search5.py", line 855, in weibo_to_mysql
self.mysql_create_table(mysql_config, create_table)
File "search5.py", line 787, in mysql_create_table
self.mysql_create(connection, sql)
File "search5.py", line 761, in mysql_create
cursor.execute(sql)
File "C:\Python37\lib\site-packages\pymysql\cursors.py", line 170, in execute
result = self._query(query)
File "C:\Python37\lib\site-packages\pymysql\cursors.py", line 328, in _query
conn.query(q)
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 517, in quer
y
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 732, in _rea
d_query_result
result.read()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 1075, in rea
d
first_packet = self.connection._read_packet()
File "C:\Python37\lib\site-packages\pymysql\connections.py", line 684, in _rea
d_packet
packet.check_error()
File "C:\Python37\lib\site-packages\pymysql\protocol.py", line 220, in check_e
rror
err.raise_mysql_exception(self.data)
File "C:\Python37\lib\site-packages\pymysql\err.py", line 109, in raise_mysql

exception
raise errorclass(errno, errval)
pymysql.err.InternalError: (1060, "Duplicate column name 'user_id'")
信息抓取完毕



Progress: 19%|█████▉ | 19/100 [02:00<08:35, 6.36
s/it]

@dataabc
Copy link
Owner

dataabc commented Mar 13, 2020

@barnett2010
我的错。现在删除了多余的字段,应该行了

@modun
Copy link

modun commented Mar 13, 2020

/usr/lib/python2.7/site-packages/pymysql/cursors.py:170: Warning: (1007, u"Can't create database 'weibo'; database exists")
result = self._query(query)
/usr/lib/python2.7/site-packages/pymysql/cursors.py:170: Warning: (1050, u"Table 'user' already exists")
result = self._query(query)

运行的时候,报错数据库已存在,请问有遇到的吗

@dataabc
Copy link
Owner

dataabc commented Mar 13, 2020

@modun
这不是错误,忽略就好

@modun
Copy link

modun commented Mar 13, 2020

@modun
这不是错误,忽略就好

看见已经更新了,谢谢!

@barnett2010
Copy link

barnett2010 commented Mar 13, 2020

@dataabc
感谢大佬。已经成功了。

果然删除这句就好了 user_id varchar(12)

已获取迪丽热巴的第100页微博---------------------------

619条微博写入json文件完毕,保存路径:
E:\e\weibohuati\new\weibo\迪丽热巴\迪丽热巴.json
619条微博写入MySQL数据库完毕
Progress: 100%|██████████████████████████████| 100
Progress: 100%|██████████████████████████████| 100
/100 [11:58<00:00, 7.18s/it]
共爬取619条原创微博
信息抓取完毕



@modun
Copy link

modun commented Mar 13, 2020

请问更新后,需重新部署吗?还是只替换下py文件就可以。

@barnett2010
Copy link

barnett2010 commented Mar 13, 2020

==================
"start_time": "2010-01-01",
"end_time": "2010-12-21",

大佬辛苦啦。
这可能不算bug。我发现,如果日期设在近期,则爬取很顺利。
如果把日期定在几年前,就会出现下面的提示,然后爬虫就不动了。
是不是旧内容被限制得更紧?

====
Error: 'NoneType' object has no attribute 'xpath'
Traceback (most recent call last):
File "search6.py", line 124, in get_page_num
if selector.xpath("//input[@name='mp']") == []:
AttributeError: 'NoneType' object has no attribute 'xpath'
Error: unsupported operand type(s) for +: 'NoneType' and 'int'
Traceback (most recent call last):
File "search6.py", line 921, in get_weibo_info
for page in tqdm(range(1, page_num + 1), desc='Progress'):
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
信息抓取完毕

@dataabc
Copy link
Owner

dataabc commented Mar 13, 2020

@modun
替换py文件即可

@dataabc
Copy link
Owner

dataabc commented Mar 13, 2020

@barnett2010
这个我也不清楚,如果这样降低速度看看

@dataabc
Copy link
Owner

dataabc commented Mar 15, 2020

@Is-Sakura-i
现在免cookie版已经可以将用户信息写入csv文件了,生成文件位于weibo文件夹下,文件名是users.csv

@Is-Sakura-i
Copy link

@dataabc 非常感谢大佬提醒!很抱歉现在才看到,刚刚运行了一下,是可以的呢~谢谢大佬呀

@Amamiyatsuki
Copy link

Amamiyatsuki commented Apr 6, 2020

感谢建议,其实这个就是微博搜索的功能,只不过关键字带有#符号。

之前的程序已经实现这个功能了,只是因为比较简陋,所以没有上传。
具体代码如下:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import codecs
import csv
import os
import random
import re
import sys
import traceback
from collections import OrderedDict
from datetime import datetime, timedelta
from time import sleep

import requests
from lxml import etree
from tqdm import tqdm


class Weibo(object):
    cookie = {
        'Cookie':
        'your cookie'
    }  # 将your cookie替换成自己的cookie

    def __init__(self, keyword, filter=0, pic_download=0):
        """Weibo类初始化"""
        self.keyword = keyword  # 要搜索的关键词
        if filter != 0 and filter != 1:
            sys.exit(u'filter值应为0或1,请重新输入')
        if pic_download != 0 and pic_download != 1:
            sys.exit(u'pic_download值应为0或1,请重新输入')
        self.filter = filter  # 取值范围为0、1,程序默认值为0,代表要爬取用户的全部微博,1代表只爬取用户的原创微博
        self.pic_download = pic_download  # 取值范围为0、1,程序默认值为0,代表不下载微博原始图片,1代表下载
        self.nickname = ''  # 用户昵称,如“Dear-迪丽热巴”
        self.weibo_num = 0  # 用户全部微博数
        self.got_num = 0  # 爬取到的微博数
        self.following = 0  # 用户关注数
        self.followers = 0  # 用户粉丝数
        self.weibo = []  # 存储爬取到的所有微博信息

    def deal_html(self, url):
        """处理html"""
        try:
            html = requests.get(url, cookies=self.cookie).content
            selector = etree.HTML(html)
            return selector
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def deal_garbled(self, info):
        """处理乱码"""
        try:
            info = (info.xpath('string(.)').replace(u'\u200b', '').encode(
                sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding))
            return info
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_nickname(self):
        """获取用户昵称"""
        try:
            url = 'https://weibo.cn/%d/info' % (self.user_id)
            selector = self.deal_html(url)
            nickname = selector.xpath('//title/text()')[0]
            self.nickname = nickname[:-3]
            if self.nickname == u'登录 - 新' or self.nickname == u'新浪':
                sys.exit(u'cookie错误或已过期,请按照README中方法重新获取')
            print(u'用户昵称: ' + self.nickname)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_user_info(self, selector):
        """获取用户昵称、微博数、关注数、粉丝数"""
        try:
            self.get_nickname()  # 获取用户昵称
            user_info = selector.xpath("//div[@class='tip2']/*/text()")

            self.weibo_num = int(user_info[0][3:-1])
            print(u'微博数: ' + str(self.weibo_num))

            self.following = int(user_info[1][3:-1])
            print(u'关注数: ' + str(self.following))

            self.followers = int(user_info[2][3:-1])
            print(u'粉丝数: ' + str(self.followers))
            print('*' * 100)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_page_num(self, selector):
        """获取微博总页数"""
        try:
            if selector.xpath("//input[@name='mp']") == []:
                page_num = 1
            else:
                page_num = (int)(
                    selector.xpath("//input[@name='mp']")[0].attrib['value'])
            return page_num
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_weibo(self, weibo_link):
        """获取长原创微博"""
        try:
            selector = self.deal_html(weibo_link)
            info = selector.xpath("//div[@class='c']")[1]
            wb_content = self.deal_garbled(info)
            wb_time = info.xpath("//span[@class='ct']/text()")[0]
            weibo_content = wb_content[wb_content.find(':') +
                                       1:wb_content.rfind(wb_time)]
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_original_weibo(self, info, weibo_id):
        """获取原创微博"""
        try:
            weibo_content = self.deal_garbled(info)
            weibo_content = weibo_content[:weibo_content.rfind(u'赞')]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                wb_content = self.get_long_weibo(weibo_link)
                if wb_content:
                    weibo_content = wb_content
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_long_retweet(self, weibo_link):
        """获取长转发微博"""
        try:
            wb_content = self.get_long_weibo(weibo_link)
            weibo_content = wb_content[:wb_content.rfind(u'原文转发')]
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_retweet(self, info, weibo_id):
        """获取转发微博"""
        try:
            original_user = info.xpath("div/span[@class='cmt']/a/text()")
            if not original_user:
                wb_content = u'转发微博已被删除'
                return wb_content
            else:
                original_user = original_user[0]
            wb_content = self.deal_garbled(info)
            wb_content = wb_content[wb_content.find(':') +
                                    1:wb_content.rfind(u'赞')]
            wb_content = wb_content[:wb_content.rfind(u'赞')]
            a_text = info.xpath('div//a/text()')
            if u'全文' in a_text:
                weibo_link = 'https://weibo.cn/comment/' + weibo_id
                weibo_content = self.get_long_retweet(weibo_link)
                if weibo_content:
                    wb_content = weibo_content
            retweet_reason = self.deal_garbled(info.xpath('div')[-1])
            retweet_reason = retweet_reason[:retweet_reason.rindex(u'赞')]
            wb_content = (retweet_reason + '\n' + u'原始用户: ' + original_user +
                          '\n' + u'转发内容: ' + wb_content)
            return wb_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def is_original(self, info):
        """判断微博是否为原创微博"""
        is_original = info.xpath("div/span[@class='cmt']")
        if len(is_original) > 3:
            return False
        else:
            return True

    def get_weibo_content(self, info, is_original):
        """获取微博内容"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            if is_original:
                weibo_content = self.get_original_weibo(info, weibo_id)
            else:
                weibo_content = self.get_retweet(info, weibo_id)
            print(weibo_content)
            return weibo_content
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_place(self, info):
        """获取微博发布位置"""
        try:
            div_first = info.xpath('div')[0]
            a_list = div_first.xpath('a')
            publish_place = u'无'
            for a in a_list:
                if ('place.weibo.com' in a.xpath('@href')[0]
                        and a.xpath('text()')[0] == u'显示地图'):
                    weibo_a = div_first.xpath("span[@class='ctt']/a")
                    if len(weibo_a) >= 1:
                        publish_place = weibo_a[-1]
                        if (u'视频' == div_first.xpath(
                                "span[@class='ctt']/a/text()")[-1][-2:]):
                            if len(weibo_a) >= 2:
                                publish_place = weibo_a[-2]
                            else:
                                publish_place = u'无'
                        publish_place = self.deal_garbled(publish_place)
                        break
            print(u'微博发布位置: ' + publish_place)
            return publish_place
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_time(self, info):
        """获取微博发布时间"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.deal_garbled(str_time[0])
            publish_time = str_time.split(u'来自')[0]
            if u'刚刚' in publish_time:
                publish_time = datetime.now().strftime('%Y-%m-%d %H:%M')
            elif u'分钟' in publish_time:
                minute = publish_time[:publish_time.find(u'分钟')]
                minute = timedelta(minutes=int(minute))
                publish_time = (datetime.now() -
                                minute).strftime('%Y-%m-%d %H:%M')
            elif u'今天' in publish_time:
                today = datetime.now().strftime('%Y-%m-%d')
                time = publish_time[3:]
                publish_time = today + ' ' + time
            elif u'月' in publish_time:
                year = datetime.now().strftime('%Y')
                month = publish_time[0:2]
                day = publish_time[3:5]
                time = publish_time[7:12]
                publish_time = year + '-' + month + '-' + day + ' ' + time
            else:
                publish_time = publish_time[:16]
            print(u'微博发布时间: ' + publish_time)
            return publish_time
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_publish_tool(self, info):
        """获取微博发布工具"""
        try:
            str_time = info.xpath("div/span[@class='ct']")
            str_time = self.deal_garbled(str_time[0])
            if len(str_time.split(u'来自')) > 1:
                publish_tool = str_time.split(u'来自')[1]
            else:
                publish_tool = u'无'
            print(u'微博发布工具: ' + publish_tool)
            return publish_tool
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_weibo_footer(self, info):
        """获取微博点赞数、转发数、评论数"""
        try:
            footer = {}
            pattern = r'\d+'
            str_footer = info.xpath('div')[-1]
            str_footer = self.deal_garbled(str_footer)
            str_footer = str_footer[str_footer.rfind(u'赞'):]
            weibo_footer = re.findall(pattern, str_footer, re.M)

            up_num = int(weibo_footer[0])
            print(u'点赞数: ' + str(up_num))
            footer['up_num'] = up_num

            retweet_num = int(weibo_footer[1])
            print(u'转发数: ' + str(retweet_num))
            footer['retweet_num'] = retweet_num

            comment_num = int(weibo_footer[2])
            print(u'评论数: ' + str(comment_num))
            footer['comment_num'] = comment_num
            return footer
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def extract_picture_urls(self, info, weibo_id):
        """提取微博原始图片url"""
        try:
            a_list = info.xpath('div/a/@href')
            first_pic = 'https://weibo.cn/mblog/pic/' + weibo_id + '?rl=1'
            all_pic = 'https://weibo.cn/mblog/picAll/' + weibo_id + '?rl=2'
            if first_pic in a_list:
                if all_pic in a_list:
                    selector = self.deal_html(all_pic)
                    preview_picture_list = selector.xpath('//img/@src')
                    picture_list = [
                        p.replace('/thumb180/', '/large/')
                        for p in preview_picture_list
                    ]
                    picture_urls = ','.join(picture_list)
                else:
                    if info.xpath('.//img/@src'):
                        preview_picture = info.xpath('.//img/@src')[-1]
                        picture_urls = preview_picture.replace(
                            '/wap180/', '/large/')
                    else:
                        sys.exit(
                            u"爬虫微博可能被设置成了'不显示图片',请前往"
                            u"'https://weibo.cn/account/customize/pic',修改为'显示'"
                        )
            else:
                picture_urls = '无'
            return picture_urls
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_picture_urls(self, info, is_original):
        """获取微博原始图片url"""
        try:
            weibo_id = info.xpath('@id')[0][2:]
            picture_urls = {}
            if is_original:
                original_pictures = self.extract_picture_urls(info, weibo_id)
                picture_urls['original_pictures'] = original_pictures
                if not self.filter:
                    picture_urls['retweet_pictures'] = '无'
            else:
                retweet_url = info.xpath("div/a[@class='cc']/@href")[0]
                retweet_id = retweet_url.split('/')[-1].split('?')[0]
                retweet_pictures = self.extract_picture_urls(info, retweet_id)
                picture_urls['retweet_pictures'] = retweet_pictures
                a_list = info.xpath('div[last()]/a/@href')
                original_picture = '无'
                for a in a_list:
                    if a.endswith(('.gif', '.jpeg', '.jpg', '.png')):
                        original_picture = a
                        break
                picture_urls['original_pictures'] = original_picture
            return picture_urls
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def download_pic(self, url, pic_path):
        """下载单张图片"""
        try:
            p = requests.get(url)
            with open(pic_path, 'wb') as f:
                f.write(p.content)
        except Exception as e:
            error_file = self.get_filepath(
                'img') + os.sep + 'not_downloaded_pictures.txt'
            with open(error_file, 'ab') as f:
                url = url + '\n'
                f.write(url.encode(sys.stdout.encoding))
            print('Error: ', e)
            traceback.print_exc()

    def download_pictures(self):
        """下载微博图片"""
        try:
            print(u'即将进行图片下载')
            img_dir = self.get_filepath('img')
            for w in tqdm(self.weibo, desc=u'图片下载进度'):
                if w['original_pictures'] != '无':
                    pic_prefix = w['publish_time'][:11].replace(
                        '-', '') + '_' + w['id']
                    if ',' in w['original_pictures']:
                        w['original_pictures'] = w['original_pictures'].split(
                            ',')
                        for j, url in enumerate(w['original_pictures']):
                            pic_suffix = url[url.rfind('.'):]
                            pic_name = pic_prefix + '_' + str(j +
                                                              1) + pic_suffix
                            pic_path = img_dir + os.sep + pic_name
                            self.download_pic(url, pic_path)
                    else:
                        pic_suffix = w['original_pictures'][
                            w['original_pictures'].rfind('.'):]
                        pic_name = pic_prefix + pic_suffix
                        pic_path = img_dir + os.sep + pic_name
                        self.download_pic(w['original_pictures'], pic_path)
            print(u'图片下载完毕,保存路径:')
            print(img_dir)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_one_weibo(self, info):
        """获取一条微博的全部信息"""
        try:
            weibo = OrderedDict()
            is_original = self.is_original(info)
            if (not self.filter) or is_original:
                weibo_id = info.xpath('@id')
                if len(weibo_id) > 0:
                    weibo['id'] = weibo_id[0][2:]
                    weibo['content'] = self.get_weibo_content(
                        info, is_original)  # 微博内容
                    picture_urls = self.get_picture_urls(info, is_original)
                    weibo['original_pictures'] = picture_urls[
                        'original_pictures']  # 原创图片url
                    if not self.filter:
                        weibo['retweet_pictures'] = picture_urls[
                            'retweet_pictures']  # 转发图片url
                        weibo['original'] = is_original  # 是否原创微博
                    weibo['publish_place'] = self.get_publish_place(
                        info)  # 微博发布位置
                    weibo['publish_time'] = self.get_publish_time(
                        info)  # 微博发布时间
                    weibo['publish_tool'] = self.get_publish_tool(
                        info)  # 微博发布工具
                    footer = self.get_weibo_footer(info)
                    weibo['up_num'] = footer['up_num']  # 微博点赞数
                    weibo['retweet_num'] = footer['retweet_num']  # 转发数
                    weibo['comment_num'] = footer['comment_num']  # 评论数
                else:
                    weibo = None
            else:
                weibo = None
            return weibo
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_one_page(self, page):
        """获取第page页的全部微博"""
        try:
            url = 'https://weibo.cn/search/mblog?keyword=%s&page=%d' % (
                self.keyword, page)
            selector = self.deal_html(url)
            info = selector.xpath("//div[@class='c']")
            is_exist = info[0].xpath("//div/span[@class='ctt']")
            if is_exist:
                for i in range(len(info) - 2):
                    weibo = self.get_one_weibo(info[i])
                    if weibo:
                        self.weibo.append(weibo)
                        self.got_num += 1
                        print('-' * 100)
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def get_filepath(self, type):
        """获取结果文件路径"""
        try:
            keyword = self.keyword.replace('%23', '#')
            file_dir = os.path.split(os.path.realpath(
                __file__))[0] + os.sep + 'weibo' + os.sep + keyword
            if type == 'img':
                file_dir = file_dir + os.sep + 'img'
            if not os.path.isdir(file_dir):
                os.makedirs(file_dir)
            if type == 'img':
                return file_dir
            file_path = file_dir + os.sep + '%s' % keyword + '.' + type
            return file_path
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_csv(self, wrote_num):
        """将爬取的信息写入csv文件"""
        try:
            result_headers = [
                '微博id',
                '微博正文',
                '原始图片url',
                '发布位置',
                '发布时间',
                '发布工具',
                '点赞数',
                '转发数',
                '评论数',
            ]
            if not self.filter:
                result_headers.insert(3, '被转发微博原始图片url')
                result_headers.insert(4, '是否为原创微博')
            result_data = [w.values() for w in self.weibo][wrote_num:]
            if sys.version < '3':  # python2.x
                reload(sys)
                sys.setdefaultencoding('utf-8')
                with open(self.get_filepath('csv'), 'ab') as f:
                    f.write(codecs.BOM_UTF8)
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            else:  # python3.x
                with open(self.get_filepath('csv'),
                          'a',
                          encoding='utf-8-sig',
                          newline='') as f:
                    writer = csv.writer(f)
                    if wrote_num == 0:
                        writer.writerows([result_headers])
                    writer.writerows(result_data)
            print(u'%d条微博写入csv文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('csv'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_txt(self, wrote_num):
        """将爬取的信息写入txt文件"""
        try:
            temp_result = []
            if wrote_num == 0:
                if self.filter:
                    result_header = u'\n\n原创微博内容: \n'
                else:
                    result_header = u'\n\n微博内容: \n'
                temp_result.append(result_header)
            for i, w in enumerate(self.weibo[wrote_num:]):
                temp_result.append(
                    str(wrote_num + i + 1) + ':' + w['content'] + '\n' +
                    u'微博位置: ' + w['publish_place'] + '\n' + u'发布时间: ' +
                    w['publish_time'] + '\n' + u'点赞数: ' + str(w['up_num']) +
                    u'   转发数: ' + str(w['retweet_num']) + u'   评论数: ' +
                    str(w['comment_num']) + '\n' + u'发布工具: ' +
                    w['publish_tool'] + '\n\n')
            result = ''.join(temp_result)
            with open(self.get_filepath('txt'), 'ab') as f:
                f.write(result.encode(sys.stdout.encoding))
            print(u'%d条微博写入txt文件完毕,保存路径:' % self.got_num)
            print(self.get_filepath('txt'))
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def write_file(self, wrote_num):
        """写文件"""
        if self.got_num > wrote_num:
            self.write_csv(wrote_num)
            self.write_txt(wrote_num)

    def get_weibo_info(self):
        """获取微博信息"""
        try:
            url = 'https://weibo.cn/search/mblog?keyword=%s' % (self.keyword)
            selector = self.deal_html(url)
            page_num = self.get_page_num(selector)  # 获取微博总页数
            wrote_num = 0
            page1 = 0
            random_pages = random.randint(1, 5)
            for page in tqdm(range(1, 11), desc=u'进度'):
                self.get_one_page(page)  # 获取第page页的全部微博

                if page % 20 == 0:  # 每爬20页写入一次文件
                    self.write_file(wrote_num)
                    wrote_num = self.got_num

                # 通过加入随机等待避免被限制。爬虫速度过快容易被系统限制(一段时间后限
                # 制会自动解除),加入随机等待模拟人的操作,可降低被系统限制的风险。默
                # 认是每爬取1到5页随机等待6到10秒,如果仍然被限,可适当增加sleep时间
                if page - page1 == random_pages and page < page_num:
                    sleep(random.randint(6, 10))
                    page1 = page
                    random_pages = random.randint(1, 5)

            self.write_file(wrote_num)  # 将剩余不足20页的微博写入文件
            if not self.filter:
                print(u'共爬取' + str(self.got_num) + u'条微博')
            else:
                print(u'共爬取' + str(self.got_num) + u'条原创微博')
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()

    def start(self):
        """运行爬虫"""
        try:
            self.get_weibo_info()
            print(u'信息抓取完毕')
            print('*' * 100)
            if self.pic_download == 1:
                self.download_pictures()
        except Exception as e:
            print('Error: ', e)
            traceback.print_exc()


def main():
    try:
        # 使用实例,输入一个用户id,所有信息都会存储在wb实例中
        keyword = u'%23test%23'  # 可以改成任意合法的用户id(爬虫的微博id除外)
        filter = 1  # 值为0表示爬取全部微博(原创微博+转发微博),值为1表示只爬取原创微博
        pic_download = 1  # 值为0代表不下载微博原始图片,1代表下载微博原始图片
        wb = Weibo(keyword, filter, pic_download)  # 调用Weibo类,创建微博实例wb
        wb.start()  # 爬取微博信息
        if wb.weibo:
            print(u'最新/置顶 微博为: ' + wb.weibo[0]['content'])
            print(u'最新/置顶 微博位置: ' + wb.weibo[0]['publish_place'])
            print(u'最新/置顶 微博发布时间: ' + wb.weibo[0]['publish_time'])
            print(u'最新/置顶 微博获得赞数: ' + str(wb.weibo[0]['up_num']))
            print(u'最新/置顶 微博获得转发数: ' + str(wb.weibo[0]['retweet_num']))
            print(u'最新/置顶 微博获得评论数: ' + str(wb.weibo[0]['comment_num']))
            print(u'最新/置顶 微博发布工具: ' + wb.weibo[0]['publish_tool'])
    except Exception as e:
        print('Error: ', e)
        traceback.print_exc()


if __name__ == '__main__':
    main()

这是利用本程序修改的,功能较少,但主要功能都有。用法是先把程序前几行的your cookie换成自己的cookie。然后再把mian函数中的keyword换成自己要搜索的关键词,如果搜索普通关键词如test,代码如下:

        keyword = u'test'

如果搜索话题#test#:

        keyword = u'%23test%23'

注意关键词两边的#一定要替换成%23

@dataabc 您好大佬,我想问您一下,如果我想得到每条微博对应的用户id的话,应该如何修改这段代码呢。

@dataabc
Copy link
Owner

dataabc commented Apr 6, 2020

@Amamiyatsuki
我已经写了一个新项目,专门用来获取关键词搜索,地址https://github.com/dataabc/weibo-search,包含对应的用户id等本程序能获取到的所有字段。
优点是:
1.获取的信息比较全,包含微博的id、bid、对应用户的user_id、昵称、发布位置、艾特用户、话题、评论数、转发数、retweet_id,如果有转发微博,会同时获取转发微博和被转发微博的信息,支持下载图片视频;
2.获取的数量比较多,理论上获取的数量是上面的1万倍。上面一个日期范围只能获取1000条微博,现在如果关键词比较热,一个日期范围(一天)理论上能获取1000多万条。
缺点是:
1.需要安装scrapy;
2.README还没怎么写,刚使用可能需要点时间;
3.项目还不算完善,正在更新。

@Amamiyatsuki
Copy link

@Amamiyatsuki
我已经写了一个新项目,专门用来获取关键词搜索,地址https://github.com/dataabc/weibo-search,包含对应的用户id等本程序能获取到的所有字段。
优点是:
1.获取的信息比较全,包含微博的id、bid、对应用户的user_id、昵称、发布位置、艾特用户、话题、评论数、转发数、retweet_id,如果有转发微博,会同时获取转发微博和被转发微博的信息,支持下载图片视频;
2.获取的数量比较多,理论上获取的数量是上面的1万倍。上面一个日期范围只能获取1000条微博,现在如果关键词比较热,一个日期范围(一天)理论上能获取1000多万条。
缺点是:
1.需要安装scrapy;
2.README还没怎么写,刚使用可能需要点时间;
3.项目还不算完善,正在更新。

非常感谢您的回答,您的这些项目对我很有帮助。

@stale
Copy link

stale bot commented Jun 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 14, 2020
@stale
Copy link

stale bot commented Jun 21, 2020

Closing as stale, please reopen if you'd like to work on this further.

@stale stale bot closed this as completed Jun 21, 2020
@Ya-Hou
Copy link

Ya-Hou commented May 18, 2023

为什么pycharm里用requests爬取的图片,无法打开,显示“图像未加载,尝试从外部打开修正格式问题!”

@dataabc
Copy link
Owner

dataabc commented May 18, 2023

@Ya-Hou 有错误信息吗?

@Ya-Hou
Copy link

Ya-Hou commented May 19, 2023

IMG20230519165738.jpg

大佬,只有这个,代码没有报错,我用selenium也打不开,

@dataabc
Copy link
Owner

dataabc commented May 19, 2023

@Ya-Hou 您尝试免cookie版看看。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants