Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues_bug_574 无法匹配获取微博长文,尝试修复 #575

Merged
merged 2 commits into from
Apr 27, 2024

Conversation

myshero
Copy link

@myshero myshero commented Apr 27, 2024

修复了:无法正确获取需要“展开”的长文微博

优化了:如果长文微博中有换行则保留格式。长文微博文本中的
标签被替换为\n

def get_long_weibo(self):
        """获取长原创微博"""
        try:
            for i in range(5):
                self.selector = handle_html(self.cookie, self.url)
                if self.selector is not None:
                    info_div = self.selector.xpath("//div[@class='c' and @id='M_']")[0]
                    info_span = info_div.xpath("//span[@class='ctt']")[0]
                    # 1. 获取 info_span 中的所有 HTML 代码作为字符串
                    html_string = etree.tostring(info_span, encoding='unicode', method='html')
                    # 2. 将 <br> 替换为 \n
                    html_string = html_string.replace('<br>', '\n')
                    # 3. 去掉所有 HTML 标签,但保留标签内的有效文本
                    new_content = fromstring(html_string).text_content()
                    # 4. 替换多个连续的 \n 为一个 \n
                    new_content = re.sub(r'\n+', '\n', new_content)
                    weibo_content = handle_garbled(new_content)
                    if weibo_content is not None:
                        return weibo_content
                sleep(random.randint(6, 10))
        except Exception:
            logger.exception(u'网络出错')

结果示例:

        {
            "id": "Obuk4oIaU",
            "user_id": "",
            "content": ":2024年04月26日,星期五\n今天证实了我们所说,情绪仍处于上升期,即使也要多看多了解。",
            "article_url": "",
            "original_pictures":"无",
            "retweet_pictures": null,
            "original": true,
            "video_url": "无",
            "publish_place": "无",
            "publish_time": "2024-04-26 12:11",
            "publish_tool": "微博网页版",
            "up_num": 3,
            "retweet_num": 0,
            "comment_num": 0
        }

@dataabc dataabc merged commit d7de931 into dataabc:master Apr 27, 2024
@dataabc
Copy link
Owner

dataabc commented Apr 27, 2024

感谢贡献代码。非常好的优化,可以让长微博更整洁,已merge。

@myshero myshero deleted the issues_bug_574 branch April 29, 2024 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants