Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

一个爬网页的练习:看看 R 邮件列表中最热门的讨论是什么 #5

Closed
XiangyunHuang opened this issue Jun 15, 2019 · 1 comment

Comments

Projects
None yet
1 participant
@XiangyunHuang
Copy link
Owner

commented Jun 15, 2019

来源于统计之都讨论帖 https://d.cosx.org/d/420739

discuss_theme full_url count
Runnable R packages https://stat.ethz.ch/pipermail/r-devel/2019-February/077225.html 16
nlminb with constraints failing on some platforms https://stat.ethz.ch/pipermail/r-devel/2019-February/077226.html 14
code for sum function https://stat.ethz.ch/pipermail/r-devel/2019-February/077287.html 14
Intermittent crashes with inset [<- command https://stat.ethz.ch/pipermail/r-devel/2019-February/077367.html 11
Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8 https://stat.ethz.ch/pipermail/r-devel/2019-February/077261.html 10
Return/print standard error in t.test() https://stat.ethz.ch/pipermail/r-devel/2019-February/077335.html 8
Inefficiency in df$col https://stat.ethz.ch/pipermail/r-devel/2019-February/077248.html 8
model.matrix.default() silently ignores bad contrasts.arg https://stat.ethz.ch/pipermail/r-devel/2019-February/077321.html 7
Improved Data Aggregation and Summary Statistics in R https://stat.ethz.ch/pipermail/r-devel/2019-February/077368.html 6
bias issue in sample() (PR 17494) https://stat.ethz.ch/pipermail/r-devel/2019-February/077305.html 6
Encoding issues https://stat.ethz.ch/pipermail/r-devel/2019-February/077289.html 5
Documentation for sd (stats) + suggestion https://stat.ethz.ch/pipermail/r-devel/2019-February/077300.html 5
Trying to compile R 3.5.2 - 32 bit R - on Windows 10 64 bit - with ICU support https://stat.ethz.ch/pipermail/r-devel/2019-February/077283.html 4
Set the number of threads using openmp with .Fortran? https://stat.ethz.ch/pipermail/r-devel/2019-February/077235.html 4
Problem with compiling OpenBLAS to work with R https://stat.ethz.ch/pipermail/r-devel/2019-February/077380.html 4
Extract.data.frame.Rd about $.data.frame https://stat.ethz.ch/pipermail/r-devel/2019-February/077278.html 4
Compile R to WebAssembly / Emscripten? https://stat.ethz.ch/pipermail/r-devel/2019-February/077311.html 4
Bugzilla down? https://stat.ethz.ch/pipermail/r-devel/2019-February/077357.html 4
Bug: time complexity of substring is quadratic as string size and number of substrings increases https://stat.ethz.ch/pipermail/r-devel/2019-February/077318.html 4
stopifnot https://stat.ethz.ch/pipermail/r-devel/2019-February/077350.html 3
pcre problems https://stat.ethz.ch/pipermail/r-devel/2019-February/077351.html 3
Exit status of Rscript https://stat.ethz.ch/pipermail/r-devel/2019-February/077391.html 3
Error in rbind(info, getNamespaceInfo(env, "S3methods")) https://stat.ethz.ch/pipermail/r-devel/2019-February/077297.html 3
Bug in print.default: dispatches to global show instead of methods::show https://stat.ethz.ch/pipermail/r-devel/2019-February/077323.html 3
Proposed patch for ?Extract https://stat.ethz.ch/pipermail/r-devel/2019-February/077330.html 2
Proposed function file.backup https://stat.ethz.ch/pipermail/r-devel/2019-February/077280.html 2
patch for gregexpr(perl=TRUE) https://stat.ethz.ch/pipermail/r-devel/2019-February/077306.html 2
mle (stat4) crashing due to singular Hessian in covariance matrix calculation https://stat.ethz.ch/pipermail/r-devel/2019-February/077302.html 2
make.unique rbind examples https://stat.ethz.ch/pipermail/r-devel/2019-February/077279.html 2
Is libtiff >= 4.0.0 now required by R for TIFF support? https://stat.ethz.ch/pipermail/r-devel/2019-February/077353.html 2
R 3.5.3 scheduled for March 11 https://stat.ethz.ch/pipermail/r-devel/2019-February/077356.html 1
Proposed speedup of spec.pgram from spectrum.R https://stat.ethz.ch/pipermail/r-devel/2019-February/077288.html 1
Possible Update to R-internals Manual https://stat.ethz.ch/pipermail/r-devel/2019-February/077365.html 1
PATCH: Asserting that 'connection' used has not changed + R_GetConnection2() https://stat.ethz.ch/pipermail/r-devel/2019-February/077275.html 1
Package inclusion in R core implementation https://stat.ethz.ch/pipermail/r-devel/2019-February/077401.html 1
Bug: time complexity of substring is quadratic https://stat.ethz.ch/pipermail/r-devel/2019-February/077349.html 1
@XiangyunHuang

This comment has been minimized.

Copy link
Owner Author

commented Jun 15, 2019

修改自 @jienagu https://github.com/jienagu/tidyverse_examples/blob/master/web_scraping_r_devel.R

# 安装必要的依赖
packages <- c("rvest", "knitr")
lapply(packages, function(pkg) {
  if (system.file(package = pkg) == "") install.packages(pkg)
})

# 确保 Windows 下的中文环境也能获取正确的日期格式化结果
Sys.setlocale("LC_TIME", "C")
# 格式化日期
all_months <- format(
  seq(
    from = as.Date("1997-04-01"),
    to = Sys.Date(), by = "1 month"
  ),
  "%Y-%B"
)

# 清理帖子主题
clean_discuss_topic <- function(x) {
  # 去掉中括号及其内容
  x <- gsub("(\\[.*?\\])", "", x)
  # 去掉末尾换行符 \n
  x <- gsub("(\\\n)$", "", x)
  # 两个以上的空格替换为一个空格
  x <- gsub("( {2,})", " ", x)
  x
}
library(magrittr)
x <- "2019-February"
base_url <- "https://stat.ethz.ch/pipermail/r-devel"

# 下面的部分可以打包成一个函数
# 输入是日期 x 输出是一个 markdown 表格

scrap_webpage <- xml2::read_html(paste(base_url, x, "subject.html", sep = "/"))

# Extract the URLs 提取完整链接
tail_url <- scrap_webpage %>%
  rvest::html_nodes("a") %>%
  rvest::html_attr("href")
# Extract the theme 提取链接对应的讨论主题
discuss_topic <- scrap_webpage %>%
  rvest::html_nodes("a") %>%
  rvest::html_text()

# url 和 讨论主题合并为数据框
discuss_df <- data.frame(discuss_topic = discuss_topic, tail_url = tail_url)

# 清理无效的帖子记录
discuss_df <- discuss_df[grepl(pattern = "\\.html$", x = discuss_df$tail_url), ]
# 清理帖子主题内容
discuss_df$discuss_topic <- clean_discuss_topic(discuss_df$discuss_topic)

# 去重 # 只保留第一条发帖记录
discuss_uni_df <- discuss_df[!duplicated(discuss_df$discuss_topic), ]
# 分组计数
discuss_count_df <- as.data.frame(table(discuss_df$discuss_topic), stringsAsFactors = FALSE)
# 对 discuss_count_df 的列重命名
colnames(discuss_count_df) <- c("discuss_topic", "count")
# 按讨论主题合并数据框
discuss <- merge(discuss_uni_df, discuss_count_df, by = "discuss_topic")

# 添加完整的讨论帖的 url
discuss <- transform(discuss, full_url = paste(base_url, x, tail_url, sep = "/"))
# 选取讨论主题、主题链接和楼层高度
discuss <- discuss[, c("discuss_topic", "full_url", "count")]

# 按楼层高度排序,转化为 Markdown 表格形式输出
discuss[order(discuss$count, decreasing = TRUE), ] %>%
  knitr::kable(format = "markdown", row.names = FALSE) %>%
  cat(file = paste0(x, "-disuss.md"), sep = "\n")

XiangyunHuang added a commit that referenced this issue Jul 11, 2019

# This is a combination of 8 commits.
# This is the 1st commit message:

add warning block

# The commit message #2 will be skipped:

# 添加密度图的例子

# The commit message #3 will be skipped:

# magick 制作 gif 动画 gapminder

# The commit message #4 will be skipped:

# 举个例子介绍 tweenr 的过渡作用

# The commit message #5 will be skipped:

# 重格式化代码,调整密度图的尺寸比例

# The commit message #6 will be skipped:

# 风玫瑰图或称南丁格尔图

# The commit message #7 will be skipped:

# install av and gifski to generate mp4 or gif

# The commit message #8 will be skipped:

# 为 plotly 安装依赖

XiangyunHuang added a commit that referenced this issue Jul 11, 2019

# This is a combination of 11 commits.
# This is the 1st commit message:

add warning block

# The commit message #2 will be skipped:

# 尝试 ggplot2 开发版

# The commit message #3 will be skipped:

# add SQLite

# The commit message #4 will be skipped:

# update readme

# The commit message #5 will be skipped:

# 关闭数据库连接

# The commit message #6 will be skipped:

# call dbDisconnect() when finished working with a connection
#
# 如果数据库中已经包含同名的表,那么就应该覆盖写入

# The commit message #7 will be skipped:

# 直方图 Base R vs Ggplot2

# The commit message #8 will be skipped:

# Base R 原生支持 unicode math in plot

# The commit message #9 will be skipped:

# update histogram

# The commit message #10 will be skipped:

# fira fonts in ggplot2

# The commit message #11 will be skipped:

#  Plot outline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.