论文信息爬虫 Paper Information Spider

主要功能

此爬虫从Google Scholar和dblp上爬取论文信息，主要包括：

论文作者
论文年份
论文会议或期刊及其页数
引用数量
他引数量
GB/T 7714/MLA/APA 引用格式

爬取结束会生成CSV以供后续查看使用

使用

注意需要科学上网才可使用！且目前发送请求的次数为10~15秒一次，请求过于频繁会被封cookie，建议按照目前设置的时间。

使用前请替换Spider.py中的headers中的cookie为合法的Google scholar的cookie，获取合法cookie的方法见下文。

from spider import PaperSpider
paper_title_list = ['paper_title1','paper_title2']
spider = PaperSpider(paper_title_list,need_other_cited=True,need_cite_format=True)
spider.run()

其中paper_title_list是所有需要爬取的论文title，need_other_cited表示是否需要统计他引，这个在引用数量较多时比较耗时，need_cite_format表示是否需要爬取引用格式如APA。

cookie获取方法

打开chrome进入Google Scholar，随便搜索一篇论文

点击其中的被引用次数，进入新的页面后打开开发者工具，选择network标签，刷新页面，点击其中第一个请求，在request headers中找到cookie复制过来即可

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample_result.csv		sample_result.csv
sample_runner.py		sample_runner.py
spider.py		spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

论文信息爬虫 Paper Information Spider

主要功能

使用

cookie获取方法

About

Releases

Packages

Languages

License

chunbolin/PaperSpider

Folders and files

Latest commit

History

Repository files navigation

论文信息爬虫 Paper Information Spider

主要功能

使用

cookie获取方法

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages