Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

201906.世纪佳缘爬虫 #3

Open
cheenwe opened this issue Jun 23, 2019 · 0 comments
Open

201906.世纪佳缘爬虫 #3

cheenwe opened this issue Jun 23, 2019 · 0 comments

Comments

@cheenwe
Copy link
Owner

cheenwe commented Jun 23, 2019

总结2019年上半年工作中对该爬虫开发进行简单的梳理及优化建议

1. 爬虫开发过程

a. 分析如何获取需要爬取链接

打开任意一个人主页,会发现类似于:http://www.jiayuan.com/xxx 的结构,xxx 为可连续的ID,可以从某个ID为起点就行遍历,

b. 大规模请求后会出现IP被封的情况

使用代理IP,项目中有具体实现,需优化 #4

c. 提高爬取速度,下载图片速度

多开脚本,多线程,异步操作(使用tomorrow)

d. 如何控制爬取进度

使用REST API 对多个爬虫进行控制,分配爬取ID等,自动获取Cookie等,需优化 #5

e. 如何保存爬取数据

使用REST API发送到服务端,数据统一保存

f. 如何应对验证验证

出现验证时,爬虫暂停,通知进行手动验证, 验证通过后更新记录,爬虫自动开始

g. 如何拿Cookie

模拟登陆 http 请求,直接获取cookie : 已失效

使用selenium模拟登陆,获取cookie, cookie失效后自动重新获取cookie,需开发 #6

@cheenwe cheenwe pinned this issue Jun 23, 2019
@cheenwe cheenwe unpinned this issue Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant