newscrawler

新闻网站爬虫,目前能够爬取网易，新浪，qq, sohu等三家网站的新闻页面。

##Using:

python runspiders.py

##json file

The news file saved as json file:

newsId: the news's id

source: the source of the news , such as news.163.com, news.sina.com.cn or news.qq.com

date: the creation time of news, 20150529

contents:

link: the link of news

title: the title of news

passage: the content of news

The title and passage are encode as unicode, so you need transform it when load it.

##Other: save2xml.py is used to changing the json to xml type.

The xml file can be tagged by TemporaliaChTagger.

###Reference news-combinator

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
newscrawler		newscrawler
.gitignore		.gitignore
README.md		README.md
runspiders.py		runspiders.py
save2xml.py		save2xml.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

newscrawler

newscrawler

.gitignore

.gitignore

README.md

README.md

runspiders.py

runspiders.py

save2xml.py

save2xml.py

scrapy.cfg

scrapy.cfg

Repository files navigation

newscrawler

About

Releases

Packages

Languages

gitter-badger/newscrawler

Folders and files

Latest commit

History

Repository files navigation

newscrawler

About

Resources

Stars

Watchers

Forks

Languages