Scraping financial news using scrapy

Scraper consists of some predefined spiders to collect news from e.g. Reuters and BusinessStandard

There are two types of spiders, one for gathering fresh news, the other for following a list of (manually prepared urls).

Spider which collects fresh news follows these steps:

open frontpage, gather links in the news section
Filter out already parsed articles using MD5 hash of each link and a hash log
Parse new articles: ** title ** date ** body
Perform basic cleaning of data: ** remove all chars excluding alphanumeric characters, punctuation and currency symbols ** trim every field ** parse date times and convert to datetime type
Save to CSV, semicolon delimited, enclosed in double quotes

Scraping can be done periodically on a Windows machine using scrapyd and regular windows task scheduler. To set this up follow these steps:

install scrapyd
install scrapyd-client (I bumped into a problem when using pip to install scrapyd-client and it didn't install correctly. It worked when I insalled it directly from the repo: pip install git+https://github.com/scrapy/scrapyd-client)
run scrapyd
deploy the project to scrapyd:
cd into the scrapy project
prepare scrapy.cfg
run: scrapyd-client deploy
check if it works by running: "curl http://local-url-to-scrapyd/schedule.json -d project=myscrapyproject -d spider=spidername". The resulst should be something like this: "{"node_name": "FACA", "status": "ok", "jobid": "811557ee4bf411e982c19cb6d0d30648"}"
Add a Windows Scheduled Task by running this command: schtasks /create /sc minute /mo 5 /tn TaskName /tr "cmd.exe 'curl --silent http://local-url-to-scrapyd/schedule.json -d project=myprojectname -d spider=myspidername'"

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
indianstock		indianstock
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback