Skip to content

desiquant/news_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DesiQuant News Scraper

A scrapy crawler that scrapes market news from Indian financial news outlets

⚠️ WARNING: Work in progress. Will implement breaking frequently.

test status

Usage

Run a spider. The outputs are saved to outputs/moneycontrol.jl in JSONlines format

# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol

# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol

To view a list of all available spiders:

scrapy list

# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews

To run all the spiders in production

# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py

Run tests to check if spiders are still working.

# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11

# intergration test all spiders
pytest

Sitemaps

The sitemaps for each website not always directly available in robots.txt. Googling for keywords like "ndtvprofit.com daily sitemap xml" seems to retrieve the ones that are not mentioned.

Publisher Sitemap Type Sitemap Link
News 18 Daily Sitemap Link
The Hindu Daily Sitemap Link
The Hindu Business Line Daily Sitemap Link
Business Today Daily Sitemap Link
Money Control Daily Sitemap Link
Business Standard Sitemap Index Link
Economic Times Monthly Sitemaps Link
Firstpost Daily Sitemap Link
NDTV Profit Daily Sitemap Link
Free Press Journal Daily Sitemap Link
Outlook India Daily Sitemap Link
Zee News Monthly Sitemap Link
Financial Express Daily Sitemap Link
Indian Express Daily Sitemap Link

Notes:

TODO

  • Do not cache recent sitemaps
  • Run the scraper as prefect flow
  • Scraping mode - Update/dump
  • While running the test, if it fails, prevent scrapy from showing the entire output
  • export PYTHONDONTWRITEBYTECODE=1
  • pytest failing on few spiders on remote server
  • moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work

Server Checklist

  • Attach floating IPs
  • Prevent pycache
  • Mount volume

More Sources

The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.