A scrapy crawler that scrapes market news from Indian financial news outlets
⚠️ WARNING: Work in progress. Will implement breaking frequently.
Run a spider. The outputs are saved to outputs/moneycontrol.jl
in JSONlines format
# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol
# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol
To view a list of all available spiders:
scrapy list
# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews
To run all the spiders in production
# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py
Run tests to check if spiders are still working.
# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11
# intergration test all spiders
pytest
The sitemaps for each website not always directly available in robots.txt
. Googling for keywords like "ndtvprofit.com daily sitemap xml"
seems to retrieve the ones that are not mentioned.
Publisher | Sitemap Type | Sitemap Link |
---|---|---|
News 18 | Daily Sitemap | Link |
The Hindu | Daily Sitemap | Link |
The Hindu Business Line | Daily Sitemap | Link |
Business Today | Daily Sitemap | Link |
Money Control | Daily Sitemap | Link |
Business Standard | Sitemap Index | Link |
Economic Times | Monthly Sitemaps | Link |
Firstpost | Daily Sitemap | Link |
NDTV Profit | Daily Sitemap | Link |
Free Press Journal | Daily Sitemap | Link |
Outlook India | Daily Sitemap | Link |
Zee News | Monthly Sitemap | Link |
Financial Express | Daily Sitemap | Link |
Indian Express | Daily Sitemap | Link |
- Do not cache recent sitemaps
- Run the scraper as prefect flow
- Scraping mode - Update/dump
- While running the test, if it fails, prevent scrapy from showing the entire output
- export PYTHONDONTWRITEBYTECODE=1
- pytest failing on few spiders on remote server
- moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work
- Attach floating IPs
- Prevent pycache
- Mount volume
The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.
- https://www.livemint.com/api/cms/story/v2/11720327511606 - Check Content Length in Head. TODO: Check for market slug with a smaller query
- https://timesofindia.indiatimes.com/articleshow/81896735.cms - Redirect not showing in head, No sitemap as well.
- https://www.indiainfoline.com/news/top-share-market-news/page/14072 - New articles have no ID in the url. Seems to allow old articles to redirect
- https://in.investing.com/news/a/a-4293269 - Doesn't redirect to actual url