DesiQuant News Scraper

A scrapy crawler that scrapes market news from Indian financial news outlets

⚠️ WARNING: Work in progress. Will implement breaking frequently.

Usage

Run a spider. The outputs are saved to outputs/moneycontrol.jl in JSONlines format

# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol

# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol

To view a list of all available spiders:

scrapy list

# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews

To run all the spiders in production

# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py

Run tests to check if spiders are still working.

# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11

# intergration test all spiders
pytest

Sitemaps

The sitemaps for each website not always directly available in robots.txt. Googling for keywords like "ndtvprofit.com daily sitemap xml" seems to retrieve the ones that are not mentioned.

Publisher	Sitemap Type	Sitemap Link
News 18	Daily Sitemap	Link
The Hindu	Daily Sitemap	Link
The Hindu Business Line	Daily Sitemap	Link
Business Today	Daily Sitemap	Link
Money Control	Daily Sitemap	Link
Business Standard	Sitemap Index	Link
Economic Times	Monthly Sitemaps	Link
Firstpost	Daily Sitemap	Link
NDTV Profit	Daily Sitemap	Link
Free Press Journal	Daily Sitemap	Link
Outlook India	Daily Sitemap	Link
Zee News	Monthly Sitemap	Link
Financial Express	Daily Sitemap	Link
Indian Express	Daily Sitemap	Link

Notes:

TODO

Do not cache recent sitemaps
Run the scraper as prefect flow
Scraping mode - Update/dump
While running the test, if it fails, prevent scrapy from showing the entire output
export PYTHONDONTWRITEBYTECODE=1
pytest failing on few spiders on remote server
moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work

Server Checklist

Attach floating IPs
Prevent pycache
Mount volume

More Sources

The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.

https://www.livemint.com/api/cms/story/v2/11720327511606 - Check Content Length in Head. TODO: Check for market slug with a smaller query
https://timesofindia.indiatimes.com/articleshow/81896735.cms - Redirect not showing in head, No sitemap as well.
https://www.indiainfoline.com/news/top-share-market-news/page/14072 - New articles have no ID in the url. Seems to allow old articles to redirect
https://in.investing.com/news/a/a-4293269 - Doesn't redirect to actual url

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
.vscode		.vscode
infra		infra
news_scraper		news_scraper
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
run.py		run.py
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DesiQuant News Scraper

Usage

Sitemaps

Notes:

TODO

Server Checklist

More Sources

About

Releases

Packages

Languages

desiquant/news_scraper

Folders and files

Latest commit

History

Repository files navigation

DesiQuant News Scraper

Usage

Sitemaps

Notes:

TODO

Server Checklist

More Sources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages