Skip to content

EleutherAI/hn-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Hacker News Scraper

A scraper for Hacker News (duh)

Usage:

run hn_scraper.py (modify the main function call to change the end_id to parse to)

Filtering Details

Parallelism details

Generating comment trees can be expensive. We use two levels of parallelism to solve this. Items can either be stories or comments, we are given the total number of items on Hacker News (~25,000,000). One pool of workers check if items are stories, and if so initiate another pool of workers that traverse the n comment trees (n is the numebr of good top-level comments) in order to build a single comment chain for each top level comment.

Example:

TODO:

[] Add in flags for better data management [] Fix requests issue

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages