Database of expressions used for extracting content from blogs and articles.
The main database is JSON5 format, a strict subset of Javascript, also available as a normal JSON, for convenience.
The extraction expressions are Cheerio, similar with jQuery.
The targeted information is:
- the author
- the date when the article was written
- and of course, the article text, as clean as possible
This project is designed to be used with Clean-Mark, but you can use it however you want.
- abcnews.go.com
- aeon.co
- agroinfo.ro
- arenait.net
- arstechnica.com
- articles.latimes.com
- artsy.net
- bbc.com
- beta.theglobeandmail.com
- bigthink.com
- bindiribli.ro
- bossfeed.net
- businessinsider.com
- collectivelyconscious.net
- curentul.info
- dailymail.co.uk
- deepdotweb.com
- digi24.ro
- earthsky.org
- edition.cnn.com
- engadget.com
- express.co.uk
- farnamstreetblog.com
- fastcompany.com
- finesociety.ro
- firstpost.com
- foxnews.com
- galacticconnection.com
- gandeste.org
- gazetadambovitei.ro
- gnosticwarrior.com
- hackread.com
- hbr.org
- hotnews.ro
- howtogeek.com
- huffingtonpost.com
- info.localytics.com
- infoalert.ro
- irishmirror.ie
- isgp-studies.com
- jamesclear.com
- jurnalul.ro
- latimes.com
- life.ro
- mashable.com
- merckmanuals.com
- money.cnn.com
- nautil.us
- nbcnews.com
- ncbi.nlm.nih.gov
- neonnettles.com
- news.com.au
- newscientist.com
- newyorker.com
- nytimes.com
- nzherald.co.nz
- observator.tv
- pri.org
- qz.com
- romaniaa.ro
- rt.com
- rts.earth
- smh.com.au
- start-up.ro
- stiri.tvr.ro
- stirileprotv.ro
- techcrunch.com
- techradar.com
- telegraph.co.uk
- theatlantic.com
- theguardian.com
- theliberal.ie
- thenextweb.com
- theverge.com
- thrillist.com
- torrentfreak.com
- usatoday.com
- usnews.com
- vox.com
- wakingtimes.com
- wall-street.ro
- washingtonpost.com
- weforum.org
- wsj.com
- yahoo.com
- ziare.com
Clean-Mark already has algorithms to extract most of the info, if the website is SEO friendly, eg: it respects schema.org/Article, or Microformats, or the Open Graph protocol.
But it's not a perfect tool 🤖 and it needs help from us humans 🙄
We ❤️ contributions !!!
Want to report a bug, request a feature, or contribute? Things can only be contributed via the A-Extractor GitHub repository.
The "fork-and-pull" Git workflow:
- Fork the repo on GitHub
- Clone the project to your own machine
- Work on your fork
- Make your changes and additions
- Change or add tests if needed
- Run tests and make sure they pass
- Add changes to README.md if needed
- Commit changes to your own branch
- Make sure you merge the latest from "upstream" and resolve conflicts if there is any
- Push your work back up to your fork
- Submit a Pull request so that we can review your changes
MIT © Cristi Constantin.