Skip to content

alanalvarado/a-extractor

 
 

Repository files navigation

📃 Article extractor

Database of expressions used for extracting content from blogs and articles.

NPM Version NPM Downloads Build Status Standard Style Guide

The main database is JSON5 format, a strict subset of Javascript, also available as a normal JSON, for convenience.

The extraction expressions are Cheerio, similar with jQuery.

The targeted information is:

  • the author
  • the date when the article was written
  • and of course, the article text, as clean as possible

This project is designed to be used with Clean-Mark, but you can use it however you want.

86 domains available

  • abcnews.go.com
  • aeon.co
  • agroinfo.ro
  • arenait.net
  • arstechnica.com
  • articles.latimes.com
  • artsy.net
  • bbc.com
  • beta.theglobeandmail.com
  • bigthink.com
  • bindiribli.ro
  • bossfeed.net
  • businessinsider.com
  • collectivelyconscious.net
  • curentul.info
  • dailymail.co.uk
  • deepdotweb.com
  • digi24.ro
  • earthsky.org
  • edition.cnn.com
  • engadget.com
  • express.co.uk
  • farnamstreetblog.com
  • fastcompany.com
  • finesociety.ro
  • firstpost.com
  • foxnews.com
  • galacticconnection.com
  • gandeste.org
  • gazetadambovitei.ro
  • gnosticwarrior.com
  • hackread.com
  • hbr.org
  • hotnews.ro
  • howtogeek.com
  • huffingtonpost.com
  • info.localytics.com
  • infoalert.ro
  • irishmirror.ie
  • isgp-studies.com
  • jamesclear.com
  • jurnalul.ro
  • latimes.com
  • life.ro
  • mashable.com
  • merckmanuals.com
  • money.cnn.com
  • nautil.us
  • nbcnews.com
  • ncbi.nlm.nih.gov
  • neonnettles.com
  • news.com.au
  • newscientist.com
  • newyorker.com
  • nytimes.com
  • nzherald.co.nz
  • observator.tv
  • pri.org
  • qz.com
  • romaniaa.ro
  • rt.com
  • rts.earth
  • smh.com.au
  • start-up.ro
  • stiri.tvr.ro
  • stirileprotv.ro
  • techcrunch.com
  • techradar.com
  • telegraph.co.uk
  • theatlantic.com
  • theguardian.com
  • theliberal.ie
  • thenextweb.com
  • theverge.com
  • thrillist.com
  • torrentfreak.com
  • usatoday.com
  • usnews.com
  • vox.com
  • wakingtimes.com
  • wall-street.ro
  • washingtonpost.com
  • weforum.org
  • wsj.com
  • yahoo.com
  • ziare.com

Important

Clean-Mark already has algorithms to extract most of the info, if the website is SEO friendly, eg: it respects schema.org/Article, or Microformats, or the Open Graph protocol.
But it's not a perfect tool 🤖 and it needs help from us humans 🙄

Contributions

We ❤️ contributions !!!

Want to report a bug, request a feature, or contribute? Things can only be contributed via the A-Extractor GitHub repository.

The "fork-and-pull" Git workflow:

  1. Fork the repo on GitHub
  2. Clone the project to your own machine
  3. Work on your fork
    1. Make your changes and additions
    2. Change or add tests if needed
    3. Run tests and make sure they pass
    4. Add changes to README.md if needed
  4. Commit changes to your own branch
  5. Make sure you merge the latest from "upstream" and resolve conflicts if there is any
  6. Push your work back up to your fork
  7. Submit a Pull request so that we can review your changes

License

MIT © Cristi Constantin.

About

Article content extraction database

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%