Skip to content

Extract article from its HTML format into a text file without any need of pre-defined rules on your scraper. This tool is an experiment on CRF for article content extraction.

Notifications You must be signed in to change notification settings

feryandi/CRF-Article-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tool: CRF Article Extractor

This is an experiment on CRF for article content extraction. When you are trying to get clean data from a website, usually the extraction is getting in the way. For example, I want to extract news data from certain online media. It's getting to the point where I need to create automatic content extraction instead of defining XPath for every website out there. This is the goal of the tool, to easily extract article content with minimal errors.

How to Use

Install all the needed requirement first.

To create a model, use this command python generate_model.py It will generate model on model folder as well as pickled training-ready data from dataset in pickle folder.

To use it, use the command python extract.py --url some.url.com

Note that this is not production ready thus need more implementation in order to make it ready for use.

Experiment

I use CRFSuite with binding for Python (python-crfsuite) implementation for the CRF and using LBFGS as algorithm. The train is only 25 data, validation 10 data, and test 5 data of website that never seen before on train data. While the data is really small, it's have a decent performance overall.

The features are: tag, parent tag, tag chain (tag and parent tag), length text before, length text after, length text content, and word count.

Compared to similar CRF experimentation on Victor: the Web-Page Cleaning Tool this one have greater perfomance (based on precision and recall) and less feature which makes it more general (test data contains 4 different languages) but since the dataset on this one is really small, I couldn't guarantee it.

Validation Data Result

Evaluation

%
Precision 96
Recall 86
F1 91

Confusion Matrix

content ignore
97 16
4 9419

Test Data Result

Evaluation

%
Precision 91
Recall 93
F1 92

Confusion Matrix

content ignore
71 5
7 2640

About

Extract article from its HTML format into a text file without any need of pre-defined rules on your scraper. This tool is an experiment on CRF for article content extraction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published