Tool: CRF Article Extractor

This is an experiment on CRF for article content extraction. When you are trying to get clean data from a website, usually the extraction is getting in the way. For example, I want to extract news data from certain online media. It's getting to the point where I need to create automatic content extraction instead of defining XPath for every website out there. This is the goal of the tool, to easily extract article content with minimal errors.

How to Use

Install all the needed requirement first.

To create a model, use this command python generate_model.py It will generate model on model folder as well as pickled training-ready data from dataset in pickle folder.

To use it, use the command python extract.py --url some.url.com

Note that this is not production ready thus need more implementation in order to make it ready for use.

Experiment

I use CRFSuite with binding for Python (python-crfsuite) implementation for the CRF and using LBFGS as algorithm. The train is only 25 data, validation 10 data, and test 5 data of website that never seen before on train data. While the data is really small, it's have a decent performance overall.

The features are: tag, parent tag, tag chain (tag and parent tag), length text before, length text after, length text content, and word count.

Compared to similar CRF experimentation on Victor: the Web-Page Cleaning Tool this one have greater perfomance (based on precision and recall) and less feature which makes it more general (test data contains 4 different languages) but since the dataset on this one is really small, I couldn't guarantee it.

Validation Data Result

Evaluation

	%
Precision	96
Recall	86
F1	91

Confusion Matrix

content	ignore
97	16
4	9419

Test Data Result

Evaluation

	%
Precision	91
Recall	93
F1	92

Confusion Matrix

content	ignore
71	5
7	2640

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
model		model
pickle		pickle
src		src
.gitignore		.gitignore
README.md		README.md
extract.py		extract.py
generate_model.py		generate_model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

model

model

pickle

pickle

src

src

.gitignore

.gitignore

README.md

README.md

extract.py

extract.py

generate_model.py

generate_model.py

requirements.txt

requirements.txt

Repository files navigation

Tool: CRF Article Extractor

How to Use

Experiment

Validation Data Result

Evaluation

Confusion Matrix

Test Data Result

Evaluation

Confusion Matrix

About

Releases

Packages

Languages

feryandi/CRF-Article-Extractor

Folders and files

Latest commit

History

Repository files navigation

Tool: CRF Article Extractor

How to Use

Experiment

Validation Data Result

Evaluation

Confusion Matrix

Test Data Result

Evaluation

Confusion Matrix

About

Resources

Stars

Watchers

Forks

Languages