Web Data Extraction In Python

This project examine web content extraction libraries including beautifulsoup, lxml and regex. You give only element information and extraction pattern is prepared with functions.

lxml

Some web sites introduced BeautifulSoup recommend to install and use lxml for speed. But also, it can be used stand-alone and it is more efficient.

import lxml_ext

result_list = lxml_ext.extract_all(data, pattern)
result = lxml_ext.extract(data, pattern)

data: web page pattern: html element,

<h1 class="test">

extract_all method returns all extractions for a pattern in a web page, as a list of strings. extract method returns the first extraction result in a web page, as string.

BeautifulSoup

The different parsers including html.parser, lxml, and html5lib can be used in BeautifulSoup. For example:

import bsoup
parser = html.parser
result_list = bsoup.extract_all(data, pattern, parser)
result = bsoup.extract(data, pattern, parser)

#Regex Regular expressions are well-known and efficient technique that can be used in extraction process. However, it can cause problems when the number of inner tags is ambiguous.

import regex
result_list = regex.extract_all(data, pattern, parser)
result = regex.extract(data, pattern, parser)

Publications

Comparison of Python Libraries used for Web Data Extraction. Uzun, E.; Yerlikaya, T.; and Kırat, O. In 7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19, pages 108-113, 2018.

Click for bibtex, downloads, all publications...

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
bsoup.py		bsoup.py
data.zip		data.zip
lxml_ext.py		lxml_ext.py
regex.py		regex.py
test_dataset.py		test_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Data Extraction In Python

lxml

BeautifulSoup

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Languages

erdincuzun/WebDataExtractionInPython

Folders and files

Latest commit

History

Repository files navigation

Web Data Extraction In Python

lxml

BeautifulSoup

Publications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages