Web Data Extraction In Python
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
bsoup.py
data.zip
lxml_ext.py
regex.py
test_dataset.py

README.md

Web Data Extraction In Python

This project examine web content extraction libraries including beautifulsoup, lxml and regex. You give only element information and extraction pattern is prepared with functions.

lxml

Some web sites introduced BeautifulSoup recommend to install and use lxml for speed. But also, it can be used stand-alone and it is more efficient.

import lxml_ext

result_list = lxml_ext.extract_all(data, pattern)
result = lxml_ext.extract(data, pattern)

data: web page pattern: html element,

<h1 class="test">

extract_all method returns all extractions for a pattern in a web page, as a list of strings. extract method returns the first extraction result in a web page, as string.

BeautifulSoup

The different parsers including html.parser, lxml, and html5lib can be used in BeautifulSoup. For example:

import bsoup
parser = html.parser
result_list = bsoup.extract_all(data, pattern, parser)
result = bsoup.extract(data, pattern, parser)

#Regex Regular expressions are well-known and efficient technique that can be used in extraction process. However, it can cause problems when the number of inner tags is ambiguous.

import regex
result_list = regex.extract_all(data, pattern, parser)
result = regex.extract(data, pattern, parser)

Publications

Comparison of Python Libraries used for Web Data Extraction. Uzun, E.; Yerlikaya, T.; and Kırat, O. In 7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19, pages 108-113, 2018.

Click for bibtex, downloads, all publications...