Combine XPath, CSS Selectors and JSONPath for Web data extracting.
Install the stable version from PYPI.
pip install "data-extractor[jsonpath-extractor]" # for extracting JSON data
pip install "data-extractor[lxml]" # for extracting HTML data
Or install the latest version from Github.
pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"
Currently supports to extract JSON data with below optional dependencies
install one dependency of them to extract JSON data.
Currently supports to extract HTML(XML) data with below optional dependencies
- lxml for using XPath
- cssselect for using CSS-Selectors
from data_extractor import Field, Item, JSONExtractor
class Count(Item):
followings = Field(JSONExtractor("countFollowings"))
fans = Field(JSONExtractor("countFans"))
class User(Item):
name_ = Field(JSONExtractor("name"), name="name")
age = Field(JSONExtractor("age"), default=17)
count = Count()
assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
{
"data": {
"users": [
{
"name": "john",
"age": 19,
"countFollowings": 14,
"countFans": 212,
},
{
"name": "jack",
"description": "",
"countFollowings": 54,
"countFans": 312,
},
]
}
}
) == [
{"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
{"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]
Feature
- Generic extractor with convertor (#83)
- mypy plugin for type annotation of extracting result (#83)
Build
- upgrade jsonpath-extractor to v0.8.0
Clone the source codes from Github.
git clone https://github.com/linw1995/data_extractor.git
cd data_extractor
Setup the development environment. Please make sure you install the pdm, pre-commit and nox CLIs in your environment.
make init
make PYTHON=3.7 init # for specific python version
Use pre-commit for installing linters to ensure a good code style.
make pre-commit
Run linters. Some linters run via CLI nox, so make sure you install it.
make check-all
Run quick tests.
make
Run quick tests with verbose.
make vtest
Run tests with coverage. Testing in multiple Python environments is powered by CLI nox.
make cov