Data Extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Quickstarts

Installation

Install the stable version from PYPI.

pip install "data-extractor[jsonpath-extractor]"  # for extracting JSON data
pip install "data-extractor[lxml]"  # for extracting HTML data

Or install the latest version from Github.

pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"

Extract JSON data

Currently supports to extract JSON data with below optional dependencies

install one dependency of them to extract JSON data.

Extract HTML(XML) data

Currently supports to extract HTML(XML) data with below optional dependencies

lxml for using XPath
cssselect for using CSS-Selectors

Usage

from data_extractor import Field, Item, JSONExtractor


class Count(Item):
    followings = Field(JSONExtractor("countFollowings"))
    fans = Field(JSONExtractor("countFans"))


class User(Item):
    name_ = Field(JSONExtractor("name"), name="name")
    age = Field(JSONExtractor("age"), default=17)
    count = Count()


assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
    {
        "data": {
            "users": [
                {
                    "name": "john",
                    "age": 19,
                    "countFollowings": 14,
                    "countFans": 212,
                },
                {
                    "name": "jack",
                    "description": "",
                    "countFollowings": 54,
                    "countFans": 312,
                },
            ]
        }
    }
) == [
    {"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
    {"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]

Changelog

Unreleased

Feature

Generic extractor with convertor (#83)
mypy plugin for type annotation of extracting result (#83)

v0.10.2

Build

upgrade jsonpath-extractor to v0.8.0

Contributing

Environment Setup

Clone the source codes from Github.

git clone https://github.com/linw1995/data_extractor.git
cd data_extractor

Setup the development environment. Please make sure you install the pdm, pre-commit and nox CLIs in your environment.

make init
make PYTHON=3.7 init  # for specific python version

Linting

Use pre-commit for installing linters to ensure a good code style.

make pre-commit

Run linters. Some linters run via CLI nox, so make sure you install it.

make check-all

Testing

Run quick tests.

make

Run quick tests with verbose.

make vtest

Run tests with coverage. Testing in multiple Python environments is powered by CLI nox.

make cov

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
.github/workflows		.github/workflows
data_extractor		data_extractor
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
README.template.rst		README.template.rst
noxfile.py		noxfile.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Extractor

Quickstarts

Installation

Extract JSON data

Extract HTML(XML) data

Usage

Changelog

Unreleased

v0.10.2

Contributing

Environment Setup

Linting

Testing

About

Releases

Packages

Languages

License

Watch-Later/data_extractor

Folders and files

Latest commit

History

Repository files navigation

Data Extractor

Quickstarts

Installation

Extract JSON data

Extract HTML(XML) data

Usage

Changelog

Unreleased

v0.10.2

Contributing

Environment Setup

Linting

Testing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages