Frictionless Data Implementations
General documentation container and issue tracker for the Frictionless Data stack.
- row number - row number starting from 1
- col number - column number starting from 1
- headers - array of column names
- row - array of column values
- keyed row - dictionary of column values
- extended row - array containing row number, headers and row
- descriptor - dictionary describing something
There are two level of implementations:
Python libraries are intended to be on
extended level of implementation when all other languages should fosuc on
basic implementation level (it includes
tableschema without API marked as
This interface is described on pseodo declarative language to provide general functional requirements and naming conventions for implementations. This language based on Python idioms. Implementations for other languages should follow they own language idioms (e.g. exception handling, asynchronous functions etc).
[extended implementation] --- validate(source, preset='table', checks='all', table_limit=10, row_limit=1000, error_limit=1000, order_fields=False, infer_fields=False, custom_presets=, custom_checks=, **options) -> report ~@check(code, type, context, position) ~@preset(name) exceptions spec ~cli
[basic implementation] --- Package(descriptor, base_path=None, strict=False) valid -> dict errors -> list profile -> Profile descriptor -> dict resources -> Resource resource_names -> str get_resource(name) -> Resource/None add_resource(descriptor) -> Resource/None remove_resource(name) -> Resource/None infer(pattern=False) -> descriptor commit(strict=None) -> bool save(target) -> bool Resource(descriptor, base_path=None, strict=False) valid -> dict errors -> list profile -> Profile descriptor -> dict name -> str inline -> bool local -> bool remote -> bool multipart -> bool tabular -> bool source -> data/path headers -> str schema -> tableschema.Schema iter(keyed/extended=False, cast=True, relations=False) -> (generator) (keyed/extended)row read(keyed/extended=False, cast=True, relations=False, limit=None) -> (keyed/extended)row check_relations() -> True/raise raw_iter() -> bytes raw_read() -> bytes infer() -> descriptor commit(strict=None) -> bool save(target) -> bool Profile(profile) name -> str jsonschema -> dict validate(descriptor) -> True/raise validate(descriptor) -> True/raise infer(pattern, basePath) -> descriptor exceptions --- [extended implementation] --- DataPackage(descriptor, base_path=None, strict=True, storage=None, **options) save(target, storage=None, **options) Resource(descriptor, base_path=None) config(**options) ~cli
[basic implementation] --- Table(source, schema=None, strict=False, **options) headers -> str schema -> Schema iter(keyed/extended=False, cast=True, relations=False) -> (generator) (keyed/extended)row read(keyed/extended=False, cast=True, relations=False, limit=None) -> (keyed/extended)row infer(limit=100) -> descriptor save(target) -> bool Schema(descriptor, strict=False) valid -> bool errors -> list descriptor -> dict primary_key -> str foreign_keys -> str fields -> Field field_names -> str get_field(name) -> Field/None add_field(descriptor) -> Field/None remove_field(name) -> Field/None cast_row(row) -> row infer(rows, headers=1) -> descriptor commit(strict=None) -> bool save(target) -> bool Field(descriptor, missing_values=['']) name -> str type -> str format -> str required -> bool constraints -> dict descriptor -> dict cast_value(value, constraints=True) -> value test_value(value, constraints=True) -> bool validate(descriptor) -> True/raise infer(source, **options) -> descriptor exceptions --- [extended implementation] --- Table(source, schema=None, post_cast=None, storage=None, **options) save(target, storage=None, **options) Storage(**options) buckets -> str create(bucket, descriptor, force=False) delete(bucket=None, ignore=False) describe(bucket, descriptor=None) -> descriptor iter(bucket) -> (generator) row read(bucket) -> row write(bucket, rows) plugins ~cli
[extended implementation] --- Stream(source, headers=None, scheme=None, format=None, encoding=None, sample_size=None, post_parse=None, **options) closed/open/close/reset sample -> rows headers -> list iter(keyed/extended=False) -> (generator) (keyed/extended)row read(keyed/extended=False, limit=None) -> (keyed/extended)row save(target, format=None, encoding=None, **options) validate(source, scheme=None, format=None) -> True/raise exceptions ~cli
This document proposes a process to work on the technical side of the Frictionless Data project. The goal - have things manageable for a minimal price.
The specific of the project is a huge amount of components and actors (repositories, issues, contributors etc). The process should be effective in handling this specific.
The main idea to focus on getting things done and reduce the price of maintaining the process instead of trying to fully mimic some popular methodologies. We use different ideas from different methodologies.
- Product Owner (PO)
- Product Manager (PM)
- Developer Advocate (DA)
- Technical Lead (TL)
- Senior Developer (SD)
- Junior Developer (JD)
We use a kanban board located at https://waffle.io/frictionlessdata/project to work on the project. The board has following columns (ordered by issue stage):
- Backlog - unprocessed issues without labels and processed issues with labels
- Priority - prioritized issues planned for the next iterations (estimated and assigned)
- Current - current iteration issues promoted on iteration planning (estimated and assigned)
- Review - issues under review process
- Done - completed issues
The work on the project is a live process splitted into 2 weeks iterations between iteration plannings (including retrospection):
- Inside an iteration assigned persons work on their current issues and subset of roles do issues processing and prioritizing
- During the iteration planning the team moves issues from the Priority column to the Current column and assign persons. Instead of issue estimations assigned person approves amount of work for the current iteration as a high-level estimation.
As milestones we use concrete achievements e.g. from our roadmap. It could be tools or spec versions like “spec-v1”. We don’t use the workflow related milestones like “current” of “backlog” managing it via the board labeling system.
Aside internal waffle labels and helpers labels like “question” etc we use core color-coded labels based on SemVer. The main point of processing issues from Inbox to Backlog is to add one of this labels because we need to plan releases, breaking announces etc:
Every issue in the Current column should be assigned to some person with meaning “this person should do some work on this issue to unblock it”. Assigned person should re-assign an issue for a current blocker. It provides a good real-time overview of the project.
After planning it’s highly recommended for an assigned person to write a short plan of how to solve the issue (could be a list of steps) and ask someone to check. This work could be done on some previous stages by subset of roles.
We use Git Flow with some simplifications (see OKI coding standards). Master branch should always be “green” on tests and new features/fixes should go from pull requests. Direct committing to master could be allowed by subset of roles in some cases.
A pull request should be visually merged on the board to the corresponding issue using “It fixes #issue-number” sentence in the pull request description (initial comment). If there is no corresponding issue for the pull request it should be handled as an issue with labeling etc.
After sending a pull request the author should assign the pull request to another person “asking” for a code review. After the review code should be merged to the codebase by the pull request author (or person having enough rights).
By default documentation for a tool should be written in README.md not using additional files and folders. It should be clean and well-structured. API should be documented in the code as docstrings. We compile project level docs automatically.
Tests should be written using OKI coding standards. Start write tests from top (match high-level requirements) to bottom (if needed). The most high-level tests are implemented as testsuites on project level (integration tests between different tools).
We use SemVer for versioning and Travis for testing and releasing/deployments. We prefer short release cycle (features and fixes could be released immediately). Releases should be configured using tags based on package examples workflow provided by OKI.