Skip to content

Commit

Permalink
Payload Redesign (#859)
Browse files Browse the repository at this point in the history
* payload factory and its test cases

* add audio and image payload meta ontology

* add meta ontology

* reconstruct the payload ontology

* add some docstring

* correct importing path

* move payload to a separate ontology file

* move Modality Payload: base_ontology -> payload_ontology

* move Modality Payload: base_ontology -> payload_ontology

* move Modality Payload: base_ontology -> payload_ontology

* payload ontology

* add DataPack.grids back as it's used in some test cases

* correct modality error message

* change to a hashable meta info for registering in payload factory

* payload ontology: ft/onto/base_ontology.py -> ft/onto/payload_ontology.py

* remove used import

* correct import path

* correct import path

* add audio encoding

* rm pdb

* pylint

* pylint

* recover init methods to debug ontology generation

* minor changes

* rewrite grids -> grid

* remove grid from adding entry

* rewrite the docstring

* import grid

* import grid

* pylint

* pylint

* pylint

* remove the requirement to initialize bounding box

* fix pylint and docstring

* DataPack.grids -> DataPack.grid

* revert base ontology

* fix docstring

* remove docstring

* simplify test

* rename: grids -> grid

* correct typing for grid

* remove meta

* remove meta

* move payload with modalities to top.py

* correct the docstring

* JpegPayload and SoundFilePayload

* new payload factory and its test

* fix pylint issue

* remove grids

* correct payloads importing paths

* update dependent packages for payload

* install payload module

* update package requirement

* fix main.yml format

* fix black

* remove used import

* move payloads from base_ontology to top

* Fix issues

* correct payload ontology

* DefaultAudioPayload -> AudioPayload

* test

* Fix setup.py

* pylint fixes

* formatting

* Fixing payload test.

* Adding tutorial

* Change payload interface.

* Fix interface bugs.

* Fix pylint.

* Fix pylint.

* Remove probably unused function branch.

* fix mypy

* fix image annotation bug

* Try to debug the tid.

* add prints to debug

* Fix the parent/child payload problem.

* Remove a debug workflow step.

* typos in doc

Co-authored-by: Hector <hunterhector@gmail.com>
  • Loading branch information
hepengfe and hunterhector committed Jan 10, 2023
1 parent fd717ff commit 5f43e72
Show file tree
Hide file tree
Showing 31 changed files with 1,025 additions and 243 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ jobs:
rm -rf stave
- name: Install Forte
run: |
pip install --progress-bar off .[data_aug,ir,remote,audio_ext,stave,models,test,wikipedia,nlp,extractor]
pip install --progress-bar off .[data_aug,ir,remote,audio_ext,stave,models,test,wikipedia,nlp,extractor,payload]
- name: Install deep learning frameworks
run: |
pip install --progress-bar off torch==${{ matrix.torch-version }}
Expand Down Expand Up @@ -203,6 +203,7 @@ jobs:
- { module: "wikipedia", test_file: "tests/forte/datasets/wikipedia"}
- { module: "nlp",test_file: "tests/forte/processors/subword_tokenizer_test.py tests/forte/processors/pretrained_encoder_processors_test.py"}
- { module: "extractor",test_file: "tests/forte/train_preprocessor_test.py forte/data/extractors tests/forte/data/data_pack_dataset_test.py tests/forte/data/converter/converter_test.py"}
- { module: "payload", test_file: "tests/forte/utils/payload_decorator_test.py"}
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Some components or modules in forte may require some [extra requirements](https:
* `pip install forte[wikipedia]`: Install packages required for reading [wikipedia datasets](https://github.com/asyml/forte/tree/master/forte/datasets/wikipedia).
* `pip install forte[nlp]`: Install packages required for additional NLP supports, such as [subword_tokenizer](https://github.com/asyml/forte/tree/master/forte/processors/nlp/subword_tokenizer.py) and [texar encoder](https://github.com/asyml/forte/tree/master/forte/processors/third_party/pretrained_encoder_processors.py)
* `pip install forte[extractor]`: Install packages required for extractor-based training system, [extractor](https://github.com/asyml/forte/blob/master/forte/data/extractors), [train_preprocessor](https://github.com/asyml/forte/tree/master/forte/train_preprocessor.py), [tagging trainer](https://github.com/asyml/forte/tree/master/examples/tagging/tagging_trainer.py), [DataPack dataset](https://github.com/asyml/forte/blob/master/forte/data/data_pack_dataset.py), [types](https://github.com/asyml/forte/blob/master/forte/data/types.py), and [converter](https://github.com/asyml/forte/blob/master/forte/data/converter).

* `pip install forte[payload]` install packages required for payload.
## Quick Start Guide
Writing NLP pipelines with Forte is easy. The following example creates a simple pipeline that analyzes the sentences, tokens, and named entities from a piece of text.

Expand Down
6 changes: 4 additions & 2 deletions docs/ch1.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Chapter 1. Handling Structured Data
Chapter 1. Handling Data
====================================
.. toctree::
:maxdepth: 2

toc/ontology_generation.md
notebook_tutorial/handling_structued_data
notebook_tutorial/handling_structured_data
notebook_tutorial/lazy_loading

316 changes: 316 additions & 0 deletions docs/notebook_tutorial/lazy_loading.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ac6c4e77",
"metadata": {},
"source": [
"# Loading Data As Needed"
]
},
{
"cell_type": "markdown",
"id": "a297c739",
"metadata": {},
"source": [
"Sometimes it is preferable to load data only when it is required. For example, when creating a pipeline that handles a large amount of image data, a naive way would be to load the data at the beginning (i.e. through a reader), and pass all the data along the pipeline.\n",
"\n",
"Yet this approach could be inefficient since the actual images are passing along the pipeline, potentially through a network. If not all the processors in the pipeline need to access the image data, a better alternative would be to lazy load the data when needed, while all the data stays at an online location (such as an NSF location or a hyperlink).\n",
"\n",
"Forte's `Payload` classes provides options for you to do exactly that."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3e236fbd",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"from dataclasses import dataclass\n",
"\n",
"from forte.data.data_pack import DataPack\n",
"from forte.data.ontology.top import ImagePayload\n",
"\n",
"@dataclass\n",
"class JpegPayload(ImagePayload):\n",
" \"\"\"\n",
" Attributes:\n",
" extensions (Optional[str]):\n",
" mime (Optional[str]):\n",
" type_code (Optional[str]):\n",
" version (Optional[int]):\n",
" source_type (Optional[int]):\n",
" \"\"\"\n",
"\n",
" extensions: Optional[str]\n",
" mime: Optional[str]\n",
" type_code: Optional[str]\n",
" version: Optional[int]\n",
" source_type: Optional[int]\n",
"\n",
" def __init__(self, pack: DataPack):\n",
" super().__init__(pack)\n",
" self.extensions: Optional[str] = None\n",
" self.mime: Optional[str] = None\n",
" self.type_code: Optional[str] = None\n",
" self.version: Optional[int] = None\n",
" self.source_type: Optional[int] = None"
]
},
{
"cell_type": "markdown",
"id": "e9e2352f",
"metadata": {},
"source": [
"The class above is an example `Payload` class inheriting the Forte built-in `ImagePayload` class (note that this class is generated through the ontology generator, you should be able to find the definitions [here](https://github.com/asyml/forte/blob/master/forte/ontology_specs/payload_ontology.json)). \n",
"\n",
"The `Payload` classes, as their name suggest, are used to store data. A `Payload` class has certain default members, such as a `uri` and a `cache`, and one can also enrich the class by extending it, like above. \n",
"\n",
"The simple usage of a `Payload` class is to access its `uri` and `cache`. The `uri` is defined by you, it could be a URL or a remote file path. And the `cache` is used to store the actual data. In a regular Forte reader implementation, one might want to specify the `uri` and populate the `cache` with actual data. Let's see a quick example."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "73b4d106",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://some/path/\n"
]
}
],
"source": [
"# A Payload is just another regular entry object, \n",
"# so we can handle this in the same way.\n",
"datapack = DataPack()\n",
"sp = JpegPayload(datapack)\n",
"sp.uri = \"http://some/path/\"\n",
"\n",
"print(datapack.get_single(JpegPayload).uri)"
]
},
{
"cell_type": "markdown",
"id": "880d50c5",
"metadata": {},
"source": [
"We have set the `uri` for this particular payload, which is lightweight since we only added a string to it. While one can load the actual data into `sp.cache` by reading the `uri` now, let's study the \"lazy loading\" option.\n",
"\n",
"Forte allows one to do this by associating a `load` function to the `Payload` class using a simple decorator like below:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c25e9738",
"metadata": {},
"outputs": [],
"source": [
"from forte.data.ontology.top import load_func\n",
"\n",
"\n",
"@load_func(JpegPayload)\n",
"def load(payload: JpegPayload):\n",
" def read_uri(input_uri): # The function to read the URI.\n",
" # to be implemented\n",
" pass\n",
" return read_uri(payload.uri) # Returns the payload content."
]
},
{
"cell_type": "markdown",
"id": "d3fcab84",
"metadata": {},
"source": [
"What happens here is that we decorate the `load` function with the Forte built-in `load_func` decorator, which associates the `JpegPayload` type with the `load` function. Note that this function takes an `input_uri` as input, internally, Forte will pass `JpegPayload.uri` to it.\n",
"\n",
"Now when you call the `load` function in the `JpegPayload` class, it will try to populate the `cache` with the return value of the `load` function, by providing the `uri`. \n",
"\n",
"Let's see a full implementation of this function."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4c0ff966",
"metadata": {},
"outputs": [],
"source": [
"@load_func(JpegPayload)\n",
"def load(payload: JpegPayload):\n",
" \"\"\"\n",
" A function that parses payload metadata and prepare and returns a loading function.\n",
"\n",
" This function is not stored in data store but will be used\n",
" for registering in PayloadFactory.\n",
"\n",
" Returns:\n",
" a function that reads image data from an url.\n",
" \"\"\"\n",
" try:\n",
" from PIL import Image\n",
" import requests\n",
" import numpy as np\n",
" except ModuleNotFoundError as e:\n",
" raise ModuleNotFoundError(\n",
" \"ImagePayload reading web file requires `PIL` and\"\n",
" \"`requests` packages to be installed.\"\n",
" ) from e\n",
"\n",
" def read_uri(input_uri):\n",
" # customize this function to read data from uri\n",
" uri_obj = requests.get(input_uri, stream=True)\n",
" pil_image = Image.open(uri_obj.raw)\n",
" return np.asarray(pil_image)\n",
"\n",
" return read_uri(payload.uri)"
]
},
{
"cell_type": "markdown",
"id": "906b886d",
"metadata": {},
"source": [
"This `load` implementation uses the `PIL` library to read images, which supports JPEG.\n",
"\n",
"Now we have registered the `load` function to the `SoundFilePayload` class. Let's have a try."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1b2a4434",
"metadata": {},
"outputs": [],
"source": [
"datapack = DataPack(\"image\")\n",
"payload = JpegPayload(datapack)\n",
"datapack.add_entry(payload)\n",
"payload.uri = \"https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg\""
]
},
{
"cell_type": "markdown",
"id": "0a11aa9d",
"metadata": {},
"source": [
"We have successfully read the data URL, now we can load the payload content at any time."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f589c4b6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(539, 810, 3)\n"
]
}
],
"source": [
"payload.load()\n",
"print(payload.cache.shape)"
]
},
{
"cell_type": "markdown",
"id": "76af822f",
"metadata": {},
"source": [
"Note that here we explicitly called the `load` function for illustration purposes. Forte actually allows you to directly access the `cache`, and it will attempt to `load` the data without the explicit `load` call."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "2bc0dcde",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(539, 810, 3)\n"
]
}
],
"source": [
"datapack_lazy = DataPack(\"image\")\n",
"pl = JpegPayload(datapack_lazy)\n",
"datapack_lazy.add_entry(pl)\n",
"pl.uri = \"https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg\"\n",
"print(pl.cache.shape)"
]
},
{
"cell_type": "markdown",
"id": "27647429",
"metadata": {},
"source": [
"In this way, we achieve the \"lazy loading\" idea, with a registered function, and without having users to manually worry about when to load the content.\n",
"\n",
"Finally, there are a few usage tips:\n",
"1. Once the data is loaded into `cache`, it will stay with the data pack (which means it will be transferred through the pipeline). Currently Forte does not have a mechanism to automatically clean the `cache`. One can call the `clear_cache` function manually.\n",
"2. To use the lazy loading mechanism in `Payload`, it is preferable to register a function for a dedicated type. This will help you organize the loading methods of different types of data. Under the hood. Forte simply assign the loading method into the corresponding `Payload` class. This means method overriding will work as expected: if a different `load` function is assigned to a child class, then the `load` function registered to the child class will be used."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "84a2f89b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"pl.clear_cache()\n",
"print(pl._cache)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "283f373b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
17 changes: 17 additions & 0 deletions forte/common/aliases.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright 2019-2022 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from urllib import parse

URL = parse.ParseResult

0 comments on commit 5f43e72

Please sign in to comment.