-
Notifications
You must be signed in to change notification settings - Fork 60
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* payload factory and its test cases * add audio and image payload meta ontology * add meta ontology * reconstruct the payload ontology * add some docstring * correct importing path * move payload to a separate ontology file * move Modality Payload: base_ontology -> payload_ontology * move Modality Payload: base_ontology -> payload_ontology * move Modality Payload: base_ontology -> payload_ontology * payload ontology * add DataPack.grids back as it's used in some test cases * correct modality error message * change to a hashable meta info for registering in payload factory * payload ontology: ft/onto/base_ontology.py -> ft/onto/payload_ontology.py * remove used import * correct import path * correct import path * add audio encoding * rm pdb * pylint * pylint * recover init methods to debug ontology generation * minor changes * rewrite grids -> grid * remove grid from adding entry * rewrite the docstring * import grid * import grid * pylint * pylint * pylint * remove the requirement to initialize bounding box * fix pylint and docstring * DataPack.grids -> DataPack.grid * revert base ontology * fix docstring * remove docstring * simplify test * rename: grids -> grid * correct typing for grid * remove meta * remove meta * move payload with modalities to top.py * correct the docstring * JpegPayload and SoundFilePayload * new payload factory and its test * fix pylint issue * remove grids * correct payloads importing paths * update dependent packages for payload * install payload module * update package requirement * fix main.yml format * fix black * remove used import * move payloads from base_ontology to top * Fix issues * correct payload ontology * DefaultAudioPayload -> AudioPayload * test * Fix setup.py * pylint fixes * formatting * Fixing payload test. * Adding tutorial * Change payload interface. * Fix interface bugs. * Fix pylint. * Fix pylint. * Remove probably unused function branch. * fix mypy * fix image annotation bug * Try to debug the tid. * add prints to debug * Fix the parent/child payload problem. * Remove a debug workflow step. * typos in doc Co-authored-by: Hector <hunterhector@gmail.com>
- Loading branch information
1 parent
fd717ff
commit 5f43e72
Showing
31 changed files
with
1,025 additions
and
243 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,9 @@ | ||
Chapter 1. Handling Structured Data | ||
Chapter 1. Handling Data | ||
==================================== | ||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
toc/ontology_generation.md | ||
notebook_tutorial/handling_structued_data | ||
notebook_tutorial/handling_structured_data | ||
notebook_tutorial/lazy_loading | ||
|
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,316 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ac6c4e77", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading Data As Needed" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a297c739", | ||
"metadata": {}, | ||
"source": [ | ||
"Sometimes it is preferable to load data only when it is required. For example, when creating a pipeline that handles a large amount of image data, a naive way would be to load the data at the beginning (i.e. through a reader), and pass all the data along the pipeline.\n", | ||
"\n", | ||
"Yet this approach could be inefficient since the actual images are passing along the pipeline, potentially through a network. If not all the processors in the pipeline need to access the image data, a better alternative would be to lazy load the data when needed, while all the data stays at an online location (such as an NSF location or a hyperlink).\n", | ||
"\n", | ||
"Forte's `Payload` classes provides options for you to do exactly that." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "3e236fbd", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from typing import Optional\n", | ||
"from dataclasses import dataclass\n", | ||
"\n", | ||
"from forte.data.data_pack import DataPack\n", | ||
"from forte.data.ontology.top import ImagePayload\n", | ||
"\n", | ||
"@dataclass\n", | ||
"class JpegPayload(ImagePayload):\n", | ||
" \"\"\"\n", | ||
" Attributes:\n", | ||
" extensions (Optional[str]):\n", | ||
" mime (Optional[str]):\n", | ||
" type_code (Optional[str]):\n", | ||
" version (Optional[int]):\n", | ||
" source_type (Optional[int]):\n", | ||
" \"\"\"\n", | ||
"\n", | ||
" extensions: Optional[str]\n", | ||
" mime: Optional[str]\n", | ||
" type_code: Optional[str]\n", | ||
" version: Optional[int]\n", | ||
" source_type: Optional[int]\n", | ||
"\n", | ||
" def __init__(self, pack: DataPack):\n", | ||
" super().__init__(pack)\n", | ||
" self.extensions: Optional[str] = None\n", | ||
" self.mime: Optional[str] = None\n", | ||
" self.type_code: Optional[str] = None\n", | ||
" self.version: Optional[int] = None\n", | ||
" self.source_type: Optional[int] = None" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e9e2352f", | ||
"metadata": {}, | ||
"source": [ | ||
"The class above is an example `Payload` class inheriting the Forte built-in `ImagePayload` class (note that this class is generated through the ontology generator, you should be able to find the definitions [here](https://github.com/asyml/forte/blob/master/forte/ontology_specs/payload_ontology.json)). \n", | ||
"\n", | ||
"The `Payload` classes, as their name suggest, are used to store data. A `Payload` class has certain default members, such as a `uri` and a `cache`, and one can also enrich the class by extending it, like above. \n", | ||
"\n", | ||
"The simple usage of a `Payload` class is to access its `uri` and `cache`. The `uri` is defined by you, it could be a URL or a remote file path. And the `cache` is used to store the actual data. In a regular Forte reader implementation, one might want to specify the `uri` and populate the `cache` with actual data. Let's see a quick example." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "73b4d106", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"http://some/path/\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# A Payload is just another regular entry object, \n", | ||
"# so we can handle this in the same way.\n", | ||
"datapack = DataPack()\n", | ||
"sp = JpegPayload(datapack)\n", | ||
"sp.uri = \"http://some/path/\"\n", | ||
"\n", | ||
"print(datapack.get_single(JpegPayload).uri)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "880d50c5", | ||
"metadata": {}, | ||
"source": [ | ||
"We have set the `uri` for this particular payload, which is lightweight since we only added a string to it. While one can load the actual data into `sp.cache` by reading the `uri` now, let's study the \"lazy loading\" option.\n", | ||
"\n", | ||
"Forte allows one to do this by associating a `load` function to the `Payload` class using a simple decorator like below:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"id": "c25e9738", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from forte.data.ontology.top import load_func\n", | ||
"\n", | ||
"\n", | ||
"@load_func(JpegPayload)\n", | ||
"def load(payload: JpegPayload):\n", | ||
" def read_uri(input_uri): # The function to read the URI.\n", | ||
" # to be implemented\n", | ||
" pass\n", | ||
" return read_uri(payload.uri) # Returns the payload content." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d3fcab84", | ||
"metadata": {}, | ||
"source": [ | ||
"What happens here is that we decorate the `load` function with the Forte built-in `load_func` decorator, which associates the `JpegPayload` type with the `load` function. Note that this function takes an `input_uri` as input, internally, Forte will pass `JpegPayload.uri` to it.\n", | ||
"\n", | ||
"Now when you call the `load` function in the `JpegPayload` class, it will try to populate the `cache` with the return value of the `load` function, by providing the `uri`. \n", | ||
"\n", | ||
"Let's see a full implementation of this function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"id": "4c0ff966", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@load_func(JpegPayload)\n", | ||
"def load(payload: JpegPayload):\n", | ||
" \"\"\"\n", | ||
" A function that parses payload metadata and prepare and returns a loading function.\n", | ||
"\n", | ||
" This function is not stored in data store but will be used\n", | ||
" for registering in PayloadFactory.\n", | ||
"\n", | ||
" Returns:\n", | ||
" a function that reads image data from an url.\n", | ||
" \"\"\"\n", | ||
" try:\n", | ||
" from PIL import Image\n", | ||
" import requests\n", | ||
" import numpy as np\n", | ||
" except ModuleNotFoundError as e:\n", | ||
" raise ModuleNotFoundError(\n", | ||
" \"ImagePayload reading web file requires `PIL` and\"\n", | ||
" \"`requests` packages to be installed.\"\n", | ||
" ) from e\n", | ||
"\n", | ||
" def read_uri(input_uri):\n", | ||
" # customize this function to read data from uri\n", | ||
" uri_obj = requests.get(input_uri, stream=True)\n", | ||
" pil_image = Image.open(uri_obj.raw)\n", | ||
" return np.asarray(pil_image)\n", | ||
"\n", | ||
" return read_uri(payload.uri)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "906b886d", | ||
"metadata": {}, | ||
"source": [ | ||
"This `load` implementation uses the `PIL` library to read images, which supports JPEG.\n", | ||
"\n", | ||
"Now we have registered the `load` function to the `SoundFilePayload` class. Let's have a try." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"id": "1b2a4434", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"datapack = DataPack(\"image\")\n", | ||
"payload = JpegPayload(datapack)\n", | ||
"datapack.add_entry(payload)\n", | ||
"payload.uri = \"https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "0a11aa9d", | ||
"metadata": {}, | ||
"source": [ | ||
"We have successfully read the data URL, now we can load the payload content at any time." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"id": "f589c4b6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"(539, 810, 3)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"payload.load()\n", | ||
"print(payload.cache.shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "76af822f", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that here we explicitly called the `load` function for illustration purposes. Forte actually allows you to directly access the `cache`, and it will attempt to `load` the data without the explicit `load` call." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"id": "2bc0dcde", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"(539, 810, 3)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"datapack_lazy = DataPack(\"image\")\n", | ||
"pl = JpegPayload(datapack_lazy)\n", | ||
"datapack_lazy.add_entry(pl)\n", | ||
"pl.uri = \"https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg\"\n", | ||
"print(pl.cache.shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "27647429", | ||
"metadata": {}, | ||
"source": [ | ||
"In this way, we achieve the \"lazy loading\" idea, with a registered function, and without having users to manually worry about when to load the content.\n", | ||
"\n", | ||
"Finally, there are a few usage tips:\n", | ||
"1. Once the data is loaded into `cache`, it will stay with the data pack (which means it will be transferred through the pipeline). Currently Forte does not have a mechanism to automatically clean the `cache`. One can call the `clear_cache` function manually.\n", | ||
"2. To use the lazy loading mechanism in `Payload`, it is preferable to register a function for a dedicated type. This will help you organize the loading methods of different types of data. Under the hood. Forte simply assign the loading method into the corresponding `Payload` class. This means method overriding will work as expected: if a different `load` function is assigned to a child class, then the `load` function registered to the child class will be used." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"id": "84a2f89b", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"None\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"pl.clear_cache()\n", | ||
"print(pl._cache)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "283f373b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Copyright 2019-2022 The Forte Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from urllib import parse | ||
|
||
URL = parse.ParseResult |
Oops, something went wrong.