Atomizer - recognize file types and convert to JSON-AD #434

joepio · 2022-06-13T15:17:53Z

Atomizing is about turning non-atomic data into atomic data. This (often) means converting some existing file into JSON-AD, and then sending it / publishing it to an Atomic Server Importer #390.

Ideally, we'd have one application (the Atomizer) that can:

run as a CLI. Pipe to your locally running Atomic Server to have a highly performant importer.
run on the server. Upload a file / link to a file and the server will automatically perform conversions / atomizations. Or atomize already uploaded files automagically.
run in the browser. Let the JS client perform imports. Highly flexible, can even ask user for extra input when needed.

That will be able to:

Recognize files and file types
Convert them to JSON-AD
Send them to an Atomic Server

Considerations:

How do we deal with changes to files? I suppose we'd create new commits that typically never remove any properties, but do overwrite them.
Should users be able to manually re-trigger extracting data from a source file? Does this needs to be an endpoint?
The function signature seems to be very simple: File + Parent in, JSON-AD out.
I suppose the Atomizer needs to know the parent - where the resource needs to go. It also needs to know how to upload the file. This can be extracted from the partent URL example.com/upload
How do we deal with changes to JSON-AD? Let's say a user edits the location on an image, which originated from the EXIF data. The user might expect this would update the values on the image file. However, it does not do this - it only updates the AD resource. This could definitely be confusing. We could solve this by adding write capabilities, but that would definitely make things far more complicated. Another solution is to just not allow updates to metadata.

Implementation

I think a sensible technological approach is to write all of this in Rust, as a new Crate inside this repo. If it's rust, we can easily embed it in Atomic Server. Also, we can still (later) compile it to WASM and run it in the browser.

I'd like users to register handlers for various files types as plugins. Each handler can be registered for a specific mime type, and has a handler functions that reads a bunch of bytes and creates one (or more?) resources.

Therefore it might make sense to have a bunch of plugins that do this.

Mime type recognition. Before There are tools that help to identify the file type. A notable one is libmagic, and its rust wrapper magic. A lightweight alternative that only uses filetype extensions is mime_guess.

Filetypes / data types to atomize:

Files. We already have the File class, which describes a file with some size and some filetype. If the Atomizer does not know what to do with a file, it will simply upload it as-is: as a file.
Plaintext files. Anything that's a programming language file, or other text file, can be converted into this.
Markdown. These can be converted to Articles. We still need a proper model for this.
Documents. PDF, word, etc. can be converted to plaintext, which makes them searchable. Crates: pdf-extract
Image. Similar to Files, but with extra data, such as EXIF location / camera / aperture / ISO, etc. These can be transformed to smaller images using an endpoint. Image endpoint (resize, crop) #257
CSV. Parse the headers and convert it to a Table, with Class + Properties.
HTML. Import HTML pages #432

Inspiration:

See https://github.com/tauri-apps/tauri-plugin-log
@jonassmedegaard had some really cool ideas about this.

The text was updated successfully, but these errors were encountered:

joepio mentioned this issue Jun 13, 2022

Article data model #435

Closed

joepio assigned AlexMikhalev Jun 13, 2022

joepio changed the title ~~Atomizer~~ Atomizer - recognize file types and convert to JSON-AD Jun 14, 2022

AlexMikhalev mentioned this issue Jul 4, 2022

Fluvio - secret sauce for scalability #461

Closed

3 tasks

joepio mentioned this issue Jul 5, 2022

Fluvio - secret sauce for scalability #463

Closed

joepio mentioned this issue Feb 11, 2023

Atomizer + PDF extractor #591

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomizer - recognize file types and convert to JSON-AD #434

Atomizer - recognize file types and convert to JSON-AD #434

joepio commented Jun 13, 2022 •

edited

Loading

Atomizer - recognize file types and convert to JSON-AD #434

Atomizer - recognize file types and convert to JSON-AD #434

Comments

joepio commented Jun 13, 2022 • edited Loading

Considerations:

Implementation

Filetypes / data types to atomize:

Inspiration:

joepio commented Jun 13, 2022 •

edited

Loading