Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomizer - recognize file types and convert to JSON-AD #434

Open
joepio opened this issue Jun 13, 2022 · 0 comments
Open

Atomizer - recognize file types and convert to JSON-AD #434

joepio opened this issue Jun 13, 2022 · 0 comments
Assignees

Comments

@joepio
Copy link
Member

joepio commented Jun 13, 2022

Atomizing is about turning non-atomic data into atomic data. This (often) means converting some existing file into JSON-AD, and then sending it / publishing it to an Atomic Server Importer #390.

Ideally, we'd have one application (the Atomizer) that can:

  • run as a CLI. Pipe to your locally running Atomic Server to have a highly performant importer.
  • run on the server. Upload a file / link to a file and the server will automatically perform conversions / atomizations. Or atomize already uploaded files automagically.
  • run in the browser. Let the JS client perform imports. Highly flexible, can even ask user for extra input when needed.

That will be able to:

  • Recognize files and file types
  • Convert them to JSON-AD
  • Send them to an Atomic Server

Considerations:

  • How do we deal with changes to files? I suppose we'd create new commits that typically never remove any properties, but do overwrite them.
  • Should users be able to manually re-trigger extracting data from a source file? Does this needs to be an endpoint?
  • The function signature seems to be very simple: File + Parent in, JSON-AD out.
  • I suppose the Atomizer needs to know the parent - where the resource needs to go. It also needs to know how to upload the file. This can be extracted from the partent URL example.com/upload
  • How do we deal with changes to JSON-AD? Let's say a user edits the location on an image, which originated from the EXIF data. The user might expect this would update the values on the image file. However, it does not do this - it only updates the AD resource. This could definitely be confusing. We could solve this by adding write capabilities, but that would definitely make things far more complicated. Another solution is to just not allow updates to metadata.

Implementation

I think a sensible technological approach is to write all of this in Rust, as a new Crate inside this repo. If it's rust, we can easily embed it in Atomic Server. Also, we can still (later) compile it to WASM and run it in the browser.

I'd like users to register handlers for various files types as plugins. Each handler can be registered for a specific mime type, and has a handler functions that reads a bunch of bytes and creates one (or more?) resources.

Therefore it might make sense to have a bunch of plugins that do this.

Mime type recognition. Before There are tools that help to identify the file type. A notable one is libmagic, and its rust wrapper magic. A lightweight alternative that only uses filetype extensions is mime_guess.

Filetypes / data types to atomize:

  • Files. We already have the File class, which describes a file with some size and some filetype. If the Atomizer does not know what to do with a file, it will simply upload it as-is: as a file.
  • Plaintext files. Anything that's a programming language file, or other text file, can be converted into this.
  • Markdown. These can be converted to Articles. We still need a proper model for this.
  • Documents. PDF, word, etc. can be converted to plaintext, which makes them searchable. Crates: pdf-extract
  • Image. Similar to Files, but with extra data, such as EXIF location / camera / aperture / ISO, etc. These can be transformed to smaller images using an endpoint. Image endpoint (resize, crop) #257
  • CSV. Parse the headers and convert it to a Table, with Class + Properties.
  • HTML. Import HTML pages #432

Inspiration:

@joepio joepio changed the title Atomizer Atomizer - recognize file types and convert to JSON-AD Jun 14, 2022
@joepio joepio mentioned this issue Feb 11, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants