Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marketplace #1750

Closed
11 of 28 tasks
wdanilo opened this issue May 22, 2021 · 32 comments
Closed
11 of 28 tasks

Marketplace #1750

wdanilo opened this issue May 22, 2021 · 32 comments
Labels
--breaking Important: a change that will break a public API or user-facing behaviour -tooling Category: tooling p-medium Should be completed in the next few sprints

Comments

@wdanilo
Copy link
Member

wdanilo commented May 22, 2021

Specification

Package Manager:

Engine:

Cloud:

IDE:

Requirements

Show

GUI related description

  • Managing Libraries
    There needs to be a way to create and delete libraries via the GUI.

    • Future task: there needs to be a way to rename libraries.
  • Assigning nodes to libraries
    There should be an option in the node's context menu (RMB) to assign this node to a library. After doing so:

    • The node's source code should be automatically refactored to the library's source code.
    • In case the refactoring could not happen, an error with a clean explanation should appear.
    • All imports in this project should be auto-refactored too. For now, we do not care about imports in other projects.
    • Future task: there should also be an option to refactor the conflicting dependencies too (it should be part of a bigger code-refactoring utility).
  • Navigating the nodes
    If a user double-clicks a node that was refactored to a library, they should properly enter its definition in the graph editor. Breadcrumbs should reflect the user's logical path, not the path on the disk (as currently). If they then open the text editor, it should show the code of that component.

    • If the node is not located in a user library, the node should not be entered, instead an info message should be displayed indicating that the node cannot be edited.
  • User should be able to select the Edition for the project.

    • There should be a global setting (drop-down menu) in the app allowing selecting one of the predefined Editions.
    • After selecting it, the Edition field in the project's configuration file should be updated and should be properly handled by the engine.
    • The drop-down menu should have icons for library versions that are installed and that need to be downloaded. For those installed, clicking the entry should switch the project to that library version, and for those that are not installed, clicking it should show an installation progress bar next to the drop-down entry. After closing the drop-down menu, in the case at least one installation is in progress, there should be displayed a cumulative progress bar showing the installation progress.
    • Hovering should show the ongoing progress per ongoing item.
    • After switching to another library we are OK with losing all cache, and we are also OK that some workflows could break due to incompatible API changes. For now, we do not want to do any tools to help with the transition between versions. But there should be a confirmation dialog prompting the user to confirm the change and explaining that data will be lost.
  • User accounts

    • In order to share libraries, users need to be logged in to the application with their GitHub account. Users should not need to log in.
    • Users should not need to sign in if they do not want to share libraries with other users (not everyone has to have GitHub account).
    • In the app, there should be a "log in" button in the right upper corner or a user name with the "log out" option displayed.
    • Log-ins should shared with the Enso cloud account system.
  • Marketplace panel

    • There should be a marketplace panel that should show all libraries available nicely layouted as tiles.
    • The marketplace view should be a web-view, as it will be shared with the website (https://enso.org/marketplace).
    • The website version should have "install" buttons that should open the app, while the buttons in the app should behave slightly differently - they should convert to progress bars when the lib will be downloaded / installed.
    • There should also be a progress bar next to the "marketplace" icon / menu, showing cumulative progress for all installs happening at the given time.
    • The panel should display categories (names like "data science") and tiles inside. Each tile should display library-defined image (or default one) and a synopsis taken from the lib Main.enso file doc.
    • installing a library should also add an import to the current projects main file.
    • the interface should also allow removing an library from the current project which removes the import.

Engine related description

  • Libraries on disk
    All user-defined libraries should be placed next to downloaded libraries in ~/enso/libraries, next to ~/enso/projects. The idea is that when user defines a library (a set of nodes), it should be out-of-the-box accessible in all of their projects. Also, the engine should use the ENSO_LIBRARY_PATH env variable to discover other folders where libraries can be placed. The above locations should be used by default. There should also be a parameter in the project.yaml and a way to pass a command-line parameter-based overrides for this.

  • Library naming
    All user-defined libraries should have the name starting with their Enso account username. For example, a shapes library could be named (and imported) as import wdanilo.Shapes. The only exception is the Standard library, which is provided by the core team.

  • Library metadata

    • In each library, there should be a folder meta which should contain icon.png and preview.png (of predefined sizes). These images will be used by the library searcher. In case of missing files, a default will be used.
    • Also, there should be a file LICENSE.md in the top folder of a library. By default, the LICENSE.md should be populated with the MIT license. In our terms of use, there should also be a sentence that in case of a missing LICENSE.md file, the license defaults to MIT by default.
    • There needs to be a project configuration file with metadata like "name", "tag-line", "description", as needed for the Marketplace to show information.
  • Library discovery
    There should be no special configuration file to describe library dependencies used by the project. All libraries should be auto-discovered based on imports.

  • Library versioning
    The library versioning should be based on a simple resolver configuration. A resolver can be one of nightly-<VERSION>, unstable-<VERSION>, or lts-<VERSION>, where VERSION uses semantic versioning schema. A resolver is just a file containing all packages versions available in the marketplace.

Package-manager related description

  • Storage

    • The packages will be stored in a repository that consists of pure files and is accessed through HTTP(S).
    • The repository should have the following structure
      ├─ libraries
      │  └─ <username>
      │     └─ <libname>
      │        └─ <version>
      │           ├─ src.tgz
      │           ├─ test.tgz
      │           ├─ visualizations.tgz
      │           ├─ config.yaml
      │           ├─ manifest.yaml
      │           └─ LICENSE.md
      ├─ editions
      │   ├─ nightly
      │   │   └─ 2021-04-23
      │   ├─ unstable
      │   └─ stable
      │        └─ 2021.5
      └─ ban.list  // Contains banned user list.
      
  • Management

    • There should be an app written in Rust as part of this task which should allow managing the repository.
      • There also needs to be a server side component that serves as the backend of this app. It needs to handle authentication, and access to the git repo containing all the data. See below for some additional details.
    • The app should implement the following commands:
      • update, which should reset the local git repo and pull the libraries-version folder only.
      • install [--libraries-version=...] <NAME>, which should look up the name in the appropriate libraries version file, pull it, and repeat that for all of its dependencies. All pulled source code should be located in ~/enso/libraries/<libname>/<version>. Also, it is important to note here that only the needed things should be pulled - only the sources of the library without the test folder.
      • push <PATH>, which should upload the library of the provided path or the library located in the closest parent to CWD if the path is missing. This should use SSH-based authentication on GitHub. Of course, users would not have access to it directly, see "publishing libraries" section below to learn more.
      • search, which should search package names by part of the name.
      • info, which should provide the name, version, and synopsis of a given package.
      • publish, which should publish the package just like the pull command, but should not require SSH authentication to GitHub. Instead, it should utilize the server-app described in the "Publishing libraries" section.
      • unpublish <LIBRARY> <VERSION>, which should unpublish the library of the provided. Unpublished libraries can still be downloaded and used, but are not visible for people who never used them before.
    • The app should come with aliases for every action. The aliases might be created as separate files or another way around. For example, there should be command-line alias enso-install for enso-marketplace install. This way, the enso command should be able to allow users to write enso install (searching for enso-install in env, just like Git does).
    • The app should allow providing other, custom repositories too. In case multiple repositories are provided, the library-versions files should be merged. For example, in case we are using repository A and repository B and we are using library-version=stable-2.0, then the files A/libraries-version/stable and B/libraries-version/stable should be considered and the list of libraries for 2.0 should be merged.
    • The repository should use Git LFS. See GPM docs to learn more about this idea.
    • We should have a limit of library size that should be checked before uploading library version. For now, lets keep it small, like max 2Mb. In the future, we might want to increase this number per library or per library author.
    • Initially, we were thinking of reusing the Git Package Manager written in Rust, however, it is probably a better idea to write this app from scratch because GPM:
      • Uses different assumptions than we use in many places, including package resolution, package names, etc.
      • Is pretty small (total line count = 1500).
      • Does NOT have any docs in the code.
      • Doesn't seem to be popular / used beyond a single company.
    • As the repository will grow drastically, it is important to always clone only the needed portion of it. This includes its booting - only package lists should be cloned / synchronised during update. See the following links to learn more on how to do it:
    • In the future, we might also consider allowing libraries to specify OS-dependencies (like os-based binaries, e.g. in OpenCV bindings). These dependencies should then also be pulled lazily and only for the target OS.
  • Publishing libraries

    • There should be a server-app created that would use the "marketplace" program described earlier. This app should have SSH key authentication enabled, so it will have rights to upload packages to GitHub.
    • The "marketplace" app should be able to communicate with this server-app to upload user packages without giving users full access to the repository. Users that are authenticated, should be able to run the command publish command of the "marketplace" app.
    • Certain users might be banned, and their GitHub username list should be stored in the ban.list file.
@MichaelMauderer
Copy link
Contributor

* **Assigning nodes to libraries**

This also implies that there is a way to manage libraries, or at the very least to create a new library. This is also relevant for the marketplace: the information we want to show in the marketplace needs to be configured when creating or editing a library.

* **Navigating the nodes**

Should this be possible for any node from a different library or only locally defined ones? If so, should there be some way to show "read only" nodes, for example from the standard library.

* **User should be able to select the `libraries version` for the project**.

Name idea: instead of libraries version we could call it "edition". Format could be 2021.3 or along those lines. It would give a sense of how current an edition is.

* **User accounts**
  
  * In order to share libraries, users need to be logged in to the application with their GitHub account. Users should not need to log in.
  * Users should not need to sign in if they do not want to share libraries with other users (not everyone has to have GitHub account).
  * In the app, there should be a "log in" button in the right upper corner or a user name with the "log out" option displayed.

Note: this should be a proxy for an "Enso account" and could be expanded to other authentication services. Thought: do we have account management for the cloud? This maybe should be the same account from there (at some point).

* **Marketplace panel**
  
  * There should be a marketplace panel that should show all libraries available nicely layouted as tiles.

Needs a design mock-up.

  * The panel should display categories (names like "data science") and tiles inside. Each tile should display library-defined image (or default one) and a synopsis taken from the lib `Main.enso` file doc.

The preview for libraries without an image could be an abstract graph representation of the nodes. Or some fancy cool graphic derived from it. That makes it unique and recognizable even without a custom image.

* **Library naming**
  All user-defined libraries should have the name starting with the user github name. For example, a shapes library could be named (and imported) as `import wdanilo.Shapes`. The only exception is the `Standard` library, which is provided by the core team.

How would this be disambiguated later on if there are multiple possible repositories for libraries and there are name clashes? Are we going to ensure unique "Enso" usernames? What happens on name change?
Libraries from ENSO_LIBRARY_PATH are presumably imported without username prefix and just "found". So in this scenario there would also be the unpublished "Shapes" lib locally that could be used.

    └─ ban.list  // Contains banned user list.

This should probably not be clear text names but hashes or something along those lines.

* **Management**
  
  * There should be an app written in Rust as part of this task which should allow managing the repository.

Is this going to be Enso's "cargo"? Or is there another tool that is going to be used to invoke, e.g., tests.

    * `push <PATH>`, which should upload the library of the provided path or the library located in the closest parent to CWD if the path is missing. This should use SSH-based authentication on GitHub. Of course, users would not have access to it directly, see "publishing libraries" section below to learn more.

Is push going to do validation of the package? E.g. checking it is valid code, has valid config, running tests or things like that? We, of course, need to do this, server side, too, but it might be beneficial for the user trying to publish and spotting issues early.

    * `search`, which should search package names by part of the name.
    * `info`, which should provide the name, version, and synopsis of a given package.
    * `publish`, which should publish the package just like the pull command, but should not require SSH authentication to GitHub. Instead, it should utilize the server-app described in the "Publishing libraries" section.
    * `unpublish <PATH>`, which should unpublish the library of the provided path or the library located in the closest parent to CWD if the path is missing. Unpublished libraries can still be downloaded and used, but are not visible for people who never used them before.

Should this be based on a version? It might be beneficial to unpublish specific buggy versions of a library, but not the whole libraries. See Cargo's yank.

  * Certain users might be banned, and their GitHub username list should be stored in the `ban.list` file.

See above: this should not be clear text names.

@iamrecursion iamrecursion added Category: Backend -tooling Category: tooling --breaking Important: a change that will break a public API or user-facing behaviour p-medium Should be completed in the next few sprints labels May 24, 2021
@iamrecursion
Copy link
Contributor

iamrecursion commented May 24, 2021

There should be an option in the node's context menu (RMB) to assign this node to a library. After doing so:

How are new libraries created? What does a user need to do when creating a new library?

  • The node's source code should be automatically refactored to the library's source code.
  • In case the refactoring could not happen (it would cause circular dependencies between libraries), an error with a clean explanation should appear.

Given that we currently support circular dependencies between modules (it's very useful), this may not be an actual problem. While we can't deal with circular re-exports, a module B can import a module A that can import a module B due to how we elaborate names. I can, however, foresee future circumstances where this would be an issue, so keeping this as a feature point is still important.

  • All imports in this project should be auto-refactored too. For now, we do not care about imports in other projects.

What do you mean by this? Adding and removing imports as necessary based on the extraction to the library?

If a user double-clicks a node that was refactored to a library, they should properly enter its definition in the graph editor. Breadcrumbs should reflect the user's logical path, not the path on the disk (as currently). If they then open the text editor, it should show the code of that component.

I'm not sure what you mean by "logical path". Do you mean the chain of call-sites that got the user to this point?

Along similar lines, what should happen when a user tries to edit a library that they do not own? We probably want some concept of a "protected" library that users have to accept a warning to edit, and we should track the edit status of these libraries in our logging. We don't want users to accidentally modify the standard library and then report bugs, for example.

By the same token, though, we do want it to be easy for users to make modifications to functionality in other libraries as part of their own libraries. We probably want some ability to "extract from $foo into $bar" to enable this.

  • User should be able to select the libraries version for the project.

What does this mean? A dropdown containing the various lts, nightly, etc. versions? I feel like that's a very technical question to ask a potentially non-technical user and wonder if we can make it friendlier.

  • After switching to another library we are OK with losing all cache and we are also OK that some workflows could break due to incompatible API changes. For now, we do not want to do any tools to help with the transition between versions.

We should definitely print a warning about this.

  • In order to share libraries, users need to be logged in to the application with their GitHub account. Users should not need to log in.

We need to have an exceedingly clear terms of service for this kind of thing. It's very important that users agree to it before getting access to any of this functionality.

  • There should be a marketplace panel that should show all libraries available nicely layouted as tiles.

Pretty layouts are all well and good, but this sounds like a nightmare for discovery. We really need to think about how this will enable users to find the library that they want.

  • There should also be a progress bar next to the "marketplace" icon / menu, showing cumulative progress for all installs happening at the given time.

Cumulative progress is all well and good for an overview, but I think it important that we allow users to hover over it and see detailed progress for each download without having to actually re-enter the marketplace.

All user-defined libraries should be placed next to downloaded libraries in ~/enso/libraries, next to ~/enso/projects. The idea is that when user defines a library (a set of nodes), it should be out-of-the-box accessible in all of their projects. Also, the engine should use the ENSO_LIBRARY_PATH env variable to discover other folders where libraries can be placed. The above locations should be used by default.

ENSO_LIBRARY_PATH is a good starting point, but we want to be able to override it on a per-project basis and a per-run basis. We should think about both project.yaml and command-line parameter-based overrides for this.

All user-defined libraries should have the name starting with the user github name. For example, a shapes library could be named (and imported) as import wdanilo.Shapes. The only exception is the Standard library, which is provided by the core team.

This is a bad idea due to the simple fact that a GitHub username is mutable. People can change their usernames which means that the import path of the library would either change or become vulnerable to a supply-chain attack by someone taking over the old username.

If we want to go with user-based prefixes we need to maintain the global uniqueness and immutability of these prefixes ourselves, not rely on GitHub.

  • In each library, there should be a folder meta which should contain icon.png and preview.png (of predefined sizes). These images will be used by the library searcher. In case of missing files, a default will be used.

I don't know what you want in preview.png, but supporting user-generated content opens us up to some very questionable legal grey areas. I would suggest we allow users to select from a predefined list of icons.

Thinking more on this we're still open to the same legal grey areas as users can presumably upload arbitrary contents of the data/ directory with their library. We definitely need to get some insight from the lawyers as to what safeguards we need to have in place for user-generated content, moderation, and so on.

There should be no special configuration file to describe library dependencies used by the project. All libraries should be auto-discovered based on imports.

This is all well and good but seems like a bit of a usability problem to me if we don't do it well:

  • We can't arbitrarily download all libraries into the resolver onto a user's disk as this may be massive.
  • We can't provide suggestions for all libraries in a resolver without either pre-generating a potentially-massive suggestions database or downloading all of the libraries to generate the suggestions locally.
  • I'm not sure what we want to do for suggestions as not suggesting available libraries limits discovery (perhaps we could take a keyword-based approach) but also having suggestions from the entire resolver sounds like a discoverability nightmare.

The library versioning should be based on a simple resolver configuration. A resolver can be one of nightly-<VERSION>, unstable-<VERSION>, or lts-<VERSION>, where VERSION uses semantic versioning schema. A resolver is just a file containing all packages versions available in the marketplace.

What is the distinction between nightly and unstable?

  • The packages should be stored using Git in a Git LFS enabled repository.

Please make sure that you understand the potentially astronomical costs of this. If we're expecting users to be fetching LFS data (which we are, given that the data folder can contain arbitrary binary data) we need to be prepared for the costs to get very high. GitHub's LFS pricing is not very cost-effective when compared to simple bandwith egress costs on an AWS bucket (or azure equivalent). This would only get worse in the future as we want to start vendoring precompiled library versions, native dependencies, and so on.

Furthermore, if we use LFS we need to pre-populate an LFS configuration and reject pushes that don't use it. Otherwise we will have users making the mistakes of storing binary files outside of LFS, which will severely impact repository performance. It's also a consideration that such a repository will be subject to a heavy amount of churn, which pollutes the repository history with lots of small commits. This is likely to run up against multiple performance edge-cases in git as GitHub sees with large repos.

  • The GitHub repository should be located at http://github.com/enso-org/marketplace and should be a public repository.
  • The repository should have the following structure
├─ libraries
│  └─ <username>
│     └─ <libname>
│        └─ <version>
│           ├─ src             // Contains Main.enso with documentation used in the marketplace.
│           ├─ test
│           ├─ visualizations
│           ├─ config.yaml     // Contains libraries-version field.
│           └─ LICENSE.md
├─ libraries-version
│   ├─ nightly
│   ├─ unstable
│   └─ stable
└─ ban.list  // Contains banned user list.

We need to be particularly careful with this as git is fairly unfriendly and it's very easy for users to accidentally upload PII or protected classes of information with their library. Git is a poor tool for retracting such things, which makes me think that we probably don't want to use it. We need to provide users the means to remove such mistakes, even if that removal is incomplete due to having been public (as is always the case). I don't think that carefully warning the user before upload is sufficient when the mistakes become effectively permanent (as we can't force-push the repo due to lots of people depending on it).

  • The app should come with aliases for every action. The aliases might be created as separate files or another way around. For example, there should be command-line alias enso-install for enso-marketplace install. This way, the enso command should be able to allow users to write enso install (searching for enso-install in env, just like Git does).

We already have support for this, but install is taken by the existing installation functionality. We probably want to either add a sub-command to install that is library, or have the command as enso-library and let people write enso library install (or similar). @radeusgd will have more insight on this.

  • The app should allow providing other, custom repositories too. In case multiple repositories are provided, the library-versions files should be merged. For example, in case we are using repository A and repository B and we are using library-version=stable-2.0, then the files A/libraries-version/stable and B/libraries-version/stable should be considered and the list of libraries for 2.0 should be merged.

This merging needs to be specified very carefully as we need to have a deterministic and clear mechanism for resolving conflicts.

  • We should have a limit of library size that should be checked before uploading library version. For now, lets keep it small, like max 2Mb. In the future, we might want to increase this number per library or per library author.

This size limit helps somewhat with egress costs, but it's worth keeping in mind that the number of downloads will usually be >>> than the number of uploads, so we still get stung with large egress costs.

@MichaelMauderer
Copy link
Contributor

What do you mean by this? Adding and removing imports as necessary based on the extraction to the library?

Yes, this means if a node is extracted from the current file, the imports should be adjusted, so everything still works.

What is the distinction between nightly and unstable?

My understanding was that Nightly is supposed to be the most up-to-date daily version, while unstable is "what becomes stable in the next release". The idea is similar to what Rust does with its train release model.

LFS

I think the idea is that the user does not really interact with the repo itself, but this is only there to back our library management. So, user side configuration should not be an issue, since this is all abstracted away and only the commands from our tool / the IDE are invoked.

@iamrecursion
Copy link
Contributor

Makes sense, though my concerns about LFS and bandwidth still stand.

@radeusgd
Copy link
Member

If the node is not located in a user library, the node should not be entered, instead an info message should be displayed indicating that the node cannot be edited.

Shouldn't it be possible to enter it but for example in some read-only mode? (If that cannot be done now, maybe we could at least note it for sometime in the future?) I think it may be quite useful to see the implementation of not-editable nodes, as sometimes this may be very helpful for debugging or just understanding some implementation issues.

@MichaelMauderer
Copy link
Contributor

Shouldn't it be possible to enter it but for example in some read-only mode? (If that cannot be done now, maybe we could at least note it for sometime in the future?) I think it may be quite useful to see the implementation of not-editable nodes, as sometimes this may be very helpful for debugging or just understanding some implementation issues.

Yes, that is definitely something that we want to tackle at some point. But to limit the scope of the initial implementation we went with the simplest solution.

@radeusgd
Copy link
Member

radeusgd commented May 25, 2021

As discussed I'm comparing GitHub (with LFS) vs S3 as backends for storage.

Git

  • Downloading libraries is seemingly easier (git protocol handles that) but it’s not actually that simple as we need to use the sparse-checkout functionality to skip directories which is currently experimental and may change in the future; since we are using LFS we actually need to manually handle downloading files anyway (similarly to how this is done in GPM).
  • Performance may start being concern when we have lots of packages being updated (each update is a new commit, so the repository grows and even if we care only about different packages, we need to process through the deltas for the whole repository to get to the latest state)
  • We are not really using Git's versioning
    • We are still keeping library versions in separate directories and not in commit history (because each edition maps package name to its version, not to a commit)
    • The only place where we may be using history is the edition files, but here it is also disputable:
      • The stable edition files should be immutable anyway
      • If we want to be able to access older nightly versions, searching through commit history may be non-trivial, if we need that feature it may be better to just keep each nightly edition file as a separate file (since they are not very large)
  • We are using a facade service for uploading packages anyway (due to authentication requirements), so the fact that we are using git is only an internal implementation detail.
  • It may be easier for external vendors to set-up their own external package repositories, just by creating a public Git repository with the required structure.

S3

  • There are libraries for S3 both for Scala and Rust, but we may need some plumbing to implement downloading whole directories (usually they just expose some kind of a streaming API that has to be connected). Still it seems to be much simpler to implement than using Git.
  • More stable and predictable performance, regardless of repository size.
  • Vendor lock-in, but support for other file storages is relatively simple to add if we just use a HTTP API - the only thing that may differ is how to list files in each directory to download them.
  • May be a bit harder for others to set up external repos than just creating a Git repo, but setting an S3 bucket is not much harder. If we also allowed some kind of simpler file storage through HTTP, maybe it would be possible to use GitHub Pages for this too.

If our current approach is that we keep the directory structure as described above and download only parts of it (for example skipping the test directory), I think (but I may very well be wrong here) both solutions may suffer from slight performance bottleneck of downloading lots of small files, which is usually slower than downloading a single archive. The Git-backed implementation is not free from that if implemented in the way GPM does it, because as it relies on LFS, downloading each file is a separate HTTP request.

Pricing comparison

S3 - Storage: $0.023/GB/month, Transfer: $0.09/GB/month (first 1GB free), requests: $0.0004/1k requests (negligible)

Git LFS - first 1GB of transfer/storage free, then $5 per 50GB of storage and transfer per month

For the sake of an example, let’s assume that we have 1k users each downloading 1GB of libraries per month and that we store 5GB of data for libraries versions:

  • Storage costs are negligible here (it will be <$1/month for S3 and will fit in a single data pack for GitHub).
  • Transfer - that would amount to 1TB of transfer monthly:
    • S3 - $90/month
    • GitHub - 20 datapacks needed, which amounts to $100/month

So for such amounts the difference in pricing is indeed not significant. S3 is priced in a regressive manner, so for significantly bigger transfer capacities the pricing difference will be larger (S3 will be more affordable).

@radeusgd
Copy link
Member

While analysing possible designs I have encountered a quite important question, answer of which may affect what designs are possible: do we plan to have a mechanism for overriding a dependency version over what is defined in an edition? If yes, how should that work?

Motivation: Let’s say we have library A-1.0.0 that depends on B-1.0.0 which are both included in the latest edition 2021.4. Now, library B has been upgraded to B-2.0.0 (assume there are breaking changes) and it is scheduled to be bumped in edition 2021.5. If library A does not update its dependencies, it will not be able to be included in edition 2021.5 anymore, because it would cause a dependency conflict (B cannot co-exist in a single edition in two versions at once). To fix that, maintainers of A want to perform an upgrade which will involve some code modifications (as there were the mentioned API changes). To do so, they need to be able to use B-2.0.0, but it is not yet part of any edition. They cannot wait for 2021.5 to be released because that will be too late - A will need to be pulled out of this edition along with any other libraries that depend on A. Which shows a need for some overriding mechanism.

@iamrecursion
Copy link
Contributor

Just a note that it probably makes sense to look at Azure Blob Storage as we have lots of credits there and are slowly moving our infra to Azure.

@radeusgd
Copy link
Member

radeusgd commented May 26, 2021

  • Libgit2 does not support sparse-checkout nor partial-clone, so the only way to use these features is to wrap the CLI
  • We are not using Git’s incremental updates anyway if we are storing library versions in separate directories. If we want to be able to use the incremental mechanisms, we should reconsider other designs, like updating the libraries in-place and keeping older versions as part of the git history; but that has other implications, for example complicates co-existence of projects using different editions which point to different versions of the same library and also complicates creating custom editions which AFAIK we want to support (vide: the above about overriding versions).
  • It seems that the suggested way to go is to use both partial clone (to enable ‘lazy downloading’) and sparse-checkout (to only checkout and thus download the necessary files); technically one can use just the partial clone and specify sparse filters for it but there are hints that it may hurt performance significantly.
  • sparse-checkout may not scale very well. If the goal is to support hundreds of thousands of libraries, sparse-checkout may start becoming a bottleneck - it seems that even when using the more efficient but restricted cone filters, the complexity of checking out files is roughly linear in the number of all (not only checked out) files in a given directory, that is because the cone filter must check each subdirectory to see if it should be checked out or not. We could try overcoming this problem by structuring the library index in some weird nested way:
    • Instead of having library folders abc, abd and def, we could do a/b/c, a/b/d and d/e/f limiting the number of entires in a single parent directory to check.
  • However, it seems that even with the most restricted settings (a shallow blobless partial clone with sparse checkout of only the necessary files), git will still at least contain references to all files in the repository at the current point in history (i.e. the full tree of HEAD). It will not download the blobs for these files (so they will not take as much space as full files), but we will have objects that say that such a file exists. Thus if we have hundreds thousands of libraries, each of which may contain tens or hundreds of files, we will still need to download A LOT of data. For clarity, I could not find a definitive answer that would confirm that this is necessarily the case within the documentation, but after reading multiple sources, including https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/ that seems to be the most likely course of action. We may try running some kind of benchmarks to confirm that, but that will take some time to implement.
    • We could try overcoming this by for example keeping each library in a separate branch (and then shallow cloning only the desired branch), but then we would need to copy the libraries to some other place to check them out (as only one branch in the repo can be checked out at a time) and we remove most of the advantages of using git. Moreover, having so many branches may also cause performance problems.
  • Partial clone seems to be implemented on GitHub and GitLab, but if someone links to a repository running some older version if my understanding is correct, such a repository would always be downloaded in whole (since partial clone is if I understand correctly an optional protocol feature) which may cause unexpected performance issues for users setting up custom repositories.

As a summary - I agree that git is a great abstraction of file storage, but it is from its inception tailored for use-cases where we download the whole repository. There are mechanisms that were developed as an answer to monorepos which now allow it to avoid downloading all the data, but they are still mostly experimental and not mature yet - they may work ok, but we do not really have any performance guarantees.

@radeusgd
Copy link
Member

The main issue besides the fact that the features we want to use are experimental is that we are not really using many features that git provides at the cost of having to ‘fight’ with git to get what we need.

For example we are not using the versioning history and incremental updates for the library files, because we are storing library versions in separate directories, so a specific library version is actually immutable and does not have ‘history’ (apart from not existing and then existing). We could try to leverage git’s versioning but that would require overwriting the library files for new versions. However such a solution would then conflict with using different editions across projects - we would need to somehow checkout different versions at the same time - so we would need to actually move stuff out of the repository to be used (to allow using different versions of the library at the same time) - but then we start treating git just as a content delivery system which it is not. The only upside is that once the user installed library A-1.0.0 from edition X and they create a new project with edition Y which requires library A-1.1.0, the incremental update may avoid re-downloading some artifact. But this would require modifying the design of directory structure and poses other issues.

@radeusgd
Copy link
Member

radeusgd commented May 26, 2021

Suggested designs:

Blob storages

Loose blob storage

We can use the already suggested storage layout, just storing the library data in separate directories, each file as a separate blob.

libraries
└─ <username>
   └─ <libname>
      └─ <version>
         ├─ src
         ├─ test
         ├─ visualizations
         ├─ config.yaml
         └─ LICENSE.md

Uploading would be done through the service but conceptually it just consists of putting the library files in its directory. Very easy to implement for custom repositories.
Downloading would require to know which files to download (knowing the structure of the src directory and its subdirectories etc., see the next section for details). The downside of this approach is that downloading each file separately may be inefficient. We could try using some optimizations using HTTP/2 multiplexing, but that is getting complex and less portable.

Vendor-lock or adding manifests

As the above approach needs to know the structure of the directories to download, we need some way to convey this information.
One way is to use the API of the blob storage that we use under the hood and just query for the directory contents/index. This is solution is nice because it requires no action from the user setting up the repository (apart from setting the permissions for listing directory contents), but it ties us to a specific vendor - S3 or Azure have different APIs for that, custom solutions would not be applicable.

An alternative solution is to generate a manifest file that would reside at the root of the directory for each library which would contain a list of all files in all directories that are part of this library. This is a completely platform-agnostic approach and it would allow us to easily swap backends for the package repository. In particular, any kind of hosting service would work. Moreover it can be a bit faster, because we can download a single manifest instead of having to recursively traverse all subdirectories (each subdirectory incurring an additional request). The only downside is that the manifests need to be generated somehow - but that is not really a problem as our tool can generate them when the library is being uploaded; the logic for generating the manifests is also extremely simple, but it affords us portability.

Blob storage with subarchives

A completely alternative solution is to not store the files in a ‘loose’ manner but instead create packages for logical components of the library. We can have separate package for each logical component - for example for the sources, for the tests, for the binary data. Some simple files that are always present, like configuration files / metadata, or the license file could still be stored loosely for simplicity.

The proposed directory structure would be:

libraries
└─ <username>
   └─ <libname>
      └─ <version>
         ├─ src.zip
         ├─ test.zip
         ├─ visualizations.zip
         ├─ config.yaml
         └─ LICENSE.md

Now as each logical component is packaged separately, we can easily download only the sources ignoring for example the test files. We gain better storage and transfer efficiency as the data is stored compressed and the download operation just downloads a constant number of packages for each library. The package manager can then unpack the packages locally, which is a simple operation.
This requires a bit more complexity at the upload side as the packages have to be prepared but since the layout is predefined, it is quite simple to implement and it is again easier to upload a few packages rather than lots of loose files. We can do the packaging in the client and upload each package separately or alternatively the client could upload the library as a single package and then the upload service could repackage it into the components. But I think doing it on the client side is the best solution.

Storing edition metadata

Each separate edition can be stored as a separate text file, for example:

libraries-version
├─ nightly
│  ├─ 2021-05-21
│  └─ 2021-05-22
├─ unstable
│  ├─ 1.2.3-rc.1
│  └─ 1.2.4
└─ stable
   ├─ 1.2.3
   └─ 1.2.4

Alternatively, to avoid having too many files next to each other in the most populated directory - nightly, we could have subdirectories for each year.

Git-based

As originally suggested

├─ libraries
│  └─ <username>
│     └─ <libname>
│        └─ <version>
│           ├─ src
│           ├─ test
│           ├─ visualizations
│           ├─ config.yaml
│           └─ LICENSE.md
└─ libraries-version
    ├─ nightly
    ├─ unstable
    └─ stable

The original idea was to store the above structure in a git repository.

Then we could use git clone --sparse --depth=0 --filter=blob:none --filter=tree:0; git sparse-checkout init --cone to clone only the latest revision and not checkout any files and initialize the sparse checkout mechanism in the better performing mode (cone filtering instead of arbitrary filters).
Then to initialize the editions, we can do git partial-checkout add libraries-version/stable etc. to actually force checkout of the edition metadata files.
Later, when installing a library we would need to do:
git sparse-checkout add libraries/username/libname/version/src
git sparse-checkout add libraries/username/libname/version/config.yaml etc. for every entry in libraries root.
Why can’t we just git sparse-checkout add libraries/username/libname/version/? Because we wanted to specifically not download the test files and as far as I understood the documentation we cannot exclude it first and then include all the other files, because of the precedence settings. Moreover such filter pattern would not fit into the ‘cone’ filter shape and the arbitrary filtering has much worse performance than cone-based (the cone here essentially means that for each node we can either include its whole subtree in checked out files or all of its immediate children without their subtrees, and any other filtering is not allowed).

However with this approach we don’t really get much from git itself - we are not using any tagging, version control capability because we put new library versions in separate directories anyway - so in terms of git history they are not connected in any way. So we do not have any incremental updates or other niceties of git.
We do however heavily rely on the partial clone and sparse checkout which are experimental features and their support can change over time (we can ship older versions of git, but if repository remotes change the protocol, we may be forced to upgrade). As for custom repositories, we have no guarantee that the hosting site the users will use will support these features (although it is likely as the most popular ones, like GitHub and GitLab seem to support it).

There are important performance concerns - after reading through the documentation, to my best understanding, even if we do the most restricted kind of clone (shallow+sparse+partial clone), we still download the whole tree - that is we have references to every object that is in the current snapshot of the repository. That means that we have reference to every file of every, even not downloaded, library. Of course it does not take a lot of space, because we fetch the blobs lazily, so we have refs to files but no contents of these files. But still if we want to support huge amounts of big libraries this may be a significant bottleneck because the initial download of the repository must still download the refs to all available libraries (and the more files the library has the more downloading there is to be done). I’m unfortunately not 100% sure that this is the case, but based on multiple sources that seems to be the most likely scenario. To verify it with absolute certainty I would need to read very deeply into git’s source code or perform some tests - this will take more time (I’d guess around half to one day to verify this thoroughly) so I didn’t want to do it unless absolutely necessary. Based on current data, see for example this blogpost I’d say I have 85% certainty that this is the case.

There is also an issue that any operation like checking out a library has to check all libraries against the sparse checkout mask so our performance is linear in the number of libraries available in the repository. One way to alleviate this would be to store libraries in a directory structure like us/er/na/me/li/br/ar/yn/am/e - thus making the tree deeper but less wide.

There are also analyses that a subsequent shallow fetch after a shallow clone may have less-than-ideal performance - but it is hard to say if this would translate to our use-case exactly, it would require performing our own benchmarks as the analyses had slightly different circumstances.

Actually using the version control capabilities

As mentioned above, the original design requires a lot of hassle to use git but it does not actually use its version control abilities at all and we cannot easily extend that approach to use them - since we keep the library versions in separate directories they are completely separate entities to git and so there is no history-connection.

We can modify the design to take advantage of git’s history and the incremental updates.

That would require to keep each library version in the same directory, thus different versions would not be in different directories but under different commits. The edition files could reference library vresion by the commmit when it was added. Thus newer versions will be connected through git’s history and when the user were downloading updates to a newer edition, they could only download the deltas.

├─ libraries
│  └─ <username>
│       └─ <libname> // This entry is then versioned in git and diifferent versions of the library lie in different revisions of the repository.
│              ├─ src
│              ├─ test
│              ├─ visualizations
│              ├─ config.yaml
│              └─ LICENSE.md

However this gives rise to other complications - we can have different projects that use different editions that reference different versions of the same library. But we can have only one ‘point of history’ of git checked out at the same time. And we don’t want to limit to having only one project at a given point in time. So we actually need to copy each library to some separate directory so that multiple library versions can coexist. But then git turns into just a download manager.

There are some advantages to that (we have the incremental updates), but the complexity is quite hight and also all the performance disadvantages listed in the previous section still stand (e.g. we still may need to download the whole directory tree (albeit without contents) and the issues with operations being linear in the number of libraries).

I don't think that the feature of incremental updates (because I think this is the main selling point here, or am I missing something?) is worth this additional complexity. Additionally for some libraries it is quite possible that a significant part of library's size will be some native components that will be stored on LFS and not processed incrementally anyway. Or put another way - a library that is mostly text files can be downloaded in full very quickly (regardless of being incremental or not) and a library is usually big in size due to big binary files - in both cases the incremental updates do not give too much of a benefit.

@radeusgd
Copy link
Member

A short summary after this long analysis:

It seems possible to implement the repository using git, but we don't get many advantages at the cost of quite high complexity (also good to remember that as noted earlier, we need to use git CLI instead of bindings) and having to rely on experimental features whose behaviour may change in time and whose performance is not really well understood nor predictable.

But we have a simple abstraction for file storage and that is any kind of storage systems that then expose access to these files via HTTP(S) and I think using that will be easier to implement, more stable and more predictable in terms of performance (for example we don't have to be afraid of scaling the S3 storage (or any other good alternatives) with growing number of libraries).

By using a standardised format of archives (as described in Blob storage with subarchives) or manifest files (which can be easily generated) we can have a standardized format of storing the libraries repository which can be used on any kind of storage system (be it S3, Azure or even FTP access to a HTTP server).

@wdanilo
Copy link
Member Author

wdanilo commented May 27, 2021

Yes, that is definitely something that we want to tackle at some point. But to limit the scope of the initial implementation we went with the simplest solution.

@MichaelMauderer I believe that allowing people to see how libs are implemented under the hood in read-only mode is crucial for learning the application, so I believe this solution may be a little bit too limited even for the first shot. If we would allow people to browse libs in read-only mode, as @radeusgd suggested, that would allow them to debug / understand / learn much faster and better. I feel that displaying errors instead is a little too limiting here.


Also, I think the stable editions should be named 2022.2 (where the number after the dot is the number of the edition in the given year). Unstable can be named 2022-01-24.unstable (which contains the exact date).


Regarding the rest of the things, we will have a call tomorrow.

@radeusgd
Copy link
Member

One more thing that came up when I was thinking on refining the tasks is that we will need some mechanism for creating new editions. Shall we also have a task for a tool that will create a new edition semi-automatically?
In that case we'd need a policy for resolving which version should be present in the edition - by default it probably should be the latest version of each library, but what if library A is updated to a newer version but library B still depends on the older one? There are (at least) two ways to resolve this - ignore library B in the new edition altogether (and all others that depend on it) or use the older version of A to be able to keep B. I'm not sure how much of these decisions should be automated versus how much we want to have manual control over this, especially for more important libraries.

Depending on what we choose I guess there should be a task to create a tool for generating new (and probably also nightly?) editions or at least documentation explaining how to do this.

@iamrecursion
Copy link
Contributor

For dependency resolution you can consider using Z3.

@wdanilo
Copy link
Member Author

wdanilo commented May 27, 2021

@iamrecursion that would be a total overkill. We want it to be delivered in 8-12 weeks. The dependency mechanism described here (editions without per package dependencies) would not need z3 at all.

@radeusgd
Copy link
Member

Meeting summary

We will use a static-file based approach mostly similar to what was described in the Blob storage with subarchives section.

For downloading, we will rely only on the HTTP protocol (if possible, see possible exception below) so that various storages may be used under the hood.

Each library will have a path determined by its name and version: $ROOT/<lib-author>/<lib-name>/<lib-version>. Inside of it there will be a manifest listing available components, an always present config.yaml/package.yaml and components (representing subdirectories of the package) packed as separate tar.gz archives.

For now we will download all of these components except for the tests component. In the future we want to extend this filtering logic to allow for os-specific components so that any binary dependencies that are needed are only downloaded for the needed OS+arch combination instead of downloading all of them.

(Not directly discussed but seems like a logical conclusion) Another exception would be the meta folder which will also not be packaged so that the package browser can easily download the icon or image for the preview when displaying the library.

Edition files will be stored as plain files, each edition being uniquely identified (nightly with its date or stable with its version number).

We need to create the following tools:

  • The package-manager CLI for installing packages and to be used as a client for publishing
  • A server for serving the packages (to be used by clients wanting to set-up a local repository, we will use S3) - likely it can just be a simple HTTP server with nothing else (but see section below)
  • A server for handling publishing the packages - one that will be run in our AWS cloud and also can be run by clients setting up local repos. It can either use no authentication or use an identification service provided as part of Enso Cloud.

The identification service connected with Enso Cloud will need to provide:

  • A way to log-in from the IDE or CLI and prove to the publishing server that the user owns the username under which they are publishing a library.
  • Create a new account and assign a fresh username when the user does not have an account in Enso Cloud yet (so that one can register from the IDE).

When a new library is created (mainly in the IDE, for example when extracting a piece of code) the user should already be logged into the Enso Cloud account. If they are not, this is the moment where the log-in should take place. That is because the newly created library will contain the username as part of its import path, so we need to know it in advance to avoid renaming.

For now the edition files will be created manually.

We want to be secure against at least basic DoS attacks (like a malevolent actor repeatedly downloading lots of libraries to use-up our bandwidth). One way to do that would be to require being logged-in to download libraries and keeping a log of how much bandwidth a single user has used-up and set a daily limit to some reasonable value. TODO: this requires research as to how we could integrate such a solution (authentication + logging used bandwidth) with the S3 backend.


What I have slightly missed and we may need to further discuss at some point is how the static storage will be integrated with the marketplace website. The marketplace can definitely load the edition files to know which packages are available in a given edition, but how should the searchability work? Should it somehow index the metadata from each package's config.yaml?

@iamrecursion
Copy link
Contributor

Has there been any thought given to using TLS on the connection? We don't want to allow people to MITM the users of the marketplace.

@wdanilo
Copy link
Member Author

wdanilo commented May 27, 2021

@iamrecursion no thoughts on that. But if this is a real threat, we should consider it.

@iamrecursion
Copy link
Contributor

We want to be secure against at least basic DoS attacks (like a malevolent actor repeatedly downloading lots of libraries to use-up our bandwidth). One way to do that would be to require being logged-in to download libraries and keeping a log of how much bandwidth a single user has used-up and set a daily limit to some reasonable value.

Just a thought that I don't know of a single package repository for another language that requires users to be authenticated to download packages. Most don't rate-limit at all, but those that do seem to do it based on IP.

Uploading packages is a different matter and does, of course, require users to be authenticated.

@radeusgd
Copy link
Member

DoS Protection

S3 Authentication

This SO answer explains the landscape for S3 access control quite well. The S3 access can be for IAM Roles/Users but that is not suitable for our use-case, as IAM is intended for developers/staff, not for the users.

There is also another mechanism - pre-signed URLs - an application that has permissions to access the bucket can generate a one-time URL that is valid for a limited time which can be then used by an user to access S3. However for this to work we would need to implement some kind of a proxy service which will be checking the auth status, updating the used bandwidth and if allowed, redirecting to these pre-signed URLs. This application would need to be a Lambda or some service running on EC2 (but the latter may be worse in terms of scalability).

There may be some issues with tracking the real bandwidth usage, because once we give the user the presigned URL, they can use it multiple times within the expiry time (so using more bandwidth than we expect) or the download may be interrupted (so the user used less than actually expected, possibly leading to false positives if the maximum bandwidth threshold is not high enough). It may be possible to do this better by inspecting S3 logs, but that increases the complexity of the solution significantly.

Other approaches

As noted by @iamrecursion, it is uncommon for public package repositories to require authentication to download packages, which may make our ecosystem look more closed than it really is.

Most systems seem to tackle this issue by using CDNs (which we should likely use anyway for better performance) and some kind of rate-limiting protection.

AWS has an Advanced Shield service which also includes Web Application Firewall that allows for some rate limiting. It will however not protect against someone performing a slightly slower attack distributed over a longer timeframe.

I think I don't have enough expertise in this area to evaluate this properly, so I think it would be good to discuss that with someone more knowledgeable about cloud deployments.

@radeusgd
Copy link
Member

After a discussion with @wdanilo we decided that it will actually make sense to not have the package manager as a separate binary. Instead it will be a Scala library which then can be used by the launcher (which will handle its CLI interface) and the project-manager/language-server. This will greatly simplify the integration, as instead of having to wrap a CLI command and parse some JSON responses we will be able to just use a native API.

The motivations for changing that decision were following:

  • having a separate package-manager binary would complicate handling versions as we would need a mechanism for updating this binary (now we can just re-use launcher's update system, in the previous case we would have to extend it in a non-trivial way to handle such plugins);
  • we realised that this component will not necessarily need to be ported to Rust - the move of parts of the backend to Rust is going to be made mostly for performance reasons and the package manager is not a bottleneck (because downloading packages dominates its operations' time anyway). When we will be moving parts of the backend to Rust, we can just wrap the package-manager library with a machine-friendly CLI and call it as an external program (much in the same way as we were planning to be doing with the previous solution but instead of a Scala application calling a Rust binary, it will be a Rust app calling the Scala package manager).

@radeusgd
Copy link
Member

I would also like to clarify the design decision to not include an explicit list of dependencies but instead rely on imports only.

From the UX perspective it seems simpler but can complicate situations where the user wants to learn the dependencies of a project - they cannot just check a single file but need to go through the imports. Although we can alleviate this by providing a CLI helper command that will list the dependencies, however it may also be slow for big projects as it needs to parse every file.

Another issue, albeit less problematic is that when installing a library A we also need to install any transitive dependencies. Without any metadata this will involve the following process: download and extract library A, parse all of its files to learn its dependencies to know that it depends on B and C, download B and C, parse their files to learn their dependencies etc. This will be slower (we probably don't want to optimize this now, but in the future we may want to parallelize downloading of dependencies and not knowing about all the dependencies up-front will hinder the parallelization - further downloads need to wait while a dependency is being extracted and parsed because before that we do not know them, the internet connection is waiting unused when the dependency is being extracted) and it will be harder to estimate download progress - if we do not know how many dependencies there are to download we cannot display a good progress bar.

This however can be fixed without changing the design - on the user-level we can keep inferring the dependencies from imports alone, but when creating the package for upload (which will be done by the package-manager component), we could gather all the dependencies and list them inside of the metadata/manifest file. This way when downloading we could rely on this data to provide better progress estimations and in the future, optimize download speed.

@radeusgd
Copy link
Member

radeusgd commented Jun 4, 2021

To make sure that our designs are complete, I'm posting a few example workflows that show how all parts of the system may be interacting.

Extracting nodes to a library

  1. The user selects a set of nodes to extract and chooses an option ‘extract to library’
    1. Question: Does the extraction work on a set of nodes and creates a new function or is it possible to extract an existing function (as these are two different kinds of logical operations: ‘extract/abstract’ vs ‘move’)
    2. The user chooses ‘Extract to a new library’
      1. IDE ensures that the user is logged in (if not, a sign-in view should open, also allowing to sign-up) - this ensures that we know the prefix under which to create the library
      2. IDE allows the user to type a name for the library
        1. It must not allow to create a library with a name that already exists within local libraries (capability to check if a library exists or the capability to list existing/taken names).
      3. IDE sends a request to the language-server to create a new library with a given name
      4. The Language Server creates a new project/package in the local library directory with the given name
    3. The user chooses ‘Extract to an existing library’
      1. The IDE displays asks the LS for a list of existing local libraries and displays this list for the user to choose from
      2. The user selects one of these
    4. (in practice from UX perspective these two above options may be consolidated, they are split here just because they differ in the logic)
  2. At this point we have a selected library to which the nodes should be extracted (it may have just been created or existed earlier).
  3. The IDE opens (text/openFile) the Main file of that library (Assumption: all extractions are added to the main file; if not, we need some dialog to choose where the extraction should be placed).
  4. Question: How should the name for the extracted function be chosen? Should this be a dialog? Or auto-generated?
    1. If it is chosen by the user, the IDE needs to verify that the name is unique within the target file.
  5. The IDE prepares a piece of code based on the selected node and sends the patch (text/applyEdit).
  6. (If the extracted piece of code was already a function and was being used in the current file or other project files) The IDE needs to find references to this existing function.
    1. Initially as we discussed this can be done by finding direct references to this function (e.g. here.foo, ModuleName.foo etc.)
    2. Question: Does the IDE need some additional capabilities from the Language Server to be able to provide this?
  7. The IDE should prepare and apply the following set of edits (text/applyEdit):
    1. Add an import of the extraction target module (unless it was already present)
    2. (if applicable) replace references to the extracted function with references to the function in the extraction target module.
    3. (if applicable) Repeat this for all other files to change the references to the extracted function.
    4. Remove the extracted function from the main file.
    5. Ideally, all of these edits should be applied atomically (in each file, not across files though)
    6. The currently edited file should be updated last, to ensure that any intermediate states are consistent - modifying it earlier would break references to the removed-extracted function before they were updated to point to the new place.
  8. If the extraction target module was not yet imported before, the edits above will trigger the ImportResolver to import the module, if this is the first import of that library the PackageRepository will load that package.
    1. The package is loaded and compiled.
    2. The PackageRepository should also notify the Language Server that a new library is being added, the Language Server will then forward this notification to the IDE in the form of file/rootAdded notification (extended with additional metadata like human-readable name and read-only flag).
    3. This will make the IDE add the library to the project tree (under something like ‘External libraries’).
  9. If the extraction target was already loaded, the text/applyEdit sent above should trigger a recompilation of that module and any dependencies (this is already implemented, should not require additional adaptation).

@radeusgd
Copy link
Member

radeusgd commented Jun 4, 2021

Signing-in

  1. The user clicks the sign-in button or the sign-in is requested by other process (for example, see: creating a library).
  2. A webview is opened that allows the user to sign in using the Enso Cloud Identity Provider, which currently uses GitHub for authentication.
    1. The webview should also allow to sign-up if the user did not have an account.
  3. Once the user is signed in, an authorization token and username are passed to the IDE and saved somewhere. The IDE now displays the username somewhere and allows to sign out.

Adding a library from the marketplace to the project

  1. The user opens the marketplace webview which presents our marketplace website.
    1. TODO: we should probably add some detail on what this website should provide
    2. IIRC it should allow to search by package name
  2. The user selects a package, displaying a full description
  3. The user clicks ‘Add to the project’ which notifies the IDE
    1. The IDE needs to be able to communicate with the webview somehow.
  4. The IDE sends a request to the Language Server to pre-install the selected library.
    1. If it was already installed, it succeeds immediately.
    2. If it was not installed, the downloader downloads the manifest to find the list of dependencies (and recursively discovers all transitive dependencies).
    3. Knowing how many dependencies there are to install, we can report installation progress - reporting the particular download progress of each dependency and the progress of the overall installation progress (we can re-use the current taskProgress API and just provide separate progress updates for these processes independently, the IDE then needs to correctly detect ‘which is which’ and display the progress bars).
    4. Once the download is complete, all libraries are extracted to the installed-libraries-repository and success is reported.
  5. Once the IDE gets a response from the LS that the library is installed, it sends a text/applyEdit message adding an import of the main module of the installed library: import Library_Author.Library_Name.
    1. This is done only after confirming the download has finished, because otherwise the import itself would start the download, but that would freeze the compilation process, so any other modifications would have to wait for it to finish. Instead, as the pre-download is done in the background, the program can be normally edited and recomputed while the download is processing.
  6. Adding the new import causes the ImportResolver to ask the PackageRepository to load the new library. Since all dependencies were downloaded, it just loaded quickly.

Adding a local library to the project

If I understand correctly, this is currently done by manually adding the import. Or do we want to have some kind of a dialog box that can display available local libraries? This would not require additional backing from the backend, as we will already provide an endpoint to list local libraries and adding imports can use the same logic as adding the import for remote libraries.

Publishing a library

  1. The user selects a library (probably in the directory tree under External dependencies or some other list of local libraries) and clicks ‘Publish’.
  2. The IDE saves all open files to include any changes.
  3. If the user is not signed-in, a sign-in process should be initiated first.
  4. Once the user is signed in, the IDE verifies that the username matches the username within the library.
    1. This catches the edge case that the library was created under a different account and later the user switched accounts before publishing.
    2. If the usernames do not match, the user is asked to re-log into the correct account.
    3. In the future we could ask the user if they want to refactor the library changing the name but I think this shouldn’t be done right now (Question: what do you think?)
  5. The IDE sends a request to the Language Server which uses the Package Manager component to create a compressed package of all the files in the selected library.
  6. The Language Server asks the IDE for the authentication token.
  7. The Package Manager authenticates with the upload service and uploads the file.
    1. This may fail if the library was already uploaded with the given version.

@radeusgd
Copy link
Member

radeusgd commented Jun 4, 2021

I would like to ask @wdanilo and @MichaelMauderer to have a look at these example workflows described above to see if they match the expectations and are clearly described.

Also while describing them I've noticed an issue with local library versioning that we need to clarify.

Essentially when a local library is created it will get some initial version, probably 0.1.0.

Afterwards the user can work on it, use it in some of their project and publish it.

Later, they may want to improve it or fix bugs and will want to publish an update. So we need to support 'bumping' the version number. One part of that is some kind of UI which will allow to select a library and change its version number before publishing (as publishing with the same version number will simply fail).

However this opens another question: what should happen when the library version is changed? Should this change version of the currently opened library or create a copy with a bumped version? If we were to choose the latter, we would need to allow the local libraries repository to support multiple versions of the same library (note that I'm not speaking about the repository containing locally downloaded copies of published libraries, but the location where locally created editable libraries are gathered). This complicates the situation because it would not be enough to have the switch that we wanted 'override local libraries', but would need a way to select versions of the local libraries. In my opinion this is an unnecessary complication, so I would suggest that this will just edit the library version in-place.

If any projects depended on the older version of the library, they can just switch to the older, already published version in an usual override.

@radeusgd
Copy link
Member

radeusgd commented Jun 4, 2021

One more feature I'd like to describe in more detail is library resolution settings.

The main setting the user can change is the base edition that the project is based on.

Moreover we have this toggle 'use local libraries over published ones'.

Moreover we have a list of version overrides, which can be added or modified, which consist of a library name, (version override and repository override) or just the 'use local version' override (as if the local libraries repository can contain only a single version of a given library, there is no sense if having a version there).

I also suggest that the project should not specify its edition at all within the package.yaml. Instead, every project should be requried to also have a file edition.yaml next to package.yaml which will contain the edition configuration. By default this file will consist of just a single line which says: derive: 2021.4 (or some other edition). But having the file separate will allow for much easier editing of edition settings (as this local file can simply be modified).

An alternative is to have a edition: 2021.4 field inside of package.yaml and if any custom modifcations are made, the edition.yaml file should be created and the field inside of package.yaml should be modified to edition: "./edition.yaml".

The latter solution however leads to some edge cases - for example what if package.yaml is set to some 'global' edition and so the edition.yaml file should be created when overridding custom settings, but the edition.yaml file already exists? Should it be used-as-is and just modified (thus also enabling some other settings unexpectedly??) or should it be overwritten?

Because of these edge cases, I think just requiring the edition.yaml file for each project will be much simpler and more consistent. What do you think about this @MichaelMauderer @wdanilo?

@MichaelMauderer
Copy link
Contributor

Question: Does the extraction work on a set of nodes and creates a new function or is it possible to extract an existing function (as these are two different kinds of logical operations: ‘extract/abstract’ vs ‘move’)

I would say it should be "move". If you want to create a function you can collapse the nodes first and then move the resulting node.
Maybe for later we might have an action "collapse+move".

Question: How should the name for the extracted function be chosen? Should this be a dialog? Or auto-generated?

Do we need a new name or can we take 1:1 the source here?

(If the extracted piece of code was already a function and was being used in the current file or other project files) The IDE needs to find references to this existing function.

This is in the "for later" category. Right now, we will not do this. Later on this will probably need refactoring support from the language server.

In the future we could ask the user if they want to refactor the library changing the name but I think this shouldn’t be done right now (Question: what do you think?)

Agree, this is for later.

@iamrecursion
Copy link
Contributor

Because of these edge cases, I think just requiring the edition.yaml file for each project will be much simpler and more consistent. What do you think about this @MichaelMauderer @wdanilo?

You can also do a "section" in the package.yaml:

edition: 2021.4

Could easily become:

edition:
  version: 2021.4
  extra-deps:
    ...

@radeusgd
Copy link
Member

radeusgd commented Jun 8, 2021

Question: How should the name for the extracted function be chosen? Should this be a dialog? Or auto-generated?

Do we need a new name or can we take 1:1 the source here?

I think taking the source name by default sounds sensible, but it can happen that a function with that name is already present in the target library, so we need some way to resolve conflicting names, to avoid adding code that causes a re-definition (which IIRC is a compile error).

You can also do a "section" in the package.yaml:

I think this sounds good, although then I'd suggest always making it a section, i.e. it should always have format like

edition:
   extending: 2021.4

Otherwise we would need extra logic to handle the edge case to 'collapse' the section if the version is the only setting which in my opinion adds unnecessary complexity to that logic. Or do you think it is worth it because the shorter version is significantly more readable/understandable?

@radeusgd
Copy link
Member

radeusgd commented Jun 8, 2021

(If the extracted piece of code was already a function and was being used in the current file or other project files) The IDE needs to find references to this existing function.

This is in the "for later" category. Right now, we will not do this. Later on this will probably need refactoring support from the language server.

Understandable as that is rather complex. But then we need to remember that we probably shouldn't remove the 'original' function after extraction. If we were able to do the proper refactor, it would make sense to delete the original function (as it was moved not copied to the library). But if we are not doing refactoring, removing it would break existing code and we probably don't want to do this - if the user wants to change the references manually they can also manually delete the old function.

@wdanilo wdanilo closed this as completed Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--breaking Important: a change that will break a public API or user-facing behaviour -tooling Category: tooling p-medium Should be completed in the next few sprints
Projects
None yet
Development

No branches or pull requests

4 participants