Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup monorepo with lerna, account library integration #561

Closed
holgerd77 opened this issue Jul 25, 2019 · 35 comments · Fixed by #666
Closed

Setup monorepo with lerna, account library integration #561

holgerd77 opened this issue Jul 25, 2019 · 35 comments · Fixed by #666

Comments

@holgerd77
Copy link
Member

Being in discussion for some time there is now an agreement to integrate the following libraries here by setting up a monorepo with lerna similar to the setup in the ethereumjs-config:

We'll start with integrating the account library due to its small size, low direct usage level and proximity to the VM logic, integration of the other libraries should be handled as a separate issue.

This has to go along with an update on the ethereumjs-config library (libraries) to be able to distribute the respective configurations to a monorepo-style consuming repository. This effort shouldn't be underestimated and will likely be as much work as the direct integration here.

@s1na
Copy link
Contributor

s1na commented Nov 20, 2019

With the stable Istanbul being released, I think it's time to re-start this conversation. I had two specific discussion-points in mind, namely:

  1. Do we keep the library names as is, or rename and use the scoped naming, e.g. @ethereumjs/account
  2. Do we use different or the same version for all modules in the monorepo?

Re using the same version: I think this has some advantages. To achieve it we have 2 options as I see it. If we rename libraries we can safely reset the version. If we don't rename, we can upgrade all libraries to the biggest version among them (e.g. if VM has v5.x and all other v2.x, we upgrade all to v6)

@holgerd77
Copy link
Member Author

Yes, I also thought that it would be very much the time for the mono repo implementation. We can also very much do this step-by-step and on a first round e.g. keep everything "as is" so the VM integrating all the modules it currently does and just setup the outer mono-repo structure and maybe do a first bug fix release for testing the new structure (just a first thought, not sure if exactly this way makes sense, but generally: we can take on this one step after the other).

@s1na on your questions:

  1. I would have very much the tendency to take on the occasion and switch to a scoped naming here, this would generally make very much sense for the whole EthereumJS org libraries (over time)

  2. I lack very much the "feeling from experience" here. I would assume the biggest disadvantage here would be that one is enforcing e.g. major versions on all mono repo-associated libraries, even if there is no or just a subtle change in the library itself. This would very much affect single users of the respective library, since they eventually won't get a bug fix which would be otherwise be distributed by a separate bug fix release. Now people would be enforced to manually do the major version upgrade. How would you respond to this criticism?

@s1na
Copy link
Contributor

s1na commented Nov 25, 2019

I would assume the biggest disadvantage here would be that one is enforcing e.g. major versions on all mono repo-associated libraries

This would definitely be the downside. The only approach by which we can avoid it is to both keep the names (i.e. no scoped renaming) and the versions. I guess it comes down to a pro/con discussion. To bring a pro for using the same version I'd refer to the discussion in ethereumjs/ethereumjs-block#74.

@alcuadrado curious to hear your thoughts as well.

@evertonfraga
Copy link
Contributor

evertonfraga commented Nov 26, 2019

For information and reference, I made a file structure (+ simplified dependency tree) of the monorepo proposal, as I get it:

image

Edit: Added ethereumjs-common to the diagram

@evertonfraga
Copy link
Contributor

evertonfraga commented Nov 27, 2019

I'd like to add one more topic for discussion: what should be the ideal test structure for the monorepo?

Here's a brief analysis of the current situation:

platform # of jobs latest job duration
ethereumjs-vm circleci 11 13m 13s
ethereumjs-blockchain travis 5 3m 6s
ethereumjs-block travis 9 4m 12s
ethereumjs-tx travis 9 3m 46s
ethereumjs-account travis 5 3m 31s
ethereumjs-common travis 4 1m 45s

Both CI platforms that we use still don't have a tailored support for monorepos. I see some approaches:

1. Build all 43 jobs, for every commit

This is the most straightforward way to go, but it is also resource-heavy. It strictly depends on a robust parallelization plan on the CI side, otherwise, we'd have an unpleasant job queue to cope with.

  • find out what's our current plan capacity.

2. Only trigger jobs for packages that had changes (👎)

In this scenario it would detect which package got changes and run only their jobs. That can be accomplished with a single test runner that uses a combo of lerna diff and CircleCI API to trigger a job.

The big no for me is that the tests would still be as siloed as they are today, where you need to release one package to see it failing on a downstream test.

3. Run tests by affected package

We'd use their explicit dependency relation to determine which jobs need to run. I still don't know how elegant this solution would be on the CI side, but I'd be happy to explore this path more.

How to read the image:

  1. Changes to vm should run their own 11 jobs
  2. Changes to block should run block (9), blockchain (5) and vm (11) = 25 total
  3. Changes to both blockchain and accounts should run their own tests + vm

In any case, it seems primordial to migrate them over to circleci.

Edit: Added ethereumjs-common to the table and diagram

@s1na
Copy link
Contributor

s1na commented Nov 27, 2019

Great analysis @evertonfraga, and a good point!

I'd also go either with 1 or 3. But yeah as you mentioned as it stands 1 doesn't seem very feasible. I do think however it might be possible to reduce the runtime of some of the tests, making option 1 more feasible. E.g. #536 was a low-hanging fruit I found a while ago which improved runtime quite a bit.

I think it'd also make things a lot simpler if the projects were using a similar toolchain (and CI). This should be a good preparatory first step. ethereumjs-account can already started to be integrated here though I think.

@ryanio
Copy link
Contributor

ryanio commented Dec 3, 2019

I like the idea of doing lerna in independent mode and discovered semantic-release.

@holgerd77
Copy link
Member Author

holgerd77 commented Dec 3, 2019

Ryan brought this up and so I just started thinking about further library integration here (also inspired by the visualization done above by @evertonfraga): in past discussions we had some broader consensus to not include the merkle-patricia-library here, I would also be skeptical about the ethereumjs-util library. However I think the Common and Testing libraries might be relatively indisputable candidates (Update: Testing shouldn't be included due to repo size, see comment here)?

Another topic: without further reflection on pros and cons on this I would also have a tendency to independent mode as suggested by @ryanio.

@ryanio
Copy link
Contributor

ryanio commented Dec 3, 2019

In a monorepo world it might make sense to include util so everything remains under the same umbrella and we receive the full benefits of monorepo-ing.

Would it be too much to nest MPT under util?

@alcuadrado
Copy link
Member

I'm really excited that this migration is happening 🤩

  1. Do we keep the library names as is, or rename and use the scoped naming, e.g. @ethereumjs/account

Renaming the packages will prevent users to get updates. If I'm using, for example, ethereumjs-tx and bug fixes start to be published as @ethereumjs/tx, I won't get them.

This would only happen once, so it may not be that important.

  1. Do we use different or the same version for all modules in the monorepo?

I think I'd use different versions with independent mode, as @ryanio suggested.

@holgerd77's argument about unnecessary major versions and bug fixes is similar to what I just wrote about renaming the packages, but it would be much more frequent so way more important. That alone is enough to use independent mode IMO.

I understand your motivation and share your concern, @s1na . Keeping separate versions will probably lead to situations like ethereumjs/ethereumjs-block#74 again.

I believe the root of the problem is that npm dependencies are handled incorrectly in most of the libraries. Things like -account, -block, -blockchain, -tx, bn.js, and possibly others should be peer dependencies most of the time. I really like this article that explains when peer dependencies are needed. I think we all had a short conversation about this in Osaka.

Changing the libs to peer dependencies would be a big breaking change, so I'd keep it outside of the scope of the migration.

Great job @evertonfraga ! I think option 3 is the most interesting, but it may be hard to implement. @ryanio's PR using Github Actions on the -blockchain repo shows some very promising performance improvements, so I wonder if using actions to run all jobs in the monorepo on every PR won't be performant enough.

@alcuadrado
Copy link
Member

We can also very much do this step-by-step and on a first round e.g. keep everything "as is" so the VM integrating all the modules it currently does and just setup the outer mono-repo structure and maybe do a first bug fix release for testing the new structure (just a first thought, not sure if exactly this way makes sense, but generally: we can take on this one step after the other).

Also, I think this is a very good way to approach this. There are lots of things to migrate and config, so doing it gradually makes sense to me.

@holgerd77
Copy link
Member Author

holgerd77 commented Dec 4, 2019

Ok just a summary on how the consensus seem to form on how we proceed here (correct me or continue the discussion if you have strong opinions against any of the points):

  1. We use independent mode as Lerna setup and keep on with current version numbers
  2. We do the renaming to scoped packages like @ethereumjs/vm
  3. We use GitHub actions as a unified CI tool (is this already consensus?)
  4. We do a gradual step-by-step integration, first PR on this would be as described above (only VM, outer structure)
  5. For testing in doubt we can start with 1. (build all jobs) to keep things simple and expand on this later on further package inclusion
  6. Rename all git tags to avoid naming conflicts

Anything I forgot (feel free to directly edit the post)? 😀

@evertonfraga
Copy link
Contributor

evertonfraga commented Dec 4, 2019

My assessment to Github Actions: 👍

Pros

  1. Performance really seems to be superior.
  2. Actions (or jobs) in separate files is beneficial to a monorepo structure.
  3. It has a marketplace with a growing community, so reusing bits is super easy.
  4. It goes beyond the traditional workflow of (git push, run job). It can interact with PRs, and do a lot of things.
  5. last, but not least, it is integrated to Github and the UI is more responsive than Travis and CircleCI.

Concerns and "non-concerns"

  1. The ethereum org is still on the legacy billing structure, so they currently face a challenge there to use Actions. As our repos mostly live under EthereumJS, can we take advantage of the free builds for GH Actions?
  2. I'm not too concerned with vendor lock-in, as there's a low cost to move over to any other CI tool.

@alcuadrado
Copy link
Member

I wonder if the cache limits of github actions won't be a problem. They cache up to 2gb per repo, and node_modules can be huge. Mor info here

@ryanio
Copy link
Contributor

ryanio commented Dec 4, 2019

Thanks @alcuadrado, that is indeed an important aspect to keep in mind for the evaluation.

I found this thread: actions/cache#6

It has some interesting discussion, including one user seeing success with running a 700mb node_modules and the cache compressing the archive. I’ll have to give the thread another closer look in relation to our specific needs to extract more conclusions.

I can try to investigate these limits further and if we may find them constraining.

2GB per repo sounds extremely low for a monorepo use case like ours, especially with [vm, account, block, blockchain, tx] all running their own checks in matrix across 4 node versions on every PR push.

From the top of your head, which ci steps / tests / builds / outputs / logs do you find being on the larger side?

I also found Persisting workflow data using artifacts.

@ryanio
Copy link
Contributor

ryanio commented Dec 4, 2019

As a separate point to add to the list of considerations, how do we feel about semantic-release? I think it would be nice to have a fully automated release process.

https://github.com/atlassian/lerna-semantic-release

@holgerd77
Copy link
Member Author

@ryanio Can we have this as a separate topic, eventually just open a new issue on that? This can be discussed relatively separated from this issue (and also might need a bit longer and thoughtful discussion) and there is a bit the danger to overload the discussion here.

@ryanio
Copy link
Contributor

ryanio commented Dec 4, 2019

@holgerd77 Yes great point, have started it here: #625

@evertonfraga
Copy link
Contributor

evertonfraga commented Dec 4, 2019

As a hands-on way to understand the requirements to the migration, I started to make this shell script monorepify.sh.

This can be useful to rehearse the migration and get valuable information ahead of time (eg: would GitHub actions work with X jobs? What about their cache limit?). Also, this automation tool could be in fact used to perform the changes we need 😛.

The migration can be run for vm repo only (just comment line 11, might have some quirks), with any combination of the other candidates to the monorepo (edit line 11), or all of them (just run).

Roadmap

Edit: see this project for related tasks: https://github.com/orgs/ethereumjs/projects/1

I'll use my own fork as a playground, and you can inspect it here: https://github.com/evertonfraga/ethereumjs-vm/

@evertonfraga
Copy link
Contributor

Good news: GitHub actions make it trivial to implement my proposed solution (3). We’ll have to declare some exclusion paths to each workflow file, so the irrelevant builds are not ran for each push.

https://help.github.com/en/actions/automating-your-workflow-with-github-actions/workflow-syntax-for-github-actions#example-ignoring-paths

@alcuadrado
Copy link
Member

Nice script @evertonfraga! You may want to install lerna in one of its first steps, as it has this very handy command https://github.com/lerna/lerna/tree/master/commands/import

Should we create a new repo for the monorepo? This way we can leave all the different library-specific repos as they are now and archive them.

@evertonfraga
Copy link
Contributor

Thanks, @alcuadrado! I was uncertain at first to use an importer like lerna's, as it rewrites all commits as if the files were always under the new file structure. But maybe it's the best way forward.

Re: new repo for it — I lean towards to keep ethereumjs-vm reputation (stars) and network (forks). It can be renamed to ethereumjs-monorepo in the end.

@holgerd77
Copy link
Member Author

holgerd77 commented Dec 5, 2019

On the caching question:

Just had a look at the VM node_modules folder, this is roughly 1 GB uncompressed and 95 MB compressed (which counts). This should be one of the significantly larger ones having ethereumjs-testing with its huge submodule folder as a dependency.

@holgerd77
Copy link
Member Author

What is with open PRs on the other repos (but eventually also on the VM?) containing a significant amount of work? These should likely (optimally) be merged before a transition (hope this is realistic)?

@evertonfraga
Copy link
Contributor

PRs on the other repos
Sure, good point. This is something we should perform some manual tests and create a playbook of.

@holgerd77
Copy link
Member Author

Monorepo Transition PR Analysis

Some analysis of the currently open PRs on the repos.

Feel free to edit the post. I would suggest not to include the monorepo working PRs (e.g. on the GitHub actions like ethereumjs/ethereumjs-account#59) to not clutter the list too much.

Account

No significant open PRs.

Block

Blockchain

Transaction

Common

VM

  • All recent open PRs (there are some which just need to be closed) are from @s1na or @alcuadrado → Can be asked for resubmission if picked up again

Summary: Ok, PR situation is actually easier to handle than I initially thought. Basically for the moment we are fine and there is some defined path to go for all open PRs. 😄

@evertonfraga
Copy link
Contributor

evertonfraga commented Dec 10, 2019

Our goal is to have a smooth migration, without surprises or rushed changes because something was not considered first place. My approach is mostly programmatic, as IMO doing every step manually can be error-prone.

I am using a project board to keep track of all requirements to accomplish that. Please take a look and see if I am forgetting anything.

https://github.com/orgs/ethereumjs/projects/1

Feel free to add new cards, and if any of those are worth discussing, feel free to convert to an issue (Click on card … > Convert to Issue).

@holgerd77
Copy link
Member Author

@evertonfraga That's really cool, thanks for setting this up! 😄

@evertonfraga
Copy link
Contributor

Hi all,

I need more context regarding the use of ts-node in our projects in order to proceed with the concerted test efforts using Lerna. I see there are some different ways to spawn the tests across the projects.

1. account, block, common and tx

They all pass the main test files to ts-node, naturally don't requiring dist/ files to be present

ts-node node_modules/tape/bin/tape ./test/index.ts

2. blockchain

It first builds dist/ files, then runs ts-node requiring modules from dist/.

npm run build && ts-node node_modules/tape/bin/tape ./test/index.ts

3. vm

I assume this is outdated code, since that currently the tests don't work without the --dist flag.

npm run build:dist && node --stack-size=1500 ./tests/tester --blockchain --dist

Now I wanted some feedback: should we normalize the libs to run with ts-node, without the need to build the files first?
I have to say that this would help quite a lot on the CI side.

@alcuadrado
Copy link
Member

alcuadrado commented Jan 22, 2020

I've always found the vm setup somewhat frustrating, as I tend to forget to run build:dist, because I normally run all my tests with ts-node.

@evertonfraga
Copy link
Contributor

Thanks for the feedback @alcuadrado.

The VM npm scripts used to let you test both the code and dist, with the —dist flag, but it lost that capability with the TS migration.

I’m taking the ts-node route that will restore that capability, and will help a lot with CI, so I shouldn’t need to move dist files across jobs for integration tests.

@holgerd77
Copy link
Member Author

Some update on Greenkeeper, this is activated for many repositories (see ethereumjs/ethereumjs-account#55 for an exemplary PR) to notify about dependency updates (just had a look at the setup, not in a consistent state though, some repos missing, some not activated):

First, monorepos seems to be supported.

To ease the migration I've now disabled Greenkeeper to stop opening new PRs, I will also close existing open PRs on the respective libraries. We can then re-activate once the transition is done (or eventually also considering switching to another tool, seems you also have some alternative solution here on the Grid side, don't see any pressure for a switch though in general and have no comparison overview. Very much a side topic I would say, but anyhow).

@evertonfraga
Copy link
Contributor

evertonfraga commented Feb 3, 2020

We started using RenovateBot for Grid repositories. It had lots of granularity for configurations, but in the end I had no special need with the array of settings for it.

I had planned to implement Greenkeeper in the monorepo already, since it's already used by the target repository (ethereumjs-vm) and it supports multiple package.json. PRs can be grouped by every new dependency, which makes everything cleaner.

I have activated Greenkeeper today in our monorepo playground: https://github.com/evertonfraga/ethereumjs-vm/

Mind that Readme linked above is not yet complete, but it's already a good opportunity for feedback.

@holgerd77
Copy link
Member Author

@evertonfraga whuah, have to admit that I didn't look at this monorepo playground yet you created. This already looks fantastic! 😄

@holgerd77
Copy link
Member Author

"This branch is 1305 commits ahead, 31 commits behind ethereumjs:master."

Lol. 🤪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants