Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrating from SVN to GitHub #40

Open
goodmami opened this issue Feb 20, 2023 · 9 comments
Open

migrating from SVN to GitHub #40

goodmami opened this issue Feb 20, 2023 · 9 comments

Comments

@goodmami
Copy link
Member

This issue concerns difficulties in importing the ERG from SVN to GitHub. See comments on this gist for some context. One question from that thread:

[...] is GitHub's SVN importer not sufficient?

I can now answer that:

This repository is too large.

This might also be an issue with a manually converted repository. If so, we might need to consider storing big things like profiles and compiled .grm or .dat files in a separate repo.

@arademaker
Copy link
Member

Yes, see my comments in gist.

@oepen
Copy link
Contributor

oepen commented Feb 23, 2023

hiya!

importing the ERG from SVN into GitHub is no small project, i imagine. ERG history goes back to around 1994, and there has been a long tradition for storing large binary files interspersed with the source files (owing in part to its centralized design, SVN works fairly well on binary files).

i imagine some repository surgery and retroactive refactoring may be called for. if it helps, i could probably make available an SVN dump file (filtered to just include everything below the ERG directory). but for that to make sense, i think we would first have to declare the ERG in SVN read-only, i.e. establish agreement with dan that there are no pending commits and that all future development will be against GitHub.

@arademaker
Copy link
Member

arademaker commented Mar 2, 2023

I have finished the first part of the migration.

  1. Using git svn I cloned the SVN repo
  2. I updated the tags and branches but later branches were deleted (after @danflick confirmed they are not needed)
  3. this repo was updated with the git push --all --force

I am attaching the script I used, the logs with the steps, commands and outputs, the old README file and the references I followed.

transfer.zip

In the README.org I enumerated the nexts steps.

Keeping in mind that we ignored the files:

"\.mem$|\.grm$|edge$|result$|\.gz$|\.dat$"

Next we need:

1. the profiles will be moved to a separated repository. Dan agree
   that better would be to manually go over the SVN commits, take the
   specific versions of each release and construct a git repository
   manually recreating the important snapshots over the history.

2. We need to revise the tags and make releases in the repo to reflect
   the ERG's history. Tags are now poiting to commits disconnected
   from the branches. See
   https://github.com/delph-in/erg/releases/tag/2018 for example
   (click on the commit hash).

3. We need to attach to each release the big files that we didn't want
   to keep under version control (mem files, the maxent models)

@arademaker arademaker changed the title SVN import issues migrating from SVN to GitHub Mar 2, 2023
@goodmami
Copy link
Member Author

goodmami commented Mar 5, 2023

Thanks, @arademaker. Rather than pruning out just the larger edge and result files from the profiles, resulting in unusable profile artifacts in the repo, I assumed you might prune out the entire tsdb/ subdirectory and make a separate repo for it. If these were pruned out from the beginning, it should make this repo size significantly smaller. I understand that this task may be easier said than done, however.

@goodmami
Copy link
Member Author

Next we need:

  1. the profiles will be moved to a separated repository. Dan agree
    that better would be to manually go over the SVN commits, take the
    specific versions of each release and construct a git repository
    manually recreating the important snapshots over the history.

@arademaker, do we have the [incr tsdb()] profiles available anywhere?

@arademaker
Copy link
Member

At the beginning of the year, @danflick and I discussed the issue with the profiles. I do not remember now what his final decision was, but one approach I suggested was to have a separate repo for them.

@arademaker
Copy link
Member

Sorry, what I wrote above is precisely what I remoted in the previous comment. I don't know the current status; @danflick left Brazil with a complete step-by-step to finish the migration, but he needs time to revise the data before the final migration.

@danflick
Copy link
Collaborator

I have the [incr tsdb()] profiles for each of the releases for the past 15 years, and would appreciate guidance on how best to organize those files on Github to enable convenient packaging of the releases, including the most recent 2023 version.

@goodmami
Copy link
Member Author

@danflick, some suggestions:

  • Store gold profiles in a separate repo
  • Store gold files as plaintext, not with .gz or other compression. Git will manage the compression, and it will be able to reason about lines of text (e.g., # lines changed, computing commit deltas, etc.)
  • Package releases (e.g., a redwoods-2024.tar.gz file or similar) as assets on GitHub releases. See how Francis and I did this for the OMW wordnet data as an example: https://github.com/omwn/omw-data/releases
  • Use CI scripts to package and attach assets to releases (I can help with this)

For this erg repository, I would also just remove the whole subtree under tsdb/gold/. Currently the profiles are all there except for the large files, which is confusing. And similar to the last two points above, you can make ERG releases on GitHub use CI scripts to build and attach the .grm, .dat, and any other large files as assets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants