Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEEDS IP CLEARANCE] ARROW-10228: Contribute Julia implementation #8448

Closed
wants to merge 8 commits into from

Conversation

StefanKarpinski
Copy link

This pull request merges a synthetic history of the Arrow.jl Julia package into the main arrow monorepo under the julia top-level directory. The history of Arrow.jl has been rewritten so that it appears that all development was done in this directory, retaining only a commit for each published version of the Arrow.jl package. Preserving this history (specifically the git tree objects associated with each commit) allows Julia's package manager to continue to install historical versions of Arrow.jl while having the arrow monorepo as the git repository of record going forward.

I'm making this pull request on behalf of the Arrow.jl project (cc @quinnj, @ExpandingMan) as the resident git mage. Let me know if there's anything I should change about this PR to integrate better into the arrow project.


For my own record (in case I need to do this again), here's the code I used to generate the synthetic history:

using TOML

data = TOML.parse("""
["0.1.2"]
git-tree-sha1 = "5cab061e3fcf0d78291f9c4b3db1f58c8f5e1bc5"

["0.2.0"]
git-tree-sha1 = "5081382c0e5c78c1849b9841b9d8941437060b48"

["0.2.1"]
git-tree-sha1 = "ecfe11bd0874ab41b78be0ca8d0f680ba37978dc"

["0.2.2"]
git-tree-sha1 = "c66fc3e71747c99a3e3940ade685c0d8ea66c0ae"

["0.2.3"]
git-tree-sha1 = "d3c36842140057276f6f8348afa08f0f7dae2d1e"

["0.2.4"]
git-tree-sha1 = "c86df6ed41b3bd192d663e5e0e7cac0d11fd4375"

["0.3.0"]
git-tree-sha1 = "76641f71ac332cd4d3cf54b98234a0f597bd7a2f"
""")

trees = Dict(VersionNumber(k) => v["git-tree-sha1"] for (k, v) in data)

ENV["GIT_AUTHOR_NAME"] = "Jacob Quinn"
ENV["GIT_AUTHOR_EMAIL"] = "quinn.jacobd@gmail.com"

let commit = "16b729db74d78ecb010efab855c9e46c8052f59e"
    for (ver, tree)  in sort!(collect(trees), by=first)
        message = """
        ARROW-10228: [Julia] Arrow.jl v$ver

        Co-authored-by: Michael Savastio <savastio@gmail.com>
        """
        ENV["GIT_AUTHOR_DATE"] = readchomp(`git show -s --format=%ai v$ver`)
        commit = readchomp(`git commit-tree -p $commit -m $message $tree`)
    end
    run(`git branch -f sk/synthetic $commit`)
end
run(`git filter-repo --force --to-subdirectory-filter julia`)

Then I used the following .git/config in a clone of arrow:

[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	ignorecase = true
	precomposeunicode = true
[remote "origin"]
	url = https://github.com/apache/arrow.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[remote "StefanKarpinski"]
	url = https://github.com/StefanKarpinski/arrow.git
	fetch = +refs/heads/*:refs/remotes/StefanKarpinski/*
[remote "Arrow.jl"]
	url = ../Arrow.jl
	fetch = +refs/heads/*:refs/remotes/Arrow.jl/*

[branch "master"]
	remote = origin
	merge = refs/heads/master

With that setup, you just do this in the arrow clone:

git fetch Arrow.jl --no-tags
git merge Arrow.jl/sk/synthetic

Enter the merge commit message when prompted.

@github-actions
Copy link

@StefanKarpinski
Copy link
Author

Supersedes #8393 but still needs IP clearance.

@StefanKarpinski
Copy link
Author

Note: I think this should probably be merged rather than rebase merged, but let me know how y'all want the history to look. I can probably accommodate anything.

@nealrichardson nealrichardson changed the title ARROW-10228: [Julia] merge Arrow.jl history into main arrow monorepo [NEEDS IP CLEARANCE] ARROW-10228: Contribute Julia implementation Oct 12, 2020
@nealrichardson
Copy link
Member

Looks like this PR needs the license headers prepended everywhere: maybe that can be done and then squashed into that last commit 0632ecf?

@quinnj
Copy link
Member

quinnj commented Oct 12, 2020

@StefanKarpinski, I have a few things I want to cleanup/improve, so I can work on that over the next few days and then we can just do a new "release" in JuliaData/Arrow.jl and push that release commit here using the same script you have; does that sound reasonable? That would then include the license header changes.

@StefanKarpinski
Copy link
Author

Additional changes can be made before or after the merge, but content of files in historical commits cannot be modified since that will change the tree hashes, which would makes it impossible to install the previous versions of Arrow.jl from this repo, which is the purpose of having this history in the repo. IMO, it would be easier and clearer to just make additional changes in this repo after the merge.

@wesm
Copy link
Member

wesm commented Oct 16, 2020

We can have a bigger discussion (e.g. on the mailing list) but in other instances either we've done a rebase-merge or a squash-merge for these. It's our preference to maintain a linear commit history in the main branch. How important is it to be able to install the old releases using the exact tree hash at the time that they were released before? Since this code is still pre-"production" (I think? I haven't looked at the status of the integration tests) I'm not sure how valuable it is to be able to install the old releases

@StefanKarpinski
Copy link
Author

We guarantee that published Julia package versions remain installable forever, and versions are immutably identified by tree hashes, so it's quite important. If one couldn't install old versions from this repo, then this would have to contain a new, different Julia package so that the old one could remain installable. Fortunately, a rebase should not affect the necessary subtrees, so rebasing should be fine.

@StefanKarpinski
Copy link
Author

However, those rebased commits cannot modify any of the files in any way, e.g. by putting headers in them. The headers can be added in a newer commit for the first version that is published as an official part of arrow.

ExpandingMan and others added 8 commits October 20, 2020 00:37
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
Co-authored-by: Michael Savastio <savastio@gmail.com>
@quinnj
Copy link
Member

quinnj commented Oct 28, 2020

Can be closed in favor of #8547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants