# Usage examples for gitpythonfs - an fsspec implementation.

Includes comparison of behaviour with other fsspec implementations, `git` and `file`.

To get persistent repo_fixture folder as used in these examples:
- debug run a test that uses `tests.test_core.repo_fixture`, pausing before `shutil.rmtree(d)` deletes the temp dir.
- get the temp directory location `d` 
- in your os, rename the directoy.  eg  `mv /tmp/tmptdfvh5pe /tmp/repo_fixture`
- finish or cancel debug run

ToDo: add a persist option to `tests.test_core.repo_fixture`  

In [None]:
import fsspec

# gitpythonfs

In [None]:
import gitpythonfs

In [None]:
dlt_repo_path = "~/dlt"
wine_repo_path = "~/wine"
test_fixture_repo_path = "/tmp/repo_fixture"

In [None]:
fs_gitpythonfs = fsspec.filesystem("gitpythonfs", repo_path=wine_repo_path)

In [None]:
# fs_at_current_directory = fsspec.filesystem("gitpythonfs")

In [None]:
fs_gitpythonfs.repo

In [None]:
fs_gitpythonfs.repo.head.ref.name

Now let's look at some contents of our repo:

In [None]:
fs_gitpythonfs.ls("")

Note that we got the folder `inner`, because our implementation uses head, which is currently at branch `abranch`.  Different to the `git` fsspec implementation, which defaults to `master` branch.

In [None]:
fs_gitpythonfs.ls("", detail=True)


walk() - and therefore find() and glob() -  are very slow.  That's because walk() invokes list(detail=True), which then expensively gets committed_date by default. Walk uses detail=True because it needs to discern file vs directory.  You can pass include_committed_date=False to either walk() or find() to speed them up.

In [None]:
for thing in fs_gitpythonfs.walk(""): # , include_committed_date=False):
    print(thing)

In [None]:
fs_gitpythonfs.find("") # , include_committed_date=True)

In [None]:
for thing in fs_gitpythonfs.glob("docs/**.md", ref="HEAD"): # , include_committed_date=False):
    print(thing)

Git References (refs):

You can specify a commit sha, tag or branch.  Default is HEAD.

In [None]:
fs_gitpythonfs.find(
    "",
    ref="thetag" # comment out to use HEAD
    )

Multiple files:

In [None]:
files = fsspec.open_files("gitpythonfs:///tmp/repo_fixture:abranch@**/file*")
len(files)

Some git contents are cached in the filesystem instance:

In [None]:
fs_gitpythonfs._get_tree.cache_info()

In [None]:
fs_gitpythonfs.clear_git_caches()

Note that fsspec itself caches entire filesystem instances, which can be overridden with `skip_instance_cache=True`

In [None]:
path_2 = "/tmp/repo_fixture"
fs_gitpythonfs_2 = fsspec.filesystem(
    "gitpythonfs",
    repo_path=path_2,
    # ref="thetag",
    # skip_instance_cache=True
    )

fs_gitpythonfs_2._get_tree.cache_info()

git_cmd extensions

In [None]:
from gitpythonfs import git_cmd
import git

dlt_repo_path = "~/dlt"
wine_repo_path = "~/wine"
test_fixture_repo_path = "/tmp/repo_fixture"

repo = git.Repo(dlt_repo_path)

revs_dict = git_cmd.get_revisions_all(repo, "HEAD")

# repo: dlt
# time: 0.2s
# files: 1771

# repo: wine
# time: 48.1s
# files: 18,669  (count(revs_dict)
# revs_dict size: 805991 bytes


In [None]:
from sys import getsizeof
getsizeof(f"{revs_dict}")
# revs_dict
len(revs_dict)

Using GitPython directly

In [None]:
import git
repo = git.Repo("/tmp/repo_fixture")


In [None]:
repo.tree

In [None]:
repo.commit("abranch").hexsha

Direct git cmd usage via GitPython
See https://gitpython.readthedocs.io/en/stable/tutorial.html#using-git-directly

In [None]:
git_exec = repo.git
git_exec.log()

In [None]:
git_exec.whatchanged()

Git formatting
https://git-scm.com/docs/pretty-formats

In [None]:
git_exec.whatchanged(pretty="%at%n%H")

Getting reg/log based on reference

In [None]:
import git

def get_revisions_at_path():

    repo = git.Repo("/tmp/repo_fixture")
    git_exec = repo.git

    # hexsha = repo_4d5t6.commit("abranch").hexsha
    # ref = "b618cd9aad9074b981178ca2f8c41b1f3d95c52f"
    ref = "HEAD"

    # cases: root, directory, file
    #  if directory add "/*"
    #  assume/enforce path doesn't start with "/" as that causes error in git
    path = ""
    # git uses fnmatch(3) style matching
    path_spec = ":(top)" + path + "*"

    out = git_exec.log(
        ref,
        path_spec,
        raw=True
        , no_merges=True
        , pretty="%at")
    return out

get_revisions_at_path()

In [None]:
import git

# git

existing implementation called [git](https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/git.py)

In [None]:
fs_git = fsspec.filesystem("git", path="/tmp/repo_fixture")
# "git:///tmp/test_repo:head@inner"


In [None]:
fs_git.ls("")

More complex url does not work for instantiation. It should only be the path to the repo.

In [None]:
fs_git_at_branch = fsspec.filesystem("git:///tmp/repo_fixture:abranch@inner/file3")


But it does work for "direct" methods like `open()`:

In [None]:
with fsspec.open("git:///tmp/repo_fixture:file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

which also support git refs:

In [None]:
with fsspec.open("git:///tmp/repo_fixture:master@file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

In [None]:
with fsspec.open("git:///tmp/repo_fixture:thetag@file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

In [None]:
with fsspec.open("git:///tmp/repo_fixture:abranch@inner/file3") as f:
    bytes = f.read()
    assert bytes == b"data3"

bytes

With the first commit. 

Note: Will be different sha every time repo_fixture is recreated.

In [None]:
with fsspec.open("git:///tmp/repo_fixture:9bfaaaf97aab0493e0d369cae65580d3c6d95060@file1") as f:
    bytes = f.read()
    assert bytes == b"data0"

bytes

In [None]:
with fsspec.open("git:///tmp/repo_fixture:abranch@inner/file3") as f:
    bytes = f.read()
    assert bytes == b"data3"

bytes

Opening multiple files:

In [None]:
files = fsspec.open_files("git:///tmp/repo_fixture:abranch@**/file*")
files

# file (local)

Note these will return the .git folder and contents, which file:// protocol just sees as regular folders and files.

In [None]:
fs_file = fsspec.filesystem("file")
fs_file.ls("/tmp/repo_fixture")

find() gives list of files. It recruses subfolders

In [None]:
fs_file.find("/tmp/repo_fixture")

find() uses walk()

walk() gives folders, each with:
- list of subfolders
- list of files
It recurses subfolders

In [None]:
for thing in fs_file.walk("/tmp/repo_fixture"):
    print(thing)

and walk uses ls()

ls() lists files and directories at the specified level. It does not recurse.

In [None]:
fs_file.ls("/tmp/repo_fixture")

ls() can also give object details

In [None]:
fs_file.ls("/tmp/repo_fixture", detail=True)