# Usage examples for gitpythonfs - an fsspec implementation.

Includes comparison of behaviour with other fsspec implementations, `git` and `file`.

To get persistent repo_fixture folder as used in these examples:
- debug run a test that uses `tests.test_core.repo_fixture`, pausing before `shutil.rmtree(d)` deletes the temp dir.
- get the temp directory location `d` 
- in your os, rename the directoy.  eg  `mv /tmp/tmptdfvh5pe /tmp/repo_fixture`
- finish or cancel debug run

ToDo: add a persist option to `tests.test_core.repo_fixture`  

In [1]:
import fsspec

# gitpythonfs

In [5]:
import gitpythonfs

In [12]:
path = "/tmp/repo_fixture"
# path = "~/dlt"
fs_gitpythonfs = fsspec.filesystem("gitpythonfs", path=path)

In [13]:
fs_gitpythonfs.ls("")

['.dockerignore',
 '.editorconfig',
 '.github',
 '.gitignore',
 'CONTRIBUTING.md',
 'LICENSE.txt',
 'Makefile',
 'README.md',
 'check-package.sh',
 'compiled_packages.txt',
 'deploy',
 'dlt',
 'docs',
 'mypy.ini',
 'poetry-deps.sh',
 'poetry.lock',
 'pyproject.toml',
 'pytest.ini',
 'tests',
 'tox.ini']

Note that we got the folder `inner`, because our implementation uses head, which is currently at branch `abranch`.  Different to the `git` fsspec implementation, which defaults to `master` branch.

In [14]:
fs_gitpythonfs.ls("", detail=True)

[{'name': '.dockerignore',
  'type': 'file',
  'mime_type': 'text/plain',
  'size': 168,
  'hexsha': '0c3244a411fe1751d82029c92dd8ea68f1cc4b46',
  'committed_date': 1685729443},
 {'name': '.editorconfig',
  'type': 'file',
  'mime_type': 'text/plain',
  'size': 319,
  'hexsha': '82a6b37b3c9f057da069ee0f3bdb7415fe9715e9',
  'committed_date': 1670792044},
 {'name': '.github',
  'type': 'directory',
  'mime_type': None,
  'size': 177,
  'hexsha': '3967419f325f89ea9990877681da2c2a1547b785',
  'committed_date': 1699459203},
 {'name': '.gitignore',
  'type': 'file',
  'mime_type': 'text/plain',
  'size': 1869,
  'hexsha': 'f26ea23d9157be88d5ba5d0822b5df94c3b3362f',
  'committed_date': 1699374841},
 {'name': 'CONTRIBUTING.md',
  'type': 'file',
  'mime_type': 'text/markdown',
  'size': 5370,
  'hexsha': 'c5fb6f9658a0abb1e26f551e9a5b70e142cbdc15',
  'committed_date': 1692369585},
 {'name': 'LICENSE.txt',
  'type': 'file',
  'mime_type': 'text/plain',
  'size': 11343,
  'hexsha': 'fa1b4ed2de7e8


walk() - and therefore find() -  are very slow.  That's because walk() invokes list(detail=True), which then expensively gets committed_date by default. Walk uses detail=True because it needs to discern file vs directory.  You can pass include_committed_date=False to either walk() or find() to speed them up.

In [24]:
for thing in fs_gitpythonfs.walk(""): # , include_committed_date=False):
    print(thing)

('', ['.github', 'deploy', 'dlt', 'docs', 'tests'], ['.dockerignore', '.editorconfig', '.gitignore', 'CONTRIBUTING.md', 'LICENSE.txt', 'Makefile', 'README.md', 'check-package.sh', 'compiled_packages.txt', 'mypy.ini', 'poetry-deps.sh', 'poetry.lock', 'pyproject.toml', 'pytest.ini', 'tox.ini'])
('.github', ['ISSUE_TEMPLATE', 'workflows'], ['PULL_REQUEST_TEMPLATE.md', 'weaviate-compose.yml'])
('.github/ISSUE_TEMPLATE', [], ['bug_report.yml', 'config.yml', 'feature_request.yml'])
('.github/workflows', [], ['get_docs_changes.yml', 'lint.yml', 'test_airflow.yml', 'test_build_images.yml', 'test_common.yml', 'test_dbt_cloud.yml', 'test_dbt_runner.yml', 'test_destination_athena.yml', 'test_destination_athena_iceberg.yml', 'test_destination_bigquery.yml', 'test_destination_mssql.yml', 'test_destination_qdrant.yml', 'test_destination_snowflake.yml', 'test_destination_synapse.yml', 'test_destination_weaviate.yml', 'test_destinations.yml', 'test_doc_snippets.yml', 'test_local_destinations.yml'])
('

In [21]:
fs_gitpythonfs.find("") # , include_committed_date=True)

['.dockerignore',
 '.editorconfig',
 '.github/ISSUE_TEMPLATE/bug_report.yml',
 '.github/ISSUE_TEMPLATE/config.yml',
 '.github/ISSUE_TEMPLATE/feature_request.yml',
 '.github/PULL_REQUEST_TEMPLATE.md',
 '.github/weaviate-compose.yml',
 '.github/workflows/get_docs_changes.yml',
 '.github/workflows/lint.yml',
 '.github/workflows/test_airflow.yml',
 '.github/workflows/test_build_images.yml',
 '.github/workflows/test_common.yml',
 '.github/workflows/test_dbt_cloud.yml',
 '.github/workflows/test_dbt_runner.yml',
 '.github/workflows/test_destination_athena.yml',
 '.github/workflows/test_destination_athena_iceberg.yml',
 '.github/workflows/test_destination_bigquery.yml',
 '.github/workflows/test_destination_mssql.yml',
 '.github/workflows/test_destination_qdrant.yml',
 '.github/workflows/test_destination_snowflake.yml',
 '.github/workflows/test_destination_synapse.yml',
 '.github/workflows/test_destination_weaviate.yml',
 '.github/workflows/test_destinations.yml',
 '.github/workflows/test_doc_s

Multiple files:

In [11]:
files = fsspec.open_files("gitpythonfs:///tmp/repo_fixture:abranch@**/file*")
len(files)

4

# git

existing implementation called [git](https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/git.py)

In [None]:
fs_git = fsspec.filesystem("git", path="/tmp/repo_fixture")
# "git:///tmp/test_repo:head@inner"


In [None]:
fs_git.ls("")

[{'type': 'file',
  'name': 'file1',
  'hex': 'a906852f929f123eedf16e685caafcf5179a572c',
  'mode': '100644',
  'size': 6},
 {'type': 'file',
  'name': 'file2',
  'hex': 'aff88cec66c6b9b9629ad0a288500b36c7e8479a',
  'mode': '100644',
  'size': 7}]

More complex url does not work for instantiation

In [None]:
fs_git_at_branch = fsspec.filesystem("git:///tmp/repo_fixture:abranch@inner/file3")


ValueError: Protocol not known: git:///tmp/repo_fixture:abranch@inner/file3

But it does work for "direct" methods like `open()`:

In [None]:
with fsspec.open("git:///tmp/repo_fixture:file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

b'data00'

which also support git refs:

In [None]:
with fsspec.open("git:///tmp/repo_fixture:master@file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

b'data00'

In [None]:
with fsspec.open("git:///tmp/repo_fixture:thetag@file1") as f:
    bytes = f.read()
    assert bytes == b"data00"

bytes

b'data00'

In [None]:
with fsspec.open("git:///tmp/repo_fixture:abranch@inner/file3") as f:
    bytes = f.read()
    assert bytes == b"data3"

bytes

b'data3'

With the first commit. 

Note: Will be different sha every time repo_fixture is recreated.

In [None]:
with fsspec.open("git:///tmp/repo_fixture:9bfaaaf97aab0493e0d369cae65580d3c6d95060@file1") as f:
    bytes = f.read()
    assert bytes == b"data0"

bytes

b'data0'

In [None]:
with fsspec.open("git:///tmp/repo_fixture:abranch@inner/file3") as f:
    bytes = f.read()
    assert bytes == b"data3"

bytes

b'data3'

Opening multiple files:

In [None]:
files = fsspec.open_files("git:///tmp/repo_fixture:abranch@**/file*")
files

<List of 4 OpenFile instances>

# file (local)

Note these will return the .git folder and contents, which file:// protocol just sees as regular folders and files.

In [31]:
fs_file = fsspec.filesystem("file")
fs_file.ls("/tmp/repo_fixture")

['/tmp/repo_fixture/file2',
 '/tmp/repo_fixture/.git',
 '/tmp/repo_fixture/inner',
 '/tmp/repo_fixture/file1']

find() gives list of files. It recruses subfolders

In [32]:
fs_file.find("/tmp/repo_fixture")

['/tmp/repo_fixture/.git/COMMIT_EDITMSG',
 '/tmp/repo_fixture/.git/HEAD',
 '/tmp/repo_fixture/.git/config',
 '/tmp/repo_fixture/.git/description',
 '/tmp/repo_fixture/.git/hooks/applypatch-msg.sample',
 '/tmp/repo_fixture/.git/hooks/commit-msg.sample',
 '/tmp/repo_fixture/.git/hooks/fsmonitor-watchman.sample',
 '/tmp/repo_fixture/.git/hooks/post-update.sample',
 '/tmp/repo_fixture/.git/hooks/pre-applypatch.sample',
 '/tmp/repo_fixture/.git/hooks/pre-commit.sample',
 '/tmp/repo_fixture/.git/hooks/pre-merge-commit.sample',
 '/tmp/repo_fixture/.git/hooks/pre-push.sample',
 '/tmp/repo_fixture/.git/hooks/pre-rebase.sample',
 '/tmp/repo_fixture/.git/hooks/pre-receive.sample',
 '/tmp/repo_fixture/.git/hooks/prepare-commit-msg.sample',
 '/tmp/repo_fixture/.git/hooks/push-to-checkout.sample',
 '/tmp/repo_fixture/.git/hooks/sendemail-validate.sample',
 '/tmp/repo_fixture/.git/hooks/update.sample',
 '/tmp/repo_fixture/.git/index',
 '/tmp/repo_fixture/.git/info/exclude',
 '/tmp/repo_fixture/.git/l

find() uses walk()

walk() gives folders, each with:
- list of subfolders
- list of files
It recurses subfolders

In [34]:
for thing in fs_file.walk("/tmp/repo_fixture"):
    print(thing)

('/tmp/repo_fixture', ['.git', 'inner'], ['file2', 'file1'])
('/tmp/repo_fixture/.git', ['branches', 'hooks', 'objects', 'info', 'refs', 'logs'], ['index', 'HEAD', 'config', 'description', 'COMMIT_EDITMSG'])
('/tmp/repo_fixture/.git/branches', [], [])
('/tmp/repo_fixture/.git/hooks', [], ['fsmonitor-watchman.sample', 'push-to-checkout.sample', 'update.sample', 'pre-applypatch.sample', 'pre-push.sample', 'pre-receive.sample', 'sendemail-validate.sample', 'pre-merge-commit.sample', 'applypatch-msg.sample', 'pre-commit.sample', 'prepare-commit-msg.sample', 'commit-msg.sample', 'post-update.sample', 'pre-rebase.sample'])
('/tmp/repo_fixture/.git/objects', ['1c', '51', '48', 'f2', 'pack', '2a', 'a9', '9b', 'cd', 'af', 'f9', '74', '09', '94', 'info', 'b6', 'dd'], [])
('/tmp/repo_fixture/.git/objects/1c', [], ['37a156abf51aae05b7b970514e19b9689f42ac'])
('/tmp/repo_fixture/.git/objects/51', [], ['39238ec364232bcb79141a5fb7f276b4f4001a'])
('/tmp/repo_fixture/.git/objects/48', [], ['8d13daaaf465

and walk uses ls()

ls() lists files and directories at the specified level. It does not recurse.

In [35]:
fs_file.ls("/tmp/repo_fixture")

['/tmp/repo_fixture/file2',
 '/tmp/repo_fixture/.git',
 '/tmp/repo_fixture/inner',
 '/tmp/repo_fixture/file1']

ls() can also give object details

In [36]:
fs_file.ls("/tmp/repo_fixture", detail=True)

[{'name': '/tmp/repo_fixture/file2',
  'size': 7,
  'type': 'file',
  'created': 1703550235.1300154,
  'islink': False,
  'mode': 33188,
  'uid': 1000,
  'gid': 1000,
  'mtime': 1703550235.1300154,
  'ino': 386288,
  'nlink': 1},
 {'name': '/tmp/repo_fixture/.git',
  'size': 4096,
  'type': 'directory',
  'created': 1703550235.1600153,
  'islink': False,
  'mode': 16877,
  'uid': 1000,
  'gid': 1000,
  'mtime': 1703550235.1600153,
  'ino': 386236,
  'nlink': 8},
 {'name': '/tmp/repo_fixture/inner',
  'size': 4096,
  'type': 'directory',
  'created': 1703550235.1500154,
  'islink': False,
  'mode': 16877,
  'uid': 1000,
  'gid': 1000,
  'mtime': 1703550235.1500154,
  'ino': 386259,
  'nlink': 2},
 {'name': '/tmp/repo_fixture/file1',
  'size': 6,
  'type': 'file',
  'created': 1703550235.1300154,
  'islink': False,
  'mode': 33188,
  'uid': 1000,
  'gid': 1000,
  'mtime': 1703550235.1300154,
  'ino': 386264,
  'nlink': 1}]