# git-lfs: Large File Storage

If you want to quickstart a new **git-lfs** project, go [here](./lfs_template.ipynb)
If you are ready to enable **git-lfs** for your repository, go [here](./lfs_tool.ipynb).

### git-lfs is a git extension that enables version control for large, typically binary data assets, such as:

* Audio files

* Video files

* model files for Machine Learning platforms such as Pytorch and Tensorflow

* Any file type that is typically large (greater than a few megabytes) and is not humanly readable

Git was not developed and optimized for handling these sorts of assets. While technically you can check in binary files, beyond a few megabytes per file, and/or total git storage above 1 gigabyte, git will slow down considerably and pushes, pulls etc. become slow and cumbersome. Also, git clones typically download the entire history of each file to the local instance. This is clearly not practical if the files in question have a long history of dozens or hundreds of revisions. Git does not do incremental delta coding of binary files, so each version of every binary file in your repository is downloaded in full for every cloned instance. Note that due to these reasons, the leading cloud git providers limit the size of each file as well as total repository sizes accordingly.

Rather than abuse git servers such as github, gitlab, or self-hosted setups, when using lfs git stores ***pointers*** to these large binary files. A pointer is a text file with some information about the file, and a hash (sha256 digest), which is a 32-byte unique identifier of the original file.

When a file added to the repository is identified as being "large" (what that means precisely is discussed below), the actual file's content is uploaded to an lfs server that is specifically designed to store large data files. Note that an lfs server is **not** a git server -- it doesn't handle versions, branches, merges etc., it is just a key/value store, similar to data storage services such as Amazon S3. If the data upload is successful, the **pointer** file (typically about 130 bytes long) is committed to your git repository, as a short, human-readable text file.

When cloning, pulling, or updating large files in an lfs-enabled repository to your working directory, the process is reversed: git sees that the file is an lfs **pointer**, and (if the file is not already cached locally) git then downloads the actual file from the lfs server. If all goes well, the user will never need to be aware of the pointer files, because any version of any file the user wants in their working directory is replaced with the real data file when it is actually needed due to a checkout or a pull.

### Run the next cell to check that git-lfs is installed (it is installed by default on Cloudburst):

In [None]:
!git lfs version

The output should look something like this (the version numbers may differ):

    git-lfs/2.9.2 (GitHub; linux amd64; go 1.13.5)
    
If instead you see an error such as `git: 'lfs' is not a command` you will need to install git-lfs on your system. 

### What determines that a file is "big"?

git tracks "big" files using a system similar to `.gitignore`. A file named `.gitattributes` contains a list of patterns to match; if the file in question matches a pattern, it will be treated as "big" by lfs. The precise pattern matching format is discussed [here](https://git-scm.com/docs/gitattributes). Note the use of `**` to signify that all files recursively inside a directory are to be considdered "big".

There is a command line tool to add files and folders to be tracked by lfs:

    git lfs track "pattern"
    
This will add the specified pattern to `.gitattributes`, along with some metadata.

### Where are my lfs files stored?

git-lfs looks in yet another hidden file, `.lfsconfig`, to discover where to store the actual "big" data. If this file is missing, git attempts to guess the lfs server spec from the git URL `origin` of the repository. This will typically be an lfs server maintained by the same organization that hosts your repository, for example **github** or **gitlab**. Note that these services may incur additional charges to your account.

[Cloudburst](https://cloudburst.host) provides free lfs hosting for public (Open-source) projects. Contact us at beta@cloudburst.host if you are interested in pay-as-you-go private lfs storage.

Try our **git-lfs** tool to help you migrate your projrct to use lfs: [lfs tool](./lfs_tool.ipynb).