-
Notifications
You must be signed in to change notification settings - Fork 8
Transferring data
Getting access to relatively large datasets in Colab is surprisingly difficult.
The following notebook illustrates ways to access data, including mounting Google Drive and copying files in and out of Google Cloud Storage.
https://colab.research.google.com/notebooks/io.ipynb
Mounting Google Drive:
from google.colab import drive
drive.mount('/content/gdrive')
An example showing a FUSE mount of Google Cloud Storage isn't present in that notebook,
but it can be done using gcsfuse
:
from google.colab import auth
auth.authenticate_user()
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse
!mkdir -p gcs
!gcsfuse --implicit-dirs ngos-gsoc-axiom ./gcs
However, trying to ls
a large directory on a gcs mount seems to hang forever, while an ls
on the same directory in a Drive mount completes almost instantly.
Also, gcsfuse
is just a wrapper around the Google Cloud Storage API, which means that it
generates API calls that are not free.
AFAIK there are no costs to access data in Drive, other than paying for the storage itself.
In this Kaggle thread users discuss data access, and note that accessing data in a mounted Drive is very slow. One user states that they moved to using Google Cloud Storage (presumably downloading each file to the colab instance as needed?) and sped up their workflow. Some experimentation here is needed.
Uploading local data to Google Cloud Storage is fairly simple and well supported by the
gsutil
tool, which is widely available on Google Cloud VMs, Colab, and the
Cloud SDK Docker image.
Once gsutil
is configured and a GCS bucket created, it can be run similar to rsync
:
gsutil rsync -r /dir_to_upload gs://gcs-bucket/
The -m
option can also be specified to run parallel uploads, but it can quite easily saturate network bandwidth and degrade performance of other applications on the network.
Uploading to Google Drive is much more painful.
IMPORTANT NOTE: In Drive, any file owned by a user counts toward that user's space quota. That means that the user who creates/uploads files must have purchased enough space to hold them. It's not sufficient to upload to a shared folder owned by a user with a storage subscription; you must upload files as that user.
Google doesn't seem to have an officially maintained
command line tool. I had some initial success with skicka
, but once large amount of data had already been transferred into Drive the sync
initialization would just hang (YMMV). skicka
is useful to view total size of a folder in Drive using the du
command.
rclone
seemed to work best for uploading to Drive. After installing, run rclone config
to set up a config for Drive. You'll need to create a set of Google Drive OAuth credentials (see https://rclone.org/drive/), make sure you do this as the user with a sufficient storage subscription. Once your Drive remote is configured it should look something like:
$ rclone config show
[gdrive]
type = drive
client_id = <id>
client_secret = <secret>
scope = drive
token = {"access_token":"<token>","token_type":"Bearer","refresh_token":"<token>","expiry":"2019-06-06T14:32:40.671138592-07:00"}
Then you can explore your drive, etc:
$ rclone lsd gdrive:/
-1 2019-06-03 15:22:29 -1 ngos
And upload.sync data:
rclone sync -v . gdrive:ngos/
I didn't have success transferring data from Google Cloud Storage to Drive in the cloud, which I expected to be much faster than uploading the local data to Drive. All approaches were either abysmally slow, didn't ever initiate transfers, or failed due to noted issue.
- colab: mounting drive in colab and transferring to it using gsutil (fails because of disk buffering/out of space)
- cloud vm: mounting gdrive using google-drive-ocamlfuse and transferring data from gcs using gsutil
- cloud vm: mounting gcs using gcs-fuse and transferring data to gdrive using skicka
- cloud vm: rsyncing between fuse mounts (google-drive-ocamlfuse and gcs-fuse)
- cloud vm: using rclone to transfer from gcs-fuse mount to gdrive