# <center>Using Colaboratory With Google Drive</center>


Google Colaboratory is a great research tool, but there is often a need to use data to/from the owner's Google Drive for small research projects.  Larger projects will use other types of Google Disk Storage, but smaller projects will work very well with Google Drive.  An owner's Google Drive is not "automatically mounted" on a Colaboraty Project instance; therefore, this note shows how to access files to/from a Google drive.

For clarity, a bit of knowledge is important to understanding the process of using Colaboratory.  Namely, when a user starts a Colaboraty file, it creates a Docker instance that is a complete linux environment.  This linux environment has the standard Linux development tools; however, it has no connection to the owner's Google Drive.



The reference for PyDrive is [PyDrive](http://pythonhosted.org/PyDrive/index.html)

The reference for the GoogleDrive API (Rest v2) is [GoogleDrive API](https://developers.google.com/drive/v2/reference/parents)

The reference for the Google Cloud Storage Fuse is [gcsfuse](https://cloud.google.com/storage/docs/gcs-fuse)

The reference for using Google Drive with Colaboratory is 

[GoogleDrive Fuse Implementation Reference](https://colab.research.google.com/notebook#fileId=1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

##  Basic Linux Environment

This paragraph demonstrates the basic Linux environment under the Colaboratory hood.  *(Note:  You can add a GPU as part of the startup, but it isn't necessary for this project)*.

In [1]:
import os, platform, subprocess
print("Home Directory: {:s}".format(os.environ["HOME"]))
print("Current Working Directory: {:s}".format(os.getcwd()))
print("Operating System Name: {:s}".format(platform.system()))
print("OS Release: {:s}".format(platform.release()))
print("Processor Type: {:s}".format(platform.processor()))
mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')
mem_gib = mem_bytes/(1024**3)
print("Amount of memory (GB) {:.2f}".format(mem_gib))
print("\nFile System Information")
p1 = subprocess.Popen(['df','-h'],stdout=subprocess.PIPE)
output=p1.communicate()
out_lines = output[0].decode("utf-8").split("/n")
for ol in out_lines:
  print(ol)
print("Current Directory Size")
p1 = subprocess.Popen(['df',os.getcwd()],stdout=subprocess.PIPE)
output=p1.communicate()
out_lines = output[0].decode("utf-8").split("/n")
for ol in out_lines:
  print(ol)
fileList = os.listdir("/")
print("Now we see the typical Linux root directory")
for file in fileList:
  print("FILE: {:s}".format(file))

Home Directory: /root
Current Working Directory: /content
Operating System Name: Linux
OS Release: 4.14.33+
Processor Type: x86_64
Amount of memory (GB) 12.72

File System Information
Filesystem      Size  Used Avail Use% Mounted on
overlay          40G  5.6G   32G  15% /
tmpfs           6.4G     0  6.4G   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
tmpfs           6.4G     0  6.4G   0% /opt/bin
/dev/sda1        46G  6.4G   39G  15% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           6.4G     0  6.4G   0% /sys/firmware

Current Directory Size
Filesystem     1K-blocks    Used Available Use% Mounted on
overlay         41022688 5790256  33118896  15% /

Now we see the typical Linux root directory
FILE: media
FILE: usr
FILE: sys
FILE: run
FILE: lib64
FILE: mnt
FILE: tmp
FILE: dev
FILE: proc
FILE: boot
FILE: opt
FILE: bin
FILE: root
FILE: lib
FILE: etc
FILE: sbin
FILE: home
FILE: srv
FILE: var
FILE: .dockerenv
FILE: tf_deps
FILE: tensorflow-1.10.0-cp27

As you can see from this output, this is a standard linux environment with a fairly good amount of resources.  Now we want to access our files using PyDrive, which will mount the Google Drive into our existing file system.

##  Google Drive Storage Using PyDrive

###  Default Code That Eventually Becomes a Library

This code currently resides at [ColabGDrive](https://github.com/drdavidrace/ColabGdrive.git)

It is not fully developed, but is okay for personal use.

This is probably the preferred method for the future since we don't have to do a fuse mount, but it does require additional tools for development since the google cloud storage is a little different than a standard linux system.  Many of the differences exist in the ColabGDrive I am developing, but these may be a while in development.

The following code block loads the basic libraries.


In [1]:
%%bash
cur_version=$(pip --version)
echo $cur_version

check_gdrive=`pip list | grep colab-gdrive`
if [[ ! -z "$check_gdrive" ]]; then
  pip uninstall --yes colab_gdrive
fi
pip install -U -q git+https://github.com/drdavidrace/colab_gdrive.git
pip list | grep -i colab
pip install -U -q PyDrive
pip list | grep -i pydrive


pip 18.0 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)
Uninstalling colab-gdrive-0.0.1a0:
  Successfully uninstalled colab-gdrive-0.0.1a0
colab-gdrive             0.0.1a0  
google-colab             0.0.1a1  
PyDrive                  1.3.1    


Set up a ColabGDrive so we can test/run code.

In [2]:
from pprint import pprint
import logging
#
#  Set up a ColabGDrive
from colab_gdrive import colab_gdrive

myGdrive = colab_gdrive.ColabGDrive(logging_level=logging.ERROR)
pprint(myGdrive.is_connected())
pprint(myGdrive.getcwd())

True
'root'


###Test General File Listings

In [4]:
import os
from pprint import pprint


#
#  Check testing information
#
print("=======Dirctory Information=======")
pprint(myGdrive.ls('TempTest'))
print("=====1 Directory Content Information 1=====")
pprint(myGdrive.ls('TempTest/*'))
print("=====2 Directory Content Information 2=====")
pprint(myGdrive.ls('TempTest/.'))
print("=====3 Directory Content Information 3=====")
pprint(myGdrive.ls('TempTest/..'))
print("=====4 Directory Content Information 4=====")
pprint(myGdrive.ls('.'))
print("=====5 Directory Content Information 5=====")
pprint(myGdrive.ls('*'))
pprint(myGdrive.ls(myGdrive.getcwd()))


{'file_result': [{'id': '1kD_q3vGcPMrZQIFn40nsLCzSLeaAOdxL',
                  'mimeType': 'application/vnd.google-apps.folder',
                  'title': 'TempTest'}],
 'full_name': 'root/TempTest'}
=====1 Directory Content Information 1=====
{'file_result': [{'id': '1lOA6-Rzz9NCWg9V7BcZ4228K7Jo8L0Rd',
                  'mimeType': 'application/vnd.google-apps.folder',
                  'title': 'Test2'},
                 {'fileSize': '418',
                  'id': '1nIffURLCAgERzPvHMvaQlpkGNEtMsROt',
                  'mimeType': 'text/plain',
                  'title': 'colors.txt'},
                 {'id': '14M8BRFtzYnNwFGsAVFWCZ_xUKlK44xodn8jWECmsDxE',
                  'mimeType': 'application/vnd.google-apps.document',
                  'title': 'colors'},
                 {'fileSize': '12106397',
                  'id': '10brWcukQYAXcivUEFgm2s3njngRadoo7',
                  'mimeType': 'text/csv',
                  'title': 'train2.csv'}],
 'full_name': 'root/TempTest'}
=====2

###Test Directory Management

In [5]:
#This should have much different results from the previous cell
myGdrive.set_log_level(logging.ERROR)
pprint(myGdrive.chdir('TempTest'))
pprint(myGdrive.getcwd())
#
#  Check testing information
#
print("=======Dirctory Information=======")
pprint(myGdrive.ls('TempTest'))
print("=====1 Directory Content Information 1=====")
pprint(myGdrive.ls('TempTest/*'))
print("=====2 Directory Content Information 2=====")
pprint(myGdrive.ls('TempTest/.'))
print("=====3 Directory Content Information 3=====")
#Notice that the .. will move it backwards and throw away the added TempTest
pprint(myGdrive.ls('TempTest/..'))
print("=====4 Directory Content Information 4=====")
pprint(myGdrive.ls('.'))
print("=====5 Directory Content Information 5=====")
pprint(myGdrive.ls('*'))
#  Move the working directory back one layer
pprint(myGdrive.chdir('..'))
print("=====6 Directory Content Information -  Should be root6=====")
pprint(myGdrive.ls('*'))

'root/TempTest'
GoogleDriveFile({'id': '1kD_q3vGcPMrZQIFn40nsLCzSLeaAOdxL'})
'root/TempTest'
'root/TempTest'
{'file_result': [], 'full_name': 'root/TempTest'}
=====1 Directory Content Information 1=====
{'file_result': [], 'full_name': 'root/TempTest'}
=====2 Directory Content Information 2=====
{'file_result': [], 'full_name': 'root/TempTest'}
=====3 Directory Content Information 3=====
{'file_result': [{'id': '1kD_q3vGcPMrZQIFn40nsLCzSLeaAOdxL',
                  'mimeType': 'application/vnd.google-apps.folder',
                  'title': 'TempTest'}],
 'full_name': 'root/TempTest'}
=====4 Directory Content Information 4=====
{'file_result': [{'id': '1kD_q3vGcPMrZQIFn40nsLCzSLeaAOdxL',
                  'mimeType': 'application/vnd.google-apps.folder',
                  'title': 'TempTest'}],
 'full_name': 'root/TempTest'}
=====5 Directory Content Information 5=====
{'file_result': [{'id': '1lOA6-Rzz9NCWg9V7BcZ4228K7Jo8L0Rd',
                  'mimeType': 'application/vnd.google-apps

###  Download from Google Drive to an Instance File System

For this part of the examples, we have a file that is logically "root/BigDataTraining/UsingGoogleColaboratoryShortcuts/train.csv".  We will work with this file and perform various operations moving it back and forth between the Google Drive and the instance environment.

This example copies one of the files from google drive to the instance /contents directory for processing.  As a rule, we need to go through the following steps:

*  Locate the file id of the file we want to download
  *  Since we don't usually store file id's, we generically want to find the file id based upon the parents
  *  NOTE:  The use of parents vs directories is necessary to support the efficiency of the Google Drive
  *  We use the parents like directory names in a local file system
*  Create a tie between pydrive and the google drive with the file id
*  Download the using the colab_gdrive copy_from function.

The current copy_from function downloads to the current working directory, so we will first download to /content (it is the default directory for colaboratory), then change the current working directory to /tmp for a download.

In [3]:
import os, platform, subprocess
print("Home Directory: {:s}".format(os.environ["HOME"]))
pprint("Current Working Directory: {:s}".format(os.getcwd()))
gdrive_file = 'root/BigDataTraining/UsingGoogleColaboratoryShortcuts/train.csv'
current_metadata = myGdrive.get_file_metadata(gdrive_file)
pprint(current_metadata)
test_file = myGdrive.isfile(gdrive_file)
print(test_file)
#
myGdrive.copy_from(gdrive_file)
fileList = os.listdir(os.getcwd())
print("Now we see the typical Linux root directory:")
for file in fileList:
  print("FILE: {:s}".format(file))
print("Check the size:")
current_size = os.path.getsize('train.csv')
print(current_size)
current_size = myGdrive.getsize(gdrive_file)
print(current_size)
#
#  Change the local chdir
#
cur_work_dir = os.getcwd()
os.chdir("/tmp")
myGdrive.copy_from(gdrive_file)
fileList = os.listdir(os.getcwd())
print("Now we see the typical Linux cwd directory:")
for file in fileList:
  print("FILE: {:s}".format(file))
print("Check the size:")
current_size = os.path.getsize('train.csv')
print(current_size)
current_size = myGdrive.getsize(gdrive_file)
print(current_size)


Home Directory: /root
'Current Working Directory: /content'
GoogleDriveFile({'id': '1CS9nhWGoj4ZXljnXupQ6qVHJhuq8z0XY'})
True
Now we see the typical Linux root directory:
FILE: .config
FILE: sample_data
FILE: train.csv
FILE: adc.json
Check the size:
97391029
97391029
Now we see the typical Linux cwd directory:
FILE: train.csv
FILE: test1.txt
Check the size:
97391029
97391029


### Upload from an Instance to Google Drive (Other Directory)

In this next exercise, we will:

*  Create a file with minimal content
*  uplode to the directory  root/BigDataTraining
*  validate that the file is "correct"


In [4]:
import os, platform, subprocess
pprint("Current Working Directory: {:s}".format(os.getcwd()))
!echo 'This is a line for testing' > test1.txt
!cat test1.txt
print("Now we see the typical Linux cwd directory:")
for file in os.listdir(os.getcwd()):
  print("FILE: {:s}".format(file))
#
#  Set the upload information
#
local_file = 'test1.txt'
myGdrive.chdir('')
myGdrive.ls('test1.txt')


'Current Working Directory: /tmp'
This is a line for testing
Now we see the typical Linux cwd directory:
FILE: train.csv
FILE: test1.txt


{'file_result': [{'fileSize': '27',
   'id': '1LKC34Eo_9P0R5PM_Vy0pjvhG-Megs8OL',
   'mimeType': 'text/plain',
   'title': 'test1.txt'}],
 'full_name': 'root/test1.txt'}

The above command for inserting into the directory may not be obvious, but the 'parents' option on the create file must be an array.  Since Google allows more than one directory, the PyDrive assumes you are passing an array.

If you don't pass it an array, it puts it into the root directory.

(NOTE:  This operation is instantiated in the uploadFile command in ColabGDrive as demonstrated below.)

####  Google Docs Example

The following example shows the metadata for a Google docs file.

In [0]:
docID = findFileID('root/BigDataTraining/docexample.docx')
docReader = drive.CreateFile({'id':docID})
docReader.FetchMetadata()
for k in docReader.keys():
  print("Key: {:s}".format(k))
  print(docReader[k])

Key: id
1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k
Key: kind
drive#file
Key: etag
"hcahzZRGAO5dFBAGGfvDlnfbXEY/MTUyMTc2NTM0MTY2OA"
Key: selfLink
https://www.googleapis.com/drive/v2/files/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k
Key: webContentLink
https://drive.google.com/uc?id=1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k&export=download
Key: alternateLink
https://docs.google.com/document/d/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k/edit?usp=drivesdk
Key: embedLink
https://docs.google.com/document/d/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k/preview?ouid=100296224498372692739
Key: iconLink
https://drive-thirdparty.googleusercontent.com/16/type/application/vnd.google-apps.document
Key: thumbnailLink
https://docs.google.com/feeds/vt?gd=true&id=1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k&v=2&s=AMedNnoAAAAAWrRv2_84rbYHxcF-9NoOG27TICZ9V9w4&sz=s220
Key: title
docexample.docx
Key: mimeType
application/vnd.google-apps.document
Key: labels
{'starred': False, 'hidden': False, 'trashed

###  Important Summary of the Work Above

So we ask ourselves' the important questions:

*  What are the "logical" differences between a Linux file system and the Google File System?
*  How do the differences affect our management activities?

####  Logical Differences
For a Linux file system, the parents keep track of the children.  If you go back to ancient history (aka 1970's and 1980's), the file system was essentially synomymous with the way the files were laid out on the disks.  The "directory files" containers held pointers to children containers, and file containers had pointers to the parts of the file.  Symbolic links allowed multiple directories to point to the same file (either directory or file), but everything started at the top and worked its way down.  This had three primary advantages:

*  It was "natural" for people to understand because it essentially mimiced a room full of file cabinets
*  It was "easy" to use names (rather than pointers) so people could find files
*  It was relatively easy to embed this concept on a tape and on a single disk
*  It was easy to continue the original "unix" file system which defined as file as a linear stream of bytes

On the other hand, starting in the 1980s it had some serious disadvantages:

*  The tape/disks became the slow part of the system, so using individual disks for files was a problem (even after caching was used)
*  There were a lot of different ways that multiple disks could be used for file systems, for instance:
  *  The Thinking Machines file system automatically spread data across all of the drives and mapped physical drives to physical processors
  *  The Cray file system "stripped" blocks across the drives and relied on the individual applications to "fix" the data
*  The HDFS file layout wanted to consider data as non-linear, but regular multi-dimensional, so there was some complex algorithms to get data to the right processor.  This file system allows one of the dimensions to be time, so it covers a large number of physical data models.

Google's approach may not have been "explicitly" defined using the following concepts, but the following appear to be consistent with their model.  In general, the Google model fits the following:

*  Users might want worldwide access to the data
*  Users should be concerned about data access and data access time rather than a physical layout of the data
*  There are so many different applications, that google can't provide the mapping from physical layout to data so the application has to map from the linear stream of bytes to their application needs.
*  Users need a way to understand their data layout
*  Users need to share documents and/or folders

To meet these requirements, Google appears to have made the following decisions:

*  They supported the concept of folders rather than directories
*  They made each file in their system a stnad-alone entity
*  They provide the ability to define the bandwidth requirements for some of their stores and they meet those requirements


1.  The implementation is rather elegant, instead of directories containing information about the files/directories under them - the files/folders track the information on their parents.  This makes sense from Google's perspective since they are so good at indexing.
2.  They track everything by a fileID and the file carries its 'title' with it.  There is a great deal of information in the metadata for a file, so they use the metadata to keep all of the important information.  (see the example above.)
3.  The metadata contains the mimeType so apps that support that mimeType know how to parse the data.  This is important, the data is considered a unicode byte stream and the mimeType tells other apps what they can expect.  

There are considerable differences, but it is reasonable to build a small set of tools that wrap around PyDrive to provide a human readable interface to a google drive.  I have instantiated the small set of tools in my bitbucket account so I can use CodeAnywhere for editing.

NOTE:  Discussion of the Cloud Storage capabilies (object oriented buckets) will be discussed later in this notebook.

All that said, probably the biggest differences are:

*  There isn't a concept of the current working directory in Google Drive when working with a notebook.  This would have to be built with a stack that maintains the current working directory if desired.  (this is coming in my ColabGDrive library.)
*  You can't "open" a file by giving it a path, you have to find the fileID first on your own (the first tool in ColabGDrive library)

Overall, I have been able to work with path names (as defined by root oriented strings) fairly well.  The ColabGDrive will eventually have the concept of a current working directory, so it will be relatively easy to map between our visual concept of folders to the GDrive environment.  No we4 problems so far (22 Mar 2018).