# <center>Using Colaboratory With Google Drive</center>


Google Colaboratory is a great research tool, but there is often a need to use data to/from the owner's Google Drive for small research projects.  Larger projects will use other types of Google Disk Storage, but smaller projects will work very well with Google Drive.  An owner's Google Drive is not "automatically mounted" on a Colaboraty Project instance; therefore, this note shows how to access files to/from a Google drive.

For clarity, a bit of knowledge is important to understanding the process of using Colaboratory.  Namely, when a user starts a Colaboraty file, it creates a Docker instance that is a complete linux environment.  This linux environment has the standard Linux development tools; however, it has no connection to the owner's Google Drive.



The reference for PyDrive is [PyDrive](http://pythonhosted.org/PyDrive/index.html)

The reference for the GoogleDrive API (Rest v2) is [GoogleDrive API](https://developers.google.com/drive/v2/reference/parents)

The reference for the Google Cloud Storage Fuse is [gcsfuse](https://cloud.google.com/storage/docs/gcs-fuse)

The reference for using Google Drive with Colaboratory is 

[GoogleDrive Fuse Implementation Reference](https://colab.research.google.com/notebook#fileId=1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

##  Basic Linux Environment

This paragraph demonstrates the basic Linux environment under the Colaboratory hood.  *(Note:  You can add a GPU as part of the startup, but it isn't necessary for this project)*.

In [1]:
import os, platform, subprocess
print("Home Directory: {:s}".format(os.environ["HOME"]))
print("Current Working Directory: {:s}".format(os.getcwd()))
print("Operating System Name: {:s}".format(platform.system()))
print("OS Release: {:s}".format(platform.release()))
print("Processor Type: {:s}".format(platform.processor()))
mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')
mem_gib = mem_bytes/(1024**3)
print("Amount of memory (GB) {:.2f}".format(mem_gib))
print("\nFile System Information")
p1 = subprocess.Popen(['df','-h'],stdout=subprocess.PIPE)
output=p1.communicate()
out_lines = output[0].decode("utf-8").split("/n")
for ol in out_lines:
  print(ol)
print("Current Directory Size")
p1 = subprocess.Popen(['df',os.getcwd()],stdout=subprocess.PIPE)
output=p1.communicate()
out_lines = output[0].decode("utf-8").split("/n")
for ol in out_lines:
  print(ol)
fileList = os.listdir("/")
print("Now we see the typical Linux root directory")
for file in fileList:
  print("FILE: {:s}".format(file))

Home Directory: /content
Current Working Directory: /content
Operating System Name: Linux
OS Release: 4.14.33+
Processor Type: x86_64
Amount of memory (GB) 12.72

File System Information
Filesystem      Size  Used Avail Use% Mounted on
overlay          40G  5.1G   33G  14% /
tmpfs           6.4G     0  6.4G   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
tmpfs           6.4G     0  6.4G   0% /opt/bin
/dev/sda1        46G  5.8G   40G  13% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           6.4G     0  6.4G   0% /sys/firmware

Current Directory Size
Filesystem     1K-blocks    Used Available Use% Mounted on
overlay         41022688 5293436  33615716  14% /

Now we see the typical Linux root directory
FILE: proc
FILE: mnt
FILE: usr
FILE: etc
FILE: sys
FILE: tmp
FILE: run
FILE: dev
FILE: boot
FILE: var
FILE: media
FILE: srv
FILE: home
FILE: lib64
FILE: lib
FILE: root
FILE: sbin
FILE: opt
FILE: bin
FILE: content
FILE: .dockerenv
FILE: gpu-tensorflow-1.9.

As you can see from this output, this is a standard linux environment with a fairly good amount of resources.  Now we want to access our files using PyDrive, which will mount the Google Drive into our existing file system.

##  Google Drive Storage Using PyDrive

###  Default Code That Eventually Becomes a Library

This code currently resides at [ColabGDrive](https://github.com/drdavidrace/ColabGdrive.git)

It is not fully developed, but is okay for personal use.

This is probably the preferred method for the future since we don't have to do a fuse mount, but it does require additional tools for development since the google cloud storage is a little different than a standard linux system.  Many of the differences exist in the ColabGDrive I am developing, but these may be a while in development.

The following code block loads the basic libraries.


In [1]:
%%bash
cur_version=$(pip --version)
echo $cur_version
pip uninstall --yes ColabGDrive
pip install -U -q git+https://github.com/drdavidrace/ColabGdrive.git
pip list | grep -i Colab

pip 18.0 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)
Uninstalling ColabGDrive-0.0.1a0:
  Successfully uninstalled ColabGDrive-0.0.1a0
ColabGDrive              0.0.1a0  
google-colab             0.0.1a1  


###  Download from Google Drive to an Instance File System

For this part of the examples, we have a file that is logically "root/BigDataTraining/UsingGoogleColaboratoryShortcuts/train.csv".  We will work with this file and perform various operations moving it back and forth between the Google Drive and the instance environment.

This example copies one of the files from google drive to the instance /contents directory for processing.  As a rule, we need to go through the following steps:

*  Locate the file id of the file we want to download
  *  Since we don't usually store file id's, we generically want to find the file id based upon the parents
  *  NOTE:  The use of parents vs directories is necessary to support the efficiency of the Google Drive
  *  We use the parents like directory names in a local file system
*  Create a tie between pydrive and the google drive with the file id
*  Download the file using GetContentFile, we can put the data anywhere but it goes to the /contents directory by default

The current example downloads it to both the /contents directory and the /tmp directory; however, any writeable instance directory can be used for the download.

In [2]:
import ColabGDrive 

from ColabGDrive import ColabGDrive 

myGdrive = ColabGDrive.ColabGDrive()
myGdrive.connect_drive()


Entering Initialization


In [0]:
import re, sys
#  First install PyDrive
!pip install -U -q PyDrive
import pydrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
import os
https://github.com/drdavidrace/ColabGdrive.git
# #WARNING:  The following push the output to /dev/null so if there is an error in the output it is silent.  Not great, but sometimes better than wathing unwanted output.  
!pip uninstall --yes ColabGDrive
!pip install https://github.com/drdavidrace/ColabGdrive.git
# !pip uninstall --yes ColabGDrive > /dev/null 2>&1
# !pip install git+https://drdavidmrace@bitbucket.org/drdavidmrace/colabgdrive.git > /dev/null 2>&1
from ColabGDrive.gDriveMgmt import findFileID, listFileDict
#
#  The test, can we read from a file copied from a Google Drive
#
import pandas as pd
import numpy as np
#Steps for downloading file
trainID = findFileID('BigDataTraining/UsingGoogleColaboratoryShortcuts/train.csv') #Get the file id to download
toDownload = drive.CreateFile({'id': '{:s}'.format(trainID)})  #Identifies the file (by id) to download
toDownload.GetContentFile('train.csv')  #Set the content file name and downloads the file
trainPD = pd.read_csv('train.csv',encoding='cp1250',dtype={11:'str',12:'str',31:'str'})

print(trainPD.columns)
print(trainPD.shape)
#
#  Now we check that it exists
#
fileList = os.listdir("/content")
print("Now we see the content directory")
for file in fileList:
  print("FILE: {:s}".format(file))
print("This shows that the download works well for this environment")
#
#  Now download to a different directory
#
if(os.path.isfile("/tmp/train.csv")):
  os.remove("/tmp/train.csv")
fileList = os.listdir("/tmp")
print("Before download we see the /tmp directory")
for file in fileList:
  print("FILE: {:s}".format(file))
toDownload.GetContentFile('/tmp/train.csv')
print("After download we see the /tmp directory")
fileList = os.listdir("/tmp")
for file in fileList:
  print("FILE: {:s}".format(file))
print("This shows that the download works well for this environment")

Uninstalling ColabGDrive-0.0.1:
  Successfully uninstalled ColabGDrive-0.0.1
Collecting git+https://drdavidmrace@bitbucket.org/drdavidmrace/colabgdrive.git
  Cloning https://drdavidmrace@bitbucket.org/drdavidmrace/colabgdrive.git to /tmp/pip-jeaipdn7-build
Installing collected packages: ColabGDrive
  Running setup.py install for ColabGDrive ... [?25l- done
[?25hSuccessfully installed ColabGDrive-0.0.1
Index(['ticket_id', 'agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date',
       'violation_code', 'violation_description', 'disposition', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due',
       'payment_date', 'payment_status', 'c

###  Upload from a Instance Drive to Google Drive (Root Directory)

Conceptually this is similar to downloading files, namely we have 

*  Identify the location for the upload
  *  The default is the 'root' directory
  *  We can do a search and find the parent id for the target folder
*  Tie the new file to the correct location
*  Upload the file's contents

In [0]:
import pandas as pd
import os
#
#  Remove the old file
#
outFileName = "/tmp/train2.csv"
if(os.path.isfile(outFileName)):
  os.remove(outFileName)
trainPD.to_csv(outFileName,columns=['ticket_id','disposition','fine_amount','compliance'])
print("train.csv")
print(os.stat("/tmp/train.csv"))
print("train2.csv")
print(os.stat(outFileName))
fileList = os.listdir("/tmp")
for file in fileList:
  print("FILE: {:s}".format(file))
#
#  store a copy in the root directory of Google Drive
#
rootTrainID = findFileID('root/train2.csv')
if(rootTrainID):
  toDelete = drive.CreateFile({'id':rootTrainID})
  toDelete.Trash()
  toDelete.Delete()
toUpload = drive.CreateFile({'title': 'train2.csv'})  #Identifies the file name once uploaded to Google Drive
toUpload.SetContentFile(outFileName)  #Identify the file to upload
toUpload.Upload()  #Physically upload
#
#  Read file information
fileInfo = listFileDict("root/train2.csv")
print(fileInfo)
#
#  Read Metadata
#
print(fileInfo['id'])
reader = drive.CreateFile({'id': "{:s}".format(fileInfo['id'])})
reader.FetchMetadata()
print(reader['title'])
print(reader['fileSize'])
print(reader['fileExtension'])
print(reader['parents'])
print(len(reader['parents']))

train.csv
os.stat_result(st_mode=33188, st_ino=1704294, st_dev=1792, st_nlink=1, st_uid=0, st_gid=0, st_size=97391029, st_atime=1521767324, st_mtime=1521767324, st_ctime=1521767324)
train2.csv
os.stat_result(st_mode=33188, st_ino=1704318, st_dev=1792, st_nlink=1, st_uid=0, st_gid=0, st_size=12106397, st_atime=1521767334, st_mtime=1521767335, st_ctime=1521767335)
FILE: train.csv
FILE: train2.csv
{'title': 'train2.csv', 'id': '10brWcukQYAXcivUEFgm2s3njngRadoo7'}
10brWcukQYAXcivUEFgm2s3njngRadoo7
train2.csv
12106397
csv
[{'kind': 'drive#parentReference', 'id': '0AGG93sRDxiIuUk9PVA', 'selfLink': 'https://www.googleapis.com/drive/v2/files/10brWcukQYAXcivUEFgm2s3njngRadoo7/parents/0AGG93sRDxiIuUk9PVA', 'parentLink': 'https://www.googleapis.com/drive/v2/files/0AGG93sRDxiIuUk9PVA', 'isRoot': True}]
1


### Upload from an Instance to Google Drive (Other Directory)

In this next exercise, we will upload the file in /tmp/train2.csv to the directory root/BigDataTraining.  This will work essentially the same, but we will first find the id for the directory and then create the file with the appropriate id.

In [0]:
#  Check that it exists
rootTrainID = findFileID('root/BigDataTraining/train2.csv')
if(rootTrainID):
  toDelete = drive.CreateFile({'id':rootTrainID})
  toDelete.Trash()
  toDelete.Delete()
#  Find the directory id, then create the file
dirID = findFileID('BigDataTraining')
print(dirID)
#print out the current contents of the directory
print("The current directory contents:")
if(dirID is not None):
  listDir = drive.ListFile({'q': "'{:s}' in parents and trashed=false".format(dirID)}).GetList()
  for f in listDir:
    print("FILE: {:s}".format(f['title']))
if(dirID is not None):
  toUpload = drive.CreateFile({'parents':[{'id':"{:s}".format(dirID)}],'title': 'train2.csv'})  #Identifies the file name once uploaded to Google Drive
  toUpload.SetContentFile(outFileName)  #Identify the file to upload
  toUpload.Upload() 
else:
  print("Directory Not Found")
#print out the current contents of the directory
print("The updated directory contents")
if(dirID is not None):
  listDir = drive.ListFile({'q': "'{:s}' in parents and trashed=false".format(dirID)}).GetList()
  for f in listDir:
    print("FILE: {:s}".format(f['title']))

1u3ybivsD_iDiy7pZdpZVyyAjIbLgivoT
The current directory contents:
FILE: docexample.docx
FILE: UsingGoogleColaboratoryShortcuts
FILE: StudyOfMichiganVoting2016.ipynb
FILE: UMichCourse
FILE: GoogleDataLabSetup.ipynb
The updated directory contents
FILE: train2.csv
FILE: docexample.docx
FILE: UsingGoogleColaboratoryShortcuts
FILE: StudyOfMichiganVoting2016.ipynb
FILE: UMichCourse
FILE: GoogleDataLabSetup.ipynb


The above command for inserting into the directory may not be obvious, but the 'parents' option on the create file must be an array.  Since Google allows more than one directory, the PyDrive assumes you are passing an array.

If you don't pass it an array, it puts it into the root directory.

(NOTE:  This operation is instantiated in the uploadFile command in ColabGDrive as demonstrated below.)

####  Google Docs Example

The following example shows the metadata for a Google docs file.

In [0]:
docID = findFileID('root/BigDataTraining/docexample.docx')
docReader = drive.CreateFile({'id':docID})
docReader.FetchMetadata()
for k in docReader.keys():
  print("Key: {:s}".format(k))
  print(docReader[k])

Key: id
1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k
Key: kind
drive#file
Key: etag
"hcahzZRGAO5dFBAGGfvDlnfbXEY/MTUyMTc2NTM0MTY2OA"
Key: selfLink
https://www.googleapis.com/drive/v2/files/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k
Key: webContentLink
https://drive.google.com/uc?id=1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k&export=download
Key: alternateLink
https://docs.google.com/document/d/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k/edit?usp=drivesdk
Key: embedLink
https://docs.google.com/document/d/1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k/preview?ouid=100296224498372692739
Key: iconLink
https://drive-thirdparty.googleusercontent.com/16/type/application/vnd.google-apps.document
Key: thumbnailLink
https://docs.google.com/feeds/vt?gd=true&id=1kheJEAIyfLRFk-O8peQsEalADLpJ-AYYbpjULtIgW_k&v=2&s=AMedNnoAAAAAWrRv2_84rbYHxcF-9NoOG27TICZ9V9w4&sz=s220
Key: title
docexample.docx
Key: mimeType
application/vnd.google-apps.document
Key: labels
{'starred': False, 'hidden': False, 'trashed

###  Important Summary of the Work Above

So we ask ourselves' the important questions:

*  What are the "logical" differences between a Linux file system and the Google File System?
*  How do the differences affect our management activities?

####  Logical Differences
For a Linux file system, the parents keep track of the children.  If you go back to ancient history (aka 1970's and 1980's), the file system was essentially synomymous with the way the files were laid out on the disks.  The "directory files" containers held pointers to children containers, and file containers had pointers to the parts of the file.  Symbolic links allowed multiple directories to point to the same file (either directory or file), but everything started at the top and worked its way down.  This had three primary advantages:

*  It was "natural" for people to understand because it essentially mimiced a room full of file cabinets
*  It was "easy" to use names (rather than pointers) so people could find files
*  It was relatively easy to embed this concept on a tape and on a single disk
*  It was easy to continue the original "unix" file system which defined as file as a linear stream of bytes

On the other hand, starting in the 1980s it had some serious disadvantages:

*  The tape/disks became the slow part of the system, so using individual disks for files was a problem (even after caching was used)
*  There were a lot of different ways that multiple disks could be used for file systems, for instance:
  *  The Thinking Machines file system automatically spread data across all of the drives and mapped physical drives to physical processors
  *  The Cray file system "stripped" blocks across the drives and relied on the individual applications to "fix" the data
*  The HDFS file layout wanted to consider data as non-linear, but regular multi-dimensional, so there was some complex algorithms to get data to the right processor.  This file system allows one of the dimensions to be time, so it covers a large number of physical data models.

Google's approach may not have been "explicitly" defined using the following concepts, but the following appear to be consistent with their model.  In general, the Google model fits the following:

*  Users might want worldwide access to the data
*  Users should be concerned about data access and data access time rather than a physical layout of the data
*  There are so many different applications, that google can't provide the mapping from physical layout to data so the application has to map from the linear stream of bytes to their application needs.
*  Users need a way to understand their data layout
*  Users need to share documents and/or folders

To meet these requirements, Google appears to have made the following decisions:

*  They supported the concept of folders rather than directories
*  They made each file in their system a stnad-alone entity
*  They provide the ability to define the bandwidth requirements for some of their stores and they meet those requirements


1.  The implementation is rather elegant, instead of directories containing information about the files/directories under them - the files/folders track the information on their parents.  This makes sense from Google's perspective since they are so good at indexing.
2.  They track everything by a fileID and the file carries its 'title' with it.  There is a great deal of information in the metadata for a file, so they use the metadata to keep all of the important information.  (see the example above.)
3.  The metadata contains the mimeType so apps that support that mimeType know how to parse the data.  This is important, the data is considered a unicode byte stream and the mimeType tells other apps what they can expect.  

There are considerable differences, but it is reasonable to build a small set of tools that wrap around PyDrive to provide a human readable interface to a google drive.  I have instantiated the small set of tools in my bitbucket account so I can use CodeAnywhere for editing.

NOTE:  Discussion of the Cloud Storage capabilies (object oriented buckets) will be discussed later in this notebook.

All that said, probably the biggest differences are:

*  There isn't a concept of the current working directory in Google Drive when working with a notebook.  This would have to be built with a stack that maintains the current working directory if desired.  (this is coming in my ColabGDrive library.)
*  You can't "open" a file by giving it a path, you have to find the fileID first on your own (the first tool in ColabGDrive library)

Overall, I have been able to work with path names (as defined by root oriented strings) fairly well.  The ColabGDrive will eventually have the concept of a current working directory, so it will be relatively easy to map between our visual concept of folders to the GDrive environment.  No we4 problems so far (22 Mar 2018).