# Google Colaboratory – Storing and acquiring data for training

This notebook will cover how to obtain data to use in Colab notebooks.  These next topics are like “which comes first, the chicken or the egg?”   Do you find data and then worry about storage, or do you figure out storage, then worry about the data.  Truth is both of these play together and need to be considered jointly.  I use a combination of using Google Drive for permanent storage along with using the local Colab VM for training (images, Panda, csv, numpy files, etc).  Notebooks, models and config items are stored in Drive, training data stored on VM for local usage.  This does require some extra time before you can start your training, so you can compare which one is best for your situation.  I've found that taking 5 or so minutes to copy data to the VM is well worth the time you save when training.  

My personal Kaggle guideline is that if the competition data size is below 10G, then I use colab, otherwise use Kaggle.

**Storage**

Google has three main places you can store your information and each has pros/cons.

| Characteristics  | Colab VM | Drive | Cloud Storage |
| :----: | :----: | :----: | :----: |
| Cost | none | 15G free | Varies |
| Retrieval Speed | fastest | slowest | fast |
| Permanent | no | yes | yes |
| Created | when you start Colab | when you sign-up for Google | when purchased |

Link that covers the differences between Drive and Cloud Storage:

http://www.differencebetween.net/technology/difference-between-google-cloud-and-google-drive/


**Acquiring Data**

There are two main ways you acquire data for training:  public domain (Google, Kaggle, Microsoft, Facebook, etc) or you can create your own.  This notebook will focus on using public Google and Kaggle data.  When I say “Google data” I'm referencing the link below.  But the data included comes from many sources, not just Google.  It is not all Google data, but those data sets in which Google has TensorFlow examples.  Also, there is data set overlap between all of the companies.  For instance, the “Dog & Cats” data can be found in multiple locations.

https://www.tensorflow.org/resources/models-datasets


**Basic Linux commands**

It is a good idea to have a basic understanding of some simple Linux commands.  This link is a good starting point and can explain any command you find in this notebook.

https://www.tjhsst.edu/~dhyatt/superap/unixcmd.html

**General Colab Information**

If you are new to Colab, this is a good starter.

https://medium.com/dair-ai/primer-for-learning-google-colab-bb4cabca5dd6





### Explore the Colab Local VM drive

When you launched this notebook, a VM was started with some pre-loaded local storage.  As you can see there were some sample files loaded for you.

In [None]:
# List the contents on your local VM drive, there is a sample_data folder
# If this is the first time you have executed code, it might take a few seconds to initialize and start the VM
!ls

In [None]:
# List the contents of the sample_data folder
! ls sample_data

In [None]:
# Show the contents of sample_data/README.md
!cat sample_data/README.md

### Obtaining Kaggle data sets

To use Kaggle data sets, you must have registered in Kaggle and accepted the data agreement.  You register for a Kaggle account once, but you have to accept the data agreement for each data store (or contest) you are interested in.  You then can create a personal key and load the data into your environment.  I'm showing using Colab, but these steps should work with about any environment.

Link to detailed steps (very good):

https://stackoverflow.com/questions/49310470/using-kaggle-datasets-in-google-colab

Here is a high-level summary:
1. In Kaggle: Register for a Kaggle account (this is free)
2. In Kaggle: From you Account tab, create & download your kaggle.json token
3. in Kaggle: For any data you want to use, agree to usage (basically join the contest)
4. Start Colab notebook  (assume you have already have a free Google account)
5. In Colab: Upload your kaggle.json file 
6. In Colab: Install Kaggle libs and modify access rights
7. In Colab: List or upload Kaggle files

Here are some other research links:

https://www.kaggle.com/suraj2596/download-datasets-to-your-google-drive
https://github.com/Kaggle/kaggle-api/issues/160
https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166


In [None]:
# To start, install kaggle libs, make sure you are using the latest version...
#!pip install -q kaggle

# WORK AROUND TO GET NEW VERSION: https://stackoverflow.com/questions/58643979/google-colaboratory-use-kaggle-server-version-1-5-6-client-version-1-5-4-fai
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Upload your "kaggle.json" file that you created from your Kaggle Account tab
# If you downloaded it, it would be in your "Downloads" directory

from google.colab import files
files.upload()

In [None]:
# On your VM, create kaggle directory and modify access rights

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
# To verify everythong worked, execute commands to list data sets and competitions

!kaggle datasets list
!kaggle competitions list

### Copy Kaggle files

**The next few cells are examples, do not run unles you want the data**


Make sure you update the folders on c and p correctly

 **-c** is the competition and **-p** is where it will be placed

 To copy to Google Drive:

!kaggle competitions download -c titanic  -p /content/drive/My\ Drive/kaggle/titanic

To copy to local VM:

!kaggle competitions download -c titanic  -p kaggle/titanic




In [None]:
# Using the Titantic data set as an example and copy to local VM (It is small, so easy to start with...)
# I put Kaggle files in a kaggle subdirectory, but you can put them as needed

!kaggle competitions download -c titanic -p kaggle/titanic

In [None]:
# Show the downloaded files in kaggle directory
!ls kaggle/titanic

In [None]:
# Another example for dogs-vs-cats but going to Google Drive

!kaggle competitions download -c dogs-vs-cats -p /content/drive/My\ Drive/kaggle/dogscats

In [None]:
# If the data set was not listed, you can always use the API link in the Data section
# I found this very helpful!

# Go to the competition, select "Data" tab, then select the "API"
# Te API is below the description and just above the list of data.
# This will copy the API link to your clip board, then paste it in a Notebook cell

# This is from the Airbus Ship Detection Challenge: https://www.kaggle.com/c/airbus-ship-detection/data

!kaggle competitions download -c airbus-ship-detection

In [None]:
# If you want to unzip the files, example is for Google Drive

# Can be modified for local VM

# The u stands for update and the q stands for quiet - the latter is a good idea 
# because massive output can sometimes cause your Google Colab notebook to crash

#!unzip -uq "drive/My Drive/PATH_TO_ZIP" -d "drive/My Drive/PATH_TO_OUTPUT"


#!unzip -uq "/content/drive/My Drive/kaggle/dogscats/train.zip" -d "/content/drive/My Drive/kaggle/dogscats/"

print("Done unzipping....")

### Copy file(s) from Google Drive to Local VM

In [None]:
# To mount Google Drive, run cell, click on link, select account and copy/paste your access token
from google.colab import drive
drive.mount('/content/drive')

# if successful, you will see "Mounted at /content/drive"

In [None]:
# List contents to verify it is mounted correctly
!ls "/content/drive/My Drive"

In [None]:
# verify name of file(s) to copy, change <your file> to actual file
!ls "/content/drive/My Drive/<your file>"

In [None]:
# To copy files from Google Drive to the VM local filesystem

# Create any directories as needed using mkdir

# Actual copy, replace the name and location with your files
!cp -r "/content/drive/My Drive/Colab Notebooks/<your file>" ./

In [None]:
# Should see your files(s) or directories on local VM
!ls

### Obtaining Google data sets via TensorFlow

The next section will illustrate how to use data sets provided by TensorFlow database.  The steps are:
1. Find the data set you are interested in
2. Load the data along with the information (I've found showing the “info” is helpful)
3. Process the data based on the description

Link to the data set home page:

https://www.tensorflow.org/datasets



In [None]:
# Standard includes so everything works.....

import numpy as np 

import IPython.display as display
from PIL import Image

import matplotlib.pyplot as plt
%matplotlib inline

# Force TensorFlow 2.x
%tensorflow_version 2.x

import tensorflow as tf
print(tf.__version__)  # double check on tensorflow version

import tensorflow_datasets as tfds


In [None]:
# Simple helper method to display images....        
def show_batch(image_batch, label_batch, number_to_show=25, r=5, c=5):
    plt.figure(figsize=(10,10))  

    for n in range(number_to_show):
        ax = plt.subplot(r,c,n+1)
        plt.imshow(tf.keras.preprocessing.image.array_to_img(image_batch[n]))
        plt.title(str(label_batch[n].numpy()))
        plt.axis('off')

In [None]:
# See available datasets or can look on the site
print(tfds.list_builders())

In [None]:
# As an example, construct a tf.data.Dataset using mnist
# Add "with_info" to get all of the dataset information

dataset, info = tfds.load(name="mnist", split="train", with_info=True,)

In [None]:
#Show the files that were downloaded, this can be different based on what was downloaded...
!ls /root/tensorflow_datasets

In [None]:
# Show the information about the data set
print(info)

In [None]:
# Build your input pipeline with batch size 32

# tf.data.experimental.AUTOTUNE just means that Google will decide how best to optimize during runtime

dataset = dataset.shuffle(1024).batch(32).prefetch(tf.data.experimental.AUTOTUNE)

for features in dataset.take(1):
    image_batch, label_batch = features["image"], features["label"]

# Verify it worked
print(image_batch.shape, label_batch)

In [None]:
# Show first 25 images of the batch
show_batch(image_batch, label_batch)