# Welcome to AUK Notebook for Google Colab

To leverage the GPU power offered by Google Colab, it would be useful to use Google Colab, a free Jupyter Notebook environment run on the Cloud, for Archive Unleashed. It would be particularly good for machine learning projects. In this notebook, we will look at different ways to access it.

## Using Google Colab with data from Github

One simple way is to load the data from Github. There are some restrictions to this solution since Github restricts each file to 100MB and each repo to 1GB in size. 

Set up Google Colab to have the same environment as our github repo:

In [0]:
!git clone https://github.com/archivesunleashed/auk-notebooks.git
!pip install -r auk-notebooks/requirements.txt
!python -m nltk.downloader punkt vader_lexicon stopwords

The default directory is ~ which means you need to update AUK_PATH to 'auk-notebooks/data/'. To change that, you can do:

```
import sys
sys.path.append(‘[desired path]’)
```

In [0]:
COLLECTION_ID = '4867'  # Change to switch collections.
AUK_PATH = 'auk-notebooks/data/'  # Change value to full path to your data, including trailing slash.

## Using Google Colab with data from Google Cloud Storage (GCS)

A personal account is allowed 5 GB of free storage per month (https://cloud.google.com/free/).

First, you need to create a bucket in Google Cloud Platform and store the data there. Here are the steps you need to follow:


1.   Create a Google Cloud Platform account (if you are already a Google user, you can simply use the same account).
2.   Create a bucket (https://cloud.google.com/storage/docs/creating-buckets).
3.   Update the bucket with the data. I simply loaded the files from https://github.com/archivesunleashed/auk-notebooks/tree/master/data
4.   Authenticate yourself in Google Colab using the following snippet:


In [0]:
from google.colab import auth
auth.authenticate_user()

You now have access to the *data*

In [0]:
bucket_name = 'archive_unleashed'
file_path = '4867-fullurls.txt'

In [2]:
# Option 1: You can load the data from GSC into a file
!gsutil cp gs://{bucket_name}/{file_path} /tmp/example.txt

Copying gs://archive_unleashed/4867-fullurls.txt...
/ [0 files][    0.0 B/   13.0 B]                                                / [1 files][   13.0 B/   13.0 B]                                                
Operation completed over 1 objects/13.0 B.                                       


In [0]:
# Option 2: You can load the data from GSC to a blob
from google.cloud import storage
client = storage.Client(bucket_name)
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
print(blob.download_as_string())

The two solutions above require logging into a Google account that already have access to the files. If we want to provide access to a larger audience, it makes sense to use either an API key through Google IAM or offer full read permissions to the general public.

1. use an API key through IAM. Unfortunately, I haven't found a good tutorial for this.
2. Grant access to data to all users (using the snippet below) from the admin account. This is what I have done for my bucket.

In [5]:
# !gsutil acl ch -u AllUsers:R gs://{bucket_name}/**

Updated ACL on gs://archive_unleashed/4867-fulltext.txt
Updated ACL on gs://archive_unleashed/4867-fullurls.txt
Updated ACL on gs://archive_unleashed/4867-gephi.gexf
Updated ACL on gs://archive_unleashed/4867-gephi.graphml


## Using Google Colab with data from AWS

It is also possible to store data onto the AWS Cloud platform. Using the free  tier of AWS, we have access to 5 GB/month of AWS S3 during the first 12 months of the account creation. Since S3 is a popular choice for cloud storing, we decided to try it out too.

AWS Glacier is used as a write once, never retrieved cloud storage. It is a “extremely low-cost storage” that may be interesting to consider. 

### Creating an AWS account

I created a personal AWS account where I added a S3 bucket "archiveunleashed" and uploaded the "data" folder of this github repo unto the S3 bucket. I create a role "archiveunleashed" and retrieved its AWS access keys. Then, we need to connect the two together by defining a policy and inline policy which grants access to the bucket to this user.

Creating and granting access to S3 is more difficult that Google Cloud, but there is a lot of tutorials available.


Here is one: 
https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example1.html


In [1]:
import getpass

aws_access_key_id = getpass.getpass("Enter AWS access key id: ")
aws_secret_access_key = getpass.getpass("Enter AWS secret access key id: ")

Enter AWS access key id: ··········
Enter AWS secret access key id: ··········


In [0]:
bucket_name = 'archiveunleashed'
remote_fname = '4867-fulltext.txt'
local_fname = '/tmp/example.txt'

In [0]:
import boto3

s3r = boto3.resource(
    's3', 
     aws_access_key_id=aws_access_key_id,
     aws_secret_access_key=aws_secret_access_key
)
buck = s3r.Bucket(bucket_name)
buck.download_file(remote_fname, local_fname)

In [0]:
!cat {local_fname}

Additional links:

1. Official documentation from Google Colab: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=L5cMl7XV65be
2. Additional API from Google Cloud: https://colab.research.google.com/drive/1hPH7skySCZR-ZMJ6TmYLN1ug6vbq2cpb
3. Free services from Google: https://cloud.google.com/free/
4. Free services from AWS: https://aws.amazon.com/free/