# Connecting to Google Cloud Storage

Before anything, I had to `sftp` my Google Cloud credentials from my local machine to the VM I'm using on Compute Engine for this module. Then, since it's a remote terminal, I can't use the OAuth route to authenticate myself. So I did this instead:
``` bash
gcloud auth activate-service-account --key-file $GC_CREDENTIALS
```
where I have set `GC_CREDENTIALS` to the full path to my credentials JSON file.

Now we are ready.  
First we uploaded all the Parquet files to GCS with:
``` bash
gsutil -m cp -r data/pq/ gs://$GS_BUCKET_NAME/pq
```
where I have set `GS_BUCKET_NAME` to be the name of my bucket on Google Cloud Storage. Note: Because we have a large number of files, we used multi-threaded mode (`-m`).

Next, we need the [Cloud Storage connector for Hadoop](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#non-clusters).
``` bash
mkdir lib && cd lib &&
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar gcs-connector-hadoop3-latest.jar
```

Okay, I lied before. _Now_ we are ready...

In [1]:
import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

In [2]:
credentials_location = os.environ.get("GC_CREDENTIALS", "/home/freddie/.gc/google_credentials.json")

# Configuring Spark
conf = SparkConf() \
    .setMaster('local[*]') \
    .setAppName('NYTaxi') \
    .set("spark.jars", "./lib/gcs-connector-hadoop3-latest.jar") \
    .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)

In [3]:
sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")

23/10/29 00:10:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


All that is happening above is that we are telling Spark how to interpret file system (`fs`) strings that start with "`gs`" and what implementation (`impl`) to use and what credentials to use.  
Now we can build a session as before:

In [4]:
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

In [5]:
gcs_bucket_name = os.environ.get("GC_CREDENTIALS", "dezc-data-lake_brilliant-vent-400717")
df_green = spark.read.parquet(f"gs://{gcs_bucket_name}/pq/green/*/*")

                                                                                

In [6]:
df_green.count()

                                                                                

2304517

It worked!