Data Engineer OnBoard - Week 2 
# GCS and BigQuery
Angus Tu
2025.09.17

As a data engineer, now I have received a digital transformation request from a travel company, and need to assist them in migrating their data to Google Cloud step by step.

First, I received a file related to travel products, and I need to upload the data to Google Cloud Storage.

----
# GCS: Google Cloud Storage

## Set Permission 
You need the <u>Storage Object User</u> role and the permissions <u>storage.buckets.create</u> and <u>storage.buckets.list</u>.

> As an admin

In [None]:
gcloud config configurations activate default

In [None]:
gcloud iam roles create bucketCreator \
    --project=tw-rd-data-angus-tu \
    --title="Storage Bucket Creator PoLP" \
    --description="Can only create and list buckets" \
    --permissions=storage.buckets.create,storage.buckets.list \
    --stage=GA

In [None]:
gcloud projects add-iam-policy-binding tw-rd-data-angus-tu \
    --member="serviceAccount:angus-personal@tw-rd-data-angus-tu.iam.gserviceaccount.com" \
    --role="projects/tw-rd-data-angus-tu/roles/bucketCreator"

In [None]:
gcloud projects add-iam-policy-binding tw-rd-data-angus-tu \
    --member="serviceAccount:angus-personal@tw-rd-data-angus-tu.iam.gserviceaccount.com" \
    --role="roles/storage.objectUser"

> As a service account

In [None]:
gcloud config configurations activate sa

## Create Buckets
You can use either the `gcloud` CLI or the `gsutil` command to create buckets.     

*P.S. Using the GUI interface is more intuitive and convenient, but for documentation purposes, I record the steps with code.*

In [None]:
gcloud storage buckets create gs://tw-rd-data-angus-tu-travel-demo1 \
    --location=asia-east1 \
    --project=tw-rd-data-angus-tu 

In [None]:
gsutil mb -p tw-rd-data-angus-tu  -c STANDARD -l asia-east1 gs://tw-rd-data-angus-tu-travel-demo2

----
##  Use Python to Access GCS

### Environment 

In [14]:
from google.cloud import storage
GCS = storage.Client()

### Bucket
> Create a bucket

In [15]:
bucket_name = "tw-rd-data-angus-tu-travel-demo3"
bucket = GCS.bucket(bucket_name)
new_bucket = GCS.create_bucket(bucket, location="asia-east1")
print(f"Bucket {new_bucket.name} created in {new_bucket.location}")

Bucket tw-rd-data-angus-tu-travel-demo3 created in ASIA-EAST1


> List buckets

In [16]:
buckets = GCS.list_buckets()
for bucket in buckets:
    print(bucket.name)

tw-rd-data-angus-tu-travel-demo1
tw-rd-data-angus-tu-travel-demo2
tw-rd-data-angus-tu-travel-demo3


### Folder
Create a (empty) folder

In [None]:
bucket = GCS.bucket("tw-rd-data-angus-tu-travel-demo1")
blob = bucket.blob("data/")
blob.upload_from_string("") 

### Object/File
#### 1. Listing files
Develop a function that lists files in a given GCS bucket. Enhance the function to allow filtering of files based on a specified prefix and suffix.

In [47]:
def list_file(client, bucket_name, folder="", prefix="", suffix=""):

    bucket = client.bucket(bucket_name)
    
    if folder:
        folder = folder.rstrip("/") + "/"
    else:
        folder = ""
    search_prefix = folder + prefix
    blobs = bucket.list_blobs(prefix=search_prefix, delimiter="/")

    results = []
    
    for blob in blobs:
        name = blob.name
        if name == folder: continue
        if folder and "/" in name[len(folder):]: continue
        if suffix and not name.endswith(suffix): continue
        results.append(name)

    return results

In [None]:
list_file(GCS, "tw-rd-data-angus-tu-travel-demo1","data")

['data/.empty', 'data/null']

#### 2. Upload files
Implement a function to upload a file from a local machine to a specified GCS bucket.

In [38]:
def upload_file(client, bucket_name, source_file_name, sink_file_name, sink_dir = ""):

    bucket = client.bucket(bucket_name)

    if sink_dir: 
        sink_dir = sink_dir.rstrip("/") + "/"
    blob_name = sink_dir + sink_file_name
    blob = bucket.blob(blob_name)
    blob.upload_from_filename(source_file_name)

    return f"gs://{bucket_name}/{blob_name}"

In [44]:
upload_file(GCS, "tw-rd-data-angus-tu-travel-demo1", "test1.csv", "test1.csv")

'gs://tw-rd-data-angus-tu-travel-demo1/test1.csv'

In [48]:
list_file(GCS, "tw-rd-data-angus-tu-travel-demo1")

['test1.csv']

> Upload all files in a folder using `gsutil`

In [None]:
gsutil cp -r /Users/angus/Documents/file/folder1 gs://tw-rd-data-angus-tu-travel-demo1/data/folder1

'gs://tw-rd-data-angus-tu-travel-demo1/test1.csv'

In [None]:
list_file(GCS, "tw-rd-data-angus-tu-travel-demo1","data/folder1")

['data/folder1/test11.csv',
 'data/folder1/test12.csv',
 'data/folder1/test13.csv']

#### 3. Download a file
Implement a function to download a file from a specified GCS bucket to a local machine.

In [None]:
def download_file(client, bucket_name, source_file_name, sink_file_name):
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(source_file_name)
    blob.download_to_filename(sink_file_name)
    return 

In [54]:
download_file(GCS, "tw-rd-data-angus-tu-travel-demo1", "data/folder1/test11.csv", "test11_new.csv")

#### 4. Delete a file
Implement a function to delete a file from a specified GCS bucket.

In [55]:
def delete_file(client, bucket_name, file_name):
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    blob.delete()
    return

In [56]:
delete_file(GCS, "tw-rd-data-angus-tu-travel-demo1", "data/folder1/test13.csv")
list_file(GCS, "tw-rd-data-angus-tu-travel-demo1","data/folder1")

['data/folder1/test11.csv', 'data/folder1/test12.csv']

> Delete specific files    

Implement a function to delete multiple files from a specified GCS bucket based on a combination of parameters such as source_dir, prefix, and suffix.

In [58]:
def delete_files(client, bucket_name, source_dir = "", prefix = "", suffix = ""):
    bucket = client.bucket(bucket_name)

    if source_dir:
        source_dir = source_dir.rstrip("/") + "/"

    search_prefix = source_dir + prefix
    blobs = bucket.list_blobs(prefix=search_prefix)

    deleted_files = []

    for blob in blobs:
        name = blob.name
        if name.endswith("/"): continue
        if suffix and not name.endswith(suffix): continue
        blob.delete()
        deleted_files.append(name)

    return deleted_files


In [59]:
delete_files(GCS, "tw-rd-data-angus-tu-travel-demo1", suffix = "2.csv")

['data/folder1/test12.csv']

#### 5. Copy files
Implement a function to copy files from a source GCS bucket to a destination (sink) GCS bucket based on specific parameters, with an option to delete the source files after copying.

In [None]:
from google.cloud import storage

def copy_files(client, source_bucket_name, sink_bucket_name, delete_source = False, source_dir = "", sink_dir = "", prefix = "", suffix = ""):

    source_bucket = client.bucket(source_bucket_name)
    sink_bucket = client.bucket(sink_bucket_name)

    if source_dir:
        source_dir = source_dir.rstrip("/") + "/"
    if sink_dir:
        sink_dir = sink_dir.rstrip("/") + "/"

    search_prefix = source_dir + prefix
    blobs = source_bucket.list_blobs(prefix = search_prefix)

    copied_files = []

    for blob in blobs:
        name = blob.name
        if name.endswith("/"): continue
        if suffix and not name.endswith(suffix): continue
        relative_name = name[len(source_dir):] if source_dir else name
        sink_name = sink_dir + relative_name
        source_bucket.copy_blob(blob, sink_bucket, new_name = sink_name)
        copied_files.append(sink_name)

        if delete_source:
            blob.delete()

    return copied_files


In [61]:
copy_files(GCS, "tw-rd-data-angus-tu-travel-demo1", "tw-rd-data-angus-tu-travel-demo2", source_dir = "data/folder1/", sink_dir = "data/folder2/")

['data/folder2/test11.csv']