# GCS Inventory Loader Introduction
Load your GCS bucket inventory into BigQuery (or stdout) fast with this tool.

It can be very useful to have an inventory of your GCS objects and their metadata, particularly in a powerful database like BigQuery. The GCS listing API supports filtering by prefixes, but more complex queries can't be done via API. Using a database, you can find out lots of information about the data you have in GCS, such as finding very large objects, very old or stale objects, etc.

This utility will help you bulk load an object listing to stdout, or directly into BigQuery. It can also help you keep your inventory up-to-date with the listen command.

The implementation here takes the approach of listing buckets and sending each page to a worker in a thread pool for processing and streaming into BigQuery. Throughput rates of 15s per 100,000 objects have been achieved with moderately sized (32 vCPU) virtual machines. This works out to 2 minutes and 30 seconds per million objects. Note that this throughput is per process -- simply shard the bucket namespace across multiple projects to increase this throughput.

## Costs
Compute costs notwithstanding, the primary cost you'll incur for listing objects is Class A operations charges. Under most circumstances you'll get a listing with 1000 objects per page (exceptional circumstances might be... you just did a lot of deletes and the table is sparse). So cost is figured like so:

`(number of objects listed) / 1000 / 10,000 * (rate per 10,000 class A ops)`

For example, in a standard regional bucket, listing 100 million objects should cost about .5 USD:

`(100 million) / 1000 / 10,000 * $0.05 = $0.50`


## To Learn More
To learn more go to https://github.com/domZippilli/gcs-inventory-loader

# Set up the Notebook

For this notebook to run you need to have a BigQuery project and dataset already created, so the code knows where to load the data.

## Clone gcs-inventory-loader Repository

In [None]:
!git clone https://github.com/domZippilli/gcs-inventory-loader.git

## Install dependencies

In [None]:
!pip install gcs-inventory-loader/.
!cd gcs-inventory-loader

## Configure the default.cfg file
We need to configure the .default.cfg file to point to the right BQ project for the resulting dataset and also the project within which the buckets of interest live

Change These Values Below:
- PROJECT=The_BigQuery_Project
- GCS_PROJECT=The_Bucket_Project
- DATASET_NAME=The_BigQuery_Dataset
- INVENTORY_TABLE=The_BigQuery_Table

Helpful Tip: You don't need quotes around the value 

In [None]:
# The project use for BigQuery and where the dataset and table of the uploaded object metadata can be accessed
PROJECT="CONFIGURE_ME"

# The project use for google storge where the notebook bucket live
GCS_PROJECT="CONFIGURE_ME"

# The resulting dataset of gcs-inventory-loader where the table of the uploaded object metadata will be store 
DATASET_NAME="CONFIGURE_ME"

# The resulting table of gcs-inventory-loader where the object metadata will be uploaded
INVENTORY_TABLE="CONFIGURE_ME"

In [None]:
# Create a cell magic to edit the file
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

In [None]:
%%writetemplate default.cfg
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


[GCP]
# The project in which to scan for buckets and load object information into a table.
PROJECT={PROJECT}

# The project in which to scan for buckets / objects only. Use this (or BIGQUERY.JOB_PROJECT) if to span BQ and GCS across different projects.
GCS_PROJECT={GCS_PROJECT}

[RUNTIME]
# Number of worker threads. Two threads will be reserved for listing buckets, and the remaining threads will be used to send list pages into BigQuery. Even on a single core machine, this should be set to at least 4 to allow for context switches during IO waits.
WORKERS=64

# Amount of work items (page listings) to store. More items will use more memory, but a larger work queue can improve performance if you see throughput stuttering.
WORK_QUEUE_SIZE=1000

# Log level for the inventory loader. Default is INFO.
# LOG_LEVEL=DEBUG


[BIGQUERY]
# The dataset to use for inventory data.
DATASET_NAME={DATASET_NAME}

# A table in which to place the object inventory.
INVENTORY_TABLE={INVENTORY_TABLE}

# How many rows to stream into BigQuery before starting a new stream.
# Default is 100, which is conservative, but most configurations can run much larger. Higher numbers use more memory, and an excessively high number may hit BQ limits.
BATCH_WRITE_SIZE=500

# Project to use for running BQ jobs. This is useful if you want to run the job in one project but store the data in another.
# Default is GCP.PROJECT
# JOB_PROJECT=

#[PUBSUB]
# The topic to listen to for object updates. Just give the short name of the topic, not the fully qualified name.
#TOPIC_SHORT_NAME=gcs_updates

# The subscription to listen to for object updates (will be created if not found)
#SUBSCRIPTION_SHORT_NAME=gcs_updates_sub_01

# The message wait timeout in seconds. Defaults to 10.
# This value shouldn't need adjusting. During the 10 second wait, notifications that are enqueued to be written to BigQuery
# could be lost in the event of a KP/plug-pull. If you need to shorten this window, you probably should also shrink the batch write size.
# TIMEOUT=

# Load your workspace bucket to a BQ project
If you've configured the config file correctly, you should be able to get your bucket inventory loaded with a simple command. Note that by default, this will load an inventory of all objects for all buckets in your project. 

- Make sure you ADD your Proxy email (found in Terra Profile) to the BigQuery project as an Editor 

In [None]:
import os

# Getting Workspace Bucket
bucket = os.environ["WORKSPACE_BUCKET"].split("/")[2]

In [None]:
! gcs_inventory load $bucket

# Visualize object metadata

Now the data is in BigQuery, we can run analytics on the BigQuery table to help manage your storage data.


In [None]:
# This sets the path parameter for BigQuery
params= {"path": f"{PROJECT}.{DATASET_NAME}.{INVENTORY_TABLE}"}

## Data Size 

The Bar Graph will show the GB file Size per all the same files types, so you can drive deeper in how many large and small files you have collectively

In [None]:
%load_ext google.cloud.bigquery

In [None]:
%%bigquery data_size_table --use_rest_api --params $params
DECLARE path STRING;
SET path = @path;
EXECUTE IMMEDIATE CONCAT('SELECT REGEXP_EXTRACT(name, r"\\.[0-9a-z]+$") AS file_extension, ROUND(SUM(size) / 1000 / 1000 / 1000, 1) AS sizeGB, COUNT(*) AS files, FROM `',path,'` GROUP BY file_extension ORDER BY sizeGB DESC');

In [None]:
%matplotlib inline

In [None]:
data_size_table

In [None]:
data_size_table.plot(kind="bar", x="file_extension", y="sizeGB", figsize=(50,20), fontsize=30)

## Files per File Type

The Pie Graph will show how many file of each file type is in your bucket 

This will show an general overview Pie Graph and an detailed Pie Graph

### General Overview Pie Graph for Files per File Type

Includes standard file types 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
types = list(data_size_table["file_extension"])

# Check if there is files
if len(types) > 0:
    sizes = np.array(list(data_size_table["sizeGB"]))
    porcent = 100.*sizes/sizes.sum()

    labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(types, porcent)]

    fig, ax = plt.subplots()
    patches, text = plt.pie(sizes, radius=5, startangle=90)


    plt.legend(patches, labels, loc="right", bbox_to_anchor=(-0.1, 1.),
               fontsize=10)

    plt.show()
else:
    print("You Have No Files - Plots Won't Work")



### Detailed  Pie Graph for Files per File Type

Includes logs, stderr, stdout, meta and more

In [None]:
#If you didn't run it before
%load_ext google.cloud.bigquery

In [None]:
%%bigquery file_type_data --use_rest_api --params $params
DECLARE path STRING;
SET path = @path;
EXECUTE IMMEDIATE CONCAT('SELECT IF( LENGTH(ifnull(ARRAY_REVERSE(SPLIT(ARRAY_REVERSE(SPLIT(name, "/"))[OFFSET(0)], "."))[SAFE_OFFSET(0)], ARRAY_REVERSE(SPLIT(name, "/"))[OFFSET(0)])) > 16 , ARRAY_REVERSE(SPLIT(name, "/"))[OFFSET(1)], ifnull(ARRAY_REVERSE(SPLIT(ARRAY_REVERSE(SPLIT(name, "/"))[OFFSET(0)], "."))[SAFE_OFFSET(0)], ARRAY_REVERSE(SPLIT(name, "/"))[OFFSET(0)])) as type, COUNT(name) as count FROM `',path,'` GROUP BY type HAVING COUNT(name) >1');

In [None]:
file_type_data

In [None]:
import numpy as np
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
types = list(file_type_data["type"])

# Check if there is files
if len(types) > 0:
    sizes = np.array(list(file_type_data["count"]))
    porcent = 100.*sizes/sizes.sum()

    labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(types, porcent)]

    fig, ax = plt.subplots()
    patches, text = plt.pie(sizes, radius=5, startangle=90)


    plt.legend(patches, labels, loc="right", bbox_to_anchor=(-0.1, 1.),
               fontsize=10)

    plt.show()
else:
    print("You Have No Files - Plots Won't Work")

## Duplicate Files

Using the file path and Hash number combined, the queries finds all the duplicate files and returns a list

In [None]:
#If you didn't run it before
%load_ext google.cloud.bigquery

In [None]:
%%bigquery file_Duplicate --use_rest_api --params $params
DECLARE path STRING;
SET path = @path;
EXECUTE IMMEDIATE CONCAT('SELECT Concat(regexp_replace(regexp_replace(name, cast(ARRAY_REVERSE(SPLIT(name, r"/"))[SAFE_OFFSET(0)] as string), r""), cast(ARRAY_REVERSE(SPLIT(name, r"/"))[SAFE_OFFSET(1)] as string), r""), r"-", cast( md5Hash as string)) as file_name, Count(regexp_replace(regexp_replace(name, cast(ARRAY_REVERSE(SPLIT(name, r"/"))[OFFSET(0)] as string), r""), cast(ARRAY_REVERSE(SPLIT(name, r"/"))[SAFE_OFFSET(1)] as string), r"")) as count FROM `',path,'` Group by file_name Having Count(file_name) > 2 ORDER By COUNT DESC')                     

In [None]:
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None, 'display.max_colwidth', None)
file_Duplicate

## Data Timeline

This Bar graph shows a timeline of when files were created

In [None]:
#If you didn't run it before
%load_ext google.cloud.bigquery

In [None]:
%%bigquery data_timeline_table --use_rest_api --params $params
DECLARE path STRING;
SET path = @path;
EXECUTE IMMEDIATE CONCAT('SELECT DATE(timeCreated) as date,  count(name) as name_count FROM `',path,'` Group By DATE(timeCreated)');

In [None]:
data_timeline_table

In [None]:
data_timeline_table.plot(kind="bar", x="date", y="name_count", figsize=(50,20), fontsize=20)