# GCS Inventory Loader Introduction
Load your GCS bucket inventory into BigQuery (or stdout) fast with this tool.

It can be very useful to have an inventory of your GCS objects and their metadata, particularly in a powerful database like BigQuery. The GCS listing API supports filtering by prefixes, but more complex queries can't be done via API. Using a database, you can find out lots of information about the data you have in GCS, such as finding very large objects, very old or stale objects, etc.

This utility will help you bulk load an object listing to stdout, or directly into BigQuery. It can also help you keep your inventory up-to-date with the listen command.

The implementation here takes the approach of listing buckets and sending each page to a worker in a thread pool for processing and streaming into BigQuery. Throughput rates of 15s per 100,000 objects have been achieved with moderately sized (32 vCPU) virtual machines. This works out to 2 minutes and 30 seconds per million objects. Note that this throughput is per process -- simply shard the bucket namespace across multiple projects to increase this throughput.

## Costs
Compute costs notwithstanding, the primary cost you'll incur for listing objects is Class A operations charges. Under most circumstances you'll get a listing with 1000 objects per page (exceptional circumstances might be... you just did a lot of deletes and the table is sparse). So cost is figured like so:

`(number of objects listed) / 1000 / 10,000 * (rate per 10,000 class A ops)`

For example, in a standard regional bucket, listing 100 million objects should cost about .5 USD:

`(100 million) / 1000 / 10,000 * $0.05 = $0.50`


## To Learn More
To learn more go to https://github.com/domZippilli/gcs-inventory-loader

# Clone gcs-inventory-loader Repository

In [None]:
!git clone https://github.com/domZippilli/gcs-inventory-loader.git

# Install dependencies

In [None]:
!cd gcs-inventory-loader
!pip install .

# Configure the default.cfg file
Configure the .default.cfg file to point to the right BQ project for the resulting dataset and also the project within which the buckets of interest live

Change These Values:
- PROJECT=The_BigQuery_Project
- GCS_PROJECT=The_Bucket_Project
- DATASET_NAME=The_BigQuery_Dataset
- INVENTORY_TABLE=The_BigQuery_Table

Helpful Tip: You don't need quotes around the value 

In [None]:
%%writefile default.cfg
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


[GCP]
# The project in which to scan for buckets and load object information into a table.
PROJECT=CONFIGURE_ME

# The project in which to scan for buckets / objects only. Use this (or BIGQUERY.JOB_PROJECT) if to span BQ and GCS across different projects.
GCS_PROJECT=CONFIGURE_ME

[RUNTIME]
# Number of worker threads. Two threads will be reserved for listing buckets, and the remaining threads will be used to send list pages into BigQuery. Even on a single core machine, this should be set to at least 4 to allow for context switches during IO waits.
WORKERS=64

# Amount of work items (page listings) to store. More items will use more memory, but a larger work queue can improve performance if you see throughput stuttering.
WORK_QUEUE_SIZE=1000

# Log level for the inventory loader. Default is INFO.
# LOG_LEVEL=DEBUG


[BIGQUERY]
# The dataset to use for inventory data.
DATASET_NAME=CONFIGURE_ME

# A table in which to place the object inventory.
INVENTORY_TABLE=object_metadata

# How many rows to stream into BigQuery before starting a new stream.
# Default is 100, which is conservative, but most configurations can run much larger. Higher numbers use more memory, and an excessively high number may hit BQ limits.
BATCH_WRITE_SIZE=500

# Project to use for running BQ jobs. This is useful if you want to run the job in one project but store the data in another.
# Default is GCP.PROJECT
# JOB_PROJECT=

#[PUBSUB]
# The topic to listen to for object updates. Just give the short name of the topic, not the fully qualified name.
#TOPIC_SHORT_NAME=gcs_updates

# The subscription to listen to for object updates (will be created if not found)
#SUBSCRIPTION_SHORT_NAME=gcs_updates_sub_01

# The message wait timeout in seconds. Defaults to 10.
# This value shouldn't need adjusting. During the 10 second wait, notifications that are enqueued to be written to BigQuery
# could be lost in the event of a KP/plug-pull. If you need to shorten this window, you probably should also shrink the batch write size.
# TIMEOUT=

# Load Bucket Contents into BigQuery
If you've configured the config file correctly, you should be able to get your bucket inventory loaded with a simple command. Note that by default, this will load an inventory of all objects for all buckets in your project. 

- Make sure you ADD your Proxy email (found in Terra Profile) to the BigQuery project as an Editor 

In [None]:
! gcs_inventory load <BUCKET: fc-****-****-****-****>