The fs-bq-schema-views
script is for use with the official Firebase Extension
Stream Firestore to BigQuery.
The fs-bq-schema-views
script (referred to as the "schema-views script")
generates richly-typed BigQuery views of your raw changelog.
The Stream Firestore to BigQuery extension only mirrors raw data, but it doesn't apply schemas or types. This decoupling makes schema validation less risky because no data can be lost due to schema mismatch or unknown fields.
The schema-views script creates a BigQuery view, based on a JSON schema configuration file, using BigQuery's built-in JSON functions. The Stream Firestore to BigQuery extension also provides some BigQuery user-defined functions that are helpful in converting Firestore document properties to richly-typed BigQuery cells.
The following steps are an example of how to use the schema-views script. In the sections at the end of this file, you can review detailed information about configuring a schema file and reviewing the resulting schema views.
The schema-views script runs against "schema files" which specify your schema configurations in a JSON format.
In any directory, create a schema file called test_schema.json
that contains the following:
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "number"
}
]
}
SQL has a number of reserved keywords that can cause conflicts when creating a schema, timestamp
is one such example. To ensure your Firestore document field names do not conflict, use the column_name
option to override the field name.
Please see the example below...
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "number",
"column_name": "new_column_name"
}
]
}
Learn How to configure schema files later in this guide.
The schema-views script uses Application Default Credentials to communicate with BigQuery.
One way to set up these credentials is to run the following command using the gcloud CLI:
$ gcloud auth application-default login
Alternatively, you can
create and use a service account.
This service account must be assigned a role that grants the permission of
bigquery.jobs.create
, like the "BigQuery Job User" role.
The schema-views script uses the following parameter values from your installation of the extension:
${param:PROJECT_ID}
: the project ID for the Firebase project in which you installed the extension${param:BIGQUERY_PROJECT_ID}
: the project ID for the GCP project that contains the BigQuery instance.${param:DATASET_ID}
: the ID that you specified for your dataset during extension installation${param:TABLE_ID}
: the common prefix of BigQuery views to generate
Run the schema-views script using
npx
(the Node Package Runner)
via npm
(the Node Package Manager).
-
Make sure that you've installed the required tools to run the schema-views script:
- To access the
npm
command tools, you need to install Node.js. - If you use npm v5.1 or earlier, you need to explicitly install
npx
. Runnpm install --global npx
.
- To access the
-
Run the schema-views script via
npx
by running the following command:$ npx @firebaseextensions/fs-bq-schema-views \ --non-interactive \ --project=${param:PROJECT_ID} \ --big-query-project=${param:BIGQUERY_PROJECT_ID} \ --dataset=${param:DATASET_ID} \ --table-name-prefix=${param:TABLE_ID} \ --schema-files=./test_schema.json
Note: You can run the schema-views script from any directory, but you need to specify the path to your schema file using the
--schema-files
flag. To run the schema-views script against multiple schema files, specify each file in a comma-separated list (for example:--schema-files=./test_schema.json,./test_schema2.json
).
-
In the BigQuery web UI, navigate to the generated schema changelog view:
https://console.cloud.google.com/bigquery?project=${param:PROJECT_ID}&p=${param:PROJECT_ID}&d=${param:DATASET_ID}&t=${param:TABLE_ID}_schema_test_schema_changelog&page=table
.This view allows you to query document change events by fields specified in the schema.
-
In the Firebase console, go to the Cloud Firestore section, then create a document called
test-schema-document
with two fields:- A field of type
string
called "name" - A field of type
number
called "age"
- A field of type
-
Back in BigQuery, run the following query in the schema changelog view (that is,
https://console.cloud.google.com/bigquery?project=${param:PROJECT_ID}&p=${param:PROJECT_ID}&d=${param:DATASET_ID}&t=${param:TABLE_ID}_schema_test_schema_changelog&page=table
):SELECT document_name, name, age FROM ${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_test_schema_changelog WHERE document_name = "test-schema-document"
-
Go back to the Cloud Firestore section of the console, then change the type of the "age" field to be a string.
-
Back in BigQuery, re-run the following query:
SELECT document_name, name, age FROM ${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_test_schema_changelog WHERE document_name = "test-schema-document"
You'll see a new change with a
null
age column. If you query documents that don't match the schema, then the view contains null values for the corresponding schema fields. -
Back in the Cloud Firestore section in the console, delete
test-schema-document
. -
(Optional) As with the raw views, you can also query events on the view of the documents currently in the collection by using the latest schema view (that is,
https://console.cloud.google.com/bigquery?project=${param:PROJECT_ID}&p=${param:PROJECT_ID}&d=${param:DATASET_ID}&t=${param:COLLECTION_PATH}_schema_test_schema_latest&page=table
.Back in BigQuery, if you run the following query, you'll receive no results because the document no longer exists in the Cloud Firestore collection.
SELECT document_name, name, age FROM ${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_test_schema_latest WHERE document_name = "test-schema-document"
- Create your own schema files
- Troubleshoot common issues
- Learn about the columns in a schema view
- Take a look at more SQL examples
To generate schema views of your raw changelog, you must create at least one schema JSON file.
Here's an example of a configuration that a schema file might contain:
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_numbers",
"type": "array"
},
{
"name": "last_login",
"type": "timestamp"
},
{
"name": "last_location",
"type": "geopoint"
},
{
"name": "geo_point",
"type": "stringified_map"
},
{
"fields": [
{
"name": "name",
"type": "string"
}
],
"name": "friends",
"type": "map"
}
]
}
The root of the configuration must have a fields
array that contains objects
which describe the elements in the schema. If one of the objects is of type
map
, it must specify its own fields
array describing the members of that
map.
Each fields
array must contain at least one of the following types:
string
array
map
boolean
number
timestamp
geopoint
reference
null
stringified_map
All but stringified_map
correspond with Cloud Firestore's
supported data types.
Make sure that the other types specified match the types of the fields in your
Cloud Firestore collection.
The stringified_map
type will create a JSON string out of a map in your Firestore record. This
could be useful if you are not sure of the properties in the map ahead of time, and thus
cannot specify in your schema.
You may create any number of schema files to use with the schema-views script. The schema-views script generates the following views for each schema file:
${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_${SCHEMA_FILE_NAME}_changelog
${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_${SCHEMA_FILE_NAME}_latest
Here, ${SCHEMA_FILE_NAME}
is the name of each schema file that you provided as
an argument to run the schema-views script.
Be aware of the following common mistakes when configuring a schema file:
Mistake in schema file config | Outcome of mistake |
---|---|
Omitting a relevant field | The generated view will not contain a column for that field. |
Specifying the wrong type for a relevant field | Type conversion (see previous section) will fail and the resulting column
will contain a BigQuery null value in lieu of the desired
value. |
Specifying a schema field that doesn't exist in the underlying raw changelog | Querying the column for that field will return a BigQuery null
value instead of the desired value. |
Writing invalid JSON | The schema-views script cannot generate a view |
Since all document data is stored in the schemaless changelog, mistakes in schema configuration don't affect the underlying data and can be resolved by re-running the schema-views script against an updated schema file.
-
${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_${SCHEMA_FILE_NAME}_changelog
This view is a table which contains all rows present in the raw changelog. This view is analogous to the raw change-log, only it has typed columns corresponding to fields of the schema.
-
${param:PROJECT_ID}.${param:DATASET_ID}.${param:TABLE_ID}_schema_${SCHEMA_FILE_NAME}_latest
This view stores typed rows for the documents currently in the collection. This view is analogous to the "latest" view on the raw changelog, only it includes the typed columns corresponding to fields in the corresponding schema file.
Since
GEOGRAPHY
fields are not groupable entities in BigQuery (and the query which builds the latest view of a collection of documents requires grouping on the schema columns), the latest schema omits anyGEOGRAPHY
columns and, instead, splits them out into twoNUMERIC
columns called${FIRESTORE_GEOPOINT}_latitude
and${FIRESTORE_GEOPOINT}_longitude
.
Each schema view carries with it the following fields from the raw changelog:
document_name STRING REQUIRED
timestamp TIMESTAMP REQUIRED
operation STRING REQUIRED
The remaining columns correspond to fields of the schema and are assigned types
based on the corresponding Cloud Firestore types those fields have. With the
exception of map
and array
, the type conversion scheme is as follows:
Cloud Firestore type | BigQuery type |
---|---|
string | STRING |
boolean | BOOLEAN |
number | NUMERIC |
timestamp | TIMESTAMP |
geopoint | GEOGRAPHY |
reference | STRING (containing the path to the referenced document) |
null | NULL |
Cloud Firestore maps are interpreted recursively. If you include a map in your
schema configuration, the resulting view will contain columns for whatever
fields that map contains. If the map doesn't contain any fields, the map is
ignored by the schema-views script. If using the stringified_map
type, the
map will be stringified.
Review these examples for querying an array.
Cloud Firestore arrays are
unnested,
so all array fields of the document are
cross joined
in the output table. The view retains the member and offset columns, which are
called ${FIRESTORE_ARRAY_NAME}_member
and ${FIRESTORE_ARRAY_NAME}_index
,
respectively. To make querying easier, the view includes these two columns
instead of the original ARRAY
value field.
If the array is empty, it will be ignored by the schema-views script.