Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added info about how to reorder the columns to adjust a schema #1364

Merged
merged 6 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
keywords: [destination, credentials, example, bigquery, custom destination]
---

In this example, you'll find a Python script that demonstrates how to load to bigquey with the custom destination.
In this example, you'll find a Python script that demonstrates how to load to BigQuery with the custom destination.

We'll learn how to:
- use [built-in credentials](../general-usage/credentials/config_specs#gcp-credentials)
- use the [custom destination](../dlt-ecosystem/destinations/destination.md)
- Use pyarrow tables to create complex column types on bigquery
- Use bigquery `autodetect=True` for schema inference from parquet files
- Use [built-in credentials.](../general-usage/credentials/config_specs#gcp-credentials)
- Use the [custom destination.](../dlt-ecosystem/destinations/destination.md)
- Use pyarrow tables to create complex column types on BigQuery.
- Use BigQuery `autodetect=True` for schema inference from parquet files.

"""

Expand All @@ -38,7 +38,7 @@
def resource(url: str):
# load pyarrow table with pandas
table = pa.Table.from_pandas(pd.read_csv(url))
# we add a list type column to demontrate bigquery lists
# we add a list type column to demonstrate bigquery lists
table = table.append_column(
"tags",
pa.array(
Expand All @@ -57,12 +57,12 @@ def resource(url: str):
yield table


# dlt biquery custom destination
# dlt bigquery custom destination
# we can use the dlt provided credentials class
# to retrieve the gcp credentials from the secrets
@dlt.destination(name="bigquery", loader_file_format="parquet", batch_size=0)
@dlt.destination(name="bigquery", loader_file_format="parquet", batch_size=0, naming_convention="snake_case")
def bigquery_insert(
items, table, credentials: GcpServiceAccountCredentials = dlt.secrets.value
items, table=BIGQUERY_TABLE_ID, credentials: GcpServiceAccountCredentials = dlt.secrets.value
) -> None:
client = bigquery.Client(
credentials.project_id, credentials.to_native_credentials(), location="US"
Expand All @@ -74,7 +74,7 @@ def bigquery_insert(
)
# since we have set the batch_size to 0, we get a filepath and can load the file directly
with open(items, "rb") as f:
load_job = client.load_table_from_file(f, BIGQUERY_TABLE_ID, job_config=job_config)
load_job = client.load_table_from_file(f, table, job_config=job_config)
load_job.result() # Waits for the job to complete.


Expand Down
26 changes: 26 additions & 0 deletions docs/website/docs/walkthroughs/adjust-a-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,32 @@ Do not rename the tables or columns in the yaml file. `dlt` infers those from th
You can [adjust the schema](../general-usage/resource.md#adjust-schema) in Python before resource is loaded.
:::

### Reorder columns
To reorder the columns in your dataset, follow these steps:

1. Initial Run: Execute the pipeline to obtain the import and export schemas.
1. Modify Export Schema: Adjust the column order as desired in the export schema.
1. Sync Import Schema: Ensure that these changes are mirrored in the import schema to maintain consistency.
1. Delete Dataset: Remove the existing dataset to prepare for the reload.
1. Reload Data: Reload the data. The dataset should now reflect the new column order as specified in the import YAML.

These steps ensure that the column order in your dataset matches your specifications.

**Another approach** to reorder columns is to use the `add_map` function. For instance, to rearrange ‘column1’, ‘column2’, and ‘column3’, you can proceed as follows:

```py
# Define the data source and reorder columns using add_map
data_source = resource().add_map(lambda row: {
'column3': row['column3'],
'column1': row['column1'],
'column2': row['column2']
})

# Run the pipeline
load_info = pipeline.run(data_source)
```

In this example, the `add_map` function reorders columns by defining a new mapping. The lambda function specifies the desired order by rearranging the key-value pairs. When the pipeline runs, the data will load with the columns in the new order.

### Load data as json instead of generating child table or columns from flattened dicts

Expand Down
Loading