## Advanced Read/Write IO

Pandas provides built-in methods to read/write to almost every prominent Big Data file and storage type; making pandas one of the standard tools for converting data formats and loading data.

<br/>

### Writing to Cloud BigQuery

One of the key applications of pandas is to transform data files and load into Big Data / Cloud tools for analytics. Pandas provides a built-in method called `.to_gbq()` to load Dataframes into BigQuery. 

The example below shows how you can use the `.to_gbq()` method to load data into BigQuery.

:::tip .to_gbq() Performance

Use `.to_gbq()` on smaller data loads (typically less than 1GB). The underlying method used by this method is not meant for large data loads. We recommend using this method for data loads in MBs range. Other techniques like writing directly to GCS and using BigQuery external tables is preferred method for large data loads in GB range.

:::

<br/>


In [None]:
import pandas as pd
from google.oauth2 import service_account
import os

# read data fom csv
flights = pd.read_csv('../data/flights.csv', header=0)

# check if GCP ceredentials file is set
if os.getenv('GOOGLE_APPLICATION_CREDENTIALS', default=None) is None:
    raise RuntimeError("You forgot to set GOOGLE_APPLICATION_CREDENTIALS environment variable!")

# you can expplicitly load google credentials from a serivce account json file
# this is OPTIONAL if GOOGLE_APPLICATION_CREDENTIALS environment variable is set
credentials = service_account.Credentials.from_service_account_file(
                    os.getenv('GOOGLE_APPLICATION_CREDENTIALS'))

# schema is used to map dataframe fields to BigQuery data types
# field data types should be defined as: https://cloud.google.com/bigquery/docs/schemasqbg_df
schema = [
    {'name': 'airline', 'type': 'STRING'},
    {'name': 'src', 'type': 'STRING'},
    {'name': 'dest', 'type': 'STRING'},
    {'name': 'flight_number', 'type': 'STRING'},
    {'name': 'departure_time', 'type': 'STRING'},
    {'name': 'arrival_time', 'type': 'STRING'},
]
# gcp project name, bigquery dataset and tables names
# EDIT values below based on your GCP environment
project = 'deb-airliner'
dataset = 'airline_data'
table = 'pandas_flights'
# filter output dataframe
gbq_df = flights[['airline', 'src', 'dest', 
                  'flight_number', 'departure_time', 'arrival_time']]
# write to bigquery using .to_gbq()
gbq_df.to_gbq(
    destination_table=f"{dataset}.{table}",
    project_id=project,
    chunksize= 2000,
    if_exists='replace',
    table_schema=schema,
    progress_bar=False,
    credentials=credentials,
)
print('done')