### Load data
One-time load of BlackCat 2022 NTD reports into BigQuery for bookkeeping purposes. This will not be automated as we only need to do it once.
 Transfer from GCS bucket to BQ.
 Python takes the excel file from GCS, loads here, copy worksheet by worksheet into BQ
   
**First**, we must create a dataset in the BQ GUI. Here we load various tables into that dataset.
* created `blackcat_raw` in `cal-itp-data-infra`  
  
In this notebook we create empty tables, specifying their schemas based on their column names and data types. We will then load the data into them.

The code here was used to make the python script `blackcat_2022_ntdreports_toBQ.py`

In [1]:
import pandas as pd
from google.cloud import bigquery


In [2]:
GCS_FILE_PATH_RAW = "gs://calitp-ntd-report-validation/blackcat_ntd_reports_2022_raw"

def load_excel_data(sheetname):
    df = pd.read_excel(f"{GCS_FILE_PATH_RAW}/NTD_Annual_Report_Rural_2022.xlsx",
                        sheet_name=sheetname,
                        index_col=None)
    return df

In [3]:
# Get data from GCS
rr20_service =  load_excel_data(sheetname="Service Data")
rr20_exp_by_mode = load_excel_data(sheetname="Expenses By Mode")
rr20_rev_by_mode = load_excel_data(sheetname="Revenues By Mode")
rr20_fin = load_excel_data(sheetname="Financials - 2")
rr20_safety = load_excel_data(sheetname="Safety")
rr20_other = load_excel_data(sheetname="Other Resources")
rr20_contactinfo = load_excel_data(sheetname="Basics.Contacts")

In [74]:
pd.set_option('display.max_rows', None)
rr20_exp_by_mode

Unnamed: 0,OrganizationLegalName,CommonName_Acronym_DBA,FiscalYear,Operating_Capital,Mode,TotalAnnualExpensesByMode
0,Alpine County Community Development,,2022,Capital,Demand Response (DR) - (DO),0.0
1,Alpine County Community Development,,2022,Operating,Demand Response (DR) - (DO),75944.0
2,Amador Transit,,2022,Capital,Commuter Bus (CB) - (DO),0.0
3,Amador Transit,,2022,Capital,Demand Response (DR) - (DO),0.0
4,Amador Transit,,2022,Capital,Deviated Fixed Route (DF) - (DO),0.0
5,Amador Transit,,2022,Operating,Commuter Bus (CB) - (DO),208171.0
6,Amador Transit,,2022,Operating,Demand Response (DR) - (DO),208246.0
7,Amador Transit,,2022,Operating,Deviated Fixed Route (DF) - (DO),980081.0
8,Calaveras Transit Agency,CTA,2022,Capital,Demand Response (DR) - (PT),50189.0
9,Calaveras Transit Agency,CTA,2022,Capital,Deviated Fixed Route (DF) - (PT),21894.0


In [4]:
## Load into "blackcat_raw" BQ tables - we do *not* modify at all from the original here - 
# ...when we do, those are saved into "_parsed" BQ tables
# Construct a BigQuery client object.
client = bigquery.Client()

In [43]:
# Set table_id to the ID of the table to create.
#----------- Service
table_id = "cal-itp-data-infra.blackcat_raw.2022_rr20_service"
# Remove spaces and slashes from cols - they are illegal in BQ.
rr20_service.columns = rr20_service.columns.str.replace(' ', '')
rr20_service.columns = rr20_service.columns.str.replace('/', '_')

columns = rr20_service.columns.values
columns

array(['OrganizationLegalName', 'CommonName_Acronym_DBA', 'FiscalYear',
       'Mode', 'AnnualVRM', 'AnnualVRH', 'AnnualUPT', 'SponsoredUPT',
       'VOMX'], dtype=object)

In [44]:
########### Make dict of colname: BQ type
# If col is object, it's string. If all col % 1 == 0, int
# if col is numeric but not % 1 == 0, float

#------ testing
rr20_service.dtypes
# rr20_service['Annual VRM'].dtypes
# all(x.is_integer() for x in rr20_service['Annual VRM'])    


OrganizationLegalName      object
CommonName_Acronym_DBA     object
FiscalYear                  int64
Mode                       object
AnnualVRM                 float64
AnnualVRH                 float64
AnnualUPT                 float64
SponsoredUPT              float64
VOMX                      float64
dtype: object

In [45]:
schema_dict = {}
for x in columns:
    if rr20_service[x].dtypes == 'float64':
        schema_dict[x] = "FLOAT64"
    elif rr20_service[x].dtypes == 'int64':
        schema_dict[x] = "INT64"
    elif rr20_service[x].dtypes == 'object':
        schema_dict[x] =  "STRING"

In [46]:
schema_dict

{'OrganizationLegalName': 'STRING',
 'CommonName_Acronym_DBA': 'STRING',
 'FiscalYear': 'INT64',
 'Mode': 'STRING',
 'AnnualVRM': 'FLOAT64',
 'AnnualVRH': 'FLOAT64',
 'AnnualUPT': 'FLOAT64',
 'SponsoredUPT': 'FLOAT64',
 'VOMX': 'FLOAT64'}

In [47]:
schema = []
for k, v in schema_dict.items():
    schema.append(bigquery.SchemaField(k, v, mode="REQUIRED")) 
    

In [48]:
schema

[SchemaField('OrganizationLegalName', 'STRING', 'REQUIRED', None, None, (), None),
 SchemaField('CommonName_Acronym_DBA', 'STRING', 'REQUIRED', None, None, (), None),
 SchemaField('FiscalYear', 'INT64', 'REQUIRED', None, None, (), None),
 SchemaField('Mode', 'STRING', 'REQUIRED', None, None, (), None),
 SchemaField('AnnualVRM', 'FLOAT64', 'REQUIRED', None, None, (), None),
 SchemaField('AnnualVRH', 'FLOAT64', 'REQUIRED', None, None, (), None),
 SchemaField('AnnualUPT', 'FLOAT64', 'REQUIRED', None, None, (), None),
 SchemaField('SponsoredUPT', 'FLOAT64', 'REQUIRED', None, None, (), None),
 SchemaField('VOMX', 'FLOAT64', 'REQUIRED', None, None, (), None)]

In [None]:
# # from BQ example page:
# schema = [
#         bigquery.SchemaField(x, "STRING", mode="REQUIRED"),
#         bigquery.SchemaField("age", "INTEGER", mode="REQUIRED"),
#     ]

In [49]:
table = bigquery.Table(table_id, schema=schema)

In [50]:
table

Table(TableReference(DatasetReference('cal-itp-data-infra', 'blackcat_raw'), '2022_rr20_service'))

In [51]:
table = client.create_table(table)  # Make an API request.
# print(
#     "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
# )

## Make function to do the above actions for all tables

In [62]:
# tables_to_load = [rr20_service, rr20_exp_by_mode, rr20_rev_by_mode,
#                  rr20_fin, rr20_safety, rr20_other, rr20_contactinfo]

dfdict = {"rr20_exp_by_mode": rr20_exp_by_mode, 
          "rr20_rev_by_mode": rr20_rev_by_mode,
          "rr20_fin": rr20_fin, 
          "rr20_safety": rr20_safety, 
          "rr20_other": rr20_other, 
          "rr20_contactinfo": rr20_contactinfo}

In [63]:
for k,v in dfdict.items():
    table_id = f"cal-itp-data-infra.blackcat_raw.2022_{k}"
    # Remove spaces and slashes from col names 
    v.columns = v.columns.str.replace(' ', '')
    v.columns = v.columns.str.replace('/', '_')
    columns = v.columns.values
    
    schema_dict = {}
    for x in columns:
        if v[x].dtypes == 'float64':
            schema_dict[x] = "FLOAT64"
        elif v[x].dtypes == 'int64':
            schema_dict[x] = "INT64"
        elif v[x].dtypes == 'object':
            schema_dict[x] =  "STRING"
    
    schema = []
    for k, v in schema_dict.items():
        schema.append(bigquery.SchemaField(k, v)) 
    
    table = bigquery.Table(table_id, schema=schema)
    table = client.create_table(table)
    print(f"Created table {table.project}.{table.dataset_id}.{table_id}")

Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_exp_by_mode
Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_rev_by_mode
Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_fin
Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_safety
Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_other
Created table cal-itp-data-infra.blackcat_raw.cal-itp-data-infra.blackcat_raw.2022_rr20_contactinfo


Note: I had to delete these tables because I forgot this part:  ```schema = []
    for k, v in schema_dict.items():
        schema.append(bigquery.SchemaField(k, v, mode="REQUIRED"))```  
  
and they were made with identical schemas. So I did **not** rerun it, but rolled it all into the file `blackcat_2022_ntdreports_toBQ.py` to be made **and loaded** in a for loop

### Load data into newly created tables

In [70]:
## Try for rr-20 service without specifying the schema

table_id = "cal-itp-data-infra.blackcat_raw.2022_rr20_service"

job_service = client.load_table_from_dataframe(
    rr20_service, table_id
)  
job_service.result()  # Wait for the job to complete.

LoadJob<project=cal-itp-data-infra, location=us-west2, id=9703c92b-bd3e-47b1-9e7c-2efbd8140f81>

In [71]:
table = client.get_table(table_id)  # Make an API request.

In [72]:
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id))

Loaded 89 rows and 9 columns to cal-itp-data-infra.blackcat_raw.2022_rr20_service
