# Field Decomposition

In this notebook, we decompose the fields from the raw tables that contain more than one property in their value. Specifically, we have two description fields, one in `air_carriers` and another in `bird_airports` that need to be split up into their individual components.


In [None]:
%%bigquery
select * from airline_raw.air_carriers limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,code,description,load_time
0,19031,Mackey International Inc.: MAC,2024-01-26 22:23:22.051288+00:00
1,19032,Munz Northern Airlines Inc.: XY,2024-01-26 22:23:22.051288+00:00
2,19033,Cochise Airlines Inc.: COC,2024-01-26 22:23:22.051288+00:00
3,19034,Golden Gate Airlines Inc.: GSA,2024-01-26 22:23:22.051288+00:00
4,19035,Aeromech Inc.: RZZ,2024-01-26 22:23:22.051288+00:00


In [None]:
%%bigquery
select * from airline_raw.bird_airports limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,code,description,load_time
0,CBK,"Colby, KS: Murray",2024-01-22 00:47:19.324417+00:00
1,ANP,"Annapolis, MD: Lee",2024-01-22 00:47:19.324417+00:00
2,KCA,"Kuqa, China: Kuche",2024-01-22 00:47:19.324417+00:00
3,MOL,"Molde, Norway: Aro",2024-01-22 00:47:19.324417+00:00
4,OIC,"Norwich, NY: Eaton",2024-01-22 00:47:19.324417+00:00


# Air_Carrier

Split up the description field from the raw `air_carriers` table and create a staging table from the results:

In [None]:
%%bigquery
select description, description_array[0] as airline_name, description_array[1] as airline_code
from
(select description, split(description, ':') as description_array
from airline_raw.air_carriers
limit 5)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,description,airline_name,airline_code
0,Mackey International Inc.: MAC,Mackey International Inc.,MAC
1,Munz Northern Airlines Inc.: XY,Munz Northern Airlines Inc.,XY
2,Cochise Airlines Inc.: COC,Cochise Airlines Inc.,COC
3,Golden Gate Airlines Inc.: GSA,Golden Gate Airlines Inc.,GSA
4,Aeromech Inc.: RZZ,Aeromech Inc.,RZZ


In [None]:
%%bigquery
create or replace table airline_stg.Air_Carrier as
  select airline_id, description_array[0] as airline_name, description_array[1] as airline_code, 'bird' as data_source, load_time
  from
  (select code as airline_id, split(description, ':') as description_array, load_time
  from airline_raw.air_carriers)

Query is running:   0%|          |

Verify that the raw and staging tables both have the same record counts (i.e. we don't want to lose any records from raw to staging):

In [None]:
%%bigquery
select (select count(*) from airline_raw.air_carriers) as raw_count,
(select count(*) from airline_stg.Air_Carrier) as staging_count

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,raw_count,staging_count
0,1656,1656


# bird_airport

Split up the description from the `bird_airports` table. The description contains either a city, state and airport or a city, country, and airport:

  Example: Austin, TX: Austin - Bergstrom International

In [None]:
%%bigquery
select code, city_state_array[0] as city, city_state_array[1] as state, airport_name
from
(select code, split(description_array[0], ',') as city_state_array, description_array[1] as airport_name
from
(select code, description, split(description, ':') as description_array
from airline_raw.bird_airports
limit 5))

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,code,city,state,airport_name
0,CBK,Colby,KS,Murray
1,ANP,Annapolis,MD,Lee
2,KCA,Kuqa,China,Kuche
3,MOL,Molde,Norway,Aro
4,OIC,Norwich,NY,Eaton


Unfortunately, this query doesn't get us very far. It fails with `400 Array index 1 is out of bounds`. That's because the description doesn't always contain two elements:

In [None]:
%%bigquery
create or replace table airline_stg.bird_airport as
  select code, airport_name, city_state_array[0] as city, city_state_array[1] as state, 'bird' as data_source, load_time
  from
    (select code, split(description_array[0], ',') as city_state_array, description_array[1] as airport_name, load_time
    from
      (select code, split(description, ':') as description_array, load_time
      from airline_raw.bird_airports))

Executing query with job ID: 10d93347-72fb-4b3d-9649-749962f3b9b1
Query executing: 0.60s


ERROR:
 400 Array index 1 is out of bounds (overflow)

Location: US
Job ID: 10d93347-72fb-4b3d-9649-749962f3b9b1



Instead of transforming the data with SQL, we'll switch to Python so that we end up with simpler and more maintainable code. We'll output the results into json and load them into a staging table.


In [None]:
import json, datetime
from google.cloud import bigquery

project_id = "cs329e-sp2024"
raw_dataset_name = "airline_raw"
raw_table_name = "bird_airports"
stg_dataset_name = "airline_stg"
stg_table_name = "bird_airports" # lowercase the name because it's an intermediate table

bird_airports = []
target_table_id = "{}.{}.{}".format(project_id, stg_dataset_name, stg_table_name)

def serialize_datetime(obj):
    if isinstance(obj, datetime.datetime):
        return obj.isoformat()
    raise TypeError("Type not serializable")

schema = [bigquery.SchemaField("code", "STRING", mode="REQUIRED"),
          bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("city", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("state", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("country", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("data_source", "STRING", mode="REQUIRED"),
          bigquery.SchemaField("load_time", "TIMESTAMP", mode="REQUIRED")]

bq_client = bigquery.Client()
sql = "select code, description, load_time from {}.{}".format(raw_dataset_name, raw_table_name)
query_job = bq_client.query(sql)

for row in query_job:
    code = row["code"]
    description = row["description"]
    load_time = json.dumps(row["load_time"], default=serialize_datetime).replace('"', '')
    city = description.split(",")[0].strip()

    if len(description.split(",")) > 1:
        state_country = description.split(",")[1].split(":")[0].strip()

        if state_country.isupper() and len(state_country) == 2:
            state = state_country
            country = 'US'
        else:
            state = None
            country = state_country

    else:
        state_country = None
        print('state_country is null: ' + description)

    if len(description.split(":")) > 1:
        name = description.split(":")[1].strip()
    else:
        name = None
        print('airport name is null: ' + description)

    record = {"code": code, "name": name, "city": city, "state": state, "country": country, "data_source": "bird", "load_time": load_time}
    bird_airports.append(record)

# load records into staging table
job_config = bigquery.LoadJobConfig(schema=schema, source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON, write_disposition='WRITE_TRUNCATE')
table_ref = bigquery.table.TableReference.from_string(target_table_id)

try:
    job = bq_client.load_table_from_json(bird_airports, table_ref, job_config=job_config)
    print('Inserted into', stg_table_name, ':', len(bird_airports), 'records')

    if job.errors:
      print('job errors:', job.errors)

except Exception as e:
    print("Error inserting into BQ: {}".format(e))


state_country is null: Unknown Point in Alaska
airport name is null: Unknown Point in Alaska
Inserted into bird_airports : 6510 records


Verify that we ended up with the same record count in the staging table as in the raw table:

In [None]:
%%bigquery
select (select count(*) from airline_raw.bird_airports) as raw_count,
  (select count(*) from airline_stg.bird_airports) as intermediate_stg_count

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,raw_count,intermediate_stg_count
0,6510,6510


# Primary Key

BigQuery does not enforce primary keys, so the following command is for understanding the intent of the `airline_id` field. We will still need to check that it conforms to a PK.

In [None]:
%%bigquery
alter table airline_stg.Air_Carrier
  add primary key (airline_id) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select airline_id, count(*) duplicate_records
from airline_stg.Air_Carrier
group by airline_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,airline_id,duplicate_records


In [None]:
%%bigquery
alter table airline_stg.bird_airports
  add primary key (code) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select code, count(*) duplicate_records
from airline_stg.bird_airports
group by code
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,code,duplicate_records
