# COVID-19 Modeling

### Step 1: Create dataset with a new column for "id" (which we will fill in the next step!)

In [1]:
!bq --location=US mk --dataset covid_19_model

Dataset 'arcane-footing-266618:covid_19_model' successfully created.


In [2]:
%%bigquery
create or replace table covid_19_model.Cases
as select null as id, * from covid_19_staging.Cases

### Step 2: Fill id column using hash function FARM_FINGERPRINT to compute a unique hash for the location string so that our table can have a Primary Key

In [4]:
%%bigquery
update covid_19_model.Cases set id = FARM_FINGERPRINT(country) 
where state is null

In [5]:
%%bigquery
update covid_19_model.Cases set id = FARM_FINGERPRINT(concat(state, country)) 
where state is not null

Make sure id field has no null values!

In [6]:
%%bigquery
select count(*) as count_null_ids
from covid_19_model.Cases
where id is null

Unnamed: 0,count_null_ids
0,0


And let's check out what our id field looks like:

In [7]:
%%bigquery
select id, state, country
from covid_19_model.Cases
order by state, country
limit 5

Unnamed: 0,id,state,country
0,8576431891811451300,,Azerbaijan
1,8778414404485170876,,Afghanistan
2,8778414404485170876,,Afghanistan
3,8778414404485170876,,Afghanistan
4,8778414404485170876,,Afghanistan


### Step 3: Split table because there is redundant information (region, state, lat/long)

We create a table for Location because that information is replicated every day (we will manipulate this table later.)

In [8]:
%%bigquery
create or replace table covid_19_model.Location_SQL_1
as select distinct id, state, country, latitude, longitude, fips, admin2, combined_key
from covid_19_model.Cases

Let's create an Event table as well.

In [9]:
%%bigquery
create or replace table covid_19_model.Event_SQL_1
as select id as location_id, last_update, confirmed, deaths, recovered, active
from covid_19_model.Cases

Below, we can see that there is a change in the way the `last_update` field is represented from old records to new records (goes from YYYY to YY).

In [12]:
%%bigquery
select * from covid_19_model.Event_SQL_1
order by last_update
limit 5

Unnamed: 0,location_id,last_update,confirmed,deaths,recovered,active
0,3093811823925351433,1/22/2020 17:00,1.0,,,
1,2544652828731166483,1/22/2020 17:00,1.0,,,
2,400699263222839825,1/22/2020 17:00,,,,
3,-8459520092734636284,1/22/2020 17:00,5.0,,,
4,3061248092517028102,1/22/2020 17:00,2.0,,,


In [13]:
%%bigquery
select * from covid_19_model.Event_SQL_1
order by last_update desc
limit 5

Unnamed: 0,location_id,last_update,confirmed,deaths,recovered,active
0,6443493987885756991,4/6/20 9:37,914,4,216,694
1,9155895331965305746,4/6/20 6:20,373,5,57,311
2,-3396123447326000985,4/6/20 5:30,536,6,389,141
3,-6225137598979003815,4/6/20 2:36,139,2,132,5
4,-4927258461359090633,4/6/20 2:21,67803,3212,64014,577


### Step 4: Standardize the timestamps in the Event table using SQL

Let's compare the records that have '/' instead of '-' format for the dates.

In [14]:
%%bigquery
create or replace table covid_19_model.Event_SQL_2 as
select *
from covid_19_model.Event_SQL_1
where strpos(last_update, '/') > 0

In [15]:
%%bigquery
create or replace table covid_19_model.Event_SQL_3 as
select location_id, cast(last_update as datetime) last_update, confirmed, deaths, recovered, active 
from covid_19_model.Event_SQL_1
where strpos(last_update, '-') > 0

In [16]:
%%bigquery
select count(*) as count_timestamp_slash
from covid_19_model.Event_SQL_2

Unnamed: 0,count_timestamp_slash
0,12054


In [17]:

%%bigquery
select count(*) as count_timestamp_dash
from covid_19_model.Event_SQL_3

Unnamed: 0,count_timestamp_dash
0,90609
