# **Modeling Health Nutrition and Population Statistics Data**
This notebook will create a new BigQuery dataset to begin improving the usability of the data. New tables will be created 

In [1]:
#create the new dataset for modeled data
!bq --location=US mk --dataset kaggle2_modeled

Dataset 'electric-spark-266716:kaggle2_modeled' successfully created.


#### The following queries make the new tables, with each containing the data for metrics of interest

In [23]:
%%bigquery
CREATE TABLE kaggle2_modeled.Urban_Growth_Statistics as
SELECT *
FROM kaggle2_staging.Health_Nutrition_Population_Statistics
WHERE metricCode = "SP.URB.GROW" or metricCode = "SP.URB.TOTL.IN.ZS" or metricCode = "SP.URB.TOTL"

In [24]:
%%bigquery
CREATE TABLE kaggle2_modeled.Life_Statistics as
SELECT *
FROM kaggle2_staging.Health_Nutrition_Population_Statistics
WHERE metricCode = "SP.DYN.TO65.MA.ZS" or metricCode = "SP.DYN.TO65.FE.ZS" 
or metricCode = "SP.DYN.IMRT.IN" or metricCode= "SP.DYN.AMRT.MA" 
or metricCode = "SP.DYN.AMRT.FE" or metricCode= "SP.DYN.LE00.IN" 
or metricCode = "SP.DYN.LE00.MA.IN" or metricCode= "SP.DYN.LE00.FE.IN"
or metricCode = "SP.DYN.CDRT.IN"

In [25]:
%%bigquery
CREATE TABLE kaggle2_modeled.Population_Statistics as
SELECT *
FROM kaggle2_staging.Health_Nutrition_Population_Statistics
WHERE metricCode = "SP.POP.TOTL" or metricCode = "SP.POP.TOTL.MA.ZS" 
or metricCode = "SP.POP.TOTL.FE.ZS" or metricCode = "SP.POP.GROW" 
or metricCode = "SP.DYN.TFRT.IN" or metricCode = "SP.DYN.CBRT.IN"

In [26]:
%%bigquery
CREATE TABLE kaggle2_modeled.Health_Statistics as
SELECT *
FROM kaggle2_staging.Health_Nutrition_Population_Statistics
WHERE metricCode = "SH.XPD.PCAP" or metricCode = "SH.XPD.TOTL.CD" 
or metricCode = "SH.STA.OW15.ZS" or metricCode = "SN.ITK.DEFC.ZS" 
or metricCode = "SP.DYN.TO65.FE.ZS" or metricCode = "SP.DYN.TO65.MA.ZS"

## **Checking Primary Keys**

In the UrbanGrowth table, countryName and metricCode are valid primary keys.

In [27]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Urban_Growth_Statistics

Unnamed: 0,RecordCount
0,774


In [31]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT countryName,metricCode
FROM kaggle2_modeled.Urban_Growth_Statistics)

Unnamed: 0,UniqueCount
0,774


In the LifeStatistics table, countryName and metricCode are valid primary keys.

In [32]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Life_Statistics

Unnamed: 0,RecordCount
0,2322


In [33]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT countryName,metricCode
FROM kaggle2_modeled.Life_Statistics)

Unnamed: 0,UniqueCount
0,2322


In the Population table, countryName and metricCode are valid primary keys.

In [34]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Population_Statistics

Unnamed: 0,RecordCount
0,1548


In [35]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT countryName,metricCode
FROM kaggle2_modeled.Population_Statistics)

Unnamed: 0,UniqueCount
0,1548


In the Health table, countryName and metricCode are valid primary keys.

In [36]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Health_Statistics

Unnamed: 0,RecordCount
0,1548


In [37]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT countryName,metricCode
FROM kaggle2_modeled.Health_Statistics)

Unnamed: 0,UniqueCount
0,1548


## **Checking Foreign Keys**

In [39]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Health_Statistics
LEFT JOIN kaggle2_modeled.Population_Statistics
ON Health_Statistics.countryName = Population_Statistics.countryName
WHERE Population_Statistics.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0


In [40]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Life_Statistics
LEFT JOIN kaggle2_modeled.Population_Statistics
ON Life_Statistics.countryName = Population_Statistics.countryName
WHERE Population_Statistics.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0


In [41]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Urban_Growth_Statistics
LEFT JOIN kaggle2_modeled.Population_Statistics
ON Urban_Growth_Statistics.countryName = Population_Statistics.countryName
WHERE Population_Statistics.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0


#### **NOTE:** There are no foreign key or primary key violations for any of the tables.

## **Beam Pipelines**
The following scripts execute the beam pipelines created.

#### **Pipelines for Health Statistics Table**

In [6]:
%run Health_Statistics_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7f3d059dbbc8>, due to an exception.
 Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 498, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/home/jupyter/venv/lib/python3.5/site-packages/apache_beam/io/gcp/bigquery.py", line 1032, in process
    return self._flush_batch(destination)
  File "/home/jupyter/venv/lib/python3.5/site-packages/apache_beam/io/gcp/bigquery.py", line 1070, in _flush_batch
    skip_invalid_rows=True)
  File "/home/jupyter/venv/lib/python3.5/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 837, in insert_rows
    json_row = self._convert_to_json_row(row)
  File "/home/jupyter/venv/lib/python3.5/site-packages/apache_beam/io

In [None]:
%run Health_Statistics_beam_dataflow.py

  kms_key=transform.kms_key))


#### **Pipelines for Urban Growth Statistics Table**

In [7]:
%run Urban_Growth_Statistics_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []


In [22]:
%run Urban_Growth_Statistics_beam_dataflow.py

  kms_key=transform.kms_key))


#### **Pipelines for Life Statistics Table**

In [9]:
%run Life_Statistics_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []


In [19]:
%run Life_Statistics_beam_dataflow.py

  kms_key=transform.kms_key))


#### **Pipelines for Population Statistics Table**

In [8]:
%run Population_Statistics_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []


In [24]:
%run Population_Statistics_beam_dataflow.py

  kms_key=transform.kms_key))


KeyboardInterrupt: 

## **Beam Verification**
The following scripts verify if the resulting beam tables have primary and foreign keys. 

#### **Health_Statistics_DF**
This table has a primary key of dt, countryName, and metricCode and a foreign key of countryName

In [25]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Health_Statistics_Beam_DF

Unnamed: 0,RecordCount
0,38175


In [36]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT dt, countryName, metricCode,
FROM kaggle2_modeled.Health_Statistics_Beam_DF)

Unnamed: 0,UniqueCount
0,38175


In [39]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Health_Statistics_Beam_DF
LEFT JOIN kaggle2_modeled.Population_Statistics_Beam_DF
ON Health_Statistics_Beam_DF.countryName = Population_Statistics_Beam_DF.countryName
WHERE Population_Statistics_Beam_DF.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0


#### **Life_Statistics_DF**
This table has a primary key of dt, countryName, and metricCode and a foreign key of countryName

In [27]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Life_Statistics_Beam_DF

Unnamed: 0,RecordCount
0,115014


In [35]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT dt, countryName, metricCode
FROM kaggle2_modeled.Life_Statistics_Beam_DF)

Unnamed: 0,UniqueCount
0,115014


In [40]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Life_Statistics_Beam_DF
LEFT JOIN kaggle2_modeled.Population_Statistics_Beam_DF
ON Life_Statistics_Beam_DF.countryName = Population_Statistics_Beam_DF.countryName
WHERE Population_Statistics_Beam_DF.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0


#### **Population_Statistics_DF**
This table has a primary key of dt, countryName, and metricCode

In [31]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Population_Statistics_Beam_DF

Unnamed: 0,RecordCount
0,81037


In [37]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT dt, countryName, metricCode
FROM kaggle2_modeled.Population_Statistics_Beam_DF)

Unnamed: 0,UniqueCount
0,81037


#### **Urban_Growth_Statistics_DF**
This table has a primary key of dt, countryName, and metricCode and a foreign key of countryName

In [33]:
%%bigquery 
SELECT COUNT(*) as RecordCount
FROM kaggle2_modeled.Urban_Growth_Statistics_Beam_DF

Unnamed: 0,RecordCount
0,42689


In [38]:
%%bigquery 
SELECT COUNT(*) as UniqueCount
FROM (SELECT DISTINCT dt, countryName, metricCode
FROM kaggle2_modeled.Urban_Growth_Statistics_Beam_DF)

Unnamed: 0,UniqueCount
0,42689


In [41]:
%%bigquery 
SELECT COUNT(*) as Unmatched_Elements
FROM kaggle2_modeled.Urban_Growth_Statistics_Beam_DF
LEFT JOIN kaggle2_modeled.Population_Statistics_Beam_DF
ON Urban_Growth_Statistics_Beam_DF.countryName = Population_Statistics_Beam_DF.countryName
WHERE Population_Statistics_Beam_DF.countryName IS NULL

Unnamed: 0,Unmatched_Elements
0,0
