# 4.2 Data Quality Checks

Data quality checks includes:

1. No empty table after running ETL data pipeline.
2. Data schema of every dimensional table matches data model.
3. Ensure that all Immigration files were added immigration dataframe.

## Import files to create model¶

In [1]:
import configparser
import datetime as dt
from datetime import datetime
import os
import glob
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.functions import udf, col, lit, year, month, upper, to_date
from pyspark.sql.functions import monotonically_increasing_id
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.types import StructType as R, StructField as Fld, DoubleType as Dbl, StringType as Str, IntegerType as Int, DateType as Date

## Load configuration data

In [2]:
config = configparser.ConfigParser()
config.read_file(open('dl.cfg'))

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

## Create Spark session

In [3]:
#creating the session
spark = SparkSession \
        .builder \
        .config("spark.jars.repositories", "https://repos.spark-packages.org/")\
        .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11")\
        .enableHiveSupport().getOrCreate()

## Perform 1st Data Quality Check

In [4]:
# Perform quality checks here
def quality_check1(output_data, table_name):
    """Count checks on fact and dimension table to ensure that tables are not empty.
    :param df: spark dataframe to check counts on
    :param table_name: corresponding name of table
    """
    for table in table_name:
        name = output_data + table
        df = spark.read.parquet(name)
        total_count = df.count()
        if total_count == 0:
            print(f"Data quality check failed for {name.split('/')[1].split('/')[0]} Table with zero records!\n")
            print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n")
        else:
            print(f"Data quality check passed for {name.split('/')[1].split('/')[0]} Table with {total_count:,} records.\n")
            print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n")
    

## Perform 2nd Data Quality Check

In [4]:
# Perform quality checks here
def quality_check2(output_data, table_name):
    """Count checks on fact and dimension table to ensure that tables are not empty.
    :param df: spark dataframe to check counts on
    :param table_name: corresponding name of table
    """
    for table in table_name:
        name = output_data + table
        df = spark.read.parquet(name)
        print("Table: " + name.split('/')[1].split('/')[0] + " Table")
        schema = df.printSchema()
        print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n")

## Perform 3rd Data Quality Check

In [6]:
# Perform quality checks here
def quality_check3(output_data, table, months):
    """Count checks on fact and dimension table to ensure that tables are not empty.
    :param df: spark dataframe to check counts on
    :param table_name: corresponding name of table
    """
    name = output_data + table
    df = spark.read.parquet(name)
    df_new = df.dropDuplicates(['month'])
    total_count = df_new.count()
    if total_count < months:
        print(f"Data quality check failed for {name.split('/')[1].split('/')[0]} Table as only {total_count:,} files were processed!\n")
    else:
        print(f"Data quality check passed for {name.split('/')[1].split('/')[0]} Table as all {total_count:,} files were processed!\n")
    


## Initialize variables to run Data Quality Checks

In [5]:
output_data = "Capstone_Project/"

table_name = ['Immigrations/','Immigrants/','Airports/','Populations/','Population_Statistics/','Temperatures/',\
              'Temperature_Statistics/','Countries/','States/','Ports/','Visas/']

table = 'Immigrations/'
months = 12

# Run 1st Data Quality Check: *No empty table after running ETL data pipeline.*

In [8]:
quality_check1(output_data, table_name)

Data quality check passed for Immigrations Table with 40,790,529 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Data quality check passed for Immigrants Table with 40,767,292 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Data quality check passed for Airports Table with 40,790,529 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Data quality check passed for Populations Table with 2,891 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Data quality check passed for Population_Statistics Table with 596 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Data quality check passed for Temperatures Table with 687,289 records.

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

# Run 2nd Data Quality Check: *Data schema of every dimensional table matches data model.*

In [6]:
quality_check2(output_data, table_name)

Table: Immigrations Table
root
 |-- cic_id: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- port_code: string (nullable = true)
 |-- mode_code: integer (nullable = true)
 |-- visa_code: integer (nullable = true)
 |-- arrival_date: date (nullable = true)
 |-- departure_date: date (nullable = true)
 |-- match_flag: string (nullable = true)
 |-- immigration_id: long (nullable = true)
 |-- state_code: string (nullable = true)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Table: Immigrants Table
root
 |-- cic_id: integer (nullable = true)
 |-- citizen_country: integer (nullable = true)
 |-- residence_country: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- ins_num: string (nullable = true)
 |-- immigrants_id: long (nullable = true)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Table: Airports Table
root
 |-- cic_id: integer (nullable = true)
 |--

# Run 3rd Data Quality Check: *Ensure that all Immigration files were added immigration dataframe.*

In [10]:
quality_check3(output_data, table, months)

Data quality check passed for Immigrations Table as all 12 files were processed!

