# **Validating a data pipeline at "checkpoints"**

- In this exercise, you'll be working with a data pipeline that extracts tax data from a CSV file, creates a new column, filters out rows based on average taxable income, and persists the data to a parquet file.

- pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined. You'll use these functions to validate the data pipeline at various checkpoints throughout its execution.

**Instructions**

- Print the shape of the raw_tax_data and clean_tax_data DataFrames and observe the difference in dimensions.

In [None]:
# Extract and transform tax_data
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Check the shape of the raw_tax_data DataFrame, compare to the clean_tax_data DataFrame
print(f"Shape of raw_tax_data: {raw_tax_data.shape}")
print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

In [None]:
Shape of raw_tax_data: (96, 6)
Shape of clean_tax_data: (82, 5)

**Instructions**

- Read the DataFrame from the path "clean_tax_data.parquet" into a DataFrame called to_validate, observe the .head() of each.

In [None]:
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")

print(f"Shape of raw_tax_data: {raw_tax_data.shape}")
print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

# Read in the loaded data, observe the head of each
to_validate = pd.read_parquet("clean_tax_data.parquet")
print(clean_tax_data.head(3))
print(to_validate.head(3))

In [None]:
Shape of raw_tax_data: (96, 6)
Shape of clean_tax_data: (82, 5)
                   number_of_firms  total_taxable_income  total_taxes_paid  total_cash_taxes_paid  average_taxable_income
industry_name                                                                                                            
Aerospace/Defense               77             30920.169          5106.376               7441.776                 401.561
Apparel                         39              5422.690          1112.113               1479.292                 139.043
Auto & Truck                    31             33358.200          3529.000               2446.896                1076.071
                   number_of_firms  total_taxable_income  total_taxes_paid  total_cash_taxes_paid  average_taxable_income
industry_name                                                                                                            
Aerospace/Defense               77             30920.169          5106.376               7441.776                 401.561
Apparel                         39              5422.690          1112.113               1479.292                 139.043
Auto & Truck                    31             33358.200          3529.000               2446.896                1076.071

**Instructions**

- Check that the to_validate and clean_tax_data DataFrames are equal.

In [None]:
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")

print(f"Shape of raw_tax_data: {raw_tax_data.shape}")
print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

to_validate = pd.read_parquet("clean_tax_data.parquet")
print(clean_tax_data.head(3))
print(to_validate.head(3))

# Check that the DataFrames are equal
print(to_validate.equals(clean_tax_data))

In [None]:
Shape of raw_tax_data: (96, 6)
Shape of clean_tax_data: (82, 5)

                   number_of_firms  total_taxable_income  total_taxes_paid  total_cash_taxes_paid  average_taxable_income
industry_name                                                                                                            
Aerospace/Defense               77             30920.169          5106.376               7441.776                 401.561
Apparel                         39              5422.690          1112.113               1479.292                 139.043
Auto & Truck                    31             33358.200          3529.000               2446.896                1076.071
                   number_of_firms  total_taxable_income  total_taxes_paid  total_cash_taxes_paid  average_taxable_income
industry_name                                                                                                            
Aerospace/Defense               77             30920.169          5106.376               7441.776                 401.561
Apparel                         39              5422.690          1112.113               1479.292                 139.043
Auto & Truck                    31             33358.200          3529.000               2446.896                1076.071

True

In summary, while the shapes of raw_tax_data and clean_tax_data differ, the equality check (equals method) confirms that the clean_tax_data DataFrame and the DataFrame loaded from the Parquet file (to_validate) have identical data values.