# Data Quality Report

First thing I did was this quick data quality report (dqr), just so that I can know what I'm dealing with.  I don't dive too deep into the data, just run some basic commands to see what's in these tables, and how I should deal with it. This isn't about the data being 'good' or 'bad' but rather making sure it's accurate, complete, and ready for analysis.

## What I'm doing

### 1: Import excel file to DataFrames
I'm pulling data directly from an Excel file.  It's slow on my machine, so I need to address this. This file has different sheets (like facilities, provider-facility claims, and provider demographics).  

### 2: Laying Out the Ingredients
We're taking a quick peek at the top rows of each sheet. I'm just making sure I have everything I need.

### 3: The Quality Check
This is a 'quality check' on each set of data. Let's break down what I'm doing:

#### - Counting Everything
A simple row count to see how much of everythign is there.

#### - Looking for Missing Pieces
I check for missing values. If data is missing, I note how much and where. 

#### - Avoiding Repetition
I look for duplicate rows. It's just pointing me in the direction I need to look later.

#### - Understanding the Nature of Our Data
I check the types of data we have.  I'm going to want to preserve type integrity throughout processing.

#### - Getting a Feel for the Numbers
If I have numerical data, then I get some basic stats on it.

## Output

I added a cell to print the output to txt files.  See docs/reports/dqr/

In [3]:
import pandas as pd
import os

# Reading data frome excel
excel_file_path = os.path.join(os.path.dirname(os.getcwd()), 'data', 'raw', 'case_study_data.xlsx')

# Reading each sheet into a separate DataFrame
df_facilities = pd.read_excel(excel_file_path, sheet_name='Facilities')
df_prov_fac_claims = pd.read_excel(excel_file_path, sheet_name='Provider-Facility Affiliated Cl')
df_prov_demog = pd.read_excel(excel_file_path, sheet_name='Provider Demographics')

# Displaying the first few rows of each DataFrame
print("Facilities:\n", df_facilities.head())
print("\nProvider-Facility Affiliated CL:\n", df_prov_fac_claims.head())
print("\nProvider Demographics:\n", df_prov_demog.head())


Facilities:
      enrollment_id enrollment_state provider_type_code  \
0  O20051101000429               OH              00-09   
1  O20051115000044               OH              00-09   
2  O20070216000074               OH              00-09   
3  O20070806000006               OH              00-09   
4  O20071114000521               OH              00-09   

           provider_type_text         npi mulitple_npi     ccn  associate_id  \
0  PART A PROVIDER - HOSPITAL  1194763045            N  360087    2466358387   
1  PART A PROVIDER - HOSPITAL  1215989611            N  360077    4688571730   
2  PART A PROVIDER - HOSPITAL  1043397292            Y  369807    3274431879   
3  PART A PROVIDER - HOSPITAL  1700828852            Y  360059    8628982949   
4  PART A PROVIDER - HOSPITAL  1467489682            N  36T059    8628982949   

                               organization_name       doing_business_as_name  \
0                              LUTHERAN HOSPITAL                          Na

In [4]:
# data quality report
def check_data_quality(df, df_name):
    print(f"Data Quality Report for: {df_name}\n")

    # row count
    total_rows = df.shape[0]
    print(f"Total number of rows: {total_rows}\n")

    # missing values
    missing_values = df.isnull().sum()
    missing_percentage = (missing_values / total_rows) * 100
    missing_info = pd.DataFrame({'Missing Values': missing_values, 'Percentage (%)': missing_percentage})
    print("Missing Values Information:\n", missing_info[missing_info['Missing Values'] > 0], "\n")

    # duplicates
    duplicate_rows = df.duplicated().sum()
    print(f"Number of duplicate rows: {duplicate_rows}\n")

    # data types
    print("Data Types:\n", df.dtypes, "\n")

    # shape of numeric columns
    if df.select_dtypes(include=[float, int]).shape[1] > 0:
        print("Statistics for Numeric Columns:\n", df.describe(), "\n")

    print("-------------------------------------------")

# Check data quality for each DataFrame
check_data_quality(df_facilities, "Facilities")
check_data_quality(df_prov_fac_claims, "Provider-Facility Affiliated Claims")
check_data_quality(df_prov_demog, "Provider Demographics")


Data Quality Report for: Facilities

Total number of rows: 11

Missing Values Information:
                               Missing Values  Percentage (%)
doing_business_as_name                     7       63.636364
incorporate_date                           2       18.181818
incorporate_state                          2       18.181818
organization_other_type_text               8       72.727273
addressline2                              10       90.909091
location_other_type_text                  10       90.909091
subgroup_other_text                       10       90.909091
subgroup_reh_conversion_date              11      100.000000
subgroup_cah_ccn                          11      100.000000 

Number of duplicate rows: 0

Data Types:
 enrollment_id                         object
enrollment_state                      object
provider_type_code                    object
provider_type_text                    object
npi                                    int64
mulitple_npi                 

In [5]:
import sys
import os
from contextlib import contextmanager

# print stdout to a file
@contextmanager
def redirect_stdout_to_file(file_path):
    original_stdout = sys.stdout  # Save the original stdout
    with open(file_path, 'w') as file:
        sys.stdout = file  # Set the new stdout to the file
        yield
    sys.stdout = original_stdout  # Reset stdout back to original

dqr_directory = os.path.join(os.getcwd(), '..', 'docs', 'reports', 'dqr')
os.makedirs(dqr_directory, exist_ok=True)

# file paths for the DQR
dqr_facilities_path = os.path.join(dqr_directory, 'dqr_facilities.txt')
dqr_prov_fac_claims_path = os.path.join(dqr_directory, 'dqr_claims.txt')
dqr_prov_demog_path = os.path.join(dqr_directory, 'dqr_demographics.txt')

# Writing the data quality report for each DataFrame to separate files
with redirect_stdout_to_file(dqr_facilities_path):
    check_data_quality(df_facilities, "Facilities")

with redirect_stdout_to_file(dqr_prov_fac_claims_path):
    check_data_quality(df_prov_fac_claims, "Provider-Facility Affiliated Claims")

with redirect_stdout_to_file(dqr_prov_demog_path):
    check_data_quality(df_prov_demog, "Provider Demographics")

print(f"Data quality reports written to:\n{dqr_facilities_path}\n{dqr_prov_fac_claims_path}\n{dqr_prov_demog_path}")


Data quality reports written to:
/Users/bbubnick/projects/claims_case_study/scripts/../docs/reports/dqr/dqr_facilities.txt
/Users/bbubnick/projects/claims_case_study/scripts/../docs/reports/dqr/dqr_claims.txt
/Users/bbubnick/projects/claims_case_study/scripts/../docs/reports/dqr/dqr_demographics.txt
