# Data Warehouse Iqvia QA - Member Enrollment Monthly

Performing QA on member_enrollment_monthly table in dw_staging before moving them to data_warehouse schema

## Initialization

Just loading packages that will be used and initializing connection to GP DB.

In [1]:
import pandas as pd
import sys
import psycopg2
sys.path.append('H:/uth_helpers')
from db_utils import get_dsn

In [2]:
connection = psycopg2.connect(get_dsn())
connection.autocommit = True

## Table Information

This table contains enrollment information on a monthly level. Depending on the data source, this information can be easily extracted in a monthly level.

Data Sources:

* Optum Zip/Optum DoD: Enrollment information not on monthly level. Enrollment dates have a begin date and an end  date which may be longer than a month. Enrollment tables are **mbr_enroll** and **mbr_co_enroll**
* Truven: Enrollment table , **t**, contains monthly level enrollment data
* Medicaid: Enrollment tables (**enrl**, **chip_uth**, **htw_enrl**) are in month level usually identified by **elig_month/elig_date** column
* Medicare: Enrollment table (**mbsf_abcd_summary**) are in yearly level, to get monthly enrollment, you need to look at the **mdcr_status_code_** columns
* Iqvia: enroll2 is the monthly level enrollment table, and enroll_synth contains demographic data of all members enrolled

Ideally we should have counts of enrollment tables from raw sources. These counts are included with the rest of the raw data tables counts for the given data sources.

* Optum Zip: **qa_reporting.optum_zip_counts**
* Optum Dod: **qa_reporting.optum_dod_counts**
* Medicaid: **qa_reporting.mdcd_enrollment_counts_[cy/fy]**
* Truven: **qa_reporting.truven_counts**
* Medicare: **qa_reporting.medicare_national_counts** and **qa_reporting.medicare_texas_counts**

## Row Counts and Enrollment Counts

In [3]:
query = ''' drop table if exists qa_reporting.dw_iqva_mbr_enrl_monthly;
create table qa_reporting.dw_iqva_mbr_enrl_monthly
(
    data_source text,
    calendar_year int,
    table_src text,
    dw_row_count int,
    src_row_count int,
    row_count_diff int,
    row_count_diff_percentage float,
    dw_uth_mbr_id_count int,
    dw_src_mbr_id_count int,
    src_mbr_count int,
    mbr_count_diff int,
    mbr_count_percentage float,
    date_generated date
);
'''

with connection.cursor() as cursor:
    cursor.execute(query)

In [4]:
with connection.cursor() as cursor:
      query = '''
insert into qa_reporting.dw_iqva_mbr_enrl_monthly
(data_source, calendar_year, table_src, dw_row_count, dw_uth_mbr_id_count, dw_src_mbr_id_count, date_generated)
select data_source, 
        year, 
        table_id_src, 
        count(*),
        count(distinct uth_member_id),
        count(distinct member_id_src),
        current_date
  from dw_staging.iqva_member_enrollment_monthly
 group by 1,2,3;
      '''

      cursor.execute(query)

In [5]:
with connection.cursor() as cursor:
      query = '''
update qa_reporting.dw_iqva_mbr_enrl_monthly a
set src_row_count = b.row_count,
      row_count_diff = a.dw_row_count - b.row_count,
      row_count_diff_percentage = 100. * abs(a.dw_row_count - b.row_count) / b.row_count,
      src_mbr_count = b.pat_count,
      mbr_count_diff = a.dw_src_mbr_id_count - b.pat_count,
      mbr_count_percentage = 100. * abs(a.dw_src_mbr_id_count - b.pat_count) / b.pat_count
from qa_reporting.iqvia_counts b
where data_source = 'iqva'
and a.calendar_year = b.year
and a.table_src ||'2' = b.table_name
;
      '''

      cursor.execute(query)

After inserting the counts from the dw_staging schema, let's see if there are any years where the counts do not match with the raw tables.

In [6]:
query = '''
select * 
from qa_reporting.dw_iqva_mbr_enrl_monthly
order by calendar_year
;'''
member_monthly_df = pd.read_sql(query, con=connection)
member_monthly_df




Unnamed: 0,data_source,calendar_year,table_src,dw_row_count,src_row_count,row_count_diff,row_count_diff_percentage,dw_uth_mbr_id_count,dw_src_mbr_id_count,src_mbr_count,mbr_count_diff,mbr_count_percentage,date_generated
0,iqva,2006,enroll,284739058,284739058,0,0.0,30675914,30675914,30675914,0,0.0,2024-01-18
1,iqva,2007,enroll,346468836,346468836,0,0.0,34820782,34820782,34820782,0,0.0,2024-01-18
2,iqva,2008,enroll,373065024,373065024,0,0.0,37240842,37240842,37240842,0,0.0,2024-01-18
3,iqva,2009,enroll,359279585,359279585,0,0.0,35208450,35208450,35208450,0,0.0,2024-01-18
4,iqva,2010,enroll,312564976,312564976,0,0.0,31386216,31386216,31386216,0,0.0,2024-01-18
5,iqva,2011,enroll,304518538,304518538,0,0.0,31088358,31088358,31088358,0,0.0,2024-01-18
6,iqva,2012,enroll,269545623,269545623,0,0.0,28221335,28221335,28221335,0,0.0,2024-01-18
7,iqva,2013,enroll,247873630,247873630,0,0.0,25579373,25579373,25579373,0,0.0,2024-01-18
8,iqva,2014,enroll,258722696,258722696,0,0.0,28543784,28543784,28543784,0,0.0,2024-01-18
9,iqva,2015,enroll,250504863,250504863,0,0.0,26380878,26380878,26380878,0,0.0,2024-01-18


In [7]:
member_monthly_df.to_clipboard(excel=True, index=False)

In [8]:
member_monthly_df[(member_monthly_df['row_count_diff_percentage'] > 1.) | (member_monthly_df['mbr_count_percentage'] > 1.)]

Unnamed: 0,data_source,calendar_year,table_src,dw_row_count,src_row_count,row_count_diff,row_count_diff_percentage,dw_uth_mbr_id_count,dw_src_mbr_id_count,src_mbr_count,mbr_count_diff,mbr_count_percentage,date_generated


If **member_monthly_df** does not have any rows, it means that all of the rows from the raw tables are in this enrollment table at a monthly level.

## Gender Count

Now that we have verified that most if not all of the rows from the raw tables, ccaet and mdcrt, have been added to the member_enrollment_monthly table, we will check that the counts for other columns such as gender have been correctly added to the DW table.

In this case we won't seperate the counts by source table, just by calendar year.

In [9]:
query = '''with iqva_gen_cd as (
    select year, a.pat_id, der_sex
    from iqvia.enroll2 a
    join iqvia.enroll_synth b
    on a.pat_id = b.pat_id
),
iqva_gen as (
    select year, der_sex, count(*) gender_count
    from iqva_gen_cd
    group by 1,2
), dw_gen as (
    select year, gender_cd, count(*) gender_count
    from dw_staging.iqva_member_enrollment_monthly
    group by 1,2
)
select a.year, a.gender_cd, a.gender_count as dw_gender_count, b.gender_count as src_gender_count, 
        a.gender_count - b.gender_count as gender_count_diff, 
        100. * abs(a.gender_count - b.gender_count) / b.gender_count as gender_count_diff_percentage
from iqva_gen b
full outer join dw_gen a
on a.year = b.year
and a.gender_cd = b.der_sex;
'''
 
df = pd.read_sql(query,  con=connection)
df.sort_values(['year', 'gender_cd'])



Unnamed: 0,year,gender_cd,dw_gender_count,src_gender_count,gender_count_diff,gender_count_diff_percentage
10,2006,F,11761338,11761338.0,0.0,0.0
60,2006,M,10573018,10573018.0,0.0,0.0
3,2006,U,437,437.0,0.0,0.0
63,2006,,262404265,,,
11,2007,F,12770861,12770861.0,0.0,0.0
...,...,...,...,...,...,...
5,2022,M,48380270,48380270.0,0.0,0.0
15,2022,U,30033,30033.0,0.0,0.0
34,2023,F,12206640,12206640.0,0.0,0.0
14,2023,M,11445291,11445291.0,0.0,0.0


## Plan Type Counts

In [10]:
# Including enrollments where the plantyp column is NULL. Treating it as if unknown.
query = '''with iqva_enroll as (
    select year, pat_id, plan_type
    from iqvia.enroll2 a
    left join reference_tables.ref_plan_type c
  	on c.data_source  = 'iqva' 
    and c.plan_type_src = a.prd_type
),
iqva_plans as (          
    select year, case when plan_type is null then 'U' else plan_type end as plan_type, count(*) plan_count
    from iqva_enroll a
    group by 1,2
),
dw_plans as (
    select year, case when plan_type is null then 'U' else plan_type end as plan_type,
            count(*) plan_count
    from dw_staging.iqva_member_enrollment_monthly
    group by 1,2
)
select a.year, a.plan_type, a.plan_count as dw_plan_count, b.plan_count as src_plan_count, 
        a.plan_count - b.plan_count as plan_count_diff, 
        100. * abs(a.plan_count - b.plan_count) / b.plan_count as plan_count_diff_percentage
from iqva_plans b
full outer join dw_plans a
on a.year = b.year
and a.plan_type = b.plan_type
order by year;
'''

plan_count_df = pd.read_sql(query,  con=connection)
plan_count_df.sort_values(['year', 'plan_type'])



Unnamed: 0,year,plan_type,dw_plan_count,src_plan_count,plan_count_diff,plan_count_diff_percentage
1,2006,CDHP,715632,715632,0,0.0
4,2006,FFS,15492469,15492469,0,0.0
0,2006,HMO,81070648,81070648,0,0.0
3,2006,POS,24654013,24654013,0,0.0
6,2006,PPO,160385054,160385054,0,0.0
...,...,...,...,...,...,...
109,2022,UNK,2178159,2178159,0,0.0
116,2023,CDHP,3443250,3443250,0,0.0
114,2023,HMO,6939852,6939852,0,0.0
117,2023,PPO,12651689,12651689,0,0.0


In [11]:
plan_count_df[plan_count_df['plan_count_diff_percentage'] > 1.0]

Unnamed: 0,year,plan_type,dw_plan_count,src_plan_count,plan_count_diff,plan_count_diff_percentage


In [12]:
yearly_plan_count_df = plan_count_df.groupby('year')['dw_plan_count', 'src_plan_count'].sum()
yearly_plan_count_df['plan_count_diff'] = yearly_plan_count_df['dw_plan_count'] - yearly_plan_count_df['src_plan_count']
yearly_plan_count_df['plan_count_diff_percentage'] = 100.* abs(yearly_plan_count_df['plan_count_diff'] / yearly_plan_count_df['src_plan_count'])
yearly_plan_count_df

  yearly_plan_count_df = plan_count_df.groupby('year')['dw_plan_count', 'src_plan_count'].sum()


Unnamed: 0_level_0,dw_plan_count,src_plan_count,plan_count_diff,plan_count_diff_percentage
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006,284739058,284739058,0,0.0
2007,346468836,346468836,0,0.0
2008,373065024,373065024,0,0.0
2009,359279585,359279585,0,0.0
2010,312564976,312564976,0,0.0
2011,304518538,304518538,0,0.0
2012,269545623,269545623,0,0.0
2013,247873630,247873630,0,0.0
2014,258722696,258722696,0,0.0
2015,250504863,250504863,0,0.0
