# Data Warehouse Truven QA - Member Enrollment Monthly

Performing QA on member_enrollment_monthly table in dw_staging before moving them to data_warehouse schema

## Initialization

Just loading packages that will be used and initializing connection to GP DB.

In [1]:
import pandas as pd
import sys
import psycopg2
sys.path.append('H:/uth_helpers')
from db_utils import get_dsn

In [2]:
connection = psycopg2.connect(get_dsn()+' keepalives=1 keepalives_idle=30 keepalives_interval=10')
connection.autocommit = True

## Table Information

This table contains enrollment information on a monthly level. Depending on the data source, this information can be easily extracted in a monthly level.

Data Sources:

* Optum Zip/Optum DoD: Enrollment information not on monthly level. Enrollment dates have a begin date and an end  date which may be longer than a month. Enrollment tables are **mbr_enroll** and **mbr_co_enroll**
* Truven: Enrollment table , **t**, contains monthly level enrollment data
* Medicaid: Enrollment tables (**enrl**, **chip_uth**, **htw_enrl**) are in month level usually identified by **elig_month/elig_date** column
* Medicare: ?


Ideally we should have counts of enrollment tables from raw sources. These counts are included with the rest of the raw data tables counts for the given data sources.

* Optum Zip: **qa_reporting.optum_zip_counts**
* Optum Dod: **qa_reporting.optum_dod_counts**
* Medicaid: **qa_reporting.mdcd_enrollment_counts_[cy/fy]**
* Truven: **qa_reporting.truven_counts**
* Medicare: Needs to be created

## Row Counts and Enrollment Counts

In [3]:
query = ''' drop table if exists qa_reporting.dw_truv_mbr_enrl_monthly;
create table qa_reporting.dw_truv_mbr_enrl_monthly
(
    data_source text,
    calendar_year int,
    table_src text,
    dw_row_count int,
    src_row_count int,
    row_count_diff int,
    row_count_diff_percentage float,
    dw_uth_mbr_id_count int,
    dw_src_mbr_id_count int,
    src_mbr_count int,
    mbr_count_diff int,
    mbr_count_percentage float,
    date_generated date
);
'''

with connection.cursor() as cursor:
    cursor.execute(query)

Truven is straight forward to QA for this table due to how simple we extract the raw data and insert it into this enrollment table. We just need to see if all the rows in the ccaet and mdcrt tables are represented in this table. This means that we compare the row counts of the raw table and the member_enrollment_monthly table for each year. We also count all distinct member ids in the raw tables and in the member_enrollment_monthly table.

In [4]:
with connection.cursor() as cursor:
      query = '''
insert into qa_reporting.dw_truv_mbr_enrl_monthly
(data_source, calendar_year, table_src, dw_row_count, dw_uth_mbr_id_count, dw_src_mbr_id_count, date_generated)
select data_source, 
        year, 
        table_id_src, 
        count(*),
        count(distinct uth_member_id),
        count(distinct member_id_src),
        current_date
  from dw_staging.trum_member_enrollment_monthly
 group by 1,2,3;
      '''

      cursor.execute(query)

      query = '''
update qa_reporting.dw_truv_mbr_enrl_monthly a
set src_row_count = b.row_count,
    row_count_diff = a.dw_row_count - b.row_count,
    row_count_diff_percentage = 100. * abs(a.dw_row_count - b.row_count) / b.row_count,
    src_mbr_count = b.pat_count,
    mbr_count_diff = a.dw_src_mbr_id_count - b.pat_count,
    mbr_count_percentage = 100. * abs(a.dw_src_mbr_id_count - b.pat_count) / b.pat_count
from qa_reporting.truven_counts b
where data_source = 'trum'
and a.calendar_year = b.year
and a.table_src = b.table_name
;
      '''

      cursor.execute(query)
      

In [5]:
with connection.cursor() as cursor:
      query = '''
insert into qa_reporting.dw_truv_mbr_enrl_monthly
(data_source, calendar_year, table_src, dw_row_count, dw_uth_mbr_id_count, dw_src_mbr_id_count, date_generated)
select data_source, 
        year, 
        table_id_src, 
        count(*),
        count(distinct uth_member_id),
        count(distinct member_id_src),
        current_date
  from dw_staging.truc_member_enrollment_monthly
 group by 1,2,3;
      '''

      cursor.execute(query)

      query = '''
update qa_reporting.dw_truv_mbr_enrl_monthly a
set src_row_count = b.row_count,
    row_count_diff = a.dw_row_count - b.row_count,
    row_count_diff_percentage = 100. * abs(a.dw_row_count - b.row_count) / b.row_count,
    src_mbr_count = b.pat_count,
    mbr_count_diff = a.dw_src_mbr_id_count - b.pat_count,
    mbr_count_percentage = 100. * abs(a.dw_src_mbr_id_count - b.pat_count) / b.pat_count
from qa_reporting.truven_counts b
where data_source = 'truc'
and a.calendar_year = b.year
and a.table_src = b.table_name
;
      '''

      cursor.execute(query)
      

After inserting the counts from the dw_staging schema, let's see if there are any years where the counts do not match with the raw tables.

In [6]:
query = '''
select * 
from qa_reporting.dw_truv_mbr_enrl_monthly
order by calendar_year
;'''
member_monthly_df = pd.read_sql(query, con=connection)
member_monthly_df




Unnamed: 0,data_source,calendar_year,table_src,dw_row_count,src_row_count,row_count_diff,row_count_diff_percentage,dw_uth_mbr_id_count,dw_src_mbr_id_count,src_mbr_count,mbr_count_diff,mbr_count_percentage,date_generated
0,truc,2011,ccaet,564688998,564688998,0,0.0,55559154,55559154,55559154,0,0.0,2023-10-11
1,trum,2011,mdcrt,56639475,56639475,0,0.0,5243029,5243029,5243029,0,0.0,2023-10-11
2,trum,2012,mdcrt,51474037,51474037,0,0.0,4874717,4874717,4874717,0,0.0,2023-10-11
3,truc,2012,ccaet,567019385,567019385,0,0.0,55975628,55975628,55975628,0,0.0,2023-10-11
4,truc,2013,ccaet,442453307,442453307,0,0.0,43737217,43737217,43737217,0,0.0,2023-10-11
5,trum,2013,mdcrt,45238684,45238684,0,0.0,4271755,4271755,4271755,0,0.0,2023-10-11
6,trum,2014,mdcrt,41158747,41158747,0,0.0,3868830,3868830,3868830,0,0.0,2023-10-11
7,truc,2014,ccaet,475186284,475186284,0,0.0,47258528,47258528,47258528,0,0.0,2023-10-11
8,truc,2015,ccaet,289793831,289793831,0,0.0,28348363,28348363,28348363,0,0.0,2023-10-11
9,trum,2015,mdcrt,24246912,24246912,0,0.0,2199633,2199633,2199633,0,0.0,2023-10-11


In [7]:
member_monthly_df[(member_monthly_df['row_count_diff'] != 0) | (member_monthly_df['mbr_count_diff'] != 0)]

Unnamed: 0,data_source,calendar_year,table_src,dw_row_count,src_row_count,row_count_diff,row_count_diff_percentage,dw_uth_mbr_id_count,dw_src_mbr_id_count,src_mbr_count,mbr_count_diff,mbr_count_percentage,date_generated


If **member_monthly_df** does not have any rows, it means that all of the rows from the raw tables are in this enrollment table at a monthly level.

## Gender Count

Now that we have verified that most if not all of the rows from the raw tables, ccaet and mdcrt, have been added to the member_enrollment_monthly table, we will check that the counts for other columns such as gender have been correctly added to the DW table.

In this case we won't seperate the counts by source table, just by calendar year.

In [8]:
query = '''with truven_gen_cd as (
    select 'truc' as data_source, year, enrolid, sex
    from truven.ccaet
    union all
    select 'trum' as data_source, year, enrolid, sex
    from truven.mdcrt
),
truven_gen as (
    select m.data_source, year, c.gender_cd, count(*) gender_count
    from truven_gen_cd m
    left outer join reference_tables.ref_gender c
    on c.data_source = 'trv'
    and c.gender_cd_src = m.sex::text
    group by 1,2,3
), dw_gen as (
    select data_source, year, gender_cd, count(*) gender_count
    from dw_staging.trum_member_enrollment_monthly
    group by 1,2,3
    union
    select data_source, year, gender_cd, count(*) gender_count
    from dw_staging.truc_member_enrollment_monthly
    group by 1,2,3

)
select a.data_source, a.year, a.gender_count as dw_gender_count, b.gender_count as src_gender_count, 
        a.gender_count - b.gender_count as gender_count_diff, 
        100. * abs(a.gender_count - b.gender_count) / b.gender_count as gender_count_diff_percentage
from truven_gen b
full outer join dw_gen a
on a.year = b.year
and a.gender_cd = b.gender_cd
and a.data_source = b.data_source;
'''

pd.read_sql(query,  con=connection)



Unnamed: 0,data_source,year,dw_gender_count,src_gender_count,gender_count_diff,gender_count_diff_percentage
0,truc,2011,274522068,274522068,0,0.0
1,truc,2013,214678031,214678031,0,0.0
2,truc,2019,125478052,125478052,0,0.0
3,truc,2011,290166930,290166930,0,0.0
4,trum,2011,25252962,25252962,0,0.0
5,truc,2019,132371003,132371003,0,0.0
6,truc,2017,130923036,130923036,0,0.0
7,truc,2016,153748441,153748441,0,0.0
8,trum,2012,23145137,23145137,0,0.0
9,trum,2013,24599473,24599473,0,0.0


## Plan Type Counts

In [9]:
# Including enrollments where the plantyp column is NULL. Treating it as if unknown.
query = '''with truven_enroll as (
    select 'truc' as data_source, year, enrolid, case when plantyp is null then 99 else plantyp end as plantyp
    from truven.ccaet
    union all
    select 'trum' as data_source, year, enrolid, case when plantyp is null then 99 else plantyp end as plantyp
    from truven.mdcrt
),
truven_plans as (
    select m.data_source, year, case when d.plan_type is null then 'UNK' else d.plan_type end as plan_type,
            count(*) plan_count
    from truven_enroll m
    left outer join reference_tables.ref_plan_type d
    on d.data_source = 'trv'
  and d.plan_type_src::int = m.plantyp 
    group by 1,2, 3
), dw_plans as (
    select data_source, year, case when plan_type is null then 'UNK' else plan_type end as plan_type,
            count(*) plan_count
    from dw_staging.truc_member_enrollment_monthly
    group by 1,2,3
    union
    select data_source, year, case when plan_type is null then 'UNK' else plan_type end as plan_type,
            count(*) plan_count
    from dw_staging.trum_member_enrollment_monthly
    group by 1,2,3
)
select a.year, a.plan_type, a.plan_count as dw_plan_count, b.plan_count as src_plan_count, 
        a.plan_count - b.plan_count as plan_count_diff, 
        100. * abs(a.plan_count - b.plan_count) / b.plan_count as plan_count_diff_percentage
from truven_plans b
full outer join dw_plans a
on a.year = b.year
and a.plan_type = b.plan_type
and a.data_source = b.data_source
order by year;
'''

plan_count_df = pd.read_sql(query,  con=connection)
plan_count_df



Unnamed: 0,year,plan_type,dw_plan_count,src_plan_count,plan_count_diff,plan_count_diff_percentage
0,2011,CDHP,22260917,22260917,0,0.0
1,2011,POS,1580048,1580048,0,0.0
2,2011,HMO,66342942,66342942,0,0.0
3,2011,UNK,2175078,2175078,0,0.0
4,2011,HDHP,32279,32279,0,0.0
...,...,...,...,...,...,...
192,2022,HDHP,205596,205596,0,0.0
193,2022,UNK,1136218,1136218,0,0.0
194,2022,EPO,85657,85657,0,0.0
195,2022,EPO,1580742,1580742,0,0.0


In [10]:
plan_count_df[plan_count_df['plan_count_diff'] != 0]

Unnamed: 0,year,plan_type,dw_plan_count,src_plan_count,plan_count_diff,plan_count_diff_percentage


In [11]:
year_row_count = member_monthly_df.groupby(['calendar_year'])[['dw_row_count', 'dw_uth_mbr_id_count']].sum()
year_row_count

Unnamed: 0_level_0,dw_row_count,dw_uth_mbr_id_count
calendar_year,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,621328473,60802183
2012,618493422,60850345
2013,487691991,48008972
2014,516345031,51127358
2015,314040743,30547996
2016,320175486,30856457
2017,287386861,27620062
2018,288516600,28218353
2019,276086833,27021218
2020,260249638,25009767


In [13]:
plan_count_df.groupby(['year'])[['dw_plan_count', 'src_plan_count']].sum()

Unnamed: 0_level_0,dw_plan_count,src_plan_count
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,621328473,621328473
2012,618493422,618493422
2013,487691991,487691991
2014,516345031,516345031
2015,314040743,314040743
2016,320175486,320175486
2017,287386861,287386861
2018,288516600,288516600
2019,276086833,276086833
2020,260249638,260249638


In [14]:
pd.DataFrame(year_row_count).merge(plan_count_df.groupby(['year'])[['dw_plan_count', 'src_plan_count']].sum(), how='inner', left_on='calendar_year', right_on='year')

Unnamed: 0,dw_row_count,dw_uth_mbr_id_count,dw_plan_count,src_plan_count
0,621328473,60802183,621328473,621328473
1,618493422,60850345,618493422,618493422
2,487691991,48008972,487691991,487691991
3,516345031,51127358,516345031,516345031
4,314040743,30547996,314040743,314040743
5,320175486,30856457,320175486,320175486
6,287386861,27620062,287386861,287386861
7,288516600,28218353,288516600,288516600
8,276086833,27021218,276086833,276086833
9,260249638,25009767,260249638,260249638


## Employee Status

In [15]:
query = '''with truven_enroll as (
    select 'truc' as data_source, year, enrolid, eestatu
    from truven.ccaet
    union all
    select 'trum' as data_source, year, enrolid, eestatu
    from truven.mdcrt
),
truven_plans as (
    select data_source, year, eestatu, count(*) employee_status_count
    from truven_enroll m
    group by 1,2,3
), dw_plans as (
    select data_source, year, employee_status, count(*) employee_status_count
    from dw_staging.truc_member_enrollment_monthly
    group by 1,2,3
    union
    select data_source, year, employee_status, count(*) employee_status_count
    from dw_staging.trum_member_enrollment_monthly
    group by 1,2,3
)
select a.data_source, a.year, a.employee_status, a.employee_status_count as dw_employee_status_count, b.employee_status_count as src_employee_status_count, 
        a.employee_status_count - b.employee_status_count as employee_status_count_diff, 
        100. * abs(a.employee_status_count - b.employee_status_count) / b.employee_status_count as employee_status_count_diff_percentage
from truven_plans b
join dw_plans a
on a.year = b.year
and a.employee_status::int = b.eestatu
and a.data_source = b.data_source
order by year;
'''

employee_status_count_df = pd.read_sql(query,  con=connection)
employee_status_count_df



Unnamed: 0,data_source,year,employee_status,dw_employee_status_count,src_employee_status_count,employee_status_count_diff,employee_status_count_diff_percentage
0,trum,2011,6,13647,13647,0,0.0
1,trum,2011,2,37820,37820,0,0.0
2,trum,2011,1,1361670,1361670,0,0.0
3,truc,2011,6,1859798,1859798,0,0.0
4,truc,2011,5,3604269,3604269,0,0.0
...,...,...,...,...,...,...,...
211,trum,2022,5,1080336,1080336,0,0.0
212,truc,2022,3,5353083,5353083,0,0.0
213,trum,2022,3,48175,48175,0,0.0
214,trum,2022,4,10949178,10949178,0,0.0


In [16]:
employee_status_count_df[employee_status_count_df['employee_status_count_diff'] != 0]

Unnamed: 0,data_source,year,employee_status,dw_employee_status_count,src_employee_status_count,employee_status_count_diff,employee_status_count_diff_percentage
