# Revised case normalization for Hirslanden Beau Site 2018

This jupyter notebook is used to normalize the revised case from DtoD.

Before runing the notebook, raw_data folder need to added to the root directory

The raw data folder can be find here: https://aimedic.sharepoint.com/:f:/s/dev/Ejx_A1dg8gtPumFknOWOh0oBi6ofx9hctYiq3c-0gH9vYA?e=UmcgrS

Normalization:

-  Convert the column names to the name used in the Database
-  Delete cases which is empty in the follow columns VALIDATION_COLS: 'case_id', 'patient_id', 'gender', 'age_years', duration_of_stay',  'pccl',  'drg'
- choose neccessary columns COLS_TO_SELECT: case_id, patient_id, gender, age_years, duration_of_stay, pccl, drg, pd, bfs_code, added_icds, removed_icds, added_chops, removed_chops
- still need to do (TODO):    
    -  Check CHOP upper/lowercase
    -  Check whether the PD changed. If it did, new and old PD are stored together with added and removed ICDs, respectively
    -  Pad case IDs with 0s
    -  Write function to validate cases


In [1]:
import pandas as pd
import os
from dataclasses import dataclass, field
import sys
sys.path.insert(0, '/home/jovyan/work')
sys.path.insert(1, '/home/jovyan/work/src')
sys.path.insert(2, '/home/jovyan/work/src/service')

from service import bfs_cases_db_service as bfs_db

from py.global_configs import *
from py.normalize import normalize

  class BfsCase(Base):


In [2]:
# check all the file name

FILES_TO_ANALYZE.keys()


dict_keys(['Hirslanden Salem 2017', 'Hirslanden Beau Site 2017', 'Hirslanden Linde 2017', 'Hirslanden Linde 2018', 'Hirslanden Salem 2018', 'Hirslanden Beau Site 2018'])

In [3]:
file = FILES_TO_ANALYZE['Hirslanden Beau Site 2018']
file

FileInfo(path='/home/jovyan/work/src/revised_case_normalization/raw_data/HI-Bern_Salem_Beau Site_Linde.xlsx', hospital_name_db='Hirslanden Beau Site', year='2018', sheets=['Änderungen Beau Site_ 2018'])

In [4]:

df_revised_case_d2d = normalize(file, 0)

Read 26 cases for Hirslanden Beau Site 2018
TYPES:
case_id             string
patient_id          string
gender              string
age_years            int64
duration_of_stay     int64
pccl                 int64
drg                 string
pd                  string
bfs_code            string
added_icds          string
removed_icds        string
added_chops         string
removed_chops       string
dtype: object


In [5]:
df_revised_case_d2d.head()

Unnamed: 0,case_id,patient_id,gender,age_years,duration_of_stay,pccl,drg,pd,bfs_code,added_icds,removed_icds,added_chops,removed_chops
0,41635636,5050208,W,83,5,3,E65C,J4419,M100,B965,,,
1,41647892,5087263,W,74,11,3,F03D,I350,M200,I490,,9962.0,
2,41656218,22162393,M,77,11,3,F03D,I350,M200,"J61,D689",,,
3,41660312,5287267,W,54,30,3,G02C,K5722,M200,E43,E46,,
4,41670261,5121840,M,90,6,3,F71B,I480,M100,N183,N184,,


# Match to the database


In [6]:
# get the case_id from revised_case

revised_case_id = df_revised_case_d2d['case_id'].values
revised_case_id

array(['0041635636', '0041647892', '0041656218', '0041660312',
       '0041670261', '0041622665', '0041631826', '0041632678',
       '0041642176', '0041670429', '0041678445', '0041796989',
       '0041718006', '0041727313', '0041740292', '0041755915',
       '0041762114', '0041766569', '0041767906', '0041778442',
       '0041781983', '0041830088', '0041861755', '0041863191',
       '0041869562', '0041897876'], dtype=object)

In [7]:
# match to the database
revised_case_db = bfs_db.get_bfs_cases_by_ids(revised_case_id)
revised_case_db.head()

Unnamed: 0,drg_cost_weight,aimedic_id,hospital_id,case_id,patient_id,age_years,age_days,gender,duration_of_stay,clinic_id,ventilation_hours,admission_weight,gestation_age,admission_date,admission_type,discharge_date,discharge_type,drg,adrg,pccl
0,2.228,125689,6,41622665,52E7EFDD3FDA7867,79,0,M,12,4,0,0,0,2018-01-08,1,2018-01-20,0,G17Z,G17,3
1,3.301,125812,6,41631826,87192FE13E531984,59,0,M,11,4,16,0,0,2018-01-07,1,2018-01-18,0,F06D,F06,3
2,1.781,125826,6,41632678,C8664A96222FC800,58,0,W,5,4,0,0,0,2018-01-17,1,2018-01-22,0,G18B,G18,3
3,0.671,125880,6,41635636,EDF81030E54CC0DD,83,0,W,5,3,0,0,0,2018-01-08,1,2018-01-13,0,E65C,E65,3
4,1.56,125947,6,41642176,77C1AAA07A0C492E,82,0,M,10,3,0,0,0,2018-01-12,11,2018-01-22,0,F13C,F13,3


In [8]:
# 
print('There are {} out of {} revised cases from DtoD that are matched with the database for {} {}'.format(len(revised_case_db), len(revised_case_id), file.hospital_name_db, file.year))

There are 26 out of 26 revised cases from DtoD that are matched with the database for Hirslanden Beau Site 2018


In [9]:
# if we find the match cases, then we need to check if the case_id, gender, year....are match

In [10]:
revised_case_db_subset = revised_case_db[['aimedic_id', 'case_id', 'gender', 'age_years']]
revised_case_db_subset.head()

Unnamed: 0,aimedic_id,case_id,gender,age_years
0,125689,41622665,M,79
1,125812,41631826,M,59
2,125826,41632678,W,58
3,125880,41635636,W,83
4,125947,41642176,M,82
