# 02 - Data Cleaning

## Objectives

* Check whether there is duplicate data and if so, address this as necessary
* Remove instances where patients have died from the database, since by definition these patients cannot be readmitted
* Determine the extent of missing data
* Evaluate the most suitable approach to deal with missing data
* Clean data

## Inputs

* CSV file generated in previous notebook: outputs/datasets/collection/diabetic_data.csv 

## Outputs

* Cleaned data, to be stored in new folder outputs/datasets/cleaned
* Data cleaning pipeline

---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, it is confirmed that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis'

# Load data

The data is loaded from the outputs/datasets/collection folder:

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/diabetic_data.csv')
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


---

# Examine data for duplication issues

The first field to check for duplication is `encounter_id`. If the data has been recorded correctly, there should be no duplicate encounters.

In [7]:
df[df.duplicated(subset='encounter_id')]

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted


It can be seen that there are no duplicate values recorded for `encounter_id`.

However, there could also be duplicate values for `patient_nbr`, the unique identifier of a patient.
* Since the target variable in the dataset relates to patient readmissions, it seems highly likely that there will be multiple instances of the same patient number within the database

In [8]:
df[df.duplicated(subset='patient_nbr')]

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
79,1070256,23043240,Caucasian,Female,[50-60),?,2,1,4,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
81,1077924,21820806,AfricanAmerican,Male,[50-60),?,1,6,7,3,...,No,No,No,No,No,No,No,No,No,NO
143,2309376,41606064,Caucasian,Male,[20-30),?,2,1,2,2,...,No,Steady,No,No,No,No,No,No,Yes,>30
175,2552952,86240259,Caucasian,Female,[70-80),?,1,3,7,11,...,No,Up,No,No,No,No,No,Ch,Yes,>30
307,3174918,5332491,Other,Female,[60-70),?,6,25,7,5,...,No,Steady,No,No,No,No,No,No,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101760,443847176,50375628,AfricanAmerican,Female,[60-70),?,1,1,7,6,...,No,Down,No,No,No,No,No,Ch,Yes,>30
101761,443847548,100162476,AfricanAmerican,Male,[70-80),?,1,3,7,3,...,No,Down,No,No,No,No,No,Ch,Yes,>30
101762,443847782,74694222,AfricanAmerican,Female,[80-90),?,1,4,5,5,...,No,Steady,No,No,No,No,No,No,Yes,NO
101763,443854148,41088789,Caucasian,Male,[70-80),?,1,1,7,1,...,No,Down,No,No,No,No,No,Ch,Yes,NO


There is a relatively large number of patients who appear in the database multiple times.

The obvious question, then, is how these multiple encounters for a single patient should be dealt with.
* To address this question, consider the purpose of the data analysis
* The client wishes to understand the factors that are likely to contribute to readmission, and address these where possible
* In separate encounters, the majority of the variables may vary, such as a patient's time in hospital, medication, and over time even their age and weight, and this may lead to different values for the target variable of readmission
* As such, multiple encounters of a single patient are left in the database for now. This could be revisited at a later stage if necessary

---

# Remove deaths from database

The location to which patients were discharged when they left hospital is included in the database as `discharge_disposition_id`
* A number of patients died
* By definition these patients cannot be readmitted
* While it could also be interesting to look at the pattern of variables relating to death as an outcome, this is not part of the scope of the current analysis
* It therefore makes sense to remove these patients from the database

In [21]:
death_codes = [11, 19, 20, 21]

df = df.loc[~(df['discharge_disposition_id'].isin(death_codes))]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100114 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              100114 non-null  int64 
 1   patient_nbr               100114 non-null  int64 
 2   race                      100114 non-null  object
 3   gender                    100114 non-null  object
 4   age                       100114 non-null  object
 5   weight                    100114 non-null  object
 6   admission_type_id         100114 non-null  int64 
 7   discharge_disposition_id  100114 non-null  int64 
 8   admission_source_id       100114 non-null  int64 
 9   time_in_hospital          100114 non-null  int64 
 10  payer_code                100114 non-null  object
 11  medical_specialty         100114 non-null  object
 12  num_lab_procedures        100114 non-null  int64 
 13  num_procedures            100114 non-null  int64 
 14  num_medic

---

# Determine the extent of missing data

As seen in the dataframes above and noted in the previous notebook, there is clearly some missing data in the `weight` column.
* The next step is to get all variables that have missing values
* An initial approach to this might be to use the `isna` method:

In [26]:
vars_with_missing_data = df.columns[df.isna().any()].tolist()
vars_with_missing_data

['max_glu_serum', 'A1Cresult']

However, it is already clear that `weight`, for example, has missing values, and that these are coded as a question mark `?` in the database.
* Additionally, the client has informed us that non-measurement of A1C is a potential issue that could have implications for readmission
* As such, a recording of 'None' in the current database does not mean that the data is missing per se, but rather signifies that the value was never recorded in the hospital, a fact that should be taken into account during analysis


Instead, the dataframe can be filtered to show columns that contain a question mark as follows:

In [38]:
vars_with_missing_data = df.columns[df.eq("?").any()].tolist()
vars_with_missing_data

['race',
 'weight',
 'payer_code',
 'medical_specialty',
 'diag_1',
 'diag_2',
 'diag_3']

For columns that contain a question mark, it is useful to know how much of the data is missing
* It is also useful to view this as a percentage, to help determine the most appropriate method of dealing with the missing data

In [52]:
print("Column name; number of datapoints missing; percentage of data missing")
x = 0
while x < len(vars_with_missing_data):
    column_name = vars_with_missing_data[x]
    count = df[column_name].value_counts().get("?", 0)
    print(column_name, count, round(100 * count / len(df[column_name]), 1))
    x += 1
    

Column name; number of datapoints missing; percentage of data missing
race 2239 2.2
weight 96958 96.8
payer_code 39591 39.5
medical_specialty 49129 49.1
diag_1 21 0.0
diag_2 358 0.4
diag_3 1421 1.4


---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
