Let's go ahead and mount the google drive to get easy-access to the course data:

In [0]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Import the `pandas` library; for convenience, call it `pd`.

In [0]:
import pandas as pd

Using `pd.read_csv`, load both files (`PATIENTS.csv` and `DIAGNOSES_ICD.csv`) from the <a href="https://alpha.physionet.org/content/mimiciii-demo/1.4/">MIMIC-III Clinical Database Demo</a> which licenses them under the <a href="https://opendatacommons.org/licenses/odbl/index.html">ODC Open Database License (ODbl)</a>.

**References:**

Johnson, A., Pollard, T., Mark, R. (2019). MIMIC-III Clinical Database Demo. PhysioNet. doi:10.13026/C2HM2Q

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

In [0]:
patients = pd.read_csv('/content/gdrive/My Drive/[YCMI_CBDS Summer Course] Data/PATIENTS.csv')
diagnoses = pd.read_csv('/content/gdrive/My Drive/[YCMI_CBDS Summer Course] Data/DIAGNOSES_ICD.csv')

Use the `.head()` method to get a sense of what each of the two data frames looks like. You will likely want to do this anytime you start working with new data:

In [0]:
patients.head()

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,9467,10006,F,2094-03-05 00:00:00,2165-08-12 00:00:00,2165-08-12 00:00:00,2165-08-12 00:00:00,1
1,9472,10011,F,2090-06-05 00:00:00,2126-08-28 00:00:00,2126-08-28 00:00:00,,1
2,9474,10013,F,2038-09-03 00:00:00,2125-10-07 00:00:00,2125-10-07 00:00:00,2125-10-07 00:00:00,1
3,9478,10017,F,2075-09-21 00:00:00,2152-09-12 00:00:00,,2152-09-12 00:00:00,1
4,9479,10019,M,2114-06-20 00:00:00,2163-05-15 00:00:00,2163-05-15 00:00:00,2163-05-15 00:00:00,1


In [0]:
diagnoses.head()

Unnamed: 0,row_id,subject_id,hadm_id,seq_num,icd9_code
0,112344,10006,142345,1,99591
1,112345,10006,142345,2,99662
2,112346,10006,142345,3,5672
3,112347,10006,142345,4,40391
4,112348,10006,142345,5,42731


You may have noticed that the patients data frame has a column `gender`. Run the `value_counts` method on the patients `gender` column to see how many males and females are in the data set.

In [0]:
patients['gender'].value_counts()

F    55
M    45
Name: gender, dtype: int64

You may have also noticed that there are several timestamp columns, including `dob` (date of birth) and `dod` (date of death). By default, the computer is treating them as text. Replace the contents of these columns with the values converted into timestamps using `pd.to_datetime`.

In [0]:
patients['dob'] = pd.to_datetime(patients['dob'])
patients['dod'] = pd.to_datetime(patients['dod'])

What's the earliest any of the patients were born? Use `min`.

In [0]:
min(patients['dob'])

Timestamp('1844-07-18 00:00:00')

Which patient was born first?

In [0]:
patients[patients['dob'] == min(patients['dob'])]

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
62,30962,40655,F,1844-07-18,2145-03-07,2145-03-07 00:00:00,,1


Find all the patients who were born before January 1, 2020.

In [0]:
patients[patients['dob'] < pd.to_datetime('January 1, 2020')]

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
5,9486,10026,F,1895-05-17,2195-11-24,,2195-11-24 00:00:00,1
11,9495,10036,F,1885-03-24,2185-03-26,2185-03-26 00:00:00,2185-03-26 00:00:00,1
33,9550,10094,M,1880-02-29,2180-03-20,2180-03-20 00:00:00,2180-03-20 00:00:00,1
62,30962,40655,F,1844-07-18,2145-03-07,2145-03-07 00:00:00,,1
67,31314,41983,F,1851-09-12,2151-09-15,2151-09-15 00:00:00,,1
73,31379,42231,F,2016-12-05,2105-05-18,2105-05-18 00:00:00,2103-05-18 00:00:00,1
83,31440,42458,M,1846-07-21,2147-09-08,2147-09-08 00:00:00,,1
89,31778,43827,F,1876-07-14,2178-12-07,2178-12-07 00:00:00,2178-12-07 00:00:00,1
96,31853,44154,M,1878-05-14,2178-05-15,2178-05-15 00:00:00,2178-05-15 00:00:00,1


How many men were born before January 1, 2020? How many women? (Use `.value_counts()`).

In [0]:
patients[patients['dob'] < pd.to_datetime('January 1, 2020')]['gender'].value_counts()

F    6
M    3
Name: gender, dtype: int64

Using the data from `DIAGNOSES_ICD.csv`, how many patients (`subject_id`) recieved an `icd9_code` diagnosis of `'5070'` (the string)? How is this different from the number of times that diagnosis code appears in the dataset?

*Hint: you will want to use the set function to turn a pandas series into a set*

In [0]:
len(set(diagnoses[diagnoses['icd9_code'] == '5070']['subject_id']))

12

Replace the use of `M` and `F` for identifying gender with the terms `'male'` and `'female'` in the `gender` column of `patients`. Use the data frame's `head` method with the appropriate argument to show the first 10 rows to make sure you did it right.

In [0]:
patients = patients.replace({'gender': {'F': 'female', 'M': 'male'}})
patients.head(10)

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,9467,10006,female,2094-03-05,2165-08-12,2165-08-12 00:00:00,2165-08-12 00:00:00,1
1,9472,10011,female,2090-06-05,2126-08-28,2126-08-28 00:00:00,,1
2,9474,10013,female,2038-09-03,2125-10-07,2125-10-07 00:00:00,2125-10-07 00:00:00,1
3,9478,10017,female,2075-09-21,2152-09-12,,2152-09-12 00:00:00,1
4,9479,10019,male,2114-06-20,2163-05-15,2163-05-15 00:00:00,2163-05-15 00:00:00,1
5,9486,10026,female,1895-05-17,2195-11-24,,2195-11-24 00:00:00,1
6,9487,10027,female,2108-01-15,2190-09-14,,2190-09-14 00:00:00,1
7,9489,10029,male,2061-04-10,2140-09-21,,2140-09-21 00:00:00,1
8,9491,10032,male,2050-03-29,2138-05-21,2138-05-21 00:00:00,2138-05-21 00:00:00,1
9,9492,10033,female,2051-04-21,2133-09-09,,2133-09-09 00:00:00,1


Finally, how many of the patients with a diagnosis code of `'5070'` are `male`? (Remember, the gender column should now have values of `male` and `female` not `M` and `F`.)

*Hint: You will probably want to use the `isin` method*

In [0]:
patients[patients['subject_id'].isin(diagnoses[diagnoses['icd9_code'] == '5070']['subject_id'])]['gender'].value_counts()['male']

9