Run the following cell, which downloads `PATIENTS.csv` and `DIAGNOSES_ICD.csv` from the <a href="https://alpha.physionet.org/content/mimiciii-demo/1.4/">MIMIC-III Clinical Database Demo</a> which licenses them under the <a href="https://opendatacommons.org/licenses/odbl/index.html">ODC Open Database License (ODbl)</a>.

**References:**

Johnson, A., Pollard, T., Mark, R. (2019). MIMIC-III Clinical Database Demo. PhysioNet. doi:10.13026/C2HM2Q

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

In [0]:
!wget https://physionet.org/files/mimiciii-demo/1.4/PATIENTS.csv
!wget https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv

Import the `pandas` library; for convenience, call it `pd`.

Using `pd.read_csv`, load both files (`PATIENTS.csv` and `DIAGNOSES_ICD.csv`).

Use the `.head()` method to get a sense of what each of the two data frames looks like. You will likely want to do this anytime you start working with new data:

You may have noticed that the patients data frame has a column `gender`. Run the `value_counts` method on the patients `gender` column to see how many males and females are in the data set.

You may have also noticed that there are several timestamp columns, including `dob` (date of birth) and `dod` (date of death). By default, the computer is treating them as text. Replace the contents of these columns with the values converted into timestamps using `pd.to_datetime`.

Can you tell the difference by looking at the head of `patients`?

What's the earliest any of the patients were born? Use `min`.

Which patient was born first?

Find all the patients who were born before January 1, 2020.

How many men were born before January 1, 2020? How many women? (Use `.value_counts()`).

Using the data from `DIAGNOSES_ICD.csv`, how many patients (`subject_id`) recieved an `icd9_code` diagnosis of `'5070'` (the string)? How is this different from the number of times that diagnosis code appears in the dataset?

*Hint: you will want to use the set function to turn a pandas series into a set*

Replace the use of `M` and `F` for identifying gender with the terms `'male'` and `'female'` in the `gender` column of `patients`. Use the data frame's `head` method with the appropriate argument to show the first 10 rows to make sure you did it right.

In [22]:
patients = patients.replace({'gender': {'F': 'female', 'M': 'male'}})
patients.head(10)

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,9467,10006,female,2094-03-05 00:00:00,2165-08-12 00:00:00,2165-08-12 00:00:00,2165-08-12 00:00:00,1
1,9472,10011,female,2090-06-05 00:00:00,2126-08-28 00:00:00,2126-08-28 00:00:00,,1
2,9474,10013,female,2038-09-03 00:00:00,2125-10-07 00:00:00,2125-10-07 00:00:00,2125-10-07 00:00:00,1
3,9478,10017,female,2075-09-21 00:00:00,2152-09-12 00:00:00,,2152-09-12 00:00:00,1
4,9479,10019,male,2114-06-20 00:00:00,2163-05-15 00:00:00,2163-05-15 00:00:00,2163-05-15 00:00:00,1
5,9486,10026,female,1895-05-17 00:00:00,2195-11-24 00:00:00,,2195-11-24 00:00:00,1
6,9487,10027,female,2108-01-15 00:00:00,2190-09-14 00:00:00,,2190-09-14 00:00:00,1
7,9489,10029,male,2061-04-10 00:00:00,2140-09-21 00:00:00,,2140-09-21 00:00:00,1
8,9491,10032,male,2050-03-29 00:00:00,2138-05-21 00:00:00,2138-05-21 00:00:00,2138-05-21 00:00:00,1
9,9492,10033,female,2051-04-21 00:00:00,2133-09-09 00:00:00,,2133-09-09 00:00:00,1


Finally, how many of the patients with a diagnosis code of `'5070'` are `male`? (Remember, the gender column should now have values of `male` and `female` not `M` and `F`.)

*Hint: You will probably want to use the `isin` method*