# Working with [ICD9](https://www.cdc.gov/nchs/icd/icd9.htm) Data

In [None]:
%matplotlib inline

In [None]:
import pickle
from collections import defaultdict
from nose.tools import assert_equals

#### `pickling` is a Python-specific way to read and write Python objects from disk

We will learn the details of this later.

In [None]:
with open("icd9.pickle","rb") as f0:
    icd9_data = pickle.load(f0)

## What does our data look like?

In [None]:
type(icd9_data)

In [None]:
icd9_data[:5]

### Each element of `icd9_data` is a tuple with three elements
1. A patient id
1. An ICD9 code
1. The label for that ICD9 code

#### It would be nice of this was encoded in the data with meta-data

### Let's look at a subset of the data by looking for lines that have `HEP` in the label

In [None]:
for d in icd9_data:
    try:
        if "HEP" in d[2]:
            print(d)
    except Exception as Error:
        pass

### Why did I do the try/except?

In [None]:
for d in icd9_data:
    if "HEP" in d[2]:
        print(d)


In [None]:
print(d[0])
print(d[1])
print(d[2])

### One of our lines has a code `"238.75"` that does not have a label
#### We cannot test if  `if "HEP" in None`

In [None]:
"HEP" in None

#### Lots of the labels have extraneous white spaces 

```Python
(124, '070.70', 'UNSPECIFIED VIRAL HEPATI                \r')
(117, '070.44', 'CHR HEPAT C W/ HEP COMA                 \r')
```

#### Create a dictionary named `icd9_map` with keys ICD9 code and values the ICD9 label

strip extra white spaces

#### We do the try/except to deal with the missing labels

In [None]:
icd9_map = {}
for p,c,l in icd9_data:
    try:
        icd9_map[c]=l.strip()
    except:
        pass

print(icd9_map)

### Tuple unpacking

Our line 

```Python
for p,c,l in icd9_data:
``` 

Is an example of **tuple unpacking**

Our original code was 

```Python
icd9_map = {}
for d in icd9_data:
    try:
        icd9_map[d[1]]=d[2].strip()
    except:
        pass

```

Tuple unpacking lets us split our three-tuple into three distinct variables:

* `p` would correspond to `d[0]`
* `c` would correspond to `d[1]`
* `l` would correspond to `d[2]`

In [None]:
p,c,l

### Dictionary looks like we would expect

A better solution might be to add a default label rather than doing a `pass`.

In [None]:
icd9_map = {}
for p,c,l in icd9_data:
    try:
        icd9_map[c]=l.strip()
    except:
        icd9_map[c]="NO LABEL PROVIDED"

print(icd9_map)

In [None]:
icd9_patients = defaultdict(list)
patient_diagnoses = defaultdict(list)

### How many patients are there for each diagnosis?

Create a list named `icd9_patients_list` sorted by the number of patients per diagnosis.

We are using a defaultdict with the default value being a list. For each `c` (key) that we encounter, we append to the list (value) the patient with that code.

In [None]:
icd9_patients = defaultdict(list)

for p,c,l in icd9_data:
    try:
        icd9_patients[c].append(p)
        
    except:
        pass
icd9_patients

#### We use the dictionary `items()` method to convert our dictionary into a list of key/value tuples

In [None]:
icd9_patients_list = list(icd9_patients.items())
icd9_patients_list[:5]

#### Now we want to sort our list

In [None]:
icd9_patients_list.sort()
icd9_patients_list[:5]

#### This just sorted the list by the alphanumeric value of the first element of each tuple

#### If we use `help` to look at `sort` we see two keyword arguments

* `key` allows us to pass a function that will determine how the sort is done
* `reverse` is a boolean that determines whether to sort in ascending or descending order

### A function that returns the length of the second element of a tuple

In [None]:
def length_of_second_element(x):
    return len(x[1])

In [None]:
icd9_patients_list.sort(key = length_of_second_element, reverse=True)
icd9_patients_list[:5]

#### We can also use an **anonymous function** that we define on the spot with a `lambda` statements

In [None]:
icd9_patients_list.sort(key = lambda x: len(x[1]), reverse=True)
icd9_patients_list[:5]

#### Finally, we can make our output more readable by using the `icd9_map` dictionary we define earlier to get the label that corresponds to each code

In [None]:
icd9_patients = defaultdict(list)

for p,c,l in icd9_data:
    try:
        icd9_patients[c].append(p)
        
    except:
        pass

icd9_patients_list = list(icd9_patients.items())
icd9_patients_list.sort(key = lambda x: len(x[1]), reverse=True)

for code, patients in icd9_patients_list:
    print(icd9_map[code], len(patients))


## Getting the diagnoses for each patient is similar

In [None]:
patient_diagnoses = defaultdict(list)
for p,c,l in icd9_data:
    try:
        patient_diagnoses[p].append(c)
        
    except:
        pass
    

In [None]:
patient_diagnoses_list = list(patient_diagnoses.items())
patient_diagnoses_list.sort(key = lambda x: len(x[1]), reverse=True)

for patient, codes in patient_diagnoses_list:
    print(patient)
    for c in codes:
        print("\t",icd9_map[c])

In [None]:
help(icd9_patients_list.sort)

In [None]:
import random
d,p = random.choice(icd9_patients_list)
print(d,len(p),sep="\n")
print(icd9_patients["V12.59"])
for d,p in icd9_patients.items():
    if "HX" in d:
        print(d,len(p))

In [None]:
assert_equals(len(icd9_patients["V12.59"]),5)
assert_equals(len(icd9_patients["572.2"]),12)

#### Loop through  `icd9_patients_list`
1. for each element in `icd9_patients_list` print the ICD9 label corresponding to the code and the number of patients with each diagnosis.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### How many diagnoses does each patient have?

In [None]:
patient_diagnoses = defaultdict(list)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
for p, d in patient_diagnoses_list:
    print(p)
    print(d) # replace with len(d)
    print("\n")

In [None]:
assert_equals(len(patient_diagnoses[2512]),49)
assert_equals(len(patient_diagnoses[353]),56)
assert_equals(len(patient_diagnoses_list[0][1]),125)
assert_equals(len(patient_diagnoses_list[45][1]),23)
