_HDS5210 Programming for Health Data Scientists_

# Week 7 - Reading / Writing files

This week, we're talking about reading and writing files to disk on the Jupyter server.  Note that if you want to use Jupyter to process your own files, all you have to do is do `File -> Open` and then click `Upload`.

For our exercises, we're going to use simple text file formats, but you should take a look at claim data file formats (https://www.ihs.gov/hipaa/835_837/newsletter4/) and HL7 clinical file formats (http://hl7api.sourceforge.net/hapi-testpanel/index.html) to see what kinds of files you may run across.

## 1 - Reading simple text files

For this section, we'll be using a file stored on the server in `/data/aco_year1.csv`.  If you want to see what the file looks like, you can see a layout on the CMS website: https://data.cms.gov/ACO/Medicare-Shared-Savings-Program-Accountable-Care-O/yuq5-65xt


There's a bash command called `head` that will print the first 10 lines of any file.  So, let's use that to take a peak at what this file looks like on disk.

In [1]:
%%bash
head /data/aco_year1.csv 
wc /data/aco_year1.csv

"ACO Name (LBN or DBA, if applicable) ",States Where Beneficiaries Reside ,Agreement Start Date,Track,Participate in Advance Payment Model ,Total Assigned Beneficiaries,Total Benchmark Expenditures,Total Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark,"Generated Savings/Losses1,2","Earned Shared Savings Payments/Owe Losses3,4",Successfully Reported Quality5,ACO-1,ACO-2,ACO-3,ACO-4,ACO-5,ACO-6,ACO-7,ACO-8^,ACO-9^,ACO-10^,ACO-11,ACO-12,ACO-13,ACO-14,ACO-15,ACO-16,ACO-17,ACO-18,ACO-19,ACO-20,ACO-21,DM Comp-osite,ACO-22,ACO-23,ACO-24,ACO-25,ACO-26,ACO-27^,ACO-28,ACO-29,ACO-30,ACO-31,CAD Comp-osite,ACO-32,ACO-33
"A.M. Beajow, M.D. Internal Medicine Associates ACO, P.C",Nevada,01/01/2013,Track1 ,No ,5921,$70912015,$67555873,$3356142,4.7%,$3356142,$1644510,Yes,75.6,93.09,92.18,82.91,58.06,76.36,71.33,14.88,0.67,1.14,75,72.5,1.24,25.83,22.4,31.19,64.08,0,39

Now, we'll use Pythong to print out the first line of the file.

In [2]:
with open('/data/aco_year1.csv') as aco:
    print(aco.readline())
    print(aco.readline())

"ACO Name (LBN or DBA, if applicable) ",States Where Beneficiaries Reside ,Agreement Start Date,Track,Participate in Advance Payment Model ,Total Assigned Beneficiaries,Total Benchmark Expenditures,Total Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark,"Generated Savings/Losses1,2","Earned Shared Savings Payments/Owe Losses3,4",Successfully Reported Quality5,ACO-1,ACO-2,ACO-3,ACO-4,ACO-5,ACO-6,ACO-7,ACO-8^,ACO-9^,ACO-10^,ACO-11,ACO-12,ACO-13,ACO-14,ACO-15,ACO-16,ACO-17,ACO-18,ACO-19,ACO-20,ACO-21,DM Comp-osite,ACO-22,ACO-23,ACO-24,ACO-25,ACO-26,ACO-27^,ACO-28,ACO-29,ACO-30,ACO-31,CAD Comp-osite,ACO-32,ACO-33

"A.M. Beajow, M.D. Internal Medicine Associates ACO, P.C",Nevada,01/01/2013,Track1 ,No ,5921,$70912015,$67555873,$3356142,4.7%,$3356142,$1644510,Yes,75.6,93.09,92.18,82.91,58.06,76.36,71.33,14.88,0.67,1.14,75,72.5,1.24,25.83,22.4,31.19,64.08,0,3

We can also use a loop in python to print out the first ten lines of the file, like `head`

In [3]:
with open('/data/aco_year1.csv') as aco:
    for i in range(10):
        print(aco.readline())

"ACO Name (LBN or DBA, if applicable) ",States Where Beneficiaries Reside ,Agreement Start Date,Track,Participate in Advance Payment Model ,Total Assigned Beneficiaries,Total Benchmark Expenditures,Total Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark,"Generated Savings/Losses1,2","Earned Shared Savings Payments/Owe Losses3,4",Successfully Reported Quality5,ACO-1,ACO-2,ACO-3,ACO-4,ACO-5,ACO-6,ACO-7,ACO-8^,ACO-9^,ACO-10^,ACO-11,ACO-12,ACO-13,ACO-14,ACO-15,ACO-16,ACO-17,ACO-18,ACO-19,ACO-20,ACO-21,DM Comp-osite,ACO-22,ACO-23,ACO-24,ACO-25,ACO-26,ACO-27^,ACO-28,ACO-29,ACO-30,ACO-31,CAD Comp-osite,ACO-32,ACO-33

"A.M. Beajow, M.D. Internal Medicine Associates ACO, P.C",Nevada,01/01/2013,Track1 ,No ,5921,$70912015,$67555873,$3356142,4.7%,$3356142,$1644510,Yes,75.6,93.09,92.18,82.91,58.06,76.36,71.33,14.88,0.67,1.14,75,72.5,1.24,25.83,22.4,31.19,64.08,0,3

## 2 - CSV Files with quoted field values

This isn't mentioned in the book yet, but Python has a special module for reading and writing CSV files that all for the delimiter to appear inside of a quoted field.  Documentation on this module is here: https://docs.python.org/3/library/csv.html#

In [4]:
import csv
with open('/data/aco_year1.csv') as aco:
    r = csv.reader(aco)
    for row in r:
        print(row)
        break

['ACO Name (LBN or DBA, if applicable) ', 'States Where Beneficiaries Reside ', 'Agreement Start Date', 'Track', 'Participate in Advance Payment Model ', 'Total Assigned Beneficiaries', 'Total Benchmark Expenditures', 'Total Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark', 'Generated Savings/Losses1,2', 'Earned Shared Savings Payments/Owe Losses3,4', 'Successfully Reported Quality5', 'ACO-1', 'ACO-2', 'ACO-3', 'ACO-4', 'ACO-5', 'ACO-6', 'ACO-7', 'ACO-8^', 'ACO-9^', 'ACO-10^', 'ACO-11', 'ACO-12', 'ACO-13', 'ACO-14', 'ACO-15', 'ACO-16', 'ACO-17', 'ACO-18', 'ACO-19', 'ACO-20', 'ACO-21', 'DM Comp-osite', 'ACO-22', 'ACO-23', 'ACO-24', 'ACO-25', 'ACO-26', 'ACO-27^', 'ACO-28', 'ACO-29', 'ACO-30', 'ACO-31', 'CAD Comp-osite', 'ACO-32', 'ACO-33']


In [5]:
import csv
with open('/data/aco_year1.csv') as aco:
    r = csv.reader(aco)
    count = 0
    for row in r:
        print(row)
        count += 1
        if count >= 5:
            break

['ACO Name (LBN or DBA, if applicable) ', 'States Where Beneficiaries Reside ', 'Agreement Start Date', 'Track', 'Participate in Advance Payment Model ', 'Total Assigned Beneficiaries', 'Total Benchmark Expenditures', 'Total Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark', 'Generated Savings/Losses1,2', 'Earned Shared Savings Payments/Owe Losses3,4', 'Successfully Reported Quality5', 'ACO-1', 'ACO-2', 'ACO-3', 'ACO-4', 'ACO-5', 'ACO-6', 'ACO-7', 'ACO-8^', 'ACO-9^', 'ACO-10^', 'ACO-11', 'ACO-12', 'ACO-13', 'ACO-14', 'ACO-15', 'ACO-16', 'ACO-17', 'ACO-18', 'ACO-19', 'ACO-20', 'ACO-21', 'DM Comp-osite', 'ACO-22', 'ACO-23', 'ACO-24', 'ACO-25', 'ACO-26', 'ACO-27^', 'ACO-28', 'ACO-29', 'ACO-30', 'ACO-31', 'CAD Comp-osite', 'ACO-32', 'ACO-33']
['A.M. Beajow, M.D. Internal Medicine Associates ACO, P.C', 'Nevada', '01/01/2013', 'Track1 ', 'No ', '5921

In [6]:
import csv
with open('/data/aco_year1.csv') as aco:
    reader = csv.reader(aco)
    for row in reader:
        print(row)
        break
        
    assigned_pos = row.index('Total Assigned Beneficiaries')
    print('Total assigned is in column {:d}'.format(assigned_pos+1))
    
    total_assigned = 0
    for row in reader:
        total_assigned = total_assigned + int(row[assigned_pos])
        
    print('Total assigned: {:,d}'.format(total_assigned))
        

['ACO Name (LBN or DBA, if applicable) ', 'States Where Beneficiaries Reside ', 'Agreement Start Date', 'Track', 'Participate in Advance Payment Model ', 'Total Assigned Beneficiaries', 'Total Benchmark Expenditures', 'Total Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures', 'Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark', 'Generated Savings/Losses1,2', 'Earned Shared Savings Payments/Owe Losses3,4', 'Successfully Reported Quality5', 'ACO-1', 'ACO-2', 'ACO-3', 'ACO-4', 'ACO-5', 'ACO-6', 'ACO-7', 'ACO-8^', 'ACO-9^', 'ACO-10^', 'ACO-11', 'ACO-12', 'ACO-13', 'ACO-14', 'ACO-15', 'ACO-16', 'ACO-17', 'ACO-18', 'ACO-19', 'ACO-20', 'ACO-21', 'DM Comp-osite', 'ACO-22', 'ACO-23', 'ACO-24', 'ACO-25', 'ACO-26', 'ACO-27^', 'ACO-28', 'ACO-29', 'ACO-30', 'ACO-31', 'CAD Comp-osite', 'ACO-32', 'ACO-33']
Total assigned is in column 6
Total assigned: 3,675,263


In [7]:
import csv

acos = []
with open('/data/aco_year1.csv') as aco:
    reader = csv.reader(aco)
    for row in reader:
#        print(row)
        break
        
    assigned_pos = row.index('Total Assigned Beneficiaries')
#    print('Total assigned is in column {:d}'.format(assigned_pos+1))

    for row in reader:
        acos.append([int(row[assigned_pos]), row[0]])

acos.sort()
print(acos[0:10])
        
        

[[3946, 'Akira Health, Inc.'], [4946, 'BAROMA Health Partners'], [5072, 'The Premier HealthCare Network LLC'], [5113, 'Yuma Connected Community'], [5286, 'MPS ACO Physicians, LLC'], [5338, 'Commonwealth Primary Care ACO'], [5507, 'National ACO, LLC'], [5548, 'Accountable Care Coalition of North Central Florida, LLC.'], [5568, 'Rio Grande Valley Health Alliance, LLC'], [5574, 'Physicians Accountable Care Organization LLC']]


## 3 - Reading multi-line records

Imagine, though, if you had a file whose record structures aren't quite so nice and rectangular, like a database table or spreadsheet.  Image if you had something with multiple lines per record (this file is in /data/med_list.txt):

```
PATIENT:Boal,Paul
MEDICATION:Ibuprofen,200mg
MEDICATION:Vallium,95mg
END
PATIENT:Westhus,Eric
MEDICATION:Acetominaphen,200mg
MEDICATION:Flintstones Chewable Morphine,100mg
MEDICATION:Zolpidem,10mg
END
```

You might want to read this into a list structure in Python that looks something like this:
```
[['Boal', 'Paul', ['Ibuprofen','Vallium']],
 ['Westhus', 'Eric', ['Acetominaphen','Finstones Chewable Morphine','Zolpidem']]
```

In this case, we'll want to read the file one row at a time and decide what to do based on the contents of each line.

In [8]:
med_list = []

with open('/data/med_list.txt') as meds:
    patient = []
    for row in meds:
        item = row.strip().split(':')     # We strip the newline and split on :
        if item[0] == 'PATIENT':
            patient = item[1].split(',')  # This just sets patient to be the last name, first name
            patient.append([])            # This adds an empty list on to the end so we can add meds
        elif item[0] == 'MEDICATION':
            med = item[1].split(',')[0]
            patient[2].append(med)        # Add the medication name to the end of our list
        elif item[0] == 'END':
            med_list.append(patient)      # Add that whole entry we have onto our main list
        else:
            pass
#        print(patient)


print(med_list)

[['Boal', 'Paul', ['Ibuprofen', 'Vallium']], ['Westhus', 'Eric', ['Acetominaphen', 'Flintstones Chewable Morphine', 'Zolpidem']]]


## 4 - Writing to Files

In this last piece, we're going to talk through how to write data into an output file in a structured format.  For instance, looking at the previous example, let's try to put together a pipe-delimited output like this:

```
Boal|Paul|Ibuprofen,Vallium
Westhus|Eric|Acetominaphen,Fintstones Chewable Morphine,Zolpidem
```


In [9]:
with open('med_list.txt','w') as output:
    for items in med_list:
        out = "{:s}|{:s}|{:s}\n".format(items[0],items[1],",".join(items[2]))
        output.write(out)

In [10]:
%%bash
cat med_list.txt

Boal|Paul|Ibuprofen,Vallium
Westhus|Eric|Acetominaphen,Flintstones Chewable Morphine,Zolpidem


5 - Introducing Pandas
---

Pandas is a great all-purpose Python package for working with data, especially getting data outof and into files.  The Pandas methods for reading and writing files can be found here:

https://pandas.pydata.org/pandas-docs/stable/io.html

What we get back from Pandas is something called a `dataframe` which is somewhat similar to data frames in R.  We aren't going to delve into Pandas right now, but you can find more reference information about how to acces parts of a data frame here:

https://pandas.pydata.org/pandas-docs/stable/indexing.html


In [11]:
import pandas as pd

In [12]:
csv = pd.read_csv('/data/aco_year1.csv')

In [13]:
len(csv)

220

In [14]:
csv.to_excel('aco_year1.xlsx')

In [15]:
xls = pd.read_excel('aco_year1.xlsx')

In [16]:
xls.head()

Unnamed: 0,"ACO Name (LBN or DBA, if applicable)",States Where Beneficiaries Reside,Agreement Start Date,Track,Participate in Advance Payment Model,Total Assigned Beneficiaries,Total Benchmark Expenditures,Total Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark,...,ACO-25,ACO-26,ACO-27^,ACO-28,ACO-29,ACO-30,ACO-31,CAD Comp-osite,ACO-32,ACO-33
0,"A.M. Beajow, M.D. Internal Medicine Associates...",Nevada,01/01/2013,Track1,No,5921,$70912015,$67555873,$3356142,4.7%,...,53.04,28.21,32.94,61.25,13.08,27.36,,37.5,52.31,29.74
1,AAMC Collaborative Care Network,Maryland,01/01/2013,Track1,No,10485,$92961659,$96240231,$-3278573,-3.5%,...,87.5,65.35,17.22,68.61,74.52,93.33,88.89,67.31,90.31,40.66
2,"Accountable Care Clinical Services, PC","Iowa, Pennsylvania, Connecticut, Massachusetts...",01/01/2013,Track1,No,19637,$211247324,$200721155,$10526169,5%,...,73.13,81.1,18.75,66.45,55.14,82.72,75.0,69.78,77.44,77.59
3,"Accountable Care Coalition of Caldwell County,...",North Carolina,04/01/2012,Track1,No,5915,$70881173,$71400316,$-519143,-0.7%,...,13.51,72.54,35.32,50.61,32.11,71.32,75.0,48.17,54.09,66.31
4,"Accountable Care Coalition of Central Georgia,...",Georgia,01/01/2013,Track1,No,10589,$106323535,$107639420,$-1315885,-1.2%,...,10.79,73.38,57.11,55.88,20.22,53.01,79.55,28.12,34.84,57.56


In [17]:
xls['Track'].unique()

array(['Track1 ', 'Track2 '], dtype=object)

In [18]:
xls.loc[0]

ACO Name (LBN or DBA, if applicable)                                                                  A.M. Beajow, M.D. Internal Medicine Associates...
States Where Beneficiaries Reside                                                                                                                Nevada
Agreement Start Date                                                                                                                         01/01/2013
Track                                                                                                                                           Track1 
Participate in Advance Payment Model                                                                                                                No 
Total Assigned Beneficiaries                                                                                                                       5921
Total Benchmark Expenditures                                                            

6 - Reading HTML w/ Pandas
---

Pandas has a read_html() method that will parse HTML, find HTML tables, and convert that data into data frames.

https://pandas.pydata.org/pandas-docs/stable/io.html

In [19]:
import pandas as pd
tbls = pd.read_html('https://pandas.pydata.org/pandas-docs/stable/io.html')

In [20]:
len(tbls)

12

In [21]:
tbls[0]

Unnamed: 0,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Parquet Format,read_parquet,to_parquet
8,binary,Msgpack,read_msgpack,to_msgpack
9,binary,Stata,read_stata,to_stata


In [22]:
tbls[1]

Unnamed: 0,0,1
0,split,"dict like {index -> [index], columns -> [colum..."
1,records,"list like [{column -> value}, ... , {column ->..."
2,index,dict like {index -> {column -> value}}
3,columns,dict like {column -> {index -> value}}
4,values,just the values array
