# Parsing PDFs Homework

With the power of pdfminer, pytesseract, Camelot, and Tika, let's analyze some documents!

> If at any point you think, **"I'm close enough, I'd just edit the rest of it in Excel"**: that's fine! Just make a note of it.

## A trick to use again and again

### Approach 1

Before we get started: when you want to take the first row of your data and set it as the header, use this trick.

In [1]:
import pandas as pd
df = pd.DataFrame([
    [ 'fruit name', 'likes' ],
    [ 'apple', 15 ],
    [ 'carrot', 3 ],
    [ 'sweet potato', 45 ],
    [ 'peach', 12 ],
])
df

Unnamed: 0,0,1
0,fruit name,likes
1,apple,15
2,carrot,3
3,sweet potato,45
4,peach,12


In [2]:
# Set the first column as the columns
df.columns = df.loc[0]

# Drop the first row
df = df.drop(0)

df

Unnamed: 0,fruit name,likes
1,apple,15
2,carrot,3
3,sweet potato,45
4,peach,12


🚀 Done!

### Approach 2

Another alternative is to use `.rename` on your columns and just filter out the columns you aren't interested in. This can be useful if the column name shows up multiple times in your data for some reason or another.

In [40]:
# Starting with the same-ish data...
df = pd.DataFrame([
    [ 'fruit name', 'likes' ],
    [ 'apple', 15 ],
    [ 'carrot', 3 ],
    [ 'fruit name', 'likes' ],
    [ 'sweet potato', 45 ],
    [ 'peach', 12 ],
])
df

Unnamed: 0,0,1
0,fruit name,likes
1,apple,15
2,carrot,3
3,fruit name,likes
4,sweet potato,45
5,peach,12


In [41]:
df = df.rename(columns={
    0: 'fruit name',
    1: 'likes'
})
df = df[df['fruit name'] != 'fruit name']
df

Unnamed: 0,fruit name,likes
1,apple,15
2,carrot,3
4,sweet potato,45
5,peach,12


🚀 Done!

### Useful tips about coordinates

If you want to grab only a section of the page [Kull](https://jsoma.github.io/kull/#/) might be helpful in finding the coordinates.

> **Alternatively** run `%matplotlib notebook` in a cell. Afterwards, every time you use something like `camelot.plot(tables[0]).show()` it will get you nice zoomable, hoverable versions that include `x` and `y` coordinates as you move your mouse.

Coordinates are given as `"left_x,top_y,right_x,bottom_y"` with `(0,0)` being in the bottom left-hand corner.

Note that all coordinates are strings, for some reason. It won't be `[1, 2, 3, 4]` it will be `['1,2,3,4']`

# The homework

This is **mostly Camelot work**, because I don't really have any good image-based PDFs to stretch your wings on tesseract. If you know of any, let me know and I can put together another couple exercises.

## Prison Inmates

Working from [InmateList.pdf](InmateList.pdf), save a CSV file that includes every inmate.

* Make sure your rows are *all data*, and you don't have any people named "Inmate Name."


In [5]:
import camelot

In [8]:
table = camelot.read_pdf('InmateList.pdf',flavor="stream")
table

<TableList n=1>

In [10]:
table[0].df

Unnamed: 0,0,1,2,3,4,5
0,,,,Erie County Sheriff's Office,,
1,,,,Inmate Roster,,
2,ICN #,Inmate Name,,Facility,Booking Date,
3,70693,"ABDALLAH, MICHAEL",,ECHC,04/30/2021,
4,152645,"ABDI, ABDI",,ECCF,06/20/2021,
5,144666,"ABDULLAH, DHAFIR",,ECCF,06/17/2021,
6,156374,"ACEVEDO, CARLOS",,ECHC,06/06/2021,
7,57243,"ACKER, RAYMOND P",,ECCF,11/02/2020,
8,68579,"ADAMS, JERMAIN C",,ECHC,09/19/2019,
9,45262,"ADAMS, MARQUIS",,ECHC,05/27/2021,


In [37]:
df = table[0].df

df = df[3:-1]

df = df.rename(columns={
    0: 'icn',
    1: 'inmate_name',
    2: 'empty',
    3: 'facility',
    4: 'booking',
    5: 'empty'
})

df[['icn','inmate_name','facility','booking']]

Unnamed: 0,icn,inmate_name,facility,booking
3,70693,"ABDALLAH, MICHAEL",ECHC,04/30/2021
4,152645,"ABDI, ABDI",ECCF,06/20/2021
5,144666,"ABDULLAH, DHAFIR",ECCF,06/17/2021
6,156374,"ACEVEDO, CARLOS",ECHC,06/06/2021
7,57243,"ACKER, RAYMOND P",ECCF,11/02/2020
8,68579,"ADAMS, JERMAIN C",ECHC,09/19/2019
9,45262,"ADAMS, MARQUIS",ECHC,05/27/2021
10,75738,"AKRIGHT, JOSEPH A",ECCF,05/29/2021
11,104048,"ALBERTSON, ANDREW",ECCF,12/04/2019
12,1577,"ALEXANDER, BRIAN",ECHC,07/13/2021


In [38]:
df.to_csv('inmates.csv')

## WHO resolutions

Using [A74_R13-en.pdf](A74_R13-en.pdf), what ten member countries are given the highest assessments?

* You might need to have two separate queries, and combine the results: that last page is pretty awful!
* Always rename your columns
* Double-check that your sorting looks right......
* You can still get the answer even without perfectly clean data

In [152]:
tbl = camelot.read_pdf('A74_R13-en.pdf',flavor="stream",pages='0-5')

In [153]:
dfs = [table.df for table in tbl]
df = pd.concat(dfs,ignore_index=True)

In [90]:
# df.head(40)

In [154]:
df = df[[0,1]]

df = df.rename(columns={
    0: 'country',
    1: 'value'
})
df

Unnamed: 0,country,value
0,WHA74.13,
1,,Members and
2,,Associate Members
3,,
4,,Zambia
...,...,...
220,Uzbekistan,0.0320
221,Vanuatu,0.0010
222,Venezuela (Bolivarian Republic of),0.7280
223,Viet Nam,0.0770


In [155]:

countries = df[13:]


In [156]:
countries.tail(20)

Unnamed: 0,country,value
205,Nations),0.001
206,Tonga,0.001
207,Trinidad and Tobago,0.04
208,Tunisia,0.025
209,Turkey,1.3711
210,Turkmenistan,0.033
211,Tuvalu,0.001
212,Uganda,0.008
213,Ukraine,0.057
214,United Arab Emirates,0.616


In [157]:
tbl = camelot.read_pdf('A74_R13-en.pdf',flavor="stream",pages='6')
last_page = tbl[0].df
last_page

Unnamed: 0,0,1,2,3,4
0,WHA74.13,,,,
1,,Members and,,WHO scale,
2,,Associate Members,,for 2022–2023,
3,,,,%,
4,,Zambia,,0.0090,
5,,Zimbabwe,,0.0050,
6,,TOTAL,,100.000,
7,,,,,"Seventh plenary meeting, 31 May 2021"
8,,,,,A74/VR/7
9,6,,= = =,,


In [191]:
last_rows = last_page[4:6][[1,3]]

last_rows = last_rows.rename(columns={
    1: 'country',
    3: 'value'
})


df = pd.concat([countries,last_rows])

df


Unnamed: 0,country,value
13,Afghanistan,0.0070
14,Albania,0.0080
15,Algeria,0.1380
16,Andorra,0.0050
17,Angola,0.0100
...,...,...
222,Venezuela (Bolivarian Republic of),0.7280
223,Viet Nam,0.0770
224,Yemen,0.0100
4,Zambia,0.0090


In [192]:
df.country.unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Members and',
       'Associate Members', '', 'Bosnia and Herzegovina', 'Botswana',
       'Brazil', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Cook Islands (not a member of the',
       'United Nations)', 'Costa Rica', 'Côte d’Ivoire', 'Croatia',
       'Cuba', 'Cyprus', 'Czech Republic',
       'Democratic People’s Republic of', 'Korea',
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', '

In [193]:
df.value.unique()

array(['0.0070', '0.0080', '0.1380', '0.0050', '0.0100', '0.0020',
       '0.9151', '2.2101', '0.6770', '0.0490', '0.0180', '0.0500',
       '0.8211', '0.0010', '0.0030', '0.0160', 'WHO scale',
       'for 2022–2023', '%', '0.0120', '0.0140', '2.9482', '0.0250',
       '0.0460', '0.0060', '0.0130', '2.7342', '0.0040', '0.4070',
       '12.0058', '0.2880', '', '0.0620', '0.0770', '0.0800', '0.0360',
       '0.3110', '0.5540', '0.0530', '0.1860', '0.0390', '0.4210',
       '4.4273', '0.0150', '6.0904', '0.3660', '0.0090', '0.2060',
       '0.0280', '0.8341', '0.5430', '0.3980', '0.1290', '0.3710',
       '0.4900', '3.3072', '8.5645', '0.0210', '0.1780', '0.0240',
       '0.2520', '0.0470', '0.0300', '0.0710', '0.0670', '0.3410',
       '0.0170', '0.0110', '1.2921', '0.0550', '1.3561', '0.2910',
       '0.2500', '0.7540', '0.1150', '0.0450', '0.1520', '0.2050',
       '0.8021', '0.3500', '0.2820', '2.2671', '0.1980', '2.4052',
       '1.1721', '0.4850', '0.1530', '0.0760', '0.2720', '2.14

In [194]:
df[df['value'].str.contains("[a-z]")]

Unnamed: 0,country,value
34,Members and,WHO scale
35,Associate Members,for 2022–2023
82,Members and,WHO scale
83,Associate Members,for 2022–2023
129,Members and,WHO scale
130,Associate Members,for 2022–2023
177,Members and,WHO scale
178,Associate Members,for 2022–2023


In [195]:
filter_df = df[~df['value'].str.contains("[a-z]|%")]



In [196]:
filter_df.head(25)

Unnamed: 0,country,value
13,Afghanistan,0.007
14,Albania,0.008
15,Algeria,0.138
16,Andorra,0.005
17,Angola,0.01
18,Antigua and Barbuda,0.002
19,Argentina,0.9151
20,Armenia,0.007
21,Australia,2.2101
22,Austria,0.677


In [199]:
filter_df['value'] = pd.to_numeric(filter_df["value"], downcast="float")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_df['value'] = pd.to_numeric(filter_df["value"], downcast="float")


In [204]:
filter_df.sort_values('value',ascending=False).head(10)

Unnamed: 0,country,value
218,United States of America,22.0
51,China,12.0058
107,Japan,8.5645
87,Germany,6.0904
216,Northern Ireland,4.5673
80,France,4.4273
105,Italy,3.3072
39,Brazil,2.9482
47,Canada,2.7342
170,Russian Federation,2.4052


## The Avengers

Using [THE_AVENGERS.pdf](THE_AVENGERS.pdf), approximately how many lines does Captain America have as compared to Thor and Iron Man?

* Character names only: we're only counting `IRON MAN` as Iron Man, not `TONY`.
* Your new best friend might be `\n`
* Look up `.count` for strings

In [216]:
from pdfminer.high_level import extract_text


In [217]:
script = extract_text('THE_AVENGERS.pdf')



In [221]:
script_df = pd.DataFrame(script.split('\n'))


script_df = script_df.rename(columns={
    0: 'line'
})

In [230]:
script_df.value_counts().head(20)

line                             
                                     3855
TONY                                  145
FURY                                  116
(CONTINUED)                            98
NATASHA                                97
STEVE                                  89
BANNER                                 85
LOKI                                   71

CONTINUED: (2)                        61
THOR                                   47
AGENT COULSON                          43
BARTON                                 43
CAPTAIN AMERICA                        35
MARIA HILL                             28
PEPPER                                 25
INSIDE IRON MAN HELMET:                22

CONTINUED: (3)                        20
IRON MAN                               19
SELVIG                                 17
INT. BRIDGE, CARRIER - CONTINUOUS      15
dtype: int64

In [None]:
# 19 lines citing IRON MAN

## COVID data

Using [covidweekly2721.pdf](covidweekly2721.pdf), what's the total number of tests performed in Minnesota? Use the Laboratory Test Rates by County of Residence chart.

* You COULD pull both tables separately OR you could pull them both at once and split them in pandas.
* Remember you can do things like `df[['name','age']]` to ask for multiple columns

In [236]:
tbl = camelot.read_pdf('covidweekly2721.pdf',flavor="stream",pages='6')

In [273]:
counties = tbl[0].df

list_columns = ['counties','test','rates']

tbl1 = counties[[1,2,3]].reset_index(drop=True)
tbl2 = counties[[4,5,6]].reset_index(drop=True)


# tbl2.rename(columns=dict_columns)
tbl2.columns=list_columns
tbl1.columns=list_columns

df = pd.concat([tbl1,tbl2]) 

In [276]:
df[5:-2].tail(4)

Unnamed: 0,counties,test,rates
48,Winona,106625,20970.0
49,Wright,194085,14621.0
50,Yellow Medicine,19972,20239.0
51,Unknown/missing,423136,


In [281]:
df[5:-2].head(5)

Unnamed: 0,counties,test,rates
5,Aitkin,19204.0,12128.0
6,,,
7,,,
8,Anoka,545958.0,15714.0
9,Becker,59238.0,17540.0


In [327]:
cleandf = df[5:-2]

In [328]:
cleandf.counties.unique()

array(['Aitkin', '', 'Anoka', 'Becker', 'Beltrami', 'Benton', 'Big Stone',
       'Blue Earth', 'Brown', 'Carlton', 'Carver', 'Cass', 'Chippewa',
       'Chisago', 'Clay', 'Clearwater', 'Cook', 'Cottonwood', 'Crow Wing',
       'Dakota', 'Dodge', 'Douglas', 'Faribault', 'Fillmore', 'Freeborn',
       'Goodhue', 'Grant', 'Hennepin', 'Houston', 'Hubbard', 'Isanti',
       'Itasca', 'Jackson', 'Kanabec', 'Kandiyohi', 'Kittson',
       'Koochiching', 'Lac qui Parle', 'Lake', 'Lake of the Woods',
       'Le Sueur', 'Lincoln', 'Lyon', 'Mahnomen', 'Marshall',
       'Minnesota Department of Health Weekly COVID-19 Report: Updated 7/8/2021 with data current as of 4 a.m. the previous day.',
       'County', 'Martin', 'McLeod', 'Meeker', 'Mille Lacs', 'Morrison',
       'Mower', 'Murray', 'Nicollet', 'Nobles', 'Norman', 'Olmsted',
       'Otter Tail', 'Pennington', 'Pine', 'Pipestone', 'Polk', 'Pope',
       'Ramsey', 'Red Lake', 'Redwood', 'Renville', 'Rice', 'Rock',
       'Roseau', 'Scott', 'S

In [329]:
cleandf = cleandf[~cleandf['counties'].str.contains("7/8/2021")]

cleandf = cleandf[cleandf["counties"] != ""]

cleandf = cleandf[cleandf["test"] != "Number of Tests"]

cleandf.test.unique()

array(['19,204', '545,958', '59,238', '60,345', '77,865', '13,220',
       '136,895', '55,709', '79,882', '154,305', '31,760', '27,089',
       '90,811', '108,867', '9,427', '8,966', '21,088', '85,522',
       '715,290', '37,729', '68,086', '31,294', '40,081', '63,008',
       '87,119', '10,310', '2,292,159', '21,417', '20,764', '56,067',
       '80,265', '14,075', '16,504', '85,305', '9,969', '21,647',
       '13,601', '19,664', '5,957', '43,972', '10,696', '46,116', '6,643',
       '10,279', '39,876', '61,300', '37,736', '44,688', '59,179',
       '79,446', '15,175', '66,163', '30,979', '11,471', '277,493',
       '110,163', '15,424', '35,859', '19,655', '42,729', '20,459',
       '987,792', '4,078', '27,476', '30,498', '177,409', '19,147',
       '23,169', '230,375', '184,952', '24,133', '383,063', '285,098',
       '64,840', '20,325', '18,288', '33,013', '7,041', '46,421',
       '33,447', '34,795', '460,537', '20,172', '7,973', '106,625',
       '194,085', '19,972', '423,136'], dt

In [330]:
cleandf.test = cleandf.test.str.replace(',', '').astype(int)

cleandf.test.sum()

10249823

## Theme Parks

Using [2019-Theme-Index-web-1.pdf](2019-Theme-Index-web-1.pdf), save a CSV of the top 10 theme park groups worldwide.

* You can clean the results or you can restrict the area the table is pulled from, up to you

In [435]:
 tbl = camelot.read_pdf('2019-Theme-Index-web-1.pdf',pages='11',flavor='stream')

 df = tbl[0].df

Unnamed: 0,0,1,2,3,4
0,RANK\n1,GROUP NAME\nWALT DISNEY ATTRACTIONS,% CHANGE\n-0.8%,"ATTENDANCE\n2019\n 155,991,000","ATTENDANCE \n2018\n 157,311,000"
1,2,MERLIN ENTERTAINMENTS GROUP,0.9%,67000000,"66,400,000*"
2,3,OCT PARKS CHINA,9.4%,53970000,49350000
3,4,UNIVERSAL PARKS AND RESORTS,2.3%,51243000,50068000
4,5,FANTAWILD GROUP,19.8%,50393000,42074000
5,6,CHIMELONG GROUP,8.9%,37018000,34007000
6,7,SIX FLAGS INC.,2.5%,32811000,32024000
7,8,CEDAR FAIR ENTERTAINMENT COMPANY,7.8%,27938000,25912000
8,9,SEAWORLD PARKS & ENTERTAINMENT,0.2%,22624000,22582000
9,10,PARQUES REUNIDOS,6.2%,22195000,20900000


In [442]:
df[0:1]

Unnamed: 0,0,1,2,3,4
0,RANK\n1,GROUP NAME\nWALT DISNEY ATTRACTIONS,% CHANGE\n-0.8%,"ATTENDANCE\n2019\n 155,991,000","ATTENDANCE \n2018\n 157,311,000"


In [422]:
df[0:1].values[0][3].split('\n')

['ATTENDANCE', '2019', ' 155,991,000']

In [488]:
row = {}

for item in df[0:1].values[0]:
    myvalues = item.split('\n')
    if len(myvalues) > 2:
        myvalues[0] = myvalues[0] + myvalues[1]
        myvalues[1] = myvalues[2]
    row[myvalues[0]] = myvalues[1]

first_line = pd.DataFrame(row, index=[0])

first_line

Unnamed: 0,RANK,GROUP NAME,% CHANGE,ATTENDANCE2019,ATTENDANCE 2018
0,1,WALT DISNEY ATTRACTIONS,-0.8%,155991000,157311000


In [492]:

list_columns = ['rank','group','change','att19','att18']

first_line.columns=list_columns
df.columns=list_columns


total = pd.concat([first_line,df[2:]])

finaldf = total[:-2]

finaldf

Unnamed: 0,rank,group,change,att19,att18
0,1,WALT DISNEY ATTRACTIONS,-0.8%,155991000,157311000
2,3,OCT PARKS CHINA,9.4%,53970000,49350000
3,4,UNIVERSAL PARKS AND RESORTS,2.3%,51243000,50068000
4,5,FANTAWILD GROUP,19.8%,50393000,42074000
5,6,CHIMELONG GROUP,8.9%,37018000,34007000
6,7,SIX FLAGS INC.,2.5%,32811000,32024000
7,8,CEDAR FAIR ENTERTAINMENT COMPANY,7.8%,27938000,25912000
8,9,SEAWORLD PARKS & ENTERTAINMENT,0.2%,22624000,22582000
9,10,PARQUES REUNIDOS,6.2%,22195000,20900000


In [493]:
finaldf.to_csv('parks.csv')

## Hunting licenses

Using [US_Fish_and_Wildlife_Service_2021.pdf](US_Fish_and_Wildlife_Service_2021.pdf) and [a CSV of state populations](http://goodcsv.com/geography/us-states-territories/), find the states with the highest per-capita hunting license holders.

In [538]:
tbl = camelot.read_pdf('US_Fish_and_Wildlife_Service_2021.pdf')

df = tbl[0].df

In [539]:
df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7
0,State,Paid Hunting License \nHolders*,"Resident Hunting Licenses,\nTags, Permits and ...","Non-Resident Hunting \nLicenses,\nTags, Permit...","Total Hunting License, \nTags,Permits & Stamps**","Cost - Resident Hunting \nLicenses,\nTags, Per...","Cost - Non-Resident Hunting \nLicenses,\nTags,...",Gross Cost - Hunting \nLicenses
1,AK,93559,423501,59235,482736,"$4,859,356","$9,046,715","$13,906,071"


In [540]:

list_columns = ['state','paid','resident','nonresid','total','cost_resid','cost_nonresid','gross_cost']

df.columns=list_columns



In [541]:
df = df[1:-1]

df.head(5)

Unnamed: 0,state,paid,resident,nonresid,total,cost_resid,cost_nonresid,gross_cost
1,AK,93559,423501,59235,482736,"$4,859,356","$9,046,715","$13,906,071"
2,AL,452400,601683,45397,647080,"$9,700,295","$6,715,734","$16,416,029"
3,AR,343300,349098,150728,499826,"$7,851,601","$11,271,653","$19,123,254"
4,AS,0,0,0,0,$0,$0,$0
5,AZ,302383,464607,88708,553315,"$13,931,397","$5,968,169","$19,899,566"


In [542]:
population = pd.read_csv('http://goodcsv.com/wp-content/uploads/2020/08/us-states-territories.csv',encoding='latin1')

In [543]:
population.head(3)

Unnamed: 0,Type,Name,Abbreviation,Capital,Population (2015),Population (2019),area (square miles)
0,State,Alabama,AL,Montgomery,,4903185,52420
1,State,Alaska,AK,Juneau,,731545,665384
2,State,Arizona,AZ,Phoenix,,7278717,113990


In [570]:
population['Abbreviation'] = population['Abbreviation'].str.strip()
df['state'] = df['state'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['state'] = df['state'].str.strip()


In [571]:
final = pd.merge(population,df,left_on='Abbreviation',right_on='state')
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 55
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Type                 56 non-null     object
 1   Name                 56 non-null     object
 2   Abbreviation         56 non-null     object
 3   Capital              55 non-null     object
 4   Population (2015)    4 non-null      object
 5   Population (2019)    52 non-null     object
 6   area (square miles)  56 non-null     object
 7   state                56 non-null     object
 8   paid                 56 non-null     object
 9   resident             56 non-null     object
 10  nonresid             56 non-null     object
 11  total                56 non-null     object
 12  cost_resid           56 non-null     object
 13  cost_nonresid        56 non-null     object
 14  gross_cost           56 non-null     object
dtypes: object(15)
memory usage: 7.0+ KB


In [572]:
final.total = final.total.str.replace(',', '').astype(int)
final['Population (2019)'] = pd.to_numeric(final['Population (2019)'].str.replace(',',''), errors='coerce')

In [575]:
final['per_capita'] = final['total'] / final['Population (2019)']

final.sort_values(by='per_capita',ascending=False).head(5)

Unnamed: 0,Type,Name,Abbreviation,Capital,Population (2015),Population (2019),area (square miles),state,paid,resident,nonresid,total,cost_resid,cost_nonresid,gross_cost,per_capita
25,State,Montana,MT,Helena,,1068778.0,147040,MT,222309,853341,186315,1039656,"$10,966,890","$26,951,488","$37,918,378",0.972752
11,State,Idaho,ID,Boise,,1787065.0,83569,ID,275244,1412039,248610,1660649,"$11,465,795","$18,704,191","$30,169,986",0.929261
48,State,Wisconsin,WI,Madison,,5822434.0,65496,WI,666670,3965367,236639,4202006,"$28,526,992","$7,884,672","$36,411,664",0.721692
1,State,Alaska,AK,Juneau,,731545.0,665384,AK,93559,423501,59235,482736,"$4,859,356","$9,046,715","$13,906,071",0.659886
33,State,North Dakota,ND,Bismarck,,762062.0,70698,ND,135724,375250,126916,502166,"$4,680,314","$6,094,905","$10,775,219",0.658957
