# The CSV File Format and Two Dimensional Lists
#### Introduction to Programming with Python

## A useful string method: `split()`

`split()` is a string method that will break a string into parts based on some *delimiter* - a thing that separates useful values. For example, you can express a time duration in a format like 4:52:20, which would mean 4 hours, 52 minutes, and 20 seconds. In this format, ":" is used a delimiter.

In [1]:
marathon_time = "4:52:20"
marathon_time_split = marathon_time.split(":")
print(marathon_time_split)

['4', '52', '20']


Commas are another common delimiter. Let's say we read the line `"0.0, 0.4, 1.3, 1.1, 2.5, 0.0, 0.6"` from a file. In this case, the values are delimited by two characters - a command and a space. We could split on that too. Then we'd have a list of all the rainfall amounts, and they'd all be strings. So, maybe we'd convert one to a float if we needed to do something useful with it like compare it to another number or add it on to an accumulator.

In [2]:
rainfall_data = "0.0, 0.4, 1.3, 1.1, 2.5, 0.0, 0.6"
rainfall_list = rainfall_data.split(", ")
print(rainfall_list)
print(float(rainfall_list[4]))

['0.0', '0.4', '1.3', '1.1', '2.5', '0.0', '0.6']
2.5


## CSV files - comma separated values

You may have noticed that data is very often presented in *tables* - like a table of statistics in a book or data from a spreadsheet.

**CSV** (comma separated values) is a common but simple file format for data stored in tables. CSV files typically end in the `.csv` file extension and can often be opened with either spreadsheet software or text editors.

Download the `nationalparks.csv` file here: https://raw.githubusercontent.com/ericmanley/IntroToProgrammingWithPython/refs/heads/main/nationalparks.csv

This data is information about national parks and is originally from https://en.wikipedia.org/wiki/List_of_areas_in_the_United_States_National_Park_System

Try opening it with a text editor as well as a spreadsheet application to see how each presents the data. Next, we'll look at different options for how you can get this table data into your program.


<center>
<div>
<img src="images/opencsv.png" width="800"/>
</div>
</center>


## Option 1: Read it into a list and then process each row as needed

We could just open this file and read in all of the lines using `readlines()` like we've already seen. 

In [4]:
with open("nationalparks.csv") as parkfile:
    parks = parkfile.readlines()

print("Here's what the whole list looks like:",parks)

print("Here's the line at index 3:",parks[3])

Here's what the whole list looks like: ['Name,Location,Year established,Area in acres\n', 'Acadia National Park,Maine,1919,49076.63\n', 'National Park of American Samoa,American Samoa,1988,8256.67\n', 'Arches National Park,Utah,1971,76678.98\n', 'Badlands National Park,South Dakota,1978,242755.94\n', 'Big Bend National Park,Texas,1944,801163.21\n', 'Biscayne National Park,Florida,1980,172971.11\n', 'Black Canyon of the Gunnison National Park,Colorado,1999,30779.83\n', 'Bryce Canyon National Park,Utah,1928,35835.08\n', 'Canyonlands National Park,Utah,1964,337597.83\n', 'Capitol Reef National Park,Utah,1971,241904.50\n', 'Carlsbad Caverns National Park,New Mexico,1930,46766.45\n', 'Channel Islands National Park,California,1980,249561.00\n', 'Congaree National Park,South Carolina,2003,26476.47\n', 'Crater Lake National Park,Oregon,1902,183224.05\n', 'Cuyahoga Valley National Park,Ohio,2000,32571.88\n', 'Death Valley National Park,"California, Nevada",1994,3408395.63\n', 'Denali National P

The item at index 3 appears to be the line from the file about Arches National Park. Here's an example of how a programmer could process data and do something useful with it.

In [5]:
# save the line at index 3 to a new variable and display it
arches = parks[3]
print(arches)

# get rid of that pesky newline at the end
arches = arches.rstrip()

# separate each of the values by using the comma as a delimiter with split()
arches_list = arches.split(',')
print(arches_list)

# save the number of acres as a float variable so we can do a conversion to square miles
arches_acres = float(arches_list[3])
print(arches_list[0],"is",(arches_acres/640),"square miles")

Arches National Park,Utah,1971,76678.98

['Arches National Park', 'Utah', '1971', '76678.98']
Arches National Park is 119.81090624999999 square miles


## Option 2: use the csv module

Because CSV files are so common, there are Python modules that will do a lot of this work for us.

In [6]:
import csv

with open("nationalparks.csv") as npfile:
    parks = csv.reader(npfile)
    
print(parks)

<_csv.reader object at 0x108fb0f20>


This reader object is something that is kind of like a list you could work directly, but for now, let's instead convert it to an actual list so we can do things we already know how to do with lists. Notice that this gives us a **two-dimensional list** (or 2D list) - a list that has other lists as its items. The outer list contains a bunch of smaller lists, and each small list has the different pieces of data that appear on each line of the original file, with the newline and delimiter already removed.

In [7]:
import csv

with open("nationalparks.csv") as npfile:
    parks = csv.reader(npfile)
    parks = list(parks)
    
print(parks)

[['Name', 'Location', 'Year established', 'Area in acres'], ['Acadia National Park', 'Maine', '1919', '49076.63'], ['National Park of American Samoa', 'American Samoa', '1988', '8256.67'], ['Arches National Park', 'Utah', '1971', '76678.98'], ['Badlands National Park', 'South Dakota', '1978', '242755.94'], ['Big Bend National Park', 'Texas', '1944', '801163.21'], ['Biscayne National Park', 'Florida', '1980', '172971.11'], ['Black Canyon of the Gunnison National Park', 'Colorado', '1999', '30779.83'], ['Bryce Canyon National Park', 'Utah', '1928', '35835.08'], ['Canyonlands National Park', 'Utah', '1964', '337597.83'], ['Capitol Reef National Park', 'Utah', '1971', '241904.50'], ['Carlsbad Caverns National Park', 'New Mexico', '1930', '46766.45'], ['Channel Islands National Park', 'California', '1980', '249561.00'], ['Congaree National Park', 'South Carolina', '2003', '26476.47'], ['Crater Lake National Park', 'Oregon', '1902', '183224.05'], ['Cuyahoga Valley National Park', 'Ohio', '20

2D list data can be accessed with two indices, the first one for the row, and the second one for the column.

In [8]:
print(parks[3][0]) #row 3, column 0
print(parks[3][3]) #row 3, column 3

print(parks[3][0],"is",(float(parks[3][3])/640),"square miles")

Arches National Park
76678.98
Arches National Park is 119.81090624999999 square miles


## Processing all rows

If we want to do something with every row or every column (or both), we could iterate through it with a loop.

In [9]:
import csv

with open("nationalparks.csv") as npfile:
    parks = csv.reader(npfile)
    parks = list(parks)

    
    park_counter = 1
    
    while park_counter < len(parks):
        print(parks[park_counter][0],"is",(float(parks[park_counter][3])/640),"square miles")
        park_counter += 1
    

Acadia National Park is 76.682234375 square miles
National Park of American Samoa is 12.901046875 square miles
Arches National Park is 119.81090624999999 square miles
Badlands National Park is 379.30615625 square miles
Big Bend National Park is 1251.817515625 square miles
Biscayne National Park is 270.26735937499996 square miles
Black Canyon of the Gunnison National Park is 48.093484375 square miles
Bryce Canyon National Park is 55.992312500000004 square miles
Canyonlands National Park is 527.496609375 square miles
Capitol Reef National Park is 377.97578125 square miles
Carlsbad Caverns National Park is 73.07257812499999 square miles
Channel Islands National Park is 389.9390625 square miles
Congaree National Park is 41.369484375 square miles
Crater Lake National Park is 286.287578125 square miles
Cuyahoga Valley National Park is 50.8935625 square miles
Death Valley National Park is 5325.6181718749995 square miles
Denali National Park is 7407.673687500001 square miles
Dry Tortugas Natio

## Example: How many national parks are in a given state?

The following code reads in the 2D list to the `parks` variable as before. Then we ask the user to enter the name of a state and loop through the *outer* list, counting how many parks are in that state.

In [10]:
import csv

with open("nationalparks.csv") as npfile:
    parks = csv.reader(npfile)
    parks = list(parks)
    
state = input("Enter a state: ")

park_counter = 1
parks_in_state = 0

while park_counter < len(parks):
    
    if parks[park_counter][1] == state:
        parks_in_state += 1
    
    
    park_counter += 1
    
print("There are",parks_in_state,"national parks in",state)

Enter a state: Utah
There are 5 national parks in Utah


**Reflection Questions:** To make sure you understand how this code works, answer the following questions in your notes:
1. What does `len(parks)` represent? Is it the number of parks? Is it the number of total values that appears in the whole file? Why do we use that in the condition for the counter-controlled loop?
2. What is `parks[park_counter][1]`? Why is `park_counter` used as the first of the two indexes?  What is the purpose of hard-coding a number like 1 in there? 