# Parsing a CSV - Solution

In [19]:
data = """Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size\n90001,57110,26.6,28468,28642,12971,4.4\n90002,51223,25.5,24876,26347,11731,4.36\n90003,66266,26.3,32631,33635,15642,4.22""".strip()

The 'data' string holds data that represents a csv. We can see that there are line breaks ('\n') separating th rows, with commas separating the values within each row

In [20]:
data

'Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size\n90001,57110,26.6,28468,28642,12971,4.4\n90002,51223,25.5,24876,26347,11731,4.36\n90003,66266,26.3,32631,33635,15642,4.22'

If we want to separate (or, 'split up') the rows into individual strings, we can use the .split() method. 

In [21]:
data.split('\n')

['Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size',
 '90001,57110,26.6,28468,28642,12971,4.4',
 '90002,51223,25.5,24876,26347,11731,4.36',
 '90003,66266,26.3,32631,33635,15642,4.22']

This gives us a list of new, smaller strings. But, the original 'data' variable is a strings. Strings are immutable. That means the the .split() method isn't going to modify the string. It will return the list of strings we see above, but we need to store that in a new variable if we want to access it later:

In [22]:
rows = data.split('\n')

The 'rows' variable no holds what was returned when we called the .split() method. Let's take a look:

In [23]:
rows

['Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size',
 '90001,57110,26.6,28468,28642,12971,4.4',
 '90002,51223,25.5,24876,26347,11731,4.36',
 '90003,66266,26.3,32631,33635,15642,4.22']

In [24]:
type(rows)

list

In [25]:
len(rows)

4

If we want to access the individual objects in our newly created 'rows' list, we can use indexing:

In [26]:
rows[0]

'Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size'

In [27]:
type(rows[0])

str

And, once we access that first string object at [0], we can access its methods and attributes. If we want to split it up into each individual values (which are separated by commas), we can call the .split() method again:

In [28]:
rows[0].split(',')

['Zip Code',
 'Total Population',
 'Median Age',
 'Total Males',
 'Total Females',
 'Total Households',
 'Average Household Size']

We can do the same for all the rows. Let's save them to a new variable so we can access them later. We'll create an empty list, called 'data_array', and append to it:

In [29]:
data_array = []
data_array.append(rows[0].split(','))
data_array.append(rows[1].split(','))
data_array.append(rows[2].split(','))
data_array.append(rows[3].split(','))
data_array

[['Zip Code',
  'Total Population',
  'Median Age',
  'Total Males',
  'Total Females',
  'Total Households',
  'Average Household Size'],
 ['90001', '57110', '26.6', '28468', '28642', '12971', '4.4'],
 ['90002', '51223', '25.5', '24876', '26347', '11731', '4.36'],
 ['90003', '66266', '26.3', '32631', '33635', '15642', '4.22']]

This gives us a list of lists (of strings), which represents our original CSV data.

Of course, we could have just created the list outright, instead of by appending to an empty list. Both give us the same result:

In [32]:
data_array = [
rows[0].split(','),
    rows[1].split(','),
    rows[2].split(','),
    rows[3].split(',')
]
data_array

[['Zip Code',
  'Total Population',
  'Median Age',
  'Total Males',
  'Total Females',
  'Total Households',
  'Average Household Size'],
 ['90001', '57110', '26.6', '28468', '28642', '12971', '4.4'],
 ['90002', '51223', '25.5', '24876', '26347', '11731', '4.36'],
 ['90003', '66266', '26.3', '32631', '33635', '15642', '4.22']]

To access just the column names from the 'data_array' list, we can use indexing. Now that we have a list of lists of strings, we just want the list of strings that colds teh column headers. This is the first element in the list of list:

In [31]:
data_array[0]

['Zip Code',
 'Total Population',
 'Median Age',
 'Total Males',
 'Total Females',
 'Total Households',
 'Average Household Size']

And, if we want just the data (no headers), we just can use splicing, and just skip the first item in the list (which is the column names):

In [33]:
data_array[1:]

[['90001', '57110', '26.6', '28468', '28642', '12971', '4.4'],
 ['90002', '51223', '25.5', '24876', '26347', '11731', '4.36'],
 ['90003', '66266', '26.3', '32631', '33635', '15642', '4.22']]