**Working With CSV Files**

CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext.

And the CSV module is a built-in function that allows Python to parse these types of files.

In [1]:
# To parse CSV files, we use the csv module. CSV literally stands for comma separated value, 
# where the comma is what is known as a "delimiter." The csv module provides a number of built-in
# functions to make it easier to parse and iterate through CSV files.
import csv

In [9]:
# Open the diabetes file.  Note that when Python opens data files and stores them in variables,
# the variables DO NOT actually contain text.  In the example below, the diabetes_file 
# variable stores the file in a special format (one that Python can understand and interpret)
diabetes_file = open("diabetes.csv")

# See what happens when we try to print the variable where the data file is stored
# Essentially, the file is treated as an OBJECT - we'll learn about objects next week
print(diabetes_file)

# Now we need to tell Python that the file stored in diabetes_file variable should be read as 
# and interpreted as a CSV file.  We do that by calling on the reader() function of the csv module
diabetes_data = csv.reader(diabetes_file)

# At this point, the entire CSV file is treated as a table - a collection of rows and columns
# We can iterate (loop) through this table and get access to each individual row
for row in diabetes_data:
    print(row)
    
    


<_io.TextIOWrapper name='diabetes.csv' mode='r' encoding='UTF-8'>
['id', 'chol', 'stab.glu', 'hdl', 'ratio']
['1000', '203', '82', '56', '3.599999905']
['1001', '165', '97', '24', '6.900000095']
['1002', '228', '92', '37', '6.199999809']
['1003', '78', '93', '12', '6.5']
['1005', '249', '90', '28', '8.899999619']
['1008', '248', '94', '69', '3.599999905']
['1011', '195', '92', '41', '4.800000191']
['1015', '227', '75', '44', '5.199999809']
['1016', '177', '87', '49', '3.599999905']
['1022', '263', '89', '40', '6.599999905']
['1024', '242', '82', '54', '4.5']
['1029', '215', '128', '34', '6.300000191']
['1030', '238', '75', '36', '6.599999905']
['1031', '183', '79', '46', '4']
['1035', '191', '76', '30', '6.400000095']
['1036', '213', '83', '47', '4.5']
['1037', '255', '78', '38', '6.699999809']
['1041', '230', '112', '64', '3.599999905']
['1045', '194', '81', '36', '5.400000095']
['1250', '196', '206', '41', '4.800000191']
['1252', '186', '97', '50', '3.700000048']
['1253', '234', '65'

['21359', '219', '130', '44', '5']
['40251', '150', '80', '38', '3.900000095']
['40253', '185', '67', '59', '3.099999905']
['40500', '226', '100', '65', '3.5']
['40501', '206', '83', '68', '3']
['40502', '199', '81', '36', '5.5']
['40751', '239', '85', '63', '3.799999952']
['40754', '235', '106', '37', '6.400000095']
['40755', '184', '99', '36', '5.099999905']
['40762', '242', '297', '34', '7.099999905']
['40764', '307', '87', '58', '5.300000191']
['40772', '204', '94', '54', '3.799999952']
['40773', '212', '88', '36', '5.900000095']
['40774', '203', '90', '51', '4']
['40775', '219', '173', '31', '7.099999905']
['40784', '226', '279', '52', '4.300000191']
['40785', '217', '75', '54', '4']
['40786', '157', '92', '47', '3.299999952']
['40787', '235', '102', '42', '5.599999905']
['40789', '252', '161', '87', '2.900000095']
['40792', '204', '71', '55', '3.700000048']
['40797', '188', '84', '46', '4.099999905']
['40799', '194', '95', '36', '5.400000095']
['40803', '215', '64', '84', '2.5999

In [11]:
# You probably noticed that the row variable is just a list - it is a list of values contained in each column.
# You probably also noticed that the first row does not contain data - it's just the column headers
# You can access individual columns exactly the same way you would access values in a list.
# For example, the value of cholesterol is in a column called 'chol', which is a second column and 
# therefore has the index of 1

# Since we already iterated through the CSV file once, we need to tell Python to start at the beginning again
diabetes_data.seek(0)

for row in diabetes_data:
    print(row[1])