### Reading Text Files

In order to read a text file we have to **open** the file.

After a file has been opened, we can read and/or write to that file.

When we are done using the file, we need to remember to **close** the file - this releases the file, or more generically stated, releases the **resource** (a file in this case).

It is important to remember to close files - although the files will close automatically when your program terminates, there are other issues that can come up if you don't close your files explicitly.

Apart from a limit to the number of files that can be opened at the same time, one main issue is that often writes to files are not immediately written out to disk - instead, things are hanging around until the file is closed. Far better to explicitly decide when you want that to happen, rather than hoping Python does it for you at some point... maybe... if nothing crashes in the meantime... so maybe never... 

You get the idea :-)

So, we have to open and close files.

When we open a file we specify our intentions - do we want to read-only from the file, write to the file (by replacing an existing file, or by appending to the file), or do both.

These characteristics are specified using some string characters:

1. `r` - read-only
2. `w` - write-only, replace existing file (if any)
3. `a` - append (write-only) - appends to existing file (if any)
4. `r+` - both read and write

Be careful with using both read and write operations on a file at the same time - it can get quite tricky...

Let's look at opening, reading and closing files.

There is a file `DEXUSEU.csv` that is available in the course materials. Please make sure you copy that file to the same location as the Jupyter notebook you are using (if not, you'll have to tweak the code we are writing here to use the correct path to the file).

In this video, I'm going to assume that the file is located in the same directory as the Jupyter notebook.

In [1]:
file_name = 'DEXUSEU.csv'

First we are going to open the file for read-only:

In [2]:
file = open(file_name, 'r')

Now this file is open, and we can look at some of the properties of the file:

In [3]:
file.name

'DEXUSEU.csv'

In [4]:
file.readable()

True

In [5]:
file.writable()

False

In [6]:
file.mode

'r'

We can also find out if the file has been closed:

In [7]:
file.closed

False

And we can close the file:

In [8]:
file.close()

Now we can see that the file is closed:

In [9]:
file.closed

True

Now let's try reading some data.

We can read data from a text file in many different ways, the two most common being the entire file at a time, or line by line.

If you are dealing with massive data files, reading the entire file into memory and then processing it might not always be the best approach - always try to avoid calculating or creating objects until absolutely necessary (think of lazy iterators).

So, let's first look at reading the entire file at once:

In [10]:
f = open(file_name)  # r is the default
data = f.readlines()
f.close()

In [11]:
data

['DATE,DEXUSEU\n',
 '2015-04-03,1.0990\n',
 '2015-04-06,1.1008\n',
 '2015-04-07,1.0850\n',
 '2015-04-08,1.0818\n',
 '2015-04-09,1.0671\n',
 '2015-04-10,1.0598\n',
 '2015-04-13,1.0582\n',
 '2015-04-14,1.0672\n',
 '2015-04-15,1.0596\n',
 '2015-04-16,1.0742\n',
 '2015-04-17,1.0780\n',
 '2015-04-20,1.0763\n',
 '2015-04-21,1.0758\n',
 '2015-04-22,1.0729\n',
 '2015-04-23,1.0803\n',
 '2015-04-24,1.0876\n',
 '2015-04-27,1.0892\n',
 '2015-04-28,1.0979\n',
 '2015-04-29,1.1174\n',
 '2015-04-30,1.1162\n',
 '2015-05-01,1.1194\n',
 '2015-05-04,1.1145\n',
 '2015-05-05,1.1174\n',
 '2015-05-06,1.1345\n',
 '2015-05-07,1.1283\n',
 '2015-05-08,1.1241\n',
 '2015-05-11,1.1142\n',
 '2015-05-12,1.1240\n',
 '2015-05-13,1.1372\n',
 '2015-05-14,1.1368\n',
 '2015-05-15,1.1428\n',
 '2015-05-18,1.1354\n',
 '2015-05-19,1.1151\n',
 '2015-05-20,1.1079\n',
 '2015-05-21,1.1126\n',
 '2015-05-22,1.1033\n',
 '2015-05-25,.\n',
 '2015-05-26,1.0876\n',
 '2015-05-27,1.0888\n',
 '2015-05-28,1.0914\n',
 '2015-05-29,1.0994\n',
 '

As you can see, `readlines()` will create a list of strings, one list item for each row in the file.

If we do not want to read the entire file at once, and we are in a situation where we can just do what we need by reading the file one line at a time, we can do so by **iterating** over the file.

In other words, the object returned by `open` is an iterable:

In [12]:
f = open(file_name)
for line in f:
    print(line)

DATE,DEXUSEU

2015-04-03,1.0990

2015-04-06,1.1008

2015-04-07,1.0850

2015-04-08,1.0818

2015-04-09,1.0671

2015-04-10,1.0598

2015-04-13,1.0582

2015-04-14,1.0672

2015-04-15,1.0596

2015-04-16,1.0742

2015-04-17,1.0780

2015-04-20,1.0763

2015-04-21,1.0758

2015-04-22,1.0729

2015-04-23,1.0803

2015-04-24,1.0876

2015-04-27,1.0892

2015-04-28,1.0979

2015-04-29,1.1174

2015-04-30,1.1162

2015-05-01,1.1194

2015-05-04,1.1145

2015-05-05,1.1174

2015-05-06,1.1345

2015-05-07,1.1283

2015-05-08,1.1241

2015-05-11,1.1142

2015-05-12,1.1240

2015-05-13,1.1372

2015-05-14,1.1368

2015-05-15,1.1428

2015-05-18,1.1354

2015-05-19,1.1151

2015-05-20,1.1079

2015-05-21,1.1126

2015-05-22,1.1033

2015-05-25,.

2015-05-26,1.0876

2015-05-27,1.0888

2015-05-28,1.0914

2015-05-29,1.0994

2015-06-01,1.0913

2015-06-02,1.1130

2015-06-03,1.1285

2015-06-04,1.1271

2015-06-05,1.1108

2015-06-08,1.1232

2015-06-09,1.1284

2015-06-10,1.1307

2015-06-11,1.1236

2015-06-12,1.1278

2015-06-15,1.1266

201

Watch out, once we've read all the lines, we're at the bottom of the file, and there's nothing more to iterate over.

In [13]:
for line in f:
    print(line)

We can move backwards in the file, by specifying a location to move to, but I'm not going to cover that in this course.

And we should not forget to close the file:

In [14]:
f.close()

Also want to point out that the result of `open()` is an iterator - we can call `next()` on it, or iterate over it (as we just saw). That behavior is no different than other iterators we've worked with.

In [15]:
f = open(file_name)
print(next(f))
print(next(f))
print(next(f))
f.close()

DATE,DEXUSEU

2015-04-03,1.0990

2015-04-06,1.1008



Our current pattern has been:

1. open file
2. read data from file and perform some operations
3. close file

The problem with that approach is that if something goes wrong in step 2, we may never close the file explicitly.

So maybe we would want to do this:

In [16]:
f = open(file_name)
try:
    for row in f:
        print(row)
        raise ValueError('forcing an exception...')
finally:
    print('closing file...')
    f.close()
    

DATE,DEXUSEU

closing file...


ValueError: forcing an exception...

As you can see, even though we had an exception, we still closed the file by using the `finally` block.

There is a much cleaner way of doing this - using something called a **context manager**.

In [17]:
with open(file_name) as f:
    # while in this block, f remains open
    print(f.closed)
print(f.closed)

False
True


A context is **entered** using the `with` statement. Once the context is **exited**, some code that cleans up the context is executed (in this case, that would be closing the file, but other context managers may do other things upon entry/exit).

Using a context manager means we never have to remember to close the file ourselves, it will be done automatically as soon as the context is exited - whether normally or because of an exception does not matter.

Let's turn back to our sample data file, and see if we can parse the data out into a list of tuples.

First we observed that the first row in the file consists of headers - so we'll need to handle the first row differently from the others.

The other thing to observe is that some of the expected numerical data looks odd:

In [18]:
with open(file_name) as f:
    for line in f:
        print(line)

DATE,DEXUSEU

2015-04-03,1.0990

2015-04-06,1.1008

2015-04-07,1.0850

2015-04-08,1.0818

2015-04-09,1.0671

2015-04-10,1.0598

2015-04-13,1.0582

2015-04-14,1.0672

2015-04-15,1.0596

2015-04-16,1.0742

2015-04-17,1.0780

2015-04-20,1.0763

2015-04-21,1.0758

2015-04-22,1.0729

2015-04-23,1.0803

2015-04-24,1.0876

2015-04-27,1.0892

2015-04-28,1.0979

2015-04-29,1.1174

2015-04-30,1.1162

2015-05-01,1.1194

2015-05-04,1.1145

2015-05-05,1.1174

2015-05-06,1.1345

2015-05-07,1.1283

2015-05-08,1.1241

2015-05-11,1.1142

2015-05-12,1.1240

2015-05-13,1.1372

2015-05-14,1.1368

2015-05-15,1.1428

2015-05-18,1.1354

2015-05-19,1.1151

2015-05-20,1.1079

2015-05-21,1.1126

2015-05-22,1.1033

2015-05-25,.

2015-05-26,1.0876

2015-05-27,1.0888

2015-05-28,1.0914

2015-05-29,1.0994

2015-06-01,1.0913

2015-06-02,1.1130

2015-06-03,1.1285

2015-06-04,1.1271

2015-06-05,1.1108

2015-06-08,1.1232

2015-06-09,1.1284

2015-06-10,1.1307

2015-06-11,1.1236

2015-06-12,1.1278

2015-06-15,1.1266

201

In [19]:
with open(file_name) as f:
    print(f.readlines())

['DATE,DEXUSEU\n', '2015-04-03,1.0990\n', '2015-04-06,1.1008\n', '2015-04-07,1.0850\n', '2015-04-08,1.0818\n', '2015-04-09,1.0671\n', '2015-04-10,1.0598\n', '2015-04-13,1.0582\n', '2015-04-14,1.0672\n', '2015-04-15,1.0596\n', '2015-04-16,1.0742\n', '2015-04-17,1.0780\n', '2015-04-20,1.0763\n', '2015-04-21,1.0758\n', '2015-04-22,1.0729\n', '2015-04-23,1.0803\n', '2015-04-24,1.0876\n', '2015-04-27,1.0892\n', '2015-04-28,1.0979\n', '2015-04-29,1.1174\n', '2015-04-30,1.1162\n', '2015-05-01,1.1194\n', '2015-05-04,1.1145\n', '2015-05-05,1.1174\n', '2015-05-06,1.1345\n', '2015-05-07,1.1283\n', '2015-05-08,1.1241\n', '2015-05-11,1.1142\n', '2015-05-12,1.1240\n', '2015-05-13,1.1372\n', '2015-05-14,1.1368\n', '2015-05-15,1.1428\n', '2015-05-18,1.1354\n', '2015-05-19,1.1151\n', '2015-05-20,1.1079\n', '2015-05-21,1.1126\n', '2015-05-22,1.1033\n', '2015-05-25,.\n', '2015-05-26,1.0876\n', '2015-05-27,1.0888\n', '2015-05-28,1.0914\n', '2015-05-29,1.0994\n', '2015-06-01,1.0913\n', '2015-06-02,1.1130\n

So we can split the data on `,`, and we'll also strip each line to remove the trailing `\n`

We also have to deal with the numerical data containing `.` - what happens if we try to convert that to a float?

In [20]:
float('.')

ValueError: could not convert string to float: '.'

We get a `ValueError`, whereas this works just fine:

In [21]:
float('3.1415')

3.1415

Our goal will be to create a list of tuples of 2 elements each - date string as first element, and exchange rate as a float in the second position.

In [22]:
with open(file_name) as f:
    # first row is headers
    headers = next(f)
    
    # remaining rows are all data
    for row in f:
        row = row.strip()
        date, value_str = row.split(',')

        print(date, value_str)

2015-04-03 1.0990
2015-04-06 1.1008
2015-04-07 1.0850
2015-04-08 1.0818
2015-04-09 1.0671
2015-04-10 1.0598
2015-04-13 1.0582
2015-04-14 1.0672
2015-04-15 1.0596
2015-04-16 1.0742
2015-04-17 1.0780
2015-04-20 1.0763
2015-04-21 1.0758
2015-04-22 1.0729
2015-04-23 1.0803
2015-04-24 1.0876
2015-04-27 1.0892
2015-04-28 1.0979
2015-04-29 1.1174
2015-04-30 1.1162
2015-05-01 1.1194
2015-05-04 1.1145
2015-05-05 1.1174
2015-05-06 1.1345
2015-05-07 1.1283
2015-05-08 1.1241
2015-05-11 1.1142
2015-05-12 1.1240
2015-05-13 1.1372
2015-05-14 1.1368
2015-05-15 1.1428
2015-05-18 1.1354
2015-05-19 1.1151
2015-05-20 1.1079
2015-05-21 1.1126
2015-05-22 1.1033
2015-05-25 .
2015-05-26 1.0876
2015-05-27 1.0888
2015-05-28 1.0914
2015-05-29 1.0994
2015-06-01 1.0913
2015-06-02 1.1130
2015-06-03 1.1285
2015-06-04 1.1271
2015-06-05 1.1108
2015-06-08 1.1232
2015-06-09 1.1284
2015-06-10 1.1307
2015-06-11 1.1236
2015-06-12 1.1278
2015-06-15 1.1266
2015-06-16 1.1238
2015-06-17 1.1244
2015-06-18 1.1404
2015-06-19 1.13

2018-10-31 1.1332
2018-11-01 1.1396
2018-11-02 1.1378
2018-11-05 1.1394
2018-11-06 1.1412
2018-11-07 1.1459
2018-11-08 1.1416
2018-11-09 1.1325
2018-11-12 .
2018-11-13 1.1288
2018-11-14 1.1312
2018-11-15 1.1324
2018-11-16 1.1402
2018-11-19 1.1448
2018-11-20 1.1391
2018-11-21 1.1393
2018-11-22 .
2018-11-23 1.1332
2018-11-26 1.1336
2018-11-27 1.1281
2018-11-28 1.1286
2018-11-29 1.1382
2018-11-30 1.1323
2018-12-03 1.1356
2018-12-04 1.1345
2018-12-05 .
2018-12-06 1.1374
2018-12-07 1.139
2018-12-10 1.1368
2018-12-11 1.1314
2018-12-12 1.1362
2018-12-13 1.1358
2018-12-14 1.13
2018-12-17 1.1339
2018-12-18 1.1364
2018-12-19 1.1422
2018-12-20 1.1432
2018-12-21 1.1402
2018-12-24 .
2018-12-25 .
2018-12-26 1.1408
2018-12-27 1.1412
2018-12-28 1.1445
2018-12-31 1.1456
2019-01-01 .
2019-01-02 1.1357
2019-01-03 1.1399
2019-01-04 1.141
2019-01-07 1.1468
2019-01-08 1.1444
2019-01-09 1.1524
2019-01-10 1.1517
2019-01-11 1.1479
2019-01-14 .
2019-01-15 1.1392
2019-01-16 1.1408
2019-01-17 1.1386
2019-01-18 1.

Ok, almost there:

In [23]:
data = []

with open(file_name) as f:
    # first row is headers
    headers = next(f)  # or use f.readline() - same effect
    
    # remaining rows are all data
    for row in f:
        row = row.strip()
        date, value_str = row.split(',')
        try:
            value = float(value_str)
            data.append((date, value))
        except ValueError:
            # bad data, skip row
            pass

print(data)

[('2015-04-03', 1.099), ('2015-04-06', 1.1008), ('2015-04-07', 1.085), ('2015-04-08', 1.0818), ('2015-04-09', 1.0671), ('2015-04-10', 1.0598), ('2015-04-13', 1.0582), ('2015-04-14', 1.0672), ('2015-04-15', 1.0596), ('2015-04-16', 1.0742), ('2015-04-17', 1.078), ('2015-04-20', 1.0763), ('2015-04-21', 1.0758), ('2015-04-22', 1.0729), ('2015-04-23', 1.0803), ('2015-04-24', 1.0876), ('2015-04-27', 1.0892), ('2015-04-28', 1.0979), ('2015-04-29', 1.1174), ('2015-04-30', 1.1162), ('2015-05-01', 1.1194), ('2015-05-04', 1.1145), ('2015-05-05', 1.1174), ('2015-05-06', 1.1345), ('2015-05-07', 1.1283), ('2015-05-08', 1.1241), ('2015-05-11', 1.1142), ('2015-05-12', 1.124), ('2015-05-13', 1.1372), ('2015-05-14', 1.1368), ('2015-05-15', 1.1428), ('2015-05-18', 1.1354), ('2015-05-19', 1.1151), ('2015-05-20', 1.1079), ('2015-05-21', 1.1126), ('2015-05-22', 1.1033), ('2015-05-26', 1.0876), ('2015-05-27', 1.0888), ('2015-05-28', 1.0914), ('2015-05-29', 1.0994), ('2015-06-01', 1.0913), ('2015-06-02', 1.11