### Writing Text Files

So far we've seen how to open and close text files, using a context manager, and how to read data from these files.

Now let's see how we can also write to files, again with or without a context manager.

Remember the modes we have for writing files:
- 'w' : create file if it does not exist, or overwrite if it does
- 'a' : create file if it does not exist, append writes to end of file if it does

In [1]:
f = open('test.csv', 'w')

In [2]:
f.write('abc')

3

In [3]:
f.write('123456')

6

The return value is the number of characters written to the file.

In [4]:
f.close()

In [5]:
with open('test.csv', 'r') as f:
    print(f.readlines())

['abc123456']


As we can see, the consecutive writes did not create two lines in the text file, it just keeps writing what we tell it.

To create a newline, we'll have to specifically write a `\n` character:

In [6]:
with open('test.csv', 'w') as f:
    f.write('abc\n')
    f.write('123456\n')

In [7]:
with open('test.csv', 'r') as f:
    print(f.readlines())

['abc\n', '123456\n']


If we have a list of strings we want to write, we can use the `writelines` method too:

In [8]:
data = ['line 1', 'line 2', 'line 3']

In [9]:
with open('test.csv', 'w') as f:
    f.writelines(data)

In [10]:
with open('test.csv', 'r') as f:
    print(f.readlines())

['line 1line 2line 3']


As you can see, we still only have one line - so we need to provide the newline characters ourselves as well:

In [11]:
data_n = ['line 1', '\n', 'line 2', '\n', 'line 3', '\n']

In [12]:
with open('test.csv', 'w') as f:
    f.writelines(data_n)

In [13]:
with open('test.csv', 'r') as f:
    print(f.readlines())

['line 1\n', 'line 2\n', 'line 3\n']


We could also have used the original list of strings, and joined them:

In [14]:
with open('test.csv', 'w') as f:
    f.write('\n'.join(data))

In [15]:
with open('test.csv', 'r') as f:
    print(f.readlines())

['line 1\n', 'line 2\n', 'line 3']


Let's also take a look at what happens if the code in the context manager enounters an unhandled exception:

In [16]:
with open('test.csv', 'r') as f:
    raise ValueError('bogus')

ValueError: bogus

As you can see, we have an unhandled exception, but what happend to the file we opened?

In [17]:
f.closed

True

The context manager closed the file for us. That's what's nice about a context manager, it cleans up after it exits, even if the exit was caused by an unhandled exception.

Let's look at the `a` mode for writing files.

We know we already have a file called `test.csv`:

In [18]:
with open('test.csv') as f:  # default mode is r
    for line in f:
        print(line.strip())

line 1
line 2
line 3


Now let's append some data to that file:

In [19]:
with open('test.csv', 'a') as f:
    f.write('line4\n')
    f.write('line5\n')

In [20]:
with open('test.csv') as f:
    for line in f:
        print(line.strip())

line 1
line 2
line 3line4
line5


Ah, our original file did not end with a newline, so the append just continued writing to the same line.

This is one reason why we usually include the `\n` character, even for the last line in our text file.

What happens if we try to append to a non-existent file:

In [21]:
 with open('does_not_exist.txt', 'a') as f:
        f.write('Line 1')

In [22]:
with open('does_not_exist.txt') as f:
    print(f.readlines())

['Line 1']


As you can see, the file is create automatically for us.

Let's work on a practical example.

Recall the file `DEXUSEU.csv` we used previously. This same file is available course downloads - pleased make sure you save this file in the same directory as your Jupyter notebook (if not, you'll have to adjust the path when you specify the file to open).

Let's recall what that file looks like:

In [23]:
source_file = 'DEXUSEU.csv'
with open(source_file) as f:
    for _ in range(5):
        print(next(f).strip())

DATE,DEXUSEU
2015-04-03,1.0990
2015-04-06,1.1008
2015-04-07,1.0850
2015-04-08,1.0818


Our goal is to create a new csv file that will contain the following data (including the header names):

```
YEAR,MONTH,DAY,DEXUSEU
2015,4,3,1.0990
2015,4,6,1.1008
etc
```

So, we will have to read from one file, modify the data as needed, and write it out to another file.

We'll look at two different approaches to do this.

The first approach will be to read the entire data file into memory, process the data, and then write everything out to the target file.

In [24]:
target_file = 'output.csv'

In [25]:
with open(source_file) as f:
    data = f.readlines()

In [26]:
data[0:5]

['DATE,DEXUSEU\n',
 '2015-04-03,1.0990\n',
 '2015-04-06,1.1008\n',
 '2015-04-07,1.0850\n',
 '2015-04-08,1.0818\n']

First we should remove the header row:

In [27]:
del data[0]

In [28]:
data[0:5]

['2015-04-03,1.0990\n',
 '2015-04-06,1.1008\n',
 '2015-04-07,1.0850\n',
 '2015-04-08,1.0818\n',
 '2015-04-09,1.0671\n']

Next we should strip each line in the data from the trailing `\n` character:

In [29]:
data = [line.strip() for line in data]

In [30]:
data[0:5]

['2015-04-03,1.0990',
 '2015-04-06,1.1008',
 '2015-04-07,1.0850',
 '2015-04-08,1.0818',
 '2015-04-09,1.0671']

Next we should split the date and exchange rate into a tuple containing the date and exchange rate:

In [31]:
data = [line.split(',') for line in data]

In [32]:
data[0:5]

[['2015-04-03', '1.0990'],
 ['2015-04-06', '1.1008'],
 ['2015-04-07', '1.0850'],
 ['2015-04-08', '1.0818'],
 ['2015-04-09', '1.0671']]

Next we need to split the date strings into year, month and day.

Let's write a small utility function to do this:

In [33]:
def split_date(dt_str):
    return dt_str[:4], dt_str[5:7], dt_str[8:]

Let's make sure it works as intended:

In [34]:
split_date('2015-04-03')

('2015', '04', '03')

To make our life simpler and see all the code we wrote, let's write a function that takes a single (unprocessed) row from the source file and transforms it into something we can use to write to our target file:

In [35]:
def transform_row_for_output(row):
    row = row.strip()  # remove trailing \n
    dt_str, rate = row.split(',')  # split fields on ,
    year, month, day = split_date(dt_str)  # split date string into Y M D
    
    # join all the fields into a , separated string
    result = ','.join([year, month, day, rate])
    
    # finally add the newline character
    result += '\n'
    return result

Let's try it out for a single row and make sure it's doing what we want:

In [36]:
row = '2015-04-03,1.0990\n'

In [37]:
transform_row_for_output(row)

'2015,04,03,1.0990\n'

Looking good, we could even try to clean up those leading zeroes in the month and day:

In [38]:
def transform_row_for_output(row):
    row = row.strip()  # remove trailing \n
    dt_str, rate = row.split(',')  # split fields on ,
    year, month, day = split_date(dt_str)  # split date string into Y M D
    
    # clean up leading 0
    month = str(int(month))
    day = str(int(day))
    
    # join all the fields into a , separated string
    result = ','.join([year, month, day, rate])
    # finally add the newline character
    result += '\n'
    return result

In [39]:
transform_row_for_output(row)

'2015,4,3,1.0990\n'

But what about data that has a missing exchange rate?

In [40]:
row = '2015-04-03,.\n'

In [41]:
transform_row_for_output(row)

'2015,4,3,.\n'

This works, but we may not want it in our output file. We could have the transformation function return `None` in those cases, and later we can skip writing `None` return values.

In [42]:
def transform_row_for_output(row):
    row = row.strip()  # remove trailing \n
    dt_str, rate = row.split(',')  # split fields on ,
    
    try:
        float(rate)
    except ValueError:
        # not a float, so return None
        return None
    
    year, month, day = split_date(dt_str)  # split date string into Y M D
    
    # clean up leading 0
    month = str(int(month))
    day = str(int(day))
    
    # join all the fields into a , separated string
    result = ','.join([year, month, day, rate])
    # finally add the newline character
    result += '\n'
    return result

In [43]:
row = '2015-04-03,.\n'
print(transform_row_for_output(row))

None


In [44]:
row = '2015-04-03,1.0990\n'
print(transform_row_for_output(row))

2015,4,3,1.0990



But this approach means we'll need to test each transformed row to decide whether to write it or not.

How about returning an empty string - we can write an empty string to a file and nothing will happen - that way we can just write all transformed rows without checking if the row is `None` or not.

In [45]:
def transform_row_for_output(row):
    row = row.strip()  # remove trailing \n
    dt_str, rate = row.split(',')  # split fields on ,
    
    try:
        float(rate)
    except ValueError:
        # not a float, so return empty string (no output)
        return ''
    
    year, month, day = split_date(dt_str)  # split date string into Y M D
    
    # clean up leading 0
    month = str(int(month))
    day = str(int(day))
    
    # join all the fields into a , separated string
    result = ','.join([year, month, day, rate])
    # finally add the newline character
    result += '\n'
    return result

OK, now let's go ahead and write our code to transform the source file and write it out to the target file.

In [46]:
with open(source_file) as f:
    data = f.readlines()

Delete first row (headers) from data:

In [47]:
del data[0]

In [48]:
with open(target_file, 'w') as f:
    f.write('YEAR,MONTH,DAY,EXCH\n')
    for row in data:
        f.write(transform_row_for_output(row))

Let's read back some our file to see what we actually wrote (you could also just open it in a text editor):

In [49]:
with open(target_file) as f:
    for row in f:
        print(row.strip())

YEAR,MONTH,DAY,EXCH
2015,4,3,1.0990
2015,4,6,1.1008
2015,4,7,1.0850
2015,4,8,1.0818
2015,4,9,1.0671
2015,4,10,1.0598
2015,4,13,1.0582
2015,4,14,1.0672
2015,4,15,1.0596
2015,4,16,1.0742
2015,4,17,1.0780
2015,4,20,1.0763
2015,4,21,1.0758
2015,4,22,1.0729
2015,4,23,1.0803
2015,4,24,1.0876
2015,4,27,1.0892
2015,4,28,1.0979
2015,4,29,1.1174
2015,4,30,1.1162
2015,5,1,1.1194
2015,5,4,1.1145
2015,5,5,1.1174
2015,5,6,1.1345
2015,5,7,1.1283
2015,5,8,1.1241
2015,5,11,1.1142
2015,5,12,1.1240
2015,5,13,1.1372
2015,5,14,1.1368
2015,5,15,1.1428
2015,5,18,1.1354
2015,5,19,1.1151
2015,5,20,1.1079
2015,5,21,1.1126
2015,5,22,1.1033
2015,5,26,1.0876
2015,5,27,1.0888
2015,5,28,1.0914
2015,5,29,1.0994
2015,6,1,1.0913
2015,6,2,1.1130
2015,6,3,1.1285
2015,6,4,1.1271
2015,6,5,1.1108
2015,6,8,1.1232
2015,6,9,1.1284
2015,6,10,1.1307
2015,6,11,1.1236
2015,6,12,1.1278
2015,6,15,1.1266
2015,6,16,1.1238
2015,6,17,1.1244
2015,6,18,1.1404
2015,6,19,1.1335
2015,6,22,1.1378
2015,6,23,1.1190
2015,6,24,1.1178
2015,6,25,1.

Let's actually write a function to do all those steps for us:

In [50]:
def transform_file_batch(source_file, target_file):
    with open(source_file) as f:
        data = f.readlines()
        
    del data[0]
    
    with open(target_file, 'w') as f:
        f.write('YEAR,MONTH,DAY,EXCH\n')
        for row in data:
            f.write(transform_row_for_output(row))

In [51]:
transform_file_batch(source_file, target_file)

So this approach works just fine, but it has one real disadvantage: we are reading the entire file into memory (the `data` list), and then writing it back out.

But in reality, we don't need to load the entire file to process a single row - a better approach would be to read the source file one line at a time, and write out to the target file one line at a time.

Fortunately we have all the building blocks to do this very easily:

In [52]:
def transform_file(source_file, target_file):
    with open(source_file) as source:
        with open(target_file, 'w') as target:
            # need to skip first row in source file (headers)
            next(source)
            
            # write out header file
            target.write('YEAR,MONTH,DAY,EXCH\n')
            
            for row in source:
                target.write(transform_row_for_output(row))

Now let's run it:

In [53]:
transform_file(source_file, target_file)

And let's check the output results:

In [54]:
with open(target_file) as f:
    for row in f:
        print(row.strip())

YEAR,MONTH,DAY,EXCH
2015,4,3,1.0990
2015,4,6,1.1008
2015,4,7,1.0850
2015,4,8,1.0818
2015,4,9,1.0671
2015,4,10,1.0598
2015,4,13,1.0582
2015,4,14,1.0672
2015,4,15,1.0596
2015,4,16,1.0742
2015,4,17,1.0780
2015,4,20,1.0763
2015,4,21,1.0758
2015,4,22,1.0729
2015,4,23,1.0803
2015,4,24,1.0876
2015,4,27,1.0892
2015,4,28,1.0979
2015,4,29,1.1174
2015,4,30,1.1162
2015,5,1,1.1194
2015,5,4,1.1145
2015,5,5,1.1174
2015,5,6,1.1345
2015,5,7,1.1283
2015,5,8,1.1241
2015,5,11,1.1142
2015,5,12,1.1240
2015,5,13,1.1372
2015,5,14,1.1368
2015,5,15,1.1428
2015,5,18,1.1354
2015,5,19,1.1151
2015,5,20,1.1079
2015,5,21,1.1126
2015,5,22,1.1033
2015,5,26,1.0876
2015,5,27,1.0888
2015,5,28,1.0914
2015,5,29,1.0994
2015,6,1,1.0913
2015,6,2,1.1130
2015,6,3,1.1285
2015,6,4,1.1271
2015,6,5,1.1108
2015,6,8,1.1232
2015,6,9,1.1284
2015,6,10,1.1307
2015,6,11,1.1236
2015,6,12,1.1278
2015,6,15,1.1266
2015,6,16,1.1238
2015,6,17,1.1244
2015,6,18,1.1404
2015,6,19,1.1335
2015,6,22,1.1378
2015,6,23,1.1190
2015,6,24,1.1178
2015,6,25,1.

In future sections of this course we'll cover how to handle CSV files properly. Here we did not deal with quotes used to enclose text fields, or any of the other issues we may encounter when reading or writing CSV files. But this gives us a solid foundation on reading and writing text files in case we need to special handle certain files (maybe badly CSV formatted data - that happens!!), or some other proprietary data format)