# *DISCLAIMER*
## *Please, copy this notebook into your MA10276_workspace folder!*
*This is unfortunately necessary as the notebook will teach reading and writing from files. Since the course directory is read-only, this notebook will not work unless you move it to your MA10276_workspace folder.*


# Input and Output

So far, we have mainly been using the `print()` function to output text to the screen. However, in many applications it will be necessary to read data from input files and to write data to output files. In particular, plain text files can be used to store unformatted data in a operating system and platform independent manner. As such, many spreadsheets are stored in .csv (comma separated values) files. Furthermore, it is often useful to have more complicated data or even Python objects stored for common use between different programmes. This can be achieved using the JSON and Pickle formats.

### What you will learn

In this notebook we will cover the following topics:
* string handling
* linebreaks and tabulators
* input formatting
* reading and writing from/to files
* moving in files
* saving structured data with JSON and Pickle

*&#169; Tobias Hartung, University of Bath 2021-2022. This problem sheet is copyright of Tobias Hartung, University of Bath. It is provided exclusively for educational purposes at the University and is to be downloaded or copied for your private study only. Further distribution, e.g. by upload to external repositories, is prohibited.*

## Output formatting

Let us begin with outputn formatting. In most cases, you will be writing data to a file in a text format. That is, you will have to turn all your data into a string and then write it to the file. 

As an example, let us consider a small spreadsheet of the following form.

If we wish to save this data as a csv file, then each line in the file will be a row in the table and each column will be separated by a chosen delimiter. This is usually a comma, but could be any symbol. Other common symbols are semicolons, colons, and tabulators. The table saved in a .csv file would thus look like the text below.

In order to create very large strings containing a lot of data as well as linebreaks, you will often require more control over your output formatting than simply printing space-separated values like `print(x,y,z)` would do. Python provides you with multiple options to achieve your goals. 

### Manual string handling

The first option, which is the easiest to start with but also most tedious, is manual string formatting by using string slicing and concatenation operations. Elements of a string can be addressed similar to lists. 

In [None]:
last_line = "5,25,125"
print(last_line[2:-4])

However, if we are thinking about reading from a .csv file, then we may not know exactly at which character index (here 2) the second entry (25) begins and ends. A common approach to process .csv files is to read them line by line and then split them into the constituents. Here we are interested in splitting the strings at each comma. We can use the `split()` method of a string to achieve this.

In [None]:
split_last_line = last_line.split(',')
print(split_last_line)

As you can see, the `split()` method requires an argument which here is the string containing a comma `','` and it returns a list of strings which are the original string separated at the string passed to the method. Note also that the separating symbol no longer appears in the list of strings generated by `split()`. 

If we wish to join strings together, then we can use the concatenation operation `+`. 

In [None]:
number1 = "5"
number2 = "25"
number3 = "125"
line = number1 + "," + number2 + "," + number3
print(line)

As you may imagine, this can become very labour intensive for many strings to be concatenated. For some automation, you can use the `join()` method of a string. It acts as the reverse operation to `split()` and concatenates a list of strings with a given separating symbol. For example, if we want to re-assemple the split list `['5', '25', '125']`, then we could write

In [None]:
joined_last_line = ",".join(split_last_line)
print(joined_last_line)

As you can see, if we had thousands of columns, this `join()` function can easily create an entire row of the spreadsheet from a list containing the data that should go into that row. 

If you want to make sure that the .csv file (or even output printed on screen) is more readable for humans, an interesting formatting option to set manually are justifications. `rjust(n)` right-justifies in a block of size n, `ljust(n)` left-justifies, and `center(n)` centers text. 

In [None]:
for x in range(1, 11):
    print(repr(x).ljust(5), repr(x*x).center(5), end=' ')
    # Note use of 'end' on previous line
    print(repr(x*x*x).rjust(5))

Note the use of the `end` parameter in the first `print()` statement. Usually `print()` ends with a linebreak. If you don't want this, then you can set it's behaviour in this way. Here it inserts a single space just like the comma does between the first two arguments that are printed. As such the second `print()` does not start a new line but prints the cubes just like they would have been printed, had we written 

`print(repr(x).ljust(5), repr(x*x).center(5),repr(x*x*x).rjust(5))`

The use of `repr()` is also interesting to note here. The justification commands only work on printable representations of Python objects. The `repr()` function returns exactly this. So if you are unsure how to turn a complicated object into something printable, use `repr()`.

For numeric strings, you can also use `zfill()` to pad with leading zeros. It does understand plus and minus.

In [None]:
print('42'.zfill(5))
print('-42'.zfill(5))

### Formatted string literals

While manual string formatting is powerful, it can make for difficult to read code. Formatted string literals (also known as f-strings for short) often allow for greater format control while keeping the code understandable. You can define an f-string by writing f or F in front of your string definition. 

In [None]:
import math                                                 
# importing math library to get pi
# next: inserting pi with defined number of digits directly into the string
fstr = f'The value of pi is approximately {math.pi:.3f}.'   
print(fstr)

As you can see, an f-string allows you to write exprespressions (here `math.pi`) directly into the string and pass on formatting options as well (here `.3f` means three figures after the decimal dot). Passing an integer after : will enforce the expression to be of at least that length. 

In [None]:
fstr2 = f'The value of pi is approximately {math.pi:5}.'
fstr3 = f'The value of pi is approximately {math.pi:50}.'
print(fstr2)
print(fstr3)

The example above shows that more than 16 digits of pi were printed since math.pi has 16 digit precision. This is independent of us passing the integer 5 or 50. But when passing 50, it ensured that the missing 33 symbols to reach a length of 50 were added as empty spaces. In general, if you have pass a number such as `math.pi`, the format you want to use is `{value:width.precision}` although many options exist here. This type of formatting is part of Python's lexical analysis. If you are interested in more detail, you can read all about it in the Python documentation.

https://docs.python.org/3/reference/lexical_analysis.html#f-strings

This type of formatting becomes important if you wish to create human-readable tables.

In [None]:
number = 'Number'
square = 'Square'
cube = 'Cube'
print(f'{number:6} ==> {square:6} ==> {cube:4}')
for i in range(1,11):
    number = i
    square = i**2
    cube = i**3
    print(f'{number:6d} ==> {square:6d} ==> {cube:4d}')

### The string `format()` method

The third common formatting option is to use the `format()` method of a string. It's basic use is

In [None]:
print('{} squared is {}.'.format(5,25))

However, for nicer formatting, you can also pass arguments to the braces in the string. 

In [None]:
print('{0} squared is {1}.'.format(5,25))
print('{1} squared is {0}.'.format(25,5))
print('{number} squared is {square}.'.format(number=5,square=25))

It should be noted that positional and keyword formatting can be combined. 

In [None]:
print('The square of {} is {}, its cube is {cube}, and if we take the fourth power it\'s {}.'.format(2,4,16,cube=8))

Note the use of `\'`. This is called "escaping a character". Here the `'` symbol would denote the end of the string, but we wanted it as the symbol itself inside the string. By adding the backslash, we are telling Python "I know you want to interpret the next character as something that has meaning to you. I don't want you to do this." This is the usual behaviour of an escape character. So if you wanted to have a backslash in your string, then you would need to escape the backslasph symbol and write `\\`. 

In [None]:
print('\\')

It should also be noted that a few characters such as r, n, and t, change meaning when escaped. We will discuss them below. But the general rule of thumb is, don't escape a letter that would be interpreted correctly if not escaped. This way you will not accidentally change what you are printing because you accidentally escaped a letter that now has different behaviour than printing the letter itself.

Returning to `.format()`, it is also sometimes necessary to reference variables by name rather than position. This can be done through dictionaries and using brackets `[]`.

In [None]:
table = {'number': 5, 'square': 25, 'cube': 125}
print('Number: {0[number]:d}; Square: {0[square]:d}; Cube: {0[cube]:d}'.format(table))

Note the `0` refers to the zero-indexed element in the list of things given to `format()`, i.e., `0[number]` calls `table['number']`.

You can also pass the entire dictionary as keyword arguments (which is often the neater alternative). 

In [None]:
print('Number: {number:d}; Square: {square:d}; Cube: {cube:d}'.format(**table))

This is particularly useful if you combine formatting with the built-in function `vars()` which returns a dictionary of all local variables. 

Finally, this can also be used to generate tidily aligned tables. 

In [None]:
for x in range(1, 11):
    print('{0:2d} {1:3d} {2:4d}'.format(x, x*x, x**3))

### Linebreaks and tabulators

There is a final important aspect of formatting (which we will need to create a .csv file), and that is linebreaks. End of Line, End of File, and similar markers are objects in files that are commonly not printed but important for the correct displaying and processing of files. The two markers you will encounter most frequently are `\r` and `\n`. They denote "return carriage" `\r` and "new line" `\n`. Think about the operation of a type writer: hit enter to move a line down `\n` and push the paper back to the beginning of the page `\r`. Since computers don't have carriages, `\r` is mainly redundant. However, `\n` is a very common new line symbol. You can see how important these symbols are by looking at ASCII. `\n` is symbol number 10 in ASCII and `\r` is symbol number 13. Unfortunately, because `\r` is a leftover from type writers, its use is operating system dependent. 

- Unix and all Unix-like systems (including Mac OS X) only use `\n` for end-of-line. `\r` means nothing special. As a consequence many programming languages copy this convention and will adjust to/from operating system specific end-of-line sequences when needed. 
- older Mac systems (pre OS X) use `\r` as end-of-line and `\n` means nothing special
- Windows and many old operating systems use `\r\n` in this order. This is because electromechanical teletype-like terminals use `\r` to command the carriage back to the leftmost stop (a slow operation) and `\n` to roll the roller up one line (a fast operation). By using `\r\n` the roller can move up while the carriage is still moving left. 
- Character-mode terminals (typically emulating printing terminals) `\r` and `\n` act as "move cursor all the way left" and "move cursor down" respectively, i.e., both are necessary even though there is no carriage or roller. 

In practice, this means you should use `\n` to force a new line and expect the underlying runtime to handle any weird operating systems (Windows). However, when it comes to reading from files, you will encounter `\r` depending on the operating system. 

Another common symbol of this type is `\t` which adds a tabulator. 

In [None]:
print('This is some text with linebreaks.\nThis is a new line.\n\nAnd above is an empty line.\tThis was a tabulator.')

## Input formatting

Technically, we have all we need (with the exception of actually handling a file), to store data in a .csv file. However, before we discuss file handline, let us quickly look at input formatting, i.e., how to turn the data received from a plain text file into the format that we need them in. 

Usually, we obtain input data in string format. This means, we want to process any incoming data into the data types we want them to be in. To this end, we can use the built-in functions `int()`, `float()`, `list()`, `set()`, and so on. 

In [None]:
a = int("5")
b = float("1e-3")
c = list("12345")
d = set("1223")
print(a)
print(b)
print(c)
print(d) # note the unordered nature of a set

However, empty spaces and end-of-line markers can cause problems during these conversions. To remove any of these, you can use the `strip()` method.

In [None]:
in_string = "   text \n"
print(in_string)
print(in_string)

stripped_string = in_string.strip()

print(stripped_string)
print(stripped_string)

Note how the leading empty space and the linebreak was removed. We can also use other symbols to be stripped by explicitly listing all symbols we want to strip. By default it strips empty spaces (including tabulators), `\r`, `\n`, and all file delimiters such as End of File which we haven't discussed.

In [None]:
new_in_string = "..a.aa..rea...asd...42..43...adfaae.ad."
new_stripped_string = new_in_string.strip('.aersdf')
print(new_stripped_string)

Note that, although we said to strip dots, it only removed them from the beginning and end of the string. 

Furthermore, explicitly defining what should be stripped will overwrite the default behaviour. So, if we had a `\n` at the end, then `\n` and everything between it and the 3 would remain in the string. 

In [None]:
new_in_string = "..a.aa..rea...asd...42..43...adfaae.ad.\n"
new_stripped_string = new_in_string.strip('.aersdf')
print(new_stripped_string)

## Reading and Writing files

Files are handled as objects in Python. They can be opened with open(filename,mode) where the operating mode "mode" has various options. These are
- `"r"` for read-only mode
- `"w"` for write-only (this will replace the file if it has content)
- `"a"` for appending to the existing file
- `"r+"` and "w+" for read and write
- `"a+"` for read and append
- `"rb"`, `"rb+"`, `"wb"`, `"wb+"`, `"ab"`, `"ab+"` does exactly the same as their corresponding versions without the `b` but with the `b` it treats the file in binary format.
We can then print into a file using the `write()` method and close the file with `close()`.

In [None]:
f = open("test_file","w")
f.write("Number,Square,Cube")
f.close()

This code has generated a new file called test_file and written the text `Number,Square,Cube` into it. It has then closed the file. 

##### WARNING: Calling `f.write()` without closing the file can result in some of the data not being written to disk! This can happen even if the entire program has exited successfully. 

It is therefore prudent to code such that the file will be closed automatically.

In [None]:
with open("test_file","a") as f2:
    f2.write("\n1,1,1")
# We can check that the file is indeed closed now
print(f2.closed)

Note that in Python, most objects that need to be closed have their own context manager. Context managers are important for dealing with large and complicated data or structures because they allow you to allocate and release resources precisely when you want to and they provide a lot of functionality regarding exception handling. Until you progress much further into your Python career, files are likely the only data structures you will encounter that have a context manager (or at least where you want to make use of the context manager). As such, we will not discuss them in detail here, but more information can be found in the Python documentation.

https://docs.python.org/3/library/contextlib.html

The key message here is that the syntax `with ... as ...:` allows you to make use of that context manager without having to think about the specifics. Use it! 

In order to read a file, we have mutliple options. We may read the entire file at once with `read()`, read a single line (until the next `\n`) with `readline()`, or read all lines with `readlines()`. 

In [None]:
with open("test_file","r") as f:
    print(f.read())
    
print("\n\n") # adding some space in the output

with open("test_file","r") as f:
    print(f.readline())
    print(f.readline())
    
print("\n\n") # adding some space in the output

with open("test_file","r") as f:
    print(f.readlines())


Note the additional space between the two `print(f.readline())` statements. It is there because there is a `\n` at the end of the `Number,Square,Cube\n` line and `print()` inserts another `\n` unless stop it from doing so, e.g. by writing `print(f.readline(),end='')`.

`read()` and `readline()` have an optional argument size which is used like `f.read(size=-1)`. If this argument is omitted (like in the example above), negative, or `None`, then the entire document will be read. If the file is larger than your computer's memory, that is your problem to deal with (after you have successfully restarted the machine). Passing a positive integer `n`, will cause only `n` characters (in "normal" mode) or `n` bytes (in binary mode) to be read. If the end of the file is reached, then `f.read()` will return the empty string `''`.

In [None]:
with open("test_file","r") as f:
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))
    print(f.read(5))

When working with binary files, it is often prudent to enforce binary strings. This can be achieved by leading the string with a lowe case `b`. 

In [None]:
with open("binary_test_file","wb") as f:
    f.write(b'0123456789abcdef')

Note that this will cause an error if you are trying to write a binary string into a file that is not binary. 

### Writing and reading a .csv file

With everything we have, we can now write a .csv file containing squares and cubes. 

In [None]:
with open('squares_cubes.csv','w') as f: # open the file in write-mode with context manager
    f.write('Number,Square,Cube')        # insert header 
    for i in range(1,11):                # loop over numbers to be written into the .csv
        f.write(f'\n{i},{i**2},{i**3}')  # note the linebreak to indicate a new column, and the use of f-strings

We have now written a .csv file. Next, let us load the file, split the data and convert all numbers to numbers. Finally we will load them into a numpy array and print it.

In [None]:
import numpy as np                       # import numpy
numbers = []                             # initialise list with data to be added
with open('squares_cubes.csv','r') as f: # open the file in read-mode with context manager
    f.readline()                         # first line is the header, we can discard it
    for line in f.readlines():           # loop over the remaining lines in the file
        # append the three numbers i, i**2, i**3 as a list
        # use comprehension to strip and split the line at the commas, and turn each value into an integer
        numbers.append([int(x) for x in line.strip().split(',')])
        
numbers_array = np.array(numbers) # turn it into a numpy array
print(numbers_array)

## Saving structured data with JSON and Pickle

The JSON (JavaScript Object Notation) is a commonly used format for modern data exchange. As such, it is a good starting point for interoperability.

The reading and writing to files operations we have seen so far can only deal with string and binary data. If we want to read/write numbers, we need to format them as strings to write and format from string to float/int when reading. If you want to store more complicated data such as nested lists and dictionaries, this becomes much more difficult very quickly. The JSON format automates this process. Writing with JSON is called `dump()`, reading is called `load()`.

In [None]:
import json

x = [1,"two",[3],{'4':5}]

with open("json_test_file","w") as f:
    json.dump(x,f)

with open("json_test_file","r") as f2:
    y = json.load(f2)
    
print(y)

JSON works by "serialising" the data, that is, by turning it into a string and writing it into a file. As such, JSON does exactly the same thing that we have discussed above, except JSON uses a very specific format. This format is useful when storing data that is to be used by another Python programme. In fact, the "O" in "JSON" stands for "Object" which indicates that a large class of objects can be stored using this method. 

It should be noted that within Python (and many other programming languages) *everything* is an object. With that in mind, it is reasonable to assume that everything can be stored using JSON. However, this is unfortunately not true because there are some limitations based on how JSON "serialises" data. As such, not all data (e.g., functions, classes, sets, ...) can be serialized by JSON. To handle these structures, pickle is a good alternative.

##### WARNING: Pickle can store executable code. As such, a virus or other malicious software can be pickled and distributed. Only unpickle data you trust! If you use pickle and must ensure that the data is not tempered with, consider signing your data with the keyed-hashing for message authentication library hmac. 

The basic operation of pickle is identical to JSON. The main difference is that JSON uses text files whereas pickle uses binary files. This usage of binary files means that pickle can store a copy of the executable machine image of your data (e.g. a function) and thus load it again. 

In [None]:
import pickle

def square(x):
    return x*x

with open("pickle_test_file","wb") as f:
    pickle.dump(square,f)

with open("pickle_test_file","rb") as f2:
    func = pickle.load(f2)
   
print(func)
print(func(2))


Of course, saving and loading just a single function is generally not what we want to use pickle for. Instead, we tend to use it on larger structures such as classes. Since classes contain methods, they cannot be serialised via JSON. If we thus consider the `LinearFunction` class from week 5, we could not save it with JSON, but we can use pickle. 

In [None]:
# copy class definition from week 5
class LinearFunction(object):
    '''Linear functions on R'''
    
    def to_str(g):
        '''Returns a string representation of a LinearFunction'''
        return str(g.a) +'x + ' + str(g.b);
    
    def evaluate(g, x):
        '''Evaluates the linear function g at the point x and returns g(x)'''
        return g.a*x + g.b
    
    def add(f, g):
        '''Returns the sum of two LinearFunctions'''
        r = LinearFunction()
        r.a = f.a + g.a
        r.b = f.b + g.b
        return r

    def random():
        '''Returns a random LinearFunction'''
        g = LinearFunction()
        g.a = np.random.random()
        g.b = np.random.random()
        return g    
    
    def scale(f, c):
        '''Scales the LinearFunction by a factor c'''
        f.a = c*f.a
        f.b = c*f.b
    
    def random_list(n):
        '''Returns a list of n random LinearFunctions'''
        lst = []
        for i in range(n):
            lst.append(LinearFunction.random())    
        return lst
    
# save class with pickle
with open("pickle_class_file","wb") as f:
    pickle.dump(LinearFunction,f)

# load it with a different name, so that calling it won't call the class defined above
with open("pickle_class_file","rb") as f2:
    LF_class = pickle.load(f2)
   

print(LF_class)                              # show the loaded class
lin_func = LF_class()                        # define an element of the loaded class
print(isinstance(lin_func,LinearFunction))   # show the element is instance of initially defined class


# Check your understanding

##### Question 1
In separated value files such as .csv, what does the first row in the file typically contain?
```
A The author of the table data
B The column names of the data
C The source of the data
D Notes about the table data
```

##### Question 2
In a .csv file, how do you separate columns? How do you separate rows?

##### Question 3
Given the file `numbers.txt`, which of the following is the correct way to open the file exclusively for reading as a text file? 
```
A open('numbers.txt')
B open('numbers.txt', 'w')
C open('numbers.txt', 'r')
D open('numbers.txt', 'rb')
E open('numbers.txt', 'wb')
```

##### Question 4
Whenever possible, what is the recommended way to ensure that a file object is properly closed after usage?
```
A By using the with statement
B By using the try/finally block
C Making sure that you use the .close() method before the end of the script
D It doesn’t matter
```

##### Question 5
When reading a file using the file object, what method is best for reading the entire file into a single string?
```
A .readline()
B .readlines()
C .read_file_to_str()
D .read()
```

```























```

# Answers
Q1: B
Q2: columns with a comma `,` and rows with a newline symbol `\n`
Q3: C
Q4: A
Q5: D