# Reading and Writing files

We have already talked about built-in Python types, but there are more types that we did not speak about. One of these is the ``file()`` object which can be used to read or write files.

## Reading (text) files

Let's try and get the contents of the file into IPython. We start off by creating a file object:

In [None]:
f = open('data/data.txt', 'r')

The ``open`` function is taking the [data/data.txt](data/data.txt) file, opening it, and returning an object (which we call ``f``) that can then be used to access the data.

Note that ``f`` is not the data in the file, it is what is called a *file handle*, which points to the file:

In [None]:
type(f)

Now, simply type:

In [None]:
f.read()

The ``read()`` function basically just read the whole file and put the contents inside a string.

Let's try this again:

In [None]:
f.read()

What's happened? We read the file, and the file 'pointer' is now sitting at the end of the file, and there is nothing left to read.

To close the file handle, you can do:

In [None]:
f.close()

Let's now try and do something more useful, and capture the contents of the file in a string:

In [None]:
f = open('data/data.txt', 'r')
data = f.read()
f.close()

Now ``data`` should contain a string with the contents of the file:

In [None]:
data

But what we'd really like to do is read the file line by line. There are **several ways** to do this, the simplest of which is to use a ``for`` loop in the following way:

In [None]:
f = open('data/data.txt', 'r')
for line in f:
    print(repr(line))

f.close()

or you can use `readlines()`: it simply reads the entire file, one line at a time, and returns a list, where each item is a single line.

In [None]:
f = open('data/data.txt', 'r')

lines = f.readlines()
f.close()

In [None]:
len(lines)


In [None]:
lines

The syntax works as follows: the `open()` function is run, and its return is assigned to the variable `f`. The indented code block which follows can then use this `f` file object. As soon as the indented code is finished, the file is automatically closed.


If you are curious about the meaning of the values: the first two columns are astronomical coordinates to identify objects positions on the sky. More here : http://curious.astro.cornell.edu/about-us/112-observational-astronomy/stargazing/technical-questions/699-what-are-ra-and-dec-intermediate

![alt text](astronomy.png) 


The third columns is an identifier for the object (star, galaxy) and the fourth is the object "magnitude" - which in astronomy is a way to describe luminosity. More here: https://en.wikipedia.org/wiki/Magnitude_(astronomy)

![alt text](magnitude.jpg)

Another very pythonic way to read the lines of a file is to loop over them:

In [None]:
f = open('data/data.txt', 'r')
for line in f:
    print(repr(line))

f.close()

In [None]:
repr?

Note that we are using ``repr()`` to show any invisible characters (this will be useful in a minute). Also note that we are now looping over a file rather than a list, and this automatically reads in the next line at each iteration. Each line is being returned as a string. Notice the ``\n`` at the end of each line - this is a line return character, which indicates the end of a line.

Now we're reading in a file line by line, what would be nice would be to get some values out of it.  Let's examine the last line in detail. If we just type ``line`` we should see the last line that was printed in the loop:

In [None]:
line

We can first get rid of the ``\n`` character with:

In [None]:
line = line.strip()

In [None]:
line

Next, we can use what we learned about strings and lists to do:

In [None]:
columns = line.split()

In [None]:
columns

Finally, let's say we care about the object name (the 2MASS column), and the J band magnitude (the Jmag) column:

In [None]:
name = columns[2]
jmag = columns[3]

In [None]:
name

In [None]:
jmag

Note that ``jmag`` is a string, but if we want a floating point number, we can instead do:

In [None]:
jmag = float(columns[3])

In [None]:
jmag

One last piece of information we need about files is how we can read a single line. This is done using:

    line = f.readline()

We can put all this together to write a little script to read the data from the file and display the columns we care about to the screen! Here is is:

In [None]:
f.close()
# Open file
f = open('data/data.txt', 'r')

# Read and ignore header lines

# Loop over lines and extract variables of interest
for line in f:
    

Python also provides a way to read each line in a text file as a separate string (returns a list). In the following example, we read the lines in the same command line, immediately after opening the file:

## Writing files

To open a file for writing, use:

In [None]:
f = open('data_new.txt', 'w')

Then simply use ``f.write()`` to write any content to the file, for example:

In [None]:
f.write("Hello, World! - second try\n")
#f.close()

If you want to write multiple lines, you can either give a list of strings to the ``writelines()`` method:

In [None]:
f.writelines(['roof\n', 'tile\n', 'roof\n'])

or you can write them as a single string:

In [None]:
f.write('roof\ntile\nroof\n')

Once you have finished writing data to a file, you need to close it:

In [None]:
f.close()

(this also applies to reading files)

## The with-statement

As we have seen above, files must not just be opened but should be properly closed afterwards to make sure they are actually written before using them somewhere else. Sometimes writes to files get cached by Python to minimize actual writing to disk, which is comparably slow. Closing a file ensures that these changes are actually written.

To avoid forgetting to close a file there is the with-statement.

In [None]:
with open('data/data_new.txt', 'w') as f:
    f.write('spam\n')


This opens the specified file and holds the file-object within ``f``, as well as closing the file when the with-codeblock ends. Afterwards, the file is properly closed and not available anymore.

What's behind the ``with`` is called a “context manager”.  This is a far more general concept, which tends to be useful whenever you need to maintain “external invariants” – which is jargon for “cleaning up after yourself”.  For instance, you can use context managers to reliably remove temporary files, log out from services that have some session management, stop background jobs started, and even reset things within your program. If you come back here later, you'll understand the next piece of language: Check out the contextlib module.

## Exercise 2

Work with the file data/autofahrt.txt

Continuing from the example in the 'Reading files' section, read in columns one (=time $t$) and one of columns two to four (=acceleration $\ddot x(t)$, $\ddot y(t)$, $\ddot z(t)$). You choose which one.

Then, write out the two columns to a new file.

In [None]:
# EDIT THE CODE BELOW

# Open file
f = open('data/autofahrt.txt', 'r')
f2 = open('data/autofahrt_new.txt','w')

# Read and ignore header lines


# Loop over lines and extract variables of interest
    
    f2.write(...)
    
f.close()
f2.close()

## Notes

The above shows you how you can read and write any data file. Of course, there are also functions that exist to help you read in data in certain formats (for example ``numpy`` contains a function ``numpy.loadtxt`` to read in arrays from files) but the key is that with the above, you can read in any file.

## Read/write text files with numpy

numpy is a powerful Python module to work with n-dimensional arrays. AND in our case here, it is also useful to read text files and save them directly as array or matrixes.

In [None]:
# import numpy in the notebook by typing
import numpy as np

In [None]:
# load txt file with the command np.loadtxt
data = np.loadtxt('data/autofahrt.txt',skiprows=2,unpack=True)

In [None]:
print(data)

In [None]:
time = np.loadtxt('data/autofahrt.txt',skiprows=2,unpack=True,usecols=[0])
ax = np.loadtxt('data/autofahrt.txt',skiprows=2,unpack=True,usecols=[1])

#print(time)
#print(ax)

In [None]:
np.savetxt('data/autofahrt_new_numpy.txt',np.c_[time,ax])

## Large binary files

For large numeric datasets, you rarely will want to read or write these as text (strings): the resulting files are larger to store, and precision may be lost if not enough digits are stored, for example.

The alternative is to write "binary" data (meaning it is just a series of bytes representing numbers, rather than representing strings).

In python you can read binary by changing the mode from `'r'` to `'rb'`:

In [None]:
with open('data/day2_numbers.bin','rb') as f:
    data_bytes = f.read()

In [None]:
print(data_bytes)

The contents of this file are just a series of bytes, which are impossible interpret without some prior knowledge. If we know that each byte encodes the value of a single integer, we can actually use the data:

In [None]:
for byte in data_bytes:
    print(int(byte))

In [None]:
data_numbers = [int(byte) for byte in data_bytes]
print(data_numbers)

In science and scientific computing, there are a number of binary data formats in common use - this depends on the particular field. Some common formats are:
* [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) in general, especially for simulations.
* [FITS](https://en.wikipedia.org/wiki/FITS) in astrophysics.
* [CDF](https://cdf.gsfc.nasa.gov/) in space sciences, earth sciences.
* many others...

These are all "self-describing" binary formats, meaning that they adhere to a well-known standard, so that you can understand the structure of a given file without any prior knowledge.

In a single file they can store multiple datasets e.g. 1D arrays, 2D arrays, 3D arrays, etc, together with metadata (number of elements in an array, its physical units, and so on).

## HDF5


Let's look at a quick example with **HDF5**. First, we import the [h5py Python library](https://docs.h5py.org/en/stable/index.html) used to read and write HDF5 files:

In [None]:
import h5py
import numpy as np

First, let's **write** (create) a new HDF5 file:

In [None]:
squares = [i**2 for i in range(10)]
print(squares)
with h5py.File('test.hdf5','w') as f:
    f['dataset1'] = squares
    f['more_data'] = [float(sq) for sq in squares]
    f['a_third_dataset'] = np.array([33,22,11])

We have just created a new binary file named `test.hdf5` which contains two one-dimensional datasets (i.e. arrays).

Now let's **read** an existing HDF5 file:

### Exercise

Open a terminal, and type `h5ls test.hdf5` to list the contents of this file. Try `h5ls -r test.hdf5` and `h5ls -rv test.hdf5` as well, for "recursive" and "verbose".
How many datasets are there? How many entries in each? What is the data type of each?

In the notebook, use `h5py` to read the first number from the `more_data` dataset in our newly created file.

In [None]:
h5ls test.hdf5

What if we don't know what datasets are in the file, or what their names are? We can use `.keys()`, just as with a dictionary.

In [None]:
with h5py.File('test.hdf5','r') as f:
    dset_names = list(f.keys())

print(dset_names)

In general, we can read an entire dataset using the following syntax:

In [None]:
with h5py.File('test.hdf5','r') as f:
    entire_dataset = f['more_data'][()]

But one of the powerful aspects of HDF5 and similar binary data formats is that you can load specific subsets of data.

Imagine that you had an enormous 100GB data file, which is too large to load into the memory all at once. To load only the second through fifth entries:

In [None]:
with h5py.File('test.hdf5','r') as f:
    data_subset = f['more_data'][1:5]
    
print(data_subset)

Notice how we are using the same indexing and slicing syntax that we have seen already, to tell h5py what subset of the dataset to load.

> HDF5 is a rich format with many other features - take a look at the [h5py quickstart guide](https://docs.h5py.org/en/stable/quick.html#quick).