# CSV File Handling

In this tutorial, we'll cover how to bring data in and out of a `python` program.

Many scientific instruments and software will output what are called "comma separated value" files, or `.csv`s.
These `.csv` files are usually structed like a table, with the type of data in the first row, or the header of the file, and values for each measurement in subsequent rows.

Have a look at the provided file, `example_data.csv`, and you should see something like this:

|t, |x, |y, |z  |
|:-:|:-:|:-:|:-:|
|0, |0, |0, | 0 |
|1, |1, |2, | 1 |
|2, |2, |4, | 4 |
| ⋮ | ⋮ | ⋮ | ⋮ |

Usually you'll know what kind of data the columns are supposed to represent according to their names (e.g. `t`, `x`, `y`, `z`).
In this case, we'll say that the `t` column correpsonds to the time of each measurement of the (`x`, `y`, `z`) positions of some particle.

Of course, we'll want to bring this data into `python` to be able to work with it a bit more.
To do so, we will use some functions provided by the `numpy` package.

In [1]:
import numpy as np

`numpy` has several options for reading-in `.csv` files, but we'll focus on `genfromtxt()` because its syntax is easy to understand and it is very flexible.

In the following line, the first argument corresponds to the _name_ of the file we want to read (note: this must be a string, so it's enclosed in quotes), and the second argument tells `numpy` that the values in our files are separated by the character `','` (i.e. it's a `.csv`).

In [2]:
np.genfromtxt('example_data.csv', delimiter=',')

array([[nan, nan, nan, nan],
       [ 0.,  0.,  0.,  0.],
       [ 1.,  1.,  2.,  1.],
       [ 2.,  2.,  4.,  4.],
       [ 3.,  3.,  6.,  9.],
       [ 4.,  4.,  8., 16.],
       [ 5.,  5., 10., 25.],
       [ 6.,  6., 12., 36.],
       [ 7.,  7., 14., 49.],
       [ 8.,  8., 16., 64.],
       [ 9.,  9., 18., 81.]])

Now there are a few things we can observe here.
1. the data is read into an array, which is a very powerful object in python and one which we will use frequently.
1. The first row is all `nan` values, which means that `numpy` couldn't read them as a number. This is because the first row of `example_data.csv` is `t, x, y, z`, which are all strings. To fix this, we want to tell numpy that the first row correspond to the names of the columns; we do this with the argument `names=True`.
1. While this told us what was in our data file, we can't do anything with the printed output, so we need to save the output of this to some variable; we'll call this `data`.

Implementing these changes, we get the following:

In [3]:
data = np.genfromtxt('example_data.csv', delimiter=',', names=True)

Now if we ask `python` what `data` is, we should get the right output...

In [4]:
data

array([(0., 0.,  0.,  0.), (1., 1.,  2.,  1.), (2., 2.,  4.,  4.),
       (3., 3.,  6.,  9.), (4., 4.,  8., 16.), (5., 5., 10., 25.),
       (6., 6., 12., 36.), (7., 7., 14., 49.), (8., 8., 16., 64.),
       (9., 9., 18., 81.)],
      dtype=[('t', '<f8'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

But now the format is different!
We can see that each row of our data file now corresponds to a tuple (i.e. an object of the form `(...)`), and there is now something called `dtype`, which has stored the names of our columns.

Don't worry too much about the syntax of `dtype` for now, but note that it can be useful for selecting individual columns:

In [5]:
data['t']

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [6]:
data['x']

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [7]:
data['y']

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14., 16., 18.])

In [8]:
data['z']

array([ 0.,  1.,  4.,  9., 16., 25., 36., 49., 64., 81.])

Now, suppose we want to find what the total distance of our particle is from the origin at each time `t`.
We have the `x`, `y`, and `z` distances, so the total distance should simply be $$d = \sqrt{x^2 + y^2 + z^2}$$.

I don't want to focus too much on how to compute this at the moment, but I'll do so using a list comprehension:

In [9]:
d = [np.sqrt(x**2 + y**2 + z**2) for (x, y, z) in zip(data['x'], data['y'], data['z'])]
d

[0.0,
 2.449489742783178,
 6.0,
 11.224972160321824,
 18.33030277982336,
 27.386127875258307,
 38.41874542459709,
 51.43928459844674,
 66.4529909033446,
 83.46256645946133]

Finally, now that we've done some processing of our data, let's say we want to save our new information to a new file in case we want to share our results with someone else or work with it in another program.

We'll again use `numpy`, this time with `savetxt()`.

In the following line, the first argument corresponds to the file name that we want to save our data to; this should be a `.csv`, but it should _not_ be the same name as what we imported the data from (otherwise, we would overwrite the original data and could potentially lose our original measurements).

Then we need to give `numpy` an array of the data we want it to save.
Since `np.array()` likes to make rows but we're interested in saving our data as columns, we'll have to transpose this array, which is simply done with `np.array().T`.

We'll specify the delimiter to be a `','` so we get the same `.csv` format as before.

The last two arguments, header and comments, tell `numpy` what to write for the first row (i.e. for `example_data.csv` this is `t, x, y, z`).
We have to specify `comments=''`, otherwise it will default to `#` and the first row would be `# t, d`---try running this code without the comments argument and you'll see what I mean.

Let's say we just want to save the `t` and `d` columns for simplicity:

In [10]:
np.savetxt('output.csv', np.array((data['t'], d)).T, delimiter=',', header='t, d', comments='')

If you execute that line, you should now see a file called `output.csv` in your project folder with two columns, one corresponding to `t` and the other to `d` as expected.
There is a file in this directory called `example_output.csv` to which you can compare your results.