# Worksheet 7 - File I/O, Modules & Pandas

There is a lot of material in this Workshop, so just chill out and go through the problems. Don't worry if you don't finish it in class.
___

## Working with Files
The common pattern for working with external files in Python is to open the file, perform some actions, and then close the file. To open a file in Python, we use the open() function, which returns a _file object_. This file object is a an identifier that will be used to interact with the file (reading, writing, closing, etc). To begin, let's start by opening a file, doing nothing to it, and closing it.

```python
myf = open("MyDataFile.txt", "w")
myf.close()
```

There are a few things to notice. The open function takes two arguments: the first is the name of the file you want to open. The second is the mode. To get more information, type open? in the console, but in brief the most commonly-used modes work like this:
- `'w'`: write: opens the file and lets you write to it. Erases anything currently in the file.
- `'r'`: read: opens the file and lets you read data in, but not modify anything.
- `'a'`: append: opens the file, keeping what's in it. Lets you write onto the end of the existing information.
- `'r+'`: readwrite: lets you read and write. Doesn't delete anything.

The second line of code closes the file. Notice that we close the file by using the file identifier and a little dot, with the close() function added to the end. This style of working with files makes it simpler to keep track of what we're doing if we have several files open at once. 

Open your working directory in Python and create a file called `MyDataFile.txt`. Then open the file in an editor and check that it's empty. Opening a file in Python to write will create it if it doesn't exist. Opening a file to read will return an error if it doesn't exist.

### Writing
We will also want to write new files. We can write to a file with the write method. This works similarly to the read which we'll get to in the next section: you use the file identifier with the dot, with whatever you want written as the argument, like this:

```python
myf = open("MyDataFile.txt", "w")
myf.write('Hello , World!')
myf.close()
```

Now try opening your file and look at what's in it. Let's write a program to write numbers to a file:

```python
myf = open("MyDataFile.txt", "w")

for i in range(10):
    myf.write("{}".format(i))
    myf.close()
```

Note that we needed to cast our variable i to a str using the format statement. This is because _file I/O only works on strings_ – trying to write anything else will give you an error. Open your file and look at it. Notice how everything is all bunched together? That's because we didn't tell the computer to put any new lines in and it did exactly as we told it. Just like the numbers or letters you want in the file, the formatting has to be there as well. Computers store the point where you start a new line using a special character called a newline character. In Python we write it as `\n`. When you open your file in an editor, it will use these `\n`'s to know when to start new lines, and it doesn't print them otherwise. Let's modify our code to print everything onto a different line.

```python
myf = open("MyDataFile.txt", "w")

for i in range(10):
    myf.write("{}\n".format(i))
    myf.close()
```

The inclusion of `\n`'s wherever you want a new line to start is just another pattern that you get used to. There are other special characters used for formatting as well, but we won't talk about them further in this workshop.

### Reading
Reading from files can get complicated. Let's consider the simple case with data written in lines. You access this data with the `readline()`method. For example, let's re-use our `MyDataFile.txt` file:

```python
myf = open('MyDataFile.txt', 'r')
line = myf.readline()
print(line)
myf.close()
```

See how it gave us back the first line? You might also note that we got an extra blank line. This is because we read in the entire line newline character and all, and Python's `print()` function always tacks its own newline onto the end. So, we printed with two newlines. Each call to readline reads the next line in the file. If there are no more lines, it returns `None`. So let's use a loop to print each line (and not worry about the extra blank lines for now).

```python
myf = open('MyDataFile.txt', 'r')
while True:
    line = myf.readline()
    if not line:
        break
    else:
        print(line)
f.close()
```

There's another method called readlines() This reads each line and puts them all into a list.

```python
myf = open('MyDataFile.txt', 'r')

data = myf.readlines()
print(data)
myf.close()
```

Sometimes this is the best choice. But if you have a big file, or you don't need every single thing in the file, it might not be. It can be more efficient to loop over the file and do things line by line, or to create a big list and operate on it. Knowing which option to take comes with practice!

The big thing to take away here is how to open and close files. A lot of the time you will be using a special module, like csv for commaseparated data or _re_ for regular expressions which will have their own way of reading and writing

## Modules
We've already briefly discussed and used modules, but we revisit the idea here because we'll be working with a new module, below. Familiar modules are Numpy and Matplotlib. Practically speaking, a module is a way of grouping a block of code together for easy reuse elsewhere. Consider that when you started out you wrote simple
scripts of a few lines. Once these got too big, you started organising them into functions to make things clearer and easier to work with. As your programs get bigger, it becomes a good idea to organise your code into different modules. Another big advantage of modules is that they're portable if well-designed enough. For example, the module Numpy, which we have already been using, includes many data types and functions that scientists and engineers want to use. As natural scientists, you need to work with modules, so you import the Numpy module and make use of high-quality code that someone else has already written and tested, without having to write it all from  cratch yourself.

Modules make the Python language easy to extend to different applications. As a reminder, to get access to a module, you use the key word import. There are a couple of options for how you use it. Let's use the example of the Numpy module.
- `import numpy`: Gives you access to everything in Numpy, prefixed with the name numpy.
- `import numpy as np`: Does the same as the above, but uses the name 'np' instead of 'numpy'. This is very commonm and saves typing.
- `from numpy import *`: the * means 'all'. Gives you access to everything in Numpy, but without a name prefix like 'np'. This is dangerous and almost never a good idea.
- `from numpy import sin`: Imports only the sin() function.

It's a good idea to keep the name of the module there. This prevents different functions or constants which may be defined in different modules from clashing.

### Useful Modules
There are a lot of useful modules in Python. You're familiar with Numpy and (hopefully!) Matplotlib. We'll learn Pandas today. A few others worth being aware of are:
- time: routines for timing and optimising algorithms
- logging: logging, status, and error handling
- pickle: serialising Python data
- os: interfaces with the operating system
- sys: system-specific parameters and functions
- csv: working with comma-separated (e. g. like Excel) files
- tkinter: used for making GUIs
- beautifulsoup: HTML/XML parser
- pygame: used for making 2D games in Python
- xarray: used for manipulating climate model outputs

You can Google python modules to find out more, and how to use them. This is one of the advantages of being connected to the international Python community – shared software. There are many more modules, so it pays to have a look around to see what's out there before starting on your project.

## Matplotlib Quick Refresher
Matplotlib is probably the most used and the most useful module for plotting in Python. You learn it by doing, and by consulting the builtin documentation and online documentation. Here is an example. Make sure you follow all the lines in this code before moving on.

```python
import matplotlib.pyplot as plt
import numpy as np

# Typying plt? in ipython will give you some basic information
about legends.
a = np.arange(0,10, 0.5) # What does this line do?
b = a **2
#now plot a against b:
plt.figure(1) # initialize the figure
plt.clf() # empty the contents of the figure
plt.plot(a, b, 'b-', label = 'a vs b')
#now plot something different
plt.plot(a, b**1.5, 'r--', label = 'a vs b^1.5')
# Type plt.legend? to give some information about legends
plt.legend(loc = 'best') # lets Python choose the best location for the legend
plt.show()
```
Much, much more is possible. In particular, you will find the use of subplots within a figure is very useful. But let's move on.

## The Pandas Module
Pandas is a very useful module for working with data. Today we'll
work with it briefly, but you may well use it in the years to come.
Numpy may still be your workhorse for this course, especially when
you have already got the data into your code, but Pandas will prove
useful to you in the future, for example if you do some serious experiments in a lab. The techniques we learned above for reading and
writing files work fine for simple files, but Pandas lets you work with
more complicated data sets, like those that lab instruments will give.

Pandas introduces a new data structure called a _DataFrame_ (just like how Numpy has its own specialised data structure called an array). A DataFrame is a grid-like structure where each row represents an item and each column is some attribute of that item. (Users of Microsoft Excel, SPSS, or R will have seen something similar before.) Pandas can read various file types like csv, Excel, SQL, HTML, etc. We'll work with csv, or comma-separated value, files today. To read these, use the `read_csv` command from the Pandas module: `pandas.read_csv(filename)`.

Let's make a simple script to read in some csv data and figure out the type of the structure we create. We'll also print some information. This uses the census.csv file from ELE -- to run the following sections of code download it and save it to the same directory as this notebook.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
census = pd.read_csv('census.csv')
print(type(census))
display(census.head()) # This is an alternative to print() that works with pandas and
                       # displays data in a more attractive format

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,State,2011,2012,2000
0,California,37691912,37253956,33871648
1,Texas,25674681,25145561,20851820
2,New York,19465197,19378102,18976457
3,Florida,19057542,18801310,15982378
4,Illinois,12869257,12830632,12419293


See how the type of this structure we've created is a `DataFrame`? Having good data structures is probably the most important thing for writing good programs (at least according to Linus Torvald, the man who wrote Linux). This type is really well-suited to working with data, just like arrays are good for numerical computations and lists are good for times when you need a flexible collection of objects. The `.head()` method just returns the top few lines of the DataFrame object it belongs to (which we've called census here), which is useful for seeing what sort of stuff you're working with. There is also a `.tail()` method which gives you the last few lines. Accessing bits of a DataFrame is done the same way as with other data structures: with square brackets. In this case, however, you don't use an integer index. Rather, you use the string which corresponds to the name of the column. For example, let's print all the 2011 data:

In [16]:
display(census['2011'])

0     37691912
1     25674681
2     19465197
3     19057542
4     12869257
5     12742886
6     11544951
7      9876187
8      9815210
9      9656401
10     8821155
11     8096604
12     6830038
13     6587536
14     6516922
15     6482505
16     6403353
17     6010688
18     5828289
19     5711767
20     5344861
21     5116769
22     4802740
23     4679230
24     4574836
25     4369356
26     3871859
27     3791508
28     3580709
29     3062309
30     2978512
31     2937979
32     2871238
33     2817222
34     2723322
35     2082224
36     1855364
37     1842641
38     1584985
39     1374810
40     1328188
41     1318194
42     1051302
43      998199
44      907135
45      824082
46      722718
47      683932
48      626431
49      568158
Name: 2011, dtype: int64

You can call some useful methods on columns of data. You can also get a subset of rows, too. See the following examples:

In [18]:
print('Minimum 2011 : {}'.format(census['2011'].min()))
print('Maximum 2011 : {}'.format(census['2011'].max()))
print('Average 2011 : {}'.format(census['2011'].mean()))
print('Standard Deviation 2011 : {}'.format(census['2011'].std()))
print('Total US Population 2011 : {}'.format(census['2011'].sum()))
NY = census[census['State'] == 'New York']
display(NY)

Minimum 2011 : 568158
Maximum 2011 : 37691912
Average 2011 : 6219477.88
Standard Deviation 2011 : 6932149.918159333
Total US Population 2011 : 310973894


Unnamed: 0,State,2011,2012,2000
2,New York,19465197,19378102,18976457


You can also create new columns. For example, if we want a column with the change in population between 2011 and 2012 and print
the head again:

In [19]:
census['Growth'] = census['2012'] - census['2011']
display(census.head())

Unnamed: 0,State,2011,2012,2000,Growth
0,California,37691912,37253956,33871648,-437956
1,Texas,25674681,25145561,20851820,-529120
2,New York,19465197,19378102,18976457,-87095
3,Florida,19057542,18801310,15982378,-256232
4,Illinois,12869257,12830632,12419293,-38625


### Grouping and Plotting
It's useful to be able to group data together. Let's work now with the file `01_heights_weights_genders.csv` which contains heights and weights for individual people. In the last section, you had a different row corresponding to each American state. In this section, there are thousands of 'Male' and thousands of 'Female' rows. Let's load it up and print the head, so we can see what's going on a bit better.

In [21]:
data = pd.read_csv('01_heights_weights_genders.csv')
display(data.head())

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


We can now use the groupby method to make new DataFrames corresponding to the different genders, like so:

In [24]:
byGender = data.groupby('Gender')
display(byGender.size())
display(byGender.mean())

Gender
Female    5000
Male      5000
dtype: int64

Unnamed: 0_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,63.708774,135.860093
Male,69.026346,187.020621


Finally, you can make a new DataFrame object with the data from a particular group in the old one using the get_group() method.

In [25]:
boys = byGender.get_group('Male')
girls = byGender.get_group('Female')
display(boys.head())
display(girls.head())

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


Unnamed: 0,Gender,Height,Weight
5000,Female,58.910732,102.088326
5001,Female,65.230013,141.305823
5002,Female,63.369004,131.041403
5003,Female,64.479997,128.171511
5004,Female,61.793096,129.781407


In [26]:
# Space to import useful packages, e.g. numpy

Unnamed: 0,State,2011,2012,2000
0,California,37691912,37253956,33871648
1,Texas,25674681,25145561,20851820
2,New York,19465197,19378102,18976457
3,Florida,19057542,18801310,15982378
4,Illinois,12869257,12830632,12419293


___

## Exercise 1
Either modifying your code from earlier or starting over, write a script that writes odd numbers from 1-100 line-by-line into a file called `odd.dat` and even numbers from 1-100 into a file called `even.dat`

## Exercise 2
Read the odd numbers back in and print them to the screen, formatted to have a comma and space between them, like: `1, 3, 5, 7, ...`

## Exercise 3

Download the file `data.txt` from ELE. Take a look at this file to see what's in it. Then write a script to read in the data one number at a time, divide them by two, and print them to the screen. Be careful here: reading from a file will always give you str variables but you need floats! 

## Exercise 4
In this exercise you will read in two data files that have been posted on ELE: `rk4.x.txt` and `rk4.y.txt`. One contains the $x$ locations of a data particle moving through an idealised pattern of atmospheric weather near a boundary. The other contains its $y$ coordinate.

Look at both files. The file format for both files is the following: The first line is an integer, $N$, that indicates how many values are in the file. Each subsequent line contains an $x$, or a $y$, value of the particle location (depending on which file you're looking at). There are $N$ lines after the integer. Read in the $x$ values first. You may use either of the python function `readlines` or `readline`.

_Hint: You will need to convert the strings that python reads in to floats and assign them to numpy arrays._

Using your usual plotting techniques make a plot of $x$ versus $y$. Your plot should look something like Figure 1.

_Hint: If you can't remember how to make a plot, look at the examples in Workshop 2._

<img src="particletrajectory.png" alt="Figure 1: Particle trajectories" width="400"/>

## Exercise 5
Load the census data csv file and create a column for the growth (or re-use your code from the lecture). Make another column for the percent change in population from 2011 to 2012.

## Exercise 6
Find and print the states which have:
- The highest population in 2011
- The smallest population in 2011
- The largest absolute (total number) reduction in population from 2011 to 2012
- The largest relative (percent change) growth from 2011 to 2012

## Exercise 7
So far, you have used the `matplotlib` module to make line plots. You can make all sorts of different plots with it. For example, you can make a histogram by:
```python
plt.hist(data , N_bins)
```
where `N_bins` is the number of bars (called 'bins' by that data analysis crowd) that you want and the data is the thing you want to plot. Make a figure that has three histograms in it as _subplots_: one with the all of the heights of both genders combined (i.e. the 'Height' column in the main data set) and one each for boys and girls.

_Hint: Label them all appropriately. You should see 1 Hint: Google matplotlib subplot to see how you can put 3 plots on the same page.
sort of a bell-curve effect._

## Exercise 8
The `plt.plot` function you've used to far doesn't only make line graphs. You can use it to make a scatter plot as well. Just make sure that you tell it to use only symbols, and not lines! Make a scatter plot of Height vs. Weight for the entire population. It should look something like Figure 3. This is a very brief introduction to the Pandas module, which lets you do use advanced statistical analysis on data. If you need to do this sort of thing in the future (and you will) you know where to start.

<img src="WholePop.png" alt="Figure 2: Histogram of entire population" width="400"/><img src="scatter.png" alt="Figure 3: Scatter plot of height vs. weight" width="400"/>