# 1 Managing your Data

Data is the big one - there aren't many geosciences that make sense without it. 

Increasingly, your data will exist on a computer. And as soon as its there - text file, binary, Excel, other bespoke format, doesn't matter - we can start to use Python on it.

***What on Earth would we want to do our data though?***

Here are some of the things I am routinely doing with data (or, more often, the output of a computer simulation):

- Opening data files and computing summary statistics.
- Opening data files, grabbing a subset of the relevant data, and writing this out to a new file.
- Opening hundreds of separate data files, each in separate folders, and summarising their data in a single file.
- Creating new files and folders, deleting old ones.
- Zipping and unzipping files.
- Downloading files from the Internet (this one I don't do much, I can see how it could be quite useful though).

Basically, everything you can **already do** by clicking and dragging, interacting with the file system. However, once you need to do the same task more than 30 times or so, it might be quicker to write a Python program to do it for you.

***In the five tasks below, I rely on a series of tired and increasingly stretched real-life analogies, as a means of thinking about Data Management.***

## 1.1 Reading your books

Interacting with data on a computer is much like using a book. HOWEVER, the second-nature things you do without thinking about, we have to program those explicitly.

***How do you read a book?*** (this is how I do it anyway)

1. Read the title and determine, yes, this is the book I want to read.
2. Open the front cover.
3. Read the first sentence.
4. Read the second sentence.
5. Finish reading and close the book. (short attention span)

We will write a short computer code to do the same thing.

### Demo: Reading a book

Have a look in the folder this notebook is in. There is a file called 'book.txt'. Open it with a text editor. Looks legit right?

In the cell below, I have written a series of Python commands to open 'book.txt', read its contents, and then close it again. (Note, I have been rather lavish with the comments.) 

**Execute the cell below by clicking in it, then hitting Ctrl+Enter**

In [None]:
title = 'book.txt'          # name of the file we want to open

fp = open(title, 'r')       # 'open' is a command to open files 
                            # the 'r' means "just read it, don't start writing your own things"
                            # fp is called a "file pointer" - its the computer equivalent of 
                            # "where is the book? oh, here it is, still open, in my hands"

ln = fp.readline()          # read the first line of the book - "readline" is a method of a file pointer
                            # ln is a "string" variable - the variable "title" above is a string as well
    
print(ln)                   # reads the line "out loud" (prints it to the screen)

ln2 = fp.readline()         # read the next line, assign it to a different variable

print(ln2)                  # read the second line "out loud"

fp.close()                  # all done, make sure to close the book 
                            # (what kind of savage leaves a book lying around open...) 

### Modify to: Read another book

Not too bad, right? You could do that, right?

Okay, have a go changing the code below. Read a different book. And read the first three lines.

In [None]:
title = 'book.txt'           # <- change this to a different book (the file 'different_book.txt')
fp = open(title, 'r')
ln = fp.readline()   
print(ln)            
ln2 = fp.readline()          # <- copy-paste this line and the one below to read three lines
print(ln2)     
fp.close()           

***What happens if you try to read six lines? Why?***

### DIY: Read your favourite book, its in the cupboard

When you're comfortable with the two tasks above, try writing your own commands below to read one last book.

It's in a subdirectory called 'cupboard' so you will need to use a file path separator (a '/' on Mac/Linux or '\\\\' on Windows).

In [None]:
# write your code in here
# for extra credit, write your commands so that they will read ALL of the lines from 'favourite_book.txt'
# without knowing ahead of time how many there are

# hint - you can use either a while loop OR research the method 'fp.readlines()'

## 1.2 Moving your book from point A to point B

When you're working with thousands of files, it is important to organise them in a coherent structure - don't rely on your brain to remember where everything is!

Python can assist with this organisation by copying, moving, renaming and deleting files

### Demo: Light copyright violation

We can create an exact duplicate of an existing file by using the `copyfile` command. It comes as part of the `shutil` Python module.

***Execute the cell below and verify that a new file has been created, with the same contents.***

In [None]:
import shutil                           # 'shutil' is a Python module - a collection of commands for a particular purpose,
                                        # in this case, high-level file operations. 'import' makes these commands available

paper = 'knopoff_gardner1974.txt'       # name of the file we wish to copy

paper2 = 'knopoff_gardner1974_v2.txt'   # name of the new copy we will create

shutil.copyfile(paper, paper2)          # copy the file FROM paper TO paper2

import os
os.listdir('.')                         # print the contents of the current directory

Don't get caught stealing files - delete the copy!

***Execute the cell below and verify that the file is gone.***

In [None]:
os.remove('knopoff_gardner1974_v2.txt') # the 'remove' function deletes a file
                                        # use 'shutil.rmdir' to delete a folder and all its contents (be careful!)

os.listdir('.')                         # print the directory contents again

***If you run the cell above twice in a row, it produces an error. Why?***

### Modify to: Flagrant theft

Once you understand how copying a file works, try *moving* it instead. You will need to use `shutil.move`. Can't figure out how to do it? Grab the top link [here](http://lmgtfy.com/?q=python+move+file).

In [None]:
import shutil                           # (you only have to import this module once, so technically this line redundant)

paper = 'knopoff_gardner1974.txt'       # name of the original file

paper2 = 'knopoff_gardner1974.txt'      # <- this should be the same file name but modified to include a subdirectory 
                                        # <- (put the file in 'cupboard')

shutil.copyfile(paper, paper2)          # <- change this to shutil.move

import os
os.listdir('.')                         # print the contents of the current directory
os.listdir('cupboard')                  # print the contents of cupboard

***"Uh oh, I lost/deleted the original file `knopoff_gardner1974.txt`"***

*There is a backup copy: *\**.bkp*

***Notes on StackOverflow***

This is a wonderful resource for learning Python, particularly when combined with some sharp Google keyword searching. The quality of this resource is due to the very large online community of Python helpers and their nifty upvote system that ensures the best answer rises to the top.

I often copy code from a StackOverflow solution verbatim and then modify it to suit my needs. Of course, you should be careful with this approach and always make sure you understand what the code is doing.

### DIY: Outright plagiarism

You can copy and move files. Last useful task is renaming (which is quite similar in concept to moving a file). Use `os.rename` to copy `knopoff_gardner1974.txt` to `your_surname2017.txt` and watch the citations role in.

In [None]:
# your code here
# for extra credit, read the contents of your new paper and print it to the screen

## 1.3 Writing your own books

You can create your own text files and fill them with:

- text
- numbers

Sounds not that special right? It's actually super handy and the possibilities are endless. Let's dive in!

### Demo: Creative writing 101

The cell creates a new file, writes a few lines of text to it, then closes it. Verify that this file did not exist before you ran the cell and *does exist* after you have run it.

In [None]:
fp = open('my_first_novel.txt','w')                # open a file - the 'w' means "you can write stuff to this file"
                                                   # if the file does not exist, it will be created
                                                   # if the file DOES exist, it will be OVERWRITTEN (so, be careful)

fp.write('Python Odyssey')                         # write a line of text
fp.write('\n')                                     # write a special character that says "go to the next line"
fp.write('\t by David Dempsey')                    # write another line of text

fp.close()                                         # close the file

***What happens when you comment the line `fp.write('\n')` and run the cell? Therefore, what is the function of this line?***

***What does the escaped character `\t` do? (try removing it and see what happens)***

### Decipher: Marking someone else's work is not wildly fun

The code below creates a new file in the subdirectory `cupboard` and then uses `format` to write numbers to it.

In [None]:
import numpy as np                                # this module does a bunch of handy numerical stuff - more later! 

fp = open('cupboard'+os.sep+'my_first_sequel.txt','w')     # *your comment*

fp.write('{}\n'.format('a line of text'))                  # *your comment*

fp.write('{}: {:d}\n'.format('an integer', 10))            # *your comment*

fp.write('{}: {:3.2f} {:8.7e}\n'.format('the same float, two ways', np.pi, np.pi))          # *your comment*

fp.close()

***What do the following commands do in the cell above? (one way to figure it out is by removing them and rerunning the cell)***

- `os.sep`
- `'{}\n'.format`
- `{:d}`
- `{:3.2f}`
- `{:8.7e}`
- `np.pi`

***Change some of the numbers above and confirm corresponding changes in the file.***

***Change `:d` to `:04d` - what happens?***

***Change `:3.2f` to `:6.4f` - what happens?***

### DIY: Data fabrication 101

The for loop below generates data that **looks like** a physical process.

***What is a "for loop"?***

An essential computing structure that basically automates completion of a similar task over and over and over again.

In [None]:
# a function to represent the physical process
def physical_process(time):
    return 2*time**2 + 1             # * = multiply, ** = "to the power of"

ts = np.linspace(0,10,11)            # a vector of times, starting at 0, ending at 10, eleven values in total
print(ts)                            # print out the vector

for t in ts:                         # for each value in the vector time (temporarily assigned to the variable t)
    x = physical_process(t)          # get the physical process value by using the function above
    x = x + (np.random.rand()-0.5)*10      # add on some measurement noise (a random number between -5 and 5)
    print(x)                         # print out the "data"

# your code here
# (hint: the loop above 'deletes' the value of x each time it comes around - how can you save it? write it out?)

***Save the "data" above to a text file called "legit_data.txt". Save in CSV format (comma separated values) where the first line is a "header" that tells what is in the columns below.***

## 1.4 Tidying your room

I use Python to organise my life (and by "my life" I mean the files on my computer, which are my life so...)

It might seem trivial, but directory management (creation and deletion of folders) is central to automating the organisation of your data.

### Demo: Because your Mum told you to

The code below does several things:

- it creates a new folder
- it moves several files into the new folder

It also makes use of the "if" statement to control what is happening in the code.

**What is an "if" statement?***

Sometimes you want your computer code to exhibit a little common sense. Like only perform a task ***if*** it hasn't already been done. Only delete a file ***if*** it happens to be in an off-limits location.

In [None]:
# create a folder
if not os.path.isdir('pile_of_books'):                    # check if there is a directory called "pile_of_books"
    os.makedirs('pile_of_books')                          # if not - make that directory

# move a file
if os.path.isfile('book1.txt'):                           # check if a file called 'book1.txt' exists
    shutil.move('book1.txt', 'pile_of_books'+os.sep+'book1.txt')   # if it does - move it to the new directory

# move another file
flname = 'book2.txt'                                      # same as above, only now the book title is stored as a string
if os.path.isfile(flname):                                # in the variable 'flname'
    shutil.move(flname, 'pile_of_books'+os.sep+flname)

***Run the cell below to undo the changes above.*** 

In [None]:
if os.path.isfile('pile_of_books'+os.sep+'book1.txt'):                # *your comment here*
    shutil.move('pile_of_books'+os.sep+'book1.txt','book1.txt')       # *your comment here*
    
flname = 'book2.txt'                                                  # *your comment here*
if os.path.isfile('pile_of_books'+os.sep+flname):                     # *your comment here*
    shutil.move('pile_of_books'+os.sep+flname,flname)                 # *your comment here*

if os.path.isdir('pile_of_books'):                                    # *your comment here*
    os.rmdir('pile_of_books')                                         # *your comment here*

How does that work? Complete the comments above.

### Modify to: Because you're an adult and adults have tidy rooms

What if you want to copy all the files ending in `.txt` to a special folder, but you don't know how many there are (if any) or their names?

Glob is a wonderful little tool that makes use of **wildcards** to catch all files conforming to a particular name pattern. For example, `*.txt` will grab all files endings in `.txt`, where as `*data*.pdf` will grab all pdf files containing the word `data` anywhere in their name.

In [None]:
from glob import glob

if not os.path.isdir('txt_files'):
    os.rmdir('txt_files')
    
fls = glob('*.txt')
for fl in fls:
    shutil.copyfile(fl, 'txt_files'+os.sep+fl)
    
# what does this code do?
# modify it so that a new subdirectory 'book_files' is created - copy into it files containing the word 'book' in their title

### Decipher: Professional room-tidying consultant (it's a thing, look it up)

Sometimes the data files are all bundled together in a single zipped archive. How to get them out? Or, how to read files while they're in there?

The code below is complete - execute it and try to determine which lines are achieving which outcomes (i.e., add your own comments).

In [None]:
from zipfile import ZipFile

zfl = ZipFile('archive.zip')
print(zfl.namelist())
fp = zfl.open(zfl.namelist()[0])
ln = fp.readline()
print(ln)
zfl.extract(zfl.namelist()[1])
fp = open(zfl.namelist()[1])
ln = fp.readline()
print(ln)
fp.close()

zfl.close()

***Which line(s)...***

- opens the zip file
- gets a list of the files contained in the zip file
- opens and reads text from the first file
- extracts the second file

## 1.5 The lastest E-readers

Opening files and reading data from them line by line is a little cumbersome. Fortunately, Python packages some handy tools to automate this process for some common files types. We will look at two: comma-separated value (CSV, this format was touched on above), and Excel spreadsheet (this format can be opened by Pandas).

### 1.5.1 Demo: Kindle 1.0

I use the function `genfromtxt` *a lot*. It's great for data in simple formats sitting in text files. The example below reads comma separated values from the file `temperature_data.csv` and prints it to the screen.

In [None]:
# read the data
data = np.genfromtxt('temperature_data.csv', delimiter = ',', skip_header = 1)
     # the first input is the name of the text file
     # the second input says "columns of data are separated by a comma"
     # the third input says "ignore the first row, it is just information about the columns, not actually data"

print(data)

# read the data again, but separate the columns into separate variables
z,T = np.genfromtxt('temperature_data.csv', delimiter = ',', skip_header = 1, unpack=True)
     # the first three inputs are the same as above
     # the fourth input transposes the output matrix so that Python can unpack it to separate variables
print(z)
print(T)

***Verify for yourself that the numbers Python is printing out here match up with those contained in the file `temperature_data.csv`*** 

### 1.5.2 For experts 1 - download from the web

Often data are being constantly updated and posted to the web by government agencies. Instead of manually downloading these data, we can write a few lines of code to pull them down. The example below captures induced earthquakes recorded by KNMI in the Netherlands.

First, grab the data and save it as a `.txt` file.

In [None]:
import urllib.request                       # the module we'll need
import shutil
url = 'http://cdn.knmi.nl/knmi/map/page/seismologie/all_induced.csv'        # the url where the data is located (look it up!)
file_name = 'nd_eqs.txt'                                                    # name of the file to save the data into
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file: # not super important you understand this line
    shutil.copyfileobj(response, out_file)               # pull down and save the data

Verify for yourself that the text file created - `nd_eqs.txt` - matches up with the data at the url http://cdn.knmi.nl/knmi/map/page/seismologie/all_induced.csv

### For experts 2 - Pandas

Pandas is a popular package for interacting with *time series data*. Rather than demonstrating it myself, I'll instead direct you to Nikolay Koldunov's excellent [Python for Geosciences](http://earthpy.org/tag/python-for-geosciences.html) course, which I have included in the zip file for this course.  

In particular, look at this [notebook](../koldunovn/python_for_geosciences-master/06%20-%20Time%20series%20analysis%20&#40;Pandas&#41;.ipynb).

# What now?

That's a crash course in automating some of the more mundane file system tasks you'd otherwise have to do by hand.

Time to get onto the more exciting stuff - making sparkly pictures of your data!

**Open the [Visualisation](../module 2 - visualisation/Visualisation.ipynb) notebook in the module 2 folder.** 