# Loading data in files

This course is about using Python to analyse and visualise data.

Most data, of course, is supplied to us in various formats: spreadsheets, database dumps, or text files in various formats (csv, tsv, json, yaml, hdf5, netcdf)

It is also stored in some medium: on a local disk, a network drive, or on the internet in various ways.

It is important to distinguish the data format, how the data is structured into a file, from the data's storage, where it is put. 

We'll look first at loading data from a disk and then at how we understand various file formats. Then we'll look at downloading data from the internet.

### An example datafile

Let's write an example datafile to disk so we can investigate it. IPython notebook provides a way to do this: if we put
`%%writefile` at the top of a cell, instead of being interpreted as python, the cell contents are saved to disk.

In [1]:
%%writefile mydata.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Overwriting mydata.txt


**Supplementary material** What the heck was [that](https://en.wikipedia.org/wiki/Lorem_ipsum)?

Where did that go? It went to the current folder, which for a notebook, by default, is where the notebook is on disk.

In [2]:
import os # The 'os' module gives us all the tools we need to search in the file system
os.getcwd() # Use the 'getcwd' function from the 'os' module to find where we are on disk.

'/Users/jamespjh/devel/rsdt/training/pyintro/notebooks'

Can we see it is there?

In [3]:
[x for x in os.listdir(os.getcwd()) if '.txt' in x ]

['mydata.txt']

Yep! Note how we used a list comprehension to filter all the extraneous files.

#### Path independence and `os`

We can use `dirname` to get the parent folder for a folder, in a platform independent-way.

In [4]:
os.path.dirname(os.getcwd())

'/Users/jamespjh/devel/rsdt/training/pyintro'

We could do this manually using `split`:

In [5]:
"/".join(os.getcwd().split("/")[:-1])

'/Users/jamespjh/devel/rsdt/training/pyintro'

But this would not work on windows, where path elements are separated with a `\` instead of a `/`. So it's important 
to use `os.path` for this stuff.

**Supplementary Materials**: If you're not already comfortable with how files fit into folders, and folders form a tree,
    with folders containing subfolders, then look at http://swcarpentry.github.io/shell-novice/01-filedir.html. 

Satisfy yourself that after using `%%writedir`, you can then find the file on disk with Windows Explorer, OSX Finder, or the Linux Shell.

We can see how in Python we can investigate the file system with functions in the `os` module, using just the same programming approaches as for anything else.

We'll gradually learn more features of the `os` module as we go, allowing us to move around the disk, `walk` around the
disk looking for relevant files, and so on. These will be important to master for automating our data analyses.

## Reading the file

So, let's read our file:

In [6]:
myfile=open('mydata.txt')

In [7]:
type(myfile)

file

We can go line-by-line, by treating the file as an iterable:

In [8]:
[x for x in myfile]

['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \n',
 'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n',
 'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. \n',
 'Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.']

If we do that again, the file has already finished, there is no more data.

In [9]:
[x for x in myfile]

[]

We need to 'rewind' it!

In [10]:
myfile.seek(0)
[len(x) for x in myfile if 'ut' in x]

[125, 109, 104]

It's really important to remember that a file is a *different* built in type than a string.

We can read one line at a time with `readline`: 

In [11]:
myfile.seek(0)
first = myfile.readline()

In [12]:
first

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \n'

In [13]:
second=myfile.readline()

In [14]:
second

'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n'

We can read the whole remaining file with `read`:

In [15]:
rest=myfile.read()

In [16]:
rest

'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. \nExcepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'

Which means that when a file is first opened, read is useful to just get the whole thing as a string:

In [17]:
open('mydata.txt').read()

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. \nExcepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'

You can also read just a few characters:

In [18]:
myfile.seek(0)

In [19]:
myfile.read(5)

'Lorem'

### Strings are not files

Because files and strings are different types, we CANNOT just treat strings as if they were files:

In [20]:
mystring= "Hello World\n My name is James"

In [21]:
mystring

'Hello World\n My name is James'

In [22]:
mystring.readline()

AttributeError: 'str' object has no attribute 'readline'

This is important, because some file format parsers expect input from a **file** and not a string. 
We can convert between them using the StringIO module in the standard library:

In [23]:
from StringIO import StringIO

In [24]:
mystringasafile=StringIO(mystring)

In [25]:
mystringasafile.readline()

'Hello World\n'

In [26]:
mystringasafile.readline()

' My name is James'

Note that in a string, "\n" is used to represent a newline.

### Closing files

We really ought to close files when we've finished with them, as it makes the computer more efficient. (On a shared computer,
this is particuarly important)

In [27]:
myfile.close()

Because it's so easy to forget this, python provides a **context manager** to open a file, then close it automatically at
the end of an indented block:

In [28]:
with open('mydata.txt') as somefile:
    content = somefile.read()
print content

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.


The code to be done while the file is open is indented, just like for an `if` statement.

In [29]:
print somefile

<closed file 'mydata.txt', mode 'r' at 0x107b041e0>


You should pretty much **always** use this syntax for working with files.

### Writing files

We might want to create a file from a string in memory. We can't do this with the notebook's `%%writefile`

When we open a file, we can specify a 'mode', in this case, 'w' for writing. ('r' for reading is the default.)

In [32]:
with open('mywrittenfile', 'w') as target:
    target.write('Hello')
    target.write('World')

In [33]:
with open('mywrittenfile','r') as source:
    print source.read()

HelloWorld


And we can "append" to a file:

In [34]:
with open('mywrittenfile', 'a') as target:
    target.write('Hello')
    target.write('James')

In [35]:
with open('mywrittenfile','r') as source:
    print source.read()

HelloWorldHelloJames
