<img alt='UCL' src="images/ucl_logo.png" align='center'>


[<img src="images/noun_post_2109127.svg" width="50" align='right'>](016_Python_for.ipynb)
[<img src="images/noun_pre_2109128.svg" width="50" align='right'>](018_Python_xxx.ipynb)



# 021 Read Files


## Introduction


### Purpose

In this session, we will learn how to read files and similar resources. We will mainly use [`pathlib`](https://docs.python.org/3/library/pathlib.html) and the local package [gurlpath](geog0111/gurlpath) derived from [`urlpath`](https://github.com/chrono-meter/urlpath). We will also cover opening and closing files, and some simple read- and write-operations.


### Prerequisites

You will need some understanding of the following:


* [001 Using Notebooks](001_Notebook_use.ipynb)
* [002 Unix](002_Unix.ipynb) with a good familiarity with the UNIX commands we have been through.
* [003 Getting help](003_Help.ipynb)
* [010 Variables, comments and print()](010_Python_Introduction.ipynb)
* [011 Data types](011_Python_data_types.ipynb) 
* [012 String formatting](012_Python_strings.ipynb)
* [013_Python_string_methods](013_Python_string_methods.ipynb)
* [020_Python_files](020_Python_files.ipynb)

You will need to recall details from [020_Python_files](020_Python_files.ipynb) on using the two packages.


### Test
You will need a web login to NASA Earthdata and to have stored this using `cylog` according to [004_Accounts](004_Accounts.ipynb) for the site `https://e4ftl01.cr.usgs.gov`. We can test this with the following code ius yoiu set do_test to True:

In [20]:
from geog0111.gurlpath import URL
# ping small (1.3 M) test file
site='https://e4ftl01.cr.usgs.gov/'
test_dir='MOLA/MYD11_L2.006/2002.07.04'
test_file='MYD11_L2*0325*.hdf'
# this glob interprets the wildcards to get at a suitable test file
url = URL(site,test_dir).glob(test_file,verbose=False)[0]
# test ping returns True
assert url.ping(verbose=False) == True

If this fails, set `verbose` to `True` to see what is going on, then if you can;'t work it out from there, go back to [004_Accounts](004_Accounts.ipynb) and sort the login for NASA Earthdata the site `https://e4ftl01.cr.usgs.gov`.

In [None]:
# ANSWER

# print out the absolute pathname of the 
# directory that images/ucl.png is in
ucl = Path('images','ucl.png')

# use absolute and parent
# Use name to show how that is helpful
print(f'The directory {ucl.name} is in is: {ucl.absolute().parent}')

# check that the file exists
# if it does ...
if ucl.exists():
    # print the size of the file in KB to two decimal places

    # from above, use stat().st_size
    size_in_bytes = ucl.stat().st_size
    # 1024 Bytes -> 1 KB
    size_in_KB = size_in_bytes/1024
    # 2 dp -> : .2f
    print(f'file size {size_in_bytes} Bytes -> {size_in_KB : .2f} KB')
else:
    print(f'file does not exist')

## Reading and writing

We can conveniently use `pathlib` to deal with file input and output. The main methods to be aware of are:


|command|  purpose|
|---|---|
|`Path.open()`| open a file and return a file descriptor|
|`Path.read_text()`|  read text|
|`Path.write_text()`| write text|
|`Path.read_bytes()`| read byte data|
|`Path.write_bytes()`| write byte data|


For `gurlpath` we have the following equivalent functions:





|command|  purpose|
|---|---|
|`URL.open()`| open a file descriptor with data from a URL|
|`URL.read_text()`|  read text from URL|
|`URL.write_text()`| write text to file|
|`URL.read_bytes()`| read byte data from URL|
|`URL.write_bytes()`| write byte data to file|

Notice that the `write` functions (and `open` when used for write) write to local files, not to the URL. 

They have a keyword argument `local_file` to set the location to write the file to. If this is not given, the the directory structure of the URL is used (relative to the current directory). Alternatively, you can settrhe keyword `local_dr`, or set `URL.local_file` or `URL.local_dir` as appropriate. 

Note that `URL` is tolerant of calling with a `Path`: if we call `URL` with a local file, most operations will continue and apply the appropriate `Path` function.

### `with ... as ...`, `Path.open`, `URL.open`, `yaml`, `json`

Quite often, we will use specific packages for reading particular file formats. But often we just need to be able to open a file (to get a file descriptor) or just to read some binary of text data from a file, or write straight binary of text data to a file. We use this suite of functions given above for such taskas. 

The first of these, `Path.open` provides a file descriptor for the open file. This is used to interface to other input/output functions in Python. A typical example of this is reading a configuration file in [`yaml` format](http://zetcode.com/python/yaml/).

The usual way of opening a file to get the file descriptor is:

    with Path(filename).open('r') as f:
       # do some reading with f
       pass
       

We use the form `with ... as ...` here, so that the file descriptor `f` only exists within this construct and the file is automatically closed when we finish. Codes are spaced in inside the construct, as we have seen in `if ...` or `for ... in ...` constructs.

Here, we have set the flag `r` within the `open()` statement (this is the default mode). This means that the file will be opened for *reading* only. Alternatives include `w` for writing, or `w+` for appending.

In the following example, we use `Path` to open the file [`bin/copy/environment.yml`](bin/copy/environment.yml) and read it using the `yaml` library. This file specifies which packages are loaded in our Python environment. It has a simple ascii format, but since it is a `yaml` file, we should read it with code that interprets the format correctly and safely into a dictionary. This is done using `yaml.safe_load(f)` with `f` an open file descriptor.

In [6]:
from pathlib import Path
import yaml

# form the file name
yaml_file = Path('bin','copy','environment.yml')

with yaml_file.open('r') as f:
    env = yaml.safe_load(f)

print(f'env is type {type(env)}')
print(f'env keys: {env.keys()}')

env is type <class 'dict'>
env keys: dict_keys(['name', 'channels', 'dependencies'])


The equivalent, reading the data from a URL is:

In [7]:
from geog0111.gurlpath import URL
import yaml

# form the file name
site = 'https://raw.githubusercontent.com'
site_dir = '/UCL-EO/geog0111/master'
site_file = 'copy/environment.yml'
yaml_file = URL(site,site_dir,site_file)

# notice that we can use verbose=True for URL open
with yaml_file.open('r',verbose=True) as f:
    env = yaml.safe_load(f)

print(f'env is type {type(env)}')
print(f'env keys: {env.keys()}')

env is type <class 'dict'>
env keys: dict_keys(['name', 'channels', 'dependencies'])


--> reading data from https://raw.githubusercontent.com/UCL-EO/geog0111/master/copy/environment.yml
--> open() text stream


Another common file format for configuration information is [`json`](https://www.json.org/json-en.html). We can use the same form of code as above to write the information in `env` into a `json` format file:

In [8]:
from pathlib import Path
import json

# form the file name
json_file = Path('bin','copy','environment.json')

with json_file.open('w') as f:
    json.dump(env, f)

## read and write text

We can read text from a file with `Path.read_text()` or from a URL with `URL.read_text()`, then either `Path.write_text()` or  `URL.write_text()` to write text to a file:

In [14]:
# from https://www.json.org
some_text = '''
It is easy for humans to read and write.
It is easy for machines to parse and generate. 
'''

# set up the filename
outfile = Path('work/easy.txt')
# write the text
nbytes = outfile.write_text(some_text)
# print what we did
print(f'wrote {nbytes} bytes to {outfile}')

wrote 90 bytes to work/easy.txt


#### Exercise 1

* Using `Path.read_text()` read the text from the file `work/easy.txt` and print the text returned.
* split the text into lines of text using `str.split()` at each newline, and print out the resulting list

You learned how to split strings in [013_Python_string_methods](013_Python_string_methods.ipynb#split()-and-join())

In [17]:
# ANSWER
# Using `Path.read_text()` read the text from the 
# file `work/easy.txt` and print the text returned.

text = Path('work/easy.txt').read_text()
print(f'I have read:\n{text}')

# split the text into lines of text using `str.split()` 
# at each newline, and print out the resulting list
text_list = text.split('\n')
print(f'lines list:\n{text_list}')

I have read:

It is easy for humans to read and write.
It is easy for machines to parse and generate. 

lines list:
['', 'It is easy for humans to read and write.', 'It is easy for machines to parse and generate. ', '']


We can show that we get the same result reading the same file locally or from the web:

In [11]:
from geog0111.gurlpath import URL
from pathlib import Path

# first read the data
u = 'https://www.json.org/json-en.html'
url = URL(u)
# set the output dir
url.local_dir='data'

data = url.read_text(verbose=False)

# write to 'data/json-en.html' with URL
osize = url.write_text(data)
# test the correct number of bytes
assert osize == 26718
print('passed URL')

# write to 'data/json-en.html' with Path
osize = Path('data/json-en.html').write_text(data)
# test the correct number of bytes
assert osize == 26718
print('passed Path')


passed URL
passed Path


The `URL` class has a few advantages over using `Path` in this way:

* if the output directory doesn't already exist, it will be created
* if we set a `noclobber=True` flag, then we will not try to write the file if it already exists.

For example:

In [13]:
from geog0111.gurlpath import URL
from pathlib import Path

# first read the data
u = 'https://www.json.org/json-en.html'
url = URL(u)
url.local_dir='data'

data = url.read_text()

# write to 'data/json-en.html' with URL
osize = url.write_text(data,verbose=True,noclobber=True)

--> local file data/json-en.html
--> existing file data/json-en.html
--> noclobber: True
--> opening local file data/json-en.html
--> mkdir local dir data
--> file exists so not writing
--> done : 26880


#### Exercise 2

XXX TODO XXX

In [None]:
# ANSWER

In [None]:
# ANSWER
# Using Path.read_text() read the text from the file work/easy.txt 
# and print the text returned.

# set up the filename
infile = Path('work','easy.txt')
# read the text
read_text = infile.read_text()

# split the text into lines of 
# text using str.split() at each newline, 
# and print out the resulting list
lines = read_text.split('\n')
print(lines)

## read and write binary data

We can read binary data from a file with `Path.read_bytes()` or from a URL with `URL.read_bytes()`, then either `Path.write_bytes()` or  `URL.write_bytes()` to write the binary data to a file.

Let's first access a MODIS file from the web, as we did in [020_Python_files](020_Python_files.ipynb):

In [24]:
from  geog0111.modis import Modis

modis = Modis('MCD15A3H',verbose=True)
url = modis.get_url("2020","01","01")[0]

--> wildcards in: ['*.h08v06*.hdf']
--> level 0/1 : *.h08v06*.hdf
--> trying https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01
--> discovered 1 files with pattern *.h08v06*.hdf in https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01


Now, pull the dataset 

In [25]:
# set the output directory
url.local_dir = 'work'
# read the dataset
hdf_data = url.read_bytes()
# and save to a file
obytes = url.write_bytes(hdf_data,verbose=True)

--> local file work/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf
--> opening local file work/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf
--> mkdir local dir work
--> writing data ...
--> done : 9067184
--> done : 9067184


### Exercise 3

Using the code:
    
    from  geog0111.modis import Modis

    # get URL
    modis = Modis('MCD15A3H',verbose=True)
    url = modis.get_url("2020","01","01")[0]
    # set the output directory
    url.local_dir = 'work'
    
    # read the dataset
    hdf_data = url.read_bytes()
    # and save to a file
    obytes = url.write_bytes(hdf_data,verbose=True)    

* write a function that only calls `url.read_bytes()` if the file doesn't already exist
* If it already exists, just read the data from that file
* test your code with the url generated above and show that the file size is 9067184 bytes

You will need to remember how to get the filename from the URL object, and also to test if a file exists. We learned all of these in [020_Python_files](020_Python_files.ipynb).

Note that `len(data)` will give the size of bytes data.

In [28]:
# ANSWER

# write a function that only calls url.read_bytes() 
# if the file doesn't already exist
def get_data(url,verbose=False,local_dir='work'):
    '''
    Get the binary data from url if the 
    output file doesnt exist
    
    Positional Arguments:
    url  : a URL object
    
    Keyword Arguments:
    verbose  : Bool -> False
    local_dir : str -> work
    '''
    # get the output file name
    # url.name gives the file name from the URL
    ofile = Path(local_dir,url.name)
    
    # test exists
    if ofile.exists():
        # If it already exists, 
        # just read the data from that file
        return ofile.read_bytes()
    
    # otherwise read data from url:
    # set output dir
    url.local_dir = local_dir
    # pass on verbose flag
    hdf_data = url.read_bytes(verbose=verbose)
    # 
    obytes = url.write_bytes(hdf_data,verbose=True)
    return hdf_data

In [35]:
# ANSWER

from  geog0111.modis import Modis
modis = Modis('MCD15A3H',verbose=False)
url = modis.get_url("2020","01","01")[0]

hdf_data = get_data(url,verbose=True,local_dir='work')
assert len(hdf_data) ==  9067184
print('passed')

passed


#### Exercise 4

* print out the absolute pathname of the directory that the binary file [`images/ucl.png`](images/ucl.png) is in
* print the size of the file in kilobytes (KB) to two decimal places without reading the datafile. 
* read the datafile, and check you get the same data size

You will need to recall how to find a file size in bytes using `Path`. This was covered in [020_Python_files](020_Python_files.ipynb). You will need to know how many bytes are in a KB. To print to two decimal places, you need to recall the string formatting we did in [012_Python_strings](012_Python_strings.ipynb#String-formating).

In [38]:
# ANSWER 

# print out the absolute pathname of the 
# directory that images/ucl.png is in
abs_name = Path('images/ucl.png').absolute()
print(abs_name)

# we want the parent!
print(f'the file {abs_name.name} is in {abs_name.parent}')

# print the size of the file in bytes without reading the datafile. 
print(f'{abs_name.name} has size {abs_name.stat().st_size} bytes')

# 1 KB is 1024 Bytes
# .2f is 2 d.p. format
print(f'{abs_name.name} has size ' +\
      f'{abs_name.stat().st_size/1024:.2f} KB')

# read the datafile, and check you get the same data size
dataset = abs_name.read_bytes()
# size
s = len(dataset)
print(f'the size of data read is {s} bytes -> {s/1024 : .2f} KB')

/Users/plewis/Documents/GitHub/geog0111/notebooks/images/ucl.png
the file ucl.png is in /Users/plewis/Documents/GitHub/geog0111/notebooks/images
ucl.png has size 1956 bytes
ucl.png has size 1.91 KB
the size of data read is 1956 bytes ->  1.91 KB


### 1.3.5 Summary

In this section, we have used `Path` and `URL` classes to read and write text and binary files. We have combined these ideas with earlier work to download and save a MODIS datafile and other text and binary datasets. We have refreshed our memory of some of the earlier material, especially string formatting.

You should now have some confidence in these matters, so that if you were set a task of downloading and saving datasets, as well as other tasks such as finding their size, whether the exists or not, you could do this. 


[<img src="images/noun_post_2109127.svg" width="50" align='right'>](016_Python_for.ipynb)
[<img src="images/noun_pre_2109128.svg" width="50" align='right'>](014_Python_groups.ipynb)
