<img alt='UCL' src="images/ucl_logo.png" align='center'>


[<img src="images/noun_post_2109127.svg" width="50" align='right'>](016_Python_for.ipynb)
[<img src="images/noun_pre_2109128.svg" width="50" align='right'>](018_Python_xxx.ipynb)



# 022 Read and Write: URLs and files


## Introduction


### Purpose

In the previous session, we used [`pathlib`](https://docs.python.org/3/library/pathlib.html) and the local package [gurlpath](geog0111/gurlpath) derived from [`urlpath`](https://github.com/chrono-meter/urlpath) to open object streams from URLs and files. 

In this session, we will extend this to deal with reading and writing to text and binary files and URLs.

### Prerequisites

You will need some understanding of the following:


* [001 Using Notebooks](001_Notebook_use.ipynb)
* [002 Unix](002_Unix.ipynb) with a good familiarity with the UNIX commands we have been through.
* [003 Getting help](003_Help.ipynb)
* [010 Variables, comments and print()](010_Python_Introduction.ipynb)
* [011 Data types](011_Python_data_types.ipynb) 
* [012 String formatting](012_Python_strings.ipynb)
* [013_Python_string_methods](013_Python_string_methods.ipynb)
* [020_Python_files](020_Python_files.ipynb)

You will need to recall details from [020_Python_files](020_Python_files.ipynb) on using the two packages.

### Test

You should run a [NASA account test](notebooks/004_Accounts.ipynb#Test) if you have not already done so.

## Reading and writing

As before, we note that we can conveniently use `pathlib` to deal with file input and output. The main methods we have seen are:


|command|  purpose|
|---|---|
|`Path.open()`| open a file and return a file descriptor|
|`Path.read_text()`|  read text|
|`Path.write_text()`| write text|
|`Path.read_bytes()`| read byte data|
|`Path.write_bytes()`| write byte data|


For `gurlpath` we have the following equivalent functions:





|command|  purpose|
|---|---|
|`URL.open()`| open a file descriptor with data from a URL|
|`URL.read_text()`|  read text from URL|
|`URL.write_text()`| write text to file|
|`URL.read_bytes()`| read byte data from URL|
|`URL.write_bytes()`| write byte data to file|

Recall that the `write` functions (and `open` when used for write) write to local files, not to the URL. They have a keyword argument `local_file` to set the location to write the file to. If this is not given, the the directory structure of the URL is used (relative to the current directory). Alternatively, you can set the keyword `local_dir`, or set `URL.local_file` or `URL.local_dir` as appropriate. 

Note that `URL` is tolerant of calling with a `Path`: if we call `URL` with a local file, most operations will continue and apply the appropriate `Path` function.

## read and write text

We can read text from a file with `Path.read_text()` or from a URL with `URL.read_text()`, then either `Path.write_text()` or  `URL.write_text()` to write text to a file:

In [1]:
from pathlib import Path
# from https://www.json.org
some_text = '''
It is easy for humans to read and write.
It is easy for machines to parse and generate. 
'''

# set up the filename
outfile = Path('work/easy.txt')
# write the text
nbytes = outfile.write_text(some_text)
# print what we did
print(f'wrote {nbytes} bytes to {outfile}')

wrote 90 bytes to work/easy.txt


#### Exercise 1

* Using `Path.read_text()` read the text from the file `work/easy.txt` and print the text returned.
* split the text into lines of text using `str.split()` at each newline, and print out the resulting list

You learned how to split strings in [013_Python_string_methods](013_Python_string_methods.ipynb#split()-and-join())

In [2]:
# ANSWER
# Using `Path.read_text()` read the text from the 
# file `work/easy.txt` and print the text returned.

text = Path('work/easy.txt').read_text()
print(f'I have read:\n{text}')

# split the text into lines of text using `str.split()` 
# at each newline, and print out the resulting list
text_list = text.split('\n')
print(f'lines list:\n{text_list}')

I have read:

It is easy for humans to read and write.
It is easy for machines to parse and generate. 

lines list:
['', 'It is easy for humans to read and write.', 'It is easy for machines to parse and generate. ', '']


We can show that we get the same result reading the same file locally from [`data/json-en.html`](data/json-en.html) or from the web from [`https://www.json.org/json-en.html`](https://www.json.org/json-en.html):

In [21]:
from geog0111.gurlpath import URL
from pathlib import Path

# first read the data from URL with no cache
# and directory work
u = 'https://www.json.org/json-en.html'
url = URL(u,local_dir='work',verbose=True,noclobber=False)
data_url = url.read_text()

# then from file in directory data
data_file = Path('data/json-en.html').read_text()

assert data_url == data_file
print('files are the same')

--> reading db file /Users/plewis/.url_db/.db.yml
--> updated cache database in /Users/plewis/.url_db/.db.yml
--> db file [PosixPath('/Users/plewis/.url_db/.db.yml')]
--> trying https://www.json.org/json-en.html


files are the same


## read and write binary data

We can read binary data from a file with `Path.read_bytes()` or from a URL with `URL.read_bytes()`, then either `Path.write_bytes()` or  `URL.write_bytes()` to write the binary data to a file. Other than that, and the fact that we cannot directly visualise the contents of the binary files without some interpreted code, there is no real difference in how we treat them.

Let's first access a MODIS file from the web, as we did in [020_Python_files](020_Python_files.ipynb). Here, the `kwargs` are passed on to `URL`:

In [27]:
from  geog0111.modis import Modis

kwargs = {
    'verbose'    : True,
    'db_dir'     : 'work',
    'local_dir'  : 'work',
}

modis = Modis('MCD15A3H',**kwargs)
url = modis.get_url("2020","01","01")[0]

--> reading db file /Users/plewis/Documents/GitHub/geog0111/notebooks/work/.db.yml
--> updated cache database in /Users/plewis/Documents/GitHub/geog0111/notebooks/work/.db.yml
--> db file [PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/.db.yml')]
--> retrieving glob https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/*.h08v06*.hdf from database


Then read the data. Cached data will be used where available unless we set `noclobber=False`.

In [30]:
b  = url.read_bytes()
print(f'data for {url} cached in {url.local()}')

data for https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf cached in /Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf


We could explicitly write the data to a file, but since we are using a cache, there is no real point. This means that we can just use the URL to access the dataset. If we do need to specify the filename explicitly for any other codes, we can use `url.local()`.

#### Exercise 2

Using the code:
    
    from  geog0111.modis import Modis
    
    kwargs = {
        'verbose'    : True,
        'db_dir'     : 'work',
        'local_dir'  : 'work',
    }

    modis = Modis('MCD15A3H',**kwargs)
    # get URLs
    hdf_urls = modis.get_url("2020","01","01")

* write a function called `get_locals` that loops over each entry in the list `hdf_urls` and returns the local filename 
* write code to test the function and print results using data from `modis.get_url("2020","01","*")`

In [31]:
from geog0111.gurlpath import URL

# ANSWER
# write a function called `get_locals` that loops 
# over each entry in the list `hdf_urls` and returns the local filename 
def get_locals(hdf_urls):
    '''
    get the cached filenames for the URL list
    '''
    olist = []
    for f in hdf_urls:
        olist.append([f.local()])
    return olist

In [32]:
from geog0111.gurlpath import URL

# BETTER ANSWER
# write a function called `get_locals` that loops 
# over each entry in the list `hdf_urls` and returns the local filename 
def get_locals(hdf_urls):
    '''
    get the cached filenames for the URL list
    '''
    return [f.local() for f in hdf_urls]

In [37]:
# write code to test the function and print results 
# using data from modis.get_url("2020","01","*")
kwargs = {
    'db_dir'     : 'work',
    'local_dir'  : 'work',
}

modis = Modis('MCD15A3H',**kwargs)
# get URLs
hdf_urls = modis.get_url("2020","01","*")
# test
print(get_locals(hdf_urls))

[PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020005.h08v06.006.2020010210940.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020009.h08v06.006.2020014204616.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020013.h08v06.006.2020018030252.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020017.h08v06.006.2020022034013.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020021.h08v06.006.2020026032135.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020025.h08v06.006.2020030025757.hdf'), PosixPath('/Users/plewis/Documents/GitHub/geog0111/notebooks/work/MCD15A3H.A2020029.h08v06.006.2020034165001.hdf')]


#### Exercise 3

* print out the absolute pathname of the directory that the binary file [`images/ucl.png`](images/ucl.png) is in
* print the size of the file in kilobytes (KB) to two decimal places without reading the datafile. 
* read the datafile, and check you get the same data size

You will need to recall how to find a file size in bytes using `Path`. This was covered in [020_Python_files](020_Python_files.ipynb). You will need to know how many bytes are in a KB. To print to two decimal places, you need to recall the string formatting we did in [012_Python_strings](012_Python_strings.ipynb#String-formating).

In [11]:
# ANSWER 

# print out the absolute pathname of the 
# directory that images/ucl.png is in
abs_name = Path('images/ucl.png').absolute()
print(abs_name)

# we want the parent!
print(f'the file {abs_name.name} is in {abs_name.parent}')

# print the size of the file in bytes without reading the datafile. 
print(f'{abs_name.name} has size {abs_name.stat().st_size} bytes')

# 1 KB is 1024 Bytes
# .2f is 2 d.p. format
print(f'{abs_name.name} has size ' +\
      f'{abs_name.stat().st_size/1024:.2f} KB')

# read the datafile, and check you get the same data size
dataset = abs_name.read_bytes()
# size
s = len(dataset)
print(f'the size of data read is {s} bytes -> {s/1024 : .2f} KB')

/Users/plewis/Documents/GitHub/geog0111/notebooks/images/ucl.png
the file ucl.png is in /Users/plewis/Documents/GitHub/geog0111/notebooks/images
ucl.png has size 1956 bytes
ucl.png has size 1.91 KB
the size of data read is 1956 bytes ->  1.91 KB


## Summary

In this section, we have used `Path` and `URL` classes to read and write text and binary files. We have combined these ideas with earlier work to access MODIS datafiles and other text and binary datasets. For data we access through a URL, we can do file operations on a cached version of the file. We have refreshed our memory of some of the earlier material, especially string formatting.

You should now have some confidence in these matters, so that if you were set a task of downloading and saving datasets, as well as other tasks such as finding their size, whether the exists or not, you could do this. 


[<img src="images/noun_post_2109127.svg" width="50" align='right'>](016_Python_for.ipynb)
[<img src="images/noun_pre_2109128.svg" width="50" align='right'>](014_Python_groups.ipynb)
