# 5 Useful Libraries
##### **Author: Adam Gatt**

## os

https://docs.python.org/3/library/os.html

Allows operating system tools for:

* Showing and changing the working directory
* Listing directory contents
* Reading/setting environment variables
* `os.path` library for file path operations, including:
 * Checking if a file/path exists
 * Building/splitting paths in an OS-indepedent way
* and more.

In [6]:
import os

# Print the current working directory
print(f"Current dir: {os.getcwd()}")

# Change the current working directory
os.chdir('/')

# Print the contents of a specified directory
print(f"\nContents of bin: \n{os.listdir('bin')}")

# Fetch environment variables
print(f"\nValue of CONDA_EXE env var: \n{os.environ['CONDA_EXE']}")

filepath = os.path.join('/home', 'adamgatt', 'coding')
print(f"\nDoes '{filepath}' exist? {os.path.exists(filepath)}")

Current dir: /

Contents of bin: 
['bash', 'bunzip2', 'bzcat', 'bzcmp', 'bzdiff', 'bzegrep', 'bzexe', 'bzfgrep', 'bzgrep', 'bzip2', 'bzip2recover', 'bzless', 'bzmore', 'cat', 'chgrp', 'chmod', 'chown', 'cp', 'dash', 'date', 'dd', 'df', 'dir', 'dmesg', 'dnsdomainname', 'domainname', 'echo', 'egrep', 'false', 'fgrep', 'findmnt', 'fusermount', 'grep', 'gunzip', 'gzexe', 'gzip', 'hostname', 'kill', 'less', 'lessecho', 'lessfile', 'lesskey', 'lesspipe', 'ln', 'login', 'ls', 'lsblk', 'mkdir', 'mknod', 'mktemp', 'more', 'mount', 'mountpoint', 'mv', 'netstat', 'nisdomainname', 'pidof', 'ps', 'pwd', 'rbash', 'readlink', 'rm', 'rmdir', 'run-parts', 'sed', 'sh', 'sleep', 'stty', 'su', 'sync', 'tar', 'tempfile', 'touch', 'true', 'ulockmgr_server', 'umount', 'uname', 'uncompress', 'vdir', 'wdctl', 'which', 'ypdomainname', 'zcat', 'zcmp', 'zdiff', 'zegrep', 'zfgrep', 'zforce', 'zgrep', 'zless', 'zmore', 'znew', 'journalctl', 'loginctl', 'networkctl', 'systemctl', 'systemd-ask-password', 'systemd-esc

## `datetime`
https://docs.python.org/3/library/datetime.html

You can represent datetimes using the Date object. The library also provides functionality for comparing and parsing dates, fetching the current datetime and performing calculations with the timedelta object.

In [0]:
from datetime import datetime, timedelta

# Get the current date
cur_date = datetime.now()
print(f"Current datetime is {cur_date}")

# Get only the day from the date
print(cur_date.day)
# Print in standard ISO 8601 format
print(cur_date.isoformat())
# Print current day of the week (where 0 = Monday)
print(cur_date.weekday())

# Can perform math on dates using a timedelta object
last_fortnight = cur_date - timedelta(weeks=2)

# Comparison of dates
print(cur_date > last_fortnight)

# Can use strftime to print in custom format, or strptime to read from a custom format
print(last_fortnight.strftime('%d/%m/%Y'))

Current datetime is 2020-03-19 02:09:53.400973
19
2020-03-19T02:09:53.400973
3
True
05/03/2020


## `json`
https://docs.python.org/3/library/json.html

Convert to and from the JSON format.
* A Python `list` is analogous to a JSON list. A Python `dict` is analogous to a JSON object.
* `load()` and `dump()` will read/write to a file/stream or ".write()-supporting file-like object"
* `loads()` and `dumps()` will read/write to a specified string

In [0]:
import json

# Define a structure we want to serialise to JSON
default_courses = ['Programming', 'Databases', 'Networking']
student_list = [{
    'name': 'Adam',
    'year_enrolled': 2010,
    'courses': default_courses
}, {
    'name': 'Brett',
    'year_enrolled': 2012,
    'courses': []
}, {
    'name': 'Ben',
    'year_enrolled': 2014,
    'courses': default_courses
}]

print(student_list)

# Create a JSON-formatted string from our structure
student_json = json.dumps(student_list)
print()
print(student_json)
print(type(student_json))
print(len(student_json))
print(student_json[35:55])

# Parse the JSON string into a Python structure we can interact with
list_again = json.loads(student_json)
print()
print(list_again)
print(type(list_again))
print(list_again[2]['name'])

[{'name': 'Adam', 'year_enrolled': 2010, 'courses': ['Programming', 'Databases', 'Networking']}, {'name': 'Brett', 'year_enrolled': 2012, 'courses': []}, {'name': 'Ben', 'year_enrolled': 2014, 'courses': ['Programming', 'Databases', 'Networking']}]

[{"name": "Adam", "year_enrolled": 2010, "courses": ["Programming", "Databases", "Networking"]}, {"name": "Brett", "year_enrolled": 2012, "courses": []}, {"name": "Ben", "year_enrolled": 2014, "courses": ["Programming", "Databases", "Networking"]}]
<class 'str'>
248
2010, "courses": ["P

[{'name': 'Adam', 'year_enrolled': 2010, 'courses': ['Programming', 'Databases', 'Networking']}, {'name': 'Brett', 'year_enrolled': 2012, 'courses': []}, {'name': 'Ben', 'year_enrolled': 2014, 'courses': ['Programming', 'Databases', 'Networking']}]
<class 'list'>
Ben


In [0]:
# 'courses' refers to the same actual list for items 0 and 2, so changing one affects the other
student_list[0]['courses'][0] = 'Cooking'
print(student_list[2])

# When parsing a JSON string all structures are created anew, so all 'courses' refer to separate lists in memory
print()
list_again[0]['courses'][1] = ['Cooking']
print(list_again[2])

{'name': 'Ben', 'year_enrolled': 2014, 'courses': ['Cooking', 'Databases', 'Networking']}

{'name': 'Ben', 'year_enrolled': 2014, 'courses': ['Programming', 'Databases', 'Networking']}


## `pyyaml`
https://pyyaml.org/wiki/PyYAMLDocumentation

The `pyyaml` library allows you to serialise to and from the YAML format.

In [0]:
!pip install pyyaml

from yaml import dump, load

student_yaml = dump(student_list)
print(student_yaml) # Use of IDs allows shared list to remain a single object in memory
list_from_yaml = load(student_yaml)
print(list_from_yaml)

- courses: &id001 [Programming, Databases, Networking]
  name: Adam
  year_enrolled: 2010
- courses: []
  name: Brett
  year_enrolled: 2012
- courses: *id001
  name: Ben
  year_enrolled: 2014

[{'courses': ['Programming', 'Databases', 'Networking'], 'name': 'Adam', 'year_enrolled': 2010}, {'courses': [], 'name': 'Brett', 'year_enrolled': 2012}, {'courses': ['Programming', 'Databases', 'Networking'], 'name': 'Ben', 'year_enrolled': 2014}]


## `pprint`
_(pretty print)_

https://docs.python.org/3/library/pprint.html

Provides the `pprint()` function that prints out arbitrary structures with a layout that makes them easier to understand. You can also configure and create a `PrettyPrinter` object for repeated use, specifying:
* `indent` indentation size (in spaces)
* `width` maximum character length for each line
* `compact` whether to show one element per line or condense to a single line where possible
* `depth` maximum element depth to display within the function. Elements deeper than this are displayed as `...`
* `sort_dict` (Python 3.8+) Whether keys in a dict should be displayed in alphabetical order

In [0]:
from pprint import pprint, PrettyPrinter
pprint(student_list)


my_structure = (1, 2, (3, 4), (5, 6, (7, 8)))
printer = PrettyPrinter(indent=4, width=15, depth=2, compact=False)
print()
printer.pprint(my_structure)

[{'courses': ['Programming',
              'Databases',
              'Networking'],
  'name': 'Adam',
  'year_enrolled': 2010},
 {'courses': [],
  'name': 'Brett',
  'year_enrolled': 2012},
 {'courses': ['Programming',
              'Databases',
              'Networking'],
  'name': 'Ben',
  'year_enrolled': 2014}]

(   1,
    2,
    (3, 4),
    (   5,
        6,
        (...)))


## `tqdm`
_(progress bar)_

https://github.com/tqdm/tqdm _(install first with `pip tqdm`)_

Allows you to display and update progress bars for long operations. The progress bar will update in-place.
* Can easily be used by wrapping `tqdm()` around the list/generator/iterable that we are progressing through.
* When using in a notebook, use `tqdm_notebook()` instead for more consistent behaviour (and better appearance).
* If using `tqdm` (non-notebook), avoid `print`ing or it may interfere with the update. Instead you can use `tqdm.write()` to print, with the progress bar shifting below all messages printed this way.

In [0]:
from tqdm import tqdm, tqdm_notebook
from time import sleep

for x in tqdm_notebook(range(12)):
  sleep(1)
  print(f"Printing {x}")

HBox(children=(IntProgress(value=0, max=12), HTML(value='')))

Printing 0
Printing 1
Printing 2
Printing 3
Printing 4
Printing 5
Printing 6
Printing 7
Printing 8
Printing 9
Printing 10
Printing 11



In [0]:
from functools import reduce
from tqdm import tqdm_notebook

def create_white_and_black_piece(piece):
  return ['white ' + piece, 'black ' + piece]

def add_pieces_to_board(current_board, pieces):
  sleep(0.15)
  return current_board + pieces

piece_list = ['pawn']*8 + ['knight']*2 + ['bishop']*2 + ['rook']*2 + ['king', 'queen']

# tqdm can be used in a functional manner simply by wrapping it around the iterable
board = reduce(add_pieces_to_board,
               map(create_white_and_black_piece,
                   tqdm(piece_list, desc='Assembling chess board', unit='piece', leave=True)),
               []) # Begin with an empty board

print(board)

Assembling chess board: 100%|██████████| 16/16 [00:02<00:00,  6.62piece/s]

['white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white pawn', 'black pawn', 'white knight', 'black knight', 'white knight', 'black knight', 'white bishop', 'black bishop', 'white bishop', 'black bishop', 'white rook', 'black rook', 'white rook', 'black rook', 'white king', 'black king', 'white queen', 'black queen']





## `multiprocessing`
_(parallel processing)_

Python offers the `threading` library for creating new processing "threads" that allow for parallel execution of code with fine-grained detail. But there is also a `multiprocessing` library which also offers parallelism but with extremely easy, higher-level interface. If you have a lot of data that needs to be processed in an identical manner, there are often performance improvements to be made simply by plugging in a multiprocessing `Pool` object into the relevant processing loop.

On a technical level, the `threading` library spawns parallel threads within the same process, allowing for shared memory use but requiring careful management of such with the Global Interpreter Lock. `multiprocessing` avoids the issue by spawning entirely new subprocesses for the parallel executions. This comes at a cost of making communication between the parallel executions more restrictive, with special Queue and Pipe objects provided for this purpose. But this cost is moot for the common case of processing multiple pieces of data independently and then simply combining their results at the very end. An additional cost is that spawning new processes is slower than spawning a new thread.

In [0]:
# Sequential processing of data

from time import sleep
from random import uniform

# We take our slow code and spin it out into a function definition
# Fetches a record with a 3-6 second delay before returning
def fetch_from_slow_database(record_id):
  print(f"Starting to fetch record {record_id}")
  sleep(uniform(3.0, 6.0))
  print(f"Record {record_id} retrieved!")
  return {
      'id': record_id,
      'balance': 3000 + record_id*500
  }

# The data to process is stored in something we can iterate over (e.g. a list, tuple, generator)
id_list = range(15)

fetched_records = [fetch_from_slow_database(id) for id in id_list]

# Can see the records have been collated into the correct order
print(fetched_records)

Starting to fetch record 0
Record 0 retrieved!
Starting to fetch record 1
Record 1 retrieved!
Starting to fetch record 2
Record 2 retrieved!
Starting to fetch record 3
Record 3 retrieved!
Starting to fetch record 4
Record 4 retrieved!
Starting to fetch record 5
Record 5 retrieved!
Starting to fetch record 6
Record 6 retrieved!
Starting to fetch record 7
Record 7 retrieved!
Starting to fetch record 8
Record 8 retrieved!
Starting to fetch record 9
Record 9 retrieved!
Starting to fetch record 10
Record 10 retrieved!
Starting to fetch record 11
Record 11 retrieved!
Starting to fetch record 12
Record 12 retrieved!
Starting to fetch record 13
Record 13 retrieved!
Starting to fetch record 14
Record 14 retrieved!
[{'id': 0, 'balance': 3000}, {'id': 1, 'balance': 3500}, {'id': 2, 'balance': 4000}, {'id': 3, 'balance': 4500}, {'id': 4, 'balance': 5000}, {'id': 5, 'balance': 5500}, {'id': 6, 'balance': 6000}, {'id': 7, 'balance': 6500}, {'id': 8, 'balance': 7000}, {'id': 9, 'balance': 7500}, {'id

In [0]:
# Parallel processing, with results collated in order

from multiprocessing import Pool
from time import sleep
from random import uniform

# We take our slow code and spin it out into a function definition
# Fetches a record with a 3-6 second delay before returning
def fetch_from_slow_database(record_id, model_path):
  print(f"Starting to fetch record {record_id}")
  sleep(uniform(3.0, 6.0))
  print(f"Record {record_id} retrieved!")
  return {
      'id': record_id,
      'balance': 3000 + record_id*500
  }

# The data to process is stored in something we can iterate over (e.g. a list, tuple, generator)
id_list = range(15)

# Create the pool with a maximum of 10 concurrent workers at a time
pool = Pool(6)

# This is all it takes to perform the parallel processing
# The second argument is the list of inputs we need to iterate through
# The first argument is the function to feed each input to
fetched_records = pool.map(fetch_from_slow_database, id_list)

# Can see the records have been collated into the correct order
print(fetched_records)

Starting to fetch record 0
Starting to fetch record 1
Starting to fetch record 5
Starting to fetch record 4
Starting to fetch record 2
Starting to fetch record 3
Record 1 retrieved!
Starting to fetch record 6
Record 5 retrieved!
Starting to fetch record 7
Record 4 retrieved!
Starting to fetch record 8
Record 2 retrieved!
Starting to fetch record 9
Record 3 retrieved!
Starting to fetch record 10
Record 0 retrieved!
Starting to fetch record 11
Record 6 retrieved!
Starting to fetch record 12
Record 11 retrieved!
Starting to fetch record 13
Record 7 retrieved!
Starting to fetch record 14
Record 10 retrieved!
Record 8 retrieved!
Record 9 retrieved!
Record 13 retrieved!
Record 12 retrieved!
Record 14 retrieved!
[{'id': 0, 'balance': 3000}, {'id': 1, 'balance': 3500}, {'id': 2, 'balance': 4000}, {'id': 3, 'balance': 4500}, {'id': 4, 'balance': 5000}, {'id': 5, 'balance': 5500}, {'id': 6, 'balance': 6000}, {'id': 7, 'balance': 6500}, {'id': 8, 'balance': 7000}, {'id': 9, 'balance': 7500}, {'id

We can use `imap` instead of `map` to treat the parallel processing as an iterator, so that each result will be returned and usable individually. Each result is made available when they are made available, and when it is their turn based on the ordering of the input data.

Using multiprocessing as an iterator has the effect of supressing `print` statements in our processing function.

In [0]:
# Used as an iterator, we have access to each individual "result" instead
# of a completed list of "fetched_results".
for result in pool.imap(fetch_from_slow_database, id_list):
  print(result)

{'id': 0, 'balance': 3000}
{'id': 1, 'balance': 3500}
{'id': 2, 'balance': 4000}
{'id': 3, 'balance': 4500}
{'id': 4, 'balance': 5000}
{'id': 5, 'balance': 5500}
{'id': 6, 'balance': 6000}
{'id': 7, 'balance': 6500}
{'id': 8, 'balance': 7000}
{'id': 9, 'balance': 7500}
{'id': 10, 'balance': 8000}
{'id': 11, 'balance': 8500}
{'id': 12, 'balance': 9000}
{'id': 13, 'balance': 9500}
{'id': 14, 'balance': 10000}


Sometimes we don't care about the results being collated into the same order as the inputs had. This might be because:

* The ordering of the input data wasn't hugely important to begin with, for example training data in an ML application
* We want to act on processed results as soon as they are ready, such as responding to HTTP requests
* We don't need to gather the results together at all but instead simply dump them into a data store such as files in an S3 bucket

The `imap_unordered` function allows us to use our Pool object more efficiently by ignoring the ordering requirement. Instead of gathering the results together into a list, we instead use `imap_unordered` as an iterator in a `for` loop. 

Like `imap`, we are provided with each result individually. However, results will not wait for their "turn" based on the order of the input data. Instead we receive them as soon as they become available, allowing us to act on each immediately.

In [0]:
from tqdm import tqdm

record_list = []
# We can wrap tqdm around imap or imap_unorderd to show and update progress
# tqdm needs to be around the imap_unordered, not around id_list or else it
# will jump to 100% progress immediately
for record in tqdm(pool.imap_unordered(fetch_from_slow_database, id_list), total=len(id_list), unit='record', desc='Fetching records'):
    record_list.append(record)

print(record_list)

Fetching records: 100%|██████████| 15/15 [00:12<00:00,  1.23record/s]

[{'id': 1, 'balance': 3500}, {'id': 2, 'balance': 4000}, {'id': 5, 'balance': 5500}, {'id': 3, 'balance': 4500}, {'id': 0, 'balance': 3000}, {'id': 4, 'balance': 5000}, {'id': 6, 'balance': 6000}, {'id': 7, 'balance': 6500}, {'id': 9, 'balance': 7500}, {'id': 8, 'balance': 7000}, {'id': 10, 'balance': 8000}, {'id': 11, 'balance': 8500}, {'id': 14, 'balance': 10000}, {'id': 13, 'balance': 9500}, {'id': 12, 'balance': 9000}]





In [0]:
from tqdm import tqdm_notebook

# If we do want to gather the results together but just want the speed
# improvement from removing the ordering requirement, then we can just
# evaluate the imap_unordered iterator by casting to a list
unordered_results = list(tqdm_notebook(pool.imap_unordered(fetch_from_slow_database, id_list), total=len(id_list), unit='record', desc='Fetching records'))

print(unordered_results)

HBox(children=(IntProgress(value=0, description='Fetching records', max=15, style=ProgressStyle(description_wi…


[{'id': 6, 'balance': 6000}, {'id': 3, 'balance': 4500}, {'id': 7, 'balance': 6500}, {'id': 9, 'balance': 7500}, {'id': 2, 'balance': 4000}, {'id': 8, 'balance': 7000}, {'id': 5, 'balance': 5500}, {'id': 0, 'balance': 3000}, {'id': 1, 'balance': 3500}, {'id': 4, 'balance': 5000}, {'id': 12, 'balance': 9000}, {'id': 14, 'balance': 10000}, {'id': 13, 'balance': 9500}, {'id': 11, 'balance': 8500}, {'id': 10, 'balance': 8000}]


If we want to avoid the overhead of multiprocessing or use shared memory we can perform parallel execution at the thread level. One way to achieve this is to use the `Thread` class in the `threading` library. We then need to wait for each thread's completion with `join()` and collate results manually. A higher-level interface is provided by the `ThreadPoolExecutor` class in the `concurrent.futures` library.

In [0]:
from concurrent.futures import ThreadPoolExecutor, as_completed

# Provides a context manager to supply to a 'with' statement to ensure that
# cleanup is correctly performed for us 
with ThreadPoolExecutor(max_workers=10) as pool:

    # Submit each execution to the pool. The output is a thread reference we
    # can gather into a list to check up on the threads in future. We could also
    # use the thread reference to check if a thread is still running or cancel
    # it if desired.
    threads = [pool.submit(fetch_from_slow_database, input_id) for input_id in id_list]
    
    # as_completed will return threads as they complete
    for thread in as_completed(threads):
      # We call result() to fetch the result of the completed thread
      print(f"Thread {thread} has result {thread.result()}")



Starting to fetch record 0
Starting to fetch record 1
Starting to fetch record 2
Starting to fetch record 3
Starting to fetch record 4
Starting to fetch record 5
Starting to fetch record 6
Starting to fetch record 7
Starting to fetch record 8Starting to fetch record 9

Record 6 retrieved!Thread <Future at 0x7f2bb1c22f98 state=finished returned dict> has result {'id': 6, 'balance': 6000}

Starting to fetch record 10
Record 9 retrieved!Thread <Future at 0x7f2bb1c224a8 state=finished returned dict> has result {'id': 9, 'balance': 7500}

Starting to fetch record 11
Record 2 retrieved!
Starting to fetch record 12Thread <Future at 0x7f2bb55037f0 state=finished returned dict> has result {'id': 2, 'balance': 4000}

Record 0 retrieved!Thread <Future at 0x7f2bb1c47240 state=finished returned dict> has result {'id': 0, 'balance': 3000}

Starting to fetch record 13
Record 8 retrieved!Thread <Future at 0x7f2bb1c22518 state=finished returned dict> has result {'id': 8, 'balance': 7000}

Starting to f