# Methods I: Programming and Data Analysis

## Session 08: Advanced Formatting; File Formats; Finishing the Basics

### Gerhard Jäger

#### (based on Johannes Dellert's slides)

December 14, 2021

Advanced formatting using `format()`:

-   **formatting templates** are stored in a specialized string format

-   basic principles:

    -   every object to be rendered is repesented by a block in curly
        brackets

    -   symbols outside curly brackets are copied into the output

-   useful basic features:

    -   **padding** to a desired length $k$ using
        `:k` (left alignment), `:^k` (center alignment), or `:>k` (right
        alignment)

    -   **truncating** to a desired length $k$ using `.k` (combines with
        padding)

    -   **integer format** `:d` which allows padding

    -   **float format** `:f` which allows padding and specifying a
        precision,
        e.g. `{:06.2f}` to output 3.245 as `"003.42"`

-   more information on `https://pyformat.info`

### Advanced Formatting

Example of `format()` usage:

- to format a sequence of objects, apply the `format()` method on the
template string with the objects as arguments:

## base case

In [1]:
'{} {}'.format('one', 'two')


'one two'

In [2]:
'{} {}'.format(1, 2)


'1 2'

## numbered placeholders

In [3]:
'{1} {0}'.format('one', 'two')

'two one'

## padding

In [4]:
'{:10}'.format('test')


'test      '

In [5]:
'{:^10}'.format('test')


'   test   '

In [6]:
'{:>10}'.format('test')


'      test'

In [7]:
'{:_<10}'.format('test')


'test______'

## specifying data type

In [8]:
'{:d}'.format(42)

'42'

In [9]:
'{:f}'.format(42)

'42.000000'

In [10]:
'{:d}'.format("test")

ValueError: Unknown format code 'd' for object of type 'str'

In [11]:
'{:s}'.format("test")

'test'

## precision for floats

In [12]:
'{:6.2f}'.format(3.141592653589793)


'  3.14'

In [13]:
'{:06.2f}'.format(3.141592653589793)


'003.14'

- templates can be stored in a variable


In [14]:
result_template = "{:20s}: {:6.2f} {:6.2f}"

In [15]:
result_template.format("Pruta Govvom", 2.5, 4.7)

'Pruta Govvom        :   2.50   4.70'

- format() expects parts as separate arguments, which requires **unpacking** using `*` when applied to sequence objects:

In [16]:
results = [("Pruta Govvom", 2.5, 4.7), ("Prokanayardan Tum", 0.4, 3), ("Mara Tsirpalandani", 1.15, 0.01)]

In [17]:
for result in results:
    print(result_template.format(*result))

Pruta Govvom        :   2.50   4.70
Prokanayardan Tum   :   0.40   3.00
Mara Tsirpalandani  :   1.15   0.01


### Exception Handling

File I/O is where **exception handling** becomes important:

-   files might not exist (`FileNotFoundError`)

-   files might not have the correct permissions (`PermissionError`)

-   the storage device might run out of space (`IOError`)

A good program will catch all of these **exceptions**, and fail
graciously!

### Exception Handling: The `try` construct

Syntax of the `try` construct:

-   all the statements which could raise exceptions are wrapped in a
    `try` block, which prevents crashes when an exception occurs

-   following `except` blocks can catch different pre-defined error
    types, and define how each type of error is handled

-   a `finally` block contains the statements which will be executed in
    any case, whether the `try` block was exited via an exception or not


-   Example: (from https://www.geeksforgeeks.org/try-except-else-and-finally-in-python/)

In [18]:
def divide(x, y):  
    try:  
        # Floor Division : Gives only Fractional  
        # Part as Answer  
        result = x // y  
    except ZeroDivisionError:  
        print("Sorry! You are dividing by zero")  
    else: 
        print("Yeah! Your answer is:", result)  
    finally:   
        # this block is always executed    
        # regardless of exception generation.   
        print('This is always executed')    

In [19]:
divide(1,10)

Yeah! Your answer is: 0
This is always executed


In [20]:
divide(1, 0)

Sorry! You are dividing by zero
This is always executed


- Exception handling is important in connection with file operations. 
- Whenever you open a file, you need to close it no matter what!
- a `try`-`except`-`finally` construction with unspecified exception type can guarantee that

In [24]:
try:
    f = open("some_text.txt")
    text = f.write("whatever")
except:
    print("something bad happened")
#    1/0
finally:
    print("this is executed regardless")
    f.close()

something bad happened
this is executed regardless


### Exception Handling: The `with` construct

Another option is to open files in a `with` block:

-   the clean-up procedure of the object (e.g. a file handle) will be
    executed whether an exception is raised or not

-   in the case of a file handle, the clean-up procedure includes
    `close()`

-   exceptions will not be handled, but percolated up to any surrounding
    `try` blocks (terminating the program in absence of matching
    `except`)

-   example:

In [25]:
with open("some_text.txt") as f:
    f.write("more text")
print(file_contents)

UnsupportedOperation: not writable

### File Formats: Introduction

A range of file formats is commonly used for data exchange:

-   Tab-Separated Values (TSV) or Comma-Separated Values (CSV) for
    tabular data (i.e. for datapoints consisting of values for
    predefined fields)

-   Extensible Markup Language (XML) for hierarchically structured data
    where format checking is necessary

-   JavaScript Object Notation (JSON) for uncomplicated data exchange
    between programs

-   various binary formats (database files, matrices)


-----------------

### File Formats: CSV and TSV

Basics of CSV and TSV formats:

-   first line (the **header**): field names

-   one entry per line

-   field values separated by one designated delimiter, common choices:

    -   the tab character (TSV format, ending .tsv)

    -   the comma (CSV format, ending .csv)

    -   the space (many formats)

-   if the delimiter occurs inside a field value, the value needs to be
    surrounded by double quotes; tools automatically handle this

-   parsing can be done via `split()`, but more safely via the `csv`
    module

### File Formats: CSV Example

    Country,Demonym,Internet TLD,Capital
    Afghanistan,Afghan,.af,Kabul
    Albania,Albanian,.al,Tirana
    Algeria,Algerian,.dz,Algier
    Andorra,Andorran,.ad,Andorra la Vella
    Angola,Angolan,.ao,Luanda
    Antigua and Barbuda,"Antiguan, Barbudan",.ag,St John's
    Argentina,Argentine,.ar,Buenos Aires
    Armenia,Armenian,.am,Yerevan

### File Formats: XML

Basics of the **Extensible Markup Language (XML)**:

-   data is structured by user-defined **tags** like `<country>`

-   textual data is stored in **elements**: `<tag>`*text*`</tag>`

-   elements can have **attributes**: `<country id="AF">`

-   elements can contain other elements (nested structure)

XML processing:

-   very complex, do not try this on your own!

-   `xml.dom` module for representing the entire document in memory
    (good for manipulating and saving again as XML)

-   `xml.sax` for sequential processing (good for summarizing huge
    files)

### File Formats: XML Example

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <countryDatabase>
      <country id="AF">
        <name>Afghanistan</name>
        <province id="BDS">
          <name>Badakhshan</name>
          <capital>Fayzabad</capital>
        </province>
        <province id="BDG">
          <name>Badghis</name>
          <capital>Qala i Naw</capital>
        </province>
        ...
      </country>
      ...
    </countryDatabase>


----

### File Formats: JSON

Basics of the **JavaScript Object Notation (JSON)**:

-   structured data represented using arrays and key-value pairs

-   syntax is virtually identical to Python literals!

-   whitespace is irrelevant

JSON processing:

-   some subtle differences mean that not every Python literal can
    directly be interpreted as JSON, and vice versa\
    (e.g. no support for tuples, both correspond to arrays)

-   `json` module can dump any nested data structure to a JSON file, and
    load such a structure from a file

### File Formats: JSON Example

    [{"ID": "AF",
      "Name": "Afghanistan",
      "Provinces":
      [{"ID": "BDS", "Name": "Badakshan", "Capital": "Fayzabad"},
       {"ID": "BDG", "Name": "Badghis", "Capital": "Qala i Naw"},
       ...
      ]
     },
     {"ID": "AL",
      "Name": "Albania",
       ...
     },
     ...
    ]


### File Formats: Binary Formats

Many proprietary programs use **binary formats**:

-   free-form byte sequences of any shape

-   more efficient, especially for non-string data

-   difficult to process

-   example 1: encrypted files

-   example 2: compressed files (e.g. DOCX and ODT formats)

-   example 3: media files (PDF, MP3, \...)

### Encodings: Overview

What is an **encoding**?

-   intuitively: a mapping from characters to bit sequences

-   example: `"A"` is `01000001`, `"B"` is `01000010`

-   every text file encodes characters according to an encoding

Most relevant encodings today:

-   ASCII 7-Bit encoding (English only)

-   ISO 8-Bit encodings (alphabets only)

-   UTF-8 Unicode (all writing systems)



Encodings in Python 3:

-   by default, strings in Python are encoded in UTF-8

-   if you work with files in UTF-8, you should not run into problems

ASCII Encoding
--------------

ASCIIBasics Basic facts about **ASCII** encoding:

-   arose from need for standardized encoding in late 1960s

-   defined by ASA (American Standard Association) in multiple revisions

-   ASCII: **American Standard Code for Information Interchange**

-   wide range of control characters

-   limited to uppercase and lowercase English alphabet

-   basis for most of today's encodings

### ASCII Basics 


7-bit encoding:

-   decimal: 0 to 127

-   binary: 0000000 to 1111111

-   hexadecimal: 0x00 to 0x7F

-   maximum capacity of 128 characters, all of which are used



### ASCII Line breaks 

Line breaks:

-   LF: **line feed** (`\n`)\
    0001010 = 0x0A (decimal: 10)\
    "move to next line\"

-   CR: **carriage return** (`\r`)\
    0001101 = 0x0D (decimal: 13)\
    "move back to beginning of line, allow overprinting"

Line breaks differ between operating systems:

-   Unix-based systems: `\n`\

-   Windows: `\r\n`

### ASCII Limitations
ASCII has severe limitations for non-US usage:

-   no umlauts

-   no accents or other diacritics

-   no symbols for currencies other than dollar

Result:

-   different modifications were established in different countries,
    replacing different characters by symbols of local significance

-   examples: ISCII (Indian scripts), TSCII (Tamil), VSCII (Vietnamese)

-   sometimes changes are small, e.g. JIS C-6220 in Japan, replacing the
    backslash with a Yen sign

ISO 8-Bit Encodings
-------------------

ISO 8-Bit Encodings Step forward taken by **ISO/IEC 8859**:

-   standardized extensions of ASCII

-   first two variants defined in 1987\
    by ISO (International Organization for Standardization)

-   additional variants added until 2001

-   8-bit system: leading 0 bit for 7-bit ASCII characters

-   e.g. 1010101 (ASCII) becomes **0**1010101 (ISO-8859)

-   upper range of 128 characters used for localized graphical
    characters

-   different ISO-8859 standards tailored for different regions

### ISO/IEC 8859

ISO 8859-1

<img src="iso-8859-1.png" width=250/>

## ISO/IEC 8859

Overview List of all ISO-8859 standards:

-   ISO-8859-1: Western European

-   ISO-8859-2: Central European

-   ISO-8859-3: South European and Esperanto

-   ISO-8859-4: Baltic, old

-   ISO-8859-5: Cyrillic

-   ISO-8859-6: Arabic

-   ISO-8859-7: Greek



-   ISO-8859-8: Hebrew

-   ISO-8859-9: Turkish

-   ISO-8859-10: Nordic

-   ISO-8859-11: Thai (unofficial)

-   ISO-8859-12: (does not exist)

-   ISO-8859-13: Baltic, new

-   ISO-8859-14: Celtic

-   ISO-8859-15: Revised Eastern European

-   ISO-8869-16: South-Eastern European

**Unicode** is the predominant encoding standard in modern computing

-   idea: provide one **universal encoding** for all languages,\
    both current and historic writing systems, symbols, ...

-   first version of Unicode released in 1991 (2019: release of v. 12.0)

-   Unicode code range can be encoded by **21 bits**

-   multi-byte encoding with enough room for all current and many future
    symbols ($1\ 114\ 111$, of which $136\ 690$ are currently used)

-   virtually universal coverage (139 scripts) with a single standard

-   fonts are large and require advanced rendering techniques

-   different fonts will be used for different parts of the range

### Unicode

In Python3, strings are represented internally in **UTF-8**:

-   if the Unicode codepoint is $< 128$, a single byte is used (ASCII)

-   otherwise, the codepoint is turned into a sequence of 2-4 bytes

-   all characters used in official languages fit in 3 bytes or less

Specifying non-unicode encodings in Python (will be converted):

``` {language="python"}
ascii_file = open(file_name, encoding="ascii", "r")
latin1_file = open(file_name, encoding="latin-1", "w")
```



# More on Functions

### Using Return values (Recap)

- a function call is an expression, i.e. it will evaluate to an object

- ``` {language="python"}
  tuple(sorted(some_list)[0:3])
  ```

- if you use the result of a function call only once, it is not
  necessary to assign it to a variable before using it!

  ``` {language="python"}
  sorted_list = sorted(some_list) #this is unnecessary!
  for item in sorted_list:
    print(item)
  ```

---------------------------


- some functions and methods are **fruitless** (return `None`), but
  modify one of the objects handed over as arguments

  ``` {language="python"}
  some_list.sort()
  ```

- some functions only have side effects, i.e. they only return `None`,
  and do not modify any of the arguments, but still do have an effect

  ``` {language="python"}
  print("This string will not be changed.")
  ```



### More on Functions: Functions Calling Functions

Reminder: functions can call each other!

-   Example of a program consisting of functions which call each other:

- 

  ``` {language="python"}
  def dict_lookup(eng_word, tokens)
    return inflect(eng_deu_dict[eng_word], tokens)
  
  def translate(sentence)
    tokens = tokenize(sentence)
    translated_tokens = list()
    for token in tokens:
      translated_tokens.append(dict_lookup(token, tokens))
    return " ".join(translated_tokens)
  
  if __name__ == "__main__":
    example_sentence = "I know him"
    result = translate(example_sentence)
    print(result)
  ```



### More on Functions: The Call Stack

Helpful concept: the **call stack**

- each point during the program's execution can be defined by the
  positions in which the function calls that have not terminated yet
  have occurred; for our program, one state of the stack looks like
  this:

      2: inflect("kennen", ["I", "know", "him"])
      8: dict_lookup("know", ["I", "know", "him"])
      8: translated_tokens.append(...)
      13: translate("I know him")
      __main__  

- after termination of a function call, it is removed from the stack,
  and control flow **reverts back to the new highest element on the
  stack**

# More on Loops

### Looping over a Range

Recap: one of the most basic iterables is the **range**!

-   very often, we want our variable in a for loop to take every value
    from a certain range of integers, e.g. all house numbers in a street

-   example: modifying the values in part of a list

In [26]:
price_per_month = [60,60,70,80,80,90,100,170,150,70,60,80]
for month in range(3,6):
    old_price = price_per_month[month]
    new_price = int(old_price * 1.2)
    price_per_month[month] = new_price
print(price_per_month)

[60, 60, 70, 96, 96, 108, 100, 170, 150, 70, 60, 80]


### More on Loops: Looping over Lines in a File

Addition to file I/O: opened **file handles are iterables**!

- example file `"lines.txt"`, and a loop over it:

  ``` {language="python"}
  First line here.
  Second line there.
  ```


In [28]:
for line in open("lines.txt", 'r'):
    print(line.strip())

First line here.
Second line there.


- but a file handle has state ("remembers where it is"):

In [29]:
lines = open("lines.txt", 'r')
for line in lines:
    print("1st loop: " + line.strip())
for line in lines:
    print("2nd loop: " + line.strip())

1st loop: First line here.
1st loop: Second line there.


- solution: reset file handle via `lines.seek(0)`

### Usage of `while` Loops

Question 1: What might this while loop be doing?

``` {language="python"}
while command != "quit":
  if command == "left":
    player.direction.turn(-5)
  elif command == "right":
    player.direction.turn(+5)
  elif command == "accelerate":
    player.speed += 3
  elif command == "stop":
    player.speed -= 5
    if player.speed < 0:
      player.speed = 0
  for obj in objects:
    obj.continue_movement(1)
  graphics.draw_scene()
  command = get_next_command()
```

### 

### Usage of `while` Loops

Question 2: What might this while loop be doing?

``` {language="python"}
agenda = get_new_tasks()
done = set()
while len(agenda) > 0:
  task = agenda.pop(0)
  if is_simple(task):
    task.carry_out()
    done.add(task)
  else:
    subtasks = break_into_subtasks(task)
    agenda.extend(subtasks)
  new_tasks = get_new_tasks()
  for task in new_tasks:
    if task not in done:
      agenda.append(task)
```


### More on Sorting: Reverse Order

Default sorting order can be reverted:

-   sorting is in ascending order by default (1 before 10, A before B)

-   what if we want to compute e.g. a ranking,\
    where the highest score is best, and should be listed first?

-   both `sort()` and `sorted()` support the named argument `reverse`
    which takes a boolean value:

In [30]:
test_list = [5,3,14,5,1,2,3,7,8,9,12,-3,-4]
print(sorted(test_list))

[-4, -3, 1, 2, 3, 3, 5, 5, 7, 8, 9, 12, 14]


In [33]:
print(sorted(test_list, reverse=True))

[14, 12, 9, 8, 7, 5, 5, 3, 3, 2, 1, -3, -4]


In [34]:
test_list

[5, 3, 14, 5, 1, 2, 3, 7, 8, 9, 12, -3, -4]

In [35]:
test_list.sort(reverse=True)

In [36]:
test_list

[14, 12, 9, 8, 7, 5, 5, 3, 3, 2, 1, -3, -4]

### Sorting a Dictionary by Value

Sorting entries in a dictionary by their values:

-   by default, sorting a dictionary will sort it by key

-   a function value for the named argument `key` can override this

-   for sorting a dict by value, one commonly used option is to use the
    `itemgetter()` function from the `operator` module:

In [37]:
import operator
f = operator.itemgetter(1)
grades = {1717345 : "D", 1456345 : "A", 1334521: "C"}
for (student_ID, grade) in sorted(grades.items(), key=f):
    print(str(student_ID) + ": " + grade)

1456345: A
1334521: C
1717345: D


- also works with the in-place `sort()` method, e.g. on a list of
tuples!

### More on Optimization

Some basic facts about **optimization**:

-   for every computable problem or task, there are infinitely many
    programs which solve it; they will differ in memory and time usage!

-   differences in speed can be huge, even if you might not notice it on
    toy data (prefixes from a full frequency list takes minutes!)

-   some **principles of optimization**:

    -   avoid running costly computations (File I/O, search) multiple
        times

    -   avoid computing the same thing twice

    -   avoid creating unnecessary objects

-   get used to writing reasonably efficient code now,\
    it will be much harder later when you have formed habits!

-   if you want to be extremely efficient,\
    learn more programming languages (C, Java)

### Quiz: Efficient or Not? (1)

- Is there any obvious way to optimize this code?

  ``` {language="python"}
  older_student_ids = set()
  newer_student_ids = set()
  for line in open("participants.tsv","r"):
    student_id = int(line.split("\t")[0])
    if student_id < 3000000:
      older_student_ids.add(student_id)
    else:
      newer_student_ids.add(student_id)
  with open("older_student_ids.txt","w") as output:
    for student_id in sorted(older_student_ids):
      output.write(str(student_id) + "\n")
  with open("newer_student_ids.txt","w") as output:
    for student_id in sorted(newer_student_ids):
      output.write(str(student_id) + "\n")
  ```


### Quiz: Efficient or Not? (2)

- Is there any obvious way to optimize this code?

  ``` {language="python"}
  students_by_id = dict()
  for line in open("participants.tsv","r"):
    student_id = line.split("\t")[0]
    first_name = line.split("\t")[1]
    last_name = line.split("\t")[2]
    student_name = (first_name, last_name)
    students_by_id[student_id] = student_name
  for student_id in students_by_id.keys():
    print(last_name + ", " + first_name + ": " + student_id)
  ```


### Quiz: Efficient or Not? (3)

- Is there any obvious way to optimize this code?

  ``` {language="python"}
  student_names = list()
  for line in open("participants.tsv","r"):
    first_name, last_name = line.split("\t")[1:3]
    student_names.append((last_name, first_name))
  for i in range(len(student_names)):
    print(sorted(student_names)[i])
  ```


### Hints on Maintainability

What is **maintainability** about?

-   being able to quickly change the behavior of a program\
    by editing the code in as few places as possible

-   being able to revisit a program years from now, and still understand
    it

-   being able to quickly add or change some functionality as
    requirements change

How can one ensure maintainability?

-   if you use identical literals everywhere, store them in a global
    variable!

-   do not copy-and-paste code! (that's what functions and loops are
    for!)

-   comment your code!

    -   informative variable names

    -   block comments describing function behavior


### Maintainability: Example

How can the maintainability of this function be increased?

``` {language="python"}
def print_matrix(mtx):
  for row in mtx:
    if row == mtx[0]:
      continue
    entry1 = str(row[1])[:6]
    if len(entry1) < 6:
      entry1 += (6 - len(entry1)) * " "
    entry2 = str(row[2])[:6]
    if len(entry2) < 6:
      entry2 += (6 - len(entry2)) * " "
    entry3 = str(row[3])[:6]
    if len(entry3) < 6:
      entry3 += (6 - len(entry3)) * " "
    print(entry1 + " " + entry2 + " " + entry3)
```



