# [Dive into Python 3](https://diveintopython3.problemsolving.io/)

## [Chapter 1: Python Overview](https://diveintopython3.problemsolving.io/your-first-python-program.html)

In [1]:
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PT', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

def approximate_size(size, a_kilobyte_is_1024_bytes=True):
  '''Convert a file size to human-readable format.

  Keyword arguments:
  size -- file size in bytes
  a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
                              if False, use multiples of 1000

  Returns: string'''
  if size < 0:
    raise ValueError('number must be non-negative')
  
  multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
  for suffix in SUFFIXES[multiple]:
    size /= multiple
    if size < multiple:
      return '{0:.1f} {1}'.format(size, suffix)

print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))

1.0 TB
931.3 GiB


### General basic points about Python:

#### Whitespace is important

- Python uses white space rather than delimiters to indicate code structure, so indentation is really important. A nice side-effect of this is it enforces readability.
- An indent can be any number of spaces, but indentation must be consistent. Blank lines are ignored.

#### Python is loosely typed

- Python is loosely typed. Function definitions don't declare argument or return types, though they may give non-binding and unenforced type hints.

#### Variables are initialized when you assign to them

- Unlike some languages, Python doesn't make you declare a variable before assigning to it. It does this automatically upon assignment.

#### Every function returns a value

- Some languages distinguish functions (which return a value) from subroutines (which do not). This distinction doesn't exist in Python; all functions start with `def`.
- Every Python function returns a value. If there's no `return` statement, it returns `None`.

#### Arguments can be required or optional

- To make a function argument optional, you assign a default value. Required arguments must be declared before optional ones.

#### Docstrings are non-binding but very helpful

- Docstrings, which explain function usage, are non-binding but made available by Python at runtime and are typically displayed by IDEs as a tooltip.
- Triple quotes signify a multi-line string. Everything in the quotes is part of the string, including white space and carriage returns. They are commonly used for docstrings because they allow use of unescaped single and double quotes.

#### Call functions with keyword or non-keyword args

- You can call a function with non-keyword arguments (in sequence), or with keyword arguments (which are sequence-agnostic). You can also mix and match, but non-keyword arguments must precede keyword arguments in the function call.

#### Everything in Python is a first-class object (can have attributes or methods and be assigned to a variable)

- Everything in Python, including a function, is an object. All objects have attributes which are available at runtime. For instance, a function's docstring can be accessed like this: `approximate_size.__doc__`. (All functions have this built-in attribute.)
- Once you import a module, like `import module`, you can access any of its *public* functions, classes, or attributes, with a period and a name: `module.function`.
- Python defines objects loosely. Objects don't *have* to have attributes or methods, and not all objects are subclassable. 
- All Python objects, including modules, functions, classes, and class instances, can be assigned to a variable or passed as an argument. (In programming parlance, all Python objects are "first-class objects.")

#### Python 'raises' 'exceptions' that must be 'handled'

- Errors in Python are called 'exceptions' and triggered with the `raise` keyword (rather than 'throw' as in other languages). If a raised exception is 'unhandled', the program will stop.
- Unfortunately, Python functions don't declare what exceptions they might raise, so you have to figure this out yourself.
- Exception handling is done with `try...except` rather than 'catch' as in other languages.
- Exceptions are implemented as classes, and raising an exception creates an instance of that class.
- Exceptions can be handled at any level of the 'stack' of nested functions or classes in which they occur.

### Import search in Python:

- When you import something in Python, it checks all directories in sys.path. By default, this basically contains your current workspace folder, your Python executable folder, and any active virtual environment folder.
- Import search will return `.py` files on the search path or standard library modules, which are written in C and don't have corresponding `.py` files.
- **You can easily insert a new folder into sys.path with `sys.path.insert(0, '/dir/to/add')`.** This persists only until you quit Python (or stop the kernel). You typically want to insert a new path into first position in the list, so your modules will override any modules of the same name that turn up further down the list. **This trick is useful for testing code with older versions of dependency libraries.**

In [2]:
import sys
sys.path
# sys.path.insert(0, 'dir/to/add')

['c:\\Users\\chris\\OneDrive\\Documents\\Python\\python-practice',
 'C:\\Users\\chris\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
 'C:\\Users\\chris\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
 'C:\\Users\\chris\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
 'C:\\Users\\chris\\AppData\\Local\\Programs\\Python\\Python312',
 'c:\\Users\\chris\\.virtualenvs\\practice-mD3pbIW8',
 '',
 'c:\\Users\\chris\\.virtualenvs\\practice-mD3pbIW8\\Lib\\site-packages',
 'c:\\Users\\chris\\.virtualenvs\\practice-mD3pbIW8\\Lib\\site-packages\\win32',
 'c:\\Users\\chris\\.virtualenvs\\practice-mD3pbIW8\\Lib\\site-packages\\win32\\lib',
 'c:\\Users\\chris\\.virtualenvs\\practice-mD3pbIW8\\Lib\\site-packages\\Pythonwin']

#### ImportErrors and NameErrors

- If you import a dependency that isn't installed, an ImportError exception is raised. Catching this error lets you run optional logic using the dependency—only if it's installed—without crashing the program.
- Alternatively, you can revert to an alternative fallback dependency (and perhaps alias it with the same name).
- If you try to access an unintialized variable, it raises a `NameError`. (Note that - Python is case-sensitive, so trying to access a variable with the wrong casing will throw a NameError.)

In [3]:
try:
  import chardet
except:
  chardet = None

if chardet:
  print("do something")
else:
  print("continue anyway")

try:
  from lxml import etree
except ImportError:
  import xml.etree.ElementTree as etree

continue anyway


#### Add a special conditional block for testing code

- All modules have a built-in attribute `__name__`, which is relative to the top-level module being run. The top-level module is assigned the name '__main__'.
- Add a conditional `if __name__ == '__main__':` block at the bottom of a `.py` file to execute code only when it is run as a standalone top-level module, and not when it is imported. This can be used for quick-and-dirty code testing, among other things.

## [Chapter 2: Native Datatypes](https://diveintopython3.problemsolving.io/native-datatypes.html)

Python's main native data types:

- `bool`: True/False
- `int`: Integer, including 0, natural numbers N, and additive inverse -N
- `float`: Decimal real number
- `string`: Sequence of Unicode characters enclosed in quote marks
- `bytearray`: Arbitrary data encodings, such as a jpeg image
- `list`: Ordered, mutable sequence of values enclosed in square brackets
- `tuple`: Ordered, immutable sequence of values enclosed in parentheses
- `set`: Unordered bag of values enclosed in curly braces
- `dict`: Unordered bag of key: value pairs enclosed in curly braces and joined with colons

There are many other native data types, encompassing all the types of objects found in base Python, including `module`, `function`, `class`, `method`, `file`, and `compiled code`. (In fact, the other data types are basically all instances of `class`. When you use `set()` to instantiate a set, for instance, you technically aren't calling a function; you're instantiating a class.)

### Booleans

- Booleans can take either of the constants True or False
- Conditional expressions that resolve to a boolean are known as 'boolean contexts'
- Booleans can be treated as numbers, with `True == 1` and `False == 0` (which makes it easy to count true values in an iterator by taking the `sum`)
- In a boolean context, **anything** other than `0`, `0.0`, `False`, `None`, or an empty iterable will evaluate as True

In [4]:
def is_it_true(anything: any) -> None:
  if anything:
    print("yes, it's true")
  else:
    print("no, it's false")

is_it_true(None)
is_it_true([])
is_it_true(())
is_it_true(0)
is_it_true([False])
is_it_true(0.1)
is_it_true("hello")

no, it's false
no, it's false
no, it's false
no, it's false
yes, it's true
yes, it's true
yes, it's true


### Floats and Ints

- Python distinguishes ints from floats by the absence or presence of a decimal
- If you perform mathematical operations with a combination of ints and floats, Python will coerce them all to floats
- Coercing a float to an int with `int()` will truncate the number, not round
- Integers can be arbitrarily large, and Python 3 will dynamically decide how many storage bytes to use (unlike some other languages where you have to declare the storage size)
- Floats are accurate to 15 decimal places

### Mathematical operators

- `/`: Floating point division, always returns a float
- `//`: Floor division, returns an int or float (depending on the inputs) rounded down (e.g., -5.5 gets rounded to -6, *not* truncated to -5)
- `**`: Raise to the power of
- `%`: Modulo, returns the remainder after integer division

### Fractions

- To do fractions math, import `fractions` and instantiate a `Fraction` object with `fractions.Fraction(numerator, denominator)`.
- You can use all the usual mathematical operators with fractions
- Fractions are automatically reduced/simplified
- You can't create a fraction with zero denominator

In [5]:
import fractions

# Just for fun, define method to output string and add to the Fraction class
fractions.Fraction.as_string = lambda self: str(self)

# Subtract 3/4 from 4/3
difference = fractions.Fraction(2,3) * 2 - fractions.Fraction(3/4)

# Print the result
difference.as_string()

'7/12'

### Lists

- Lists are like arrays in other languages, except that length and contents are mutable and don't have to be declared in advance
- List items can be any data type, and a list can contain any combination of data types
- There's no size limit other than available memory

#### Subsetting lists

- You can access list items using a numerical index counted either from the beginning, starting from 0, or from the end, starting from -1 (e.g., `some_list[3]` will get the fourth item)
- You can slice a list using three integers separated by colons, like this: `some_list[start:end:step]`
- The `start` index is inclusive, but the `end` index is exclusive (e.g., `some_list[0:2]` will get the items at indexes 0 and 1, but not the item at index 2)
- The second colon and third value can be omitted to use a default step of 1 (e.g., `some_list[start:end]`)
- You can omit the start value to slice from the beginning or the end value to slice through the end (e.g., `some_list[:]` will get the whole list)
- Assigning a list to a new variable creates a reference to the original list, whereas slicing always creates a copy (so `some_list[:]` is a common shorthand for making a copy of a list)
- Use a negative step value to reverse the list (e.g., `some_list[::-1]`)
- With a negative step, the start index must be greater than the end index (e.g., `some_list[3:0:-1]` will get the items at indexes 3, 2, and 1, in that order)

In [6]:
# Create a list
some_list = ['a', 'b', 'james', 'z', 'example']

# Subset list from the beginning
print('First item: ' + some_list[0])

# Subset list from the end
print('last_item: ' + some_list[-1])

# some_list[:n] always returns the first n items
print(some_list[:3])

# Slice the fourth, third, and second items, in that order
print(some_list[3:0:-1])

# Create a reference to some_list
list_ref = some_list

# Create a copy of some_list
list_copy = some_list[:]

# Modify some_list
some_list[0] = 'replacement_value'

# The reference has changed, but copy has not
assert list_ref[0] != list_copy[0]
print(list_ref[0] + ' != ' + list_copy[0])

First item: a
last_item: example
['a', 'b', 'james']
['z', 'james', 'b']
replacement_value != a


#### Modifying lists

- You can change the value at a list index like `some_list[0] = some_value` (but this won't create a new list, so you have to already created it—e.g. `some_list = []`)
- You can add to a list with the `+` operator, although this is memory-intensive because it creates a new list in memory before assignment
- The `append` method of the `List` class adds a single item to the end of the list, which is done in-place (i.e. we can append to `some_list` without assigning the result back to `some_list`)
- `append` is preferred for modifying lists, whereas `+` is good for creating a copy
- The `insert` method adds a new item at a numerical index and bumps everything else down one position (e.g., `some_list.insert(0, 'hello')` inserts 'hello' at index zero and increments all other items' index by +1)
- The `extend` method takes a list argument, unpacks it, and appends its contents to the list
- The `del` keyword applied to a list item will delete the item and shift all subsequent items' indexes down by 1
- The `remove` method deletes an item by value rather than by index (e.g., `some_list.remove('hi')` deletes the **first** value matching 'hi' from the list)
- The `pop` method deletes an item by index (by default the last item) **and returns the removed value** (useful for treating list items as consumables to be crossed off the list after use)

In [7]:
some_list = ['hello', 'world,']
items_to_add = ['I', 'am', 'Chris']

# Returns None; modifies list in-place
some_list.append(items_to_add)
print('Result with append:')
print(some_list)

del some_list[-1]

# Returns None; modifies list in-place
some_list.extend(items_to_add)
print('Result with extend:')
print(some_list)

print('Pop result: ' + some_list.pop())

Result with append:
['hello', 'world,', ['I', 'am', 'Chris']]
Result with extend:
['hello', 'world,', 'I', 'am', 'Chris']
Pop result: Chris


#### Searching over lists

- The `count` method counts how many times a value appears in a list (e.g., `some_list.count('hello')` counts list values matching 'hello')
- The `in` keyword can be used to check for any matching value (e.g., `'hello' in some_list` returns True or False)
- The `index` method returns the numerical index of the first occurrence of the value, and returns ValueError if there is no matching value in the list (can be called with optional inclusive start and exclusive stop indices, e.g. `some_list.index('hello', 0, 4))`)

In [8]:
some_list = ['a', 'b', 'new', 'mpilgrim', 'new']

# We could check for the presence of 'new' before calling `index``...
print('new' in some_list[0:2])

# But it's Easier to Ask Forgiveness than Permission (EAFP)
try:
  some_list.index('new', 0, 2)
except ValueError:
  print('Value "new" not found in slice.')

False
Value "new" not found in slice.


### Tuples

- Tuples are immutable lists, represented as comma-separated items enclosed in parentheses
- You can index or slice or copy a tuple just like a list, but there are no methods for altering a tuple in-place, because a tuple can't be modified
- Tuples are faster than lists, so they're preferable when you don't need to modify contents
- Use a tuple to 'write protect' data to make your code safer
- To create a tuple of one item, you need a comma after the value, or Python will assume you just have extra parentheses
- You can 'unpack' a tuple by assigning it to an equally-sized tuple of the names of the variables you want to assign to

In [9]:
# This creates a tuple
a_tuple = ('hello',)
print(type(a_tuple))

# This does not
a_tuple = ('hello')
print(type(a_tuple))

# This unpacks a tuple into the variables val_1 and val_2
(val_1, val_2) = ('first value', 'second_value')
print(val_1)
print(val_2)

# Use with range to create constants with ordinal values
(MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)

# Note that tuple and list contents are copies, not references
a_tuple = (val_1, val_2)
a_list = [val_1, val_2]
val_1 = 'adjusted_value'
print(a_tuple[0])
print(a_list[0])

<class 'tuple'>
<class 'str'>
first value
second_value
first value
first value


### Sets

- Sets are **unordered** bags of **unique** values
- If you use `print` on an empty set, the output is represented as `set()` rather than `{}`, because the latter would represent a dictionary
- Similarly, if you assign `{}` to a variable, the variable will be a dict, not a set
- Use `set()` to create an empty set

In [10]:
empty_set = set()
print(empty_set)

a_set = set([1,2,3])
print(a_set)

set()
{1, 2, 3}


#### Modifying a set

- Use the `add` method to add a single value to a set
- If you add a value that already exists in the set, nothing happens (no error is thrown, but no value is added)
- Call the `update` method with any number of sets or lists as arguments to add multiple values (e.g., `{1,2,3}.update({3,4,5})` will output `{1,2,3,4,5}`)
- The `discard` and `remove` methods both delete a single value from the set, but `remove` throws a `KeyError` if the value is missing, while `discard` will not
- `pop()` removes an arbitrary value from the set and returns it, throwing a `KeyError` if the set is empty
- Unlike with lists, you can't use the `+` operator to modify sets (raises `TypeError`)

#### Comparing sets

- The `union` method returns all values in both sets (e.g., `{1,2}.union({2,3})` returns `{2}`)
- The `difference` method returns all values in the `self` set that are not in the argument set (e.g., asymmetric difference)
- The `symmetric_difference` method returns the set of values that are unique to either of the sets being compared (e.g., equivalent to `set_a.update(set_b).difference(set_a.union(set_b))`)
- Since sets are unordered, `{1,2,3} == {3,2,1}`
- The `is_subset` method asks if all members of `self` are members of the argument set
- The `is_superset` method asks if all members of the argument set are members of `self`
- You can use the `in` keyword to test for a value's membership in a set

### Dictionaries

- Dicts are mutable, unordered sets of key-value pairs (e.g., `{'a_key': 'a_value'}`)
- They are optimized for retrieving the value using the key as an index ((e.g., `{'a_key': 'a_value'}['a_key']` outputs `'a_value`))
- They're not optimized for getting the key from the corresponding value
- Several datatypes are allowed as dictionary keys, and you can mix and match in the same dictionary
- Values can be any arbitrary datatypes
- Keys must be unique, are case-sensitive when strings

#### Modifying dictionaries

- You can add a new key to an existing dictionary simply by assigning a value to it (`existing_dict['new_key'] = 'some_value'`)
- Modifying the value for an existing key uses exactly the same syntax

### NoneType

- `None` is Python's equivalent of null or undefined
- `None` is its own thing, and comparing it to anything else (e.g., `False` or `0` will return `False`)

## [Chapter 3: Paths and Comprehensions](https://diveintopython3.problemsolving.io/comprehensions.html)

### Working with paths

#### Setting/getting the working directory

- To make a module on your hard drive available for import, you need to do one of thwo things: add its folder to the search path with `sys.path.insert` or make the folder the working directory with `os.chdir`
- You can get the current working directory with `os.getcwd()`
- By default, the working directory is either the folder where you installed Python or the folder from which you started Python from the system shell

#### The `os.path` module

- Python tries to maintain a unified API for different file systems (e.g., Windows vs. Linux), but for true cross-compatibility, you should use the utilities in `os.path`
- The `os.path.join` function combines a base file path with an arbitrary number of directories or a file name to produce a combined path
`os.path.expanduser('~')` gets the user's home directory
- `os.path.split` will split the path from the destination folder or file
- `os.path.splitext` Will split a file name from its extension (including period)
- When you coerce an `os.path.Path` object to a string, it will always format the path for the current operating system
- You can use `os.path.realpath` to convert a relative path to an absolute path, or `os.path.relpath` to do the opposite
- These utilities handle all the annoying little stuff like slash direction and trailing slashes that would otherwise break your code

In [11]:
import os

path = os.path.join(os.path.expanduser('~'), 'Documents', 'Python', 'python-practice')

print('Path: ' + str(path))

(pathname, dirname) = os.path.split(path)

print('Path: ' + str(pathname))
print('Dir: ' + str(dirname))

(filename, extension) = os.path.splitext("test.py")

print("Filename: " + filename)
print("Extension: " + extension)

Path: C:\Users\chris\Documents\Python\python-practice
Path: C:\Users\chris\Documents\Python
Dir: python-practice
Filename: test
Extension: .py


#### Finding files or mapping a folder

- The `glob.glob` function (part of Python's standard library) returns all files in the current working directory matching the `pathname` argument, where the argument is a string with shell-style asterisk wildcards
- A single asterisk (`"*"`) matches any number of characters, but not across directory boundaries, while a double asterisk (`"**"`) crosses directory boundaries
- The `recursive` argument tells `glob.glob` whether to search subdirectories, and `include_hidden` determines whether to include hidden files and files beginning with a period
- You can use multiple wildcards in the same path (e.g., `glob.glob("*/*.py")` will return a list of files one level below the root in the file tree)

In [12]:
import glob

# Recursively map the entire working directory, but don't include hidden files
glob.glob('**', recursive=True)

['algorithms.ipynb',
 'dive_into_python.ipynb',
 'Pipfile',
 'Pipfile.lock',
 'project_euler.ipynb',
 'README.md',
 'the_python_challenge.ipynb',
 'think_python.ipynb']

#### Accessing file metadata

- `os.stat` returns metadata about a file, such as when it was last modified
- The last modified timestamp is represented in Unix style as the number of seconds since January 1, 1970
- You can use `time.localtime()` or `time.ctime()` to convert this object to a more useful class object or timestamp string

In [13]:
metadata = os.stat("README.md")

# Use inspect module to get public attributes of metadata
import inspect
print([member[0] for member in inspect.getmembers(metadata) if not member[0].startswith('_')])

import time
print("Last modified: " + time.ctime(metadata.st_mtime))

['count', 'index', 'n_fields', 'n_sequence_fields', 'n_unnamed_fields', 'st_atime', 'st_atime_ns', 'st_birthtime', 'st_birthtime_ns', 'st_ctime', 'st_ctime_ns', 'st_dev', 'st_file_attributes', 'st_gid', 'st_ino', 'st_mode', 'st_mtime', 'st_mtime_ns', 'st_nlink', 'st_reparse_tag', 'st_size', 'st_uid']
Last modified: Sun Mar 10 13:04:59 2024


### Comprehensions

- Comprehensions provide a fast and compact way to loop over iterables such as lists, dictionaries, and sets
- They're one of the cooler features of Python, but also somewhat unintuitive for beginners
- A comprehension transforms an iterable into another iterable
- The output of a comprehension is a new iterable of the same size as, or smaller than, the original iterable
- Comprehensions can be used to either apply an arbitrary filter to an iterable, or to transform its elements, or both

#### List and set comprehensions

- A list comprehension takes the form of `[do_something_with(item) if conditional_test(item) else do_something_else_with(item) for item in iterable]`, where the`do_something` transformations and `if` and `else` clauses are optional
- An annoying feature of comprehensions is that the placement of the `if` clause changes if there's no `else` clause: `[do_something_with(item) for item in iterable if conditional_test(item) == True]`
- A set comprehension takes the same form as a list comprehension, but enclosed in curly braces

In [14]:
# Filter the list to keep only numbers over 2
print([num for num in [1,2,3,4] if num > 2])

# Apply a different transformation depending on whether num > 2
print({num * 2 if num > 2 else num * 4 for num in [1,2,3,4]})

[3, 4]
{8, 4, 6}


(Note that our set comprehension outputs only the **unique** results of the transformation, so that the output set is smaller than the input list because 2 \* 4 == 4 \* 2.)

#### Dictionary comprehensions

- In a dictionary comprehension, you must return a key: value pair for each loop over an item, rather than just a value
- When constructing a dictionary from some other iterable, you must separately derive a key and value from the same item: `{do_something_with(item): do_something_else_with(item) for item in non_dictionary}`
- When constructing a dictionary from another dictionary, you can unpack each item as a key, value pair with the items() method (which returns a list of key, value tuples) before doing any filtering or transformation: `key: do_something_with(value) for key, value in dictionary.items()}`
- To swap the keys and values in a dictionary, you can just use `{value: key for key, value in dictionary}` (although this will raise a TypeError if the dictionary contains lists, because dictionary keys can't be a mutable data type)

In [15]:
metadata_dict = {f: os.stat(f) for f in glob.glob("*")}

filesize_dict = {os.path.splitext(f)[0]: approximate_size(meta.st_size) for f, meta in metadata_dict.items()}

print("README size: " + filesize_dict["README"])

README size: 1.5 KiB


## [Chapter 4: Strings](https://diveintopython3.problemsolving.io/strings.html)

### Text encoding

- Every character you've every seen on your screen is stored in an "encoding" that provides a mapping between the displayed character and the pattern of bits that represent it in memory
- There are many different encoding schemes, such as different schemes for different languages
- If you look at a page that combines two different encoding schemes, you might see two characters that look identical but are encoded differently in memory
- To properly display text information encoded in a file, you need the mapping or decryption key
- If you've ever visited a web page and seen question marks where apostrophes should be, it's usually because the page didn't declare its encoding properly

### A brief history of encoding schemes

- Each language has its own encoding scheme, using the numbers 0-255 (because one byte—eight bits—can store numbers up to `2**8-1 == 255`)
- English text historically used ASCII, which stores English characters as numbers from 0-127
- Western European languages with diacritical marks used CP-1252 (a.k.a. windows-1252), which overlaps with ASCII for the numbers 0-127 but differs at higher numbers
- Japanese, Chinese, and Korean have so many characters that they had to use two bytes, permitting the use of numbers ranging from 0-65535 (because `2**16-1 == 65535`)
- Word processors developed their own encoding schemes to carry "rich text" information that allowed for a wider range of characters
- The advent of email and the Internet necessitated protocols for transmitting descryption keys along with text

### Unicode: a universal standard

#### UTF-32

- Unicode was an attempt to introduce a universal encoding standard that could work across all languages
- The first 127 Unicode characters match ASCII and CP-1252
- It used 4 bytes for every character, and thus is known as UTF-32 (because there are 32 bits in 4 bytes)
- The advantage of UTF-32 is that you can find the nth character in constant time, because the nth character is the n*4th byte
- The disadvantage is that this is a pretty inefficient use of disk space, since most people will almost never need the upper range of that encoding

#### UTF-16

- UTF-16 uses two bytes, and then does some hacky transformation to convert to UTF-32 when needed (basically trading off some CPU cycles for memory efficiency)
- With UTF-16, you can no longer really find the nth character in constant time unless you maintain a separate index
- Different computer systems store bytes in different order, so multi-byte encodings like UTF-16 and UTF-32 use a "Byte Order Mark" at the beginning of a document to indicate what order to use
- UTF-16 is still pretty memory inefficient, because most text on the Internet (even the HTML tags on Chinese web pages) is in English characters and doesn't need two bytes

#### UTF-8: Variable-length encoding

- UTF-8 uses variable byte length, so that English characters take one byte, extended Latin characters two bytes, and Chinese characters three bytes
- The disadvantage is that finding the next character is an O(n) (linear time) rather than O(1) (constant time) operation
- The advantages are that UTF-8 is memory efficient and resolves the byte-ordering issue, so we no longer need a Byte Order Mark

### Working with strings in Python

- Python uses UTF-8 to store strings under the hood, but you can convert to any supported encoding using the string class's `encode` method
- Strings can be created with single quotes, double quotes or three-in-a-row of either type

#### Placeholders

- Using the string class's `format` method, you can pass an arbitrary number of values to replace positional integer placeholders
- The `{0}` placeholder will be replaced by the first argument, `{1}` by the second, and so on, for an arbitrary number of arguments
- You can use empty curly braces (`{}`) and let Python infer the integer indices from the sequence
- It's possible to subset the placeholder with an attribute or integer (e.g., `0.st_size` `0[0]`), but not with a non-integer dictionary key, a class method like `pop()`, or even an integer stored in a variable

#### Conversion flags and format specifiers

- You can use a format specifier to right- or center-align text, add leading zeros, sign a number, pad with spaces, control decimal precision, convert to hexadecimal, etc.
- Within a placeholder replacement field, an exclamation point marks the beginning of a conversion flag, and a colon marks the beginning of a format specifier
- Conversion flags and format specifiers are their own [mini-language], but I will give a few highlights:
  - `!s` converts an object to a string, whereas `!r` returns a string that would initialize the object if evaluated as code
  - `:+` will sign both positive and negative numbers, while `:-` will sign only negative numbers 
  - `:,` will add commas to a number as thousands separators
  - `:f` formats a number as a decimal fixed point number, while `:g` will use either a fixed-point decimal or scientific notation where appropriate
  - `:%` formats the number as a percentage
  - `:.nf` or `:.ng`, where `n` is a digit, returns a decimal number to a precision of `n` digits (e.g., `{:.2f}` rounds to two digits)

In [16]:
username = "chris"
password = "some_password"

print("User {0}'s password is: {1}".format(username, password))
print("User {1}'s password is: {0}".format(password, username))

# Indexing a list by integer from a placeholder works
size_suffixes = ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
print("1000 {0[2]} = 1 {0[3]}".format(size_suffixes))

# Indexing a list by an integer stored in a variable doesn't work
index = 2
try:
  "1000 {0[index]} = 1 {0[index]}".format(size_suffixes)
except Exception as e:
  print("Indexing list by integer variable raises {0!r}".format(e))

# Indexing a dict by key doesn't work if the key is a string
user_dict = {username, password}
try:
  print("User chris's password is: {0['chris']}".format(user_dict))
except Exception as e:
  print("Indexing dict by string raises {0!r}".format(e))

# However, we can index a dict by an integer key
size_dict = {1000: 'KB'}
print("1000 bytes in a {0[1000]}".format(size_dict))

# We can access attributes, but not call methods
print("Reference for pop method: {0.pop}".format(size_suffixes))
try:
  print("Last size suffix is: {0.pop()}".format(size_suffixes))
except Exception as e:
  print("Calling pop method raises {0!r}".format(e))

fp_num = 0.00124
print("{:.3f}".format(fp_num))

User chris's password is: some_password
User chris's password is: some_password
1000 GB = 1 TB
Indexing list by integer variable raises TypeError('list indices must be integers or slices, not str')
Indexing dict by string raises TypeError("'set' object is not subscriptable")
1000 bytes in a KB
Reference for pop method: <built-in method pop of list object at 0x000001BE07D56540>
Calling pop method raises AttributeError("'list' object has no attribute 'pop()'")
0.001


#### Other common string methods

- Using the `len` function on a string returns its length in characters
- The `split` method will split the string by some delimiter and return a list of substrings (excluding the delimiter)
- The `splitlines` method will return a list containing the lines of a multi-line string (i.e. split the string by carriage returns)
- The `lower` method converts to lower case, and `upper` converts to upper case
- The `count` method counts the number of occurrences of a substring in a string
- You can slice a string exactly like a list (e.g., `some_string[0:12]` would get the first through twelfth characters)

In [17]:
query = 'user=pilgrim&database=master&password=PapayaWhip'

# Split by ampersands and then by equal signs to get a nested list
queries = [q for q in query.split("&")]
list_of_lists = [q.split("=") for q in queries]

# Coerce the nested list to a dictionary (only works if sub-lists all have len == 2)
a_dict = dict(list_of_lists)
print(a_dict)

{'user': 'pilgrim', 'database': 'master', 'password': 'PapayaWhip'}


#### Bytes objects

- Strings are always immutable sequences of Unicode characters
- You can convert a string to a sequence of numeric encodings called a 'bytes' object by prefixing a `b` (e.g., `bytes_obj = b'abcd'`)
- Alternatively, you can call the `encode` string method, which allows you to pass an arbitrary encoding scheme as an argument
- Accessing a bytes object by index will return an integer
- You can manipulate bytes objects much like strings, but you can never mix strings and bytes objects
- Bytes objects are immutable, so you can access a byte by index but never assign to it by index; this will raise a `TypeError`
- To work with a mutable data type that will let you modify individual bytes, you can convert the bytes object to a bytearray with `bytearray()`

In [20]:
bytes_obj = b'ABC'
print(bytes_obj[0])

bytes_obj += b'/x00'
print(bytes_obj)

65
b'ABC/x00'


## [Chapter 5: Regular Expressions](https://diveintopython3.problemsolving.io/regular-expressions.html)

The Python string methods, such as `index`, `find`, `split`, `count`, and `replace`, are limited to searching for static substrings. For more complex search cases, there's regex. All functionality related to regex is located in the `re` module.

### Regex basics

Important special match characters in regex include:

- `^`: beginning of string
- `$`: end of string
- `\b`: word boundary
- `\s`: space
- `\t`: tab
- `\d`: digit
- `\D`: non-numeric character
- `.`: any character
- `\w`: any word

There are also modifiers that adjust the preceding character:

- `x*`: x zero or more times
- `x+`: x one or more times
- `x{n,m}`: x between n and m times
- `x|y`: exactly one of x or y
- `x?`: optionally x

And there are "lookahead" and "lookbehind" methods for finding or excluding patterns that occur together:

- `x(?=y)`: x followed by y (positive lookahead)
- `x(?!y)`: x not followed by y (negative lookahead)
- `(?<=y)x`: x preceded by y (positive lookbehind)
- `(?<!y)x`: x not preceded by y (negative lookbehind)
- Note that a limitation of negative lookbehinds in Python is that they must be of fixed character length, so `(?<!ab|c)x` won't work because the lookbehind has variable length of either 1 or 2. But you can combine negative lookbehinds of different lengths like `(?:(?<!ab)(?<!c))x`

A backslash in Python must be escaped with another backslash (e.g., `\\b`) *unless* you precede the string with the letter `r` to tell Python to treat it as a raw string (e.g., `r"\b"` is equivalent to `"\\b"`). It's good practice to *always* use raw strings when doing regex.

Note that regex matching is "greedy", meaning it will try to return the largest possible match for a given pattern. Thus, `x+` will return a single match for the string 'xxx', even though several smaller matches are possible (e.g., the first 'x' would match the pattern all by itself).

### Working with `re` methods

- The `compile()` method compiles your regex pattern so you can use it multiple times throughout your code without having it recompiled under the hood on every use
- The `sub()` method takes three arguments: the pattern, a replacement for the pattern, and the string in which to perform the replacement
- `sub()` will replace *every* occurrence of the pattern in the string
- The `search()` method takes a pattern to match and a string to match against, and returns either `None` or an object of the `Match` class
- `search()` returns only the **first** match in the string

Note that you can add the `re.IGNORECASE` flag as an extra argument in any of these methods for case-insensitive matching.

### Working with `Match` object methods

The `Match` class has methods that describe the matches found:

- `span()` returns a tuple containing the start and end positions of the match
- `groups()` returns the part of the string where there was a match
- If you call `groups()` on a `NoneType` result, it will raise an `AttributeError` you'll need to catch

In [None]:
import re

STRINGS = ['100 BROAD', '100 BROAD ROAD', '100 BROAD ROAD APT. 3']

# The book gives this example for replacing "ROAD" with "RD." in an address
# However, this will still miss some edge cases
for s in STRINGS:
  print(re.sub(r'\bROAD\b', 'RD.', s))

# We can use negative lookahead to avoid abbreviating ROAD before a common street suffix
# Note: For more street suffixes, see: https://en.wikipedia.org/wiki/Street_suffix
pattern = re.compile(r'\bROAD\b(?!.*STREET|LANE|ROAD|AVENUE|BOULEVARD|CIRCLE|COURT|HIGHWAY)')
STRINGS.append("100 OLD TOWN ROAD STREET")
STRINGS.append("100 ROAD RUNNER STREET")
for s in STRINGS:
  print(re.sub(pattern, 'RD.', s))

### Isolating groups

- To extract subgroups that match your pattern from a string, you can enclose them in parentheses.
- The `groups()` method of the `Match` object then returns a tuple of the matching subgroups.
- If you've used a lookahead or lookbehind pattern that already has parentheses, this doesn't count as a group unless you enclose it in a second set of parentheses.
- For a nine-digit phone number with optional extension, you might use `r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$'`. This would return 3 or 4 subgroups.

### Verbose regular expressions

- It's possible to define multi-line regular expressions with explanatory comments by using triple quotes, hashtags, and the `re.VERBOSE` flag.
- In verbose regex, all comments and whitespace are ignored.

In [39]:
# Let's define a verbose pattern to return the street name and street suffix as separate groups
multiline_pattern = r'''
# Any leading numeric digits are ignored
# We exclude leading white space before the first non-numeric character
\s+
# We capture any non-numeric characters, comprising the street name
(\D+)
# We exclude separating white space that we don't want to include in either group
\s+
# We capture the street suffix
(STREET|LANE|ROAD|AVENUE|BOULEVARD|CIRCLE|COURT|HIGHWAY)
# Any characters after the suffix are ignored
'''

# We call `re.search` with the `re.VERBOSE` flag to ignore whitespace and comments
print(re.search(multiline_pattern, "100 WAGON ROAD STREET, APT 2", re.VERBOSE).groups())

# Because of greedy matching, this returns ('WAGON ROAD', 'STREET') and not ('WAGON', 'ROAD')

('WAGON ROAD', 'STREET')


### Example: Roman numerals

The book gives a really smart example case for validating Roman numerals 1 to 4000. The approach is basically to enumerate the possible patterns for the thousands place, the hundreds place, the tens place, and the ones place, consuming each detected match, and check that the string is empty when we're done.

- For the thousands place, we can use `r'^M{0,3}'`.
- For the hundreds place, we can use `r'(CM|CD|D?C{0,3})'`
- For the tens place, we can use `r'(XC|XL|L?X{0,3})'`
- For the ones place, we can use `r'(IX|IV|V?I{0,3})'`

Instead of looping through and consuming the digits, we can also just put it all together: `r'^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'`. If `search` returns `None`, the numeral is invalid. If it's valid, it returns a `Match`.

In [37]:
numerals = ["XVIV", "XXXX", "MMMM", "MMM", "XXX", "III", "IV", "XIX", "MCD", "MMMDCCCLXXXVIII"]

for num in numerals:
  print(re.search(r'^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$', num) != None)

False
False
False
True
True
True
True
True
True
True


## [Chapter 6: Closures and Generators](https://diveintopython3.problemsolving.io/generators.html)

### Closures

One nice feature of a language like Python that treats functions as first-class citizens is that it allows use to dynamically create functions at run-time from variables found in the environment.

For instance, consider the following code that creates a function to pluralize English nouns. The function declaration depends upon a constant found in the environment: a data structure containing patterns and replacements.

In [14]:
import re

plural_rules = [
  (re.compile(r'(?<=[sxz])$'), "es"),
  (re.compile(r'(?<=[^aeioudgkprt]h)$'), "es"),
  (re.compile(r'(?<=[aeioudgkprt]h)$'), "s"),
  (re.compile(r'(?<=[aeiou]y)$'), "s"),
  (re.compile(r'(?<![aeiou])y$'), "ies"),
  (re.compile(r'$'), "s")
]

def pluralize(noun: str) -> str:
  for rule in plural_rules:
    if re.search(rule[0], noun):
      return re.sub(rule[0], rule[1], noun)

  raise Exception("Something went wrong")

nouns_to_pluralize = ['mess', 'vacancy', 'day', 'rough', 'sketch', 'skid']
for noun in nouns_to_pluralize:
  print(pluralize(noun))

messes
vacancies
days
roughs
sketches
skids


An advantage of closures is that they allows us to separate constants like `plural_rules` from function logic like `pluralize`, so that code and constants can be maintained separately. We could even store `plural_rules` in an importable file, library, or database, and then the `pluralize` function would effectively update its behavior whenever we changed the rules between runs.

Another advantage of closures is that we can pass our dynamically generated functions around the program as function arguments, assign them to variables, alias them, or create lists or other data structures populated with anonymous closures to be used as iterables.

For instance, the following code creates a list of tuples, each tuple containing a closure that searches for a match and a closure that replaces the match for a given pluralization rule. These anonymous closures are then retrieved and invoked in the `pluralize` function, where we temporarily assign them to the variables `search` and `sub`.

In [16]:
plural_rules = [
  (re.compile(r'(?<=[sxz])$'), "es"),
  (re.compile(r'(?<=[^aeioudgkprt]h)$'), "es"),
  (re.compile(r'(?<=[aeioudgkprt]h)$'), "s"),
  (re.compile(r'(?<=[aeiou]y)$'), "s"),
  (re.compile(r'(?<![aeiou])y$'), "ies"),
  (re.compile(r'$'), "s")
]

def build_search_and_sub_functions(pattern, replacement):
  search = lambda word : re.search(pattern, word)
  sub = lambda word : re.sub(pattern, replacement, word)
  return search, sub

search_and_sub_functions = [
  build_search_and_sub_functions(pattern, replacement)
  for pattern, replacement in plural_rules
]

def pluralize(word):
  for search, sub in search_and_sub_functions:
    if search(word): return sub(word)
  raise Exception("Something went wrong")

nouns_to_pluralize = ['mess', 'vacancy', 'day', 'rough', 'sketch', 'skid']
for noun in nouns_to_pluralize:
  print(pluralize(noun))

messes
vacancies
days
roughs
sketches
skids


### Generators

Another special type of function is called a "generator". A generator function is an iterator that uses a `yield` statement—instead of a `return` statement—to return a value on each iteration. After yielding each return value, the generator pauses (saves its local state) and waits to be prompted for the next value. To request the next value, you use the `next()` function on the generator. The generator will pick up exactly where it left off and continue iterating until it reaches the next `yield` statement.

There are a couple different ways to use a generator. First off, we can assign a function call to a variable, like this, and call it repeatedly with `next`:

In [33]:
def make_fib_generator():
  a, b = 0, 1
  while True:
    yield a
    a, b = b, a + b

fib = make_fib_generator()

n = None
while not n or n <= 1000:
  if n is not None: 
    print(n, end=' ')
  n = next(fib)


0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

That function will loop infinitely if we allow it to, but more commonly we want to build in a stopping condition. One nice thing about a built-in stopping condition is that it allows us to use the generator to populate a list or gracefully exit a `for` loop without knowing in advance how many values there will be:

In [2]:
def fib(max):
  a, b = 0, 1
  while a <= max:
    yield a
    a, b = b, a + b

print(list(fib(1000)))

for n in fib(1000):
  print(n, end=' ')

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

Combining closures with generators, we could use a generator to get the rules for our `pluralize` function one at a time on each run:

In [34]:
def search_and_sub_functions(plural_rules):
  for pattern, replacement in plural_rules:
    search = lambda word : re.search(pattern, word)
    sub = lambda word : re.sub(pattern, replacement, word)
    yield search, sub

def pluralize(word):
  for search, sub in search_and_sub_functions(plural_rules):
    if search(word): return sub(word)
  raise Exception("Something went wrong")

nouns_to_pluralize = ['mess', 'vacancy', 'day', 'rough', 'sketch', 'skid']
for noun in nouns_to_pluralize:
  print(pluralize(noun))

messes
vacancies
days
roughs
sketches
skids


Advantages of using a "lazy-loading" generator function here:

- Faster startup time, because we're not initializing the `search` and `sub` functions at startup
- `search` and `sub` functions at the bottom of the list, which perhaps handle rarely seen edge cases, need never be initialized until and unless we need them
- We're always using the most up-to-date pluralization rules, because we're not "enclosing" the `plural_rules` until immediately before they're used

Disadvantages of using a generator here:

- We re-initialize the `search` and `sub` functions on every call to `pluralize` rather than just doing it once at startup, which could be a significant performance drag if we use `pluralize` a lot

## [7: Classes and Iterators](https://diveintopython3.problemsolving.io/iterators.html)

Python is a fully object-oriented language. Everything in Python is an object. And every object in Python is a member of a class.The class optionally defines properties and methods that are shared across all instances of that object (as well as some "class methods" that belong directly to the class).

### Declaring and inspecting classes

You declare a new class by using the reserved `class` keyword. By convention, class names in Python use PascalCase (capitalize first letter of each word in the class name). Unlike some other languages, Python classes don't need to have explicit constructors and destructors. You could even define an empty class by using the reserved `pass` keyword:

In [9]:
class EmptyObject:
    pass

If ever you encounter a class you've haven't worked with before, you can use the `inspect` module to learn about it. For instance, `inspect.getsource` will return the class's source code, and `inspect.getmembers` will return a list of its methods and properties. Note that all objects in Python inherit certain default methods and properties, whose names are wrapped in double underscore. So for instance, if create an instance of the `EmptyObject` class defined above (which we do by calling it like a function—e.g., `EmptyObject()`), we find it isn't really empty:

In [10]:
import inspect

empty_object = EmptyObject()

print(inspect.getmembers(empty_object))

[('__class__', <class '__main__.EmptyObject'>), ('__delattr__', <method-wrapper '__delattr__' of EmptyObject object at 0x000002C0B0D059A0>), ('__dict__', {}), ('__dir__', <built-in method __dir__ of EmptyObject object at 0x000002C0B0D059A0>), ('__doc__', None), ('__eq__', <method-wrapper '__eq__' of EmptyObject object at 0x000002C0B0D059A0>), ('__format__', <built-in method __format__ of EmptyObject object at 0x000002C0B0D059A0>), ('__ge__', <method-wrapper '__ge__' of EmptyObject object at 0x000002C0B0D059A0>), ('__getattribute__', <method-wrapper '__getattribute__' of EmptyObject object at 0x000002C0B0D059A0>), ('__getstate__', <built-in method __getstate__ of EmptyObject object at 0x000002C0B0D059A0>), ('__gt__', <method-wrapper '__gt__' of EmptyObject object at 0x000002C0B0D059A0>), ('__hash__', <method-wrapper '__hash__' of EmptyObject object at 0x000002C0B0D059A0>), ('__init__', <method-wrapper '__init__' of EmptyObject object at 0x000002C0B0D059A0>), ('__init_subclass__', <built

In many cases, you will want to customize some of these default methods in your class definition. For instance, the `__init__` method is run immediately after the class is initialized (not quite a constructor, because the object has already been constructed by the time this code runs), and the `__repr__` method controls what printable string representation is returned when we use the `print` function on the object. Many custom classes in Python define their own logic for these methods.

For instance, we might declare a `Person` class that allows the `firstname` and `lastname` properties to be set by passing string arguments to the constructor when the class is initialized. And when `print` is used on the class, we might return a printable representation of the person's name. Here's how we could do it:

In [19]:
class Person():
  firstname = ''
  lastname = ''
  
  def __init__(self, firstname, lastname):
    self.firstname = firstname
    self.lastname = lastname
  
  def __repr__(self):
    return f'{self.lastname}, {self.firstname}'

person = Person("Bob", "Smith")
print(person)

Smith, Bob


Note that each method receives `self` as its first argument. This reserved keyword gives the method access to the object's "state", so that the method can read and write the object's properties or call its methods. For instance, the `__init__` method defined above accesses the `lastname` and `firstname` properties of the object in order to assign values to them, and the `__repr__` method accesses these properties in order to read their values and use them in creating the desired string representation of the object.

By the way, `self` must always be the *first* argument declared in the method definition. And a method with `self` in its definition gets access to object state automatically on every call to the method; we never explicitly pass `self` as an argument in a method *call*.

One last note: when inspecting the members of a class, we often may want to filter out all the default Python object methods. We can generally do that by keeping only members whose names don't start with a double underscore. In Python, the double underscore wrapper ("dunder") usually denotes special utility methods that are not intended to be used directly via a public API.

In [23]:
print([
    member for member in inspect.getmembers(person)
    if not member[0].startswith("__")
  ]
)

[('firstname', 'Bob'), ('lastname', 'Smith')]


### Class inheritance

Classes can "inherit" methods and properties from each other, so often you create a new class as a "child" of some "parent". For instance, if we have a `Person` class with `firstname` and `lastname` attributes, we can create a `Student` class that inherits from person, and the new class will have `firstname` and `lastname`, plus any additional attributes we add to it (e.g., `grade`).

Note that when we define an `__init__` or `__repr__` method for a child class, the child's method will run *instead of* the parent's method, not in addition to it. However, we can explicitly call the parent's method within the child's method by using `super()` to access the parent class. For instance, `super().__init__` allows us to run the parent's initialization logic cooperatively with whatever additional logic we want to run in the child:

In [27]:
class Student(Person):
  grade: int = 0

  def __init__(self, firstname: str, lastname: str, grade: int):
    super().__init__(firstname, lastname)
    self.grade = grade

  def __repr__(self):
    return super().__repr__() + f", Grade {self.grade}"


student = Student("Ben", "Zhao", 5)
print(student)

Zhao, Ben, Grade 5


### Iterators