In [1]:
from __future__ import annotations
from typing import Optional

# Lesson 6: Designing Data



A really important aspect of program design is designing the right "data type" for your data. Depending on the data types you select for your program, your program can either become much easier to write or much more difficult to write.

### Three quotes from fancy people on this topic:

#### From [Linus Torvalds](https://en.wikipedia.org/wiki/Linus_Torvalds) (creator of the Linux operating system and the git version control application)
> "Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

#### From [Rob Pike](https://users.ece.utexas.edu/~adnan/pike.html) in his book, "Rules of Programming"
> "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."

#### From [Fred Brooks](https://en.wikipedia.org/wiki/The_Mythical_Man-Month) in his book, "The Mythical Man-Month"
> "Representation is the essence of computer programming.
...
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious."



# Designing data is about modelling and about communication

The process of "designing data" is _not required_ in Python. It is, however, a common practice. 

Python is a "dynamic language". When you create a variable, say `a = 4.3`, you do not have to state in advance that the variable `a` is going to be a `float`. You just assign the value and away you go. Any explanation you give about the _type_ of data in a variable is purely for expression and explanation.

Designing a **data definition** does not change how your code is run. It is a way of creating mental models and expressing them in your code. It is about communication about your intention (to other programmers including "future you") behind your program design. It is about understanding what you are doing and why you are doing it.

# Data Definitions

In real life, data often has certain limits and cannot be "any value". 

Examples:
* The battery power in your phone is reported in percent and percentages (in the context of your phone battery) are between 0 and 100. So, the limit on this data is that it must be a value between 0 and 100. A battery percentage can be expressed as an `int` but it cannot be "any value" of `int`.
* Frequencies on the FM radio dial go between 87.9 MHz and 107.9 MHz and they increment by 0.2 MHz on the odd tenth (e.g. 87.9, 88.1, 88.3, etc.). A radio frequency could be represented as a `float` or a `str` but they cannot be "any value" of those types.
* Steel W-Sections in Canada range from "W1100x499" down to "W100x19" in particular increments. This can be represented by a `str` or perhaps a `tuple` of `int` (like `(1100, 499)`) but it cannot take "any value" of `str` or `tuple`.

A formal **data definition** is a way of communicating to the reader of the code the parameters of your data, how it works, and how you have decided to model it in Python.

### "Reader" of the code? I thought that was the computer...

It is said that "Code is read more often than it is written." Code is written once but, even though we are the writers of the code, we still often go back and re-read our code to troubleshoot it or to extend, enhance, or optimize it. 

Anyone interacting with your program's code will have to read it for themselves to understand how it works before they can write any new code.

## Three conceptual "families" of data: *Atomic*, *Collection*, -and- *Compound*

So far, we have worked primarily with numbers (`int` and `float`), strings of text (`str`), and lists of things (`list`) and a little bit of work with `tuple`, which is like a `list` that we cannot change. If we were to split those data types into these two categories, they would look like this:

 Atomic Types | Collection Types | Compound Types
--------------|------------------|------
 `int`        |     `list`      | `dataclass`
 `float`      |       `tuple`    | `NamedTuple`
 `str`        |    **`dict`**        | **`dict`**
 `bool`       |    `set`         |
 `None`       |                  |
 `Optional`   |                  |
 

## Atomic

**Atomic types** are, as the name suggests, data types that we cannot break down any further. An integer is an integer, a string is a string, and a bool is a bool. Even though Atomic types appear simple and obvious to design with, it is important to still follow a data design process because the Atomic data types you choose will affect how you write your program.

**Atomic types in Python**
* `bool`: Can be used to represent binary choices ("in"/"out", "up"/"down", "on"/"off") which can be represented as `True`/`False`
* `int`: Can be used to represent _discrete_ quantities (steps of whole numbers)
* `float`: Can be used to represent _continuous_ quantities
* `str`: Can be used to represent any data that can be described in text (but also a _collection_ of characters)
* `None`: Can be used to represent "null" data or "no value". 
* `Optional`: Can be used to represent a quantity that could have a value or could be None, e.g. level of a battery where `None` represents either an empty battery or no battery present otherwise a `float` would represent the level of power.

## Designing Atomic data

The design of atomic data is primarily for communication. It does not have any particular affect on your program at run time but it can be useful for showing the intention and thinking behind your program.

### Design Template for Atomic Data

```python
DataType: <type> # Interp. represents something about a thing...explain what the values in datatype could be or what their range might be - type names should be TitleCase (convention for python custom types and classes)
    
# Examples - variable names for examples should be all caps (convention for python constants)
DT0 = <an example value>
DT1 = <an example value>
```

**An example:**
```python
AMFreq: int # A Canadian AM radio frequency in kHz. Valid AM frequencies are between 540 and 1170 kHz and step in increments of 10.

#Examples
AMF0 = 1060
AMF1 = 540
AMF2 = 1170
```

**Using the datatype in a function**

```python
def test_valid_freq(freq: AMFreq) -> bool:
    """
    Returns True if 'freq' is a valid AM radio frequency. False, otherwise.
    """
    range_ok = (540 <= freq <= 1170)
    freq_step_ok = (freq % 10 == 0)
    if range_ok and freq_step_ok:
        return True
    return False
```

#### Using `Optional`
```python
from typing import Optional # First, import it from 'typing' module in the standard library

BatteryLevel: Optional[float] # Interp. represents the power of a battery in percent. None represents no battery present.

BL_1 = 12.3
BL_2 = 100.0
BL_3 = None

def check_battery_present(bl: BatteryLevel) -> bool:
    """
    Returns True if the battery is present.
    Returns False otherwise.
    """
    if bl is not None:
        return True
    else:
        return False
```

# Compound

**Compound types** are used to combine different kinds of data to represent something that may have multiple attributes. 

For example, a car which might have attributes of "Make", "Model", "Year", and "Colour". 

There are many kinds of compound types in Python but we will start with two of the most common:

1. `dict` - An built-in, _ad hoc_ compound type useful for when you do not know what the names of the attributes might be.
2. `dataclass` - A simple compound type that requires the user to pre-define the names of the attributes.

> **Note:** The `dict` type is incredibly useful for a wide range of purposes. Its key characteristic is that it stores a _pair_ of data, a key and a value, in a collection, like a list does. It is suitable for representing data that may come in odd shapes, such as for defining heirarchical or "tree-like" data.

# `dict`

## Basics of `dict` (both a **compound** and a **collection** type)

A `dict` is a dictionary. Just like a printed dictionary where you look up words to find their meanings, you can use a `dict` to look up keys to find their values. These key/value pairs are stored in the dictionary like a list.

You create a dictionary by using _braces_ `{...}`. You separate the keys and values with a _colon_, `:`. You separate key/value pairs with a _comma_ `,`.

An example:
```python
my_dict = {"concrete": "35 MPa", "steel": "400 MPa", "wood": "19.2 MPa"}
```

Use indexing notation to get the value you want. With a list, you use the position of the list item as the index. With a dictionary, you use the key of the item you want.

If a key is _not_ in the `dict`, you get an `IndexError`

```python
my_dict["concrete"] # Returns "35 MPa"
my_dict["steel"] # Returns "400 MPa"
my_dict["wood"] # Returns "19.2 MPa"
my_dict["glass"] # Raises IndexError because "glass" is not in the dict
```

#### `dict` methods

Because `dict` is a built-in type (like `list` and `str`), it also has some built-in methods:

Below are some of the most commonly used methods:
* `.update(new_dict_values: dict)` - Use .update to add new key/values to the dict or to change the value of an existing key. 
* `.get(key, [default = None])` - Use .get to get the value associated with the key. If the key is not found then return the "default" value.
* `.keys()` - Use .keys in a for loop to iterate over the keys of the dictionary
* `.values()` - Use .values in a for loop to iterate over the values of the dictionary
* `.items()` - Use .items in a for loop to iterate over each key, value pair (i.e. loops over both keys/values at the same time)

See the [Lesson 06 Reference](Lesson_06_Reference.ipynb) for examples of how each of these methods are used.

# 1. Data definition recipe for `dict` as a _compound_ type

Choose a _dict_ for a compound data definition for the following conditions:

1. You do not know _what the attributes are_ that you will need to store as keys
2. You do not know _how many attributes_ you will need to store as keys
3. You do not know the _shape_ of your data (e.g. nested/tree/heirachical data)


```python
from typing import Dict
MyCompoundType: Dict

# Interp. Represents a ... 
# keys represent ...
# values represent ...

# Examples
MCT1 = {<key1a>: <value1a>, <key1b>: <value1b>, ...}
MCT2 = {<key2a>: <value2a>, <key2b>: <value2b>, ...}
```

### An example

```python
from typing import Dict
WoodMaterial: Dict

# Interp. A class to represent a structural wood material as defined in CSA O86-14
# species is a str which can be one of "D. Fir", "SPF", "Hem-Fir", or "Northern"
# grade is a str which can be one of "SS", "No. 1/2", "No. 3/Stud"
# E is the elastic modulus in MPa
# Refer to Table 6.3.1A for more information

WMD1 = {"species": "D. Fir", "grade": "SS", "E": 12500}
WMD2 = {"species": "Northern", "grade": "No. 3/Stud", "E": 6500}
WMD3 = {"species": "SPF", "grade": "No. 1/2", "E": 9500}
```

# `dataclass`

## Basics of `dataclass`

A dataclass is a special "template maker" for when you need to store structured data of different types. You can think of it as a way of making "custom types".

An example for creating a custom type called `WoodMaterial`:

```python
from dataclasses import dataclass # included in the standard library

@dataclass # this thing is called a "decorator". Just go with it for now.
class WoodMaterial:
    species: str
    grade: str
    E: int

d_fir_ss = WoodMaterial("D. Fir", "SS", 12500)
```

The pieces of data you store in your class are called _attributes_. You access the data within each of your attribute names by using "dot notation".

If you try to access an attribute that does not exist in your class, you will get an `AttributeError`.

```python
d_fir_ss.species # Returns 'D. Fir'
d_fir_ss.grade # Returns 'SS'
d_fir_ss.E # Returns 12500
d_fir_ss.fb # Raises AttributeError
```

# `dataclass` methods

The custom types you create with `dataclass` do not have their own built-in methods like `dict`, `str`, and `list` have.

However, you can _create your own_ methods to operate on the data within your custom type.

# 2. Data definition recipe for `dataclass` (_compound_ type)

```python
from dataclasses import dataclass

@dataclass
class ClassName:
    attribute_a: <type>
    attribute_b: <type>
    attribute_c: <type>
    ...
```

### An example

```python
@dataclass # this thing is called a "decorator". Just go with it for now.
class WoodMaterial:
    species: str
    grade: str
    E: int
    
# Interp. A class to represent a structural wood material as defined in CSA O86-14
# species is a str which can be one of "D. Fir", "SPF", "Hem-Fir", or "Northern"
# grade is a str which can be one of "SS", "No. 1/2", "No. 3/Stud"
# E is the elastic modulus in MPa
# Refer to Table 6.3.1A for more information

# Examples
WM1 = WoodMaterial("D. Fir", "SS", 12500)
WM2 = WoodMaterial("Northern", "No. 3/Stud", 6500)
WM3 = WoodMaterial("SPF", "No. 1/2", 9500)
```

## Differences between our `dict` and `dataclass` compound types

Examples for `dict` and `dataclass` were demonstrated for describing the same data, `WoodMaterial`.

Which of the two data definitions is more appropriate to represent a `WoodMaterial`? Why?

# Collections

**Collection types** can represent groups of *Atomic* types, *Compound* types, or even other *Collection* types.

* `list`
* `tuple`
* `dict`
* `set`

## Designing Collections data

Like Atomic data definitions, defining a Collections data definition with the Python's built-in types does not have any effect on your program but can be useful for communication.

You commonly would define a collection data definition if you have already defined an atomic or compound data definition and you want to also describe, say, a list of that data type.

### Design Template for Collection Data

```python
CollectionType: <type>[<subtype>] # Interp. represents some collection of types. Explain if there are any special rules or parameters for your collection (e.g. cannot be empty, or can be up to a certain length) - type names should be TitleCase (convention for Python custom types and classes)
    
# Examples - variable names for examples should be ALL CAPS (convention for python constants)
CT0 = <an example value>
CT1 = <an example value>
```

**An example:**

```python
AMFreqList: List[AMFreq] # Interp. A list of AM radio frequencies. Can be empty.
    
# Examples
AMFL0 = []
AMFL1 = [AMF0, AMF1, AMF2]
```

**Another example, with `dict`:**

```python
AMFreqDict: Dict[AMStation, AMFreq] # Interp. A dictionary of AM radio station names and their frequencies
    
# Examples
AMFD0 = {}
AMFD1 = {AMS0: AMF0, "CFRZ": AMF1, AMS2: AMF2}
```

# Converting Raw Data into Structured Data

The primary purpose of designing a data type is to populate them with data. Where do we get the data?

* Surveys or other hand-recorded information into a text file (e.g. csv files)
* Software programs that produce text data outputs (e.g. spColumn, ETABS)
* Electronic sensors or other instruments (e.g. temperature sensors, light sensors)
* Internet "API"s or other data sources (e.g. Twitter, NASA)

Generally speaking, Python reads *text data* in whatever format it is. However, you may have to write functions yourself in order to fit the raw text data into its own data type.

# An Example

## Start with data definition

```python
import csv
import pathlib
from typing import NamedTuple, List

class BiaxialInteraction(NamedTuple):
    axial: float
    mx: float
    my: float
    
# Interp. Represents a data point within a 3D interaction surface
# axial: a float representing the axial force, compression is +ve
# mx: a float representing the moment about the x-axis, +ve is right-hand rotation
# my: a float representing the moment about the y-axis, +ve is right-hand rotation
```

```python
# Then write the functions to parse your data 

def load_3d_interaction(interaction_file_path: pathlib.Path) -> List[BiaxialInteraction]:
    """
    Returns a list of BiaxialInteraction data points extracted from the raw data at
    'interaction_file_path' where the data was generated by spColumn 3D interaction
    failure surface .csv export.
    """
    raw_data = read_interaction_csv(interaction_file_path)
    raw_data = raw_data[1:] # The first line of the file is the column headings
    biaxial_data = raw_data_to_list_biaxial(raw_data)
    return biaxial_data


def read_interaction_csv(file_path: pathlib.Path) -> List[List[str]]:
    """
    Returns raw data read from the csv file at file_path.
    """
    raw_data = []
    with open(file_path, 'r') as csv_file:
        csv_reader = csv.reader(csv_file)
        for line in csv_reader:
            raw_data.append(line)
    return raw_data


def raw_data_to_list_biaxial(raw_data: List[List[str]]) -> List[BiaxialInteraction]:
    """
    Returns a list of biaxial interaction data points from a list of raw string data
    generated from an spColumn 3D interaction surface .csv file.
    """
    biaxial_data = []
    for row in raw_data:
        biaxial_data.append(list_to_biaxial_data(row))
    return biaxial_data


def list_to_biaxial_data(raw_data_row: List[str]) -> BiaxialInteraction:
    """
    Returns a BiaxialInteraction object from data in 'raw_dat_row' where 'raw_data_row` is a list
    of str with the following fields:
        - Axial Force
        - Moment X
        - Moment Y
        - N.A. Depth
        - N.A. Angle
        - D_t
        - eps_s
    """
    axial_force = float(raw_data_row[0])
    moment_x = float(raw_data_row[1])
    moment_y = float(raw_data_row[2])
    return BiaxialInteraction(axial=axial_force, mx=moment_x, my=moment_y)

```


## "Designing data" is the act of modelling things that occur in real life

I find the act of designing data to be one of the most interesting and creative parts of programming. If your data is designed well, then writing the functions can become much easier. 

When designing data, look for opportunities to "capture" the qualities of the real-life phenomenon you are trying to represent and put those same qualities in your data design. For example, we captured the phenomenon that information about an artwork does not change in the design of our `ArtWork` datatype by using `NamedTuple` instead of `dataclass`. By using `NamedTuple` we also have access to the quality that an artwork is unique, because `NamedTuple` is _hashable_.

However, you can also model too much. It is important to balance the amount of thoughtful modelling you put into your datatype with the actual functionality that you need to get out of your datatype: e.g. if I really wanted to capture the qualities of an artwork, should my `ArtWork` datatype also capture the medium? Maybe it should also store a URL to an image of the artwork! That would be cool.

But hold on: does your application _need_ to have this much data captured right now? If you don't actually need this information yet, then do not capture it. If in the future you do, then it is easy to add this to your application _because_ you have taken the time to thoughtfully create a datatype.

## Lets design some data!

For the following scenarios, think about what kinds of data types you could use to represent these phenomena in Python. Think about it on your own and come up with some ideas and then we will discuss in class.

| | | |
|-|-|-|
| The level of gasoline in a vehicle| An inventory of all of the plants in a greenhouse | The colour of a star
| <a href="https://www.bankofcanada.ca/rates/exchange/background-information-on-foreign-exchange-rates/?page_moved=1#Data-format">The US/CAD exchange rate</a> | The weight of a salmon | <a href="https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">A person's name</a>
| The melody of a song | Your monthly expenses |
| Birthdays you want to remember every year | Cards played in a poker game |
| Loading on a simply supported beam |  | 

## This Week's Workbook

In this week's tasks, you will be asked to do two things:

1. Design a data type to represent the information presented in the task (many "right answers")
2. Design a function to convert the provided raw data into your new data type