# Lesson 6: Designing Data



A really important aspect of program design is designing the right "data type" for your data. Depending on the data types you select for your program, your program can either become much easier to write or much more difficult to write.

There are many different kinds of data types available to us in Python. The most basic and most useful ones are the "built-ins". There are many more other ones at your fingertips as part of the Python standard library (which comes pre-installed with Python). There are still EVEN MORE data types available in many 3rd party data manipulation libraries (which are pre-installed as part of the "Anaconda" Python distribution).

On top of all of that, you can design your own custom data types to suit your specific needs. This will be taught in this course, starting with this lesson.

## A Word about "Duck Typing"

One of the features of Python is that it uses what is called, "duck typing". The idea is that if, "it sounds like a duck, walks like a duck, and looks like a duck, then it must be a duck".

Unlike other programming languages such as C, C++, C#, Visual Basic, Java, etc. Python does not require us to _declare_ the data type of our variable at the time that we assign a value to it. For example:

Python:
```python
a = "cat"
b = [23, 54.3, 42]
```

VB:
```C#
Dim a As String
a = "cat"

Dim b As Array
b = [23, 54.3, 42]
```

The idea with Python is that something can be, say, "list-like" and work just like a list even if it is not actually a list. This makes the language very flexible and fun to use. 

```python
def example_func(list_like_data):  # This will work with lists, tuples, strings, dicts, sets (and more) because they are all iterable.
    acc = []
    for item in list_like_data:
        acc.append(item)
    return acc
```

Similarily, when we write a function, we do not _need_ to declare the types of our function arguments. For example:

```python
def my_func(name, age):
    """
    Returns a string greeting someone by their 'name' and 'age'.
    """
    return f"Hi, {name.capitalize()}! I hear you are {age} years old."
```
**vs.**

```python
def my_func(name: str, age: int) -> str:
    """
    Returns a string greeting someone by their 'name' and 'age'.
    """
    return f"Hi, {name.capitalized()}! I hear you are {age} years old."
```

These Python type "declarations" are what we call "type annotations". They do not affect the code in anyway. They are, as far as the machine is concerned, completely useless. When type annotations are included in Python code, they are just passed over much like comments are.

The only reason to include type annotations in Python are for people (like, yourself) who may be reading your code in the future and trying to figure out what is going on.

I always use type annotations in my functions and I find them to be very useful. However, I do not use them on _every variable_. There is a balance that can be struck.

As we talk about creating data definitions, we will be using some type annotations. I recommend going through the process that I am showing you and getting into the habit of describing your data with data definitions and type annotations.

## Three conceptual "families" of data types: *Atomic*, *Collection*, -and- *Compound*

So far, we have worked primarily with numbers (`int` and `float`), strings of text (`str`), and lists of things (`list`) and a little bit of work with `tuple`, which is like a `list` that we cannot change. If we were to split those data types into these two categories, they would look like this:

 Atomic Types | Collection Types | Compound Types
--------------|------------------|------
 `int`        | `list`           | `dataclass`
 `float`      | `tuple`          | `NamedTuple`
 `str`        | `dict`           | Other "classes"
 `bool`       | `set`            |
 `None`       |                  |
 `Optional`   |                  |
 
## Atomic

**Atomic types** are, as the name suggests, data types that we cannot break down any further. An integer is an integer and a string is a string. Even though Atomic types appear simple and obvious to design with, it is important to still follow a data design process because the Atomic data types you choose will affect how you write your program.

**Atomic types in Python**
* `bool`: Can be used to represent binary choices ("in"/"out", "up"/"down", `True`/`False`)
* `int`: Can be used to represent _discrete_ quantities (steps of whole numbers)
* `float`: Can be used to represent _continuous_ quantities
* `str`: Can be used to represent any data that can be described in text
* `None`: Can be used to represent "null" data or "no value". 
* `Optional`: Can be used to represent a quantity that could have a value or could be None, e.g. level of a battery where `None` represents either an empty battery or no battery present otherwise a `float` would represent the level of power.

#### Using `Optional`
```python
from typing import Optional # First, import it from 'typing' module in the standard library

BatteryLevel: Optional[float] # Interp. represents the power of a battery in percent. None represents no battery present.

BL_1 = 12.3
BL_2 = 100.0
BL_3 = None
```

## Collections

**Collection types** can represent groups of *Atomic* types, *Compound* types, or even other *Collection* types. We have seen all of the built-in collection types in Python and they each lend themselves to particular applications:

**Collection types in Python**
* `list`: A generic collection type. Useful when you need the order of items maintained, you only need to look-up items by their position in the list (the item "index"), and when you are generally adding items or removing items from the _end_ of the collection.
* `tuple`: An immutable collection type. Very similar to list except you cannot add or remove items. Useful when representing groups of data that are not expected to change, e.g. x, y, z coordinates. Because tuples are immutable, they can be _hashed_ - a powerful technique that allows tuples to be used as keys in `dicts` and can also be put into `set`s.
* `dict`: A collection type to represent _mappings_, when you need to correlate one piece of information with another, e.g. name and age, or file names and file contents, or beam names and beam lengths. As of Python 3.6, dictionaries also maintain the order of items that have been added to it (a sometimes useful feature). Dictionary keys must be _hashable_ (immutable).
* `set`: An _unordered_ collection to represent _unique items_ in a collection. Useful when working with groups of data that are somehow mutually exclusive and when you might need to quickly _compare_ the members of two or more groups. e.g. collections of `(x, y, z)` coordinates (what are the coordinates that are in "group A" but not in "group B"?)

### Defining Data Types with Atomic and Collection built-in types

Typically, you do not need to set-up a special definition for an atomic or collection built-in type. However, if the built-in data type is going to be specially representing something with specific properties then it might make sense to create a definition.

**How to Define a Data Type**

1. Any applicable imports
2. Type creation
3. An "interp." comment to describe what the type is an how it works. Give an "operating range", if applicable.
4. Examples

**Example:**

```python
from __future__ import annotations # We are using Python 3.8; not needed in 3.9

StarColour: int
# interp. The colour of a star in surface-temperature Kelvin. Range from 2000 and up (hottest known star: WR102 @ 210000)

# Examples
SC1 = 6000
SC2 = 20000
```

Now that you have defined your data type definition, you can use it in your function's type annotations without causing an error. Note there is no real "enforcement" mechanism for the range. Just communicating your intention.

```python
def check_star_blue(sc: StarColour) -> bool:
    """
    Returns True if the star colour, 'sc' is in the blue range.
    """
    if sc > 8000:
        return True
    else:
        return False
```

## Compound

**Compound types** are types that are made up of atomic types, collection types, and perhaps other compound types in a kind of *container*. They are useful when you are trying to describe a thing that has either multiple _dimensions_ of data (e.g. x, y, colour) or has multiple _fields_ of data (e.g. an artwork that has a title, creator, date created, medium).

**We will start by using two types of compound datatypes: `dataclass` and `NamedTuple`**
* `dataclass`: A compound type that can be given it's own name and as many kinds of fields as a user needs. A _mutable_ datatype that cannot be _hashed_.
* `NamedTuple`: A compound type with all of the same qualities as `dataclass` but is _immutable_ and can be _hashed_ (used as `dict` keys or used in `set`).

### Defining **Compound** types

1. Any applicable imports
2. Type creation
3. An "interp." comment to describe what the type is an how it works
4. Examples

### Compound type: `dataclass`

Make a `dataclass` like this:

```python
from dataclasses import dataclass # First import dataclass from the dataclasses module in the standard library

@dataclass # This "@" symbol is called a decorator. Don't worry about it for now. I will talk about what this does at the end of the course
class LEDPixel:
    x: int
    y: int
    colour: Optional[str]
       
        
# Interp. Represents a single LED light in a 20x20 grid of LED lights.
# x: represents the light's location horizontally in the grid 0 <= x <= 19
# y: represents the light's location vertically in the grid 0 <= y <= 19
# colour: represents the current colour of the LED, can be either "red", "green", "blue", "white", or None (if LED is off)
```

To create a new compound datatype with your `LEDPixel` dataclass:

```python
LEDPx_1 = LEDPixel(x=0, y=0, color="white")
LEDPx_2 = LEDPixel(8, 12, None)
```

We can now access the _attributes_ of an `LEDPixel` object through "dot notation":

```python
LEDPx_1.x
LEDPx_1.y
LEDPx_1.color
```

**Notice: To represent an LED pixel, using a `dataclass` would be preferred over a `NamedTuple` because we would expect the colour of an individual pixel to change colour over time.**

### Compound type: `NamedTuple`

Make a `NamedTuple` like this:

```python
from typing import NamedTuple # Import NamedTuple from the typing module in the standard library

class ArtWork(NamedTuple):
    title: str
    creator: str
    year_created: int
       
        
# Interp. Represents a single piece of artwork
# title: a str representing the artwork's title
# creator: a str representing the artworks' creator
# year_created: an int representing the date created in YYYY format
```

To create a new compound datatype with the `ArtWork` NamedTuple:

```python
AW_1 = ArtWork(title="Nude Descending a Staircase, No. 2", creator="Marcel Duchamp", year_created=1912)
AW_2 = ArtWork("Ignorance = Fear", "Keith Haring", 1989)
```

We also use "dot notation" to access the _attributes_ of our `ArtWork` objects:

```python
AW_2.title
AW_2.creator
AW_2.year_created
```


**Notice: To represent a piece of artwork, a `NamedTuple` could be more useful than a `dataclass` since the information about a piece of artwork does not change.**

Since we do not expect the data to change, we can take advantage of `NamedTuple`'s immutability and gain the additional functionality of using an `ArtWork` as a dictionary key or in sets since we can take advantage of the fact that a piece of artwork is _unique_.

While we _could_ just use a `dataclass` to represent an `ArtWork` and not worry about it, by choosing `NamedTuple` instead our code becomes more _expressive_. "Expressive code" means that we are able to communicate our creative intentions through our _choices_ of code instead of writing out long comments explaining ourselves. i.e. it's the idea of "show, don't tell".

## Writing functions with compound data types

The main features of compound data types are their _attributes_. When writing functions with compound data types, you are often accessing the attributes and making a comparison or changing their attributes based on other data.

**Example with `ArtWork`**:

```python
def artwork_by_year_created(loaw: List[Artwork], year_created: int) -> List[Artwork]:
    """
    Returns a list of ArtWork representing the artworks in 'loaw' that were created
    in the year, 'year_created'.
    """
    acc = []
    for aw in loaw:
        if aw.year_created == year_created:
            acc.append(aw)
    return acc
```

**Example with `dataclass`**:

```python
def change_pixel_colour(led_px: LEDPixel, new_colour: str, x: int) -> LEDPixel:
    """
    Returns a 'led_px' with it's .colour attribute updated to 'new_colour' if
    its x coordinate is a match for 'x'
    """
    if led_px.x == x:
        led_px.colour = new_colour
        return led_px
    return led_px
```

## "Designing data" is the act of modelling things that occur in real life

I find the act of designing data to be one of the most interesting and creative parts of programming. If your data is designed well, then writing the functions can become much easier. 

When designing data, look for opportunities to "capture" the qualities of the real-life phenomenon you are trying to represent and put those same qualities in your data design. For example, we captured the phenomenon that information about an artwork does not change in the design of our `ArtWork` datatype by using `NamedTuple` instead of `dataclass`. By using `NamedTuple` we also have access to the quality that an artwork is unique, because `NamedTuple` is _hashable_.

However, you can also model too much. It is important to balance the amount of thoughtful modelling you put into your datatype with the actual functionality that you need to get out of your datatype: e.g. if I really wanted to capture the qualities of an artwork, should my `ArtWork` datatype also capture the medium? Maybe it should also store a URL to an image of the artwork! That would be cool.

But hold on: does your application _need_ to have this much data captured right now? If you don't actually need this information yet, then do not capture it. If in the future you do, then it is easy to add this to your application _because_ you have taken the time to thoughtfully create a datatype.

## Lets design some data!

For the following scenarios, think about what kinds of data types you could use to represent these phenomena in Python. Think about it on your own and come up with some ideas and then we will discuss in class.

| | | |
|-|-|-|
| The level of gasoline in a vehicle| An inventory of all of the plants in a greenhouse | The colour of a star
| <a href="https://www.bankofcanada.ca/rates/exchange/background-information-on-foreign-exchange-rates/?page_moved=1#Data-format">The US/CAD exchange rate</a> | The weight of a salmon | <a href="https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">A person's name</a>
| The melody of a song | Your monthly expenses | Your daily to-do list
| Birthdays you want to remember every year | Cards played in a poker game | All of the steel W-sections in the handbook
| Loading on a simply supported beam |  | 