In [None]:
from __future__ import annotations

# Lesson 6: Designing Data



A really important aspect of program design is designing the right "data type" for your data. Depending on the data types you select for your program, your program can either become much easier to write or much more difficult to write.

There are many different kinds of data types available to us in Python. The most basic and most useful ones are the "built-ins". There are many more other ones at your fingertips as part of the Python standard library (which comes pre-installed with Python). There are still EVEN MORE data types available in many 3rd party data manipulation libraries (which are pre-installed as part of the "Anaconda" Python distribution).

On top of all of that, you can design your own custom data types to suit your specific needs. This will be taught in this course, starting with this lesson.

## Three conceptual "families" of data types: *Atomic*, *Collection*, -and- *Compound*

So far, we have worked primarily with numbers (`int` and `float`), strings of text (`str`), and lists of things (`list`) and a little bit of work with `tuple`, which is like a `list` that we cannot change. If we were to split those data types into these two categories, they would look like this:

 Atomic Types | Collection Types | Compound Types
--------------|------------------|------
 `int`        |     `list`      | `dataclass`
 `float`      |       `tuple`    | `NamedTuple`
 `str`        |    `dict`        | custom _class_
 `bool`       |    `set`         |
 `None`       |                  |
 `Optional`   |                  |
 

## Atomic

**Atomic types** are, as the name suggests, data types that we cannot break down any further. An integer is an integer and a bool is a bool. Even though Atomic types appear simple and obvious to design with, it is important to still follow a data design process because the Atomic data types you choose will affect how you write your program.

**Atomic types in Python**
* `bool`: Can be used to represent binary choices ("in"/"out", "up"/"down", `True`/`False`)
* `int`: Can be used to represent _discrete_ quantities (steps of whole numbers)
* `float`: Can be used to represent _continuous_ quantities
* `str`: Can be used to represent any data that can be described in text (but also a _collection_ of characters)
* `None`: Can be used to represent "null" data or "no value". 
* `Optional`: Can be used to represent a quantity that could have a value or could be None, e.g. level of a battery where `None` represents either an empty battery or no battery present otherwise a `float` would represent the level of power.

## Designing Atomic data

The design of atomic data is primarily for communication. It does not have any particular affect on your program but it can be useful for showing the intention and thinking behind your program.



### Design Template for Atomic Data

```python
DataType: <type> # Interp. represents something about a thing...explain what the values in datatype could be or what their range might be - type names should be TitleCase (convention for python custom types and classes)
    
# Examples - variable names for examples should be all caps (convention for python constants)
DT0 = <an example value>
DT1 = <an example value>
```

**An example:**

```python
AMFreq: int # An AM radio frequency. Values are in kHz and range from 540 to 1170 in 10 kHz increments
    
# Examples
AMF0 = 540
AMF1 = 1030
AMF2 = 1170
```

**Another example:**

```python
AMStation: str # A Canadian AM radio call sign. In Canada, radio call signs are three or four characters long, all caps, and range from "CFA" to "CKZ". Cannot be an empty str.
    
# Examples
AMS0 = "CFAS"
AMS1 = "CKA"
AMS2 = "CGRT"
```


**Using the datatype in a function**

```python
def test_valid_freq(freq: AMFreq) -> bool:
    """
    Returns True if 'freq' is a valid AM radio frequency. False, otherwise.
    """
    range_ok = (540 <= freq <= 1170)
    freq_step_ok = (freq % 10 == 0)
    if range_ok and freq_step_ok:
        return True
    return False
```

#### Using `Optional`
```python
from typing import Optional # First, import it from 'typing' module in the standard library

BatteryLevel: Optional[float] # Interp. represents the power of a battery in percent. None represents no battery present.

BL_1 = 12.3
BL_2 = 100.0
BL_3 = None

def check_battery_power(bl: BatteryLevel) -> bool:
    """
    Returns True if the battery's battery-level, 'bl' has any power.
    Returns False otherwise.
    """
    if bl is not None:
        return True
    else:
        return False
```

# Collections

**Collection types** can represent groups of *Atomic* types, *Compound* types, or even other *Collection* types.

* `list`
* `tuple`
* `dict`
* `set`

## Designing Collections data

Like Atomic data definitions, defining a Collections data definition with the Python's built-in types does not have any effect on your program but can be useful for communication.

You commonly would define a collection data definition if you have already defined an atomic or compound data definition and you want to also describe, say, a list of that data type.

### Design Template for Collection Data

```python
CollectionType: <type>[<subtype>] # Interp. represents some collection of types. Explain if there are any special rules or parameters for your collection (e.g. cannot be empty, or can be up to a certain length) - type names should be TitleCase (convention for Python custom types and classes)
    
# Examples - variable names for examples should be ALL CAPS (convention for python constants)
CT0 = <an example value>
CT1 = <an example value>
```

**An example:**

```python
AMFreqList: List[AMFreq] # Interp. A list of AM radio frequencies. Can be empty.
    
# Examples
AMFL0 = []
AMFL1 = [AMF0, AMF1, AMF2]
```

**Another example, with `dict`:**

```python
AMFreqDict: Dict[AMStation, AMFreq] # Interp. A dictionary of AM radio station names and their frequencies
    
# Examples
AMFD0 = {}
AMFD1 = {AMS0: AMF0, "CFRZ": AMF1, AMS2: AMF2}
```

# Compound

**Compound types** are used to combine different kinds of data to represent something that may have multiple attributes. For example, a car which might have attributes of "Make", "Model", "Year", and "Colour". 

When we define a compound type, we are actually creating a _new Python data type in our code_ (unlike the atomic and collection examples show above, which were just annotated versions of built-in data types).

In tabular data, where rows are records and columns are fields, a compound data type might be used to represent one record.

There are many kinds of compound types in Python but we will start with two of the most common:

1. `dataclass` - A simple compound type that has as many fields as the user needs. _Mutable_.
2. `NamedTuple` - A type similar to `dataclass` but is _immutable_ (or _hashable_, meaning it can be a key in a dictionary or used in a set).

### Defining **Compound** types

1. Any applicable imports
2. Type creation
3. An "interp." comment to describe what the type is an how it works
4. Examples

### Compound type: `dataclass`

Make a `dataclass` like this:

```python
from dataclasses import dataclass 

@dataclass # This "@" symbol is called a "decorator". 
class LEDPixel:
    x: int
    y: int
    colour: Optional[str]
          
# Interp. Represents a single LED light in a 20x20 grid of LED lights.
# x: represents the light's location horizontally in the grid 0 <= x <= 19
# y: represents the light's location vertically in the grid 0 <= y <= 19
# colour: represents the current colour of the LED, can be either "red", "green", "blue", "white", or None (if LED is off)
```

To create a new compound datatype with your `LEDPixel` dataclass:

```python
LEDPX_1 = LEDPixel(x=0, y=0, colour="white")
LEDPX_2 = LEDPixel(8, 12, None)
```

We can now access the _attributes_ of an `LEDPixel` object through "dot notation":

```python
LEDPX_1.x
LEDPX_1.y
LEDPX_1.color
```

**Notice: To represent an LED pixel, using a `dataclass` would be preferred over a `NamedTuple` because we would expect the colour of an individual pixel to change colour over time.**

### Compound type: `NamedTuple`

Make a `NamedTuple` like this:

```python
from typing import NamedTuple # Import NamedTuple from the typing module in the standard library

class ArtWork(NamedTuple):
    title: str
    creator: str
    medium: str
    year_created: int
       
        
# Interp. Represents a single piece of artwork
# title: a str representing the artwork's title
# creator: a str representing the artworks' creator
# medium: a str representing the what the artworks is made of, e.g. 'painting', 'photograph', etc.
# year_created: an int representing the date created in YYYY format
```

To create a new compound datatype with the `ArtWork` NamedTuple:

```python
AW_1 = ArtWork(title="Nude Descending a Staircase, No. 2", creator="Marcel Duchamp", medium="Painting", year_created=1912)
AW_2 = ArtWork("Ignorance = Fear", "Keith Haring", "painting", 1989)
```

We also use "dot notation" to access the _attributes_ of our `ArtWork` objects:

```python
AW_2.title
AW_2.creator
AW_2.year_created
```

## Why `NamedTuple` for `ArtWork`? Why not `dataclass`?

**To represent a piece of artwork, a `NamedTuple` could be more useful than a `dataclass` since the information about a piece of artwork does not change.**

Since we do not expect the data to change, we can take advantage of `NamedTuple`'s immutability and gain the additional functionality of using an `ArtWork` as a dictionary key or in sets since we can take advantage of the fact that a piece of artwork is _unique_.

While we _could_ just use a `dataclass` to represent an `ArtWork` and not worry about it, by choosing `NamedTuple` instead our code becomes more _expressive_. "Expressive code" means that we are able to communicate our creative intentions through our _choices_ of code instead of writing out long comments explaining ourselves. i.e. it's the idea of "show, don't tell".

# Converting Raw Data into Structured Data

The primary purpose of designing a data type is to populate them with data. Where do we get the data?

* Surveys or other hand-recorded information into a text file (e.g. csv files)
* Software programs that produce text data outputs (e.g. spColumn, ETABS)
* Electronic sensors or other instruments (e.g. temperature sensors, light sensors)
* Internet "API"s or other data sources (e.g. Twitter, NASA)

Generally speaking, Python reads *text data* in whatever format it is. How do you know if your data is text data? If you can open it up in a text editor (e.g. Notepad or Jupyter) and you can read it.

There are many text data file formats (e.g. CSV, JSON, XML, INI, YAML, TOML, and other custom formats). For this excercise, we will be working with only **csv** files. 

We will discuss other formats towards the end of the course.

### An example using `ArtWork`

Lets say we have some data from a CSV file. We got the data from a CSV file from [The Museum of Modern Art's Github repository](https://github.com/MuseumofModernArt/collection).

Because the full data set is huge, there is a sample of the data set in this week's lesson, called `Artworks-sample.csv`

Here is what the first few lines might look like after reading them with Python's built in `csv` library:

```python

[['Title', 'Artist', 'Gender', 'Date', 'Medium', 'Dimensions', 'CreditLine', 'AccessionNumber'],
 ['City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard', 'Christian de Portzamparc', '(Male)', '1987', 'Paint and colored pencil on print', '16 x 11 3/4" (40.6 x 29.8 cm)', 'Gift of the architect in honor of Lily Auchincloss', '1.1995'],
 ['Villa near Vienna Project, Outside Vienna, Austria, Elevation', 'Emil Hoppe', '(Male)', '1903', 'Graphite, pen, color pencil, ink, and gouache on tracing paper', '13 1/2 x 12 1/2" (34.3 x 31.8 cm)', 'Gift of Jo Carole and Ronald S. Lauder', '1.1997']]
```

#### Write a function to transform a `List[str]` into an `ArtWork`

The first row of data is the header and it will be useful for interpreting the raw data we have. Our `ArtWork` data type takes four pieces of information:
* Title
* Creator
* Medium
* Year created

Where can we find that information in our raw `List[str]` data?

```python
from typing import List
from cs103 import summary, start_testing, expect

def create_artwork(raw_data: List[str]) -> ArtWork:
    """
    Returns an ArtWork object that comes from the information in 'raw_data', a list of strs
    read from the Artworks.csv data set from the Museum of Modern Art's Github repo.
    """
    title = raw_data[0]
    artist = raw_data[1]
    year = int(raw_data[3])
    medium = raw_data[4]
    return ArtWork(title=title, creator=artist, year_created=year, medium=medium)


record_1 = ['City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard', 'Christian de Portzamparc', '(Male)', '1987', 'Paint and colored pencil on print', '16 x 11 3/4" (40.6 x 29.8 cm)', 'Gift of the architect in honor of Lily Auchincloss', '1.1995']
record_2 = ['Villa near Vienna Project, Outside Vienna, Austria, Elevation', 'Emil Hoppe', '(Male)', '1903', 'Graphite, pen, color pencil, ink, and gouache on tracing paper', '13 1/2 x 12 1/2" (34.3 x 31.8 cm)', 'Gift of Jo Carole and Ronald S. Lauder', '1.1997']

start_testing()
expect(
    create_artwork(data_1),
    ArtWork(
        title='City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard',
        creator='Christian de Portzamparc',
        year_created=1987,
        medium='Paint and colored pencil on print',
    )
)

expect(
    create_artwork(data_2),
    ArtWork(
        title='Villa near Vienna Project, Outside Vienna, Austria, Elevation',
        creator='Emil Hoppe',
        year_created=1903,
        medium='Graphite, pen, color pencil, ink, and gouache on tracing paper',
    )
)
summary()
```

## Writing functions with compound data types

The main features of compound data types are their _attributes_. When writing functions with compound data types, you are often accessing the attributes and making a comparison or changing their attributes based on other data.

**Example with `ArtWork`**:

```python
def artwork_by_year_created(loaw: List[Artwork], year_created: int) -> List[Artwork]:
    """
    Returns a list of ArtWork representing the artworks in 'loaw' that were created
    in the year, 'year_created'.
    """
    acc = []
    for aw in loaw:
        if aw.year_created == year_created:
            acc.append(aw)
    return acc
```

**Example with `dataclass`**:

```python
def change_color_by_x(led_px: LEDPixel, new_colour: str, x: int) -> LEDPixel:
    """
    Returns a 'led_px' with it's .colour attribute updated to 'new_colour' if
    its x coordinate is a match for 'x'
    """
    if led_px.x == x:
        led_px.colour = new_colour
        return led_px
    return led_px
```

## "Designing data" is the act of modelling things that occur in real life

I find the act of designing data to be one of the most interesting and creative parts of programming. If your data is designed well, then writing the functions can become much easier. 

When designing data, look for opportunities to "capture" the qualities of the real-life phenomenon you are trying to represent and put those same qualities in your data design. For example, we captured the phenomenon that information about an artwork does not change in the design of our `ArtWork` datatype by using `NamedTuple` instead of `dataclass`. By using `NamedTuple` we also have access to the quality that an artwork is unique, because `NamedTuple` is _hashable_.

However, you can also model too much. It is important to balance the amount of thoughtful modelling you put into your datatype with the actual functionality that you need to get out of your datatype: e.g. if I really wanted to capture the qualities of an artwork, should my `ArtWork` datatype also capture the medium? Maybe it should also store a URL to an image of the artwork! That would be cool.

But hold on: does your application _need_ to have this much data captured right now? If you don't actually need this information yet, then do not capture it. If in the future you do, then it is easy to add this to your application _because_ you have taken the time to thoughtfully create a datatype.

## Lets design some data!

For the following scenarios, think about what kinds of data types you could use to represent these phenomena in Python. Think about it on your own and come up with some ideas and then we will discuss in class.

| | | |
|-|-|-|
| The level of gasoline in a vehicle| An inventory of all of the plants in a greenhouse | The colour of a star
| <a href="https://www.bankofcanada.ca/rates/exchange/background-information-on-foreign-exchange-rates/?page_moved=1#Data-format">The US/CAD exchange rate</a> | The weight of a salmon | <a href="https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">A person's name</a>
| The melody of a song | Your monthly expenses | Your daily to-do list
| Birthdays you want to remember every year | Cards played in a poker game | All of the steel W-sections in the handbook
| Loading on a simply supported beam |  | 

## This Week's Workbook

In this week's tasks, you will be asked to do two things:

1. Design a data type to represent the information presented in the task (many "right answers")
2. Design a function to convert the provided raw data into your new data type