In [None]:
from cs103 import *


# CPSC 103 - Systematic Program Design
# Module 07a Day 2 -- Supplemental
Ian Mitchell, with thanks to Rik Blok, Jessica Wong and Giulia Toti

Note that the discussion below is specific to 2024W1 offering.

---

# Reminders
- Wed-Fri: Module 7 (HtDAP) tutorial.  Attendance will be taken.
- Friday: Project milestone due.
- next Monday: Module 8 (Visualization): Pre-lecture assignment.
- next Monday: Module 5 (Arbitrary-Sized): Tutorial Resubmission.
- next Wednesday: Complete [final exam conflict form](https://ubc.ca1.qualtrics.com/jfe/form/SV_3w0efA3WAvy1pMW) if you have a conflict with our final exam slot.

See your Canvas calendar (https://canvas.ubc.ca/calendar) for details.

### A Few Useful Things

- **If you are feeling distressed or overwhelmed then reach out and talk to somebody!** UBC has extensive resources listed on the student health page https://students.ubc.ca/health.  Use them!
- See [Piazza post @513](https://piazza.com/class/m0li3cza8an2th/post/513) for the full tutorial schedule for the rest of the term.  (Some but not all optional.)

<div class="alert alert-warning">

### ⚠️ [Project Milestone](https://canvas.ubc.ca/courses/147818/assignments/1966018)
   
**DUE:** Friday Nov 22.

You should already be working on it.
- Watch for your [mentor assignment](https://canvas.ubc.ca/courses/147818/pages/project-mentor-assignments)
- Get help:
  - [Office hours](https://canvas.ubc.ca/courses/147818/pages/schedule-tutorials-and-office-hours).
  - Optional tutorial sessions on Thursday Nov 14 and Friday Nov 15 (today and tomorrow).
  - Optional lectures on Tuesday Nov 19 (next Tuesday).
    
You do not need to design `main` and `analyze` for the milestone.

Review the [Project Milestone grading rubric](https://canvas.ubc.ca/courses/147818/assignments/1966018) carefully so that your submission fulfills our requirements!
    
**Syzygy may crash occasionally or become very slow as the deadline approaches, due to the volume of activity.  Be sure to start early.**

</div>

<div class="alert alert-danger">
    
### ⛔ Don't use built-in functions!
    
- For the project, you are **NOT** allowed to use built-in functions like `len`, `sum`, `min`, `max`, etc.
- Demonstrate how you would use good programming practices to implement the correct code yourself; in other words, write helper function(s).
    
</div>

---


# How to Design Analysis Programs (HtDAP)

The steps in the HtDAP recipe are: 
1. Planning
<ol style="list-style-type:lower-alpha">
    <li>Plan input: Identify the information in the file your program will read.</li>
    <li>Plan output: Write a description of what your program will produce.</li>
    <li>Plan examples: Write or draw examples of what your program will produce.</li>
</ol>
2. Designing the program
<ol style="list-style-type:lower-alpha">
    <li>Design data definitions.</li>
    <li>Design read function: Design a function to read the information and store it as data in your program.</li>
    <li>Design analyze function: Design functions to analyze the data.</li>
</ol>

Already completed all of the planning steps and designed the data definitions.  See `module07a-day2` notebook for details.  We include the data definitions from that notebook in the cell below so that we can use them during step 2b.

---

In [None]:
# Slightly simpler than what we designed last week
# (in alignment with our answer just above.

from enum import Enum
# Shorthand way of including multiple definitions (NamedTuple and List) from a single library (typing).
from typing import NamedTuple, List

##################
# Data Definitions


CrimeType = Enum('CrimeType', ['BEC', 'BER', 'TV', 'TB'])
# interp. A type of crime in our file.  One of:
#   Break-and-enter Commercial (BEC),
#   Break-and-enter Residential (BER),
#   Theft of vehicle (TV), or
#   Theft of bicycle (TB).

# examples redundant for enumeration

# template based on Enumeration (4 cases)
@typecheck
def fn_for_crime_type(ct: CrimeType) -> ...:
    if ct == BEC:
        return ...
    elif ct == BER:
        return ...
    elif ct == TV:
        return ...
    elif ct == TB:
        return ...
    
CrimeData = NamedTuple('CrimeData', [('type', CrimeType),
                                     ('hour', int)]) # in range [0,23]
# interp. Information about a crime, including the type of crime and hour of the day
# in the range [0,23] that the crime occurred

CD0 = CrimeData(CrimeType.BEC, 0)
CD1 = CrimeData(CrimeType.BER, 1)
CD2 = CrimeData(CrimeType.TV, 13)
CD3 = CrimeData(CrimeType.TB, 23)

# template based on Compound (2 fields) and reference rule
@typecheck
def fn_for_crime_data(cd: CrimeData) -> ...:
    return ...(fn_for_crime_type(cd.type), cd.hour)

# List[CrimeData]
# interp. a list of CrimeData

LOCD0 = []
LOCD1 = [CD0]
LOCD2 = [CD0, CD1]
LOCD3 = [CD0, CD1, CD2, CD3]

# template based on Arbitrary-sized and reference rule
@typecheck
def fn_for_locd(locd: List[CrimeData]) -> ...:
    # description of the accumulator
    acc = ...      # type: ...
    for cd in locd:
        acc = ...(fn_for_crime_data(cd), acc)

    return ...(acc)


# Step 2b: Design `read` function

Design a function to read the information and store it as data in your program:
- You should complete the `read` function from its template. 
- Change the `Consumed` type name.
- Check what columns from the file you need.
- Check if the types need *parsing*; in other words, changing the data representation from the type that was read into the type we will store as data.
  - Remember: All values in the CSV file are represented as strings when they are read. 
- You can also add any other helper function you need (e.g. to remove rows with missing/invalid data).
- You should create at least two small CSV files for testing, so you can be sure your function is working before using it on full datasets. 

<div class="alert alert-warning">
    
⚠️ You may skip irrelevant columns and rows with missing/invalid data when reading.  But **don't remove any other rows** (for example, don't filter for a specific `CrimeType` here.)  That's part of the chosen analysis and should be handled later.  At this point we want to read **all** valid, relevant data.
    
</div>

## Filtering rows while reading

How do we know which rows to filter at this `read` stage, and which we should keep (even if they will be filtered later at the `analyze` phase)?  You should keep every row *unless it cannot be correctly represented in the data structure that you designed* (typically because of missing or invalid data).  We need to throw away the rows which cannot be correctly represented because otherwise we would be using incorrect data during the `analyze` phase.  Of course, an alternative to throwing away rows which cannot be correctly represented is to go back and modify your data structure so that it can correctly represent these rows.

Consider this example: What if the information in a particular column of your file is usually an integer but is sometimes missing?  If you choose to represent that column's information with a field of type `int` in your definition of `Consumed`, then you cannot correctly represent rows which do not have an integer in that column.  You then have to choose between throwing away those rows at the `read` phase or modifying your data structure so that you can correctly represent them.  An example of the latter would be to go back and modify the data definition for `Consumed` so that the type of the field corresponding to this column is `Optional[int]`.  Which approach you choose usually depends on whether you will need those rows with missing information during the `analyze` phase.

In contrast, consider the case where we are only going to analyze information from the rows in which that column has a value which is strictly positive.  While it might look appealing to throw away those rows which have a negative or zero value in that column during the `read` phase, you *should not*.  You can correctly represent those values (negative or zero) just fine with the specified field type -- whether that type is `int` or `Optional[int]`; in other words, regardless of the choice you made above -- so you should keep those rows during the `read` phase.  Then during `analyze` you will filter out the data instances with negative or zero values in this field.

<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ But wouldn't it be more efficient to filter while reading?</summary>

You might object that if we filter a row during the `read` phase, then we don't have to use any memory to represent that row in our list of consumed data, and the `analyze` phase won't ever need to look at that data.  That is true: it is more efficient to filter early.  But remember the goal when designing programs: *design and implement for correctness.*  Optimize only after you have correctness, and only if absolutely necessary to do the job.  

If you are reading a data set with *billions* of rows then you *might* need to optimize your reading and analysis code *if* you need your answers quickly.  But for the kind of data and analysis that you are likely to work on, optimization is not worth the effort to do by hand or risk it presents of introducing a bug.  (I will note that the computer can often do a better job of optimizing your code with much less risk of creating a bug, so it is generally a great idea to use automated optimizations *after* you confirm that your unoptimized code is correct.  But that is a story for more advanced courses.)

</details>

Why do we keep this data at the `read` phase even if we know we are going to throw it away later?  We want to give the `analyze` phase correct data for as much information from our file as possible.  Therefore, keep any row that can be *correctly* represented even if it might be thrown away later.

---

### Step 2b: Design `read` function - Example

#### Space reserved for helper function(s)

Continue on to the next cell to design `read`.  We'll come back here later when we need to design a helper function.

In [None]:
# Design your helper function(s) here.

# Helper that we designed last Thursday.
@typecheck
def parse_crime_type(s:str) -> CrimeType:
    """
    Returns the string s as a CrimeType.
    
    Assumes s is one of the following:
        "Break and Enter Commercial"
        "Break and Enter Residential/Other"
        "Theft of Bicycle"
        "Theft of Vehicle"
    """
    # return CrimeType.BEC # stub
    # template from atomic non-distinct
    # return ...(s)
    if s == "Break and Enter Commercial":
        return CrimeType.BEC
    elif s == "Break and Enter Residential/Other":
        return CrimeType.BER
    elif s == "Theft of Bicycle":
        return CrimeType.TB
    elif s == "Theft of Vehicle":
        return CrimeType.TV
    
    
start_testing()

expect(parse_crime_type("Break and Enter Commercial"), CrimeType.BEC)
expect(parse_crime_type("Break and Enter Residential/Other"), CrimeType.BER)
expect(parse_crime_type("Theft of Bicycle"), CrimeType.TB)
expect(parse_crime_type("Theft of Vehicle"), CrimeType.TV)

summary()

# Helper that will check whether a row contains valid data.
# Uncomment the lines below when you are ready to create the helper.
#@typecheck
#def ...


#start_testing()

#expect(..., ...)

#summary()

<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ Sample solution (For later.  Don't peek if you want to learn 🙂)</summary>

```python
@typecheck
def is_reliable(row_data: List[str]) -> bool:
    """
    Returns True if none of the pertinent data in row_data is missing,
    otherwise returns False.
    
    Missing data is indicated by a "0" in both items 4 and 5
    of the list.
    
    ASSUMES: row_data is a full row of values.  Specifically, row_data[4]
    and row_data[5] must exist.
    """
    # return True # stub
    
    # no template used
    return row_data[4] != "0" or row_data[5] != "0"


start_testing()

# Examples and tests for is_reliable
expect(is_reliable(["0", "0", "0", "0", "1", "0"]), True)
expect(is_reliable(["0", "0", "0", "0", "0", "1"]), True)
expect(is_reliable(["1", "1", "1", "1", "0", "0"]), False)

summary()
```
    
</details>

---

#### Design a function to read the information and store it as data in your program

Here is a partially designed `read` function from last Thursday, including some test cases.  Let's give it a try!

BTW: We have created some example files (see the current directory) and defined examples containing the expected results in the appendix below.  When we need it we'll [jump down](#Appendix:-Parsed-test-data) to copy that example data.

In [None]:
@typecheck
def read(filename: str) -> List[CrimeData]:
    """    
    reads information from the specified file and returns a list of CrimeData compound instances.

    Assume that there is at least a header row.
    """
    #return []  #stub
    # Template from HtDAP
    # locd contains the data that has been read in so far
    locd = [] # type: List[CrimeData]

    with open(filename) as csvfile:
        
        reader = csv.reader(csvfile)
        next(reader) # skip header line

        for row_data in reader:
            cd = CrimeData(parse_crime_type(row_data[0]), parse_int(row_data[4]))
            locd.append(cd)
    
    return locd


start_testing()

# Examples and tests for read.

# Uses the test files in the current directory
# and the test data in the appendix below.
expect(read('testfile_empty.csv'), [])
# Tests that rows with unreliable data are filtered.
expect(read('testfile_all_missing.csv'), [])
# Tests BER crime type plus whether a row with zero value for hour
# but nonzero value for minute is *not* filtered.
expect(read('testfile_all_ber.csv'), TEST_ALL_BER)
# Tests BEC crime data plus whether a row with unreliable data is filtered.
# Rather than use the examples defined below, we could also copy the values
# directly into this cell.
expect(read('testfile_all_bec.csv'), [CrimeData(CrimeType.BEC, 6),
                                  CrimeData(CrimeType.BEC, 18)])
# Two more tests to check the other values of the enumeration.
# Copied the values in the first one, used the variable in the second
# for no particular reason.
expect(read('testfile_all_tb.csv'), [CrimeData(CrimeType.TB, 1),
                                 CrimeData(CrimeType.TB, 23),
                                 CrimeData(CrimeType.TB, 17)])
expect(read('testfile_all_tv.csv'), TEST_ALL_TV)

summary()


<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ Sample solution (For later.  Don't peek if you want to learn 🙂)</summary>

```python
@typecheck
def read(filename: str) -> List[CrimeData]:
    """    
    Reads information from the specified file and returns a list 
    of crime data.
    
    Ignores lines where both the hour and the minute are zero.
    """
    # return []  #stub
    # Template from HtDAP
    # locd contains the result so far
    locd = [] # type: List[CrimeData]

    with open(filename) as csvfile:
        
        reader = csv.reader(csvfile)
        next(reader) # skip header line

        for row_data in reader:
            # you may not need to store all the strings in the 
            # current row, and you may need to convert some
            # of the strings to other types
            if is_reliable(row_data):
                cd = CrimeData(parse_crime_type(row_data[0]), 
                               parse_int(row_data[4]))
                locd.append(cd)
    
    return locd


start_testing()

# Examples and tests for read
expect(read('testfile_empty.csv'), [])
expect(read('testfile_all_missing.csv'), [])
expect(read('testfile_all_bec.csv'), TEST_ALL_BEC)
expect(read('testfile_all_ber.csv'), TEST_ALL_BER)

summary()
```
    
</details>

<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ Why are we filtering at the read phase in this example?</summary>

Because we cannot correctly represent "no reliable timestamp" using the two fields (`type` and `hour`) in our `CrimeData` compound.  If we just stored the value `0` in the `hour` field, then it would be *incorrectly* interpreted later as a timestamp between midnight and 1am.  There are several possible solutions:
- Discard any row that has no reliable timestamp.  That is suitable if we will never use the information in those rows.
- Modify our definition of `CrimeData` so that we can correctly represent missing data.  Some possibilities:
  - Change the type of `hour` to be `Optional[int]`, where the value `None` represents an unreliable timestamp.
  - Add a field `minute` of type `int` to our `CrimeData` compound, and store the minute column as well.  Then the `analyze` function could check whether the data is reliable.
  - Add another field `reliable` of type `bool` to our `CrimeData` compound and create appropriate parsing logic in `read` (and likely a helper) to set this field correctly when reading a row.  Then the `analyze` function could check whether the data is reliable *without* needing to know about the unusual encoding in the hour and minute columns that the file uses.

Clearly we have chosen the first option in this case.  It is the easiest since we don't need the information in those rows.
</details>

---

# Tests

Don't use the original CSV file to test your program.  Otherwise you have to know the expected output before you've finished designing the program.  (Isn't this what the program is for? 😉)

Instead, create special test files that contain a subset of the data.  Choose the subsets so that:
1. They cover a wide range of data,
2. They cover special cases and edge cases,
3. They deliberately include some missing/invalid data that you think the program should be able to handle,
3. It's easy to determine what the expected output should be.

Small files are fine!  You might also want one or two tests with larger subsets.

Give the test files names that help you remember what they're testing.  Feel free to construct artificial data or sample from the original CSV file.

<div class="alert alert-info">
   
ℹ️ Don't forget to submit all your test files with your program!
    
</div>



## Using real data for testing

Another common source of examples / test cases (in addition to creating dummy data files) is to extract a few rows from your real data set into a test data set.

There are a number of ways to perform that extraction.  Here is a procedure that you can do in Jupyter / Syzygy:
1. Make a copy of your original data set file.  You will only modify the copy so that (a) you still have your original data set and (b) if something goes wrong, you can always start over.
2. Rename that copy to make it clear it is a test data set; for example, `testfile_real_rows.csv`.
3. Open the copy in Jupyter's editor: Right click the file and select "Open with > editor".
4. Now you can delete rows by selecting the row(s) and pressing the backspace or delete keys.  Make sure that you delete entire rows. If you accidentally delete part of a row, delete the rest of it.
5. Keep only a few of the rows.  For convenience that is often the first and/or last few rows.  Delete the rest.  Save the test file and close the editor.
6. Reopen the test file in the CSV viewer, and then construct the correct output data from those rows by hand.  The effort required for that process will dictate how many rows you keep in your test data set.

Just to be clear: You are not *required* to include a subset of the real data among your test cases.  But it is a good way to make sure that you haven't accidentally designed a `read` function which handles dummy data but not real data.

---

# All Done (?)

Have we handled all of the possible weirdness in our data set, `read` and helper functions?

Hint: What type does `parse_int` return?

<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ Sample solution (For later.  Don't peek if you want to learn 🙂)</summary>

If the string loaded from column 4 of a row cannot be converted into an integer, then `parse_int` will return `None`.  What happens then?

<details class="alert alert-info"><summary style="cursor:pointer; display:list-item">ℹ️ How do we fix that? (For later.  Don't peek if you want to learn 🙂)</summary>

There are several potential approaches to resolving this issue, and the right one will depend on the data set and how you plan to interpret it.
- If you are confident that every string loaded from column 4 of your data set will correctly convert into an integer, you could make that part of your assumptions.  (Which you then need to document.)
- If you want to treat rows with a missing or invalid string in column 4 as bad or missing data, then you should write a helper to identify those rows and include that helper in your `read` function's logic, in the same way that you used `is_reliable`.  (What template would you use?)
- If you want to treat rows with a missing or invalid string in column 4 as if they had a value of `0` for the `hour` field (it is not unusual for people to omit 0 values to make a table more readable), then you should write a helper to parse a `None` value into a `0` value. (What template would you use?) (How would this choice interact with your logic for `is_reliable`?)
- If you want to treat rows with a missing or invalid string in column 4 as a different category of crimes from those with a numeric value in this column, then you should modify your definition of `CrimeData` to use something other than `Interval[int]` as the type for the `hour` field.  (What type(s) could you use?)

</details>

</details>

---

# Appendix: Parsed test data

Here are the test files parsed into `List[CrimeData]`.  I've included this info here so we can quickly add it as needed to our examples.


In [None]:
# from 'testfile_all_bec.csv'
TEST_ALL_BEC = [CrimeData(CrimeType.BEC, 6),
                CrimeData(CrimeType.BEC, 18)] # missing data removed

# from 'testfile_all_ber.csv'
TEST_ALL_BER = [CrimeData(CrimeType.BER, 21),
                CrimeData(CrimeType.BER, 17),
                CrimeData(CrimeType.BER, 0)] # reliable because nonzero minute field

# from 'testfile_all_tb.csv'
TEST_ALL_TB = [CrimeData(CrimeType.TB, 1),
               CrimeData(CrimeType.TB, 23),
               CrimeData(CrimeType.TB, 17)]

# from 'testfile_all_tv.csv'
TEST_ALL_TV = [CrimeData(CrimeType.TV, 23),
               CrimeData(CrimeType.TV, 14),
               CrimeData(CrimeType.TV, 21)]

# from 'testfile_all_missing.csv'.
# (But none of these rows should have been kept because
#  the minute column in each row was also 0, indicating
#  an unreliable timestamp.)
TEST_ALL_MISSING = [CrimeData(CrimeType.BEC, 0),
                    CrimeData(CrimeType.BER, 0),
                    CrimeData(CrimeType.TB, 0),
                    CrimeData(CrimeType.TV, 0)]

# from 'testfile_empty.csv'
TEST_EMPTY = []


We'll run the cell here to create these instances and/or copy the data to help prepare [our examples above](#Step-2b:-Design-read-function---Example).

---