# Working Notebook

Welcome to the _Programming with Python_ course! We will be using this notebook to go through the lecture materials, as well as to work _together_ on practical examples and exercises.

## first thing: let's familiarise with the environment

Let's talk about **Jupyter Notebooks** for a second.

In [1]:
# code cell

Text Cell

In [2]:
# third

In [3]:
# second

---

$\rightarrow$ _Adapted from_ : [**Software Carpentries: Programming with Python**]()

## Arthritis Inflammation
We are studying **inflammation in patients** who have been given a new treatment for arthritis.

There are `60` patients, who had their inflammation levels recorded for `40` days.
We want to analyze these recordings to study the effect of the new arthritis treatment.

To see how the treatment is affecting the patients in general, we would like to:

1. Process the file to extract data for each patient;
2. Calculate some statistics on each patient;
    - e.g. average inflammation over the `40` days (or `min`, `max` .. and so on)
    - e.g average statistics per week (we will assume `40` days account for `5` weeks)
    - `...` (open to ideas)
3. Calculate some statistics on the dataset.
    - e.g. min and max inflammation registered overall in the clinical study;
    - e.g. the average inflammation per day across all patients.
    - `...` (open to ideas)


![3-step flowchart shows inflammation data records for patients moving to the Analysis step
where a heat map of provided data is generated moving to the Conclusion step that asks the
question, How does the medication affect patients?](
https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/fig/lesson-overview.svg "Lesson Overview")


### Data Format

The data sets are stored in
[comma-separated values] (CSV) format:

- each row holds information for a single patient,
- columns represent successive days.

The first three rows of our first file look like this:
~~~
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
~~~

Each number represents the number of inflammation bouts that a particular patient experienced on a
given day.

For example, value "6" at row 3 column 7 of the data set above means that the third
patient was experiencing inflammation six times on the seventh day of the clinical study.

Our **task** is to gather as much information as possible from the dataset, and to report back to colleagues to foster future discussions.

### Let'make a plan

- Problem description (step by step) in NATURAL LANGUAGE (**strict rule**) - imagine you're explaining this to someone who doesn't know **anything** about programming.
- What do we need to start
- Where do we start

In [4]:
# I'll go first - let's create a dummy file to practice named dummy, two rows, ten values

In [5]:
file = open("dummy.csv")

# Read: going line by line
# read the first and save the data for patient 1
# read line 2 and save the data for patient 2

lines = file.readlines()
patient1 = lines[0]
patient2 = lines[1]

In [6]:
patient1

'1,2,3,4,5,6,7,8,9,10\n'

In [7]:
patient2

'10,9,8,7,6,5,4,3,2,1'

In [8]:
file = open("dummy.csv")
lines = file.readlines()

for i in (0, 1):
    if i == 0:
        patient1 = lines[i]
    else:
        patient2 = lines[i]

In [9]:
patient1

'1,2,3,4,5,6,7,8,9,10\n'

In [10]:
patient2

'10,9,8,7,6,5,4,3,2,1'

How to collect information from data file

In [11]:
# read the file, line by line
# in each line, splitting the values on "comma"
# save all the values in a line into a "list" (or a Collection of values) 
# save the collection as "patient data"


In [12]:
file = open("dummy.csv")

patient_counter = 1
for line in file:
    values = line.split(",")
    if patient_counter == 1:
        patient1 = values
    elif patient_counter == 2:
        patient2 = values
    patient_counter = patient_counter + 1

In [13]:
patient2

['10', '9', '8', '7', '6', '5', '4', '3', '2', '1']

In [14]:
patient1

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10\n']

In [15]:
patient_counter

3

In [16]:
file = open("dummy.csv")

patient_counter = 1
for line in file:
    line = line.rstrip() # gets rid of the \n or whatever tabulation you'll in the end
    # rstrip lstrip 
    values = line.split(",")
    
    numbers = []  #creating new list
    for value in values:
        number = int(value)  # casting the type of each value from string to integer
        numbers.append(number)  # adding (appending) the integer value to numbers
        
    if patient_counter == 1:
        patient1 = numbers
    elif patient_counter == 2:
        patient2 = numbers
    patient_counter = patient_counter + 1

Play with what we have so far: iteration

In [17]:
patient1

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [18]:
patient2

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [19]:
# get the 2nd item in patient2
patient2[1]

9

(_fancy word_) **Slicing**

![slicing example](https://swcarpentry.github.io/python-novice-inflammation/fig/python-zero-index.svg)

Source: [Software Carpentries](https://swcarpentry.github.io/python-novice-inflammation/02-numpy/index.html)

In [20]:
patient2

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [21]:
p_slice = patient2[1:len(patient2)]

In [22]:
p_slice

[9, 8, 7, 6, 5, 4, 3, 2, 1]

In [23]:
patient2[1:]

[9, 8, 7, 6, 5, 4, 3, 2, 1]

In [24]:
patient2[1:-1]

[9, 8, 7, 6, 5, 4, 3, 2]

Now let's move to the _real_ data file: **how can we re-use the same algorithm?**

In [25]:
# INTRODUCING FUNCTIONS

def process_datafile(filename):
    # patients = [ (1, 2, 3..), (2, 3, 4,..), (), (), (), () ]
    file = open(filename)
    patients = []  # list of patient data
    for line in file:
        line = line.strip()  # get rid of tabs
        values = line.split(",")  # list of strings
        # converting
        numbers = []
        for value in values:
            numbers.append(int(value))
        # we have now the list of numbers we wanted
        patients.append(tuple(numbers))
    
    return patients

_now we have 60 patiens to deal with_ - how can we do that?

In [26]:
process_datafile("dummy.csv")

[(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (10, 9, 8, 7, 6, 5, 4, 3, 2, 1)]

In [27]:
dummy_patients = process_datafile("dummy.csv")
dummy_patients

[(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (10, 9, 8, 7, 6, 5, 4, 3, 2, 1)]

In [28]:
dummy_patients[0]

(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

In [29]:
inflammation_data = process_datafile("data/inflammation-01.csv")

In [30]:
len(inflammation_data)

60

Let's make **assertion** about our data

In [31]:
for patient in inflammation_data:
    assert len(patient) == 40, "This patient hasn't 40 values"

---

### Here is what we've done together in live coding! well done everybody! 🙌

# To Complete

From this point on, completing the rest of the notebook is up to you! _I know you can do it!_

Here is the deal: you have the opportunity to enjoy completing the notebook, by solving the following exercises and challenges. 

I will provide a few _hints_ and _suggestions_ to help you along the way, but remember: there is **never** a single solution to a coding exercise. 

Therefore, feel free to be creative, and go for the solution you think would be the most appropriate.

The most important bit is to **enjoy** and have fun. If you're not enjoying while you're doing it, you're probably doing it wrong!

If you stuck with one exercise, don't worry: go ahead and jump to the next one. 

## On your marks, get set, gooo! 

## Adding `PatientID` field to the data file

What if we also add in a reference ID for each patient? 

$\rightarrow$ For this section of exercise, please connsider this new `datafile`: `data/inflammation02.csv`!

_Hint_: TODO: modify the `process_datafile` function to return a **Python Dictionary** this time, rather than a list!

For more, pls have a look at [Dictionaries](programming_with_python/dictionaries.ipynb)

In [2]:
def process_datafile_to_dictionary(filename):
    file = open(filename)
    patients = {}  
    for line in file:
        line = line.strip()  
        values = line.split(",")  
        key = values[0]
        numbers = []
        for value in values[1:]:
            numbers.append(int(value))
        patients[key] = tuple(numbers)
    
    return patients

**Let's practice with our new data structure**

**Ex 1.2:** Print the values for the Patients with the following IDs: `2d58`, `5b04`, `c736`

_hint_: Retrieve Patient Values by Keys in the Dictionary!

In [5]:
dct = process_datafile_to_dictionary("data/inflammation-02.csv")
dct['2d58']

(0,
 1,
 0,
 0,
 4,
 3,
 3,
 5,
 5,
 4,
 5,
 8,
 7,
 10,
 13,
 3,
 7,
 13,
 15,
 18,
 8,
 15,
 15,
 16,
 11,
 14,
 12,
 4,
 10,
 10,
 4,
 3,
 4,
 5,
 5,
 3,
 3,
 2,
 2,
 1)

In [8]:
dct['5b04']

(0,
 0,
 2,
 3,
 2,
 3,
 2,
 6,
 3,
 8,
 7,
 4,
 6,
 6,
 9,
 5,
 12,
 12,
 8,
 5,
 12,
 10,
 16,
 7,
 14,
 12,
 5,
 4,
 6,
 9,
 8,
 5,
 6,
 6,
 1,
 4,
 3,
 0,
 2,
 0)

In [7]:
dct['c736']

(0,
 1,
 1,
 2,
 2,
 5,
 1,
 7,
 4,
 2,
 5,
 5,
 4,
 6,
 6,
 4,
 16,
 11,
 14,
 16,
 14,
 14,
 8,
 17,
 4,
 14,
 13,
 7,
 6,
 3,
 7,
 7,
 5,
 6,
 3,
 4,
 2,
 2,
 1,
 1)

**Ex 1.3:** Write a function to calculate the **average** inflammation value for a given patient

_hint_: Think carefully about what values would you pass as argument to the function!

In [9]:
def average_patient(dct, PatientID):
    tp = dct[PatientID]
    av = 0
    ln = len(tp)
    sm = 0
    for t in tp:
        sm = sm + t
    av = sm/ln
    return av

In [11]:
av = average_patient(dct, 'c736')
av

6.25

**Ex 1.4:** Write a function to calculate the **min** and **max** inflammation value for a given patient

_hint_: in Python, a function **can** return _any_ number of values you want! So you can `return min, max`

In [18]:
def min_max_patient(dct, PatientID):
    tps = dct[PatientID]
    mn = 1000000
    mx = 0
    for tp  in tps:
        if tp > mx:
            mx = tp
        if tp < mn:
            mn = tp      
        
    return mn, mx

In [19]:
mn_mx = min_max_patient(dct, 'c736')
mn_mx

(0, 17)

## 2. Dealing with more _realistic cases_ ❌

**Ex 2.1:** Considering the data at hand, now let's try to think about _what can go wrong with data acquisition_!

The first exercise in this section is a a NO-CODING exercise. 

What I'd like you to do is to think about the data that we have (**and their corresponding specification!**) and list all the things that **CAN POSSIBLY GO WRONG** while reading and storing the data. 

Some of those can be _pre-conditions_, i.e. "conditions that should hold **before** reading the data from the file"; whereas some others can be _post-conditions_, that is "conditions that should hold **after** all has gone apparently well with data acquisition.". 

The goal is to elicit all the possible conditions of errors we should account for in our code. 

You  _could also_ mark pre and post conditions in your list with the following tags: `[PRE]`; `[POST]`  

Your list here: (*I'll go first*)

0. [PRE] Check filename
1. [PRE] Tabulation characters at both ends of the lines should be removed;
2. [PRE] The file should not contain any empty line. If so, those should be discarded.
3. [POST] There should be `60` patients.
4. [POST] Each patient should have `40` values
5. [PRE/POST] Every patient should have a unique ID (in other words, IDs should not be repeated)
6. [PRE] Inflammation data should be non-negative integer values
7. [PRE/POST] We must set min and max limit for inflammation (min - PRE6, max < 21) 
...

**Ex 2.2:** Please refer to `datafile`: `data/inflammation-03.csv`

I have intentionally modified the original data in order to introduce some mistakes. Some are quite sneaky! 

Try to modify the `process_datafile` function in order to correctly load patient data included in `data/inflammation-03.csv`.

The new code in the `process_datafile` function should deal with all the **\[PRE\]** conditions identified in the previous list!

_Hint_: Before doing this exercise, have a quick look at the `try/execpt` section in [Exceptions](programming_with_python/exceptions.ipynb).
Then, try to identify which exceptions correspond to the different error conditions, and _control_ the error cases for the different pre-conditions.
**Please** have a look at the example reported below :)

In [32]:
def process_file(filepath):
    file = open(filepath)
    patients = {}  
    for line in file:
        values = line.strip().split(",")
        key = values[0]
        numbers = []
        patient_counter = 60
        for value in values[1:]:
            value_counter = 40
            try:
                number = int(value)
                if number > 20:
                    number = 20 # PRE#7
            except ValueError:  # if trying to cast to integer a NON-NUMERICAL Value, ValueError is raised as exception
                # PRE#6 NON-Numerical value
                number = 0  # setting value to default min to handle the case
            finally:
                number = max(0, number)  # PRE#6 accounting for Negative values
                value_counter -= 1
                numbers.append(number)
            
            if value_counter > 0:
                # Exception
            
            
            
        patients[key] = tuple(numbers)
        patient_counter -= 1
        
        if patient_counter > 0:
            # Exception
        ...
        

Putting our helmets on (_with some testing_) ⛑

Great! So till now we've accounted for _pre-conditions_ in the data, by changing our function implementation to incorporate the code to deal with error cases. Now it's time to verify **POST** conditions on the data.

Post-conditions can only be verified **after** the dataset has been loaded (which means that the whole execution of `process_datafile` terminates with NO-ERROR).

**Ex 2.3:** Implement a series of Testing functions to verify post conditions. Note: It's a good practice to have each Test covering one single post-condition at a time.

_hint_: I will show you an example of what I am expecting yuo do do:

In [35]:
def test_post_3_total_no_patients_is_60(filename):
    # testing function to verify the post/condition #3
    patients = process_datafile(filename)
    assert len(patients) == 60, f"ERROR: Patients were expected to be 60 but {len(patients)} were found!"
    print(f"✅ Test 'test_post_3_total_no_patients_is_60' passed on {filename}!")
    

#invoke to execute: 
test_post_3_total_no_patients_is_60("data/inflammation-01.csv")

# complete: invoke on inflammation-02 and inflammation-03 :)

✅ Test 'test_post_3_total_no_patients_is_60' passed on data/inflammation-01.csv!


In [None]:
def test_post_4_all_patients_have_40days_of_data(filename):
    #YOUR CODE HERE
    print(f"✅ Test 'test_post_4_all_patients_have_40days_of_data' passed on {filename}!")
   

In [None]:
# write other tests for any other condition you think it would be appropriate

--- 

Well done for reaching this point! 🎉

**GREAT TIME FOR A BREAK NOW!** ☕️🧁🍪

---


## 3. Rethink about our Data (Abstractions): let's define our own **new type**!

Before giving a go to the next couple of Exercises, please consider taking a look at the [Classes](programming_with_python/classes.ipynb) Notebook.

The `Patient` is an important entity in our use case scenario that it probably deserves its own **type** (or data abstraction). Therefore, when creating variable we want to be able to **assign** a value that is of `type = Patient`. 

To do so, we'd need to **define** a _new data type_ (`Patient` type, indeed), and the way we will do this is by creating a new `Patient` class.

**Ex 3.1:** Complete the following Class Stub below (or even modify it, if you think) to make the following cell executing with **no error**.


_Hint_ At this stage, identify which attribute shuold be included.

In [36]:
# Examplar Code Stub

class Patient:
    # constructor
    def __init__(self, patientID, values):
        self.patient_id = ...
        self.inflammation_values = ...
        ...
        
    #method
    def convert_to_numbers(self, values):
        numerical_values = []
        for value in values:
            try:
                number = int(value)
            except ValueError:  # if trying to cast to integer a NON-NUMERICAL Value, ValueError is raised as exception
                # PRE#6 NON-Numerical value
                number = 0  # setting value to default min to handle the case
            finally:
                number = max(0, number)  # PRE#6 accounting for Negative values
                numerical_values.append(number)
        return numerical_values
    
    ...

**Ex 3.2**: Modify the `process_datafile` function so that a **list** of `Patient` objects is returned, rather than a list of tuples (or a dictonary of tuples)                              m