![alt text](./pageheader_rose2_babies.jpg)

# Data Science in Medicine using Python

### Author: Dr Gusztav Belteki

## 1. Review of homework: slicing and dicing in Python

In [None]:
# List of strings

lst = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
       'November', 'December']

lst

In [None]:
# Indexing is zero based, starts with zero

lst[0:4]

In [None]:
# Zero can be omitted

lst[:4]

##### So - write the input to generate the output

`['March', 'April', 'May']`

In [None]:
# Indexing is half open, the beginning is included the end is not

lst[2:5]

In [None]:
# This works well with continued indexing - no duplication

lst[2:5], lst[5:8]

`['May', 'June', 'July', 'August']`

In [None]:
lst[4:8]

In [None]:
# Negative indexing also works

lst[-8:-4]

`['July', 'August', 'September', 'October', 'November', 'December']`

In [None]:
# This needs to be 12, not 11, despite the zero base

lst[6:12]

In [None]:
# Again, if it goes to the end, it can be omitted

lst[6:]

`['January', 'March', 'May', 'July', 'September']`

In [None]:
# The third index is `strides`, here every second

lst[0:9:2]

In [None]:
lst[:9:2]

`['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']`

In [None]:
# If it goes from beginning to end, both numbers can be omitted

lst[:]

`['April', 'June', 'August', 'October']`

In [None]:
lst[3:11:2]

`['December', 'November', 'October', 'September', 'August']`

In [None]:
# Indexing backwards with negative strides
# First index now the last, it is NOT 12 because of zero based indexing

lst[11:6:-1]

In [None]:
# It can be omitted

lst[:6:-1]

`['November', 'September', 'July', 'May']`

In [None]:
lst[10:3:-2]

`['December',
 'November',
 'October',
 'September',
 'August',
 'July',
 'June',
 'May',
 'April',
 'March',
 'February',
 'January']`

In [None]:
lst[12:0:-1]

In [None]:
lst[::-1]

#### We will come back to lists later today

## 2. Once more about modules and data structures

![alt text](./importing.pdf)

Basic data structures in the global workspace:
- numbers (float, int, complex), text, etc
- Protected keywords (35) such as `and` `or` `in` etc.
- basic ("built-in) functions, e.g `print()`, `len()`, `dir()` etc

Everything else needs to be imported

Importing from 

- Standard library modules
- Third party modules
- Own modules

#### Standard library modules

More than 200 modules, the list is [here](https://docs.python.org/3/py-modindex.html)

We will only use a few during this course:

- `collections`: advanced data structures
- `copy`: advanced copy operations
- `datetime`: date and time conversions and operations
- `math`: mathematical functions
- `os`: operating system handling
- `pickle`: exporting large datasets as binary files
- `re`: regular expressions
- `sys`: system specific operations and variables


Third party modules, packages (group of modules) and libraries (group of packages)

There are thousands, see [Pypi](https://pypi.org)

We have also created a package, [ventiliser](https://pypi.org/project/ventiliser/)

We will only use a few "famous" packages generally used in data science:

- `numpy`: multidimensional arrays ("lists"), numberical computation, linear algebra
- `pandas`: analysis of tabular (~Excel) data, high level
- `matplotlib`: plotting
- `scipy`: stats and mathematics
- `statmodels`: advanced stats
- `nltk`: natural language processing
- `scikit-learn`: machine learning


We will not use these ones but exciting deep learning frameworks

- `tensorflow`
- `pytorch`

## 3. Types of medical data

##### 1.  Tabular data
    - Obtained from sensors of medical devices, monitors etc.
    - Typically retrieved as csv or other text format
    - Frequently (but not always) time series data
    - Usually two-dimensional but can be higher dimensional
    
    
Frequently `csv` (comma separated values) or `tab-delimited` files

##### csv

`time,HR,Sat,RR,T
 16:25:42,156,45,37.6
 16:25:43,152,47,37.6
 16:25:44,149,47,37.5`

Typically imported in Excel
    
![alt text](./data/tabular_data.jpg)    

##### 2.  Image data

- Obtained from imaging medical devices
- Usually as raster images (jpg,png, tiff etc file format, different compression methods)
- At least 3-dimensional, but frequently 4 or 5 dimensional
   
   
  
![alt text](./data/newborn_brain_image.jpg)
 

`124,156,182,...,56
 127,186,12,....,93
 ......
 12,18,72,.....,222`

- pixel, dot-per-inch (dpi)
- Red-green-blue layers, [RGB](https://www.codementor.io/@innat_2k14/image-data-analysis-using-numpy-opencv-part-1-kfadbafx6)


  ##### 3. Medical free text
    - From electronic medical notes or from medical knowledge databases (e.g., PubMed)
    - Typically retrieved as text (.txt) files
    - Unstructured
    - Can be time series
    
![alt text](./data/medical_freetext.jpg)

Text needs to **represented** differently for analysis

#### Bag of words

`datetime,            hypotension, had, baby, saturation, antibiotics, intravenous, ...
 18/02/20 13:45,      5,           10,  3,    12,         0,            0
 18/02/20 14:22,      0,           7,   4,    0,          3,            2`

## 4. More about python lists

So far we have only seen primitive data structures (objects): numbers, text string

In [None]:
A = 42
A

In [None]:
B = 'Hello world'
B

#### A list is a collection of any other data structures ( =objects) 

In [None]:
# It can be text

lst_1 = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
         'November', 'December']

In [None]:
# It can be numbers

lst_2 = [1, 2, 3, 4, 5, 6, ]

In [None]:
# It can be a lot of numbers

a = range(100, 100000000, 5)
lst_3 = list(a)

In [None]:
len(lst_3)

In [None]:
lst_3[:10]

In [None]:
lst_3[-10:]

In [None]:
# Text strings can be converted to lists

lst_4 = list('Hello world')
lst_4

In [None]:
# List can containg mixed data types

lst_5 = ['H', 42, 55.555]
lst_5

In [None]:
# It can be composed of other lists
# This is a two dimensional data structure (like DataFrame or Excel table)

lst_6 = [  [1, 2, 3], [4, 5, 6], [7, 8, 9]   ]
lst_6

1,  2,  3

4,  5,  6

7,  8,  9

In [None]:
# Indexing is "row first" and "zero based"

lst_6[1]

In [None]:
# Identifying single values by consecutive indexing

lst_6[1][2]

In [None]:
# slicing

lst_6[0:2]

In [None]:
# Double slicing not really works as we would like it

lst_6[0:2][0:2]

**Self_learning**: learn about `list` methods:

- append,
- clear,
- copy,
- count,
- extend,
- index,
- insert,
- pop,
- remove,
- reverse,
- sort

You can learn more about lists and their methods [here](https://developers.google.com/edu/python/lists)

Any decent data science course would start to discuss `numpy` at this stage...

... but we will leave it for later

In the meantime for ambitious geeks [this](https://github.com/ageron/handson-ml2/blob/master/tools_numpy.ipynb) is an excellent Jupyter notebook about numpy

## 5. Analysis of two-dimensional tabula data with pandas

In [None]:
import os
import pandas as pd

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path)
data

In [None]:
len(data)

In [None]:
data.shape

In [None]:
data.ndim

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.head(10)

##### What is the problem with these data

- Indexed by numbers only (uninformative, it should be indexed by date and time)
- Date and time are in separate columnns
- Date and time formats are not appropriate
- Column names are too long and difficult to read
- Lots of `na` values 
- half of every row is empty
- Some columns have barely any informative values
- Some values are not meaningful (e.g. tidal volume should be mL/kg not mL

We will deal with all these issues

In [None]:
data.info()

In [None]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path)
data

In [None]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path, parse_dates = ['Date', 'Time'])
data

In [None]:
data.info()

In [None]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path, parse_dates = [['Date', 'Time']])
data

In [None]:
data.info()

In [None]:
data = data.set_index('Date_Time')
data

In [None]:
data.columns

We could just replace it with 

`data.columns = ['...', '...', '...']` 

but that is error prone

In [None]:
# Welcome to list comprehensions

new_columns_1 = [item for item in data.columns]
print(new_columns_1)

In [None]:
new_columns_2 = [item[5:] for item in data.columns]
print(new_columns_2)

In [None]:
new_columns_3 = [item[5:] for item in data.columns if item.startswith('5001')]
print(new_columns_3)

In [None]:
# The expression to the right of `=` is evaluated firs (before assignment)

new_columns_3 = ['Time [ms]', 'Rel.Time [s]'] + new_columns_3
new_columns_3

In [None]:
data.columns = new_columns_3
data.head(10)

In [None]:
data.info()

In [None]:
# This is called a `hack`
# During `mean`() na values are excluded by default

data = data.resample('1S').mean()
data.head(10)

In [None]:
data.info()

In [None]:
# Some columns are almost completely empty and hopeless - drop them

data = data.drop(['Tispon [s]', 'I:Espon (I-Part) [no unit]', 
                  'I:Espon (E-Part) [no unit]'], axis = 1)
data

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
# A lot of things are happening here, for example vectorized computation, broadcasting
# We will speak about them during the next session

data.isnull().sum() / len(data) * 100

Now let us save the the modified data

We will export them as serialised binary data  - `pickle`

In [None]:
import pickle

with open(os.path.join('data', 'data.pickle'), 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

We will continue from here

### 6. Homework

Subsetting and indexing pandas DataFrames

In [None]:
data

In [None]:
# Select the third row only

selection = data
selection

In [None]:
# Select the "MVe [L/min]" column only"

selection = data
selection

In [None]:
# Select the "MVe [L/min]" and "MVi [L/min]" columns only

selection = data
selection

In [None]:
# Select the 'MVe [L/min]' value from the third row

selection = data
selection

In [None]:
# Select all data during the 1 minute period at 2020-11-03 13:00 

selection = data
selection

In [None]:
# Select all data between 2020-11-03 13:00 and and 15:00

selection = data.loc
selection

In [None]:
# Select all data between 2020-11-03 13:00 and and 15:00 and limit it to 
# "MVe [L/min]" and "MVi [L/min]" columns only

selection = data