![alt text](./image_files/pageheader_rose2_babies.jpg)

# Data Science in Medicine using Python

### Author: Dr Gusztav Belteki

## 1. Review of homework: slicing and dicing in Python

In [1]:
a = 42
a

42

In [2]:
b = 'Hello'
b

'Hello'

In [3]:
c = [3, 4 ,5]
c

[3, 4, 5]

In [4]:
# List of strings

lst = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
       'November', 'December']

lst

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

In [5]:
# Indexing is zero based, starts with zero

lst[0:4]

['January', 'February', 'March', 'April']

In [6]:
# Zero can be omitted

lst[:6]

['January', 'February', 'March', 'April', 'May', 'June']

##### So - write the input to generate the output

`['March', 'April', 'May']`

In [7]:
# Indexing is half open, the beginning is included the end is not

lst[2:5]

['March', 'April', 'May']

In [8]:
# This works well with continued indexing - no duplication

lst[2:5], lst[5:8]

(['March', 'April', 'May'], ['June', 'July', 'August'])

`['May', 'June', 'July', 'August']`

In [9]:
lst[4:8]

['May', 'June', 'July', 'August']

In [10]:
# Negative indexing also works

lst[-8:-4]

['May', 'June', 'July', 'August']

`['July', 'August', 'September', 'October', 'November', 'December']`

In [11]:
# This needs to be 12, not 11, despite the zero base

lst[6:12]

['July', 'August', 'September', 'October', 'November', 'December']

In [12]:
# Again, if it goes to the end, it can be omitted

lst[6:]

['July', 'August', 'September', 'October', 'November', 'December']

`['January', 'March', 'May', 'July', 'September']`

In [13]:
# The third index is `strides`, here every second

lst[0:9:2]

['January', 'March', 'May', 'July', 'September']

In [14]:
lst[:9:2]

['January', 'March', 'May', 'July', 'September']

`['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']`

In [15]:
# If it goes from beginning to end, both numbers can be omitted

lst[:]

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

`['April', 'June', 'August', 'October']`

In [16]:
lst[3:11:2]

['April', 'June', 'August', 'October']

`['December', 'November', 'October', 'September', 'August']`

In [17]:
# Indexing backwards with negative strides
# First index now the last, it is NOT 12 because of zero based indexing

lst[11:6:-2]

['December', 'October', 'August']

In [18]:
# It can be omitted

lst[:6:-1]

['December', 'November', 'October', 'September', 'August']

`['November', 'September', 'July', 'May']`

In [19]:
lst[10:3:-2]

['November', 'September', 'July', 'May']

`['December',
 'November',
 'October',
 'September',
 'August',
 'July',
 'June',
 'May',
 'April',
 'March',
 'February',
 'January']`

In [20]:
lst[11:0:-1]

['December',
 'November',
 'October',
 'September',
 'August',
 'July',
 'June',
 'May',
 'April',
 'March',
 'February']

In [21]:
lst[2:10:1]

['March', 'April', 'May', 'June', 'July', 'August', 'September', 'October']

#### We will come back to lists later today

## 2. Once more about modules and data structures

![alt text](./image_files/importing.pdf)

Basic data structures in the global workspace:
- numbers (float, int, complex), text, etc
- Protected keywords (35) such as `and` `or` `in` etc.
- basic ("built-in) functions, e.g `print()`, `len()`, `dir()` etc

Everything else needs to be imported

Importing from 

- Standard library modules
- Third party modules
- Own modules

#### Standard library modules

More than 200 modules, the list is [here](https://docs.python.org/3/py-modindex.html)

We will only use a few during this course:

- `collections`: advanced data structures
- `copy`: advanced copy operations
- `datetime`: date and time conversions and operations
- `math`: mathematical functions
- `os`: operating system handling
- `pickle`: exporting large datasets as binary files
- `re`: regular expressions
- `sys`: system specific operations and variables


In [24]:
import pandas as pd

In [25]:
pd.read_csv

<function pandas.io.parsers.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)>

Third party modules, packages (group of modules) and libraries (group of packages)

There are thousands, see [Pypi](https://pypi.org)

We have also created a package, [ventiliser](https://pypi.org/project/ventiliser/)

We will only use a few "famous" packages generally used in data science:

- `numpy`: multidimensional arrays ("lists"), numberical computation, linear algebra
- `pandas`: analysis of tabular (~Excel) data, high level
- `matplotlib`: plotting
- `scipy`: stats and mathematics
- `statmodels`: advanced stats
- `nltk`: natural language processing
- `scikit-learn`: machine learning


We will not use these ones but exciting deep learning frameworks

- `tensorflow`
- `pytorch`

## 3. Types of medical data

##### 1.  Tabular data
    - Obtained from sensors of medical devices, monitors etc.
    - Typically retrieved as csv or other text format
    - Frequently (but not always) time series data
    - Usually two-dimensional but can be higher dimensional
    
    
Frequently `csv` (comma separated values) or `tab-delimited` files

##### csv

`time,HR,Sat,RR,T
 16:25:42156,45,37.6
 16:25:43,152,47,37.6
 16:25:44,149,47,37.5`

Typically imported in Excel
    
![alt text](./image_files/tabular_data.jpg)    

##### 2.  Image data

- Obtained from imaging medical devices
- Usually as raster images (jpg,png, tiff etc file format, different compression methods)
- At least 3-dimensional, but frequently 4 or 5 dimensional
   
   
  
![alt text](./image_files/newborn_brain_image.jpg)
 

`124,156,182,...,56
 127,186,12,....,93
 ......
 12,18,72,.....,222`

- pixel, dot-per-inch (dpi)
- Red-green-blue layers, [RGB](https://www.codementor.io/@innat_2k14/image-data-analysis-using-numpy-opencv-part-1-kfadbafx6)


  ##### 3. Medical free text
    - From electronic medical notes or from medical knowledge databases (e.g., PubMed)
    - Typically retrieved as text (.txt) files
    - Unstructured
    - Can be time series
    
![alt text](./data/medical_freetext.jpg)

Text needs to **represented** differently for analysis

#### Bag of words

`datetime,            hypotension, had, baby, saturation, antibiotics, intravenous, ...
 18/02/20 13:45,      5,           10,  3,    12,         0,            0
 18/02/20 14:22,      0,           7,   4,    0,          3,            2`

## 4. More about python lists

So far we have only seen primitive data structures (objects): numbers, text string

In [26]:
A = 42
A

42

In [27]:
B = 'Hello world'
B

'Hello world'

#### A list is a collection of any other data structures ( =objects) 

In [28]:
# It can be text

lst_1 = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
         'November', 'December']
lst_1

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

In [29]:
# It can be numbers

lst_2 = [1, 2, 3, 4, 5, 6, ]
lst_2

[1, 2, 3, 4, 5, 6]

In [30]:
# It can be a lot of numbers

a = range(100, 100000000, 5)
lst_3 = list(a)

In [31]:
len(lst_3)

19999980

In [32]:
lst_3[:10]

[100, 105, 110, 115, 120, 125, 130, 135, 140, 145]

In [33]:
lst_3[-10:]

[99999950,
 99999955,
 99999960,
 99999965,
 99999970,
 99999975,
 99999980,
 99999985,
 99999990,
 99999995]

In [34]:
# Text strings can be converted to lists

lst_4 = list('Hello world')
lst_4

['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

In [35]:
# List can containg mixed data types

lst_5 = ['H', 42, 55.555]
lst_5

['H', 42, 55.555]

In [40]:
len(lst_5)

3

In [41]:
# It can be composed of other lists
# This is a two dimensional data structure (like DataFrame or Excel table)

lst_6 = [  [1, 2, 3], [4, 5, 6], [7, 8, 9]   ]
lst_6

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

1,  2,  3

4,  5,  6

7,  8,  9

In [42]:
# Indexing is "row first" and "zero based"

lst_6[1]

[4, 5, 6]

In [43]:
# Identifying single values by consecutive indexing

lst_6[1][2]

6

In [44]:
# slicing

lst_6[0:2]

[[1, 2, 3], [4, 5, 6]]

In [45]:
# Double slicing not really works as we would like it

lst_6[0:2][0:2]

[[1, 2, 3], [4, 5, 6]]

In [46]:
lst_6.sort(reverse = True)
lst_6

[[7, 8, 9], [4, 5, 6], [1, 2, 3]]

**Self_learning**: learn about `list` methods:

- append,
- clear,
- copy,
- count,
- extend,
- index,
- insert,
- pop,
- remove,
- reverse,
- sort

You can learn more about lists and their methods [here](https://developers.google.com/edu/python/lists)

Any decent data science course would start to discuss `numpy` at this stage...

... but we will leave it for later

In the meantime for ambitious geeks [this](https://github.com/ageron/handson-ml2/blob/master/tools_numpy.ipynb) is an excellent Jupyter notebook about numpy

## 5. Analysis of two-dimensional tabula data with pandas

In [47]:
pd.read_csv?

In [48]:
import os
import pandas as pd

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip',)
data = pd.read_csv(path)
data

Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [49]:
len(data)

689436

In [50]:
data.shape

(689436, 47)

In [51]:
data.ndim

2

In [52]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689436 entries, 0 to 689435
Data columns (total 47 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Time [ms]                        689436 non-null  int64  
 1   Date                             689436 non-null  object 
 2   Time                             689436 non-null  object 
 3   Rel.Time [s]                     689436 non-null  int64  
 4   5001|MVe [L/min]                 344633 non-null  float64
 5   5001|MVi [L/min]                 344633 non-null  float64
 6   5001|Cdyn [L/bar]                344073 non-null  float64
 7   5001|R [mbar/L/s]                342099 non-null  float64
 8   5001|MVespon [L/min]             344633 non-null  float64
 9   5001|Rpat [mbar/L/s]             341398 non-null  float64
 10  5001|MVemand [L/min]             344633 non-null  float64
 11  5001|FlowDev [L/min]             344677 non-null  float64
 12  50

In [53]:
data.describe()

Unnamed: 0,Time [ms],Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],5001|MVemand [L/min],5001|FlowDev [L/min],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
count,689436.0,689436.0,344633.0,344633.0,344073.0,342099.0,344633.0,341398.0,344633.0,344677.0,...,344051.0,344566.0,339993.0,344566.0,344570.0,344676.0,344632.0,10015.0,9071.0,9071.0
mean,1604497000000.0,172337.580989,0.223555,0.264366,0.245819,274.040235,7.8e-05,257.37503,0.223284,7.125048,...,0.06751,0.144449,0.717213,3.71869,4.341298,23.860487,0.045116,0.047129,1.018212,7.191357
std,99499490.0,99499.484827,0.029445,0.142057,0.171054,165.355264,0.00093,167.911146,0.029531,0.722707,...,0.105278,0.049657,0.359153,0.779251,1.049905,4.350903,0.318332,0.008108,0.224628,1.889433
min,1604325000000.0,0.0,0.0,0.0,0.0,29.4,0.0,15.8,0.0,6.4,...,0.0,0.01,0.23,0.6,0.6,1.8,0.0,0.04,1.0,1.0
25%,1604411000000.0,86168.0,0.21,0.23,0.21,162.0,0.0,143.0,0.21,6.9,...,0.04,0.11,0.57,3.3,3.8,21.0,0.01,0.04,1.0,7.4
50%,1604497000000.0,172337.0,0.23,0.26,0.24,213.0,0.0,195.0,0.23,7.0,...,0.05,0.12,0.64,3.8,4.3,22.0,0.02,0.05,1.0,8.0
75%,1604583000000.0,258506.25,0.24,0.28,0.27,328.0,0.0,311.0,0.24,7.2,...,0.08,0.16,0.74,4.0,4.8,25.0,0.04,0.05,1.0,8.1
max,1604669000000.0,344676.0,0.49,18.6,57.4,1000.0,0.05,1000.0,0.49,31.2,...,28.9,0.77,4.97,13.8,143.0,53.0,27.6,0.09,10.4,9.5


In [54]:
data.head(10)

Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.2,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.0,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.2,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.0,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.2,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
5,1604324562060,2020-11-02,13:42:42.060,2,,,,,,,...,0.04,0.12,0.66,4.1,3.5,19.0,0.0,,,
6,1604324562888,2020-11-02,13:42:42.888,3,0.2,0.21,0.29,154.0,0.0,136.0,...,,,,,,,,,,
7,1604324563028,2020-11-02,13:42:43.028,3,,,,,,,...,0.05,0.19,0.66,5.5,3.8,18.0,0.0,,,
8,1604324563919,2020-11-02,13:42:43.919,4,0.2,0.2,0.4,193.0,0.0,177.0,...,,,,,,,,,,
9,1604324564044,2020-11-02,13:42:44.044,4,,,,,,,...,0.06,0.1,0.69,2.6,4.3,16.0,0.0,,,


##### What is the problem with these data

- Indexed by numbers only (uninformative, it should be indexed by date and time)
- Date and time are in separate columnns
- Date and time formats are not appropriate
- Column names are too long and difficult to read
- Lots of `na` values 
- half of every row is empty
- Some columns have barely any informative values
- Some values are not meaningful (e.g. tidal volume should be mL/kg not mL

We will deal with all these issues

In [55]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689436 entries, 0 to 689435
Data columns (total 47 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Time [ms]                        689436 non-null  int64  
 1   Date                             689436 non-null  object 
 2   Time                             689436 non-null  object 
 3   Rel.Time [s]                     689436 non-null  int64  
 4   5001|MVe [L/min]                 344633 non-null  float64
 5   5001|MVi [L/min]                 344633 non-null  float64
 6   5001|Cdyn [L/bar]                344073 non-null  float64
 7   5001|R [mbar/L/s]                342099 non-null  float64
 8   5001|MVespon [L/min]             344633 non-null  float64
 9   5001|Rpat [mbar/L/s]             341398 non-null  float64
 10  5001|MVemand [L/min]             344633 non-null  float64
 11  5001|FlowDev [L/min]             344677 non-null  float64
 12  50

In [56]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path)
data

CPU times: user 2.49 s, sys: 373 ms, total: 2.86 s
Wall time: 2.95 s


Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [57]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path, parse_dates = ['Date', 'Time'])
data

CPU times: user 31.1 s, sys: 504 ms, total: 31.6 s
Wall time: 31.7 s


Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,2021-04-16 13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,2021-04-16 13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,2021-04-16 13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,2021-04-16 13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,2021-04-16 13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,2021-04-16 13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,2021-04-16 13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,2021-04-16 13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,2021-04-16 13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689436 entries, 0 to 689435
Data columns (total 47 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   Time [ms]                        689436 non-null  int64         
 1   Date                             689436 non-null  datetime64[ns]
 2   Time                             689436 non-null  datetime64[ns]
 3   Rel.Time [s]                     689436 non-null  int64         
 4   5001|MVe [L/min]                 344633 non-null  float64       
 5   5001|MVi [L/min]                 344633 non-null  float64       
 6   5001|Cdyn [L/bar]                344073 non-null  float64       
 7   5001|R [mbar/L/s]                342099 non-null  float64       
 8   5001|MVespon [L/min]             344633 non-null  float64       
 9   5001|Rpat [mbar/L/s]             341398 non-null  float64       
 10  5001|MVemand [L/min]             344633 non-

In [59]:
%%time

path = os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')
data = pd.read_csv(path, parse_dates = [['Date', 'Time']])
data

CPU times: user 3.06 s, sys: 423 ms, total: 3.48 s
Wall time: 3.58 s


Unnamed: 0,Date_Time,Time [ms],Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],5001|MVemand [L/min],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,2020-11-02 13:42:39.904,1604324559904,0,0.20,0.21,0.21,146.0,0.0,128.0,0.20,...,,,,,,,,,,
1,2020-11-02 13:42:40.029,1604324560029,0,,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,2020-11-02 13:42:40.951,1604324560951,1,0.20,0.21,0.22,146.0,0.0,128.0,0.20,...,,,,,,,,,,
3,2020-11-02 13:42:41.060,1604324561060,1,,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,2020-11-02 13:42:41.935,1604324561935,2,0.20,0.21,0.26,154.0,0.0,136.0,0.20,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,2020-11-06 13:27:13.906,1604669233906,344674,0.22,0.22,0.21,164.0,0.0,150.0,0.22,...,,,,,,,,,,
689432,2020-11-06 13:27:14.031,1604669234031,344674,,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,2020-11-06 13:27:14.906,1604669234906,344675,0.22,0.22,0.21,164.0,0.0,150.0,0.22,...,,,,,,,,,,
689434,2020-11-06 13:27:15.046,1604669235046,344675,,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [60]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689436 entries, 0 to 689435
Data columns (total 46 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   Date_Time                        689436 non-null  datetime64[ns]
 1   Time [ms]                        689436 non-null  int64         
 2   Rel.Time [s]                     689436 non-null  int64         
 3   5001|MVe [L/min]                 344633 non-null  float64       
 4   5001|MVi [L/min]                 344633 non-null  float64       
 5   5001|Cdyn [L/bar]                344073 non-null  float64       
 6   5001|R [mbar/L/s]                342099 non-null  float64       
 7   5001|MVespon [L/min]             344633 non-null  float64       
 8   5001|Rpat [mbar/L/s]             341398 non-null  float64       
 9   5001|MVemand [L/min]             344633 non-null  float64       
 10  5001|FlowDev [L/min]             344677 non-

In [61]:
data = data.set_index('Date_Time')
data

Unnamed: 0_level_0,Time [ms],Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],5001|MVemand [L/min],5001|FlowDev [L/min],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-02 13:42:39.904,1604324559904,0,0.20,0.21,0.21,146.0,0.0,128.0,0.20,6.9,...,,,,,,,,,,
2020-11-02 13:42:40.029,1604324560029,0,,,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2020-11-02 13:42:40.951,1604324560951,1,0.20,0.21,0.22,146.0,0.0,128.0,0.20,6.9,...,,,,,,,,,,
2020-11-02 13:42:41.060,1604324561060,1,,,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
2020-11-02 13:42:41.935,1604324561935,2,0.20,0.21,0.26,154.0,0.0,136.0,0.20,6.9,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-11-06 13:27:13.906,1604669233906,344674,0.22,0.22,0.21,164.0,0.0,150.0,0.22,6.9,...,,,,,,,,,,
2020-11-06 13:27:14.031,1604669234031,344674,,,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
2020-11-06 13:27:14.906,1604669234906,344675,0.22,0.22,0.21,164.0,0.0,150.0,0.22,6.9,...,,,,,,,,,,
2020-11-06 13:27:15.046,1604669235046,344675,,,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [62]:
data.columns

Index(['Time [ms]', 'Rel.Time [s]', '5001|MVe [L/min]', '5001|MVi [L/min]',
       '5001|Cdyn [L/bar]', '5001|R [mbar/L/s]', '5001|MVespon [L/min]',
       '5001|Rpat [mbar/L/s]', '5001|MVemand [L/min]', '5001|FlowDev [L/min]',
       '5001|VTmand [mL]', '5001|r2 [no unit]', '5001|VTispon [mL]',
       '5001|Pmin [mbar]', '5001|Pmean [mbar]', '5001|PEEP [mbar]',
       '5001|RRmand [1/min]', '5001|PIP [mbar]', '5001|VTmand [L]',
       '5001|VTspon [L]', '5001|VTemand [mL]', '5001|VTespon [mL]',
       '5001|VTimand [mL]', '5001|VT [mL]', '5001|% leak [%]',
       '5001|RRspon [1/min]', '5001|% MVspon [%]', '5001|MV [L/min]',
       '5001|RRtrig [1/min]', '5001|RR [1/min]', '5001|I (I:E) [no unit]',
       '5001|E (I:E) [no unit]', '5001|FiO2 [%]', '5001|VTspon [mL]',
       '5001|E [mbar/L]', '5001|TC [s]', '5001|TCe [s]',
       '5001|C20/Cdyn [no unit]', '5001|VTe [mL]', '5001|VTi [mL]',
       '5001|EIP [mbar]', '5001|MVleak [L/min]', '5001|Tispon [s]',
       '5001|I:Espon (I-Part

We could just replace it with 

`data.columns = ['...', '...', '...']` 

but that is error prone

In [63]:
# Welcome to list comprehensions

new_columns_1 = [item for item in data.columns]
print(new_columns_1)

['Time [ms]', 'Rel.Time [s]', '5001|MVe [L/min]', '5001|MVi [L/min]', '5001|Cdyn [L/bar]', '5001|R [mbar/L/s]', '5001|MVespon [L/min]', '5001|Rpat [mbar/L/s]', '5001|MVemand [L/min]', '5001|FlowDev [L/min]', '5001|VTmand [mL]', '5001|r2 [no unit]', '5001|VTispon [mL]', '5001|Pmin [mbar]', '5001|Pmean [mbar]', '5001|PEEP [mbar]', '5001|RRmand [1/min]', '5001|PIP [mbar]', '5001|VTmand [L]', '5001|VTspon [L]', '5001|VTemand [mL]', '5001|VTespon [mL]', '5001|VTimand [mL]', '5001|VT [mL]', '5001|% leak [%]', '5001|RRspon [1/min]', '5001|% MVspon [%]', '5001|MV [L/min]', '5001|RRtrig [1/min]', '5001|RR [1/min]', '5001|I (I:E) [no unit]', '5001|E (I:E) [no unit]', '5001|FiO2 [%]', '5001|VTspon [mL]', '5001|E [mbar/L]', '5001|TC [s]', '5001|TCe [s]', '5001|C20/Cdyn [no unit]', '5001|VTe [mL]', '5001|VTi [mL]', '5001|EIP [mbar]', '5001|MVleak [L/min]', '5001|Tispon [s]', '5001|I:Espon (I-Part) [no unit]', '5001|I:Espon (E-Part) [no unit]']


In [64]:
new_columns_2 = [item[5:] for item in data.columns]
print(new_columns_2)

['[ms]', 'ime [s]', 'MVe [L/min]', 'MVi [L/min]', 'Cdyn [L/bar]', 'R [mbar/L/s]', 'MVespon [L/min]', 'Rpat [mbar/L/s]', 'MVemand [L/min]', 'FlowDev [L/min]', 'VTmand [mL]', 'r2 [no unit]', 'VTispon [mL]', 'Pmin [mbar]', 'Pmean [mbar]', 'PEEP [mbar]', 'RRmand [1/min]', 'PIP [mbar]', 'VTmand [L]', 'VTspon [L]', 'VTemand [mL]', 'VTespon [mL]', 'VTimand [mL]', 'VT [mL]', '% leak [%]', 'RRspon [1/min]', '% MVspon [%]', 'MV [L/min]', 'RRtrig [1/min]', 'RR [1/min]', 'I (I:E) [no unit]', 'E (I:E) [no unit]', 'FiO2 [%]', 'VTspon [mL]', 'E [mbar/L]', 'TC [s]', 'TCe [s]', 'C20/Cdyn [no unit]', 'VTe [mL]', 'VTi [mL]', 'EIP [mbar]', 'MVleak [L/min]', 'Tispon [s]', 'I:Espon (I-Part) [no unit]', 'I:Espon (E-Part) [no unit]']


In [65]:
new_columns_3 = [item[5:] for item in data.columns if item.startswith('5001')]
print(new_columns_3)

['MVe [L/min]', 'MVi [L/min]', 'Cdyn [L/bar]', 'R [mbar/L/s]', 'MVespon [L/min]', 'Rpat [mbar/L/s]', 'MVemand [L/min]', 'FlowDev [L/min]', 'VTmand [mL]', 'r2 [no unit]', 'VTispon [mL]', 'Pmin [mbar]', 'Pmean [mbar]', 'PEEP [mbar]', 'RRmand [1/min]', 'PIP [mbar]', 'VTmand [L]', 'VTspon [L]', 'VTemand [mL]', 'VTespon [mL]', 'VTimand [mL]', 'VT [mL]', '% leak [%]', 'RRspon [1/min]', '% MVspon [%]', 'MV [L/min]', 'RRtrig [1/min]', 'RR [1/min]', 'I (I:E) [no unit]', 'E (I:E) [no unit]', 'FiO2 [%]', 'VTspon [mL]', 'E [mbar/L]', 'TC [s]', 'TCe [s]', 'C20/Cdyn [no unit]', 'VTe [mL]', 'VTi [mL]', 'EIP [mbar]', 'MVleak [L/min]', 'Tispon [s]', 'I:Espon (I-Part) [no unit]', 'I:Espon (E-Part) [no unit]']


In [66]:
# The expression to the right of `=` is evaluated firs (before assignment)

new_columns_3 = ['Time [ms]', 'Rel.Time [s]'] + new_columns_3
new_columns_3

['Time [ms]',
 'Rel.Time [s]',
 'MVe [L/min]',
 'MVi [L/min]',
 'Cdyn [L/bar]',
 'R [mbar/L/s]',
 'MVespon [L/min]',
 'Rpat [mbar/L/s]',
 'MVemand [L/min]',
 'FlowDev [L/min]',
 'VTmand [mL]',
 'r2 [no unit]',
 'VTispon [mL]',
 'Pmin [mbar]',
 'Pmean [mbar]',
 'PEEP [mbar]',
 'RRmand [1/min]',
 'PIP [mbar]',
 'VTmand [L]',
 'VTspon [L]',
 'VTemand [mL]',
 'VTespon [mL]',
 'VTimand [mL]',
 'VT [mL]',
 '% leak [%]',
 'RRspon [1/min]',
 '% MVspon [%]',
 'MV [L/min]',
 'RRtrig [1/min]',
 'RR [1/min]',
 'I (I:E) [no unit]',
 'E (I:E) [no unit]',
 'FiO2 [%]',
 'VTspon [mL]',
 'E [mbar/L]',
 'TC [s]',
 'TCe [s]',
 'C20/Cdyn [no unit]',
 'VTe [mL]',
 'VTi [mL]',
 'EIP [mbar]',
 'MVleak [L/min]',
 'Tispon [s]',
 'I:Espon (I-Part) [no unit]',
 'I:Espon (E-Part) [no unit]']

In [67]:
data.columns = new_columns_3
data.head(10)

Unnamed: 0_level_0,Time [ms],Rel.Time [s],MVe [L/min],MVi [L/min],Cdyn [L/bar],R [mbar/L/s],MVespon [L/min],Rpat [mbar/L/s],MVemand [L/min],FlowDev [L/min],...,TC [s],TCe [s],C20/Cdyn [no unit],VTe [mL],VTi [mL],EIP [mbar],MVleak [L/min],Tispon [s],I:Espon (I-Part) [no unit],I:Espon (E-Part) [no unit]
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-02 13:42:39.904,1604324559904,0,0.2,0.21,0.21,146.0,0.0,128.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:40.029,1604324560029,0,,,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.0,,,
2020-11-02 13:42:40.951,1604324560951,1,0.2,0.21,0.22,146.0,0.0,128.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:41.060,1604324561060,1,,,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.0,,,
2020-11-02 13:42:41.935,1604324561935,2,0.2,0.21,0.26,154.0,0.0,136.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:42.060,1604324562060,2,,,,,,,,,...,0.04,0.12,0.66,4.1,3.5,19.0,0.0,,,
2020-11-02 13:42:42.888,1604324562888,3,0.2,0.21,0.29,154.0,0.0,136.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:43.028,1604324563028,3,,,,,,,,,...,0.05,0.19,0.66,5.5,3.8,18.0,0.0,,,
2020-11-02 13:42:43.919,1604324563919,4,0.2,0.2,0.4,193.0,0.0,177.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:44.044,1604324564044,4,,,,,,,,,...,0.06,0.1,0.69,2.6,4.3,16.0,0.0,,,


In [68]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 689436 entries, 2020-11-02 13:42:39.904000 to 2020-11-06 13:27:15.937000
Data columns (total 45 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Time [ms]                   689436 non-null  int64  
 1   Rel.Time [s]                689436 non-null  int64  
 2   MVe [L/min]                 344633 non-null  float64
 3   MVi [L/min]                 344633 non-null  float64
 4   Cdyn [L/bar]                344073 non-null  float64
 5   R [mbar/L/s]                342099 non-null  float64
 6   MVespon [L/min]             344633 non-null  float64
 7   Rpat [mbar/L/s]             341398 non-null  float64
 8   MVemand [L/min]             344633 non-null  float64
 9   FlowDev [L/min]             344677 non-null  float64
 10  VTmand [mL]                 344567 non-null  float64
 11  r2 [no unit]                344538 non-null  float64
 12  VTispon [mL]            

In [69]:
# This is called a `hack`
# During `mean`() na values are excluded by default

data = data.resample('1S').mean()
data.head(10)

Unnamed: 0_level_0,Time [ms],Rel.Time [s],MVe [L/min],MVi [L/min],Cdyn [L/bar],R [mbar/L/s],MVespon [L/min],Rpat [mbar/L/s],MVemand [L/min],FlowDev [L/min],...,TC [s],TCe [s],C20/Cdyn [no unit],VTe [mL],VTi [mL],EIP [mbar],MVleak [L/min],Tispon [s],I:Espon (I-Part) [no unit],I:Espon (E-Part) [no unit]
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-02 13:42:39,1604325000000.0,0.0,0.2,0.21,0.21,146.0,0.0,128.0,0.2,6.9,...,,,,,,,,,,
2020-11-02 13:42:40,1604325000000.0,0.5,0.2,0.21,0.22,146.0,0.0,128.0,0.2,6.9,...,0.03,0.11,0.55,3.4,3.4,21.0,0.0,,,
2020-11-02 13:42:41,1604325000000.0,1.5,0.2,0.21,0.26,154.0,0.0,136.0,0.2,6.9,...,0.03,0.21,0.55,8.2,3.6,22.0,0.0,,,
2020-11-02 13:42:42,1604325000000.0,2.5,0.2,0.21,0.29,154.0,0.0,136.0,0.2,6.9,...,0.04,0.12,0.66,4.1,3.5,19.0,0.0,,,
2020-11-02 13:42:43,1604325000000.0,3.5,0.2,0.2,0.4,193.0,0.0,177.0,0.2,6.9,...,0.05,0.19,0.66,5.5,3.8,18.0,0.0,,,
2020-11-02 13:42:44,1604325000000.0,4.5,0.21,0.2,0.39,193.0,0.0,177.0,0.21,6.9,...,0.06,0.1,0.69,2.6,4.3,16.0,0.0,,,
2020-11-02 13:42:45,1604325000000.0,5.5,0.21,0.2,0.42,193.0,0.0,177.0,0.21,6.8,...,0.08,0.2,0.69,5.2,4.0,18.0,0.0,,,
2020-11-02 13:42:46,1604325000000.0,6.5,0.22,0.2,0.3,195.0,0.0,177.0,0.22,6.8,...,0.08,0.12,0.69,2.2,4.3,17.0,0.0,,,
2020-11-02 13:42:47,1604325000000.0,7.5,0.22,0.21,0.34,195.0,0.0,177.0,0.22,6.8,...,0.06,0.12,0.59,3.5,3.6,19.0,0.0,,,
2020-11-02 13:42:48,1604325000000.0,8.5,0.23,0.21,0.23,195.0,0.0,177.0,0.23,6.8,...,0.06,0.09,0.59,2.9,4.0,20.0,0.0,,,


In [70]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 344677 entries, 2020-11-02 13:42:39 to 2020-11-06 13:27:15
Freq: S
Data columns (total 45 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Time [ms]                   344677 non-null  float64
 1   Rel.Time [s]                344677 non-null  float64
 2   MVe [L/min]                 344519 non-null  float64
 3   MVi [L/min]                 344519 non-null  float64
 4   Cdyn [L/bar]                343959 non-null  float64
 5   R [mbar/L/s]                341986 non-null  float64
 6   MVespon [L/min]             344519 non-null  float64
 7   Rpat [mbar/L/s]             341285 non-null  float64
 8   MVemand [L/min]             344519 non-null  float64
 9   FlowDev [L/min]             344563 non-null  float64
 10  VTmand [mL]                 344453 non-null  float64
 11  r2 [no unit]                344424 non-null  float64
 12  VTispon [mL]                34

In [71]:
# Some columns are almost completely empty and hopeless - drop them

data = data.drop(['Tispon [s]', 'I:Espon (I-Part) [no unit]', 
                  'I:Espon (E-Part) [no unit]'], axis = 1)
data

Unnamed: 0_level_0,Time [ms],Rel.Time [s],MVe [L/min],MVi [L/min],Cdyn [L/bar],R [mbar/L/s],MVespon [L/min],Rpat [mbar/L/s],MVemand [L/min],FlowDev [L/min],...,FiO2 [%],VTspon [mL],E [mbar/L],TC [s],TCe [s],C20/Cdyn [no unit],VTe [mL],VTi [mL],EIP [mbar],MVleak [L/min]
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-02 13:42:39,1.604325e+12,0.0,0.20,0.21,0.21,146.0,0.0,128.0,0.20,6.9,...,25.0,,,,,,,,,
2020-11-02 13:42:40,1.604325e+12,0.5,0.20,0.21,0.22,146.0,0.0,128.0,0.20,6.9,...,25.0,0.0,4774.0,0.03,0.11,0.55,3.4,3.4,21.0,0.00
2020-11-02 13:42:41,1.604325e+12,1.5,0.20,0.21,0.26,154.0,0.0,136.0,0.20,6.9,...,25.0,0.0,4597.0,0.03,0.21,0.55,8.2,3.6,22.0,0.00
2020-11-02 13:42:42,1.604325e+12,2.5,0.20,0.21,0.29,154.0,0.0,136.0,0.20,6.9,...,25.0,0.0,3798.0,0.04,0.12,0.66,4.1,3.5,19.0,0.00
2020-11-02 13:42:43,1.604325e+12,3.5,0.20,0.20,0.40,193.0,0.0,177.0,0.20,6.9,...,25.0,0.0,3408.0,0.05,0.19,0.66,5.5,3.8,18.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-11-06 13:27:11,1.604669e+12,344671.5,0.22,0.21,0.22,211.0,0.0,193.0,0.22,6.8,...,26.0,0.0,4502.0,0.07,0.12,0.61,3.5,3.7,22.0,0.00
2020-11-06 13:27:12,1.604669e+12,344672.5,0.23,0.22,0.26,164.0,0.0,150.0,0.22,6.9,...,26.0,0.0,3879.0,0.05,0.16,0.61,4.7,3.7,22.0,0.01
2020-11-06 13:27:13,1.604669e+12,344673.5,0.22,0.22,0.21,164.0,0.0,150.0,0.22,6.9,...,26.0,0.0,4695.0,0.05,0.21,0.75,6.3,4.0,21.0,0.02
2020-11-06 13:27:14,1.604669e+12,344674.5,0.22,0.22,0.21,164.0,0.0,150.0,0.22,6.9,...,26.0,0.0,4726.0,0.04,0.11,0.75,3.1,2.6,18.0,0.02


In [72]:
data.head()

Unnamed: 0_level_0,Time [ms],Rel.Time [s],MVe [L/min],MVi [L/min],Cdyn [L/bar],R [mbar/L/s],MVespon [L/min],Rpat [mbar/L/s],MVemand [L/min],FlowDev [L/min],...,FiO2 [%],VTspon [mL],E [mbar/L],TC [s],TCe [s],C20/Cdyn [no unit],VTe [mL],VTi [mL],EIP [mbar],MVleak [L/min]
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-02 13:42:39,1604325000000.0,0.0,0.2,0.21,0.21,146.0,0.0,128.0,0.2,6.9,...,25.0,,,,,,,,,
2020-11-02 13:42:40,1604325000000.0,0.5,0.2,0.21,0.22,146.0,0.0,128.0,0.2,6.9,...,25.0,0.0,4774.0,0.03,0.11,0.55,3.4,3.4,21.0,0.0
2020-11-02 13:42:41,1604325000000.0,1.5,0.2,0.21,0.26,154.0,0.0,136.0,0.2,6.9,...,25.0,0.0,4597.0,0.03,0.21,0.55,8.2,3.6,22.0,0.0
2020-11-02 13:42:42,1604325000000.0,2.5,0.2,0.21,0.29,154.0,0.0,136.0,0.2,6.9,...,25.0,0.0,3798.0,0.04,0.12,0.66,4.1,3.5,19.0,0.0
2020-11-02 13:42:43,1604325000000.0,3.5,0.2,0.2,0.4,193.0,0.0,177.0,0.2,6.9,...,25.0,0.0,3408.0,0.05,0.19,0.66,5.5,3.8,18.0,0.0


In [73]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 344677 entries, 2020-11-02 13:42:39 to 2020-11-06 13:27:15
Freq: S
Data columns (total 42 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Time [ms]           344677 non-null  float64
 1   Rel.Time [s]        344677 non-null  float64
 2   MVe [L/min]         344519 non-null  float64
 3   MVi [L/min]         344519 non-null  float64
 4   Cdyn [L/bar]        343959 non-null  float64
 5   R [mbar/L/s]        341986 non-null  float64
 6   MVespon [L/min]     344519 non-null  float64
 7   Rpat [mbar/L/s]     341285 non-null  float64
 8   MVemand [L/min]     344519 non-null  float64
 9   FlowDev [L/min]     344563 non-null  float64
 10  VTmand [mL]         344453 non-null  float64
 11  r2 [no unit]        344424 non-null  float64
 12  VTispon [mL]        344520 non-null  float64
 13  Pmin [mbar]         344563 non-null  float64
 14  Pmean [mbar]        344563 non-null  float

In [74]:
data.isnull().sum()

Time [ms]                 0
Rel.Time [s]              0
MVe [L/min]             158
MVi [L/min]             158
Cdyn [L/bar]            718
R [mbar/L/s]           2691
MVespon [L/min]         158
Rpat [mbar/L/s]        3392
MVemand [L/min]         158
FlowDev [L/min]         114
VTmand [mL]             224
r2 [no unit]            253
VTispon [mL]            157
Pmin [mbar]             114
Pmean [mbar]            114
PEEP [mbar]             114
RRmand [1/min]          114
PIP [mbar]              114
VTmand [L]              224
VTspon [L]              157
VTemand [mL]            224
VTespon [mL]            157
VTimand [mL]            220
VT [mL]                 224
% leak [%]              158
RRspon [1/min]        88501
% MVspon [%]            158
MV [L/min]              158
RRtrig [1/min]          158
RR [1/min]              158
I (I:E) [no unit]       119
E (I:E) [no unit]       119
FiO2 [%]                114
VTspon [mL]            1359
E [mbar/L]             5704
TC [s]              

In [75]:
# A lot of things are happening here, for example vectorized computation, broadcasting
# We will speak about them during the next session

data.isnull().sum() / len(data) * 100

Time [ms]              0.000000
Rel.Time [s]           0.000000
MVe [L/min]            0.045840
MVi [L/min]            0.045840
Cdyn [L/bar]           0.208311
R [mbar/L/s]           0.780731
MVespon [L/min]        0.045840
Rpat [mbar/L/s]        0.984110
MVemand [L/min]        0.045840
FlowDev [L/min]        0.033074
VTmand [mL]            0.064988
r2 [no unit]           0.073402
VTispon [mL]           0.045550
Pmin [mbar]            0.033074
Pmean [mbar]           0.033074
PEEP [mbar]            0.033074
RRmand [1/min]         0.033074
PIP [mbar]             0.033074
VTmand [L]             0.064988
VTspon [L]             0.045550
VTemand [mL]           0.064988
VTespon [mL]           0.045550
VTimand [mL]           0.063828
VT [mL]                0.064988
% leak [%]             0.045840
RRspon [1/min]        25.676503
% MVspon [%]           0.045840
MV [L/min]             0.045840
RRtrig [1/min]         0.045840
RR [1/min]             0.045840
I (I:E) [no unit]      0.034525
E (I:E) 

Now let us save the the modified data

We will export them as serialised binary data  - `pickle`

In [76]:
import pickle

with open(os.path.join('results', 'data.pickle'), 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

We will continue from here

### 6. Homework

Subsetting and indexing pandas DataFrames

In [88]:
data

Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [None]:
# Select the third row only

selection = data
selection

In [None]:
# Select the "MVe [L/min]" column only"

selection = data
selection

In [None]:
# Select the "MVe [L/min]" and "MVi [L/min]" columns only

selection = data
selection

In [None]:
# Select the 'MVe [L/min]' value from the third row

selection = data
selection

In [None]:
# Select all data during the 1 minute period at 2020-11-03 13:00 

selection = data
selection

In [None]:
# Select all data between 2020-11-03 13:00 and and 15:00

selection = data.loc
selection

In [None]:
# Select all data between 2020-11-03 13:00 and and 15:00 and limit it to 
# "MVe [L/min]" and "MVi [L/min]" columns only

selection = data