![alt text](./pageheader_rose2_babies.jpg)

# Data Science in Medicine using Python

### Author: Dr Gusztav Belteki

## 1. Type of medical data

1.  Tabular data
    - Obtained from sensors of medical devices, monitors etc.
    - Typically retrieved as csv or other text format 
    - Usually two-dimensional
    - Usually time series data
    
    ![alt text](./data/tabular_data.jpg)
    
_____


2.  Image data
    - Obtained from imaging medical devices
    - Usually as raster images (jpg,png, tiff etc file format, different compression methods)
    - At least 3-dimensional, but frequently 4 or 5 dimensional
   
   
  
  ![alt text](./data/newborn_brain_image.jpg)
  
_____  


  3. Medical free text
    - From electronic medical notes or from medical knowledge databases (e.g., PubMed)
    - Typically retrieved as text (.txt) files
    - Unstructured
    - Can be time series
    
    
   ![alt text](./data/medical_freetext.jpg)

## 4 Data structures in Python

Store some data (from nothing to the whole universe)

All data structures are `objects` but not objects are data structures 

Evaluating them return the data is some format:

In [None]:
a = 'Hello World'
a

In [None]:
b = 42
b

In [None]:
data

`print()` usually but not always results in a nicer format

In [None]:
print(a)

In [None]:
print(b)

In [None]:
print(data)

They have different types

In [None]:
type(a)

In [None]:
type(b)

In [None]:
type(data)

Built-in functions work differently on different data structures

In [None]:
# This is obvious

len(a)

In [None]:
# This will produce an error which is perhaps unexpected

len(b)

In [None]:
# This is by no means obvious

len(data)

They also have different methods associated with them

#### Methods for text strings

In [None]:
dir(a)

In [None]:
a.upper()

In [None]:
# Counting and indexing in Python starts from zero

a.find('o')

In [None]:
a.index('o')

In [None]:
a.count('o')

In [None]:
a.startswith('H'), a.startswith('h')

In [None]:
a.isnumeric()

In [None]:
a.rjust(30)

#### Methods for integer numbers

In [None]:
dir(b)

In [None]:
b.bit_length()

In [None]:
b, b.__add__(2)

In [None]:
c = -42
c.__abs__()

In [None]:
abs(c)

#### Methods for complex data structures (pandas DataFrames)

In [None]:
dir(data)

In [None]:
len(dir(data))

In [None]:
data

In [None]:
data.mean()

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

In [None]:
%%time
data.plot()

#### `HOMEWORK` : Exploratory data analysis

## 5. Reading in images

In [None]:
# importing matplotlib module 
import matplotlib.image as mpimg 

img = mpimg.imread('data/newborn_heart_image.jpg') 

In [None]:
img.ndim

In [None]:
img.shape

In [None]:
369 * 636 * 3

In [None]:
img.flatten()

In [None]:
len(img.flatten())

In [None]:
print(set(img.flatten()))

In [None]:
img[:, :, 1]

In [None]:
img[:, :, 2]

In [None]:
img[1, :, :]

In [None]:
# Output Image

# importing matplotlib module 
import matplotlib.pyplot as plt 

plt.imshow(img) 

## 6. Reading in text data

In [None]:
f_handle = open('data/karamazov_brothers.txt', 'r')
text = f_handle.read()
f_handle.close()

In [None]:
len(text)

In [None]:
text

In [None]:
text[:10000]

In [None]:
print(text[:10000])

In [None]:
print(text[5016:10000])

## 7. Homework

##### Get tabular data and import and play around with it

#### Absolute and relative file paths

In [74]:
# Relative path:

pd.read_csv('data/data_new/CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')

Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [None]:
pd.read_csv(os.path.join('data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip'))

In [None]:
'D:\dddldd\ddddd\vfvff\'

In [76]:
# My absolute path:

pd.read_csv('/Users/guszti/data_science_course/data/CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip')

Unnamed: 0,Time [ms],Date,Time,Rel.Time [s],5001|MVe [L/min],5001|MVi [L/min],5001|Cdyn [L/bar],5001|R [mbar/L/s],5001|MVespon [L/min],5001|Rpat [mbar/L/s],...,5001|TC [s],5001|TCe [s],5001|C20/Cdyn [no unit],5001|VTe [mL],5001|VTi [mL],5001|EIP [mbar],5001|MVleak [L/min],5001|Tispon [s],5001|I:Espon (I-Part) [no unit],5001|I:Espon (E-Part) [no unit]
0,1604324559904,2020-11-02,13:42:39.904,0,0.20,0.21,0.21,146.0,0.0,128.0,...,,,,,,,,,,
1,1604324560029,2020-11-02,13:42:40.029,0,,,,,,,...,0.03,0.11,0.55,3.4,3.4,21.0,0.00,,,
2,1604324560951,2020-11-02,13:42:40.951,1,0.20,0.21,0.22,146.0,0.0,128.0,...,,,,,,,,,,
3,1604324561060,2020-11-02,13:42:41.060,1,,,,,,,...,0.03,0.21,0.55,8.2,3.6,22.0,0.00,,,
4,1604324561935,2020-11-02,13:42:41.935,2,0.20,0.21,0.26,154.0,0.0,136.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
689431,1604669233906,2020-11-06,13:27:13.906,344674,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689432,1604669234031,2020-11-06,13:27:14.031,344674,,,,,,,...,0.04,0.11,0.75,3.1,2.6,18.0,0.02,,,
689433,1604669234906,2020-11-06,13:27:14.906,344675,0.22,0.22,0.21,164.0,0.0,150.0,...,,,,,,,,,,
689434,1604669235046,2020-11-06,13:27:15.046,344675,,,,,,,...,0.01,0.15,0.75,4.1,2.6,19.0,0.01,,,


In [None]:
pd.read_csv(os.path.join('/Users', 'guszti', 'data_science_course', 'data',
                         'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip'))

In [None]:
# Why is this working ?

pd.read_csv(os.path.join('.','data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip'))

In [None]:
# And why is this working ?

pd.read_csv(os.path.join('..', 'data_science_course', 'data', 'CsvLogBase_2020-11-02_134238.904_slow_Measurement.csv.zip'))

### Slicing and dicing in Python

In [118]:
# List of strings

lst = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
       'November', 'December']

lst

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

In [121]:
# Indexing is zero based

lst[0:4]

['January', 'February', 'March', 'April']

##### So - write the input to generate the output

`['March', 'April', 'May']`

In [None]:
lst[]

`['May', 'June', 'July', 'August']`

In [None]:
lst[]

`['July', 'August', 'September', 'October', 'November', 'December']`

In [None]:
lst[]

`['January', 'March', 'May', 'July', 'September']`

In [None]:
lst[]

`['March', 'April', 'May']`

`['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']`

In [None]:
lst[]

`['April', 'June', 'August', 'October']`

In [None]:
lst[]

`['December', 'November', 'October', 'September', 'August']`

In [None]:
lst[]

`['November', 'September', 'July', 'May']`

In [None]:
lst[]

`['December',
 'November',
 'October',
 'September',
 'August',
 'July',
 'June',
 'May',
 'April',
 'March',
 'February',
 'January']`

In [None]:
lst[]

In [None]:
# I would like to show you a different way to produce histograms

In [None]:
VTemand_binned = pd.cut(data_dict['2019-01-14_124200.144']['VTemand [mL]'], bins = 10)
VTemand_binned.head(10)

In [None]:
VTemand_binned.value_counts()

In [None]:
# Sort according the index, not the values
# Also, what one function returns can be passed on to the next function
VTemand_binned.value_counts().sort_index()

In [None]:
# Better but still not what you want
VTemand_binned.value_counts().sort_index().plot()

In [None]:
# Better but still not what you want
VTemand_binned.value_counts().sort_index().plot(kind = 'bar')

In [None]:
plot = VTemand_binned.value_counts().sort_index().plot(kind = 'bar')

In [None]:
import matplotlib.pyplot as plt

plot = VTemand_binned.value_counts().sort_index().plot(kind = 'bar')
plt.savefig(fname = os.path.join('results', 'VTemand'))

In [None]:
import matplotlib.pyplot as plt

plot = VTemand_binned.value_counts().sort_index().plot(kind = 'bar', color = 'black', alpha = 0.7, 
            xlabel = 'VTemand_kg', ylabel = 'number of inflations',)
#plt.grid(True)
plt.savefig(fname = os.path.join('results', 'VTemand'))

In [None]:
import matplotlib.pyplot as plt

plot = VTemand_binned.value_counts().sort_index().plot(kind = 'bar', color = 'black', alpha = 0.7, 
            xlabel = 'VTemand_kg', ylabel = 'number of inflations', logy = True)
#plt.grid(True)
plt.savefig(fname = os.path.join('results', 'VTemand'))