# HEP Software Training: Learn Programming with Python
## Chapter 2: Analyzing Patient Data

Created by: [Hisyam Athaya](https://athayahisyam.github.io/)  
Learning portfolio based on [SWCarpentry Programming with Python: Python Fundamentals](https://swcarpentry.github.io/python-novice-inflammation/)  
Visit [HEP Software Training](https://hepsoftwarefoundation.org/training/curriculum.html) for more information.

### Importing Useful Library

In [1]:
import numpy
numpy.__version__

'1.20.3'

Loading data from csv => in this case, the data is not saved to memory

In [2]:
numpy.loadtxt(fname='python-novice-inflammation-data/data/inflammation-01.csv', delimiter=',')

array([[0., 0., 1., ..., 3., 0., 0.],
       [0., 1., 2., ..., 1., 0., 1.],
       [0., 1., 1., ..., 2., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 1., ..., 1., 1., 0.]])

### Numpy Array Operations

Loading data from csv and saving them to variable `data`

In [3]:
data = numpy.loadtxt(fname='python-novice-inflammation-data/data/inflammation-01.csv', delimiter=',')

In [4]:
print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


In [5]:
print(type(data))

<class 'numpy.ndarray'>


In [6]:
# dtype for showing the type of data contained in the ndarray

print(data.dtype)

float64


In [7]:
# shape of the data variable, its description of the dimensions of data.

print(data.shape)

(60, 40)


The data contains `60 rows` and `40 columns`.

In [8]:
# to access a single value in the array, we need to provide index in square brackets.

print('first value in the data:', data[0,0])

first value in the data: 0.0


In [9]:
print('middle value in the data:', data[30, 20])

middle value in the data: 13.0


Remember: indices in Python array are `[row, column]`

### Slicing Data (... : ...)

Select first `ten columns of values` and the first `four rows of values`  
In this context, where the data column represents days of observation and rows represents patients: first ten days of values for the first four of patients.

In [10]:
# remember! [row, column]

print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


The slice `[0:4]` means: `start at index 0 and go up-to-but-not-including 4`. Respectively, slice `[0:10]` means `start at index 0 go up-to-but-not-including 10`. Slice can begins everywhere, depends on what we needed.

In [11]:
print(data[5:10, 0:10])

[[0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
 [0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
 [0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
 [0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
 [0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]


Slice do not have to include the `upper bound` and `lower bound`(`[lower bound : upper bound]`). If we do not specify `lower bound`, Python will use `0` as default and if we do not specify the `upper bound`, Python will run the slice to the end of the row/column axis. And if we do not specify anything, i.e. `[:]`, the slice includes everything.

In [12]:
# remember! [row, column]
# remember! [lower bound : upper bound]

small = data[:3, 36:] 
# means row start from index 0 up-to-but-not-include 3, column start from index 36 until end of axis

print('small is: ')
print(small)

small is: 
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


## Data Analysis

### Mean

In [13]:
print(numpy.mean(data))

6.14875


Note that not every Python functions need input. But need its parentheses nonetheless.

In [14]:
import time
print(time.ctime())

Sun Mar 13 12:17:03 2022


### Descriptive Values of the Dataset

In [15]:
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)

print('maximum inflammation: ', maxval)
print('minimum inflammation: ', minval)
print('standard deviation: ', stdval)

maximum inflammation:  20.0
minimum inflammation:  0.0
standard deviation:  4.613833197118566


Important note: To list available numpy operations, type `numpy` followed by dot `.` and press `tab`, the complete list will appear. To show the description, type `?` and run or press `shift`+`enter`.

e.g.

`numpy.clip?`

In analysing the data, we often look to variations in statistical values, such as the maximum inflammation per patient or average inflammation per day. One way to do this is to *create new temporary array of the data we want and then do operations on them*

In [16]:
patient_0 = data[0, :]
# remember! [row, column]
# means row index 0, slice all column for that row

print('maximum inflammation for patient 0:', numpy.max(patient_0))

maximum inflammation for patient 0: 18.0


Or, something more straightforward, with no temporary variable.

In [17]:
print('maximum inflammation for patient 0:', numpy.max(data[2, :]))
# means row index 2, slice all column, no lower or upper bound : for that column

maximum inflammation for patient 0: 19.0


This image explains visually on how our queries search for data. Notice the usage of `axis` syntax. In our 2 Dimensional Array example, `axis=0` represents column, and `axis=1` represents row.

![Notice the pattern of average in the array/table](img/python-operations-across-axes.png)

In [18]:
print(numpy.mean(data, axis=0))

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


In [19]:
print(numpy.mean(data, axis=0).shape)

(40,)


The expression `40` tell us we have an N x 1 vector. This show that the data printed before is the average (`mean`) inflammation per day.

In [20]:
print(numpy.max(data, axis=1))

[18. 18. 19. 17. 17. 18. 17. 20. 17. 18. 18. 18. 17. 16. 17. 18. 19. 19.
 17. 19. 19. 16. 17. 15. 17. 17. 18. 17. 20. 17. 16. 19. 15. 15. 19. 17.
 16. 17. 19. 16. 18. 19. 16. 19. 18. 16. 19. 15. 16. 18. 14. 20. 17. 15.
 17. 16. 17. 19. 18. 18.]


In [21]:
print(numpy.mean(data, axis=1).shape)

(60,)


The expression `60` tell us, we have an N x 1 vector. This show that the data printed before is the maximum value (`max`) of inflammation per patient.

### Playing with Slice

In [22]:
element = 'oxygen'

In [23]:
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])

first three characters: oxy
last three characters: gen


In [24]:
print(element[-1])
print(element[-2])

# element[-1] means element with index 1 from behind, in this case n
# element[-2] means element with index 2 from behind, in this case e

n
e


In [25]:
# remember! [lower bound : upper bound]

print(element[1:-1])

# means get value of string from index 1 up-to-but-not-including -1
# index 1 is x, index -1 is n, which is excluded

xyge


In [26]:
# get last three character from element variable

print(element[-3:])

# means lower bound 3 from behind, while upper bound start to the finish.
# It is indeed hard to understand :) will evaluate them later

element1 = 'carpentry'
element2 = 'clone'
element3 = 'hi'

print(element1[-3:])
print(element2[-3:])
print(element3[-3:])

gen
try
one
hi


In [27]:
# data while the slice is [3 : 3] does it return empty?

print(element[3:3])
print(data[3:3, 4:4])
print(data[3:3, :])



[]
[]


### Array Operations

In [28]:
import numpy

A = numpy.array([[1,2,3], [4,5,6], [7,8,9]])

print('A = ')
print(A)

B = numpy.hstack([A, A]) # hstack: horizontal stack
print('B =')
print(B)

C = numpy.vstack([A, A]) # vstack: vertical stack
print('C = ')
print(C)

A = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
B =
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [7 8 9 7 8 9]]
C = 
[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 2 3]
 [4 5 6]
 [7 8 9]]


In [29]:
# make new variable with array that contain FIRST and LAST column of A and stack them to 3 x 2 array!
# wow thats confusing! I open the solution tab for this one, how amazing they could think something of this!

# Try 1

print(A[:, 0])

[1 4 7]


In [30]:
# not what we wanted, dimension with only one member dropped, so they wont stack
# the solution mentioned "the index itself can be a slice or array"

print(A[:, :1])

[[1]
 [4]
 [7]]


In [31]:
# that one column, so the row index is indefinite, means, they take all from that row,
# and the column index is started at zero and go-until-but-not-include 1 which means column index 0. 
# so now, we use hstack to stack 2 array horizontally, one column index 0 other -1

D = numpy.hstack((A[:, :1], A[:, -1:]))
print(D)

[[1 3]
 [4 6]
 [7 9]]


In [32]:
# or you can use delete function

E = numpy.delete(A, 1, 1)
print(E)

[[1 3]
 [4 6]
 [7 9]]


Personal comment: To be honest, I do not understand well how this went. Will skip and learn them later.

### Change in Inflammation

Patient data is *longitudinal* which means *each row represents a series of observations relating to one individual*. This means **this means the change in inflammation over time is a meaningful concept**.  
  
`numpy.diff()` function takes an array and returns the differences between two successive values.

In [33]:
# taking patient number 3 (index 3) data and take 7 days of observation (7 column)
patient3_week1 = data[3, :7]
print(patient3_week1)
print(patient3_week1.shape)

[0. 0. 2. 0. 4. 2. 2.]
(7,)


In [34]:
# if we use numpy.diff() on patient_3
# remember: 2 successive values! the second will subtracted from the first, third from second, etc.

numpy.diff(patient3_week1)

array([ 0.,  2., -2.,  4., -2.,  0.])

When using `numpy.diff` the row will shortened by 1 element, since the first one is subtracted (ed.: my understanding).  
  
In multidimensional array, axis is used. Since `row` is give a meaningful, longitudinal data, we will use 1. If we use 0 instead, it will yield an array with inflammation data from all patients in 1 day, since we want to get a meaningful, longitudinal data from *one patient* we use 1 row with all column.

In [35]:
F = numpy.diff(data, axis=1)
print(F)

[[ 0.  1.  2. ...  1. -3.  0.]
 [ 1.  1. -1. ...  0. -1.  1.]
 [ 1.  0.  2. ...  0. -1.  0.]
 ...
 [ 1.  0.  0. ... -1.  0.  0.]
 [ 0.  0.  1. ... -2.  2. -2.]
 [ 0.  1. -1. ... -2.  0. -1.]]


In [36]:
print(F.shape)

(60, 39)


The column is shortened by one, since the first column is subtracted by the second. By the solution word: `there is one fewer difference than there are columns in the data`

Question: how to find the **largest** change in inflammation **for each** patient? Does it matter if the change in inflammation is an increase or decrease?  
  
Largest: `numpy.max`  
Change: `numpy.diff`  
Each patient: `axis=1`  
  
the `numpy.diff` will return row with differences per day, because we seek the data/patient, the axis is 1 (rows). It will return a N x N array with differences. `numpy.max` will return largest data from each row.

In [40]:
h = numpy.max(numpy.diff(data, axis=1), axis=1)
print(h)
print(h.shape)

[ 7. 12. 11. 10. 11. 13. 10.  8. 10. 10.  7.  7. 13.  7. 10. 10.  8. 10.
  9. 10. 13.  7. 12.  9. 12. 11. 10. 10.  7. 10. 11. 10.  8. 11. 12. 10.
  9. 10. 13. 10.  7.  7. 10. 13. 12.  8.  8. 10. 10.  9.  8. 13. 10.  7.
 10.  8. 12. 10.  7. 12.]
(60,)
