## Loading data into Python

To begin processing the clinical trial inflammation data, we need to load it into Python. We can do that using a library called "NumPy".

In [4]:
import numpy

data = numpy.loadtxt(fname='./data/inflammation-01.csv', delimiter=',')
print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


`numpy.loadtxt` has two parameters: the name of the file we want to read and the delimiter that separates values on a line. Now that the data are in memory, we can manipulate them. First, let's ask what 'type' of thing `data` refers to:

In [5]:
print(type(data))

<class 'numpy.ndarray'>


The output tells us that `data` currently refers to an N-dimensional array. We can find out the type of the data contained in the NumPy array:

In [6]:
print(data.dtype)

float64


We can see the array's shape with the following command:

In [7]:
print(data.shape)

(60, 40)


that is, the `data` array variable contains 60 rows and 40 columns.

If we want to get a single number from the array, we must provide an index in square brackets after the variable name. Our inflammation data has two dimensions, so we will need to use two indices to refer to one specific value:

In [8]:
print('first value in data:', data[0, 0])

print('middle value in data:', data[29,19])

first value in data: 0.0
middle value in data: 16.0


## Slicing data

An index like `[30, 20]` selects a single element of an array, but we can select whole sections as well. For example, we can select the first ten day (columns) of values for the first four patients (rows) like this:

In [9]:
print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


The slice `0:4` means, "Start at index 0 and go up to, but not including, index 4." We don't have to start slices at 0:

In [10]:
print(data[5:10, 0:10])

[[0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
 [0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
 [0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
 [0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
 [0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]


The rule is that the difference between the upper and lower bounds is the number of values in the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e., if we use ':' on its own), the slice includes everything:

In [11]:
small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


## Analyzing data

If we want to find the average inflammation for all patients on all days, for example, we can ask NumPy to compute `data`'s mean value:

In [12]:
print(numpy.mean(data))

6.14875


Let's use three other NumPy functions to get some descriptive values about the dataset.

In [13]:
maxval, minval, stdval = numpy.amax(data), numpy.amin(data), numpy.std(data)

print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)

maximum inflammation: 20.0
minimum inflammation: 0.0
standard deviation: 4.613833197118566


When analyzing data, though, we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation:

In [14]:
patient_0 = data[0, :]
# 0 on the first axis (rows), everything on the second (columns)
print('maximum inflammation for patient 0:', numpy.amax(patient_0))

#print(numpy.mean(patient_0))

maximum inflammation for patient 0: 18.0


We can combine the selection and the function call:

In [15]:
print('maximum inflammation for patient 2:', numpy.amax(data[2, :]))

maximum inflammation for patient 2: 19.0


What if we need the maximum inflammation for each patient over all days or the average for each day?

- Here, axis $0$ is sweeping through all the rows for each day;
- If we calculate the averagem, we will obtain the average inflammation for each day;
- For axis $1$, it will sweep through all the columns for each patient;
- If we calculate the average, we will obtain the average inflammation for each patient.

We show all of this below:

In [16]:
print('The average inflammation for day is:')
print(numpy.mean(data, axis=0))

The average inflammation for day is:
[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


In [17]:
print('The average inflammation for patient is:')
print(numpy.mean(data, axis=1))

The average inflammation for patient is:
[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


## SLICING STRINGS

We can take slices of character strings as well:

In [18]:
element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters', element[3:6])



first three characters: oxy
last three characters gen


What is the value of `element[:4]`? What about `element[4:]`? Or `element[:]`

Answer: `element[:4]` -- output: oxyg

Answer: `element[4:]` -- output: en

Answer: `element[:]`  -- output: oxygen

In [19]:
print(element[:4])
print(element[4:])
print(element[:])

oxyg
en
oxygen


What is `element[-1]`?

In [20]:
print(element[-1])
print(element[-2])

n
e


Given those answers, explain what `element[1:-1]` does

Answer: Creates a substring from index 1 up to (not including) the final index, effectively removing the first and last letters from 'oxygen'

In [21]:
print(element[1:-1])

xyge


In [22]:
element = 'oxygen'
print(element[-3:])
element = 'carpentry'
print(element[-3:])
element = 'clone'
print(element[-3:])
element = 'hi'
print(element[-3:])

gen
try
one
hi


In [23]:
#data[3:3, 4:4]
data[3:3, :]


array([], shape=(0, 40), dtype=float64)

## STACKING ARRAYS

Arrays can be concatenated and stacked on top of one another, using NumPy's vstack and hstack functions for vertical and horizontal stacking, respectively.

In [24]:
A = numpy.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print('A = ')
print(A)

B = numpy.hstack([A, A])
print('B = ')
print(B)

C = numpy.vstack([A, A])
print('C = ')
print(C)


A = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
B = 
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [7 8 9 7 8 9]]
C = 
[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 2 3]
 [4 5 6]
 [7 8 9]]


Exercise: Write some additional code that slices, the first and last columns of A, and stacks them into a 3x2 array. Make sure to `print` the results to verify your solution.

Answer:

Note: It should not be confused with the fact that array indexing ends up eliminating singleton dimensions by default. For example, if I simply use `A[:, 0]` to select the first column, I will actually be selecting the first element of each row, making the object one-dimensional, meaning it will form an array with these elements written in a row. But by doing `A[:, :1]`, I will select only the first column, preserving the two-dimensional nature. We can even compare in the code below using `.shape`:

In [25]:
Incorrect_form = A[:, 0]
Correct_form = A[:, :1]
print(Incorrect_form.shape)
print(Correct_form.shape) 

(3,)
(3, 1)


So,

In [26]:
first_column = A[:, :1]
last_column = A[:, -1:]

D = numpy.hstack((first_column, last_column))
print('D = ')
print(D)

D = 
[[1 3]
 [4 6]
 [7 9]]


## CHANGE IN INFLAMMATION

The patient data is longitudinal in the sense that each row represents a series of observations relating to one individual. This means that the change in inflammation over time is a meaningful concept. Let's find out how to calculate changes in the data contained in an array with NumPy.

The `numpy.diff()` function takes an array and returns the differences between two successive values. Let's use it to examine the changes each day across the first week of patient 3 from our inflammation dataset. 

In [27]:
patient3_week1 = data[3, :7]
print(patient3_week1)

[0. 0. 2. 0. 4. 2. 2.]


Calling `numpy.diff(patient3_week1)` would do the following calculations: `[ 0 - 0, 2 - 0, 0 - 2, 4 - 0, 2 - 4, 2 - 2]` and return the 6 values in a new array. oimomo

In [28]:
numpy.diff(patient3_week1)

array([ 0.,  2., -2.,  4., -2.,  0.])

In [29]:
DIFF = numpy.diff(data, axis = 1)

print(DIFF.shape)

(60, 39)
