# Worksheet 4 - Data analysis task 1

- This worksheet should be used in conjunction with the Intro to Python course notes [here](https://uniexeterrse.github.io/intro-to-python/). 
- All information contained in this worksheet can be found in the course notes. 
- This worksheet highlights tasks that can be completed during the sessions. 

## 1. Scenario: A miracle arthritis inflammation cure

Our imaginary colleague “Dr. Maverick” has invented a new miracle drug that promises to cure arthritis inflammation flare-ups after only 3 weeks since initially taking the medication! Naturally, we wish to see the clinical trial data, and after months of asking for the data they have finally provided us with a CSV spreadsheet containing the clinical trial data.

The CSV file contains the number of inflammation flare-ups per day for the 60 patients in the initial clinical trial, with the trial lasting 40 days. Each row corresponds to a patient, and each column corresponds to a day in the trial. Once a patient has their first inflammation flare-up they take the medication and wait a few weeks for it to take effect and reduce flare-ups.

To see how effective the treatment is we would like to:

- Calculate the average inflammation per day across all patients.
- Plot the result to discuss and share with colleagues.

## 2. Set up project

* First we need to create a new project folder. You can do this in your normal file browser, such as Finder, or Windows Explorer. Create a folder inside this folder called Data.
* Ensure the project folder contains the Jupyter Notebook Worksheets.
* Download the data that we will use for this data analysis task from [here](https://swcarpentry.github.io/python-novice-inflammation/data/python-novice-inflammation-data.zip).
* Unzip this folder, and move the `.csv` files into the Data folder that we have just created.


## 3. Load data with NumPy

In [6]:
import numpy as np

In [7]:
filepath = 'data/inflammation-01.csv'
data = np.loadtxt(fname=filepath, delimiter=',')

In [8]:
print(type(data))

<class 'numpy.ndarray'>


In [9]:
print(data.dtype)

float64


In [10]:
print(data.shape)

(60, 40)


## 4. Selecting and slicing data

In [11]:
print('first value in data:', data[0, 0])

first value in data: 0.0


In [12]:
print('middle value in data:', data[30, 20])

middle value in data: 13.0


print(data[0:4, 0:10])

In [13]:
print(data[5:10, 0:10])

[[0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
 [0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
 [0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
 [0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
 [0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]


In [14]:
small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


## 5. Analyzing data

In [15]:
print(np.mean(data))

6.14875


In [16]:
maxval, minval, stdval = np.max(data), np.min(data), np.std(data)

print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)

maximum inflammation: 20.0
minimum inflammation: 0.0
standard deviation: 4.613833197118566


We often want to select a whole row, or a whole column. We can assign an intermediatory variable, or just print the function call:

In [17]:
patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)
print('maximum inflammation for patient 0:', np.max(patient_0))

maximum inflammation for patient 0: 18.0


In [18]:
print('maximum inflammation for patient 2:', np.max(data[2, :]))

maximum inflammation for patient 2: 19.0


To make our lives easier, we can specify the axis we want to work on:

In [19]:
print(np.mean(data, axis=0))

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


In [20]:
print(np.mean(data, axis=0).shape)

(40,)


In [21]:
print(np.mean(data, axis=1))

[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


## 6. Change in inflammation

In [22]:
patient3_week1 = data[3, :7]
print(patient3_week1)

[0. 0. 2. 0. 4. 2. 2.]


Calling `np.diff(patient3_week1)` would do the following calculations:

`[ 0 - 0, 2 - 0, 0 - 2, 4 - 0, 2 - 4, 2 - 2 ]`

In [23]:
np.diff(patient3_week1)

array([ 0.,  2., -2.,  4., -2.,  0.])

## 7. Questions

**Q**: When calling np.diff with a multi-dimensional array, an axis argument may be passed to the function to specify which axis to process. When applying np.diff to our 2D inflammation array data, which axis would we specify?

In [24]:
# Answer here

np.diff(data, axis=???)

SyntaxError: invalid syntax (2961532995.py, line 3)

**Q**: If the shape of an individual data file is (60, 40) (60 rows and 40 columns), what would the shape of the array be after you run the diff() function and why?

**A**: The shape will be (XXX, XXX) because there is one fewer difference between columns than there are columns in the data.

**Q**: How would you find the largest change in inflammation for each patient? Does it matter if the change in inflammation is an increase or a decrease?

**A**: By using the np.max() function after you apply the np.diff() function, you will get the largest difference between days.

In [25]:
np.max(np.diff(data, axis=1), axis=1)

array([ 7., 12., 11., 10., 11., 13., 10.,  8., 10., 10.,  7.,  7., 13.,
        7., 10., 10.,  8., 10.,  9., 10., 13.,  7., 12.,  9., 12., 11.,
       10., 10.,  7., 10., 11., 10.,  8., 11., 12., 10.,  9., 10., 13.,
       10.,  7.,  7., 10., 13., 12.,  8.,  8., 10., 10.,  9.,  8., 13.,
       10.,  7., 10.,  8., 12., 10.,  7., 12.])

If inflammation values decrease along an axis, then the difference from one element to the next will be negative. If you are interested in the magnitude of the change and not the direction, the `np.absolute()` function will provide that.

Notice the difference if you get the largest absolute difference between readings.

In [26]:
np.max(np.absolute(np.diff(data, axis=1)), axis=1)

array([12., 14., 11., 13., 11., 13., 10., 12., 10., 10., 10., 12., 13.,
       10., 11., 10., 12., 13.,  9., 10., 13.,  9., 12.,  9., 12., 11.,
       10., 13.,  9., 13., 11., 11.,  8., 11., 12., 13.,  9., 10., 13.,
       11., 11., 13., 11., 13., 13., 10.,  9., 10., 10.,  9.,  9., 13.,
       10.,  9., 10., 11., 13., 10., 10., 12.])