# Selecting Data Subsets

We often end up with more data than we can reasonably explore, so it becomes important to be able to select subsets of our data. We will explore three approaches (not completely independent)

1. Slicing
1. Filtering
1. Random selection

In [None]:
%matplotlib inline

In [None]:
import os
import glob
import numpy as np
import random
import matplotlib.pyplot as plt
import utils
import numpy.random
import pandas as pd

In [None]:
DATADIR = os.path.join(os.path.expanduser("~"),"DATA")
HRDIR = os.path.join(DATADIR,"Numerics", "mimic2", "hr", "subjects")
hr_files = os.listdir(HRDIR)

## Slicing

Plotting our nearly 4000 patients is not too meaningful. We can use **slicing.** Slicing is a way of grabbing chunks of data from a sequence, such as a string or a list. The slicing notation is

* **SEQUENCE[start:stop:increment]**
    * start is inclusive
    * **stop is exclusive**
* start, stop and increment all have default values
    * start: 0
    * stop: Length of string
    * increment: 1

In [None]:
demo_string = "Brian E. Chapman, Ph.D."
print(demo_string)

# print just the frist name
print(demo_string[:5])

# print just the last name
print(demo_string[9:16])

# print every other character

print(demo_string[::2])

#### We can use slicing to select subsets of our patients

In [None]:
print("Selecting first 5 files\n", hr_files[5:10])
print("Selecting last 10 files\n", hr_files[-10:]) # Python allows us to index from the end of the sequence with a negative number
print("Selecting every other file between 20 and 40", hr_files[20:40:2])

## Filtering Data

In [None]:
for f in hr_files:
    
    plt.plot(utils.get_data(os.path.join(HRDIR,f)))

Looking at this initial plot, we can see that we have lots of spikes of non-physiologic zero values for the heart rate as well as some supiciously high values. While it is conceivable that someone might momentarily have a heart rate of zero, these seem more like data entry errors and we would probably like to drop these values. We can do this easily with list comprehension which allows us to filter which values we keep with if statements. 

In [None]:
long_data = utils.get_data(os.path.join(HRDIR,'21280.txt'))

plt.plot(long_data)
print(np.max(long_data))

#### Using list comprehensions to drop zero values

In [None]:
plt.plot([d for d in long_data if d > 0])

#### We can also use list comprehension to limit to a range of values

In [None]:
plt.plot([d for d in long_data if 0 < d < 160])

#### We can also use the data itself to define bounds

* What would be a reasonable choice for b in our filtering below?

In [None]:
m = long_data.mean()
std = long_data.std()
b=1.0
plt.plot([d for d in long_data if m-b*std < d < m+b*std])

## Randomly Selecting Data

Python comes with the [``random``](https://docs.python.org/3/library/random.html) module that contains functions for generating pseudo-random nubmers, shuffling data, and randomly choosing a value from a collection. We can use the ``random.choice`` function to choose one file or use the ``random.shuffle()`` function to randomly shuffle our data and then using slicing to select some number of them.

In [None]:
for i in range(5):
    print(random.choice(hr_files))

In [None]:
hr_files_copy = hr_files[:]
random.shuffle(hr_files_copy) # modifies the list in place
print(hr_files_copy[:5])

#### Numpy also provides functionality for randomly selecting values from an array
* [``numpy.random.choice``](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html)
* [``numpy.random.shuffle``](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html#numpy.random.shuffle)

## Setting the Random Seed

If I am testing some code, I'm going to have difficulty evaluating my code if my data are different each time. To address this, in development situations I can set the seed with the [``random.seed()``](https://docs.python.org/3/library/random.html) function so that I'm always picking the same values.

In [None]:
random.seed(0)
for i in range(5):
    print(random.choice(hr_files))

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">University of Uah Data Science for Health</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Brian E. Chapman</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.