<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="25%"><img src="../../../media/decartes.jpg"
alt="DeCART Icon" width="128" height="171"><br>
</td>
<td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">DeCART Summer School<br>
for<br>
Biomedical Data Science</font></h1></td>
<td valign="center" align="center" width="25%"><img
src="../../../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

# Finding things

One of the recurring challenges we face is finding the files with content that we are interested in. In the simplest case, this search would be based on the filename. More complicated searches would be based on the content. Once we can find content in files, then we can extract the relevant data from the files.

In this notebook, we will explore the following Unix tools:

* [``find``](https://en.wikibooks.org/wiki/Guide_to_Unix/Commands/Finding_Files)
* [``grep``]()

### Some highlights
* ``find``: find files with names matching some criteria
* ``grep``: look for patterns within files

I frequently use ``grep`` and ``find`` together to find where I've used particular code snippets.


### Using Python for quizzes 

We're going to import some Python code here that we will use later to evaluate quizzes.

In [None]:
from finding_quizzes import *

## ``find``

``find``, as the name implies, finds files. The syntax for ``find`` is as follows:

```bash
find path_to_start_search [optional expression(s)]
```

#### Note: In newer versions of Unix/Linux the results are printed by default. In older versions, you need to provide a `-print` option.

If I do not provide any expressions, all files under the specified print will be found.

In [None]:
%%bash

find . 

### ``-name``

Finding all the files, is probably not too helpful in general. Finding files with a certain name would be more useful.

#### Example: Finding all files named ``python`` in the file system

In [None]:
%%bash

find / -name 'python'

#### Why are we getting these ``Permission denied`` errors?

In [None]:
%%bash
find ~ -name 'README'

### ``-iname``

Linux/Unix is case sensitive, so if I want to find files matching a name, ignoring case, I use ``-iname`` rather than ``-name``.

In [None]:
%%bash
find / -iname 'README'

## [wildcards](https://en.wikibooks.org/wiki/A_Quick_Introduction_to_Unix/Wildcards)

Finding files becomes more powerful when we search for names matching a pattern. Linux provides **wildcards** that match variable characters.

### ``*`` 

The asterisk character stands for zero or more characters (any characters).

The filename for these Jupyter notebooks have ``.ipynb`` as a suffix, so if I want to find all the Jupyter notebooks in a directory tree, I would do the following:

In [None]:
%%bash
find ~/work/decart_boot_camp_1_2018 -name '*.ipynb'


### ``?`` 

The question mark character stands for one character (any character).

## Exercise

In the ~/DATA directory there are [WAVE files](https://en.wikipedia.org/wiki/WAV) (.wav) that are recordings of various heart anomalies. There are two types of Mitral valve problems: mitral stenosis (ms) and mitral regurgitation (mr). Use find with wildcards to find all the "ms" and "mr" WAVE files. Make sure that you do not match other files, for example pulmonary stenosis (ps). How many mitral WAVE files are there? How many WAVE files are there?

In [None]:
%%bash
find ~/DATA -name 'm?.wav'

In [None]:
mitral_files(1)

In [None]:
wave_files()

### Multiple search conditions

#### AND

We can just concatenate multiple conditions to form logical AND. 

So if we want to find only directories named "Python", we can combine the ``-d`` flag with a ``-name`` condition:

In [None]:
%%bash
find / -iname "python" -d

## Exercise 

Use the find command to identify all the ``.txt`` files in ``~/DATA`` that are larger than 10 kBytes.

In [None]:
%%bash


In [None]:
find_10k_files()

#### OR

Find files matching name1 OR name2.

Doing an OR is more complicated.

* We separate our conditions with a ``-o`` flag
* Our whole expression needs to be wrapped in parentheses ()
* Because parentheses are special in the shell, they need to be **escaped**: ``\(`` and ``\)``



In [None]:
%%bash
find ~ \( -name '*.db' -o -name '*.sqlite' \)

## [grep](https://en.wikibooks.org/wiki/Grep)

grep is a Unix/Linux program that is very useful for identifying content within files. There are a lot of [options](http://linuxcommand.org/man_pages/grep1.html) with grep, most of which are beyond the scope of this class. In this section, we will review some of the basic yet very useful functionality of grep.

* Provide a pattern we want to find (e.g. "chapel")
* A file or list of files within which to search for the pattern

In ``~/DATA/Misc/``, there is a file ``obits.txt``. We'll use grep to look for the word "chapel" in the file

In [None]:
%%bash
grep chapel ~/DATA/Misc/obits.txt

#### We can use the ``-i`` flag to ignore case in our pattern (and file names)


In [None]:
%%bash
grep -i chapel ~/DATA/Misc/obits.txt

#### We can use the ``-c`` flag to just count the number of occurrences in a file

In [None]:
%%bash
grep -i -c chapel ~/DATA/Misc/obits.txt

#### We can use the ``-r`` flag to search all the files under a directory

In [None]:
%%bash
grep -i -c -r chapel ~/DATA/Misc

#### We use the ``-l`` flag to just list the files where matches were found

In [None]:
%%bash
grep -i -l -r chapel ~/DATA/Misc

#### We can provide a list of files 

In [None]:
%%bash
grep -i  chapel ~/DATA/Misc/obits.txt \
~/DATA/Misc/icd9-short.txt

In [None]:
%%bash
grep -i  chapel ~/DATA/Misc/*.txt

#### We can search for regular expressions

Regular expressions can be quite complicated and there are a variety of flavors of regular expressions.

The period (``.``) matches a single character. So if we want to find all the patients in their seventies, we could use a grep expression like the following;

In [None]:
%%bash
grep '7.-year-old' ~/DATA/Misc/*.txt

## Exercise
In the ``~/DATA/Numerics/mimic2`` directory are a series of directories with numeric physiological data measured from patients in an ICU: ``hr`` (heart rate), ``bp`` (blood pressure), ``uo`` (urine output), ``wbc`` (white blood count). Under each of these directories is another directory named ``subjects`` which contains text files containing the measured values. 

The naming convention for the filenames is the patient ID (e.g. ``9936``) followed by ``.txt``. Thus ``~/DATA/Numerics/mimic2/hr/subjects/9936.txt`` contains the heart rate measurements for patient \#9936 while ``~/DATA/Numerics/mimic2/bp/subjects/9936.txt`` contains the blood pressure measurements for the same patient. 

1. Use ``grep`` and the redirect operator (``>``) to create a file with all the patients that have a heart rate measurement greater than or equal to 200. 
1. Repeat this to create a file with all the patients that have a blood pressure measurement greater than or equal to 200.
1. Use the [``grep -Fx -f``](https://en.wikibooks.org/wiki/Grep) command to create a new file that contains the patients that are in both the heart rate greater than or equal to 200 file and the blood pressure greater than or equal to 200 file.
1. Finally, use the [``wc``](https://goo.gl/yhV8AJ) (word count) command with the ``-l`` option to count the number of patients matching both conditions.

In [None]:
bp_hr_ge200()