All notes are based on notes produced by Mike White for WashU Bio 5075 Fundamentals of Biostatistics, a first year DBBS graduate student course.

# 1. More on for loops

1. Computers are good for performing the same operation over and over, really quickly. For loops are central to this.

2. For loops will be our main tool for working with data stored in lists. 

3. Example: You have a list of lncRNA gene lengths, and want to know how many are longer than 3 kb. With a for loop, you take each item in the list, one at a time, a count those that are longer than 3 kb.

4. Another example: You have a table of data (or two lists) with the stop and start coordinates for genes. To calculate all gene lengths, you use a for loop to iterate over these lists, subtracting start from stop.

5. Beware that for loops in Python work a little differently from some other programming languages. So pay attention to the syntax.

6. For loops ITERATE over ITERABLE OBJECTS (list, string, range, etc.)

## For loop syntax

```python
# my for loop
for x in iterable_object:
    do something with x
    notice the indentation
    
# next section of code
blah blah blah
```

In [1]:
# The Fibonacci Series is constructed by adding
# the previous two numbers to get the next number

# The ratio of fib[n]/fib[n-1] approaches phi, a.k.a.
# The Golden Ratio (1.6180339887498948482...)
# as n increases to infinity

n_fib = 20
fib = [1, 1]
phi = []

for n in range(2, n_fib):
    next_fib = fib[-2] + fib[-1]
    fib.append(next_fib)
    phi.append(fib[n]/fib[n-1])

print(fib)
phi[-1]

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]


1.6180339631667064

## Looping over two lists

Often you'll find yourself needing to work with two lists to calculate a result. 

Suppose we measured the time of 5 mice running on a wheel before and after the administration of a drug. We want to calculate the difference in time on the wheel. So subtract one value from another. Easy to do it with a calculator here, but the same principle applies to a dataset with 1 million points. Computers don't really care if it's ten points or a million.

In [2]:
# Define lists
# mice running on a wheel before and after a drug

before = [0,23,34,15,21]
after = [12,43,50,29,67]

In [3]:
# for loop iterates over two lists with indexing

for i in range(len(before)):
    difference = after[i] - before[i]
    print(difference)

12
20
16
14
46


In [4]:
# A more pythonic way to iterate over two lists with zip()

for a, b in zip(after, before):
    difference = a - b
    print(difference)

12
20
16
14
46


In [5]:
# Example of how to calculate the list of differences
# Notice indentation, which lines are not in the block

differences = []

for a, b in zip(after, before):
    difference = a - b
    differences.append(difference)

print(differences)

[12, 20, 16, 14, 46]


### Functions to convert data types

You'll see why in a minute, but what happens if the data in your lists are in the wrong format?

In [6]:
before = ['0', '23', '34']
after = ['12', '43', '50']

In [7]:
for a, b in zip(after, before):
    difference = float(a) - float(b)
    print(difference)

12.0
20.0
16.0


In [8]:
n = '1'
m = int(n)
print(type(n), type(m))

<class 'str'> <class 'int'>


## Recap: Fundamental Python Syntax So Far

1. Variables and data types
2. Lists - store multiple values
3. For loops

# 2. Conditional statements

Next piece of key Python syntax: conditional statements.

How to make decisions in code:

"If this is true, do this. Else, do this other thing."

If statements, like for loops, end with a colon and are followed by an indented block. 

## Major conditional statement options
### Anything that returns a boolean value
```python
x < y
x > y
x <= y
x >= y
x == y
x != y
```
```python
x in [a, b, c, d, x]
x not in [a, b, c, d, x]
```
```python
x < y and x > z
x < y or x > z
```
```python
x.startswith("#")
```

In [9]:
x = 100
x > 10

True

In [10]:
if x > 10:
    print('My experiment worked')

My experiment worked


In [11]:
if x > 1000:
    print('Whoa too much')
else:
    print('Okay np')

# next part of code
print("hi")

Okay np
hi


In [12]:
enriched_classes = ['tRNA', 'mRNA', 'lncRNA']

if 'miRNA' in enriched_classes:
    print('Our results were enriched for miRNA!')
else:
    print('Our results were not enriched for miRNA...')

Our results were not enriched for miRNA...


In [13]:
my_values = [3.3, 5.8 , 7.6, 9.1, 11.3]

if max(my_values) > 15:
    print('Max is really high!')
elif max(my_values) < 0:
    print('Max is really low!')
else:
    print('Max is between 0 and 15.')

Max is between 0 and 15.


# 3. ACTIVITY 1

In Part One, we normalized red and green values by dividing red/green. What if green = 0? We need to check for this condition before dividing by 0 and causing an error!

**Red:** 23, 145, 203, 235, 354, 456, 17

**Green:** 5, 11, 6, 9, 8, 4, 0

## Write a for loop to normalize red values by green

Use a for loop to normalize the red values by dividing them the green values. Try using `zip()` to zip together values from red and green, or use `range(len(red))` like before.

In the cell below, create a empty list called `normalized`. Then write a `for` loop that iterates through both the red and green lists simultaneously.

In the block of the `for` loop, write code to divide the red value by the green value, then append the result to `normalized`. But this time, check if the green value is zero before doing the division.

In [16]:
# Create the lists red and green below
red = [23, 145, 203, 235, 354, 456, 17]
green = [5, 11, 6, 9, 8, 4, 0]

# Create an empty list called normalized
normalized = []

# Write a for loop that iterates over the red and green lists
# Zip together red and green to be more "pythonic"
# Divide the red value by the green value and append it to normalized
# Use a conditional statement to test if the value of green is 0
# If the value of green is 0, the normalized result should be "NA"
# Otherwise, calculate the normalized values as red divided by green

for r, g in zip(red, green):
    if g == 0:
        normalized.append("NA")
    else:
        normalized.append(r/g)
        
print(normalized)

[4.6, 13.181818181818182, 33.833333333333336, 26.11111111111111, 44.25, 114.0, 'NA']


In [15]:
# End of Activity 1

# 4. Basic Python tools to read in genomic data

1. To work with real data, you need to load it into memory as a data object assigned to a python variable.
2. We've covered the main Python syntax to do this - lists, variables, loops, conditionals.
3. Now we will load in a table of genomic data and write code to import that data as a set of lists.

## Python reads in your data as strings

When you look at a text file, you see:
  
`BRCA2   chr13   32314862    32400266    tumor_suppressor`

These values correspond to the

0. gene
1. chromosome
2. start
3. stop
4. type of gene

Notice that each piece of information is separated by a tab and that there is a new line at the end.

In [17]:
# Python sees whitespace characters for tab \t and new line \n
line = 'BRCA2\tchr13\t32314862\t32400266\ttumor_suppressor\n'
print(line)

BRCA2	chr13	32314862	32400266	tumor_suppressor



To work with each piece of information, we need to separate each part by removing the whitespace.

In [18]:
# Remove trailing newline, but does NOT change original variable!
line.strip()

'BRCA2\tchr13\t32314862\t32400266\ttumor_suppressor'

In [19]:
# Assign output of strip() to new variable
stripped = line.strip()
stripped

'BRCA2\tchr13\t32314862\t32400266\ttumor_suppressor'

In [20]:
# Output of split() is a list of strings

split = stripped.split('\t') # split string by tabs
split

['BRCA2', 'chr13', '32314862', '32400266', 'tumor_suppressor']

Common types of whitespace:
- Space ` `: `my_string.split(' ')`
- Tab `\t`: `my_string.split('\t')`
- Comma "," (okay, not whitespace): `my_string.split(',')`

In [21]:
# After split(), assign data to the correct list
# Don't forget to convert to correct data types

starts = []

starts.append(int(split[3]))
starts

[32400266]

# 5.  ACTIVITY 2: Read in data from a file

To work with our table of human gene annotations, we have to load it into memory where it can be accessed in our Python session. To load the file requires the following steps:

1. Open the file.
2. Read in each line of the data
3. Assign the data to a variable
4. Close the file.

In this activity, you will learn the commands for opening, reading, and closing files. You will use a **```for```** loop to read in the data one line at a time. Each line will be processed as demonstrated above, and the resulting values will be stored in lists.

You've practiced with simple lists and for loops – now you'll see how useful they are for larger datasets. 

Before starting this activity, make sure the files ```gene_table.txt``` and ```test_table.txt``` are in your `data/` folder in the Jupyter dashboard.

## Opening files
To open a file, we use the `open()` function. This function creates a *file object*, which is in essence a connection between where the data exists and your Python session.

The important idea is that *file objects, like strings, integers, lists, etc., can be assigned to a variable.* As you'll see, this lets you to work with file objects much like you work with other types of variables.

### open()
In this activity, you will write code to read data from files using the small `test_table.txt` file as an easy example. There are fewer lines in this table, so that if something goes wrong, it won't take over your screen.

In the cell below, use `open()` to open the `data/test_table.txt` file and assign the resulting file object to the variable `test_file`. (We could use any valid variable name here - `test_file` is good because it's descriptive.)

Type the following into the code cell below and run the cell:
```python
test_file = open('data/test_table.txt')
```

In [22]:
test_file = open('data/test_table.txt')

Make sure you understand everything that happened in the previous line of code. The `open()` function took an argument – the file name – between the parentheses. Why do you think the arguement, `'test_table.txt'` is enclosed in quotes? What does this tell you about the *type* of valid values that can be used as arguments with the `open()` function?

## Reading file data

### .readline() and .read()
Once we have an open connection to the file, the next step is to read the data. There are three ways to read lines from a file:

1. Read one line: `test_file.readline()` (no argument between parentheses)
2. Read all lines at once: `test_file.read()`
3. Iterate over the file with a `for` loop

(Why are there empty parentheses after `.readline()` and `.read()`? Answer: `.readline()` and `.read()` are functions, which always require parentheses - even when there are no arguments to put between them.)

Here is what the data in `data/test_table.txt` looks like:
```
#name	txStart	txEnd	exonCount
ENST00000371007.6	67092164	67231852	8
ENST00000371006.5	67092175	67127261	6
ENST00000475209.6	67092175	67127261	7
ENST00000621590.4	67092396	67127261	3
ENST00000263946.7	201283451	201332993	15
ENST00000367324.7	201283451	201332993	14
ENST00000622031.4	201283511	201330288	17
ENST00000352845.3	201283702	201328836	14
ENST00000337907.7	8352396	8817465	24
```

These values correspond to the

0. gene name
2. start position
3. stop position
4. number of exons

To read in a *one* line from the file, type this in the cell below:
```python
print(test_file.readline())
```

In [23]:
print(test_file.readline())

#name	txStart	txEnd	exonCount



The previous code printed the first line of the file, which is the column header line. The header line sometimes begins with a hash symbol, sometimes not. Always check before you code to see if your file has a header or not.

To read the next line, run 

```python
print(test_file.readline())
``` 

again in cell below. You should see the second line of the file, which is the first line of actual data. **This is an important feature of `.readline()`.** Python remembers which line was last read in from the file, and automatically moves on to the next one each time `.readline()` is called. Once the Python interpreter reads the last line of the file, it stops there. It will not start over, even if you call `.readline()` again.

(So what if you decide you want to go back and read the first line again? The easiet way is to reset everying by rerunning the `open()` command, giving you a brand new file object.

In [24]:
print(test_file.readline())

ENST00000371007.6	67092164	67231852	8



### Using a for loop to read lines

The functions `.readline()` and `.read()` are useful for some purposes, but the third way to read file data is the most generally useful:

Earlier, you used a **`for`** loop to *iterate* over individual elements in a list. A file object is also *iterable*: you can use a **`for`** loop to iterate over lines in a file.

The syntax is the same as that used to iterate over a list. Instead of a list variable, you use a file object variable. Rather than assigning list items to the temporary variable, the `for` loop assigns file lines to the temporary variable. *The `for` loop itself reads in each line – you do not need to call `.readline()` within the block of the `for` loop.*

Using the syntax you used to iterate over lists, write a `for` loop to iterate over `file` in the cell below. In the block of the loop, simply print the file line.

**Important:** The `for` loop will also pick up where your previous calls to `.readline()` or `.read()` left off - which means that if you invoked `.readline()` or `.read()` already, Python has already gone through some or all of the file and the `for` loop will miss out on some or all of the lines and you have to `open` the file again. It may be useful to use `.readline()` once before a `for` loop if you know the file has a header.

In the cell below, print the remaining lines of file using a for loop using this code:
```python
for line in test_file:
    print(line)
```

Note that `print` automatically adds a new line after each line, so you get two new lines in this case.

In [25]:
for line in test_file:
    print(line)

ENST00000371006.5	67092175	67127261	6

ENST00000475209.6	67092175	67127261	7

ENST00000621590.4	67092396	67127261	3

ENST00000263946.7	201283451	201332993	15

ENST00000367324.7	201283451	201332993	14

ENST00000622031.4	201283511	201330288	17

ENST00000352845.3	201283702	201328836	14

ENST00000337907.7	8352396	8817465	24



## Processing lines: Whitespace and splitting lines into data

Previously, we discussed the concept of whitespace, and the steps for processing each line from the file. To recap:

1. File lines are read in as strings that end in newlines (`\n`).
2. Strip trailing newlines from the file line with `.strip()`.
3. Split tab-delimited multi-column data into a *list* of values with `.split('\t)`.
4. Append individual values to the corresponding lists.

In the cells above, we've used `print()` to display file lines. This hides the whitespace characters `\t` and `\n`, because `print()` intentionally formats text to make it more readable.

To see that our lines come in as strings with whitespace characters, run the code cell below:

In [26]:
# open a fresh file object
test_file = open('data/test_table.txt')

# read in the header line and assign to variable 'header' 
header = test_file.readline()

# read in the first line
test_line = test_file.readline()

# display the line, showing tabs and newlines
test_line

'ENST00000371007.6\t67092164\t67231852\t8\n'

### Strip trailing newlines

The output above is a string (how can you tell?) with some whitespace. To convert this line into usable data, we first strip the trailing newline using `.strip()`. (If we fail to do this, the newline character itself will remain stuck to the last data value, causing problems later.)

As discussed in the lecture, **the output of `.strip()` is a copy of the original line**, with the newline removed. To save that copy, it needs to be assigned to a new variable. In the cell below, type the following code:

```python
stripped = test_line.strip()
stripped # to display the line as output
```

The `\n` should now be gone from the end of the string.

In [27]:
stripped = test_line.strip()
stripped

'ENST00000371007.6\t67092164\t67231852\t8'

### Split a line into data elements
Now we split the string into individual values at the tabs. To do this, we use `.split()`. **The output of `.split()` is a list.**

To see this, type the following code in to the cell below:
```python
stripped.split('\t')
```

In [28]:
stripped.split('\t')

['ENST00000371007.6', '67092164', '67231852', '8']

Look at the output: What kind of data type is this? (Hint: What is enclosed in brackets, with values separated by commas?) 

Also, notice that the individual data elements are enclosed in quotes. What is the type of the individual elements?

To save the output, rerun `stripped.split('\t')` in the cell below, but this time assign the output to the variable `split`. Then, on a new line type `split` to display it.

In [29]:
split = stripped.split('\t')
split

['ENST00000371007.6', '67092164', '67231852', '8']

### A shortcut

The steps above can be combined into a single line of code, by stringing `strip()` and `split()` together. Run the cell below to see this.

In [30]:
split = test_line.strip().split('\t')
split

['ENST00000371007.6', '67092164', '67231852', '8']

### Assign data elements to variables

We're almost there! What began as a single file line has now been split into a list of individual values. The final step is to take the individual data elements from the list `split` and append them to separate lists that represent indivudal columns of data in the file.

To do this, we use list indexing. Remember, to access one element of a list, we write the variable name followed by the list position in square brackets: `list_name[0]`.

Since `split` is a list, you can use indexing to access the individual data values. In the cell below, use `.append()`, together with indexing of `split`, to append the individual data values to the empty lists.

In [33]:
# Some blank lists to hold our data

gene_names = []
starts = []
stops = []
exon_counts = []

# Now append elements of split to the approprite lists.
# For example, since the gene name is in the first position of our list called split, we can use:
# gene_names.append(split[0])
# to add the gene name to the gene_names list
# Now do that for gene_names, starts, stops, and exon_counts
gene_names.append(split[0])
starts.append(split[1])
stops.append(split[2])
exon_counts.append(split[3])

### Convert data types

We have one last detail to take care of. Our numerical data (the values for `start`, `stop`, and `exon_count`) are still *strings*. (How can you check this?) Python will balk if you try to do math with them. 

In the cell below, try subtracting the first (and only) element from the list `starts` from the first element of `stops` to calculate the length of our example gene. You'll get an error. What does the error message say?

In [35]:
stops[0] - starts[0]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

How do we convert our numerical data to actual numerical data types? Recall that there are functions that convert data types: `str()` converts items to strings, `float()` converts items to decimal numbers, and `int()` converts items to integers.

We convert the string elements of the list `split` to integers using `int()`. You can do this by nesting the integer conversion within the `.append()` function:

```python
exon_count.append(int(split[3]))
```
Go ahead and change the code above (two code cells up) convert convert numerical data to the integer type, before appending to appropriate lists. Then, run the next code cell. You should be able to subtract start from stop without error. 



In [36]:
# Some blank lists to hold our data

gene_names = []
starts = []
stops = []
exon_counts = []

# Now append elements of split to the approprite lists.
# For example, since the gene name is in the first position of our list called split, we can use:
# gene_names.append(split[0])
# to add the gene name to the gene_names list
# Now do that for gene_names, starts, stops, and exon_counts
gene_names.append(split[0])
starts.append(int(split[1]))
stops.append(int(split[2]))
exon_counts.append(int(split[3]))

stops[0] - starts[0]

139688

Finally, when you're finished reading in the data from a file, it's good pratice to close the file as follows:

```python
test_file.close()
```
Run that command in the cell below to close the file.

In [37]:
test_file.close()

## Combine all these steps in a for loop, using test data

Now bring it all together, while still using the `test_file` data:

In [38]:
# Some blank lists to hold our data

gene_names = []
starts = []
stops = []
exon_counts = []

# open a fresh file object
test_file = open('data/test_table.txt')

# read in the header line and assign to variable 'header' 
header = test_file.readline()

# a for loop to iterate over all the lines
for line in test_file:
    
    split = line.strip().split('\t')

    gene_names.append(split[0])
    starts.append(int(split[1]))
    stops.append(int(split[2]))
    exon_counts.append(int(split[3]))
    
test_file.close() # close the connection

## Read data from the entire file gene_table.txt

Now put all of the concepts together to write code that reads in and process data from an entire file.

You now have all of the Python syntax you need to load and process the data from `data/gene_table.txt`. It has the same format as `data/test_table.txt`.

Pulling everything discussed above together, do the following in the code cell below:

1. Define 4 empty lists to hold the data: `gene_names`, `starts`, `stops`, `exon_counts`.
2. Open the file `data/gene_table.txt`. This is no longer a `test_file`, it's the real file!
3. Use `.readline()` to read in *only* the header line, assign it to a variable named `header`.
4. Write a `for` loop to process the remaining file lines, appending individual data values to the appropriate list (after conversion to the correct data type).
5. Finish by closing the file.

In [39]:
# Type your code below

# Some blank lists to hold our data

gene_names = []
starts = []
stops = []
exon_counts = []

# open a fresh file object
file = open('data/gene_table.txt')

# read in the header line and assign to variable 'header' 
header = file.readline()

# a for loop to iterate over all the lines
for line in file:
    
    split = line.strip().split('\t')

    gene_names.append(split[0])
    starts.append(int(split[1]))
    stops.append(int(split[2]))
    exon_counts.append(int(split[3]))
    
test_file.close() # close the connection

In [40]:
# Check your answers using assert.
assert len(stops) == 197782
assert gene_names[100389] == 'uc058bab.1'
assert type(exon_counts[0]) == int
# If everything is correct, you get no output from assert.

### How long are human genes?

We opened `data/gene_table.txt`, read in the data, and closed the file. Our data is now stored in four lists. We can now calculate the lengths of all human genes. 

In the cell below, do the following:

1. Loop over the lists `stops` and `starts`, using `zip()`.

2. In the block of the `for` loop, calculate gene lengths by subtracting each start coordinate from its corresponding stop coordinate. Append the result to a list called `lengths`.

Remember to define `lengths` as an empty list before you start your **`for`** loop.

In [41]:
# Calculate gene lengths using a for loop here

lengths = []

for start, stop in zip(starts, stops):
    lengths.append(stop - start)
    
# DON'T TRY TO PRINT LENGTHS -- IT'S VERY LONG!

In [42]:
# Check your answers using assert
assert len(lengths) == 197782
assert lengths[1827] == 16858
assert lengths[53] == 104
assert lengths[75] - lengths[20938] == 3273
# If everything is correct, you get no output from assert.

In [None]:
# End of Activity 2