<img src="lbnl_logo.jpg">

----




# Genomics Challenge Lab - Day 3



---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

<img src="algae.png" align="left">

Load yesterday's normalized table into `rna_data`. Recall that we saved yesterday's work into a file called "rna_by_read_depth.csv". Double check that `rna_data` is what it looked like yesterday. The labels should be completely updated, and the data table's values should be proportions.

In [None]:
# EXERCISE

rna_data = pd.read_csv("rna_by_read_depth.csv", index_col=0) #SOLUTION
rna_data.head()

## Normalizing the Data Again

### 1. New Gene Information

Now you know how to apply normalization by row. Let's think about _what_ we want to normalize by. Each row represents a specific gene, so it might be worthwhile to consider the length of a gene. Different genes have different lengths, so let's check out the lengths of our genes.

**Question 1** Why would the length of a gene matter?

_Your answer here_

In the cell below, load the information we have on each gene via the "gene_info.info" file. Like the original `rna_data` table, this file is separated by tabs instead of commas. Indicate this in your function call with the argument `sep=...`.

_Hint_: if you forgot what to put in the ellipses (...), check what you did on day 1!

In [None]:
# EXERCISE

...
gene_info.head()

This data table gives us a lot of interesting information! First, our index is `geneID`, which seems similar to our `tracking_id` column in `rna_data`. The column `chrom` tells us what chromosome the gene can be found on, and the columns `start` and `stop` tell us where the gene begins and ends on that chromosome. Finally, `length` refers to how long that specific gene is.

The last column would be the most useful to us for row normalization. Genes that are of longer lengths may produce more reads than those of shorter lengths, so we need to take that into account. Let's normalize `rna_data` by each row's gene length.

First, we want to check whether or not the gene IDs of `gene_info` are in the same order as the gene IDs in `rna_data`. Do this in the next cell and keep in mind that the gene IDs of both data tables are the table indices.

1. Get the indices of both tables and assign them them to `gene_info_ids` and `rna_data_ids`.
2. Compare the gene ID arrays using a boolean statement.
3. Count how many indices match up.
    - Recall that 'True' = 1 and 'False' = 0.
4. Verify that the number of matching gene IDs is the number of genes in the `rna_data` table.

In [None]:
# EXERCISE

gene_info_ids = gene_info.index 
rna_data_ids = ...

matches = (... == ...) 
num_matches = sum(matches)

rna_data_num_rows = rna_data.shape[0]
same_order = (num_matches == rna_data_num_rows)

print("The gene IDs are in the same order as the tracking IDs: " + str(same_order))

Based on the above statement, we can see that the gene ID columns are in the same order for `gene_info` and `rna_data`. This means we can take the `length` information from `gene_info` with full confidence that it will be in the same order as the corresponding genes in `rna_data`.

Do that in the next cell.

In [None]:
# EXERCISE

gene_lengths = ...

Let's take a quick peek at the gene lengths of the first 100 genes!

In [None]:
x_axis = rna_data.index[0:100]
y_axis = gene_lengths[0:100]

plt.bar(x_axis, y_axis)
plt.title("Genes vs. Gene Length")
plt.xlabel("Genes")
plt.ylabel("Length")
plt.show;

Wow! Gene lengths vary by a lot, and these are only the first 100 gene lengths! Once again, it is a good thing we have decided to normalize our data to account for this variability in gene lengths. Let's finalize our normalization in the next section.

### 2. Normalization by Gene Length

In the previous sections, you have learned how to normalize data by rows (via transposing) and what we would like to normalize the rows of `rna_data` by. Let's combine what you learned and normalize `rna_data` by gene length!

In the next cell, create a function that normalizes `rna_data` by rows. Recall from the "Transposing Data" section that this requires two `np.transpose()` calls and should return the table in the same structure it was originally in.

In [None]:
# EXERCISE

def normalizeByGene(data, gene_length_list):
    return np.transpose(.../...) 

Well done! We now have a function that normalizes by row and an array of values that we would like to normalize by. In the next cell, we normalize `rna_data` by `gene_lengths`. 

You might notice that the numbers are incredibly small. Recall that these values are gene reads as proportions of their gene lengths as well as gene reads as proportions of all samples' total reads. 

In [None]:
rna_by_gene_len = normalizeByGene(rna_data, gene_lengths)
rna_by_gene_len.head()

### 4. What effect did normalization have?

Let's see how our data changed for gene `Cz01g00040` after normalizing the data by gene lengths.

Once again, let's create a plot that visualizes the normalized reads for `Cz01g00040` in `rna_data`. Remember that `rna_data` is our data normalized by read depths _only_. This visualization will be the same as what you saw yesterday, so we have provided the code for you.

In [None]:
x_axis = rna_data.columns
y_axis = rna_data.loc["Cz01g00040", :]

plt.scatter(x_axis, y_axis)
plt.title("Samples vs. Normalized CZ01g00040 (Read Depth)")
plt.xlabel("Samples")
plt.ylabel("Normalized CZ01g00040 Reads")
plt.xticks(rotation=90)
plt.show;

In this next cell, you should visualize the normalized reads for `Cz01g00040` in `rna_by_gene_len`. This is the data from the previous visualization normalized by gene lengths. 

In [None]:
# EXERCISE

x_axis = ...
y_axis = ... # "Cz01g00040" onwards

plt.scatter(x_axis, y_axis)
plt.title("Samples vs. Normalized CZ01g00040 (Read Depth, Gene Length)")
plt.xlabel("Samples")
plt.ylabel("Normalized CZ01g00040 Reads")
plt.xticks(rotation=90)
plt.show;

**Question 2** How did the visualization change after the data was normalized by gene length? Did the pattern change? Did the scale of normalized `Cz01g00040` reads change? How much larger is the highest read compared to the lowest read?

_Your answer here_

**Question 3** The changes that occurred may not seem too drastic in the visualization. However, they are still important. Why might that be? Recall the uses of this experiment.

_Your answer here_

## Visualizing Relationships

Now we are going to finalize our data analysis by creating visualizations of our data. In the past few days, we have made our data more readable, normalized by read depth (ie. by columns), and normalized by gene length (ie. by row). The first of these steps made our data easier to read and therefore interpret. The last two accounted for the different scales of read depths under different conditions/samples and the various sequence lengths that different genes produce.

In order to create visualizations that represent our data well, it might be easier to separate our data into two tables -- one with the data under high light conditions and another with the data under medium light conditions.

Create separate tables for high light and medium light. To do this, type out the column labels that we would like to use in the ellipses (...). Also, be sure to put them in order of hours of light exposure and to keep only the conditions whose hours are found in both high light and medium light.

As a reference, we have printed out all the column labels.

In [1]:
rna_data.columns

NameError: name 'rna_data' is not defined

In [None]:
# EXERCISE

rna_HL = rna_data[['0.5HL-0', '0.5HL-1', '0.5HL-2', '0.5HL-3', 
        ...]]

rna_ML = rna_data[['0.5ML-0', '0.5ML-1', '0.5ML-2','0.5ML-3',  
        ...]] 

Great! Now we can visualize our data more effectively. Let's first look at how reads change as light exposure times increase.

### 1. The Relationship between Light Exposure Time and Reads

Before we look at the relationship between light exposure time and reads, we need to consider what we should visualize! We could have light exposure times on the x-axis and reads on the y-axis, but what would our datapoints be?

We have reads for thousands of genes under various light exposure times, so let's just stick with looking at one. Like the previous two labs, we can look at gene `Cz01g00040`.

- Assign `light_times` to the light exposure times that are in both `rna_HL` and `rna_ML`.
    - There should be replications of the same time since we have 4 samples of each light exposure period.
- Assign `rna_HL_gene` to the data associated with `Cz01g00040` under high light exposures.
- Assign `rna_ML_gene` to the data associated with `Cz01g00040` under medium light exposures.

In [None]:
# EXERCISE

light_times = [0.5, 0.5, 0.5, 0.5, 1, 1, 1, 1, 3, 3,3, 3, 6, 6, 6, 6, 12, 12, 12, 12] 

rna_HL_gene = ...
rna_ML_gene = ...

The following cell generates a visualization that compares light exposure times to reads. It does not differentiate between the two different light intensities, but it is still a helpful intermediate step in analyzing our data.

In [None]:
# EXAMPLE

light_times_twice = light_times + light_times
rna_gene = rna_HL_gene.append(rna_ML_gene)

plt.scatter(light_times_twice, rna_gene)
plt.title("Light Times vs. Reads")
plt.xlabel("Light Times (hr)")
plt.ylabel("Reads")
plt.show;

**Question 1** What relationship or patterns do you see? What do you think this means?

_Your answer here_

Now that you have seen an example on how to create a scatter plot, let's have you create a scatter plot that visualizes the relationship between light exposure times and reads while separating the high light data from the medium light data.

To do so, use `light_times`, `rna_HL_gene`, and `rna_ML_gene`.

In [None]:
# EXERCISE

plt.scatter(..., ...) 
plt.scatter(..., ...) 
plt.title("Light Times vs. Reads")
plt.xlabel("Light Times (hr)") 
plt.ylabel("Reads") 
plt.show;

**Question 2** Here you can see the difference between high light datapoints and medium light datapoints. Do you notice a different pattern between the two? What is different?

_Your answer here_

Now that you have thought about this question, we will draw a line representing the relationship of the data points shown. The cell below creates a straight line using the mx + b formula you learned in math. The resulting plot might help support your answer in Question 2, but try to answer it before looking at the visualization.

In [None]:
m_HL, b_HL = np.polyfit(light_times, rna_HL_gene, 1)
m_ML, b_ML = np.polyfit(light_times, rna_ML_gene, 1)

plt.scatter(light_times, rna_HL_gene)
plt.scatter(light_times, rna_ML_gene)

plt.plot(light_times, np.array(light_times) * m_HL + b_HL)
plt.plot(light_times, np.array(light_times) * m_ML + b_ML)

plt.title("Light Times vs. Reads")
plt.xlabel("Light Times (hr)")
plt.ylabel("Reads")
plt.show;

We can see that the reads for gene `Cz01g00040` increase as the number of hours under light exposure increases. This denotes a positive correlation between the two. However, we notice that reads for the gene under high light increase more rapidly with time than reads for the gene under medium light.

Now that we know about this relationship and can support it with a visualization, let's create another visualization.

### 2. The Relationship between High and Medium Light

As mentioned earlier, we now want to show the relationship between high light and medium light.

First, let's create an elementary plot with `rna_HL_gene` on the x-axis and `rna_ML_gene` on the y-axis.

In [None]:
# EXERCISE

plt.scatter(..., ...) 
plt.title("High Light Reads vs. Medium Light Reads for Cz01g00040") 
plt.xlabel("High Light Reads") 
plt.ylabel("Medium Light Reads") 
plt.show;

**Question 3** What relationship or patterns do you see? What does this mean?

_Your answer here_

This scatter plot gives us some new information, but let's divide the datapoints into different colors representing the different light exposure times. Follow the format below.

- `orange`: datapoints that were measured after **0.5 Hours** of light exposure
    - These should be the first 4 rows `rna_HL_gene` and `rna_ML_gene` because there are 4 replications of 0.5HL and 4 replications of 0.5ML! How can you take the _first 4 rows of a list_?
- `yellow`: datapoints that were measured after **1 Hour** of light exposure
    - These should be the _next_ 4 rows of `rna_HL_gene` and `rna_ML_gene`.
- `green`: datapoints that were measured after **3 Hours** of light exposure
- `blue`: datapoints that were measured after **6 Hours** of light exposure
- `purple`: datapoints that were measured after **12 Hours** of light exposure

In [None]:
# EXERCISE

orange = plt.scatter(rna_HL_gene[0:4], rna_ML_gene[0:4], c="orange") 
yellow = plt.scatter(rna_HL_gene[4:8], ..., c="yellow")
green = plt.scatter(..., ..., c="green") 
blue = plt.scatter(..., ..., c="blue") 
purple = plt.scatter(..., ..., c="purple") 

dots = [orange, yellow, green, blue, purple]
labels = ["0.5 Hours", "1 Hour", "3 Hours", "6 Hours", "12 Hours"]

plt.title("High Light Reads vs. Medium Light Reads for Cz01g00040")
plt.xlabel("High Light Reads")
plt.ylabel("Medium Light Reads")
plt.legend(dots, labels, loc="center right", bbox_to_anchor=(1.5, 1))
plt.show;

**Question 4** What are some observations you can make about this visualization? Are you surprised about the results or have you seen this relationship before?

_Your answer here_

Below is a visualization found in the original scientific publication. You do not need to understand it completely, but we do want you to notice some interesting features of it. Although our axes are different, notice that there are similarities between our figure and the publication figure.

<img src="pca.png" width="450" align="left">

**Question 5** What similarities can you draw between our visualization and the original publication's? List at least two things.

_Your answer here_

## Conclusions

Refer back to the two figures we created. The first one compared light exposure times to read values under different light intensities for the gene `Cz01g00040`. The second visualization once again looked at the gene `Cz01g00040`, but this time, it looked at the relationship between the two light intensities grouped by different light exposure times.

**Question 1** Now that you have seen two visualizations of the same data, what relationship(s) are you confident in?

_Your answer here_

**Question 2** What other data would you have liked to include if you were conducting this experiment? Would you have taken into account oxygen levels, temperature, etc?

_Your answer here_

**Question 3** Do you think you would have seen the same results if we had analyzed a gene other than `Cz01g00040`? Why or why not? If you have time, feel free to copy-and-paste the code with a different gene name to see whether the relationships stand for another gene.

_Your answer here_

Notebook developed by: Sharon Greenblum & Ciara Acosta <br/>
Edited by: Kseniya Usovich