<img src="lbnl_logo.jpg">

----




# Genomics Challenge Lab - Day 1



---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

<img src="algae.png" align="left">

## Background Information

Welcome to the last week of BLDAP! In this challenge lab, we will be working with RNA sequencing data from an experiment measuring gene expression in the algae _Chromochloris zofingiensis_.
    
This algae is important for a couple of reasons:  
1. It stores large amounts of energy from the sun, which can then be turned into biofuel.
2. It produces molecules that are beneficial for human health, like antioxidants.

Recently, scientists performed an experiment to figure out which genes were most important for these functions.  You can read more about the experiment [here](https://www.pnas.org/content/114/21/E4296). Specifically pay attention to the section titled "High Light-Induced Gene Expression", where scientists looked at which genes were 'turned on' (ie. increased their expression levels) when _C. zofingiensis_ samples were exposed to stronger light.

**Question 1** Why would it be helpful for scientists to know which genes were expressed when the algae was exposed to high light?

_Your answer here_

## Loading the Data

Now we want to import "rnaseq_raw_counts.txt" to the variable name `rna_data`. Before importing it, take a look at "rnaseq_raw_counts.txt". Notice that each value is separated by tabs instead of commas. **This means we want to use the argument `sep='\t'` in the `pd.read_csv()` function call.** The argument tells the computer that the file's values are separated by tabs instead of commas.

In [None]:
# EXERCISE

rna_data = ...

### 1. What are the dimensions of our data?

Great! Now let's learn more about our data set.

Find out how many rows and columns there are in our `rna_data` table.

In [None]:
# EXAMPLE

num_rows = rna_data.shape[0] 
num_columns = rna_data.shape[1] 

print("# of rows: " + str(num_rows))
print("# of columns: " + str(num_columns))

That's a lot of data! Let's take a quick peek at the data table and see what we are working with. Notice that we cannot see every single column name; instead there is a "column" with ellipses (...) instead.

In [None]:
rna_data.head()

### 2. What is `tracking_id`?

The column `tracking_id` refers to the id of a specific gene we are tracking. Each one is in the form 'Cz##g#####'.

- 'Cz' means that the gene is from the algae species _Chromochloris zofingiensis_.
- '##g' (the next two digits + 'g') tell us which chromosome the gene is on.
- '#####' (the last few digits) are a randomly assigned ID number.

Let's check if each gene ID is unique.
1. Start by getting the `tracking_id` column as an array.
2. Use a function that puts all unique tracking IDs into an array.
3. Find the length of the array of unique IDs.

In [None]:
# EXERCISE

tracking_ids = ...
unique_ids = ...
num_unique = len(unique_ids)

print("# of unique IDs: " + str(num_unique))

**Question 2** Compare the number of unique IDs to the number of rows. Notice that they are the same number. What does that mean?

_Your answer here_

### 3. What are our other columns?

Now that we have explored the first column, let's take a look at the remaining columns. In the cell below, print out the names of all of the columns.

In [None]:
# EXERCISE

rna_orig_columns = ...
print(rna_orig_columns)

**Question 3** Take a look at all of the column names other than the `tracking_id` column. What is similar about all of the names? What is different? Do you have any guesses about what these might mean?

_Your answer here_

## Understanding and Cleaning the Data

Now that we have an idea of what our columns look like, we can make our table easier to work with through data cleaning. Here are some of our main goals for cleaning the data:
- Creating a table index that is useful
- Changing column labels to be more readable
- Finding and addressing null values

### 1. Creating a New Table Index

Let's start with the first objective. Earlier, we found out that each `tracking_id` is unique, so we can make `tracking_id` our table index. This makes it easier to find data associated with a specific gene.

In [None]:
# EXERCISE

rna_data = ...
rna_data.head()

### 2. Making More Readable Column Names

Well done! Now let's talk about the other columns.

You may have noticed earlier that they follow a structure.
- 'HL' or 'ML' refers to whether the algae grew in 'high light' or 'medium light' respectively.
- '##h' tells us how many hours the algae was exposed to the light before a sample was collected.
    - For HL, the range of times is [0.5, 12, 1, 3, 6].
    - For ML, the range of times is [0.5, 0, 12, 1, 3, 6].
- '#' (the last digit) is an indicator of what replication of the sample it is.
    - Each experiment has 4 replications labeled 0, 1, 2, or 3.

The column `HL.0.5h0` can be read as "high light for 0.5 hours -- sample 0".

This format is hard to read with the period after 'HL'/'ML' and the 'h' denoting that the time is in hours. Let's change the column names so that it is easier for us to read. We have provided new column names for you to use in the following format.
- '##' (the first few digits) tell us the number of hours of light exposure.
- 'HL' or 'ML' denotes the light intensity.
- '-#' gives us the replication number.

The column `0.5HL-0` can be read as "0.5 hours of high light for sample 0".

_Quick Note:_ This format is better in terms of readability, but it might not be best for future coding uses and analyses! In data science, we sometimes have to compromise between readability and practicality. In this case, we want you to understand the data well, so we chose to emphasize readibility over practicality.

In [None]:
rna_new_columns = ['0.5HL-0', '0.5HL-1', '0.5HL-2', '0.5HL-3',
        '12HL-0', '12HL-1', '12HL-2', '12HL-3', 
        '1HL-0', '1HL-1', '1HL-2', '1HL-3', 
        '3HL-0', '3HL-1', '3HL-2', '3HL-3', 
        '6HL-0', '6HL-1', '6HL-2', '6HL-3',

        '0.5ML-0', '0.5ML-1', '0.5ML-2','0.5ML-3', 
        '0ML-0', '0ML-1', '0ML-2', '0ML-3', 
        '12ML-0', '12ML-1', '12ML-2', '12ML-3', 
        '1ML-0', '1ML-1', '1ML-2', '1ML-3',
        '3ML-0', '3ML-1', '3ML-2', '3ML-3', 
        '6ML-0', '6ML-1', '6ML-2', '6ML-3']

Relabel the columns of `rna_data` to the given labels in `rna_new_columns`.

In [None]:
# EXERCISE

rna_data.columns = rna_new_columns # SOLUTION
rna_data.head()

### 3. Understanding the Data Values

To get a better understanding of the data we are working with, take a look at the data in the first 10 rows and first 10 columns. 

In [None]:
# EXERCISE

rna_data ...

**Question 4** We have a full understanding of our data table's labels, so let's now consider what the values in the table represent. What data type(s) are the values in the table? What do you think they might represent?

_Your answer here_

Check if there is any missing data in the following cell. Based, on your answer above, think about whether this would affect our data analysis later.

In [None]:
# EXERCISE

rna_data ...

It's a good thing we have no missing data! It seems like all of our data values are numbers, so let's see what range of values are under our `0.5HL-0` column. Find the minimum value, maximum value, and mean in the `0.5HL-0` column.

In [None]:
#EXERCISE

min_val_1 = rna_data["0.5HL-0"] ...
max_val_1 = ...
mean_val_1 = ...

print("Minimum 0.5HL-0 Val: " + str(min_val_1))
print("Maximum 0.5HL-0 Val: " + str(max_val_1))
print("Mean 0.5HL-0 Val: " + str(mean_val_1))

It seems like there is a large range of numbers under this column. Choose another column and check if it has a range of values that is just as large as `0.5HL-0`. Feel free to try multiple different columns.

In [None]:
#EXERCISE

min_val_2 = rna_data["0.5ML-0"] ...
max_val_2 = ...
mean_val_2 = ...

print("Minimum Val: " + str(min_val_2))
print("Maximum Val: " + str(max_val_2))
print("Mean Val: " + str(mean_val_2))

Why are the range of values so broad for most columns? 

The values in our data table represent the number of "turned on" genes under the given light conditions. Some genes may turn on more under lower light conditions while others may turn on more under higher light conditions. We might also see that some genes may turn on more after longer light exposure than they will under shorter light exposure.

In order to analyze, this however, we need to be able to look at numbers that range from 0 to the hundreds of thousands! Tomorrow we will address this issue, so let's save our progress, `rna_data`, as "rna_data_cleaned.csv".

In [None]:
# EXERCISE

...

Notebook developed by: Ciara Acosta & Sharon Greenblum