<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 5: Quantifying gene expression**
In this notebook, you will assign a number for each gene that represents the number of RNA transcripts for that gene that your sample expressed in the tissue at a point in time.

## **Objectives of this notebook**
The primary objective of this notebook is to quantify the gene expression of the one sample's reduced chromosome 17 alignment. You will then compare the gene expression counts for your sample's chromosome 17 to those obtained by the GeneLab processing team. We expect that the quantities should be off by a factor approximately near `REDUCTION_FACTOR`.

## **UNIX commands introduced in this notebook**

`grep` command to search for lines in files that have a matching pattern.

`htseq-count ` command to quantify gene expression.



# Prepare your environment for this lab

In [1]:
# mount your google drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


Drive not mounted, so nothing to flush and unmount.
Mounted at mnt


In [2]:
# time the notebook
import datetime
start_time=datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

notebook start time:  2025-07-15 02:58:59


In [3]:
# set env variables for OSD dataset to use in this lab
OSD_DATASET='104'
GLDS_DATASET='104'

In [4]:
# set FASTQ_DIR directory location in google drive
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [5]:
# read env var for reduction factor from first notebook
import os
with open(f"{FASTQ_DIR}/SAMPLE_NAME.txt", "r") as f:
  OSD_SAMPLE=f.read().strip()
if not OSD_SAMPLE:
  raise Exception("STOP! You must finish the previous notebooks before running this one")
print(OSD_SAMPLE)

Mmus_C57-6J_SLS_GC_Rep1_M33


In [6]:
# read env var for reduction factor from first notebook
import os
with open(f"{FASTQ_DIR}/REDUCTION_FACTOR.txt", "r") as f:
  REDUCTION_FACTOR=f.read()
if not REDUCTION_FACTOR:
  raise Exception("STOP! You must finish the previous notebooks before running this one")
print(REDUCTION_FACTOR)

100



In [7]:
# set REFERENCE_DIR directory location in google drive
import os
REFERENCE_DIR="/content/mnt/MyDrive/NASA/GL4HS/REFERENCE"
if not os.path.exists(REFERENCE_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [8]:
# set ALIGNMENT_DIR directory location in google drive
import os
ALIGNMENT_DIR="/content/mnt/MyDrive/NASA/GL4HS/STAR/ALIGNMENT"
if not os.path.exists(ALIGNMENT_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [9]:
# set COUNTS_DIR directory location in google drive
import os
COUNTS_DIR="/content/mnt/MyDrive/NASA/GL4HS/COUNTS"
if not os.path.exists(COUNTS_DIR):
  !mkdir -p {COUNTS_DIR}

In [10]:
# install htseq
!pip install HTSeq

Collecting HTSeq
  Downloading HTSeq-2.0.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting pysam (from HTSeq)
  Downloading pysam-0.23.3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Downloading HTSeq-2.0.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pysam-0.23.3-cp311-cp311-manylinux_2_28_x86_64.whl (26.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m88.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pysam, HTSeq
Successfully installed HTSeq-2.0.9 pysam-0.23.3


In [11]:
# determine the version of htseq installed
!htseq-count --version

2.0.9


In [12]:
# download gene annotation for GRCm39 (mouse)
import os
if not os.path.exists(f"{REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf.gz"):
  !mkdir -p {REFERENCE_DIR}
  !wget -O {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf.gz \
    https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M36/gencode.vM36.primary_assembly.basic.annotation.gtf.gz

In [13]:
# remove all but chr17 annotations
# change "chr17" to "17" in the GTF file as that's what the htseq-count is expecting
# gzip the file
!gunzip -c {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf.gz > {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf
!grep ^chr17 {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf | sed 's/chr17/17/' > {REFERENCE_DIR}/chr17.gtf
!gzip -c {REFERENCE_DIR}/chr17.gtf > {REFERENCE_DIR}/chr17.gtf.gz


Read the [GTF documentation](https://www.gencodegenes.org/mouse/) and the [GTF Wikipedia page](https://en.wikipedia.org/wiki/Gene_transfer_format) for more information about basic mouse gene annotation.

In [14]:
# check the first 10 lines of the GTF annotation file
!head -10 {REFERENCE_DIR}/chr17.gtf

17	HAVANA	gene	3091857	3093685	.	+	.	gene_id "ENSMUSG00000122713.1"; gene_type "lncRNA"; gene_name "ENSMUSG00000122713"; level 2; tag "overlapping_locus"; tag "overlaps_pseudogene";
17	HAVANA	transcript	3091866	3093685	.	+	.	gene_id "ENSMUSG00000122713.1"; transcript_id "ENSMUST00000254456.1"; gene_type "lncRNA"; gene_name "ENSMUSG00000122713"; transcript_type "lncRNA"; transcript_name "ENSMUST00000254456"; level 2; tag "basic"; tag "Ensembl_canonical"; tag "TAGENE";
17	HAVANA	exon	3091866	3092128	.	+	.	gene_id "ENSMUSG00000122713.1"; transcript_id "ENSMUST00000254456.1"; gene_type "lncRNA"; gene_name "ENSMUSG00000122713"; transcript_type "lncRNA"; transcript_name "ENSMUST00000254456"; exon_number 1; exon_id "ENSMUSE00001505513.1"; level 2; tag "basic"; tag "Ensembl_canonical"; tag "TAGENE";
17	HAVANA	exon	3092284	3093685	.	+	.	gene_id "ENSMUSG00000122713.1"; transcript_id "ENSMUST00000254456.1"; gene_type "lncRNA"; gene_name "ENSMUSG00000122713"; transcript_type "lncRNA"; transcript_n

Question: What is lncRNA? Feel free to read more about that in this [Wikipedia article](https://en.wikipedia.org/wiki/Long_non-coding_RNA).

# Use HTSEQ to quantify gene expression

Read the [htseq-count manual](https://htseq.readthedocs.io/en/master/htseqcount.html) for more information.

In [15]:
# run htseq to quantify the gene expression

!htseq-count -n 2 \
  --format bam \
  --order pos \
  --stranded reverse \
  {ALIGNMENT_DIR}/chr17Aligned.out.bam \
  {REFERENCE_DIR}/chr17.gtf.gz \
  > {COUNTS_DIR}/chr17-counts.tsv

73633 GFF lines processed.
36965 alignment record pairs processed.


Note that you may get a warning about "mate records missing" from `htseq-count`. You can ignore this warning -- it's a known bug in the `htseq-count` software. Read [this github issue](https://github.com/simon-anders/htseq/issues/37) for more information if you're curious.

In [16]:
# look at the first 10 lines of the counts file
!head -10 {COUNTS_DIR}/chr17-counts.tsv

ENSMUSG00000000127.16	14
ENSMUSG00000000579.15	2
ENSMUSG00000000673.10	0
ENSMUSG00000000708.15	71
ENSMUSG00000001227.13	1
ENSMUSG00000001228.15	1
ENSMUSG00000001229.10	18
ENSMUSG00000001524.15	4
ENSMUSG00000001525.11	49
ENSMUSG00000001576.16	13


In [17]:
# read count data from file into dataframe
import pandas as pd
counts_df=pd.read_csv(f"{COUNTS_DIR}/chr17-counts.tsv", sep="\t", header=None)
counts_df.head()

Unnamed: 0,0,1
0,ENSMUSG00000000127.16,14
1,ENSMUSG00000000579.15,2
2,ENSMUSG00000000673.10,0
3,ENSMUSG00000000708.15,71
4,ENSMUSG00000001227.13,1


In [18]:
# remove any rows from the counts_df that do not begin with 'ENSMUSG'
print('length before filter: ', len(counts_df))
counts_df=counts_df[counts_df[0].str.startswith('ENSMUSG')]
print('length after filter: ', len(counts_df))
counts_df[0]

length before filter:  3329
length after filter:  3324


Unnamed: 0,0
0,ENSMUSG00000000127.16
1,ENSMUSG00000000579.15
2,ENSMUSG00000000673.10
3,ENSMUSG00000000708.15
4,ENSMUSG00000001227.13
...,...
3319,ENSMUSG00002076711.1
3320,ENSMUSG00002076730.1
3321,ENSMUSG00002076750.1
3322,ENSMUSG00002076816.1


# Compare your count data to the GeneLab-processed count data for the same sample

In [19]:
# open another tab in your web browser and navigate to the following site:
url=!echo https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-{OSD_DATASET}/files/\?format=browser
print(url[0])

https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-104/files/?format=browser


In [20]:
# download genelab-processed data for OSD dataset
import pandas as pd
#url = 'https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + OSD_DATASET + '/download?source=datamanager\&file=GLDS-' + GLDS_DATASET + '_rna_seq_STAR_Unnormalized_Counts.csv'url='https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + osd_dataset + '/download?source=datamanager\&file=GLDS-' + glds_dataset + '_rna_seq_STAR_Unnormalized_Counts.csv'
url = 'https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + OSD_DATASET + '/download?source=datamanager\&file=GLDS-' + GLDS_DATASET + '_rna_seq_Unnormalized_Counts.csv'
osd_df = pd.read_csv(url)
osd_df.head()

Unnamed: 0.1,Unnamed: 0,Mmus_C57-6J_SLS_FLT_Rep1_M23,Mmus_C57-6J_SLS_FLT_Rep2_M24,Mmus_C57-6J_SLS_FLT_Rep3_M25,Mmus_C57-6J_SLS_FLT_Rep4_M26,Mmus_C57-6J_SLS_FLT_Rep5_M27,Mmus_C57-6J_SLS_FLT_Rep6_M28,Mmus_C57-6J_SLS_GC_Rep1_M33,Mmus_C57-6J_SLS_GC_Rep2_M34,Mmus_C57-6J_SLS_GC_Rep3_M35,Mmus_C57-6J_SLS_GC_Rep4_M36,Mmus_C57-6J_SLS_GC_Rep5_M37,Mmus_C57-6J_SLS_GC_Rep6_M38
0,ENSMUSG00000000001,869.0,706.0,835.0,871.0,1069.0,700.0,1205.0,908.0,1078.0,877.0,884.0,911.0
1,ENSMUSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ENSMUSG00000000028,91.0,121.0,87.0,139.0,193.0,104.0,145.0,117.0,135.0,114.0,128.0,130.0
3,ENSMUSG00000000031,68226.0,64898.0,85595.0,90775.0,71300.0,56376.0,143726.0,104717.0,128088.0,115071.0,108243.0,118878.0
4,ENSMUSG00000000037,7.0,8.0,9.0,11.0,11.0,5.0,21.0,10.0,10.0,5.0,3.0,16.0


In [21]:
# get list of genes from OSD_SAMPLE counts data to compare with the counts for the entire OSD_DATASET
# remove the "." extension from the ensemble gene ID
sample_genes = list(counts_df[0].values)[:20]
sample_genes = [gene.split(".")[0] for gene in sample_genes]
sample_genes

['ENSMUSG00000000127',
 'ENSMUSG00000000579',
 'ENSMUSG00000000673',
 'ENSMUSG00000000708',
 'ENSMUSG00000001227',
 'ENSMUSG00000001228',
 'ENSMUSG00000001229',
 'ENSMUSG00000001524',
 'ENSMUSG00000001525',
 'ENSMUSG00000001576',
 'ENSMUSG00000001870',
 'ENSMUSG00000002017',
 'ENSMUSG00000002076',
 'ENSMUSG00000002249',
 'ENSMUSG00000002250',
 'ENSMUSG00000002257',
 'ENSMUSG00000002274',
 'ENSMUSG00000002279',
 'ENSMUSG00000002280',
 'ENSMUSG00000002289']

In [22]:
# find the gene count data for the first 20 genes associated with the sample name
osd_df[['Unnamed: 0', OSD_SAMPLE]].head(20)
gene_counts_from_osd = list(osd_df[osd_df['Unnamed: 0'].isin(sample_genes)][OSD_SAMPLE].values)

In [23]:
# capture the first 20 lines of the counts_df dataframe
gene_counts_from_you = list(counts_df[1].values)[:20]

In [24]:
# compare the counts side by side (the second column should be roughly 1/REDUCTION_FACTOR of the first column)
for gene_count in zip(sample_genes, gene_counts_from_osd, gene_counts_from_you):
  print(gene_count[0], '\t', gene_count[1], '\t', gene_count[2])


ENSMUSG00000000127 	 1109.0 	 14
ENSMUSG00000000579 	 64.26 	 2
ENSMUSG00000000673 	 1.0 	 0
ENSMUSG00000000708 	 9792.0 	 71
ENSMUSG00000001227 	 257.0 	 1
ENSMUSG00000001228 	 17.0 	 1
ENSMUSG00000001229 	 2865.0 	 18
ENSMUSG00000001524 	 783.0 	 4
ENSMUSG00000001525 	 3951.0 	 49
ENSMUSG00000001576 	 2848.0 	 13
ENSMUSG00000001870 	 1164.0 	 7
ENSMUSG00000002017 	 2538.0 	 9
ENSMUSG00000002076 	 32.0 	 0
ENSMUSG00000002249 	 334.0 	 2
ENSMUSG00000002250 	 2114.0 	 19
ENSMUSG00000002257 	 64.0 	 0
ENSMUSG00000002274 	 760.0 	 4
ENSMUSG00000002279 	 500.0 	 1
ENSMUSG00000002280 	 1736.0 	 12
ENSMUSG00000002289 	 337.0 	 3


In [25]:
# look at the first 500 gene counts in both
# determine if fraction of abundance is approximately 1/REDUCTION_FACTOR
import numpy as np
count_fractions = list()
for gene in counts_df[0].values[:500]:
  _gene = gene.split(".")[0]
  if _gene in osd_df['Unnamed: 0'].values:
    genelab_val = osd_df[osd_df['Unnamed: 0'] == _gene][OSD_SAMPLE].values[0]
    your_val = int(counts_df[counts_df[0] == gene][1].values[0])
    if not genelab_val == 0:
      frac = your_val/genelab_val
      count_fractions.append(frac)

print(np.mean(count_fractions))

0.018898268634986625


# Check your work before moving on

In [26]:
# check disk space utilization in google drive (should be about 2.4GB)
!du -sh /content/mnt/MyDrive/NASA/GL4HS

2.4G	/content/mnt/MyDrive/NASA/GL4HS


In [27]:
# time the notebook
import datetime
end_time=datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

print('notebook runtime: ', end_time - start_time)
#

notebook end time:  2025-07-15 02:59:55
notebook runtime:  0:00:56.874533
