<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 1: Downloading RNA-seq data from OSDR**

In this notebook, we will be downloading data from [OSD-104](https://osdr.nasa.gov/bio/repo/data/studies/OSD-104). This dataset includes [transcriptomic](https://en.wikipedia.org/wiki/Transcriptome) data that were collected during the NASA [Rodent Research 1 mission (RR1)](https://nlsp.nasa.gov/view/lsdapub/lsda_experiment/7633383e-fe88-5ada-92dd-cd31a50fda32).

## **OSD-104 dataset**

NASA's Rodent Research (RR) project advances biomedical research on spaceflight’s physiological effects. With limited ISS resources, efficient crew time use is vital. GeneLab enhances RR value by applying multi-omics approaches and sharing data openly.

The OSD-104 study was part of a [SpaceX-4 mission](https://osdr.nasa.gov/bio/repo/data/missions/SpaceX-4) that started on September 21, 2014 and ended October 25, 2014. There were several biological payloads on this mission, including rodents, fruit flies, plants, and yeast. This data is stored in the [Ames Life Science Data Archive](https://techport.nasa.gov/projects/145035).

The OSD-104 study used rodents to capture transcriptomic and epigenomic data from mouse muscle.

The [RNA sequencing](https://en.wikipedia.org/wiki/RNA-Seq) was performed on the [Illumina HiSeq 4000](https://support.illumina.com/sequencing/sequencing_instruments/hiseq-4000/documentation.html) platform, three high output HiSeq lanes with 100 bp [paired end](https://www.biostars.org/p/314258/) [stranded](https://www.biostars.org/p/9482535/) reads. The RNA-sequencing experiment was performed by the [UC Davis Genome Center](https://genomecenter.ucdavis.edu/).

## **Objectives of this notebook**
The primary objective of this notebook is to download 2 FASTQ files (one for each paired end reads of a single sample). In subsequent notebooks, we will use this downloaded data to perform quality checking, alignment, and quantification. You can read more about FASTQ files in this [Wikipedia page](https://en.wikipedia.org/wiki/FASTQ_format).

Because the free use of Google Drive has a [15GB limit](https://support.google.com/googleone/answer/9312312?hl=en), we have to pay careful attention to how much space we use in the labs. We provide a Python method which reduces the number of records in the FASTQ files by a configurable factor. Removing good records from a FASTQ file is not something one would normally do, but we are doing it in these labs so that you can go through all the steps to see how it is done and, at the same time, stay under the 15GB storage limit.

## **UNIX commands introduced in this notebook**

[`curl`](https://www.man7.org/linux/man-pages/man1/curl.1.html) command to download files from the Internet.

[`sed`](https://www.man7.org/linux/man-pages/man1/sed.1p.html) command to search and replace content in files.

[`head`](https://www.man7.org/linux/man-pages/man1/head.1p.html) command to display the first n lines of a file.

[`ls`](https://www.man7.org/linux/man-pages/man1/ls.1p.html) command to list the contents of a directory.

[`rm`](https://www.man7.org/linux/man-pages/man1/rm.1p.html) command to remove files and directories.

[`gzip`](https://www.gnu.org/software/gzip/manual/gzip.html) command to compress files.

[`du`](https://man7.org/linux/man-pages/man1/du.1.html) command to determine how much space our files take up.



# Prepare runtime environment for this lab

In this first section, you will mount your Google drive to the notebooks `/content/mnt` directory, set environment variables that will be used in the notebook, and create directories that will be used throughout the course.


In [2]:
# mount google drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")

Mounted at mnt


You can read more about mounting Google Drive to your Colab notebook in [this document](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA).

In [3]:
# time the notebook
import datetime
start_time = datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

notebook start time:  2025-07-14 02:03:31


In [4]:
# create directory on google drive
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  !mkdir -p {FASTQ_DIR}

In [5]:
# define OSD and GLDS dataset for notebook
OSD_DATASET='104'
GLDS_DATASET='104'

In [6]:
# define GLDS sample name
SAMPLE_NAME='Mmus_C57-6J_SLS_GC_Rep1_M33'
SAMPLE_FILE_NAME='GLDS-' + GLDS_DATASET + '_rna_seq_' + SAMPLE_NAME
!echo $SAMPLE_NAME > {FASTQ_DIR}/SAMPLE_NAME.txt

In [7]:
# define fastq file names
# these are paired end reads, so there are 2 fastq files per sample
FASTQ_FILE_NAME_1=SAMPLE_FILE_NAME + '_R1_raw.fastq.gz'
FASTQ_FILE_NAME_2=SAMPLE_FILE_NAME + '_R2_raw.fastq.gz'

In [8]:
# set env variable for factor to reduce fastq file by
# write it to file in case it needs to be used in subsequent notebook
# reducing files to save time because this is a learning experience
REDUCTION_FACTOR=100
!echo $REDUCTION_FACTOR > {FASTQ_DIR}/REDUCTION_FACTOR.txt

# Download FASTQ files from OSDR
In this section, you will download 2 FASTQ files from the [Open Science Data Repository](https://www.nasa.gov/osdr/). These 2 FASTQ files will be used for the first several notebooks so you can learn how to process RNA-seq data.

In [9]:
# open another tab in your web browser and navigate to the following site
# you don't need to directly download anything from this Web site in your Web browser.
# we are showing this to you so that you know where the data is coming from.
# files in OSDR associated w/ OSD 104
url = !echo https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-{OSD_DATASET}/files/\?format=browser
print(url[0])

https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-104/files/?format=browser


In [10]:
# download fastq files for sample from OSD
# this could take up to 10 minutes per file
#starting with ! will use the unix shell, some commands are unix and others are python

!curl -L --output {FASTQ_DIR}/{FASTQ_FILE_NAME_1} \
   https://osdr.nasa.gov/geode-py/ws/studies/OSD-{OSD_DATASET}/download\?source=datamanager\&file={FASTQ_FILE_NAME_1}


!curl -L --output {FASTQ_DIR}/{FASTQ_FILE_NAME_2} \
   https://osdr.nasa.gov/geode-py/ws/studies/OSD-{OSD_DATASET}/download\?source=datamanager\&file={FASTQ_FILE_NAME_2}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3876  100  3876    0     0   4247      0 --:--:-- --:--:-- --:--:--  4250
100 4437M  100 4437M    0     0  25.4M      0  0:02:54  0:02:54 --:--:-- 23.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3868  100  3868    0     0   5389      0 --:--:-- --:--:-- --:--:--  5387
100 4746M  100 4746M    0     0  30.9M      0  0:02:33  0:02:33 --:--:-- 44.3M


In [11]:
# check the fastq zip file sizes
# (should be around 4.5GB each)
!ls -lh {FASTQ_DIR}/{FASTQ_FILE_NAME_1}
!ls -lh {FASTQ_DIR}/{FASTQ_FILE_NAME_2}


-rw------- 1 root root 4.4G Jul 14 02:09 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/GLDS-104_rna_seq_Mmus_C57-6J_SLS_GC_Rep1_M33_R1_raw.fastq.gz
-rw------- 1 root root 4.7G Jul 14 02:11 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/GLDS-104_rna_seq_Mmus_C57-6J_SLS_GC_Rep1_M33_R2_raw.fastq.gz


In [13]:
# read the first 10 bases and quality scores of the first record from r1
# read the last 10 bases of the first record in reverse from r2
# they should be complementary base-paired
!sed -n '2,2 p' {FASTQ_DIR}/{FASTQ_FILE_NAME_1} | cut -c 1-10
!sed -n '2,2 p' {FASTQ_DIR}/{FASTQ_FILE_NAME_2} | rev | cut -c 1-10

�	E����o 
^C


# Reduce FASTQ file sizes

In this section, you will define and run a Python method that will reduce the size of a FASTQ file. The reason we are doing this is because your free Google account has a limit of 15GB storage in Google Drive, and the FASTQ files we are using are almost 10GB per sample.

The way the method works is that it only keeps a fraction of the records in the FASTQ file. This fraction is defined by the environment variable `REDUCTION_FACTOR` you set earlier in this notebook.

In [14]:
# define method to read compressed FASTQ from file and reduce its size
def reduce_fastq_gz(input_file, output_file, read_every):
  import datetime
  import sys
  import gzip
  import os

  # Get current date and time
  start = datetime.datetime.now()
  print('starting reduce_fastq_gz at: ', start.strftime('%Y-%m-%d %H:%M:%S'))
  if os.path.exists(output_file):
    os.remove(output_file)
  with gzip.open(input_file, 'r') as f_in:
    with open(output_file, 'a') as f_out:
      record_count = 0
      while True:
        # Read 4 lines (1 record)
        record = []
        for i in range(4):
          record.append(f_in.readline().decode())
        if not record[0]:  # End of file
          break
        if record_count % read_every == 0:  # Keep every read_every'th record
          for i in range(4):
            f_out.write(record[i])
        record_count += 1
    f_out.close()
  f_in.close()
  end = datetime.datetime.now()
  print('ending reduce_fastq_gz at: ', end.strftime('%Y-%m-%d %H:%M:%S'))

In [15]:
# call reduce_fastq_gz method to decompress and reduce size of R1 fastq file by factor of REDUCTION_FACTOR
# the file size should be about 200MB
# this can take up to 10 mins
import os
!ls -lh {FASTQ_DIR}/{FASTQ_FILE_NAME_1}
fastq_file_path = os.path.join(FASTQ_DIR, FASTQ_FILE_NAME_1)
reduced_fastq_file_path = os.path.join(FASTQ_DIR, 'reduced_r1.fastq')
if os.path.exists(reduced_fastq_file_path):
  os.remove(reduced_fastq_file_path)
reduce_fastq_gz(fastq_file_path, reduced_fastq_file_path, REDUCTION_FACTOR)
!ls -lh {FASTQ_DIR}/reduced_r1.fastq

-rw------- 1 root root 4.4G Jul 14 02:09 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/GLDS-104_rna_seq_Mmus_C57-6J_SLS_GC_Rep1_M33_R1_raw.fastq.gz
starting reduce_fastq_gz at:  2025-07-14 02:14:09
ending reduce_fastq_gz at:  2025-07-14 02:20:00
-rw------- 1 root root 216M Jul 14 02:20 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r1.fastq


In [16]:
# remove the original FASTQ_FILE_NAME_1 to save space on google drive
!rm -f {FASTQ_DIR}/{FASTQ_FILE_NAME_1}

In [17]:
# call reduce_fastq_gz method to decompress and reduce size of R2 fastq file by factor of REDUCTION_FACTOR
# the file size should be about 200MB
# this can take up to 10 mins
!ls -lh {FASTQ_DIR}/{FASTQ_FILE_NAME_2}
fastq_file_path = os.path.join(FASTQ_DIR, FASTQ_FILE_NAME_2)
reduced_fastq_file_path = os.path.join(FASTQ_DIR, 'reduced_r2.fastq')
if os.path.exists(reduced_fastq_file_path):
  os.remove(reduced_fastq_file_path)
reduce_fastq_gz(fastq_file_path, reduced_fastq_file_path, REDUCTION_FACTOR)
!ls -lh {FASTQ_DIR}/reduced_r2.fastq

-rw------- 1 root root 4.7G Jul 14 02:11 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/GLDS-104_rna_seq_Mmus_C57-6J_SLS_GC_Rep1_M33_R2_raw.fastq.gz
starting reduce_fastq_gz at:  2025-07-14 02:20:01
ending reduce_fastq_gz at:  2025-07-14 02:26:09
-rw------- 1 root root 216M Jul 14 02:26 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r2.fastq


In [18]:
# remove the original FASTQ_FILE_NAME_2 to save space on google drive
!rm -f {FASTQ_DIR}/{FASTQ_FILE_NAME_2}

# Examine and compress FASTQ files
In this section, you will use the `sed` command to read specific lines in each of the 2 FASTQ files.  You can learn more about quality scores on this [Wikipedia page](https://en.wikipedia.org/wiki/Phred_quality_score). You may notice quality scores decreasing as the read goes from the 5' end to the 3' end - a phenomenon discussed in [this article](https://scienceparkstudygroup.github.io/rna-seq-lesson/03-qc-of-sequencing-results/index.html#322-sequencing-error-profiles).

In [19]:
# read the first 10 bases and quality scores of the first record from r1
# read the last 10 bases of the first record in reverse from r2
# your output will very unlikely be complementary-base paired because the sequences are
# not ordered in the fastq files after having been heavily reduced :)
# but overall the records are reverse-complementary because they are read 5' to 3'
# from the 2 strands of cDNA
#fastq 2nd line: sequence data
!sed -n '2,2 p' {FASTQ_DIR}/reduced_r1.fastq | cut -c 1-10
!sed -n '2,2 p' {FASTQ_DIR}/reduced_r2.fastq | rev | cut -c 1-10

CTCCTTTCTC
TCACCGTGTG


In [20]:
# look at the quality scores of the first 10 bases of each record
# note that quality scores reflect the probability that the base was read incorrectly
# quality scores are represented by ASCII characters
#fastq 4th line: quality score
!sed -n '4,4 p' {FASTQ_DIR}/reduced_r1.fastq | cut -c 1-10
!sed -n '4,4 p' {FASTQ_DIR}/reduced_r2.fastq | rev | cut -c 1-10

AAAFFJJF-A
7-)7AA-F<F


Each FASTQ record has a record identifier as the first line. There are no requirements around the format of this line, and you're not required to have any text in the record. The RNA sequence itself is in the second line and contains ASCII characters including A, C, G, T, and N - where N is a nucleotide which could not be identified. You can read more about N's in FASTQ files in [this FAQ](https://dnatech.ucdavis.edu/faqs/my-fastq-file-contains-ns-is-there-a-problem-with-my-data). The fourth line is a quality score, and there is one ASCII character representing the quality per nucleotide, where quality represents the probability that the base was erroneously called. Lower quality scores are better than higher ones.

The relationship between the records in R1 and R2 is reverse-complementary. You can read more about that in [this biostars thread](https://www.biostars.org/p/145590/) and in this [youtube video](https://www.youtube.com/watch?v=WOrZMDibiSQ).

In [21]:
# take a look at the first 4 lines (1 record) of the first fastq file
#head: entire record (first file)
!head -n 4 {FASTQ_DIR}/reduced_r1.fastq

@J00113:162:H7W32BBXX:1:1101:4990:1947 1:N:0:TAGGAT
CTCCTTTCTCTGTACCGTCTGGAGTGGCACACACATAAGTGGGCAACATCTTCACAGGGGCGGCCTCATGCGTCTCCTTGCTCAGACCCTGCTCCATTTCCACCTTCATCCTTCTCTTAACCTCCAGCAGCTGTTCATGGCTCAGCTTCA
+
AAAFFJJF-AFAFFJJ-FJF<AAJFJJF777F<FA-FF7FAJF7AAJ<<F7<FFF-<777FF<FJAJA7AFFFF7FFFJFF-FJJJJJJAFJ-FJ7<AAJJFJJFAFFFJJ7-A-FAJFJJF-AJA<JAAF-AFAJ<A-FF-F)FJ--F7


In [22]:
# take a look at the first 4 lines (1 record) of the second fastq file
#head: entire record (second file)
!head -n 4 {FASTQ_DIR}/reduced_r2.fastq

@J00113:162:H7W32BBXX:1:1101:4990:1947 2:N:0:TAGGAT
CGCCAGAAGACCCTGGAGTCGCTGAAGCTGAGCCATGAACAGCTGCTGGAGGTTAAGAGAAGGATGAAGGTGGAAATGGAGCAGGGTCTGAGCAAGGAGACGCATGAGGCCGCCCCTGTGAAGATGTTGCCCACTTATGTGTGTGCCACT
+
AA-AF7AJ<<<77-7A7F<--7777<<<-<<F77--7--77<-777-7F77F-7-7<<F77<F-<A7<<F-AJA7A-7FAF7AFJJ)7)AFFJAFAJAJAAFJ--7AFJA<<7--7-<<F<<F<FFFFJ<<<)-)-<FAAF<F-AA7)-7


In [23]:
# compress reduced fastq files (gzip unix command)
#reduced_r1_fastq --> reduced_r1.fastq.gz
# reduces size from 220MB to approximately 43MB
!gzip -c {FASTQ_DIR}/reduced_r1.fastq > {FASTQ_DIR}/reduced_r1.fastq.gz
!gzip -c {FASTQ_DIR}/reduced_r2.fastq > {FASTQ_DIR}/reduced_r2.fastq.gz
#checking lengths of new files
!ls -lh {FASTQ_DIR}/reduced_r1.fastq.gz
!ls -lh {FASTQ_DIR}/reduced_r2.fastq.gz

-rw------- 1 root root 42M Jul 14 02:26 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r1.fastq.gz
-rw------- 1 root root 45M Jul 14 02:27 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r2.fastq.gz


In [28]:
# remove the reduced uncompressed files
# we won't be needing them anymore and they take up a lot of space
!rm -f {FASTQ_DIR}/reduced_r1.fastq
!rm -f {FASTQ_DIR}/reduced_r2.fastq

# Check your work before moving on

In this section, you will check how much disk space you're utilizing with the `du` command and `ls` command.

In [25]:
# check space used in Google drive (should have about 87MB used)
!du -sh {FASTQ_DIR}

87M	/content/mnt/MyDrive/NASA/GL4HS/FASTQ


In [26]:
# check before going on
# make sure you a reduce files called reduced_r1.fastq.gz and reduced_r2.fastq.gz of size 45M in your FASTQ_DIR
!ls -lh {FASTQ_DIR}/reduced_r1.fastq.gz
!ls -lh {FASTQ_DIR}/reduced_r2.fastq.gz

-rw------- 1 root root 42M Jul 14 02:26 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r1.fastq.gz
-rw------- 1 root root 45M Jul 14 02:27 /content/mnt/MyDrive/NASA/GL4HS/FASTQ/reduced_r2.fastq.gz


In [27]:
# time the notebook
import datetime
end_time = datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

total_notebook_time = end_time - start_time
print('total notebook time: ', total_notebook_time)

notebook end time:  2025-07-14 02:27:15
total notebook time:  0:23:43.479151
