### Reference Indexing for RNA-seq preprocessing

#### 1. Tools installation

In [None]:
# STAR
!conda install -c "bioconda/label/main" star
# RSEM
!conda install -c "bioconda/label/main" rsem

#### 2. Download reference genome

In [None]:
### Download reference genome to instances VM from Bucket ###
!gsutil cp "gs://whitelabgx-references/hg38/gencode.v27.primary_assembly.annotation.gtf" .
!gsutil cp "gs://whitelabgx-references/hg38/GRCh38.primary_assembly.genome.fa" .

#### 3. Reference indexing

In [1]:
### Variables definition ###
GENOME_FA="../tmp/GRCh38.primary_assembly.genome.fa"
GENOME_GTF="../tmp/gencode.v27.primary_assembly.annotation.gtf"

##### 3.1. Build the STAR index

In [4]:
!mkdir ../tmp/star_index_101bp
!STAR --runThreadN 14\
      --runMode genomeGenerate \
      --genomeDir ../tmp/star_index_101bp \
      --genomeFastaFiles $GENOME_FA \
      --sjdbGTFfile $GENOME_GTF \
      --sjdbOverhang 100

	/home/xiliu/.conda/envs/bulk/bin/STAR-avx2 --runThreadN 14 --runMode genomeGenerate --genomeDir ../tmp/star_index_101bp --genomeFastaFiles ../tmp/GRCh38.primary_assembly.genome.fa --sjdbGTFfile ../tmp/gencode.v27.primary_assembly.annotation.gtf --sjdbOverhang 100
	STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Nov 30 15:09:28 ..... started STAR run
Nov 30 15:09:28 ... starting to generate Genome files
Nov 30 15:10:24 ..... processing annotations GTF
Nov 30 15:10:55 ... starting to sort Suffix Array. This may take a long time...
Nov 30 15:11:10 ... sorting Suffix Array chunks and saving them to disk...
Nov 30 15:47:39 ... loading chunks from disk, packing SA...
Nov 30 15:49:41 ... finished generating suffix array
Nov 30 15:49:41 ... generating Suffix Array index
Nov 30 15:53:07 ... completed Suffix Array index
Nov 30 15:53:08 ..... inserting junctions into the genome indices
Nov 30 15:56:46 ... writing Genome to disk ...


In [5]:
# TODO: check 
!tar -czf ../tmp/star_index_101bp.tar.gz -C ../tmp/star_index_101bp .

##### 3.2. Build the RSEM index

In [2]:
!mkdir ../tmp/rsem_index
!rsem-prepare-reference $GENOME_FA --gtf $GENOME_GTF \
  ../tmp/rsem_index/rsem_reference \
  --num-threads 16

rsem-extract-reference-transcripts ../tmp/rsem_index/rsem_reference 0 ../tmp/gencode.v27.primary_assembly.annotation.gtf None 0 ../tmp/GRCh38.primary_assembly.genome.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsed 2000000 lines
Parsed 2200000 lines
Parsed 2400000 lines
Parsed 2600000 lines
Parsing gtf File is done!
../tmp/GRCh38.primary_assembly.genome.fa is processed!
200468 transcripts are extracted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref ../tmp/rsem_index/rsem_reference.transcripts.fa 1 ../tmp/rsem_index/rsem_reference
Refs.makeRefs finished!
Refs.saveRefs finished!
../tmp/rsem_index/rsem_reference.idx.fa is generated!
../tmp/rsem_index/rsem_reference.n2g.idx.fa is generated!



In [3]:
!tar -czf ../tmp/rsem_index.tar.gz -C ../tmp/rsem_index .

#### 4. Upload the reference index to the cloud

In [6]:
! mv ../tmp/star_index_101bp.tar.gz ../tmp/GRCh38_gencodeV27_primaryAssembly_star_index_101bp.tar.gz
! mv ../tmp/rsem_index.tar.gz ../tmp/GRCh38_gencodeV27_primaryAssembly_rsem_index.tar.gz

In [2]:
!gsutil cp -r ../tmp/GRCh38_gencodeV27_primaryAssembly_rsem_index.tar.gz gs://whitelabgx-references/hg38/
!gsutil cp -r ../tmp/GRCh38_gencodeV27_primaryAssembly_star_index_101bp.tar.gz gs://whitelabgx-references/hg38/

Copying file://../tmp/GRCh38_gencodeV27_primaryAssembly_rsem_index.tar.gz [Content-Type=application/x-tar]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

- [1 files][190.7 MiB/190.7 MiB]                                                
Operation completed over 1 objects/190.7 MiB.                                    
Copying file://../tmp/GRCh38_gencodeV27_p