## Run test alignment using STAR Aligner in MyBinder session

Start with a session launched from [here](https://github.com/fomightez/millefy-binder). (Should work elsewhere. What I happened to be using because package to use after, SICILIAN, used R and sessions form there have both Python and R.

#### Install the STAR Aligner to prepare

(The printing to stderr is just to make clear what is going on as code is run because I plan to run this notebook in a demonstration of the SICILIAN software according to https://github.com/salzman-lab/SICILIAN
[Dehghannasiri et al 2021 'Specific splice junction detection in single cells with SICILIAN'](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02434-8)(https://github.com/salzman-lab/SICILIAN), and so this markdown context won't be visible when I use `%run STAR_aligner_run_on_MyBinder.ipynb`. The use of stderr then will add that context better there.)

In [1]:
import sys
sys.stderr.write("Install STAR aligner to prepare\n")
sys.stderr.write("Install is done via the command `%conda install -y bioconda::star`\n")
%conda install -y bioconda::star

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.3.0
    latest version: 24.5.0

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - bioconda::star


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    htslib-1.20                |       h5efdd21_1         2.9 MB  bioconda
    openssl-3.3.1              |       h4ab18f5_0         2.8 MB  conda-forge
    star-2.7.11b               |       h43eeafb_1         8.1 MB  bioconda
    ------------------------------------------------------------
                                           Total:        13.7 MB

The following NEW packages will be INSTALLED:

  htslib

------

#### Run a test alignment to demonstrate it works

Run the demostration with a very tiny dataset so it will work on MyBinder.

1. Create a small FASTA file with a short sequence by running the next two cells. (The first cell is just to make clear what is going on as code is run because I plan to run this notebook in a demonstration of the SICILIAN software according to https://github.com/salzman-lab/SICILIAN
[Dehghannasiri et al 2021 'Specific splice junction detection in single cells with SICILIAN'](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02434-8)(https://github.com/salzman-lab/SICILIAN), and so this markdown context won't be visible when I use `%run STAR_aligner_run_on_MyBinder.ipynb`. The use of stderr then will add that context better there.)

In [None]:
sys.stderr.write("Run a test alignment to demonstrate it works\n")
sys.stderr.write("1. Create a small FASTA file with a short sequence:\n")

In [2]:
%%bash
echo '>test_seq
ACGTGGACGTACCCGTACGTACCCAAGTACGAGTACGTACGGGTACGTTCCACAAGTACGT' > test.fa

In [None]:
cat test.fa

2. Run `mkdir test_index` to make the directory to store the index.
sys.stderr.write("1. Create a small FASTA file with a short sequence:")

In [6]:
sys.stderr.write("2. Run `mkdir test_index` to make the directory to store the index.")
!mkdir test_index

2. Run `mkdir test_index` to make the directory to store the index.

In [14]:
ls -d -lah test_index

drwxr-xr-x 2 jovyan jovyan 4.0K Jun  7 16:31 [0m[01;34mtest_index[0m/


In [None]:
#pause to let first steps catch up so stderr is not so out of sync
import time
time.sleep(1.0)

3. Generate a genome index using this small FASTA file:

In [None]:
sys.stderr.write("3. Generate a genome index using this small FASTA file:\n")
#pause to let first steps catch up so stderr is not so out of sync
import time
time.sleep(1.0)

In [4]:
%%bash
STAR --runMode genomeGenerate \
     --genomeFastaFiles test.fa \
     --genomeDir test_index \
     --genomeSAindexNbases 1

	/srv/conda/envs/notebook/bin/STAR-avx2 --runMode genomeGenerate --genomeFastaFiles test.fa --genomeDir test_index --genomeSAindexNbases 1
	STAR version: 2.7.11b   compiled: 2024-03-19T08:38:59+0000 :/opt/conda/conda-bld/star_1710837244939/work/source
Jun 06 18:04:43 ..... started STAR run
Jun 06 18:04:43 ... starting to generate Genome files
Jun 06 18:04:43 ... starting to sort Suffix Array. This may take a long time...
Jun 06 18:04:43 ... sorting Suffix Array chunks and saving them to disk...
Jun 06 18:04:43 ... loading chunks from disk, packing SA...
Jun 06 18:04:43 ... finished generating suffix array
Jun 06 18:04:43 ... generating Suffix Array index
Jun 06 18:04:43 ... completed Suffix Array index
Jun 06 18:04:43 ... writing Genome to disk ...
Jun 06 18:04:43 ... writing Suffix Array to disk ...
Jun 06 18:04:43 ... writing SAindex to disk
Jun 06 18:04:43 ..... finished successfully


4. Create a small FASTQ file with a few reads from the test sequence:

In [None]:
sys.stderr.write("4. Create a small FASTQ file with a few reads from the test sequence:\n")

In [6]:
%%bash
echo '@read1
GTACCCAAGTACGAGTACG
+
AAAAAAAAAAAAAAAAAAA' > test.fq

5. Run the alignment using the small test files:

In [None]:
sys.stderr.write("5. Run the alignment using the small test files:\n")

In [8]:
%%bash
STAR --runMode alignReads \
     --genomeDir test_index \
     --readFilesIn test.fq \
     --outFileNamePrefix test_output

	/srv/conda/envs/notebook/bin/STAR-avx2 --runMode alignReads --genomeDir test_index --readFilesIn test.fq --outFileNamePrefix test_output
	STAR version: 2.7.11b   compiled: 2024-03-19T08:38:59+0000 :/opt/conda/conda-bld/star_1710837244939/work/source
Jun 06 18:05:16 ..... started STAR run
Jun 06 18:05:16 ..... loading genome
Jun 06 18:05:16 ..... started mapping
Jun 06 18:05:16 ..... finished mapping
Jun 06 18:05:16 ..... finished successfully


In [None]:
sys.stderr.write("SHOW THE FILES GENERATED BY ALL THAT AND THE ALIGNING STEP.\n")

In [9]:
ls -lah

total 408K
drwxr-x--- 1 jovyan jovyan 4.0K Jun  6 18:05 [0m[01;34m.[0m/
drwxr-xr-x 1 root   root     20 Jun  3 15:18 [01;34m..[0m/
-rw-r--r-- 1 jovyan jovyan  220 Jan  6  2022 .bash_logout
-rw-r--r-- 1 jovyan jovyan 3.7K Jan  6  2022 .bashrc
drwxr-xr-x 1 jovyan jovyan   23 Jun  3 15:14 [01;34mbinder[0m/
drwxr-xr-x 1 jovyan jovyan   19 Jun  3 15:14 [01;34m.cache[0m/
drwxrwsr-x 1 jovyan jovyan   30 Jun  3 15:16 [01;34m.conda[0m/
drwxr-xr-x 8 jovyan jovyan  180 Jun  3 15:14 [01;34m.git[0m/
drwxr-xr-x 3 jovyan jovyan   23 Jun  3 15:14 [01;34m.github[0m/
drwxr-xr-x 2 jovyan jovyan   96 Jun  6 18:03 [01;34m.ipynb_checkpoints[0m/
drwxr-xr-x 1 jovyan jovyan   29 Jun  6 18:03 [01;34m.ipython[0m/
drwxr-xr-x 3 jovyan jovyan   33 Jun  6 18:02 [01;34m.jupyter[0m/
-rw-r--r-- 1 jovyan jovyan 7.9K Jun  6 18:05 .jupyter-server-log.txt
drwxr-xr-x 1 jovyan jovyan   19 Jun  3 15:20 [01;34m.local[0m/
drwxr-xr-x 3 jovyan jovyan   19 Jun  6 18:02 [01;34m.npm[0m/
-rw-r--r-- 1 jovyan j

---------

#### Examine results to see it worked

In [10]:
sys.stderr.write("Examine results to see it worked\n")
sys.stderr.write("Results in 'test_outputLog.final.out'\n")
!cat test_outputLog.final.out

                                 Started job on |	Jun 06 18:05:16
                             Started mapping on |	Jun 06 18:05:16
                                    Finished on |	Jun 06 18:05:16
       Mapping speed, Million of reads per hour |	inf

                          Number of input reads |	1
                      Average input read length |	19
                                    UNIQUE READS:
                   Uniquely mapped reads number |	1
                        Uniquely mapped reads % |	100.00%
                          Average mapped length |	19.00
                       Number of splices: Total |	0
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	0
                       Number of splices: GC/AG |	0
                       Number of splices: AT/AC |	0
               Number of splices: Non-canonical |	0
                      Mismatch rate per base, % |	0.00%
                         Deletion rate per base |	0.00%
  

In [11]:
sys.stderr.write("Results in 'test_outputAligned.out.sam'")
!cat test_outputAligned.out.sam 

@HD	VN:1.4
@SQ	SN:test_seq	LN:61
@PG	ID:STAR	PN:STAR	VN:2.7.11b	CL:/srv/conda/envs/notebook/bin/STAR-avx2   --runMode alignReads      --genomeDir test_index   --readFilesIn test.fq      --outFileNamePrefix test_output
@CO	user command line: /srv/conda/envs/notebook/bin/STAR-avx2 --runMode alignReads --genomeDir test_index --readFilesIn test.fq --outFileNamePrefix test_output
read1	0	test_seq	19	255	19M	*	0	0	GTACCCAAGTACGAGTACG	AAAAAAAAAAAAAAAAAAA	NH:i:1	HI:i:1	AS:i:18	nM:i:0
