# Command Line Interface (CLI)
IsoTools is a python library for the analysis of long read data. 
It starts from aligned reads (e.g. bamfiles) and features transcriptome reconstruction, quantification, visualization, explorative analysis and statistical tests.
IsoTools is designed to be run interactively (e.g. in a python notebook, or in the repl), as it features rich functionality to explore the data. 
However, it can also be run as a command line tool, with predefined parameters. This is convenient to process different experiments in a standardized pipeline.
All command line parameters are described in the [CLI section](../isotoolsCLI.html):


To run the commands, need the following **input files** from the [prepared demonstration dataset](https://nc.molgen.mpg.de/cloud/index.php/s/zYe7g6qnyxGDxRd):

* sample description file 'encode_samples.tsv'
* six .bam alignment files for the six samples
* six corresponding .bam.bai indices
* reference annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
* corresponding .gff3.gz.tbi index file
* genomic reference file GRCh38.p13.genome_chr8.fa
* corresponding .fai index file

The sample description file is a tab-separated text file with at least the following columns:

* **sample_name**: A unique sample label
* **file_name**: The path to the alignment .bam file
* **group**: a group assignment of the sample


The following command runs IsoTools and performs the following analysis steps:

* transcriptome reconstruction from the alignments
* filtering transcripts according to filter query (--filter_query)
    * see [filter tutorial](06_filtering.html#Filtering-tags-and-queries) for a description of tags and queries.
* exporting transcripts as gtf (--gtf_out)
* exporting a transcript table, which includes the number of reads per transcripts (--transcript_table)
* finally, by default, a pkl file is produced, which contains the transcriptome object, reducing runtime of the next call of the CLI.
    * The resulting transcriptome can also be explored using the API.
    * creation of the pkl file can be skipped with --no-pkl
    * import of the previous pkl file can be bypassed with the --force-recreate command
    * if a pkl file is present, and a sample table containing novel samples is provided, only the novel samples are added
* the two parameters "--log" INFO and "--progress_bar" enable readout of the progress. 

In [9]:
%%bash
mkdir -p demonstration_dataset
cd demonstration_dataset
samples='encode_samples.tsv'
anno='gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'
genome='GRCh38.p13.genome_chr8.fa'

run_isotools \
    --anno $anno \
    --log INFO \
    --progress_bar \
    --genome $genome \
    --samples $samples \
    --file_prefix ./PacBio_isotools_substantial \
    --custom_filter_tag "COVERED=any(g.coverage[:,trid] > 2)"  "HIGH_COVER=g.coverage.sum(0)[trid] >= 7" \
    --filter_query "(COVERED and FSM) or (HIGH_COVER and SUBSTANTIAL and not INTERNAL_PRIMING)" \
    --gtf_out --transcript_table

2024-03-28 12:56:45 INFO: This is isotools version 0.3.4
2024-03-28 12:56:45 INFO: loading transcriptome from ./PacBio_isotools_substantial_isotools.pkl
2024-03-28 12:56:45 INFO: importing reference from gff3 file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
100%|█████████▉| 2.70M/2.70M [00:02<00:00, 1.29MB/s]
2024-03-28 12:56:48 INFO: skipped the following categories: {'three_prime_UTR', 'CDS', 'five_prime_UTR'}
2024-03-28 12:56:48 INFO: collapsed 0 immunoglobulin loci and 0 T-cell receptor loci
2024-03-28 12:56:48 INFO: adding sample GM12878_a from file ENCFF417VHJ_aligned_mm2_chr8.bam
100%|██████████| 53.0k/53.0k [00:10<00:00, 5.01kreads/s, chr=KI270757.1]
2024-03-28 12:56:58 INFO: skipped 113 reads aligned fraction of less than 0.75.
2024-03-28 12:56:58 INFO: skipped 10940 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
2024-03-28 12:56:58 INFO: ignoring 2235 chimeric alignments with less than 2 reads
2024-03-28

Running the command for the next time will automatically load the stored '.pkl' file. 
Here, we use the command line interface to perform differential splicing analysis.
The resulting table with test statistics for all alternative splicing events is written to the file 
PacBio_isotools_substantial_diff_K562_GM12878.csv.
In addidtion, the command creates sashimi coverage plots for the top 5 differentially spliced genes

In [10]:
%%bash
cd demonstration_dataset
run_isotools \
    --log INFO \
    --progress_bar \
    --file_prefix ./PacBio_isotools_substantial \
    --group_by group \
    --diff_plots 5 \
    --diff K562/GM12878

2024-03-28 12:59:37 INFO: This is isotools version 0.3.4
2024-03-28 12:59:37 INFO: loading transcriptome from ./PacBio_isotools_substantial_isotools.pkl
2024-03-28 12:59:40 INFO: testing differential splicing for K562 (3) vs GM12878 (3) using betabinom_lr test
100%|██████████| 10805/10805 [00:23<00:00, 460.12genes/s]
2024-03-28 13:00:03 INFO: 175 differential splice sites in 111 genes for K562 vs GM12878
2024-03-28 13:00:03 INFO: sashimi plot for differentially spliced gene RIPK2
2024-03-28 13:00:05 INFO: sashimi plot for differentially spliced gene PVT1
2024-03-28 13:00:13 INFO: sashimi plot for differentially spliced gene TUSC3
2024-03-28 13:00:15 INFO: sashimi plot for differentially spliced gene NCALD
2024-03-28 13:00:17 INFO: saving transcripts as pickle file
2024-03-28 13:00:17 INFO: saving transcriptome to ./PacBio_isotools_substantial_isotools.pkl
