# CloudOS User Training - Interactive Analysis 

The structure of this notebook follows the structure of the training session:  
For every workflow or set of tasks, the trainers will first walk you through it, and show you how it s carried out Step-by-Step. 
The code for the step-by-step explanation of every workflow is avaiable in corresponding step-by-step sections of this notebook. 

## Workflow I: I want to run a tool and get results in a Jupyter session 

### For Step-by-step I , see slide 11 from training presentation

In this first workflow, the objective is for you to learn how to mount data, such as cohort data selected from the cohort browser, into a interactive analysis session within Jupyterlab.

#### **Exercises I:**

1. Go to the Cohort browser page;
2. Select your cohort
3. Click "Run analysis";
4. Add new Data storage location and label
5. Give notebook a distinctive name
6.Select a project (or create a new one);
7.Add a time limit of 12h
8. Set a cost limit of 1$;
9.Select instance type with 4 CPUs and 8 Gb RAM: *Instance type name: c5.xlarge, on-demand instance;*
11.Set session storage to 100 Gb;
12. Click "Create" to start a session.


## Workflow II: Run a tool and get results in a Jupyter Session 

In the following workflow, the objective is for you to ger familiar with importing data for analysis and processing it using imported tools or packages, such as samtols for example

### Step-by-step II: 

For step-by-step on data import, please see **Slide 14** in the training presentation. 
You need to upload the file wgs1.bam into the session data. You can find this file in the Pipelines folder in Example Data within the File explorer. 

You can run this code in the notebook or in a terminal, which you can open from the launcher. 

Index a bam file using samtools 

In [None]:
samtools

In [None]:
samtools index mounted-data/rna3.bam

Inspect mounted-data, to check that your bam file has been imported successfully 

In [None]:
ls mounted-data

Find the result file from the indexing and move it to mounted-data 

In [None]:
cd mounted-data 

In [None]:
cp rna3.bam.bai ../ 

In [None]:
cd ..

In [None]:
ls

Locate the results in the data & results section (See **Slide 17** for step-by-step) 

#### **Exercises II:**

**Exercise 1:** Mount data to the jupyter session (a bam file from example data). Path - Example data > Pipelines > BamMetrics > ... > data > .bam

**Exercise 2:** Start a bash terminal 

**Exercise 3:** Import samtools, (Simply by running samtools) 

In [None]:
samtools 

**Exercise 2:** Perform <> of the bam file 

samtools <> wgs1.bam

**Exercise 3 (extra):** Perform a coverage calculation using samtools  

In [None]:
samtols coverage -r [chromosome positions] wgs1.bam

**Exercise 5:** move the file you've created to the session_data folder 

In [None]:

#mv 
mv  wgs1.bam mounted-data

#cp
cp  wgs1.bam /home/jovyan/session_data #^^


**Exercise 6:** Save your results 

**Exercise 7:** Browse Project results to find your session data 

## Workflow III: Create a conda environment & bring your own tools to the Jupyter session

### III a) Install packages using conda 

#### Step-by-step

Create a new conda environment with dependencies: 

In [None]:
conda create --name custom_1 

Activate the created environment: 

In [None]:
conda activate custom_1 

Install vcftools within your environment: 

In [None]:
conda install vcftools 

#### **Exercises III a):**

**Exercise 1:** Create a new conda environment 

In [None]:
conda create -n my_env -y 

**Exercise 2:** Activate the new environment 

In [None]:
conda activate my_env 

**Exercise 3:** Install dependecies/tools 

In [9]:
conda install -c bioconda gatk 


CondaError: KeyboardInterrupt



**Exercise 4:** Run tools 

In [None]:
gatk --java-options "-Xmx4g" HaplotypeCaller --help

### III b) Bring in a docker container as a session tool

### Step-by-step 

Browse available tools by running: 

In [None]:
alias | grep "docker"

Pull a new container tool, using: 

In [None]:
docker pull <container name>

Create an alias for the tool:

In [None]:
alias vcftools= 'docker run --rm --init -u$(id -u):$(id -g) -v $PWD:$PWD -iquay.io/lifebitai/vcftools vcftools '

Run the tool using its alias: 

In [None]:
vcftools --help 

#### **Exercises III b):**

**Exercise 1:** Inspect available tools

In [None]:
alias | grep "docker"

**Exercise 2:** Pull a new container tool

In [None]:
docker pull quay.io/lifebitai/gatk 

**Exercise 3:** Create an alias for the tool

In [None]:
alias gatk='docker run …'

**Exercise 4:** Run the tool by its alias

In [None]:
<gatk alias name> --java-options "-Xmx4g" HaplotypeCaller --help

**Exercise 5:** Extract the frequencies from a VCF file 

In [None]:
vcftools --gzvcf <.vcf.gz file> --freq --out vcf_freq

## Workflow IV: Install python packages and run scripts for data manipulation

#### Step-by-step

Install BioPython using pip in the terminal:

In [4]:
pip install biopython

Collecting biopython
  Downloading biopython-1.81-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 19.5 MB/s eta 0:00:01
[31mERROR: Operation cancelled by user[0m


Navigate to the **python_exercise.ipynb** notebook to import the module  

To test out importing & installing tools & data from Github, impoort the Lifebit GWAS pipeline

In [7]:
git clone https://github.com/lifebit-ai/gwas

fatal: destination path 'gwas' already exists and is not an empty directory.


: 128

To continue the exercise, open the **python_exercise.ipynb** notebook

## Workflow V: Bring in public code from github 

Once you have completed all of the expercises in the **python_exercise.ipynb** notebook, return here to complete the rest of the exercises 

#### Step-by-step

Clone a public repository: 

In [None]:
git clone https://github.com/lifebit-ai/rnatoy

Install nextflow using codna, and run a public pipeline 

In [None]:
conda install nextflow=20.01 

In [None]:
cd rnatoy 

In [None]:
nextflow run . -profile test 

To bring code from a private github repository, first follow the **Step-by-Step** instructions in **slide 35**, to obtain your personal access token. Then clone the private reposistory with the access token as follows:  

In [None]:
git clone hrrps://[insert access code here]

#### **Exercises V**

**Exercise 1:** In a Jupyter session, import a public repository

In [None]:
git clone https://github.com/lifebit-ai/rnatoy/

**Exercise 2:** Install java and nextflow 

In [None]:
conda install nextflow=21.04 -y

**Exercise 3:** Mount data for the exeecise into the session. You can find the data in: Example data > Pipelines > rnatoy-data

**Exercise 4:** Run the nextflow pipeline locally

In [None]:
nextflow run rnatoy -profile test \
 --reads mounted-data/rnatoy-data \
 --annot mounted-data/rnatoy-data/ggal_1_48850000_49020000.bed.gff \
 --genome mounted-data/rnatoy-data/ggal_1_48850000_49020000.Ggal71.500bpflank.fa

## Worfklow VI: Save a snapshot of your session, following the instructions on **Slide 38-39**.

## Worfklow VII: Querying a database programmatically from a notebook

In the following workflow, you will query data programmatically from a notebook in your interactive session. 

#### Step-by-step

In a bash terminal, create a new R conda envrionment and install the required packages: 

In [11]:
conda create -n omop-source r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /tmp/cloudos_user_envs/omop-source

  added / updated specs:
    - r-data.table
    - r-dbi
    - r-glue
    - r-irkernel
    - r-rpostgres
    - r-tidyverse


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _r-mutex-1.0.1             |      anacondar_1           3 KB  conda-forge
    binutils_impl_linux-64-2.38|       h2a08ee3_1         5.2 MB  defaults
    binutils_linux-64-2.38.0   |       hc2dff05_0          24 KB  defaults
    blas-1.1                   |         openblas           1 KB  conda-forge
    bwidget-1.9.14             |       ha770c72_1         120 KB  conda-forge
    c-ares-1.19.1              |       hd590300_0         111 K

In a bash terminal, download the example notebooks from git 

In [None]:
git clone https://github.com/lifebit-ai/gel-data-queries-resources

Open the **running_queries_on_source_data.ipynb** notebook, and ensure that the kernel in the notebook is set to the correct environment: **omop-source**

Either from the [knowledge base page](https://genomics-england.gitbook.io/gel-cloudos/recipes-and-examples/cohort-building/querying-data-programmatically-omop-and-source-data), or from this notebook, copy and paste the following access credentials into the first cell (keep the indent): 

In [None]:
    DBNAME = "gel_clinical_cb_sql_pro"
    HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
    PORT = 5432
    PASSWORD = 'anXReTz36Q5r'
    USER = 'jupyter_notebook'

Once the authentication is complete, run the queries anvailable in the notebook and inspect the results. 

#### **Exercises VII**

For any additional information on this workflow and exercises, visit the [knowledge base page](https://genomics-england.gitbook.io/gel-cloudos/recipes-and-examples/cohort-building/querying-data-programmatically-omop-and-source-data). 

**Exercise 1:** Create conda environment and install the required packages

In [12]:
conda create -n omop-source r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y 

Collecting package metadata (current_repodata.json): failed

CondaError: KeyboardInterrupt



**Exercise 2:** Clone the notebook files

In [None]:
git clone https://github.com/lifebit-ai/gel-data-queries-resources

**Exercise 3:** Open the "omop data" notebook file and select "omop-source" R kernel

**Exercise 4:** From the [knowledge base page](https://genomics-england.gitbook.io/gel-cloudos/recipes-and-examples/cohort-building/querying-data-programmatically-omop-and-source-data) copy and paste the credentials, and run the first cell

**Exercise 5:** Run the subsequent SQL queries in the notebook