# Part 1.1: Virtual environments and installation of packages hosted on GitHub

## Sections:
   - 1.1.1 Short introduction to Python virtual environments: how does a virtual environments work, installing and using a virtual environment.
   - 1.1.2 Installation of packages hosted on GitHub.
   - 1.1.3 Short introduction to Jupyter (Notebook) and the JupyterHub
   
## Questions & Objectives:
   - Why do we need a virtual environment?
   - Learn how to install and use a Python virtual environment.
   - Learn how to install a package hosted on GitHub, in particular how to install the rpbp package.
   - Jupyter Notebook basics: dashboard, user interface, navigation, running code, etc. How to use Jupyter with virtual environments.
   - How to use the JupyterHub.
   
### After I will be able to:
   - create and use a Python virtual environment to install packages.
   - understand what is a Jupyter Notebook, and how to use it.

## 1.1.1 Short introduction to Python virtual environments

A *virtual environment* is a tool to help you keep dependencies and packages required by different projects separate and isolated from the system-wide installation.

### Why do we need a virtual environment? 
    
Imagine a scenario where you need different versions of the same package, say v1.8 and v2.5. These would normally reside in the same directory with the same name...

    Virtual environments enable to isolate different versions of the same package (if they live in different environments, say myproject1.8 and myproject2.5).
    
Imagine a scenario where things go wrong (you break something, install many conflicting packages) ... 

    Virtual environments can be deleted and re-created easily without affecting your system or the other packages that are installed outside.
    
Imagine a scenario where you cannot install packages system-wide or do not want to do it...

    Virtual environments enable you to install everything that you want, without affecting the system or other packages that are installed outside.
    
    
### How does a virtual environment work?

We use a module named `venv` which is a tool to create isolated Python environments. A virtual environment is a directory tree (a folder structure) which contains Python executables and other necessary files. These are ideally isolated from system site directories. Each virtual environment has its own Python installation and can have its own independent set of installed Python packages.

When working in a command shell, to create a virtual environment (assuming you have Python3 installed!):

`
python3 -m venv /path/to/new/virtual/environment
`
 
To activate the new virtual environment:

`
source /path/to/virtual/environment/bin/activate
`

### Let's try it!

We will create a fresh virtual environment and eventually install packages using the script `ribo-setup`, located
under *hbigs_course_2019/ribosome-profiling*.


### More

- For information about Python virtual environments, see the [venv](https://docs.python.org/3/library/venv.html#module-venv) documentation. See also [PEP 405](https://www.python.org/dev/peps/pep-0405/).


## 1.1.2 Installation of packages hosted on GitHub

### What is GitHub?

Git is an open-source version control system, *i.e.* used to track changes to documents/code for yourself or amongst collaborators, releasing versions, *etc.* GitHub.com is where (mostly) developers store their projects, and render them accessible to the community. Anyone (even people who have nothing to do with the development of a project) can download the files and use them (according to the license, *e.g* the `rpbp` package, as well as these notebooks/scripts and the original material for this part of the course is under the MIT license). Project files are
stored in a particular location, referred to as a repository (usually abbreviated to *repo*), and you can access it with a unique URL. 

### How to install a package hosted on GitHub?

Some packages hosted on GitHub can be installed via package-management systems (`pip` for Python, RStudio Package Manager, *etc.*). In these cases, you may not even know that the source codes for these packages "live" on GitHub. Sometimes, however, you need to install the package by first *cloning* the repository and following specific instructions.

![](img/git-1.png)


### Let's try it!

After creating a new virtual environment for this part of the course, we will install the `rpbp` and `slurm-magic` packages using the script `ribo-setup`, located under *hbigs_course_2019/ribosome-profiling*.


### More

- In the last part of the course, we will go through a collection of explanations and short practical exercises about version control (Git and GitHub) and open source software. 


## 1.1.3 Short introduction to Jupyter (Notebook) and the JupyterHub

The Jupyter Notebook is an open-source web-based application allowing in-browser editing (Jupyter is running on your own computer, *i.e* your computer acts as the server), combining text, code, computations and rich media output (notebook documents). Jupyter supports many programming languages, including Python, R, and Julia. The JupyterHub is a multi-user version of the Notebook (this notebook is served via "our Hub" `https://jupyter.dieterichlab.org:49200`).

### Basic workflow

Typically, a notebook document is organised into cells, and one moves forward from one cell to the next, breaking the content or the computation into separate parts. This workflow allows to "validate" the output of one cell before moving to the next, and is also convenient for interactive exploration.

### Jupyter Notebook basics

The Notebook dashboard: when you first start the notebook server, your browser will open to the notebook dashboard. To create a new notebook, click on the "New" button and select a kernel from the dropdown menu. The running notebooks are shown with a green icon and text (or via the "Running tab"). Notebooks remain running until you explicitly shut them down; closing the notebook's page is not sufficient. 
   
![](img/jupyter-1.png)
    
To shutdown, delete, duplicate, or rename a notebook check the checkbox next to it. You can also perform these operations and more directly on the running notebook by using the top menu and tool bar (see at the top of this notebook!) 

![](img/jupyter-2.png)

Modal editor: **edit mode**. Edit mode is indicated by a green cell border and a prompt showing in the editor area. When a cell is in edit mode, you can type into the cell, like a normal text editor. Enter edit mode by pressing `Enter` or using the mouse to click on a cell.

![](img/jupyter-3.png)

Modal editor: **command mode**. Command mode is indicated by a grey cell border with a blue left margin. When you are in command mode, you are able to edit the notebook as a whole, but not type into individual cells (be careful to type into a cell in command mode!). Enter command mode by pressing `Esc` or using the mouse to click outside a cell.
    
![](img/jupyter-4.png)

### Basic commands

Edit mode

| Command | action |
|------|------|
|   `Ctrl-Enter`  | run selected cells |
|   `Shift-Enter`  | run cell, select below |

Command mode

| Command | action |
|------|------|
|   `Enter`  | enter edit mode |
|   `Esc`  | enter command mode |
|   `a`  | insert cell above |
|   `b`  | insert cell below |


## More

- See [Project Jupyter](https://jupyter.org/) for installation instructions, detailed documentation, *etc.*
- After the course, explore the menu (Help) of this notebook, and experiment with basic commands.


## How to use Jupyter with virtual environments

We will now go back to your Desktop, open an `LXTerminal` and navigate to the directory for this part of the course by typing: 

`
cd hbigs_course_2019/ribosome-profiling
`

You will see there this notebook and others, as well as a file called `ribo-setup`. We will run this script to create a new virtual ennvironment, clone the `rprp` and `slurm-magic` repositories, install the packages and add a jupyter kernel for the newly created environment. When the script has finished running, we will need to (1) refresh the page, (2) go to "Kernel" in the top menu bar, (3) "Change kenerl" in the dropdown menu list, and then select `hbigs19-ribo`, which is the name of our newly created environment.

![](img/jupyter-5.png)


We are now ready to go...


# Part 1.2: Introduction to ribosome-profiling (Ribo-seq) and the Rp-Bp workflow

For practical reasons, we will cover some technical aspects before introducing ribosome-profiling in more details. In particular, we will first set-up our notebook to run `rpbp` on the example Ribo-seq dataset, briefly introduce the SLURM workload manager (via `slurm-magic`), and actually run the `rpbp` pipeline. While our data is running, we will then go into more details about Ribo-seq, the `rpbp` package and some methodological aspects behind it.



    - Sections:
        - High-level introduction to Ribo-seq, de novo ORF discovery (elements of annotation, 
        transcript isoforms, CDS, UTRs, etc.), biological relevance of 
        alternative translation events (including translation from non-coding transcripts), 
        and why we need "dedicated software" to analyse Ribo-seq data.
        - Very short introduction to the SLURM workload manager (slurm-magic), used
        to run the rpbp pipeline.
        - The rpbp pipeline step-by-step:
            - Creating reference genome indices;
            - Running the rpbp pipeline: creating ORF profiles, predicting
            translated ORFs.
    - Duration:
        - 1 hour 30 min. to 2 hours.
    - Questions & Objectives:
        - What is the translatome? What are the uses of Ribo-seq.
        - Why do we need dedicated software to analyse Ribo-seq data?
        - What softwares are available to analyse Ribo-seq data?
        - Understand how to use the rpbp package (on my laptop, on the cluster using
        the SLURM workload manager).
        - Run the complete rpbp pipeline on a selected Ribo-seq dataset.
    - After I will be able to:
        - understand how to analyse Ribo-seq data for ORF discovery;
        - run the rpbp package (only ORF profiles, or full pipeline).

In [None]:
# open console, explain script, run script
# while running, open jupyter hub, login, and open notebook part 1

# part1: virtual environments (try also from console), jupyter notebooks (what is it, basic commands, etc.)
# and slurm (cluster): quick cell what to do if you want to do that from your PC at home (RTD, script, no need for
# notebooks in the end, etc.)
# now, add kernel and change: we have access to what we installed
# lastly, add what is needed to our path, we are now ready

# intro riboseq, rp-bp (practical, workflow), from RTD, etc. expand based on workflow demo for ribo-seq
# explain index creation, skip index, show where it is, and what it consists of
# prepare to run the example dataset, submit, a bit of slurm
# while running, extend on rp-bp, what it does in more details (RTD and paper, some background theory, etc.)
# in particular explain that we have changed params, but normally no, etc.
# when complete, run report (more is coming than pdf report...)

# part2: QC existing full results (prepare data before) (my git), maybe not all, but some plots...
# show some tables and plots
# exercice: use the example dataset and try the QC by repeating the steps

# part3: TE overview (need to run beforehand full dataset here they cannot run it, not 
# installed on their virtual environment), theoretical background, some code hints, etc.
# then load tables/results and explore (create MD plot, volcano plot, etc.)
# also have a look at Glimma (R), open plot and see.

# EXTRA:...


In [None]:

# TODO

# set license on all material, see license on git material, explain to them

# try full install, jupyter hub, switch and run example from course01
# run report
# run htseq, Deseq


# remove downsample analysis from ref folder, but run it on course01 and leave it there

In [None]:
# jupyter notebooks

In [180]:
# virtual environments

In [None]:
# slurm

In [None]:
# generic hopw to if on your own laptop, etc.

In [None]:
# change kernel

In [None]:
# import modules that are needed to run this notebook


In [25]:
# some functions, definitions, etc. that are needed to run this notebook

from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writefile_globals(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

In [1]:
# $PATH is an environment variable related to file location that are typically used to run programs.
# When one types a command to run, the system looks for this command in the directories specified by $PATH.

# What is on you $PATH?
!echo $PATH 


/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin


In [2]:
# We will now add the location of certain programs that are required to our $PATH,
# including executable scripts that were installed with the rpbp package.

# Advanced note: sys.path does not contain path to virtual environment executables, and adding to sys.path
# does not solve the problem...

# first find where we are, we use the same structure as in "ribo-setup" 
HOME = !echo $HOME
HOME = HOME.n
PARENT = HOME + '/hbigs_course_2019/ribosome-profiling'
# this is the location where rpbp-related executables were installed
ENVLOC = PARENT + '/envs/hbigs19-ribo/bin'
# these are standard bioinformatics tools that you have already used, which are installed
# at these locations on the cluster, we use specific versions to be compatible with the rpbp package
add2path = ['/biosw/slurm/18.08.6.2/bin',
            '/biosw/bowtie2/2.3.0',
            '/biosw/star/2.6.1d',
            '/biosw/samtools/1.7/bin',
            '/biosw/flexbar/3.5.0']
add2path.append(ENVLOC)

PATH = !echo $PATH
PATH.extend(add2path)
PATH = ':'.join(PATH)
# out updated $PATH is...
%set_env PATH=$PATH


env: PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin:/biosw/slurm/18.08.6.2/bin:/biosw/bowtie2/2.3.0:/biosw/star/2.6.1d:/biosw/samtools/1.7/bin:/biosw/flexbar/3.5.0:/home/eboileau/hbigs_course_2019/ribosome-profiling/envs/hbigs19-ribo/bin


In [None]:
# intro riboseq, rp-bp (practical, workflow), from RTD, etc. expand based on workflow demo for ribo-seq
# explain index creation, skip index, show where it is, and what it consists of

# intro rpbp run-merge all, etc. i.e. what we do below

In [17]:
# Prepare to run the example (downsampled) dataset: 4 replicates, 2 PBS, 2 EGF.
# change directory
DIRLOC = '/riboSeqHBIGS19-downsampled-analysis'
CFG = HOME + DIRLOC + '/config'
RES = HOME + DIRLOC + '/riboseq-results'
%cd $CFG


[Errno 2] No such file or directory: '/home/eboileau/riboSeqHBIGS19-downsampled-analysis/config'
/beegfs/pub/hbigs_course_2019/ribosome-profiling


In [None]:
# We first need to prepare a YAML configuration file to run the rpbp package. 
# YAML ("YAML Ain't Markup Language") is a data-serialization (markup) language.

# We will do this below and explain the structure of the config file. Please check that
# the file has been correctly written to disk under /riboSeqHBIGS19-downsampled-analysis/config

In [26]:
%%writefile_globals hbigs19-downsampled.yaml

project_name: HBIGS19-downsampled

# Base location for the created index files.=
genome_base_path: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96
    
# An identifier which will be used in the filenames. This should not contain
# spaces, forward slashes, or other special characters.
genome_name: GRCh38.96
    
# The full path to the GTF file which contains the exon and CDS annotations.
gtf: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/GRCh38.96.gtf

# The fasta file which contains the genome. The genomic identifiers in the GTF and
# fasta files must match (e.g., "I" and "I", or "chrI" and "chrI", but not "I" and "chrI").
fasta: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/GRCh38_96.fa

# The base location for the STAR genome index.
star_index: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/star

# The base location for the Bowtie2 index for the ribosomal sequences.
ribosomal_index: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/rRNA_cluster_plus_mtRNA/rRNA_cluster_plus_mtRNA
# The fasta file containing the rRNA sequences. The file can also contain other
# sequences which should be filtered, such as tRNA or snoRNAs
ribosomal_fasta: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/rRNA_cluster_plus_mtRNA/rRNA_cluster_plus_mtRNA.fasta

# A file containing standard adapters.
adapter_file: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/riboseq-adapters.fa

# The base location for the created files
riboseq_data: {RES}

# A dictionary in which each entry specifies a sample. The key is an 
# informative name about the sample, and the value gives the complete path to 
# the sequencing file (a fastq(.gz) file). The names will be used to 
# construct filenames, so they should not contain spaces, forward slashes, or 
# other special characters.
riboseq_samples:
 dSRR7451194.EGF.rep-1: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451194_1.fastq.gz
 dSRR7451184.EGF.rep-2: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451184_1.fastq.gz
 dSRR7451191.PBS.rep-1: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451191_1.fastq.gz
 dSRR7451197.PBS.rep-2: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451197_1.fastq.gz

riboseq_biological_replicates:
 EGF:
  - dSRR7451194.EGF.rep-1
  - dSRR7451184.EGF.rep-2
 PBS:
  - dSRR7451191.PBS.rep-1
  - dSRR7451197.PBS.rep-2

riboseq_sample_name_map:
 dSRR7451194.EGF.rep-1: EGF1
 dSRR7451184.EGF.rep-2: EGF2
 dSRR7451191.PBS.rep-1: PBS1
 dSRR7451197.PBS.rep-2: PBS2

# Rp-Bp options: we need to change the default parameters to run the downsampled data.
# Generally, you do not need to change the default parameters!

# The number of bases upstream of the translation initiation site to begin 
# constructing the metagene profile.
metagene_start_upstream: 50
# The number of bases downstream of the translation initiation site to end 
# the metagene profile.
metagene_start_downstream: 50
# The number of bases upstream of the translation termination site to begin 
# constructing the metagene profile.
metagene_end_upstream: 50
# The number of bases downstream of the translation termination site to end 
# the metagene profile.
metagene_end_downstream: 50

# N.B. These values are set artificially low for the example to work!
min_metagene_profile_count: 50
min_metagene_image_count: 10

# N.B. These value are set low to reduce the running time, but will affect the results.
metagene_iterations: 100
translation_iterations: 100


In [None]:
# We are now ready to submit our job. We use the slurm workload manager.

In [31]:
%%sbatch
#!/bin/bash
#SBATCH -J "hbigs19"
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=20G

run-all-rpbp-instances hbigs19-downsampled.yaml \
    --merge-replicates \
    --run-replicates \
    --keep-intermediate-files \
    --num-cpus 12 \
    --use-slurm \
    --logging-level INFO \
    --log-file hbigs19-downsampled.log


'Submitted batch job 336179\n'

In [37]:
# We check that our job is actually running...
%squeue -u course01

TypeError: initial_value must be str or None, not tuple

In [None]:
# while running, extend on rp-bp, what it does in more details (RTD and paper, some background theory, etc.)

In [12]:
%%latex

\begin{align}
a && b && c \\
1 && 2 && 3
\end{align}

<IPython.core.display.Latex object>

In [10]:
# imports

import os
import sys
import pandas as pd
import numpy as np

import logging

import itertools

from collections import defaultdict

import Bio.Seq
from Bio import SeqIO

import pbio.utils.bed_utils as bed_utils
import pbio.misc.utils as utils

import pbio.misc.parallel as parallel
import pbio.misc.pandas_utils

import pbio.ribo.ribo_utils as ribo_utils
import pbio.ribo.ribo_filenames as filenames

import pbio.misc.logging_utils as logging_utils
logger = logging_utils.get_ipython_logger()

from argparse import Namespace
args = Namespace()


In [11]:
%load_ext slurm_magic

In [2]:
#################################
#################################
#################################

In [5]:
# graphics

%load_ext autoreload
%autoreload 2
%matplotlib inline

import matplotlib
import matplotlib.ticker as mtick
import matplotlib.patches as patches
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

import seaborn as sns
sns.set({"ytick.direction": u'out'}, style='ticks')

params = {
   'axes.labelsize': 26,
   'font.size': 26,
   'legend.fontsize': 26,
   'xtick.labelsize': 24,
   'ytick.labelsize': 24,
   'text.usetex': True,
   'figure.figsize': [12, 8],
    'font.family': 'sans-serif',
    'font.sans-serif': 'DejaVu Sans',
    'mathtext.fontset': 'dejavusans'
   }
plt.rcParams.update(params)
font = FontProperties().copy()

mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING) 


DEBUG    : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
######################################################
######################################################
######################################################
######################################################

MIT License

Copyright (c) 2019 Etienne Boileau