<a href="https://colab.research.google.com/github/farzinlize/AKAGI/blob/master/akagi_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AKAGI - Motif Finding and Analysis Application

Akagi is an application to search and find repeated patterns, which have nearly same instances over a set of sequences. Our goal is to find and study hidden DNA patterns that are highly related to a specific binding site. 
visit our github page for more information: [AKAGI github](https://github.com/farzinlize/AKAGI)

Table 1. list of supported AKAGI commands


*   `SLD` : single level dataset (cache)
  *   options: `kmin`, `kmax` of background Gkmerhood, `level` of graph to cache, `dmax` of maximum search depth
* `FLD` : first-level last-level dataset (cache)
  * generate cache of all nodes between first-level and last-level
*   `MFC` : Motif finding & Chaining (described later in document)
*   `SDM` : sequence distance matrix
  *   generate distance matrix of every instances of each pattern in fasta format
* `ARS` : analysis raw statistics (Tumpa article)
* `ALG` : multiple alignment using `muscle` application
  * align all instances of a group in fasta format
* `CNM` : coloring neighbourhood of motifs (study design)
* `2BT` : download genome referneces 2bit-file





In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#               AKAGI INSTALLATION                #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
# cloning AKAGI source code hosted at github
!git clone https://github.com/farzinlize/AKAGI.git
%cd AKAGI
 
# installing AKAGI dependencies
''' 
  python libraries requied by AKAGI:
  - biopython
  - twobitreader: utilize 2bit compressed genome file
  - memory_profiler
  - email-to: auto email reports and attachments
  - PyDrive: google drive authentication and operations
'''
!pip install biopython
!pip install twobitreader
!pip install -U memory_profiler
!pip install email-to
!pip install PyDrive
 
# downloading datasets from hmchip
!mkdir hmchipdata
!wget http://jilab.jhsph.edu/database/dataset/Human_hg18_peakcod.tar.gz
!tar -xf Human_hg18_peakcod.tar.gz -C ./hmchipdata
 
# downloading human genome references
!python app.py 2BT -r hg18

# making PFM directory
!mkdir pfms 

'''
  [WARNING] PLACE SECRET IN `secret.json` FOR EMAIL-REPORT HERE
'''
!echo {'"google_app_password"':'"SECRET_PASSWORD"'} > secret.json

Cloning into 'AKAGI'...
remote: Enumerating objects: 1288, done.[K
remote: Counting objects: 100% (1288/1288), done.[K
remote: Compressing objects: 100% (851/851), done.[K
remote: Total 1288 (delta 697), reused 1022 (delta 431), pack-reused 0[K
Receiving objects: 100% (1288/1288), 14.16 MiB | 7.35 MiB/s, done.
Resolving deltas: 100% (697/697), done.
/content/AKAGI
Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/3a/cd/0098eaff841850c01da928c7f509b72fd3e1f51d77b772e24de9e2312471/biopython-1.78-cp37-cp37m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 6.8MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.78
Collecting twobitreader
  Downloading https://files.pythonhosted.org/packages/d5/2c/7278556581fd716eec5e83c095c279e4c9012723f423a527093d8b57f3b3/twobitreader-3.1.7.tar.gz
Building wheels for collected packages: twobitreader
  Building wheel for twobitreader (setup.py) ... [?25l[?25hdon

In [None]:
# AKAGI uninstall:
%cd ..
!rm -rf AKAGI

In [None]:
# AKAGI update
!git reset --hard
!git pull

# Email and Cloud authentication

For report purposes and utilizing cloud services in order to run experiments, some private tokens are needed as described here




## `secret.json` file

AKAGI reports contain execution information and predictions and will be sent by email. A json file named `secret.json` (contains sensitive data) is required for this purpose and the best way to insert this data is **to modify the file manually** using *Files* tool in colab from left pannel. Or you can use this cell by putting password instead of `SECRET_PASSWORD` but remember to remove it after running the cell

In [None]:
################ WARNING IMPORTANT ########################
####  delete the password after running cell for safty ####
###########################################################

# place the password instead of "SECRET_PASSWORD"
!echo {'"google_app_password"':'"SECRET_PASSWORD"'} > secret.json

In [None]:
# cell for email report use
!python report_email.py -i app.py -a constants.py -t T

In [None]:
# changing the email address to akagi automail account
!python app.py NOP -C "EMAIL_ACCOUNT='akagi.automail@gmail.com'"

## `client_secret.json`

this file is required by `PyDrive` liberary to function 

### google drive authentication 

`googledrive.py` module handles google drive based operations by utilizing `PyDrive` liberary. once you run `python gooledrive.py` command, *AKAGI: motif finding* application is able to function with using checkpointes stored at google drive

In [None]:
!touch client_secrets.json

In [None]:
# run this cell once for a colab session
!python googledrive.py

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=285671014364-v2c9m48chkqi23e33phpoodrjfugjvop.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Enter verification code: 4/1AY0e-g7f2Seh0Qa4PcKxRyf09QogpxMQO6o3czzvYyD0PB-UH2mbZ4KAltM
Authentication successful.


# Data preparation

special AKAGI operations requires references or pre-processd data to function. `2BT` and `FLD` commands provide necessery data descibed here:



## human genome reference

human genome reference hg18 is used to map peak annotation to extract their sequences. `2BT` command is used for downloding available references 



In [None]:
# downloading human genome reference
!python app.py 2BT --reference=hg18

[2BIT] downloading reference: hg18[2BIT] completed


## BFS data Caches

an observation phase needed d-neighbourhood of kmers, extracted from sequence through a fixed lentgh window. First a dataset is generated to cache BFS information of gkmerhood nodes.



In [None]:
'''
  levels: 7-8
  index: 2 (in `dataset_tree`)
  tree name: gkhood78.tree
  cache location: /cache78/
'''
!python app.py FLD -m5 -M10 -l 7-8 -d2

In [None]:
'''
  levels: 5-6
  index: 1 (in `dataset_tree`)
  tree name: gkhood56.tree
  cache location: /cache56/
'''
!python app.py FLD -m3 -M8 -l 5-6 -d2

operation FLD: generating First-level-Last-level dataset
        arguments -> kmin=3, kmax=8, first-level=5, last-level=6, distance=2
GKhood instance generated in 00:01:22
dataset generated in 00:00:24


## peak annotation to sequence

`peakseq.py` module is responsible for converting annotation to Sequences for furthur studies. those anotations are in ENCODE format gatherd from [hmChIP](http://jilab.biostat.jhsph.edu/database/cgi-bin/hmChIP.pl)

In [None]:
!python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak.cod

[peakseq] making fasta file from cod-annotation peaks (cod=./hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak.cod)


# Motif finding using chain algorithm

AKAGI uses a two-phase algorithm descibed below for `MFC` command:

## 1) Observation phase

At this phase, AKAGI starts to read input sequences in fixed-lenght frames and add every seen kmers to WATCH tree. Finally after a few step of tree searching (BFS), kmers with higher frequencies extracted and stored in a list for next phase

## 2) Chaining phase

With high-frequency kmers from last phase called **motif**, AKAGI search for possible links between these words to reveal longer patterns. Two parameter **overlap** and **gap** are used at this phase to consider more diffrences for each instance from original pattern at each level. *more successful links, more distanced form pattern a chain could go*


In [None]:
'''
  #### biology
  input sequences: peaks
  sample: A549 cell
  treatment: 1h with 500 pM Dexamethasone (Myers)
  Antibody Target: NR3C1
    The glucocorticoid receptor (GR, or GCR) also known as NR3C1
  -> GR binding in lung carcinoma tissue derived epithelial cell line A549 <-
  #### application
  file size = 13 KB
  frame sizes = (5-distance=1), (6-distance=1) - multilayer observation
  minimum lexicon limit = 1000 words
'''

# disk utilization
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_DISK -C BATCH_SIZE=10

!python app.py MFC -s peaks/ENCODE_HAIB_A549_Dex500pM_NR3C1_peak -x1000 -Q -t /A1000 -f 5-6 -d 1-1 -G 1-1 -u

# AKAGI Prediction

previously described operations are used to identify repeated motifs with an *edit-distance based* model. Finally each of those found motifs are ranked by AKAGI to determine the best one in term of different observation.

- SSMART: or statistical score for each motif that measured by considering peak sequence scores from peak calling tools
- SUMMIT: or exprimental score that shows the distance of a motif from peak summit
- JASPAR: reference score for evaluating found motif by AKAGI using jaspar database

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#              JUND EXPERIMENT                    #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
# [WARNING] -> `secret.json` is required for auto-email
# [WARNING] -> specify file types for each attachment -> T: textfile, I: image
 
# expriment description
'''
  input sequences: ChIP-Seq peaks
  sample: GM12878
  tissue: blood
  Antibody Target: JUND (Transcription Factor)
  jaspar index: MA0491.1
  chipXpress: -unmeasured-
'''

# ----------- data preparation ----------- #

# reading annotation and make fasta format sequence file
!python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_Yale_GM12878_JUND_peak.cod > peakseq.out

# store BFS cache data for expriment
!python app.py FLD -m4 -M7 -l 5-6 -d1 > FLD56.out

# downloading jaspar motif reference
!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0491.1.pfm
 
# ----------- AKAGI configuration ----------- #
!python app.py NOP -C PARENT_WORK=True
 
# ----------- AKAGI execution ----------- #
!python app.py MFC -s hmchipdata/Human_hg18_peakcod/ENCODE_Yale_GM12878_JUND_peak \
  --megalexa 1000               \
  --find-max-q                  \
  --multi-layer                 \
  --frame 3-5-6                 \
  --distance 0-1-1              \
  --gkhood 0-1-1                \
  --gap 4                       \
  --overlap 3                   \
  --multicore                   \
  --ncores 2                    \
  --jaspar pfms/MA0491.1.pfm    \
  > MFC.out
 
# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i peakseq.out-FLD56.out-MFC.out-clear.out -a chaining_report.window -t T

# Experiments and Evaluation

each expriment aim to find patterns among ChiP-seq data gathered from [hmChip](http://jilab.biostat.jhsph.edu/database/cgi-bin/hmChIP.pl) database that can predict target protein's binding sites. sample references are provided as below. in term of evaluating, each pattern will be measured by already known PWM based motifs from jaspar database in different versions addressed as below

## SRF Expriment

### Description

-  input sequences: ChIP-Seq peaks
-  sample: [GM12878](https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=GM12878)
-  tissue: blood
-  Antibody Target: [SRF](https://www.uniprot.org/uniprot/P11831) (Transcription Factor)
-  jaspar index(s): MA0083 with 3 version

  1.   [SELEX](http://jaspar.genereg.net/matrix/MA0083.1/) (*-low* number of sites)
  2.   [ChIP-seq](http://jaspar.genereg.net/matrix/MA0083.2/)
  3.   [HT-SELEX](http://jaspar.genereg.net/matrix/MA0083.3/)

-  chipXpress: (score=12.2, rank=1)

### AKAGI results

- not calculated yet


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#              SRF  EXPERIMENT                    #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
# [WARNING] -> `secret.json` is required for auto-email
# [WARNING] -> specify file types for each attachment -> T: textfile, I: image

# ----------- data preparation ----------- #

# reading annotation and make fasta format sequence file
# !python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak.cod > peakseq.out

# store BFS cache data for expriment
# !python app.py FLD -m3 -M8 -l 5-6 -d2 > FLD56.out

# downloading jaspar motif reference
!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0083.2.pfm
 
# ----------- AKAGI configuration ----------- #
!python app.py NOP -C PARENT_WORK=True             # parent job
!python app.py NOP -C MAX_SEQUENCE_COUNT=100       # briefing sequences
!python app.py NOP -C TIMER_CHAINING_HOURS=4       # set up 6 hour timer
!python app.py NOP -C SAVE_OBSERVATION_CLOUD=True  # uploading observation result (ON)
!python app.py NOP -C SAVE_THE_REST_CLOUD=False    # rest of work checkpoints (LOCAL)
!python app.py NOP -C NEED_HELP=100000             # send help signal
!python app.py NOP -C ON_SEQUENCE_ANALYSIS=False   # on sequence analysis (OFF)

# ----------- AKAGI execution ----------- #
!mprof run python app.py MFC -s hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak \
  --megalexa 1000               \
  --find-max-q                  \
  --multi-layer                 \
  --frame 5-6                   \
  --distance 1-2                \
  --gkhood 1-1                  \
  --gap 4                       \
  --overlap 3                   \
  --multicore                   \
  --ncores 2                    \
  --jaspar pfms/MA0083.2.pfm
  
!mprof plot -o memory.png
 
# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i peakseq.out-MFC.out-clear.out --re-p -a chaining_report.window-memory.png -t T-I

mprof: Sampling memory every 0.1s
running new process
operation MFC: finding motif using chain algorithm (tree_index(s):[1, 1])
        arguments -> f(s)=[5, 6], q=-2, d(s)=[1, 2], gap=4, overlap=3, dataset=hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak
        operation mode: False; coloring_frame=-1; multi-layer=True; megalexa=1000
[FOUNDMAP] foundmap mode: disk
[BRIEFING] number of sequences = 100
(len:400,score:8.486294) (len:400,score:8.466699) (len:400,score:7.986474) (len:400,score:7.867773) (len:400,score:7.759249) (len:400,score:7.614328) (len:400,score:7.612467) (len:400,score:7.256706) (len:400,score:7.161123) (len:400,score:7.035571) (len:400,score:7.020070) (len:400,score:6.959688) (len:400,score:6.847917) (len:315,score:6.801447) (len:400,score:6.762850) (len:400,score:6.745471) (len:315,score:6.705765) (len:400,score:6.580359) (len:400,score:6.566410) (len:400,score:6.535247) (len:400,score:6.476416) (len:400,score:6.469121) (len:400,score:6.436924) (len:400,

## CEBPB Expriment

### Description

-  input sequences: ChIP-Seq peaks
-  sample: [HepG2](https://www.atcc.org/products/all/HB-8065.aspx)
-  tissue: liver
-  Antibody Target: [CEBPB](https://www.uniprot.org/uniprot/P17676) (Transcription Factor)
-  jaspar index(s): MA0466 with 2 version
  -    1: [ChIP-seq](http://jaspar.genereg.net/matrix/MA0466.1/)
  -    2: [HT-SELEX](http://jaspar.genereg.net/matrix/MA0466.2/)
-  chipXpress: (score=4.7, rank=1)

### AKAGI results

- not calculated yet


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#              CEBPB  EXPERIMENT                  #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #

# [WARNING] -> `secret.json` is required for auto-email
# [WARNING] -> specify file types for each attachment -> T: textfile, I: image

# ----------- data preparation ----------- #

# reading annotation and make fasta format sequence file
!python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_Stanford_HepG2_CEBPB_peak.cod > peakseq.out

# store BFS cache data for expriment
!python app.py FLD -m3 -M8 -l 5-6 -d2 > FLD56.out

# downloading jaspar motif reference
!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0466.1.pfm
 
# ----------- AKAGI configuration ----------- #
!python app.py NOP -C PARENT_WORK=False -C MAX_SEQUENCE_COUNT=100
 
# ----------- AKAGI execution ----------- #
!timeout 8h mprof run python app.py MFC -s hmchipdata/Human_hg18_peakcod/ENCODE_Stanford_HepG2_CEBPB_peak \
  --megalexa 1000               \
  --find-max-q                  \
  --multi-layer                 \
  --frame 3-5                   \
  --distance 0-2                \
  --gkhood 0-1                  \
  --gap 4                       \
  --overlap 3                   \
  --multicore                   \
  --ncores 2                    \
  --jaspar pfms/MA0466.1.pfm    \
  > MFC.out
!mprof plot -o memory.png

# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i peakseq.out-FLD56.out-MFC.out-clear.out -a chaining_report.window-memory.png -t T-I

Using last profile data.


## SREBF1 Expriment

### Description

*   input sequences: ChIP-seq peaks
*   sample: [HepG2](https://www.atcc.org/products/all/HB-8065.aspx)
*   tissue: liver
*   Antibody Target: [SREBF1](https://www.uniprot.org/uniprot/P36956) Sterol regulatory element-binding protein 1
  *   Precursor of the transcription factor form (Processed sterol regulatory element-binding protein 1)
*   Jaspar index(es): MA0829 with 2 version and MA0595.1
  *   1: [HT-SELEX](http://jaspar.genereg.net/matrix/MA0829.1/)
  *   2: [ChIP-seq](http://jaspar.genereg.net/matrix/MA0829.2/)
  *   [MA0595.1](http://jaspar2018.genereg.net/matrix/MA0595.1/) *ChIP-seq*
*   chipXpress: (score=7.2, rank=1)

### AKAGI results

*   not calculated yet





In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#              SREBF1  EXPERIMENT                 #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
# [WARNING] -> `secret.json` is required for auto-email
# [WARNING] -> specify file types for each attachment -> T: textfile, I: image

# ----------- data preparation ----------- #

# reading annotation and make fasta format sequence file
!python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_Yale_HepG2_SREBF1_peak.cod > peakseq.out

# store BFS cache data for expriment
!python app.py FLD -m3 -M8 -l 5-6 -d2 > FLD56.out

# downloading jaspar motif reference
!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0595.1.pfm
 
# ----------- AKAGI configuration ----------- #
!python app.py NOP -C PARENT_WORK=True -C MAX_SEQUENCE_COUNT=100
 
# ----------- AKAGI execution ----------- #
!timeout 8h mprof run python app.py MFC                          \
  -s hmchipdata/Human_hg18_peakcod/ENCODE_Yale_HepG2_SREBF1_peak \
  --megalexa 1000               \
  --find-max-q                  \
  --multi-layer                 \
  --frame 3-6                   \
  --distance 0-2                \
  --gkhood 0-1                  \
  --gap 4                       \
  --overlap 3                   \
  --multicore                   \
  --ncores 2                    \
  --jaspar pfms/MA0595.1.pfm    \
  > MFC.out
!mprof plot -o memory.png
 
# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i peakseq.out-FLD56.out-MFC.out-clear.out -a chaining_report.window-memory.png-parent.report -t T-I-T

# Check Point

AKAGI uses `checkpoints.py` module to save and load unfinihsed jobs or store precalculated data in case of other application instances may need them. 



## Resumable check-points

an instance of application is able to resume remaining jobs with having `dataset_name`, on sequence distribution of first generation motifs and `q` value. overlap and gap options can be different for application instances. calling the `RCH` command, make AKAGI search for resuamable checkpoints offline or in cloud and resuam it

in case of having a working parent instead of memory-balancing one, the parent process will save and send a portion of working queue as resuamable checkpoints to **call for help** from another avaiable AKAGI instances. each instnace eventually report their best ranking by email and its up to user to choose between them

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #
 
#              HELP and RESUME                    #
 
# # # # # # # # # # # # # # # # # # # # # # # # # #

# ----------- AKAGI configuration ----------- #
# !python app.py NOP -C PARENT_WORK=True        # parent job
# !python app.py NOP -C MAX_SEQUENCE_COUNT=100  # briefing sequences
# !python app.py NOP -C TIMER_CHAINING_HOURS=4  # set up 8 hour timer

!python app.py RCH -n2 -O3 -g4

# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i clear.out -a chaining_report.window -t T


[CLOUD] download is done in 0:00:07.631784 time
[peakseq] making fasta file from cod-annotation peaks (cod=hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak.cod)
[ERROR]	b'\x00\x00\x00d'	<class 'bytes'>
[ERROR] something went wrong when sending help


## Observation check-point

Exploring such hidden patterns may require many execution with different set of parameters for better chaining or just better undrestanding study. storing the observation phase data in cloud services helps AKAGI to use first generation of motifs (also called *zero motifs*) found by another execution before. by raising `-k` or `--check-point` flag, AKAGI runs in check-point mode

**making / loading**: whenever any execution of AKAGI reaches the observation point with check-point flag on, it will search for an existing check-point in disk. if no check-point exist for application goal then after computing observation phase, AKAGI will protect and store the results as check-point for furthur use.

**cloud**: check-points can be stored at google drive with `UOC` and can be downloaded later with `DOC` command. 

In [None]:
from datetime import datetime

!python app.py NOP -C SAVE_OBSERVATION_CLOUD=True # uploading observation result

last = datetime.now()
!mprof run python app.py MFC -s hmchipdata/Human_hg18_peakcod/ENCODE_HAIB_GM12878_SRF_peak \
  --megalexa 1000               \
  --find-max-q                  \
  --multi-layer                 \
  --frame 5-6                   \
  --distance 1-2                \
  --gkhood 1-1                  \
  --disable-chaining            \
  --jaspar pfms/MA0083.2.pfm
print(datetime.now() - last)

In [None]:
# <  SREBF1 EXPERIMENT  > 

# reading annotation and make fasta format sequence file
!python peakseq.py -c ./hmchipdata/Human_hg18_peakcod/ENCODE_Yale_HepG2_SREBF1_peak.cod > peakseq.out

# store BFS cache data for expriment
!python app.py FLD -m3 -M8 -l 5-6 -d2 > FLD56.out

!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0595.1.pfm

!python app.py MFC --disable-chaining \
  -s hmchipdata/Human_hg18_peakcod/ENCODE_Yale_HepG2_SREBF1_peak  \
  -f 3-5                                                          \
  -d 0-2                                                          \
  -u                                                              \
  --gkhood 0-1                                                    \
  --jaspar pfms/MA0595.1.pfm                                      \
  > MFC.out

!python app.py UCP                                                \
  -s hmchipdata/Human_hg18_peakcod/ENCODE_Yale_HepG2_SREBF1_peak  \
  -f 3-5 \
  -d 0-2 \
  -u     

# ----------- clear & report ----------- #
!python FoundMap.py > clear.out
!python report_email.py -i peakseq.out-FLD56.out-MFC.out-clear.out

|####################| 6233/6233
done uploading
[MAIL][ERROR] file doesn't exist (FLD.out)


# Memory Profiling 

experiment goal: monitoring memory usage between differente foundmap operational mode (disk vs memory)

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #

#               EXPERIMENT .1                     #

# # # # # # # # # # # # # # # # # # # # # # # # # #

# [WARNING] -> `secret.json` is required for auto-email

# data description
'''
  #### biology
  input sequences: peaks
  sample: HUVEC (umbilical vein endothelial cells)
  tissue: blood vessel
  lab: Broad
  Antibody Target: H3K4me1
    description: Histone H3 (mono methyl K4). Is associated with enhancers, and downstream of transcription starts
  -> H3K4me1 in HUVEC umbilical vein endothelial cells <-
  #### application
  file size = 229 KB
  frame sizes = (7-distance=1)
  minimum lexicon limit = 1000 words
'''

# first, runing BFS cache generator - profiling memory for estimating the size of gkmerhood
!mprof run python app.py FLD -m4 -M7 -l 5-6 -d1 > FLD56.out
!mprof plot -o FLD_mprof_56.png

# only memory configuration
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_MEMO

!mprof run python app.py MFC -s peaks/ENCODE_Broad_HUVEC_H3K9me1_peak -x1000 -Q -t /DISKSINGLE -f 5-6 -d 1-1 -G 1-1 -u --disable-chaining > HUVEC_H3_run.out
!mprof plot -o HUVEC_H3_run_foundmap_memory.png

# disk utilization
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_DISK -C BATCH_SIZE=10

!mprof run python app.py MFC -s peaks/ENCODE_Broad_HUVEC_H3K9me1_peak -x1000 -Q -t /DISKSINGLE -f 5-6 -d 1-1 -G 1-1 -u --disable-chaining > HUVEC_H3_run_2.out
!mprof plot -o HUVEC_H3_run_foundmap_disk10.png

# Email reports include memory usage of in-memory and hybrid disk-memory version of AKAGI and clean
!mprof clean
!python FoundMap.py > other.out # clean foundmap temp
!python report_email.py -i FLD.out-HUVEC_H3_run.out-HUVEC_H3_run_2.out-other.out -a FLD_mprof_56.png-HUVEC_H3_run_foundmap_memory.png-HUVEC_H3_run_foundmap_disk10.png


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # #

#               EXPERIMENT .2                     #

# # # # # # # # # # # # # # # # # # # # # # # # # #

# [WARNING] -> `secret.json` is required for auto-email

# data description
'''
  #### biology
  input sequences: peaks
  sample: HUVEC (umbilical vein endothelial cells)
  tissue: blood vessel
  lab: Broad
  Antibody Target: H3K4me1
    description: Histone H3 (mono methyl K4). Is associated with enhancers, and downstream of transcription starts
  -> H3K4me1 in HUVEC umbilical vein endothelial cells <-
  #### application
  file size = 229 KB
  frame sizes = (7-distance=1)
  minimum lexicon limit = 1000 words
'''

# first, runing BFS cache generator
!python app.py FLD -m4 -M7 -l 5-6 -d1 > FLD56.out

# configuration (BATCH=1)
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_DISK -C BATCH_SIZE=1

!mprof run python app.py MFC -s peaks/ENCODE_Broad_HUVEC_H3K9me1_peak -Q -f 5-6 -d 1-1 -G 1-1 -u --disable-chaining > HUVEC_H3_run_1.out
!mprof plot -o HUVEC_H3_run_batch1.png

# configuration (BATCH=10)
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_DISK -C BATCH_SIZE=10

!mprof run python app.py MFC -s peaks/ENCODE_Broad_HUVEC_H3K9me1_peak -Q -f 5-6 -d 1-1 -G 1-1 -u --disable-chaining > HUVEC_H3_run_2.out
!mprof plot -o HUVEC_H3_run_batch10.png

# Email reports include same task with different batch size
!mprof clean
!python FoundMap.py > other.out # clean foundmap temp
!python report_email.py -i FLD56.out-HUVEC_H3_run_1.out-HUVEC_H3_run_2.out-other.out -a HUVEC_H3_run_batch1.png-HUVEC_H3_run_batch10.png


In [None]:
# profiling cache generator
!mprof run python app.py FLD -m5 -M10 -l 7-8 -d2 > output.out
!mprof plot -o FLD_mprof_78.png

!python report_email.py -i output.out -a FLD_mprof_78.png

In [None]:
# cleaning memory profile data
!mprof clean

# *Big data?* - Disk Utilization

## Observation phase

We observed some entry of our dataset contains many sequences which required so much memory to save all observations. `FoundMap.py` Module is resposible for hybrid memory-disk utilization to handle big data. By using `foundmap` class, AKAGI saves observation data (sequence id, position and margin) in memory until it reaches `BATCH_SIZE` limit. For memory and disk integration, `FileMap.FileHandler` class is implemented to convert information into byte stream and reverse, enabling the application to read, update and save data between disk and memory

## Chaining phase

Even though `foundmap` is stored in disk, keeping many `ChainNode` in memory may results in overflow. as the rate of producing new generation needed to be chained is much higher than processing each chain, it is needed to store those objects in disk. DiskQueue is implemented for this purpose and the parent process is responsible to manage this queue for balancing memory. on the other hand in case of more cores for multiprocessing there is no need for balancing memory when cores are able to process chains in near the rate of producing new generations so parent process also can join other workers as well for multiprocessing.

In [None]:
# AKAGI disk utilization config
!python app.py NOP -C FOUNDMAP_MODE=FOUNDMAP_DISK -C BATCH_SIZE=100 -C PARENT_WORK=False

In [None]:
# for clearing disk
!python FoundMap.py

# useful cells commands for development

In [None]:
# downloading PFM from jaspar using link
!wget -q -P ./pfms http://jaspar.genereg.net/api/v1/matrix/MA0491.1.pfm

In [None]:
import os
print(os.cpu_count())

2


In [None]:
from urllib.request import urlopen
exec(urlopen("http://colab-monitor.smankusors.com/track.py").read())
_colabMonitor = ColabMonitor().start()

Now live at : http://colab-monitor.smankusors.com/6068c01af3294


In [None]:
!pip install wandb
import wandb
wandb.init()

In [None]:
!timeout 5s mprof run python t.py
!mprof plot -o a.png

In [None]:
from datetime import datetime
print(datetime.now())

2021-04-21 20:08:40.707233


In [None]:
%cd AKAGI

[Errno 2] No such file or directory: 'AKAGI'
/content


In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

gauth = GoogleAuth()
gauth.CommandLineAuth()

drive = GoogleDrive(gauth)

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=285671014364-v2c9m48chkqi23e33phpoodrjfugjvop.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Enter verification code: 4/1AY0e-g4pBJT-qUXKpefopOIU_ppnqX0KQ1Y6-6jqwxs_TYRcdeOGWwyzQ9g
Authentication successful.


In [None]:
!ps -aus

In [None]:
!python googledrive.py

[CLOUD] number of protected files to be compressed: 9410
[CLOUD] compressing directory - total bytes count: 110914156 (105.77598190307617 MB)
Traceback (most recent call last):
  File "googledrive.py", line 148, in <module>
    store_checkpoint_to_cloud('ENCODE_HAIB_GM12878_SRF_peak_f5-6_d1-2.checkpoint', 'appdata/ENCODE_HAIB_GM12878_SRF_peak_f5-6_d1-2/')
  File "googledrive.py", line 107, in store_checkpoint_to_cloud
    compressed_drive.Upload()
  File "/usr/local/lib/python3.7/dist-packages/pydrive/files.py", line 285, in Upload
    self._FilesInsert(param=param)
  File "/usr/local/lib/python3.7/dist-packages/pydrive/auth.py", line 75, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pydrive/files.py", line 369, in _FilesInsert
    http=self.http)
  File "/usr/local/lib/python3.7/dist-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-