# Accessing and reviewing the bendit results

Potentially, a lot of data can be generated when the demo pipeline is run, and steps were taken to make handling all that as easy as possible. However, accessing that collected data may not be obvious to the uninitiated. This notebook covers what the 'realistic' pipeline [here](index.ipynb) produces and how to access it. Furthermore, it touches on how to use Jupyter/Python to conveniently view and further process all the data.

The hope is this covers what is necessary to use the generated output. Therefore, if you just plugged your sequences into the demo pipeline, you should be able to follow along to get the bendIt results for all you data. If you are looking to adapt the pipeline, this notebook will give you a flavor of the types of output you may wish to gather or not include your adapted bendIt pipeline.

## Accessing the bendIt results: Starting up an active session

[Optional]

Blah blah

## Accessing the bendIt results: Uncompress the Archive

This notebook assumes, you just ran the pipeline and are trying to access the results. There is then a number of files in the current directory beside the archive. To make things easier, I am going to suggest making a new directory with a simple name and then you could drag the archive into that folder and work there. To make things match witih the demo, I am going to write out those steps as commands, too. Feel free to run the cells or to do it by hand. If you make a different 'unpack' directory, you'll need to adjust things below accordingly.

The exclamation marks in front of the shell commands, tells Jupyter to run those commands as shell commands and not Python.

In [11]:
!mkdir unpack
!mv bendit_analysis*.tar.gz unpack/

To make things in this notebook work, you'll need to switch the current working directory over to where we are going to unpack the archive. We'll use the Jupyter magic command `%cd` to change the working directory for all subsequent cells in the notebook. If we just used `!cd`, it would only change the directory for that cell.

In [12]:
%cd unpack

/home/jovyan/unpack


Note you can check anytime what is the current working directory with `pwd`. (Note most shell commands need an exclamation point but a few were added in to Jupyter so they work without it and `pwd` is one.) Also of note, is that this location is completely independent of where the file navigation pane on the left side of this window may be showing.

In [15]:
pwd

'/home/jovyan/unpack'

To **unpack the archive**, we'll run the command in the following cell.  
You need to edit the command so it will extract your file. In other words the part after the `xzf` has to match the actual file name of the archive you are working with.

In [14]:
!tar xzf bendit_analysisFeb1420202102.tar.gz

tar (child): bendit_analysisFeb142202102.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now


If you ran the above cell and saw anything like `tar (child): bendit_analysisZZZZZZ.tar.gz: Cannot open: No such file or directory`, it simply means you had the file name wrong. You'll need to edit it and run the cell again.

If things worked, you should just see the asterisk to the left of the cell turn to a number. If you changed the file navigation pane on the left-side of this browser window over to the 'unpack' directory, after a moment you should see more files show up in the file pane. Don't worry if you didn't switch. We are going to explore the contents using commands next anyways.

## Overview of the Unpacked Items

If all is correct and you used the `%cd` command to previously switch to the 'unpack directory, running the next cell will show the contents of the current working directory so we can begin to explore the contents of the-now-unpacked archive.

In [16]:
ls -lh

total 1.6M
-rw-r--r-- 1 jovyan root  11K Feb 14 21:02 A_output.png
-rw-r--r-- 1 jovyan root 361K Feb 14 20:57 bendit_analysisFeb1420202057.tar.gz
-rw-r--r-- 1 jovyan root 361K Feb 14 21:02 bendit_analysisFeb1420202102.tar.gz
-rw-r--r-- 1 jovyan root  10K Feb 14 21:02 B_output.png
-rw-r--r-- 1 jovyan root 3.1K Feb 14 21:02 demo_A.pkl
-rw-r--r-- 1 jovyan root  38K Feb 14 21:02 demo_A.png
-rw-r--r-- 1 jovyan root  43K Feb 14 21:02 demo_A.svg
-rw-r--r-- 1 jovyan root 1.6K Feb 14 21:02 demo_A.tsv
-rw-r--r-- 1 jovyan root 3.1K Feb 14 21:02 demo_B.pkl
-rw-r--r-- 1 jovyan root  38K Feb 14 21:02 demo_B.png
-rw-r--r-- 1 jovyan root  44K Feb 14 21:02 demo_B.svg
-rw-r--r-- 1 jovyan root 1.6K Feb 14 21:02 demo_B.tsv
-rw-r--r-- 1 jovyan root   65 Feb 14 20:52 demo_sample_set.fa
-rw-r--r-- 1 jovyan root 4.1K Feb 14 21:02 LOG_baFeb1420202102.txt
-rw-r--r-- 1 jovyan root 100K Feb 14 21:02 plots4review_from-baFeb1420202102.ipynb
-rw-r--r-- 1 jovyan root 512K Feb 14 21:02 seqs_dfs_and_plots_for_each_set.

That lists out the contents of the directory. The options added along with the `ls` command make the output more redable by showing. In particular, the `h` in `-lh` means the file sizes are human readable and the `l` in `-lh` says to list the details in long form and not just the names.

You'll see there is much more than the compressed archive that was originally added when we made the directory just a few cells back. These are the results of the bendIt run. The following section will go through what these are and how to use them.

## About the Contents

If you used your own data, your results will be different but the types will be the same.  I am going describe the contents as if you ran the demo sequences; however, mostly you just need to pay attention to the file extensions, or sometimes the start of the name, to tell which file types correspond.

After a brief overview, I am going to add some details about most of the types.

Typical contents overview:

- Log file
- Image files for the plots
- a notebook for reviewing all the plots en masse
- pickeld bendIt data 
- bendIt data as tabular text
- processing information and results on a per sample basis
- Nucleotide composition detail (TBD)
- input files
- gnuplot files


### Log file

The Log file will look something like `LOG_ba<MONTH>ZZZZZZZZZ.txt` with the `<MONTH>` showing the abbreviation for the month and the `ZZZZZZZZ` portion being derived from a time date stamp.

This contains much of what was shown as the demo pipeline ran with some additional details from the actualy bendIt analyses.

At the end is a summary.

### Image files for the plots

The plots have been saved a two forms of images. The two extensions delineate them:

- `.png`  
This is a raster/bitmap file format made of pixels that you are probably familair with. While it is great for viewing easily as a lot of software, including JupyterLab, can handle it, please see the note below suggesting `.svg` for scaling up, or the Python section that discusess making individual plots larger and making new image files in `.png` format from that.

- `.svg`  
This indicates SVG (scalable vector grpahics) file format. SVG is really the best choice for scaling up or adapting further as it offers the most control and no less of resolution. Sugggest using Adobe Illustrator or Inkscape for scaling and customizing. This is what you'll want to use if you don't want to remake the plot and are looking to customize it for publication. Any modern browser can view `.svg` files, and fortunately an SVG viewer is built right into JupyterLab. 

### Notebook for reviewing all the plots en masse

The file name for this file will resemble `plots4review_from-ba<MONTH>ZZZZZZZZ.ipynb` with the time data stamp matching the arhive file name and log file name.

The image files dsicussed above are nice but not that easy to view unless you bring them local and use your file browser. Alternatively, you can browser them right in the session by opening this notebook and then opening subsequent views of it. (**ADD MORE DETAILS ON HOW TO OPEN MULTIPLE VIEW AND ADD IMAGE EXAMPLE**)

At the top of this notebook, I emphasized you'll want to be in an active notebook session for working with the output. One of the main reasons is that it offers a standard environment where the archive can be unpacked easily and Jupyter offers nice viewers for many of the data types. This is the case for the 'Review' notebook. The plots are already part of the notebook and so nothing has be run again, but a Jupyter environment is useful for viewing it. Alternatively, nbviewer can be used to view a 'static' from of the notebook if you don't mind placing the notebook file somewhere [the online nbviewer](https://nbviewer.jupyter.org/) can be pointed at it. Note the static form will look much like it does in Jupyter but you cannot modify or run any cells, or modify the text content further.



### Pickeld bendIt data

This will resemble files looking like `demo_A.pkl`.

The data plotted from tbe bendIt analysis is stored in a compressed form that can easily be read back in as a Pandas dataframe for convenient use in the Jupyter environment or further analysis. Acessing these PAndas dataframes in the Jupter environemnt will be discussed below.

### bendIt data as tabular text

This will resemble files looking like `demo_A.tsv`.

The data plotted from tbe bendIt analysis is stored in a tab-delimited tabular text form that can easily be used anywhere, even in Excel. Jupyter allows easy viewing of these as well. You can click on the 'frame' symbol next to the file name in the file navigto panel on the left, and they'll open as full-featured spreadsheet-like views. If you right click, and select `Open with ...` > `editor`, you can see the text form that underlies it.


### Processing information and results on a per sample basis

This file will be named `seqs_dfs_and_plots_for_each_set.pkl`.

This is almost all the input sequences and output stored on a per sample set and per sample basis using Pyton dictionaries. This is mainly meant for advanced use. It actually gets used in the top of the `Notebook for reviewing all the plots en masse` to render all the plots. It will be used to access the dataframes as well as an example below. Really beyond being used to easily render all the plots, it is really just there to have it in case something not collected here is necessary or needs to be checked. 

In the `seqs_dfs_and_plots_for_each_set.pkl`, there is a value for each sample set. That value is a list of dictionaries. The order of the list dictionaries for each sample set is as follows:

- cassette sequences processed keyed on name
- sequences of the cassette sequences merged to the defined flanking sequences processed keyed on name
- dataframes produced by bendIt analysis & used to make the plots keyed by sample
- plots produced from each dataframe keyed by sample

### Nucleotide composition detail (To Be Done still)

add dfs made to list under 'Processing information and results on a per sample basis' section

## Using Python to review the data or customize the plots

(Include in Python section that discusess making individual plots larger and making new image files in `.png` format from that. Though recommend `svg` as better.)

in addition to the direct pickled dataframes, show how to get to the ones keyed on sample set and samples as well as a furter illustration of the use of `seqs_dfs_and_plots_for_each_set.pkl`

In [None]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.
