<img src="https://jupyter.org/assets/main-logo.svg" width="25%" height="25%" />

<br/>

# Interactive Coding With [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/)

## What is interactive coding, and why is it useful?

Up to this point you've been executing your code as program from the command line to achieve your final results. There are scenarios, particularly in data analysis, where you need to frequently adjust parameters to search for meaningful patterns. Sourcing a script to change a single parameter can be time consuming, especially if you have commands that require a large amount of CPU time. Interactive coding is simply running your scripts line-by-line with with the interpreter, but it's usually accomplished with the aid of program that allows you to send text from your editor to the interpreter.

[From wikipedia:](https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Interactive_mode_languages)
_Interactive mode languages act as a kind of shell: expressions or statements can be entered one at a time, and the result of their evaluation is seen immediately. The interactive mode is also known as a REPL (read–eval–print loop)._

## Good coding practice alert

You should **always** ensure that your code runs, and generates **reproducible results**, when run from source (i.e. executed from the command line). When you code interactively, it can be easy to assign variable out of order. 


In [30]:
import os
os.getcwd()
os.chdir('/Users/ddiaz/Documents/diazdc-pfb2019/')

##### Complex slicing patterns

In [31]:
# the method r_ allows us to slice with multiple ranges
from numpy import r_

# Let's see what it returns before we slice our data frame
pd.np.r_[1:2, 4, 5:7]

# With column names
cell_attributes.columns.values[pd.np.r_[1:2, 4, 5:7]]

# With the first 5 rows of the data frame
cell_attributes.iloc[:5,pd.np.r_[1:3, 5:7]]

Unnamed: 0_level_0,n_genes,orig_ident,tree_ident,louvain
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAACCTGAGCACCGCT-1_1,928,1,7,0
AAACCTGCACATTCGA-1_2,894,1,9,2
AAACCTGCAGGGTTAG-1_3,1025,1,14,8
AAACCTGCATGCCTTC-1_4,1438,1,12,1
AAACCTGGTTGAACTC-1_5,1277,1,6,10


### Ordering dataframes by column values

Here we'll take look at ordering our data by a particular column value, or multiple column values.

In [32]:
# Let's make a smaller dataset to work with
cell_df_sub = cell_attributes.iloc[:25,[0,1,3,5]]

# Set ascending=True to reverse the order
cell_df_sub.sort_values('n_counts', ascending=False)

# Sort by multiple columns in different directions
cell_df_sub.sort_values(by=['tree_ident', 'n_counts'], ascending=[True, False])

Unnamed: 0_level_0,n_counts,n_genes,percent_mito,tree_ident
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAAGTAGTCACCACCT-1_23,6873,1876,0.018055,1
AAAGTAGGTGCAGACA-1_22,3671,1326,0.045244,1
AAATGCCCACTGCCAG-1_24,1478,680,0.078751,1
AAAGTAGAGGTCATCT-1_20,4106,1527,0.017317,2
AAACGGGTCCACGTGG-1_8,5586,2022,0.006445,3
AAAGATGCACGAAATA-1_10,1879,822,0.019702,3
AAATGCCGTATTCGTG-1_25,3129,1107,0.017589,4
AAAGCAACACAGATTC-1_15,3206,1263,0.009054,5
AAACCTGGTTGAACTC-1_5,2765,1277,0.010138,6
AAACCTGTCACCTTAT-1_6,2771,1144,0.01953,7


### Subsetting data by condition

Understanding how to subset your data using conditional operations is *very*, _very_ useful. You'll often encounter situations where you want to filter your data on a certain set of parameters to reduce it to a more "meaningful" state.

In [33]:
# Subsetting on a single condition
cell_df_sub.loc[(cell_df_sub['tree_ident'] == 1),]

Unnamed: 0_level_0,n_counts,n_genes,percent_mito,tree_ident
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAAGTAGGTGCAGACA-1_22,3671,1326,0.045244,1
AAAGTAGTCACCACCT-1_23,6873,1876,0.018055,1
AAATGCCCACTGCCAG-1_24,1478,680,0.078751,1


In the example below we chain boolean operators together to achieve results that satisfy multiple conditions. You can make these statments complex as you'd like.

Note: Pandas uses the bitwise logical operators (see earlier lecture). A pipe symbol `|`  represents `or`, and an ampersand symbol `&`  represents `and`. The backslashes in code simply allow us to break up our statement at arbitrary points for readbility.

In [34]:
# Subsetting on multiple conditions.
cell_df_sub.loc[
    (cell_df_sub['tree_ident'] == 1) | \
    (cell_df_sub['tree_ident'] == 2) & \
    (cell_df_sub['n_genes'] > 1000),]

Unnamed: 0_level_0,n_counts,n_genes,percent_mito,tree_ident
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAAGTAGAGGTCATCT-1_20,4106,1527,0.017317,2
AAAGTAGGTGCAGACA-1_22,3671,1326,0.045244,1
AAAGTAGTCACCACCT-1_23,6873,1876,0.018055,1
AAATGCCCACTGCCAG-1_24,1478,680,0.078751,1


What's actually going on here? The rows in the data frame are actually subsetted on a vector of True/False statements. That is, for every row for which the condition evaluates to True will be returned. If we examine the boolean statements placed within `cell_df_sub.loc[]`, you can see why this is occuring.

In [35]:
cell_df_sub['tree_ident'] == 1 | \
    (cell_df_sub['tree_ident'] == 2) & \
    (cell_df_sub['n_genes'] > 1000)

index
AAACCTGAGCACCGCT-1_1     False
AAACCTGCACATTCGA-1_2     False
AAACCTGCAGGGTTAG-1_3     False
AAACCTGCATGCCTTC-1_4     False
AAACCTGGTTGAACTC-1_5     False
AAACCTGTCACCTTAT-1_6     False
AAACGGGTCCACGTGG-1_8     False
AAAGATGCAAGGACTG-1_9     False
AAAGATGCACGAAATA-1_10    False
AAAGATGCATCGATTG-1_11    False
AAAGATGGTACACCGC-1_12    False
AAAGATGGTCTTTCAT-1_13    False
AAAGCAAAGATGCCAG-1_14    False
AAAGCAACACAGATTC-1_15    False
AAAGCAACACGGCGTT-1_16    False
AAAGCAATCGTAGATC-1_18    False
AAAGTAGAGAATAGGG-1_19    False
AAAGTAGAGGTCATCT-1_20    False
AAAGTAGGTAAGTGGC-1_21    False
AAAGTAGGTGCAGACA-1_22     True
AAAGTAGTCACCACCT-1_23     True
AAATGCCCACTGCCAG-1_24     True
AAATGCCGTATTCGTG-1_25    False
AACACGTCAAGCGATG-1_26    False
AACACGTGTATCAGTC-1_27    False
dtype: bool

### Performing mathmatical operations on vectors

Lets look at a couple examples where we apply caculations to our data frame. First lets calculate some summary statistics. This can be a useful when viewing our results for the first time to get a handle on how our data is distributed.

In [36]:
# Returning summary statistics for all columns
cell_df_sub.describe()

# Returning summary statistics for a single column
cell_df_sub.loc[:,'n_counts'].describe()

count      25.000000
mean     2644.920000
std      1420.687302
min      1135.000000
25%      1478.000000
50%      2540.000000
75%      3206.000000
max      6873.000000
Name: n_counts, dtype: float64

`n_counts` refers to the number of counts for "unique molecular identifiers", which are barcodes for individual transcripts within in a single cell. Ideally, if the number of `n_counts` is high, then the number of genes per cell should also be high. The number of genes per cell is in the `n_genes` column. Lets see if this observation holds true by calculating the pairwise correlation between these two variables. 



In [37]:
# Simply add the .corr() method to your dataframe subset
cell_df_sub.loc[:,['n_counts','n_genes']].corr()

Unnamed: 0,n_counts,n_genes
n_counts,1.0,0.960371
n_genes,0.960371,1.0


That summarizes our introduction to Pandas. As you can see, Pandas greatly simplifies the process of exploring and making calculations in data frames and matricies. Check out the link below for the offical documentation.

[**Pandas Documentation**](https://pandas.pydata.org/pandas-docs/stable/index.html)

# Shell commands in IPython

[_Adapted from the Python Data Science Handbook_](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb)

Jupyter Lab uses a special conosle called [IPython](https://ipython.org/). IPython has a ton of interesting features (you can read about them [here]()), but one of the most useful features is the ability to run bash command in the IPython console without any extra modules. You simply need to prepend your bash command with a `!`.

In [38]:
!ls

!pwd

!echo "printing from the shell"

Jupyter_Lab_intro.ipynb [34mPython[m[m                  gitouttahere.sh
Pandas.md               README.md               meta_data.csv
Pandas_example.py       [34mUnix2[m[m                   sample_script.py
/Users/ddiaz/Documents/diazdc-pfb2019
printing from the shell


In [39]:
contents = !ls
print(contents)

directory = !pwd
print(directory)

type(directory)

['Jupyter_Lab_intro.ipynb', 'Pandas.md', 'Pandas_example.py', 'Python', 'README.md', 'Unix2', 'gitouttahere.sh', 'meta_data.csv', 'sample_script.py']
['/Users/ddiaz/Documents/diazdc-pfb2019']


IPython.utils.text.SList