# GSD: Explore table of raw expression from BY4741 yscRNA-seq libraries 

This notebook is meant to facilitate exploring the data table of raw expression from BY4741 yscRNA-seq libraries (Supplementary Table 3 of [Nadal-Ribelles et al., 2019](https://www.ncbi.nlm.nih.gov/pubmed/30718850)) interactively.

Reference for the data:  
- [Sensitive high-throughput single-cell RNA-seq reveals within-clonal transcript correlations in yeast populations. Nadal-Ribelles M, Islam S, Wei W, Latorre P, Nguyen M, de Nadal E, Posas F, Steinmetz LM. Nat Microbiol. 2019 Feb 4. doi: 10.1038/s41564-018-0346-9. Epub ahead of print. PMID: 30718850 ](https://www.ncbi.nlm.nih.gov/pubmed/30718850)


**Technical note: Presently this notebook will only fully work in the classic Jupyter notebook environment when using via MyBinder.org. I don't have the repo set up to use JupyterLab with qgrid as of yet.**

-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them an <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

-----

## Preparation



First get the file associated with supplemental data table 3.

In [1]:
!curl -OL https://static-content.springer.com/esm/art%3A10.1038%2Fs41564-018-0346-9/MediaObjects/41564_2018_346_MOESM3_ESM.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1964k  100 1964k    0     0  8847k      0 --:--:-- --:--:-- --:--:-- 8887k


## Read in table and begin to evaluate it

Under ['Supplementary information' here](https://www.nature.com/articles/s41564-018-0346-9#Sec22), it says:

>"Supplementary Table 3
Raw expression BY4741 yscRNA-seq libraries after applying the quality filter criteria (total of 127 cells). Table contains raw number of molecules for each gene (rows) for each cell (rows)."

(I suspect the latter `(rows)` should read `(columns)`.)

After reading in the data, we'll also **add columns** for the `Total` raw expression counts and the average (`AVG`).

In [2]:
import pandas as pd
df = pd.read_csv('41564_2018_346_MOESM3_ESM.txt', sep='\t', header=1, delim_whitespace=False)
df['Total']=df.sum(1)
df['AVG']=df.mean(1).round(2)

In [3]:
print(f"There is information for {len(df)} genes.")

There is information for 7272 genes.


In [4]:
df.head()

Unnamed: 0,geneName,comGeneName,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,...,cell_120,cell_121,cell_122,cell_123,cell_124,cell_125,cell_126,cell_127,Total,AVG
0,SUT432,SUT432,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,4,0,115,1.8
1,YAL067C,SEO1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,12,0.19
2,CUT436,CUT436,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
3,SUT433,SUT433,0,1,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,21,0.33
4,CUT437,CUT437,0,0,1,0,1,1,0,0,...,0,0,1,0,0,0,0,0,11,0.17


## Using  qgrid for interactive viewing

As the information above showed, there is a lot of data. To make it more intuitive to explore we'll use the `qgrid` package by [Quantopian, Inc.](https://github.com/quantopian).

The code in the cell below will bring up a view of the dataframe. Clicking on the column names will allow you to sort them by that parameter. For example, scroll to the right side in the view to show the `Total` column and then click on the column heading twice to sort it descending. You can click the expand button on the right above the column to have the dataframe view fill the current browser window.

In [5]:
import qgrid
# controlling individual column widths based on https://github.com/quantopian/qgrid/issues/119 ; however if you
# have too many columns, you still need `grid_options={'forceFitColumns': False}` from 'Example 2 - Render a DataFrame with 1 million rows'
col_options = {'width': 55,}
col_defs = {
    'geneName': {'width': 85,},
    'comGeneName':{'width': 100,},
    'Total':{'width': 65,},
    'AVG':{'width': 65,},
}
#qgrid_widget = qgrid.show_grid(df, show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 65})
qgrid_widget = qgrid.show_grid(df, column_options=col_options,column_definitions=col_defs, grid_options={'forceFitColumns': False}, show_toolbar=True)
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': False, 'defa…

However, that still isn't that easy to explore. Since we have the `Total` and average columns, we could probably now show the results for every cell all the time.

In [6]:
import qgrid
just_sum_df = df[['geneName','comGeneName','Total','AVG']]
#qgrid_widget = qgrid.show_grid(df, show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 65})
qgrid_widget = qgrid.show_grid(just_sum_df, show_toolbar=True)
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

## Locating your favorite genes in the data

Unless we are going to use qgrid to view results, we probably want to sort on the total first so the resulting number on the left when we subset to show our favorite genes corresponds to the ranking for total counts. Otherwise that number will just reflect the genes postion among the each chromosome in order starting with chromosome 1. (That was why earlier when we first ran `df.head()` to look at the top of the dataframe we saw `YAL067C` first.)

The next cell will sort the two dataframes we have been using.

In [7]:
SORTED_df = df.sort_values('Total', ascending=False).reset_index(drop=True) #if we left off `drop=True`, we'd keep the 
# number before sorting on the left as well& it would be under the heading 'index'
just_sum_SORTED_df = just_sum_df.sort_values('Total', ascending=False).reset_index(drop=True)

Now to subset.

In [8]:
SORTED_df[SORTED_df.comGeneName == "POP4"]

Unnamed: 0,geneName,comGeneName,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,...,cell_120,cell_121,cell_122,cell_123,cell_124,cell_125,cell_126,cell_127,Total,AVG
2541,YBR257W,POP4,3,1,3,1,0,1,1,3,...,0,2,0,3,2,0,1,1,145,2.27


If we are just concerned with viewing the summary dataframe, we can specify that dataframe to subset.

In [9]:
just_sum_SORTED_df[just_sum_SORTED_df.comGeneName == "RPP1"]

Unnamed: 0,geneName,comGeneName,Total,AVG
2576,YHR062C,RPP1,142,2.22


Curious about that range of cells?

In [10]:
just_sum_SORTED_df.iloc[2531:2580]

Unnamed: 0,geneName,comGeneName,Total,AVG
2531,SUT572,SUT572,146,2.28
2532,YAL020C,ATS1,146,2.28
2533,YGL242C,YGL242C,146,2.28
2534,YDR068W,DOS2,146,2.28
2535,YML023C,NSE5,146,2.28
2536,YEL007W,YEL007W,145,2.27
2537,YLR237W,THI7,145,2.27
2538,YBR086C,IST2,145,2.27
2539,YMR139W,RIM11,145,2.27
2540,YCL052C,PBN1,145,2.27


Using the `.str.contains()` method, a group of related (mostly) genes can be examined at the same time.

In [11]:
just_sum_SORTED_df[just_sum_SORTED_df.comGeneName.str.contains("POP")] 

Unnamed: 0,geneName,comGeneName,Total,AVG
1824,YNR052C,POP2,227,3.55
1836,YNL282W,POP3,224,3.5
1950,YAL033W,POP5,209,3.27
2346,YBL018C,POP8,162,2.53
2541,YBR257W,POP4,145,2.27
3140,YNL221C,POP1,108,1.69
3344,YBR167C,POP7,97,1.52
4827,YGR030C,POP6,41,0.64


We can go beyond visualization to make reports on the ranking of genes.  
In the cells below we define a gene of interest, view the pertinent row, and then use the information to make a summary about that gene.

In [12]:
gene = "RPO21"

In [13]:
just_sum_SORTED_df[just_sum_SORTED_df.comGeneName == gene]

Unnamed: 0,geneName,comGeneName,Total,AVG
629,YDL140C,RPO21,749,11.7


In [14]:
gene = "RPO21"
pos_gene = just_sum_SORTED_df[just_sum_SORTED_df.comGeneName == gene].index[0] #position in SORTED dataframe
print(f"'{gene}' ranks {pos_gene} among {len(df)} genes when sorted based on the sum of molecules/per gene for all 127 cells.")
print(f"'{gene}' ranks in the top {pos_gene/float(len(df)):.2%} of the list of genes.")

'RPO21' ranks 629 among 7272 genes when sorted based on the sum of molecules/per gene for all 127 cells.
'RPO21' ranks in the top 8.65% of the list of genes.


Just change the text to genes of interest and re-run cells as you wish.

----