This notebook will explore data, as in the initial exploration but instead will focus on producing graphs as an experimentalist might.

After conversing with the experimentalist who provided us the data we have a few preprocessing steps and a few observed constants.

Gamma = 0.6

Leakthrough (Beta) = 7% or 0.07

He utilized the following smFRET Efficiency Equation


![fret equation](../images/fret_equation.png)

His first step was to bin the data at the millisecond timescale (0.001).
I've written an awk script which should be run on a data file output by timesteps_extractor.py to do exactly this.

In [3]:
#first some constants
!columnOneTimescale=0.0000001 # 10 ^ -7
!columnTwoTimescale=0.00000000003.2 # 3.2 * 10 ^ -11
!gamma=0.6
!beta=0.07

In [22]:
#!awk -v ONE=$columnOneTimescale -v TWO=columnTwoTimescale '{if(NR>2){x=($1 * ONE + $3 * TWO); printf "%.3f %f\n", x, $2}}' FILENAME
#awk will ignore first two rows
#and then round to 3 decimals of precision - i.e. every row will be converted to timestamp, channel
!awk -v ONE=$columnOneTimescale -v TWO=columnTwoTimescale '{if(NR>2){x=($1 * ONE + $3 * TWO); printf "%.3f %f\n", x, $2}}' ../data/10LinesOfExampleData.csv
#we will also write a small intermediate file for quick access in the next step
!awk -v ONE=$columnOneTimescale -v TWO=columnTwoTimescale '{if(NR>2){x=($1 * ONE + $3 * TWO); printf "%.3f %f\n", x, $2}}' ../data/10LinesOfExampleData.csv > ../data/intermediate1

0.000 3.000000
0.000 3.000000
0.000 3.000000
0.000 3.000000
0.000 4.000000
0.000 4.000000
0.000 4.000000
0.000 4.000000


In [24]:
##if the channel is 3, then it is acceptor.  If it is 4, it is donor
##This awk script checks the value of the 2nd column (the channel) and decides if we are looking at an acceptor or a donor
##Next it increments the appropriate timestamp by 1.  
##Finally, we write the file timestamp, acceptorCount(y), donorCount(z)
!awk '{u[$1]++; if(int($2)==3)y[$1]++; if(int($2)==4)z[$1]++}; END{for (j in u) printf "%.3f %.0f %.0f\n", j, y[j], z[j]}' ../data/intermediate1 | sort -n 
#and once again we write an intermediate file for next step
!awk '{u[$1]++; if(int($2)==3)y[$1]++; if(int($2)==4)z[$1]++}; END{for (j in u) printf "%.3f %.0f %.0f\n", j, y[j], z[j]}' ../data/intermediate1 | sort -n > ../data/intermediate2

0.000 4 4


In [25]:
ls ../data

10LinesOfExampleData.csv  intermediate1
README                    intermediate2


Finally, the experimentalist determines qualitatively which timesteps only present noise, and which are indeed valid data.  He stated that he chooses a number between 20 and 30 observations in order to set the threshhold.  He compares his chosen number (20 in this example) with the value given by the denominator of the fret efficiency equation. 

*Note, there was a local subtraction step which he performed for his paper which he was fuzzy on the details.  He called it minutiae and it will be ignored in today's example.
![fret equation](../images/fret_equation.png)

In [28]:
#finally, we apply the fret efficiency equation on timesteps which are above the threshold.
#this is bin'd into 50 bins and a histogram can then be graphed.
!filter=20
!awk -v GAMMA=$gamma -v BETA=$beta -v FILTER=$filter '{if(($2 + (GAMMA-BETA) * $3) > FILTER){x=($2 - BETA * $3); y=(x/(x + GAMMA * $3)); z[int(y*50)]++}}; END{for (j in z)if(j>-10 && j<50) print j/50, z[j]}' ../data/intermediate2 | sort -n
#This will produce no result as no value meets the criterion from the sample.

For Reference, I have included as yet unpublished figures from the experimentalists paper.

Full datasets can be obtained from: https://urldefense.proofpoint.com/v2/url?u=https-3A__my.pcloud.com_publink_show-3Fcode-3DkZT9cLkZrSyQXkE6PU7D4cslbNDv0L07jr6V&d=DwMFaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=Vk8xFKDEER_aI9UoVYHhKw&m=3GgIW977eI-K0PKH5azqA-lsjOI19PZTcICIfajSgzk&s=Ax8upmKWmxrO9svG3xsdk0ysLLes8ikV8M_LdD9bzwE&e=

Describing and understanding these diagrams.  This gives you a perspective of what you should expect to see from the data.
![C](../images/figure_c.png)

C:

The Black Line is fit to the histogram.  The Other 3 are fit to ‘re-build’ the black curve.

The Purple Line is a gausiaain.  Lock the sigma value to value obtained from 3M because this is the best representation of completely denatured protein

The Blue Line is fixed to a log normal of the donor only signal.  Only the amplitude is allowed to vary (only height varies essentially)

The Red Line is a reversed log normal with mu (not sigma? not sure why) fixed to the 0M signal - only amplitude is allowed to vary again.

![D](../images/figure_d.png)
D:
To generate D, do the procedure for C
Basically the red curve + the purple curve. Normalized to number of molecules.  Essentially deleting the donor only signal.  I have not reproduced these steps here.