<a href="https://colab.research.google.com/github/mpfoster/Biochem5721/blob/master/02_distributions_5721.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Biochem 5721 -- Reading data and analyzing distributions

## Ensemble properties from averaging of large ensembles
<img  align="right" src="https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/kinesin-cartoon.png" alt="Kinesin" width="500"/>

* We saw in Video 1 that molecular motors walking on microtubules are responsible for morphological changes in a cell, and for enabling cell division. In the case of kinesin, we envision a dimeric molecule that makes stepping motions as ATP in the microtubule binding domains is hydrolyzed to ADP, dicusing a conformational change. 
* We also saw that single-molecule fluorecent methods could be used to visualize the movement of individual kinesin molecules on microtubules.
* Important questions about molecular motor function include:
  1. How fast can these motors travel?
  1. Do they move by a _stepping_ motion, or more like an _inch-worm_?
  1. How far do they move in a single stroke?
  1. How efficient is the motor (how many steps per ATP hydrolyzed) per step or per nm?
  1. How processive is it (how often does it fall off)?

* Biochemists have long studied these and related questions using traditional _ensemble_ approaches, which measure the **average** overall behavior of _many, many_ molecules (~10<sup>-12</sup> mol ~10<sup>11</sup> molecules).
* The accuracy and precision of those measurements depends on the sensitivity of the available technology, but also on intrinsict variability of the molecules themselves.


# Single molecule methods reveal variation within the ensemble.
* Highly sensitive single molecule fluorescence techniques allow measurement of the position of indivisual molecules (esp. TIRF microscopy, Total Internal Reflectance). 

<img  alt="Hand-over-hand & Inchworm Models" src="https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/myosin-stepping.png" width="800"/>
* In single molecule experiments using another motor, myosin, 
Yildiz *et al.*(2003) *Science* 300(5628):pp. 2061-2065, DOI: 10.1126/science.1084398 (https://science-sciencemag-org.proxy.lib.ohio-state.edu/content/300/5628/2061) achieved high spatial resolution by immobilizing actin substrates (for example by stretching it between two small particles), and then observing the position of a fluorescent tag attached to one of the myosin "legs". A representative movie can be found at [this link](https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/myosin-time-course.mov). (A video describing this and related research is here: https://www.youtube.com/watch?v=MKlyi4euq50) 



<img align="right" alt="Stepping Plot" src="https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/myosin-step-plot.png" width="600"/>

* By measuring the positions of individual molecules over time, the authors were able to measure the movements of individual myosin dimers (3 shown in the plot)
* Step sizes were measured for 32 molecules for a total of 231 steps over ~100s
* Step sizes were binned and plotted as a histogram (number of steps of a given length) 

### _Observations?_ 
### _Conclusions?_
<!-- 
Obs:
1. Average step size ~74 nm, not inching; stdev = 5 nm
2. range of step sizes was ~30 nm (60-90)
3. Distribution looks more or less Gaussian (normal distribution)

Conclusions:
1. Step-size is ~2x translocation, so stepping
2. Step size is not uniform, or constant!
3. Normal distribution suggests a relatively simple thermodynamic explanation for the variation in step size
 -->



## Statistical measures are important in experimental sciences
* Measurements of molecular properties are subject to variability due to limitations of measurements, and variation of the phenomenon (both real and artificial).
* This makes it critical to define the precision with which we _know_ something.
* Most commonly, by computing the **mean**, $\large \bar{x} = \frac{\sum_{1}^{N}x}{N} $
and (sample) **standard deviation** $\large \sigma = \sqrt{\frac{\sum_{1}^{N}(x-\bar{x})^2}{N-1}}$
<img align="right" alt="Bell Curve" src="https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/normal-distrib.png" width="400"/>
* These values are most commonly interpreted in the context of an *assumed* **Normal Distribution**
  * In a Normal Distribution, the mean $\bar{x}$ is the *most probable* value, and $\sigma$ describes the width of the distribution. 
  * 68% of the data are within $1\sigma$, 95% within $2\sigma$. So, differenfes $>2\sigma$ are often considered significant

* Lets compute these by hand for a small sample: [76.1, 72.6, 73.2]

In [None]:
# mean: sum the values and divide by the number of values:
mean = <code-here>
mean

In [None]:
'''
 stdev: for each value, subtract from mean, square it, sum to the others, divide by N-1 and take the square root
'''
stdev = <code-here>    # compute it
stdev   # print it

In [None]:
# Let's use Python libraries
# make a list of our values:
list = [ , ]  # square brackets, separate values by ","
from scipy import stats
stats.describe(a)

## Let's simulate a normal distribution!
Yildiz *et al.* measured the step size of myosin motors from N = 231 samples. Assuming we know the answer, we can simulate the effect of sample size on the accuracy of the measurement and analysis. For example, what if they made the minimum measurements (3)? Another 231? 10X?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats # scipy statistical tools
import seaborn as sns   # this is a nice package for plotting and some stats analysis
plt.style.use('ggplot')   # nice style plots

In [None]:
true_mean = 74; true_stdev = 5.5; nsamples = 3
samples = stats.norm.rvs(size=nsamples, scale=true_stdev, loc=true_mean)   # Simulate 
#print(samples)
(mu, sigma) = stats.norm.fit(samples)   # now, fit the simulated data
print("Average = {0:.2g}, STDEV = {1:.2g}".format(mu, sigma))
ax = sns.distplot(samples, bins=10, kde=False, fit=stats.norm)   # plot it
ax.set_xlabel("Step Size /nm"); ax.set_ylabel("Frequency")
plt.show()
plt.rcParams["figure.figsize"] = (8,3)    # adjust fig size if desired

* What can we conclude about the effect of sample size?

## Chemical shift data

<img  align="right" src="http://www.bmrb.wisc.edu/metabolomics/standards/5_Hydroxy_L_tryptophan/lit/439280.png" alt="Trp" width="200"/>
* NMR chemical shifts are sensitive to the local chemical environment. For a tryptophan (Trp) residue in a protein, the chemical shift of the H indole proton H$\epsilon$<sup>1</sup> is generally well separated from other <sup>1</sup>H signals in the molecule and can serve as useful probes of protein foldedness, flexibility and ligand binding.

* Chemical shift statistics in general are used to facilitate assignments and to determine whether the local environment is unusual. 

* Chemical shift data has been tabulated by the NMR community at the BMRB (_BioMagResBank_, Biological Magnetic Resonance Data Bank; http://www.bmrb.wisc.edu/). Because chemical shift depends on structure, and structure depends on energy (thermodynamics), one might expect the chemical shifts for a particular atom type to exhibit a Normal (Gaussian) distribution around some mean value. We will test that assumption in this Jupyter notebook.

* <img align="right" alt="BMRB data TrpHE1" src="https://raw.githubusercontent.com/mpfoster/Biochem5721/master/images/bmrb-trp-he1.png" width="400"/> The protein dataset at the BMRB consists of a series of 'csv' (comma-separated-value) plain text files. For the Trp indole <sup>1</sup>H the file can be found at this URL: http://www.bmrb.wisc.edu/ftp/pub/bmrb/statistics/chem_shifts/full/devise/TRP_HE1_sel.txt. 

* The first # is the BMRB entry ID; the next to last number is the chemical shift of the H$\epsilon$1 nucleus.

* We will proceed by:
  1. Importing the required tools/packages
  1. Dowloading the data from the web in to a Python *data frame* using Pandas
  1. Performing statistical analysis to determine Mean and STDEV

In [None]:
# Chemical shift histogram
# download chemical shift data from BMRB:
# http://www.bmrb.wisc.edu/ftp/pub/bmrb/statistics/chem_shifts/full/devise/TRP_HE1_sel.txt
# import requirements:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import requests

In [None]:
# load data into a dataframe:
data_url='http://www.bmrb.wisc.edu/ftp/pub/bmrb/statistics/chem_shifts/full/devise/TRP_HE1_sel.txt'
df = pd.read_csv(data_url,
    names=("id","mol","282","resi","resn","name","element","shift","n")
    )  #data
df.head()

In [None]:
df.describe()

In [None]:
df.hist('shift', bins=100)

In [None]:
df.hist('shift', range=(7,13), bins=100)
plt.show()

In [None]:
#samples = df['shift'].between(7, 13).tolist()
df2 = df.loc[(df['shift'] > 7) & (df['shift'] < 13)]
samples = df2['shift'].tolist()
#op = df.loc[(df['Height'] > 70) & (df['Weight'] > 160)]

In [None]:
r = stats.describe(samples)
r

In [None]:
(mu, sigma) = stats.norm.fit(samples)
print("Chemical Shift Mean: {0:.2f} ± {1:.2f} ppm".format(mu, sigma))

"Normal" probability density function: $f(x)= \frac{exp(-x^2/2)}{\sqrt{ 2 \pi}}$

In [None]:
ax = sns.distplot(samples, bins=100, kde=False, fit=stats.norm)
ax.set_xlim([7,14])
#plt.savefig('figure', dpi=150)

In [None]:
ax = sns.distplot(samples, bins=100, kde=False, fit=stats.laplace)
ax.set_xlim([7,14])


## *So, what can we conclude about the distributions of Trp HE1 shifts?*