<a href="https://colab.research.google.com/github/carlocamilloni/Structural-Bioinformatics/blob/main/Notebooks/t02_intro_stat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simple Statistical Analysis

The aim of these exercises is to familiarise you with Python notebooks and some Python coding for dealing with numbers, and to review some simple statistical analysis. For some theory refer to **Lecture 2**:

*   https://github.com/carlocamilloni/Structural-Bioinformatics/blob/main/Notes/02_StochasticMolecules.pdf


## Part I. Introduction



Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with
- Zero configuration required
- Free access to GPUs
- Easy sharing

You can watch the [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) video recommended by Google Colab to learn more, or just get started below!


Note that at your left you have three icons. The first one looks like this:

><img src="https://upload.wikimedia.org/wikipedia/commons/b/bb/Summary_icon.svg" width="100">

and corresponds to the Table of Contents of this tutorial.

**The last folder icon corresponds to the File Explorer of the virtual machine hosted in Google Cloud assigned to your session.** This is very important because allows you to move files, make new folders, change files/folders names if needed and to copy the "path" by right clicking on a file/folder.

### The central concept of Google Colab Notebook: Cells

A notebook is a list of cells. Cells contain either explanatory text or executable code and its output. Click a cell to select it.



**Adding and moving cells**

You can add new cells by using the **+ CODE** and **+ TEXT** buttons that show when you hover between cells. These buttons are also in the toolbar above the notebook, and they can be used to add a cell below the currently selected cell.

You can move a cell by selecting it and clicking **Cell Up** or **Cell Down** in the top toolbar.

Consecutive cells can be selected by "lasso selection", i.e. by dragging from outside one cell and through the group.  Non-adjacent cells can be selected concurrently by clicking one and then holding down **Ctrl** while clicking another.  Similarly, using **Shift** instead of Ctrl will select all cells between two non-adjacent selections.

### Text cells


Colaboratory has two types of cells: text and code. The text cells are formatted using a simple markup language called **markdown**, based on [the original](https://daringfireball.net/projects/markdown/syntax) markdown project.
This is a **text cell**. You can **double-click** to edit this cell. Text cells
use markdown syntax. To learn more, see the [markdown
guide](/notebooks/markdown_guide.ipynb) recommended by Google Colab.

### Markdown ⌨️




To see the markdown source, double-click a text cell, showing both the markdown source (left) and the rendered version (right). Above the markdown source there is a toolbar to assist editing.



You can also use tags to format your text. The following are examples of markdown text formats. Each word/phrase is shown in the desired format, and the tags around it are those required to achieve each specific format.

**Text Formats:**

\**italics*\* or \__italics__

**\*\*bold\*\***

\~\~~~strikethrough~~\~\~

\``monospace`\`

**Indentations:**

No indent
>\>One level of indentation
>>\>\>Two levels of indentation

**An ordered list:**
1. 1\. One
1. 1\. Two
1. 1\. Three

**An unordered list:**
* \* One
* \* Two
* \* Three



******If you are interested in learning more about markdown in Google Colab you can read this nice article which includes a cheat sheet [here](https://towardsdatascience.com/cheat-sheet-for-google-colab-63853778c093)**

### Math 🧮 & Equations ✏️



You can also add math to text cells using [$\LaTeX$](http://www.latex-project.org/)
to be rendered by [MathJax](https://www.mathjax.org). Just place the statement
within a pair of **`$`** signs. For example `$\sqrt{3x-1}+(1+x)^2$` becomes
$\sqrt{3x-1}+(1+x)^2.$

Also, if you double the **`$`** tags in your $\LaTeX$ equations, you can set the contents off on its own centered line. For example, `$$y = 0.1 x$$` renders the following equation: $$y = 0.1 x$$

### Tables 📍



Tables:
```
First column name | Second column name
--- | ---
Row 1, Col 1 | Row 1, Col 2
Row 2, Col 1 | Row 2, Col 2
```

becomes:

>First column name | Second column name
>--- | ---
>Row 1, Col 1 | Row 1, Col 2
>Row 2, Col 1 | Row 2, Col 2

Horizontal rule done with three dashes (\-\-\-):

---


### Gifs 😱

YES! you can add animated gifs

<img src='https://media.giphy.com/media/3o72F8t9TDi2xVnxOE/giphy.gif'/>


### Code cells


Below is a **code cell**. To execute the contents of a code cell, you first must connect to a hosted runtime by clicking on the **Connect** button located in the toolbar menu.

 <img src='https://media.giphy.com/media/lRLBURv0hpcHqiraBI/giphy.gif'/>

Once the toolbar button changes to **'Connected'**, click in the code cell below to select it and execute the contents in the following ways:




* Click the **Play icon** in the left gutter of the cell;
* Type **Cmd/Ctrl+Enter** to run the cell in place;
* Type **Shift+Enter** to run the cell and move focus to the next cell (adding one if none exists); or
* Type **Alt+Enter** to run the cell and insert a new code cell immediately below it.

There are additional options for running some or all cells in the **Runtime** menu.


In [None]:
W = 'Tryptophan'
C = 'Cysteine'
W,C

## Part II. Files and Folders

### Mount your google drive disk
By running the code in the next cell you will be able to access your google drive folders from colab, see the side bar

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Now you can **make a folder Structural_Bioinformatics** in your google drive by using the menu on the left side of the browser, into this folder make a folder **Task2**.

### Set the working folder

In general one problem of using Colab notebooks is that you do not know where they are run. If you open a notebook from one of your google drive folder, then this is the folder where everything happen. But if you are opening a notebook from the web then this is run on the google virtual machine. **To check the folder and change it you can use simple Linux shell commands**:

In [None]:
#check the local active folder:
%pwd

In [None]:
#change the local active folder (paste in the following line that path to your Task2 folder)
%cd "/content/drive/MyDrive/Structural_Bioinformatics/Task2"

In [None]:
#check the local active folder:
%pwd

In [None]:
#show the files in the local folder:
%ls

### Simple data extraction from files

In [None]:
#Download a structure from the PDB:
import urllib.request
urllib.request.urlretrieve('http://files.rcsb.org/download/7tjh.pdb', '7tjh.txt')
# here we change the extension of the file from .pdb to .txt so that the computer immediately understand that is a text file

In [None]:
#check again the files:
%ls

In [None]:
#show the file content here:
%cat 7tjh.txt

**To see a file you can actually double click on it from the menu on the left**



Now lets take some data from the PDB file using simple Linux commands, for example we would like to count all the different amino acids in the protein:

In [None]:
# show from the file the lines inlcuding ALA (all ALANINE atoms)
%%bash
grep ALA 7tjh.txt

In [None]:
# now take one single atom, the safest choice is the CA carbon
%%bash
grep ALA 7tjh.txt | grep CA

In [None]:
# lets' count them:
%%bash
grep ALA 7tjh.txt | grep CA | wc -l

In [None]:
# now we want to put this number in an array of size 20
import numpy as np
# we call this array 'aa_count', of size 20 initially set with all 0.
aa_count=np.full(20, 0.)
# we also set an array with amino acids names
aa_names=np.empty(20, dtype=object)
print(aa_names)
print(aa_count)

In [None]:
tmp=!grep ALA 7tjh.txt | grep CA | wc -l
aa_count[0]=tmp[0]
aa_names[0]="ALA"
print(aa_names)
print(aa_count)

In [None]:
#we can iterate this operation on some amino acids:
k=0
for i in 'ALA', 'CYS', 'GLY':
  print(i)
  !grep $i 7tjh.txt | grep CA | wc -l
  tmp=!grep $i 7tjh.txt | grep CA | wc -l
  aa_names[k]=i
  aa_count[k]=tmp[0]
  k+=1

print(aa_names)
print(aa_count)

Now add a bar plot to show the aminoacidic content of your complex, to get the number you need to complete the function below:

In [None]:
import numpy as np
# we call this array 'aa_count', of size 20 with all 0.
aa_count=np.full((20,2), 0.)
print(aa_count)
# we also set an array with amino acids names
aa_names=np.empty(20, dtype=object)

In [None]:
k=0
# complete the following line and then remove the #
for i in 'ALA', 'CYS', 'GLY', ...:
  tmp=!grep $i 7tjh.txt | grep CA | wc -l
  aa_names[k]=i
  aa_count[k]=(k,tmp[0])
  k+=1

print(aa_names)
print(aa_count)

In [None]:
import matplotlib.pyplot as plt
plt.bar(aa_count[:,0],aa_count[:,1], width=0.8, tick_label=aa_names)

And now do it again for DNA bases, in the PDB file these are indentified as DA, DT, DC, DG:

In [None]:
!grep DA 7tjh.txt | grep OP1

In [None]:
# we call this array 'dna_count', of size 4 with all 0.
dna_count=np.full((4,2), 0.)
# we also set an array with amino acids names
dna_names=np.empty(4, dtype=object)

In [None]:
# go on...

## Part III. Examples of Statistical Analysis

### 1D dicrete data

In [None]:
# Lets' begin by throwing a "dice"
# This means to generate a random number between 1 and 6 extracted from a uniform probability distribution
import random
random.randint(1, 6)
# you can see that by re-executing this cell you will always get a different number

In [None]:
# now we can decide how many times to throw it and accumulate the statitics
import numpy as np
rng = np.random.Generator(np.random.PCG64())
rng = np.random.default_rng()
num_throw=10000
# here the range is from low to high-1
data = rng.integers(low=1, high=7, size=num_throw)
# print the data and the number of data
print(data, len(data))

In [None]:
# we can make an histogram of data
# is not easy to get a good flat histogram
# the uniform distribution is very difficult to sample
import matplotlib.pyplot as plt
counts, bin=np.histogram(data, bins=[1, 2, 3, 4, 5, 6, 7])
print(counts, bin)
plt.hist(bin[:-1], bin, weights=counts, align='left', rwidth=0.9)
plt.title("Histogram of a dice throwing experiment")
plt.show()

In [None]:
# if we are estimating a probability then we can also normalize them:
counts, bin=np.histogram(data, bins=[1, 2, 3, 4, 5, 6, 7], density=True)
plt.hist(bin[:-1], bin, weights=counts, align='left', rwidth=0.9)
plt.title("Probability density estimate of a dice throwing experiment")
plt.show()
# the above is equivalent to
bin_size=1
counts, bin=np.histogram(data, bins=[1, 2, 3, 4, 5, 6, 7])
plt.hist(bin[:-1], bin, weights=counts/(len(data)*bin_size), align='left', rwidth=0.9)
plt.title("Probability density estimate of a dice throwing experiment")
plt.show()

In [None]:
#calculate the average
#this is the sum of all the elements divided by the number of elements
average_def = data.sum()/len(data)
#this is a python function that those it for you
average = np.average(data)
print(average_def, average)

In [None]:
#calculate the standard deviaton, that is the width of the distribution of the data
#this is the squared difference of the elements and the average
data2 = np.power(data-average_def, 2.)
#that is then summed, divided by the number of elements and under a square root
deviation_def = np.sqrt((data2.sum()/len(data)))
#this is a python function that does it for you
deviation = np.std(data)
print(deviation_def, deviation)

In [None]:
#calculate the standard error that is the width of the distribution of the average
#that is the accuracy of the estimate of the average
from scipy.stats import sem
# this is the standard deviation divided by the square root of the number of data-1
error_again = deviation_def/np.sqrt(len(data)-1)
# this is a python function that does it for you
error = sem(data)
print(error, error_again)

In [None]:
# the error defined the number of meaningfull digits
# as a rule of thumb do not use more than two digits from the first non-zero one

In [None]:
# what often happen in an experiment is that we observe a number that is already an average quantity
# for example we can consider an experiment where multiple copies (10) of the same protein give a fluorescent signal
# between 1 and 6 ;-) and in our setting
# this set how many time we throw  a die (how many molecules we have in the sample) to get an average quantity
num_throw=10
# here we take the average for num_throw rolls of a 6 faced die and we store the average in adata that is gonna
# be a list of numbers. So this can be seen and the number of replicated experiments
adata_10_1000 = np.array([np.average(rng.integers(low=1, high=7, size=num_throw))])

num_rep=1000
adata_10_1000 = np.array([np.average(rng.integers(low=1, high=7, size=num_throw))])
# this is a cycle from 0 to num_rep-1 where an additional average is calculated and added in our dataset
for i in range(0,num_rep-1):
  adata_10_1000 = np.append(adata_10_1000, np.array([np.average(rng.integers(low=1, high=7, size=num_throw))]), axis=0)

In [None]:
# this is the outcome of our num_rep experiments
print(len(adata_10_1000))
ave_adata = np.average(adata_10_1000)
std_adata = np.std(adata_10_1000)
print(ave_adata, std_adata)

In [None]:
# what is the probability distribution of these average quantities?
# in this cases the number are not integers and are the results of
# many random processes (multiple dice throwings)
# this put the data into an histogram using an automatic binning
counts_10_1000, bin_10_1000=np.histogram(adata_10_1000, bins='auto')
# count and bin report about the actual result of doing the histogram
print(counts, bin)
# then we can plot the histogram
plt.hist(bin_10_1000[:-1], bin_10_1000, weights=counts_10_1000, align='mid', rwidth=0.9)
plt.title("Histogram of the averages over multiple dice throwing experiment")
plt.show()

In [None]:
#

### 1D continuos data

As an example of 1D continuos data we can use data generated by monodimensional Gaussian distribution. This is characterised by its average value and its standard deviation.

In [None]:
import numpy as np
rng = np.random.Generator(np.random.PCG64())
rng = np.random.default_rng()
num_throw=1000
# gaussian centered at 70 with a sd of 5, generate num_throw data point
data_g = rng.normal(70, 5, size=num_throw)
print(len(data_g))

In [None]:
import matplotlib.pyplot as plt
counts, bin=np.histogram(data_g, bins='auto', density=True)
plt.hist(bin[:-1], bin, weights=counts, align='mid', rwidth=0.9)
plt.title("Probability density estimate of a Gaussian dataset")
plt.show()

In [None]:
from scipy.stats import sem
aver=np.average(data_g)
dev=np.std(data_g)
error=sem(data_g)
print(aver, dev, error)

You should see that even with few data, the estimate of the properties of the Gaussian is easier. This because most of the value extracted are close to the average value.

In [None]:
#cumulative distribution function
#calculates the probability from -infinity to x
X2 = np.sort(data_g)
F2 = np.array(range(len(data_g)))/float(len(data_g))
plt.plot(X2, F2)
plt.show()

###2D continuos data

Again we can use a Gaussian, in 2D

In [None]:
# this is the average value over X and Y
mean = [-2, 10]
# this is the covariance matrix, that is a combination of the squared standard deviations along X, Y and XY
cov = [[1, 3], [3, 20]]
num_throw=10000
x, y = np.random.default_rng().multivariate_normal(mean, cov, num_throw).T
print(y.shape)

In [None]:
plt.plot(x, y, 'o')
plt.axis('equal')
plt.show()

In [None]:
# Big bins
plt.hist2d(x, y, bins=(25, 25), cmap=plt.cm.jet)
plt.axis('equal')
plt.show()

In [None]:
from scipy.stats import sem
aver_x=np.average(x)
dev_x=np.std(x)
error_x=sem(x)
print(aver_x, dev_x, error_x)
aver_y=np.average(y)
dev_y=np.std(y)
error_y=sem(y)
print(aver_y, dev_y, error_y)


In [None]:
# covariance matrix (x with x, x with y, y with x, y with y)
print(np.cov(x,y))
# correlation matrix (x with x, x with y, y with x, y with y)
print(np.corrcoef(x,y))