In [108]:
import time
start_time = time.time()

In [109]:
import numpy as np

In [110]:
#gene expression matrix
G = np.array([[1,2,3,3],[3,1,9,4],[1,4,3,5]])
G

array([[1, 2, 3, 3],
       [3, 1, 9, 4],
       [1, 4, 3, 5]])

# Part a
Cell 3 has the highest expression.

# Part b
Gene 3 is the most highly expressed gene in cell 2.

# Part c
In reduced row echelon form, we have 
```python
[[1 0 3 1],
 [0 1 0 1],
 [0 0 0 0]]
```
Which gives us 2 non-zero rows, i.e. rank of 2.

The rank of G suggests that the expression of genes in 2 of the cells are correlated.

# Part d
## i.
setting v to be a row vector of length m (number of columns) where each element of v is 1/n (number of rows)
gives us v^T * G to be a vector containing the mean expression of each gene. 

In [111]:
v = np.array([1/3, 1/3, 1/3])
np.dot(v.T, G)

array([1.66666667, 2.33333333, 5.        , 4.        ])

## ii.
The mean expression levels are [5/3, 7/3, 5, 4] for the genes.


## iii.
Matrix P that accomplishes this is:
```python
[[0, 0, 0, 0], 
 [0, 0, 0, 0], 
 [0, 0, 1, 0], 
 [0, 0, 0, 1]]
```


In [112]:
P = np.array([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]])
np.dot(G, P)

array([[0, 0, 3, 3],
       [0, 0, 9, 4],
       [0, 0, 3, 5]])

# Part E.
## i.

In [113]:
nrows, _ = G.shape

l1dist = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        l1dist[i, j] = sum(abs(G[i] - G[j]))

l2dist = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        l2dist[i, j] = np.sqrt(sum((G[i] - G[j])**2))

cossim = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        cossim[i, j] = sum(G[i] * G[j])/ (np.sqrt(sum(G[i]**2)) * np.sqrt(sum(G[j]**2)))



In [114]:
l1dist

array([[ 0., 10.,  4.],
       [10.,  0., 12.],
       [ 4., 12.,  0.]])

In [115]:
l2dist

array([[0.        , 6.4807407 , 2.82842712],
       [6.4807407 , 0.        , 7.07106781],
       [2.82842712, 7.07106781, 0.        ]])

In [116]:
cossim

array([[1.        , 0.88694537, 0.96352932],
       [0.88694537, 1.        , 0.730999  ],
       [0.96352932, 0.730999  , 1.        ]])

## ii.
for all distance measures, cell 1 and 3 are the most similar.


In [117]:
G_contam = G + 1

l1dist_contam = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        l1dist_contam[i, j] = sum(abs(G_contam[i] - G_contam[j]))

l2dist_contam = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        l2dist_contam[i, j] = np.sqrt(sum((G_contam[i] - G_contam[j])**2))

cossim_contam = np.zeros((nrows, nrows))
for i in range(nrows):
    for j in range(nrows):
        cossim_contam[i, j] = sum(G_contam[i] * G_contam[j])/ (np.sqrt(sum(G_contam[i]**2)) * np.sqrt(sum(G_contam[j]**2)))


In [118]:
l1dist_contam

array([[ 0., 10.,  4.],
       [10.,  0., 12.],
       [ 4., 12.,  0.]])

In [119]:
l2dist_contam

array([[0.        , 6.4807407 , 2.82842712],
       [6.4807407 , 0.        , 7.07106781],
       [2.82842712, 7.07106781, 0.        ]])

In [121]:
cossim_contam

array([[1.        , 0.916097  , 0.97724452],
       [0.916097  , 1.        , 0.81200025],
       [0.97724452, 0.81200025, 1.        ]])

emperically, we see that the only one that changes is the cossim measure. 

In both the L1 and L2 dist, we are doing $x_i - y_i$. If the cells are contaminated, we would then have $x_i+c - y_i-c$, which is just $x_i-y_i$, thus no difference. For cossim, we are instead multiplying things together, meaning we get extra terms in the numerator and denominator that contain c, affecting the similarity scores. Specifically, we get an $(x_i+c)(y_i+c) = x_i^2 + cy_i + cx_i + c^2$ term in the numerator, and $(x_i + c)^2$ and $(y_i + c)^2$ terms in the denominator. The c, thus, stays in the final calculation, leaving us a different measure.

# Part F
For L1 distance, we have $\sum_{i=1}^n|x_i-y_i|$. If we scale this, we instead have $\sum_{i=1}^n|ax_i-by_i|$, which obviously means we would have an a and b term as it does not cancel out.

For L2 distance, we have $\sqrt{\sum_{i=1}^n(x_i-y_i)^2}$. If we scale this, we have $\sqrt{\sum_{i=1}^n(ax_i-by_i)^2}$, or $\sqrt{\sum_{i=1}^n((a^2x_i^2 - 2abx_iy_i + b^2y_i^2))}$, again leaving a and b terms that do not cancel out, not meeting the requirement.

For cossim, we have a scaled version of $\frac{\sum_{i=1}^nax_iby_i}{\sqrt{\sum_{i=1}^na^2x_i^2}\sqrt{\sum_{i=1}^nb^2y_i^2}}$. Rearranging a bit, we have $\frac{ab\sum_{i=1}^nx_iy_i}{ab\sqrt{\sum_{i=1}^nx_i^2}\sqrt{\sum_{i=1}^ny_i^2}}$, giving us the original version $\frac{\sum_{i=1}^nx_iy_i}{\sqrt{\sum_{i=1}^nx_i^2}\sqrt{\sum_{i=1}^ny_i^2}}$

In [120]:
l2dist
# Running time of the notebook
print("{:.2f} minutes".format((time.time()-start_time)/60))

0.03 minutes
