# Leafletfinder to Edge List

Use Leaflet Finder Code to create an graph based on Oliver's MD trajectory.

Info from Mail:

We uploaded a test system for you via our file share service and you should have received a barrage of emails from the service (apologies!).

It is a system with 1.7 M particles and almost 150,000 lipids (i.e. 150,000 nodes in the network). There's also a mini python script that shows how to run it. Please note that we just fixed a bug in the MDAnalysis leaflet finder code that crept in one release ago. It is fixed in the 0.12.1 release that will be out tomorrow (or get the develop branch from github).

We tried running the basic version of leaflet finder on it but got a MemoryError; apparently, it tries to allocate 2 TiB of RAM... The "sparse" option works

        L = LeafletFinder(u, "name P*", sparse=True, pbc=True)

but took over 4 min for a single frame (and a bit over 1 min with pbc=False)  so this is too slow (and pbc=True is typically necessary).

The trajectory itself only contains 15 frames (1/1000th of the original one) but our file sharing service does not lik 80 GiB files...

The trajectory is not the nicest example yet in terms of the kind of fusion between vesicles that we want to observe but it should give you something to play with.

**Data:**

In [1]:
import numpy as np
import time, os, sys
#np.__config__.show() 

The first step in any MDAnalysis script is to load a topology file (which contains a list of particles and possibly additional static data such as bonds or partial charges) and a trajectory file. The trajectory contains the coordinates, which change for each time step. In MDAnalysis, the Universe object ties the topology and the trajectory together and part of
the process of instantiating Universe (topology, trajectory ) is to parse these files.

Source: <http://dx.doi.org/10.6084/m9.figshare.1588804>

File Types:
* `.xtc` compressed trajectory file from Gromacs
* `.tpr` topology files

In [3]:
!ls -lh /data/leafletfinder/large/vesicle_1_5M_373*

-rw-r--r-- 1 iparask iparask 71M Nov  4 20:22 /data/leafletfinder/large/vesicle_1_5M_373_first.gro
-rw-r--r-- 1 iparask iparask 71M Nov  4 20:20 /data/leafletfinder/large/vesicle_1_5M_373_last.gro
-rw-r--r-- 1 iparask iparask 90M Nov  4 20:21 /data/leafletfinder/large/vesicle_1_5M_373_stride1000.xtc
-rw-r--r-- 1 iparask iparask 41M Nov  4 20:22 /data/leafletfinder/large/vesicle_1_5M_373.tpr


In [25]:
import MDAnalysis, time
topology = "/data/leafletfinder/large/vesicle_1_5M_373.tpr"
trajectory = "/data/leafletfinder/large/vesicle_1_5M_373_stride1000.xtc"

start = time.time()
u = MDAnalysis.Universe(topology, trajectory)
print "Loading Time: %.2f"%(time.time()-start)

Loading Time: 14.98


In [26]:
start = time.time()
selection = u.select_atoms("name P*")
print "Selection Time: %.2f"%(time.time()-start)

Selection Time: 1.69


In [22]:
u

<Universe with 1748952 atoms and 1603206 bonds>

In [29]:
count = 0
for ts in u.trajectory:
    print("Frame: %5d, Time: %8.3f ps" % (ts.frame, u.trajectory.time))
    print("Rgyr: %g A" % (u.atoms.radius_of_gyration(), ))
    count = count + 1 
print "Number of frames: %d"%count 

Frame:     0, Time:    0.000 ps
Rgyr: 652.801 A
Frame:     1, Time: 50000.000 ps
Rgyr: 650.131 A
Frame:     2, Time: 100000.000 ps
Rgyr: 637.096 A
Frame:     3, Time: 150000.000 ps
Rgyr: 627.282 A
Frame:     4, Time: 200000.000 ps
Rgyr: 618.614 A
Frame:     5, Time: 250000.000 ps
Rgyr: 609.713 A
Frame:     6, Time: 300000.000 ps
Rgyr: 599.82 A
Frame:     7, Time: 350000.000 ps
Rgyr: 588.657 A
Frame:     8, Time: 400000.000 ps
Rgyr: 578.532 A
Frame:     9, Time: 450000.000 ps
Rgyr: 564.654 A
Frame:    10, Time: 500000.000 ps
Rgyr: 550.324 A
Frame:    11, Time: 550000.000 ps
Rgyr: 533.978 A
Frame:    12, Time: 600000.000 ps
Rgyr: 516.298 A
Frame:    13, Time: 650000.000 ps
Rgyr: 499.393 A
Frame:    14, Time: 700000.000 ps
Rgyr: 483.763 A
Number of frames: 15


In [10]:
selection

<AtomGroup with 145746 atoms>

## Write filtered coordinates to Disk for later use

In [47]:
import numpy as np
import MDAnalysis, time

# BIG
topology1 = "/data/leafletfinder/large/vesicle_1_5M_373.tpr"
trajectory1 = "/data/leafletfinder/large/vesicle_1_5M_373_stride1000.xtc"

# SMALL
trajectory2 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md_prod_12x12_everymicroS_pbcmolcenter.xtc"
topology2 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md_prod_12x12_lastframe.pdb"


trajectory3 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md_centered.xtc"
topology3 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md.pdb"

# MEDIUM
trajectory4 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/63342lip_576TMprotein_nowat_10us_1us_timestep_fixed.xtc"
topology4 = "/home/luckow/notebooks/Pilot-Memory/data/mdanalysis/63342lip_576TMprotein_nowat_start.pdb"

topologies = [topology1, topology2, topology3, topology4]
trajectories =[trajectory1, trajectory2, trajectory3, trajectory4]


topologies = [topology3]
trajectories =[trajectory3]

for idx, t in enumerate(topologies):
    topology = t
    trajectory = trajectories[idx]
    u = MDAnalysis.Universe(topology, trajectory)
    selection = u.select_atoms("name P*")
    coord = selection.positions
    num_atoms=len(coord)
    print "Topology: %s, Traj: %s, NumAtoms: %d"%(topology, trajectory, num_atoms)
    np.savetxt(os.path.basename(trajectory)+"_" + str(num_atoms) + "Atoms.np_txt", coord)

Topology: /home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md.pdb, Traj: /home/luckow/notebooks/Pilot-Memory/data/mdanalysis/md_centered.xtc, NumAtoms: 95


## Benchmark of pairwise distance computation (Scikit Learn)

In [11]:
coord

array([[  458.09997559,   510.39996338,    59.09999847],
       [  453.69998169,   525.39996338,    53.5       ],
       [  448.5       ,   524.39996338,    49.5       ],
       ..., 
       [ 1803.69995117,   503.79998779,   142.3999939 ],
       [ 1816.90002441,   499.69998169,   147.29998779],
       [ 1814.5       ,   508.        ,   142.1000061 ]], dtype=float32)

In [3]:
coord=np.loadtxt("vesicle_1_5M_373_P_145746.np_txt")

### MDAnalysis Implementation

Dense

In [2]:
files=!ls *.np_txt

In [7]:
files

['63342lip_576TMprotein_nowat_10us_1us_timestep_fixed.xtc_50652Atoms.np_txt',
 'md_centered.xtc_95Atoms.np_txt',
 'md_prod_12x12_everymicroS_pbcmolcenter.xtc_44784Atoms.np_txt',
 'vesicle_1_5M_373_stride1000.xtc_145746Atoms.np_txt']

In [8]:
from MDAnalysis.core.distances import distance_array, self_distance_array
from MDAnalysis.analysis.distances import contact_matrix
import numpy as np
import time, os, sys, gc

files=["md_prod_12x12_everymicroS_pbcmolcenter.xtc_44784Atoms.np_txt"]

results=[]
for f in files:
    print f
    coord = np.loadtxt(f, dtype='float32')
    start = time.time()
    distance_array(coord, coord, box=None)
    print "ComputeDistanceMDAnalysisDense, %d, %.2f"%(len(coord), (time.time()-start))
    del coord
    gc.collect()

md_prod_12x12_everymicroS_pbcmolcenter.xtc_44784Atoms.np_txt
ComputeDistanceMDAnalysis, 44784, 11.98




Sparse

In [1]:
import numpy as np
coord = np.loadtxt("vesicle_1_5M_373_stride1000.xtc_145746Atoms.np_txt", dtype='float32')

In [6]:
start = time.time()
distances_mdasparse=contact_matrix(coord, returntype="sparse")
print "ComputeDistanceMDAnalysisSparse, %.2f"%(time.time()-start)

NameError: name 'contact_matrix' is not defined

### Scikit Learn Method

In [7]:
import scipy.spatial.distance
import sklearn.metrics
start = time.time()
distances=scipy.spatial.distance.cdist(coord, coord)
distances.shape
print "ComputeDistanceScikit, %.2f"%(time.time()-start)

MemoryError: 

In [14]:
distances

array([[    0.        ,    77.27516937,    52.5474205 , ...,
         1782.68078613,  1784.84057617,  1786.69934082],
       [   77.27516937,     0.        ,    68.39854431, ...,
         1742.26013184,  1744.61645508,  1746.34912109],
       [   52.5474205 ,    68.39854431,     0.        , ...,
         1742.17370605,  1744.35229492,  1746.32629395],
       ..., 
       [ 1782.68078613,  1742.26013184,  1742.17370605, ...,
            0.        ,     5.72276163,     8.17006779],
       [ 1784.84057617,  1744.61645508,  1744.35229492, ...,
            5.72276163,     0.        ,     7.71362448],
       [ 1786.69921875,  1746.34899902,  1746.32629395, ...,
            8.17006779,     7.71362448,     0.        ]], dtype=float32)

In [4]:
import sklearn.metrics
start = time.time()
distances=sklearn.metrics.pairwise.euclidean_distances(coord, coord)
distances.shape
print "ComputeDistanceSklearnMetrics, %.2f"%(time.time()-start)

ComputeDistanceSklearnMetrics, 95.33


# Synthetic Data Gen

In [39]:
coord

array([[  458.09997559,   510.39996338,    59.09999847],
       [  453.69998169,   525.39996338,    53.5       ],
       [  448.5       ,   524.39996338,    49.5       ],
       ..., 
       [ 1803.69995117,   503.79998779,   142.3999939 ],
       [ 1816.90002441,   499.69998169,   147.29998779],
       [ 1814.5       ,   508.        ,   142.1000061 ]], dtype=float32)

In [40]:
mean=np.mean(coord, axis=0)
stddev=np.std(coord, axis=0)
print "Mean: %s Stddev: %s"%(str(mean), str(stddev))

Mean: [ 1134.39465332   424.06729126   418.1980896 ] Stddev: [ 586.35089111  202.54875183  204.76069641]


In [3]:
mean=[1134.39465332, 424.06729126, 418.1980896]
stddev=[586.35089111, 202.54875183,  204.76069641]

In [10]:
!ls ../data/mdanalysis/synthetic/traj/

In [11]:
OUTPUT_DIR="../data/mdanalysis/synthetic/traj/"
for i in [10000, 20000, 40000, 80000, 160000]:
#for i in [10000, 1000]:
    x=np.random.normal(mean[0], stddev[0], i)
    y=np.random.normal(mean[1], stddev[1], i)
    z=np.random.normal(mean[2], stddev[2], i)
    x=x.reshape(len(x),1)
    y=y.reshape(len(x),1)
    z=z.reshape(len(z),1)
    synthetic=np.concatenate((x,y,z), axis=1)
    np.savetxt("%s%d.np_txt"%(OUTPUT_DIR,i), synthetic)

In [5]:
!ls -lth {OUTPUT_DIR}

total 456200
-rw-r--r--  1 luckow  staff    73K Dec 20 21:54 1000.np_txt
-rw-r--r--  1 luckow  staff   733K Dec 20 21:54 10000.np_txt
-rw-r--r--  1 luckow  staff   115M Dec 19 20:34 1600000.np_txt
-rw-r--r--  1 luckow  staff    14M Dec 19 20:34 200000.np_txt
-rw-r--r--  1 luckow  staff    29M Dec 19 20:34 400000.np_txt
-rw-r--r--  1 luckow  staff    57M Dec 19 20:34 800000.np_txt
-rw-r--r--  1 luckow  staff   750B Dec 19 20:34 10.np_txt
-rw-r--r--  1 luckow  staff   7.3K Dec 19 20:34 100.np_txt
-rw-r--r--  1 luckow  staff   7.2M Dec 19 20:34 100000.np_txt


In [44]:
mean=np.mean(synthetic, axis=0)
stddev=np.std(synthetic, axis=0)
print "Mean: %s Stddev: %s"%(str(mean), str(stddev))

Mean: [ 1380.07811707   397.62971063   384.52701735] Stddev: [ 663.1345921   135.62573427  233.03352284]


## Leaflet Finder

In [8]:
import MDAnalysis.analysis.leaflet
start=time.time()
L = MDAnalysis.analysis.leaflet.LeafletFinder(u, "name P*", pbc=False, sparse=True)
print "Create Graph Time: %.2f"%(time.time()-start)

Create Graph Time: 73.10


In [9]:
import networkx as NX
start = time.time()
graph = L.graph
cc = NX.connected_components(graph)
count = 0
for i in cc:
    count = count + 1
print str(count)
print "CC Time: %.2f"%(time.time()-start)

19
CC Time: 1.09


In [7]:
NX.write_edgelist(graph,
                  "graph_edges_%d_%d.csv"%(NX.number_of_nodes(graph), NX.number_of_edges(graph)),
                  delimiter=",")