semi-supervised deep learning for classification of molecular structures
Python Matlab Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Semi-Supervised Deep Learning for Molecular Structures

By Alvin Wan and Allen Guo

During clathrin-mediated endocytosis (CME), clathrin surrounds molecules awaiting transport, forming a spherical coat. Our goal was to pick out clathrin undergoing this process. This repository employs semi-supervised learning methods to classify "cup-like" clathrin structures given STORM microscopies for proteins of interest. See the problem formulation and approach specifics in our presentation slides or full report.

The clathrin data was provided by the Ke Xu lab in UC Berkeley's College of Chemistry, whose research work we are supporting. If you find this work useful for your research, please consider citing:

    Author = {Alvin Wan and Allen Guo},
    Title = {Semi-Supervised Deep Learning for Molecular Structures},
    Year = {2017}


This project requires Python3. We begin by navigating to the root of the repository, which we will call $STORM.


(optional) We recommend setting up a virtual environment first. This project uses Python3.

virtualenv ../env --python=python3
source ../env/bin/activate

Install all Python requirements.

pip install -r requirements.txt


Alternatively, you can toy with various hyperparameters and attempt training on your own. We approached the problem using a two-step pipeline. First, find a latent representation in a lower-dimensional space. Then, run a simple classifier on the encoded data.

If your data is located at data/train_molecules.mat and data/test_molecules.mat, the <data_class> mentioned below would be molecules.


Start by picking a featurization technique.

bash encode_(ae|kmeans|pca) <data_class>


We then train a support vector machine (SVM) using the featurizations. For the below command, make sure to featurize both the train.mat and test.mat datasets, specified above.

bash svm <data_class>