Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on the boundries #2

Open
zyxue opened this issue Mar 12, 2018 · 7 comments
Open

Comments on the boundries #2

zyxue opened this issue Mar 12, 2018 · 7 comments

Comments

@zyxue
Copy link

zyxue commented Mar 12, 2018

The boundaries on your front page look quite artificial to me, e.g. H is much closer to the uncharged AAs than to the charged AAs. Do you have any comment on why, please? I also have a few technical questions:

  1. What's the dimension of the learned vectors?
  2. Did you try visualization with a more straightforward dimensionality reduction technique, e.g. PCA?
  3. Do you have any thought on the result in the context of AA neighbors? I realize a major difference between language and AAs is that the meaning of a word is well characterized by its neighbors, so arbitrary concatenation of words won't make sense. But for AAs, any combination seems to possible. However, the resulting peptide may not be super useful.

Thank you.

@WesleyyC
Copy link
Owner

Thank you for reaching out!

The boundaries is indeed a little bit artificial. However, many of them are far from their supposed group for a reason. For the histine you mention, if you compare it to other charge AA, they are in fact quite different. In fact,

Histidine's pKa can easily be perturbed by its surroundings, e.g. by the surrounding residues in an enzyme active site, which makes it highly functionally versatile, one of the manifestations of its functional and chemical versatility being its ability to behave both as a polar/charged amino acid, as well as a hydrophobic residue.

Therefore, that might be the reason why we see H closer to uncharge residue. That's also why having a representation in continuous space could be useful.

For you technical questions,

  1. I don't remember what exactly do I use for generating the graph, but this is the line of code shown in the data folder:
    model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)
  2. Yes, but I don't think I have the result. I remember it's not impressive and that's why we move on to t-SNE. I am happy to include your PCA result if you end up making one.
  3. Yes, but in the same token, you could put any words together but they won't make any sense (i.e. useful). However, I agree with you that the AA's context doesn't really define the AA, but instead define a context where the AA you want to predict will be in the center.

Hope these answer your concerns.

@zyxue
Copy link
Author

zyxue commented Mar 13, 2018

Thanks for your reply!

I am confused by the size parameter, in your code, it says

# size: layer of neural net/ dimension, we set it to 20 because we only have 20 voca

but you call it with 1500

model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)

I understand it as the size of the learned vectors per amino acid, then it shouldn't be larger than 20 since there are only 20 AAs. Is that correct?

@WesleyyC
Copy link
Owner

ah, now I remember. The size 1500 is when I am looking at building embedding for k-mer instead of single AA. When we generate the graph for single AA, the model should output an embedding of size 20 per the comment.

@zyxue
Copy link
Author

zyxue commented Mar 13, 2018

Since you mentioned, I am also very interested in k-mer, how'd the experiment go for k-mer, then?

@WesleyyC
Copy link
Owner

Some experiment figure is here:

https://github.com/WesleyyC/Amino-Acid-Embedding/tree/master/Figure

so you should be able to open them in MATLAB, but we didn't continue this project, so the context part is not well maintained.

@zyxue
Copy link
Author

zyxue commented Mar 14, 2018

Thanks! What's your main conclusion, then? I am doing something similar with nr (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz) database, I don't see any particular pattern among AAs. My thought is that if there is no strong pattern (if any at all), then it suggests that in nature, any AA is likely to be neighbours of any other AA, so at the character level, there isn't much difference among them. Would you agree with that?

@WesleyyC
Copy link
Owner

Sorry for the late response. Miss the notification for some reasons.

For single AA embedding, we do see strong pattern regarding their biochemical property. In addition, we have computed a distance matrix using their embedding and compared it to the BLOSUM matrix. It seems that their are highly correlated.

For k-mer AA embedding, we do see pattern in our graph but we are not sure if it's the artifact of the way we generate k-mer or it's truly a pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants