Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain id parameter for structure.io.pdbx.get_sequence #600

Closed
wants to merge 6 commits into from

Conversation

tjmier
Copy link
Contributor

@tjmier tjmier commented Jun 16, 2024

Added a new parameter 'chain_id' to the function 'get_sequence' in the module 'biotite.structure.io.pdbx' to return a dictionary mapping chain_id to the sequence

'get_sequence' in the module 'biotite.structure.io.pdb'
to return a dictionary mapping chain_id to the sequence
Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few further small comments here. If we always return a dictionary, I think the function would be cleaner, if it iterates directly over strand IDs, sequence strings and sequence types instead of splitting it into two for-loops.

Comment on lines 141 to 145
sequences : list of Sequence or dict
If `chain_ids` is False, returns a list of protein and nucleotide
sequences for each entity.
If `chain_ids` is True, returns a dictionary where each key is a
chain ID and each value is the corresponding sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should note here that the chain IDs correspond to atom_site.auth_asym_id.

Comment on lines 163 to 165
for entity, strand_ids in enumerate(strand_ids):
for strand_id in strand_ids:
strand_ids_to_seq_dict[strand_id] = sequences[entity-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lines enumerates, so the first tuple value is not the entity (ID) but the index (always starting at zero).

Suggested change
for entity, strand_ids in enumerate(strand_ids):
for strand_id in strand_ids:
strand_ids_to_seq_dict[strand_id] = sequences[entity-1]
for i, strand_ids in enumerate(strand_ids):
for strand_id in strand_ids:
strand_ids_to_seq_dict[strand_id] = sequences[i]

@padix-key
Copy link
Member

Thanks for the PR. I would be in favor of dropping the chain_id parameter entirely, and returning always a dictionary. Since Biotite has not reached 1.0, yet, this backwards-incompatible changes would be OK in my opinion.

Note that this would require a few small adjustments of that function call in the tests and documentation (searching for pdbx.get_sequence in the code base should find all occurences)

@t0mdavid-m
Copy link
Member

I agree with @padix-key. I think we should drop the chain_id parameter.

with entity_poly.pdbx_strand_id as keys.
Updated test_pdx.py test_get_sequence to reflect
returning a dict instead of a list
@tjmier
Copy link
Contributor Author

tjmier commented Jun 17, 2024

I sent an additional commit with the suggested changes as I understand them. It is worth noting that there may be instances where entity_poly.pdbx_strand_id is not equivalent to atom_site.label_asym_id. The structure used in the test function (PDB:5UGO) is an example of this where the strand ID is "T" and the asym_id is "A". This may be an edge case and I am unsure if it will cause issues but its worth mentioning.

@padix-key
Copy link
Member

Thanks for the changes. There are still remaining get_sequence() calls in

  • doc/examples/scripts/sequence/residue_coevolution.py
  • tests/structure/test_rcsb.py
  • tests/structure/test_sequence.py

The strand ID is not matching atom_site.label_asym_id, but atom_site.auth_asym_id is. I am not sure what the rationale behind this is though, as the label_xxx fields should be the source of truth in the PDB.

Comment on lines 165 to 169
sequence_dict = {
strand_id: sequence
for sequence, strand_ids in zip(sequences, strand_ids)
for strand_id in strand_ids
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case some converted sequence is None in the list comprehension above, sequences would have a different length than strand_ids.

@tjmier
Copy link
Contributor Author

tjmier commented Jun 19, 2024

Please ignore those commits. I am still learning git and did not think those commits would be sent to this pull request.
Again, apologies for silly mistakes. A lot of this is new to me.

@padix-key
Copy link
Member

No problem. You may also convert this PR to a Draft PR to further indicate that this is work in progress

@tjmier
Copy link
Contributor Author

tjmier commented Jun 23, 2024

The last commit is the one I would like to merge if possible.

@padix-key
Copy link
Member

Looks good to me, however, one small adjustment is necessary: With the merge of #552 residue_coevolution.py is moved to doc/examples/scripts/sequence/homology/residue_coevolution.py. This merge conflict needs to be solved first. Furthermore, you may add yourself to CONTRIB.rst, if you like.

@tjmier tjmier closed this Jun 24, 2024
@tjmier tjmier deleted the get_sequence_by_chainid branch June 24, 2024 17:05
@tjmier tjmier restored the get_sequence_by_chainid branch June 24, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants