chain id parameter for structure.io.pdbx.get_sequence #600

tjmier · 2024-06-16T16:44:41Z

Added a new parameter 'chain_id' to the function 'get_sequence' in the module 'biotite.structure.io.pdbx' to return a dictionary mapping chain_id to the sequence

'get_sequence' in the module 'biotite.structure.io.pdb' to return a dictionary mapping chain_id to the sequence

padix-key

I added a few further small comments here. If we always return a dictionary, I think the function would be cleaner, if it iterates directly over strand IDs, sequence strings and sequence types instead of splitting it into two for-loops.

padix-key · 2024-06-17T08:52:36Z

src/biotite/structure/io/pdbx/convert.py

+    sequences : list of Sequence or dict
+        If `chain_ids` is False, returns a list of protein and nucleotide 
+        sequences for each entity.
+        If `chain_ids` is True, returns a dictionary where each key is a 
+        chain ID and each value is the corresponding sequence.


I think we should note here that the chain IDs correspond to atom_site.auth_asym_id.

padix-key · 2024-06-17T08:54:50Z

src/biotite/structure/io/pdbx/convert.py

+        for entity, strand_ids in enumerate(strand_ids):
+            for strand_id in strand_ids:
+                strand_ids_to_seq_dict[strand_id] = sequences[entity-1]


This lines enumerates, so the first tuple value is not the entity (ID) but the index (always starting at zero).

Suggested change

for entity, strand_ids in enumerate(strand_ids):

for strand_id in strand_ids:

strand_ids_to_seq_dict[strand_id] = sequences[entity-1]

for i, strand_ids in enumerate(strand_ids):

for strand_id in strand_ids:

strand_ids_to_seq_dict[strand_id] = sequences[i]

padix-key · 2024-06-17T08:58:16Z

Thanks for the PR. I would be in favor of dropping the chain_id parameter entirely, and returning always a dictionary. Since Biotite has not reached 1.0, yet, this backwards-incompatible changes would be OK in my opinion.

Note that this would require a few small adjustments of that function call in the tests and documentation (searching for pdbx.get_sequence in the code base should find all occurences)

t0mdavid-m · 2024-06-17T09:24:11Z

I agree with @padix-key. I think we should drop the chain_id parameter.

with entity_poly.pdbx_strand_id as keys. Updated test_pdx.py test_get_sequence to reflect returning a dict instead of a list

tjmier · 2024-06-17T19:17:27Z

I sent an additional commit with the suggested changes as I understand them. It is worth noting that there may be instances where entity_poly.pdbx_strand_id is not equivalent to atom_site.label_asym_id. The structure used in the test function (PDB:5UGO) is an example of this where the strand ID is "T" and the asym_id is "A". This may be an edge case and I am unsure if it will cause issues but its worth mentioning.

padix-key · 2024-06-18T06:13:32Z

Thanks for the changes. There are still remaining get_sequence() calls in

doc/examples/scripts/sequence/residue_coevolution.py
tests/structure/test_rcsb.py
tests/structure/test_sequence.py

The strand ID is not matching atom_site.label_asym_id, but atom_site.auth_asym_id is. I am not sure what the rationale behind this is though, as the label_xxx fields should be the source of truth in the PDB.

padix-key · 2024-06-18T06:17:42Z

src/biotite/structure/io/pdbx/convert.py

+    sequence_dict = {
+        strand_id: sequence
+        for sequence, strand_ids in zip(sequences, strand_ids)
+        for strand_id in strand_ids
+    }


In case some converted sequence is None in the list comprehension above, sequences would have a different length than strand_ids.

tjmier · 2024-06-19T00:25:00Z

Please ignore those commits. I am still learning git and did not think those commits would be sent to this pull request.
Again, apologies for silly mistakes. A lot of this is new to me.

padix-key · 2024-06-19T12:53:14Z

No problem. You may also convert this PR to a Draft PR to further indicate that this is work in progress

tjmier · 2024-06-23T04:23:50Z

The last commit is the one I would like to merge if possible.

padix-key · 2024-06-24T08:36:12Z

Looks good to me, however, one small adjustment is necessary: With the merge of #552 residue_coevolution.py is moved to doc/examples/scripts/sequence/homology/residue_coevolution.py. This merge conflict needs to be solved first. Furthermore, you may add yourself to CONTRIB.rst, if you like.

Added a new parameter 'chain_id' to the function

07c1253

'get_sequence' in the module 'biotite.structure.io.pdb' to return a dictionary mapping chain_id to the sequence

padix-key requested changes Jun 17, 2024

View reviewed changes

updated get_sequence to return dict of sequences

d5eb033

with entity_poly.pdbx_strand_id as keys. Updated test_pdx.py test_get_sequence to reflect returning a dict instead of a list

padix-key reviewed Jun 18, 2024

View reviewed changes

tjmier added 3 commits June 18, 2024 19:05

update local branch with remote branch

e79237e

Merge branch 'master' into get_sequence_by_chainid

90effb8

fix merge mistake

9974699

updated get_sequence, dependent tests, and example

89297ac

tjmier closed this Jun 24, 2024

tjmier deleted the get_sequence_by_chainid branch June 24, 2024 17:05

tjmier restored the get_sequence_by_chainid branch June 24, 2024 17:12

padix-key mentioned this pull request Jun 25, 2024

pdbx get_sequence return dict of seq #611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chain id parameter for structure.io.pdbx.get_sequence #600

chain id parameter for structure.io.pdbx.get_sequence #600

tjmier commented Jun 16, 2024

padix-key left a comment

padix-key Jun 17, 2024

padix-key Jun 17, 2024

padix-key commented Jun 17, 2024

t0mdavid-m commented Jun 17, 2024

tjmier commented Jun 17, 2024

padix-key commented Jun 18, 2024

padix-key Jun 18, 2024

tjmier commented Jun 19, 2024

padix-key commented Jun 19, 2024

tjmier commented Jun 23, 2024

padix-key commented Jun 24, 2024

chain id parameter for structure.io.pdbx.get_sequence #600

chain id parameter for structure.io.pdbx.get_sequence #600

Conversation

tjmier commented Jun 16, 2024

padix-key left a comment

Choose a reason for hiding this comment

padix-key Jun 17, 2024

Choose a reason for hiding this comment

padix-key Jun 17, 2024

Choose a reason for hiding this comment

padix-key commented Jun 17, 2024

t0mdavid-m commented Jun 17, 2024

tjmier commented Jun 17, 2024

padix-key commented Jun 18, 2024

padix-key Jun 18, 2024

Choose a reason for hiding this comment

tjmier commented Jun 19, 2024

padix-key commented Jun 19, 2024

tjmier commented Jun 23, 2024

padix-key commented Jun 24, 2024