Skip to content

Commit

Permalink
feat: remove non-GISAID strainnames in deduplicate scripts
Browse files Browse the repository at this point in the history
Thanks @AngieHinrichs for the suggestion in 10404d2#commitcomment-105766629
  • Loading branch information
corneliusroemer committed Mar 23, 2023
1 parent 034686c commit 367c6fb
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 0 deletions.
7 changes: 7 additions & 0 deletions deduplicate_keeping_first.py
Expand Up @@ -27,6 +27,13 @@
d.remove(current)
print(f"Removed duplicate strain: {line}")
continue
# Exclude strain names that are not GISAIDy
# strain names must have at least 2 slashes
# This gets rid of things like `strain` or `OX0123456`
if split[0].count('/') < 2 and line != 'taxon,lineage\n':
d.remove(current)
print(f"Removed non-GISAID strain: {line}")
continue
hashset.add(split[0])


Expand Down
7 changes: 7 additions & 0 deletions deduplicate_keeping_last.py
Expand Up @@ -27,6 +27,13 @@
d.remove(current)
print(f"Removed duplicate strain: {line}")
continue
# Exclude strain names that are not GISAIDy
# strain names must have at least 2 slashes
# This gets rid of things like `strain` or `OX0123456`
if split[0].count('/') < 2 and line != 'taxon,lineage\n':
d.remove(current)
print(f"Removed non-GISAID strain: {line}")
continue
hashset.add(split[0])


Expand Down

1 comment on commit 367c6fb

@AngieHinrichs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks Cornelius!

Please sign in to comment.