Skip to content

Commit

Permalink
Add script to clean up GISAID sequence file for direct use
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Pardy committed Feb 19, 2020
1 parent 66a160c commit b401051
Showing 1 changed file with 20 additions and 0 deletions.
20 changes: 20 additions & 0 deletions scripts/normalize_gisaid_fasta.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash
set -e
GISAID_SARSCOV2_IN=$1
GISAID_SARSCOV2_OUT=$2

echo "Normalizing GISAID file $GISAID_SARSCOV2_IN to $GISAID_SARSCOV2_OUT"

# Remove leading 'BetaCoV' and 'BetaCov' from sequence names
# Convert embedded spaces in sequence names to underscore (Hong Kong sequences)
# Remove trailing |EPI_ISL_id|datestamp from sequence names
# Eliminate duplicate sequences (keep only the first seen)

cat $GISAID_SARSCOV2_IN |
sed 's/^>BetaCoV\//>/gi' | # remove leading BetaCo[vV]
sed 's/ /_/g' | # remove embedded spaces
sed 's/|.*$//' | # remove trailing metadata
awk 'BEGIN{RS=">";FS="\n"}!x[$1]++{print ">"$0}' | # remove duplicates
grep -v '^>*$' > $GISAID_SARSCOV2_OUT


0 comments on commit b401051

Please sign in to comment.