- Day 1 Get the sequences length distribution from a fastq file using awk:
zcat example.fastq.gz
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBIIIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+071112_SLXA-EAS1_s_7:5:1:817:345
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+071112_SLXA-EAS1_s_7:5:1:801:338
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
zless example.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
initiate a awk arrary named lengths
, save all the record length to it and increment the frequency by array[]++
A fastq record contains 4 lines.
when line number can be divided by 4 and leave 2: NR%4 == 2
to get the sequence line.
length($0)
gives the length of the full line denoted by $0
.
when the awk
finishes reading all lines: END
print out the lengths and its frequency
- Day 2 Reverse complement a sequence:
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
ANNNAGCATAGCAAT
rev
: reverse the sequence
tr
: translate the ACTG
to its complement TGAC
- Day 3 split a multi-fasta to multiple single fasta file
cat multi_fasta.fa
>seq1
ATCTGCGATCCCC
>seq2
GGGTCGTCTTCAAAAAT
>seq3
TCGATTTTGATTTTAAAAA
cat multi_fasta.fa | awk '{
if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fasta")}
print $0 >> filename
close(filename)
}'
or use csplit
csplit -z multi_fasta.fa
- Day 4 turn a fastq to a fasta file:
cat example.fastq| paste - - - - | perl -F"\t" -ane 'print ">$F[0]_$F[3]$F[1]\n";'
>@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
>@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBIIIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
>@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
>@071112_SLXA-EAS1_s_7:5:1:817:345_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>@071112_SLXA-EAS1_s_7:5:1:801:338_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
less example.fastq | paste - - - - | awk 'BEGIN { FS = "\t" };{printf(">%s_%s\n%s\n",$1,$4,$2);}'
>@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
>@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBIIIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
>@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
>@071112_SLXA-EAS1_s_7:5:1:817:345_IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>@071112_SLXA-EAS1_s_7:5:1:801:338_IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
cat example.fastq | sed -n '1~4s/^@/>/p;2~4p'
>SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
>SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
>SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
>071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
split fastq to a set of smaller fastqs with a fixed number of reads (SEQNUM)
SEQNUM=10; split -l $((4*$SEQNUM)) --filter='gzip > $FILE.gz' --additional-suffix=.fq -d <(zcat aaa.fq.gz)
- Day 5 Reproducible subsampling of a FASTQ file.
0.01
is the % of reads to output.
cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq
cat file.fq | paste - - - - | shuf -n 1000 | tr '\t' '\n' > out.fq
seqkit sample -p 0.01 file.fq
- Day 6 Get the header and the column number:
cat data.tsv
ID head1 head2 head3 head4
1 25.5 1364.0 22.5 13.2
2 10.1 215.56 1.15 22.2
cat data.tsv | head -1 | tr "\t" "\n" | cat -n
1 ID
2 head1
3 head2
4 head3
5 head4
head -1
: print the first line
tr
translate tab to newline
cat -n
print the line number. You can use nl
too
cat data.tsv | nl
1 ID head1 head2 head3 head4
2 1 25.5 1364.0 22.5 13.2
3 2 10.1 215.56 1.15 22.2
cat data.tsv | head -1 | tr "\t" "\n" | nl
1 ID
2 head1
3 head2
4 head3
5 head4
# use sed, you can change to any line number
cat data.tsv | sed -n '1s/\t/\n/gp'
ID
head1
head2
head3
head4
use csvkit
:
csvcut -nt -l data.tsv
1: ID
2: head1
3: head2
4: head3
5: head4
- Day 7 Show hidden characters
cat -A data.tsv
ID^Ihead1^Ihead2^Ihead3^Ihead4$
1^I25.5^I1364.0^I22.5^I13.2$
2^I10.1^I215.56^I1.15^I22.2$
^I
means tab, $
means the end of the line.
This is useful when you get a file and see if you do see a tab between columns or you may have 2 tabs between the columns
Note on mac you will need to use gnu utilities and use gcat -A
sed -n l data.tsv
ID\thead1\thead2\thead3\thead4$
1\t25.5\t1364.0\t22.5\t13.2$
2\t10.1\t215.56\t1.15\t22.2$
now, tab is denoted as \t
and $
means the end of the line.
- Day 8 print out unique rows based on the first and second column
awk '!a[$1,$2]++' input_file
sort based on unique (first and second)column
sort -u -k1,2 input_file
In R
df %>% dplyr::distinct(column1, column2, .keep_all =TRUE)
- Day 9 split a bed file by chromosome
cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print > f".bed"}'
#or
awk '{print $0 >> $1".bed"}' example.bed
sed
to remove the chr
and awk
split the files to 1.bed
, 2.bed
etc.
split large file by id/label/column. you can change $1 to $2 etc depends on which column you want to use
awk '{print >> $1; close($1)}' input_file
cat example.bed
chr1 12 14 sample1
chr1 10 15 sample2
chr2 10 20 sample1
chr2 22 33 sample2
awk '{print >> $1".bed"; close($1".bed")}' example.bed
it gives you chr1.bed
, chr2.bed
awk '{print >> $4".bed"; close($4".bed")}' example.bed
it gives you sample1.bed
, sample2.bed
- Day 10 sort VCF with header
cat my.vcf | awk '$0~"^#" { print $0; next } { print $0 | "sort -k1,1V -k2,2n" }'
- Day 11 rename file with bash string manipulation
for file in *gz
do zcat $file > ${file/bed.gz/bed}
done
- Day 12 exit a dead
ssh
session
press:
~.
- Day 13 copy large files with rsync
copy the from_dir directory to the to_dir directory
rsync -av from_dir to_dir
copy every file inside the frm_dir to to_dir. Note the trailing slash
rsync -av from_dir/ to_dir
re-copy the files avoiding completed ones
rsync -avhP from_dir to_dir
- Day 14 make directory using the current date
mkdir $(date +%F)
see A Quick Guide to Organizing Computational Biology Projects
- Day 15 get all the folders' size in the current folder
du -h --max-depth=1
the total size of current directory
du -sh .
disk usage
df -h
check GNU ncdu
.
open top -M
with human readable size in Mb, Gb. install htop for better visualization.
- Day 16 pretty output
less -S
fold -w 60
cat file.tsv| column -t | less -S
csvtk pretty names.csv #from csvtk
csvlook names.csv #from csvkit
- Day 17 keep header filter
always print the first line, and filter by column 10, 11 and 18
awk ' NR ==1 || ($10 > 1 && $11 > 0 && $18 > 0.001)' input_file
In R, header is always maintained in the dataframe.
df %>% filter(column10 >1, column11 >0, column18 > 0.001)
- Day 18 delete lines by sed
delete the blank lines
sed /^$/d'
delete the last line
sed $d
sed '1d' to remove the header.
ls *csv | parallel 'cut -f, -d 2 | sed '1d' > {/.}.list'
print the second line of a LARGE file and quit:
sed -n '2{p;q}'
- Day 19 create a tx2gene mapping file from ensemble gtf retaining the version number of genes and transcripts.
awk -F "\t" '$3 == "transcript" { print $9 }' myensembl.gtf| tr -s ";" " " | cut -d " " -f2,4| sed 's/\"//g' | awk '{print $1"."$2}' > genes.txt
awk -F "\t" '$3 == "transcript" { print $9 }' myensembl.gtf| tr -s ";" " " | cut -d " " -f6,8| sed 's/\"//g' | awk '{print $1"."$2}' > transcripts.txt
paste transcripts.txt genes.txt > tx2genes.txt
- Day 20 print every 2nd line of every 4 lines in a fastq file, get the barcode frequency table
zcat mysample_L001_I1_001.fastq.gz | sed -n '2~4p' | sort | uniq -c | sort -k1,1 -nr > barcode_freq.txt
- Day 21
ever using sed to replace paths in your string, use , as a separator!
sed 's,/my/path/,,'
- Day 22
sanity check if the cell barcode is v2 or v3
comm -12 <(zcat 3M-february-2018.txt.gz |sort) <(zcat Sample1/outs/filtered_feature_bc_matrix/barcodes.tsv.gz | sed 's/-1//' | sort) | wc -l
M-february-2018.txt.gz
is the 10x genomics whitelist file.
- Day 23 select lines from a file based on columns in another file
awk -F"\t" 'NR==FNR{a[$1$2$3]++;next};a[$1$2$3] > 0' file2 file1
NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
a[$1$2$3]++; next : if this is the 1st file, save the 1st three fields in the a
array. Then, skip to the next line so that this is only applied on the 1st file.
a[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1, 2 and 3of this file have already been seen (a[$1$2$3]>0) and if they have been, we print the line. In awk, the default action is to print the line so if a[$1$2$3]>0 is true, the line will be printed.
- Day 24 find bam in the current folder (search recursively) and copy it to a new directory using 5 CPUs
find . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir
or
find . -name "*bam" | parallel -j 5 'rsync -avhP {} dest_dir'
- Day 25
ls -X
will group files by extension.
- Day 26 loop through all chromosomes
for i in {1..22} X Y
do
echo $i
done
for i in {01..22}
do
echo $i
done
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22
- Day 27
paste
is used to concatenate corresponding lines from files: paste file1 file2 file3 .... If one of the "file" arguments is "-", then lines are read from standard input. If there are 2 "-" arguments, then paste takes 2 lines from stdin. And so on.
cat test.txt
0 ATTTTATTNGAAATAGTAGTGGG
0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG
1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG
2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT
3 ACCTCAACCTCNTAAATAACTAA
cat test.txt| paste - -
0 ATTTTATTNGAAATAGTAGTGGG 0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG 1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG 2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT 3 ACCTCAACCTCNTAAATAACTAA
or use awk
cat test.txt| awk 'ORS=NR%2?"\t":"\n"'
0 ATTTTATTNGAAATAGTAGTGGG 0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG 1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG 2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT 3 ACCTCAACCTCNTAAATAACTAA
ORS: output record separator in awk. var=condition?condition_if_true:condition_if_false
is the ternary operator.
- Day 28 grep fastq reads containing a pattern but maintain the fastq format
grep -A 2 -B 1 TGG example.fastq | sed '/^--$/d'
# or
zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "TGG" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz
- Day 29 count how many columns of a tsv files
cat file.tsv | head -1 | tr "\t" "\n" | wc -l
csvcut -n -t file.tsv (from csvkit)
awk '{print NF; exit}' file.tsv
awk -F "\t" 'NR == 1 {print NF}' file.tsv
- Day 30 combine info to the fasta header
cat myfasta.txt
>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
cat my_info.txt
info1
info2
info3
info4
paste <(cat my_info.txt) <(cat myfasta.txt| paste - - | cut -c2-) | awk '{printf(">%s_%s\n%s\n",$1,$2,$3);}'
>info1_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
- Day 31 change fasta header
The fasta header is like >7 dna:chromosome chromosome:GRCh37:7:1:159138663:1 convert to >7:
cat Homo_sapiens_assembly19.fasta | gawk '/^>/ { b=gensub(" dna:.+", "", "g", $0); print b; next} {print}' > Homo_sapiens_assembly19_reheader.fasta
- Day 32 make dir and cd into that dir
mkdir blah && cd $_
cd # go to your home folder
cd - # go back to your previous folder
- Day 33
Finally learned about the !$ in unix: take the last thing (word) from the previous command.
echo hello, world
echo !$
world
Create a script of the last executed command:
echo "!!" > foo.sh
Reuse all parameter of the previous command line:
!*
- Day 34 merge all bed files and add a column for the filename.
awk '{print $0 "\t" FILENAME}' *bed
- Day 35 add or remove chr from the start of each line
# add chr
sed 's/^/chr/' my.bed
# or
awk 'BEGIN {OFS = "\t"} {$1="chr"$1; print}'
# remove chr
sed 's/^chr//' my.bed
- Day 36 check if a tsv files have the same number of columns for all rows
awk '{print NF}' test.tsv | sort -nu | head -n 1
- Day 37 Parallelized samtools mpileup
BAM="yourFile.bam"
REF="reference.fasta"
samtools view -H $BAM | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 \
| xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf $REF -r \"{}\" $BAM \
| bcftools call -cv > \"{}\".vcf"
- Day 38 convert multiple lines to a single line
This is better than tr "\n" "\t" because somtimes I do not want to convert the last newline to tab.
cat myfile.txt | paste -s
- Day 39 merge multiple files with the same header by keeping the header of the first file
I usually do it in R
, but like the quick solution.
awk 'FNR==1 && NR!=1{next;}{print}' *.csv
awk 'NR == 1 || FNR > 1' *.csv
# or
awk '
FNR==1 && NR!=1 { while (/^<header>/) getline; }
1 {print}
' file*.txt >all.txt
- Day 40 insert a field into the first line
cut -f1-4 F5.hg38.enhancers.expression.usage.matrix | head
CNhs11844 CNhs11251 CNhs11282 CNhs10746
chr10:100006233-100006603 1 0 0
chr10:100008181-100008444 0 0 0
chr10:100014348-100014634 0 0 0
chr10:100020065-100020562 0 0 0
chr10:100043485-100043744 0 0 0
chr10:100114218-100114567 0 0 0
chr10:100148595-100148922 0 0 0
chr10:100182422-100182522 0 0 0
chr10:100184498-100184704 0 0 0
sed '1 s/^/enhancer\t/' F5.hg38.enhancers.expression.usage.matrix | cut -f1-4 | head
enhancer CNhs11844 CNhs11251 CNhs11282
chr10:100006233-100006603 1 0 0
chr10:100008181-100008444 0 0 0
chr10:100014348-100014634 0 0 0
chr10:100020065-100020562 0 0 0
chr10:100043485-100043744 0 0 0
chr10:100114218-100114567 0 0 0
chr10:100148595-100148922 0 0 0
chr10:100182422-100182522 0 0 0
chr10:100184498-100184704 0 0 0
- Day 41 extract PASS calls from vcf file
cat my.vcf | awk -F '\t' '{if($0 ~ /\#/) print; else if($7 == "PASS") print}' > my_PASS.vcf
- Day 42 replace a pattern in a specific column
## column5
awk '{gsub(pattern,replace,$5)}1' in.file
## http://bioinf.shenwei.me/csvtk/usage/#replace
csvtk replace -f 5 -p pattern -r replacement
- Day 43 move a process to a screen session
1. Suspend: Ctrl+z
2. Resume: bg
3. Disown: disown %1
4. Launch screen
5. Find pid: prep BLAH
6. Reparent process: reptyr ###
- Day 44 count uinque values in a column and put in a new column
# input
blabla_1 A,B,C,C
blabla_2 A,E,G
blabla_3 R,Q,A,B,C,R,Q
# output
blabla_1 3
blabla_2 3
blabla_3 5
awk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F[i]]++) n--; print $1, n}' file
- Day 45 Create TSS bed from GTF in one line:
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' > tss.bed
or 5kb flanking tss
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+5000} else {print $1, $5-5000, $5}}' > promoters.bed
caveat: some genes are at the end of the chromosomes, add or minus 5000 may go beyond the point, use bedtools slop with a genome size file to avoid that.
download fetchChromSizes from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
fetchChromSizes hg19 > chrom_size.txt
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' | bedtools slop -i - -g chrom_size.txt -b 5000 > promoter_5kb.bed
- Day 46 reverse one column of a txt file
reverse column 3 and put it to column5
awk -v OFS="\t" '{"echo "$3 "| rev" | getline $5}{print $0}'
#or use perl reverse second column
perl -lane 'BEGIN{$,="\t"}{$rev=reverse $F[2];print $F[0],$F[1],$rev,$F[3]}
- Day 47 Get the full path of a file
realpath file.txt
readlink -f file.txt
- Day 48 unizp in parallel
https://github.com/Piezoid/pugz
Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression.
parallel gzip ::: *.fastq
pigz *.fastq
pugz *.fastq
- Day 49 Generate a sequence from 0 to 100 in steps of 10
for i in `seq 0 10 100`; do echo $i; done
- Day 50 kill all ssh sessions
pgrep ssh | xargs kill -9
- Day 51 Convert BAM file back to fastq
samtools view input.bam | \
awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > output.bam.fastq
#newer samtools
samtools fastq input.bam > output.bam.fastq
You can also do this with bamtools and picard.
- Day 52 Untangle shuffled FASTQ file
If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files. This script assumes the /1 reads precede the /2 reads.
seqtk seq -l0 Shuffled.fq | gawk '{if ((NR-1) % 8 < 4) print >> "Separate_1.fq"; else print >> "Separate_2.fq"}'
- Day 53 Histogram of allele frequencies in VCF files
cat *.vcf | awk 'BEGIN {FS=";"} {for (i=1;i<=NF;i++) {if (index($i,"AF1")!=0) {print $i} }}' | \
awk 'BEGIN {FS="="} {print int($2*10)/10}' | sort | uniq -c | sort -n -r | head
- Day 54 Count all the variants in at least two vcf files
cat *.vcf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -d | wc -l
- Day 55 Sum and Mean of column 1
awk '{sum+=$1} END {print sum}'
awk '{x+=$1}END{print x/NR}'
- Day 56 Trim leading and trailing whitespaces
Trim leading whitespace in file.txt:
sed ‘s/^[ \t]*//’ file.txt
Trim trailing whitespace in file.txt:
sed ‘s/[ \t]*$//’ file.txt
Trim leading and trailing whitespace in file.txt:
sed ‘s/^[ \t]//;s/[ \t]$//’ file.txt
- Day 57 count all sequences in a fasta file:
grep "^>" my.fasta -c
Note, do not omite the quote on >
- Day 58 To keep only one copy of duplicated lines:
awk '!seen[$0]++'
- Day 59 To replace/squeeze multiple adjacent spaces by only one space:
tr -s " " my.file
- Day 60 Print all possible 3mer DNA sequence combinations:
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}
https://github.com/stephenturner/oneliners from Stephen Turner