Skip to content
Permalink
Browse files

Give within-host-variants it's proper .tsv line ending.

Use .tsv to signal that a text file is tab-delimited. This is nicely
self-documenting and also leads to improved display on GitHub.
  • Loading branch information...
trvrb committed Sep 28, 2019
1 parent 535c936 commit a3715ff6c2ad32da169099c9c6f39c5b42089cb1
@@ -1,19 +1,19 @@
# Data

## Trees
Tree files shown in Figure 1 are available in json format [here](https://github.com/blab/h5n1-cambodia/tree/master/data/tree-jsons). These jsons were generated using the [Nextstrain avian-flu](https://github.com/nextstrain/avian-flu) pipeline with no geographic or regional subsampling.
## Trees
Tree files shown in Figure 1 are available in json format [here](https://github.com/blab/h5n1-cambodia/tree/master/data/tree-jsons). These jsons were generated using the [Nextstrain avian-flu](https://github.com/nextstrain/avian-flu) pipeline with no geographic or regional subsampling.

## Consensus genomes
All consensus sequences are available [here](https://github.com/blab/h5n1-cambodia/tree/master/data/consensus-genomes). The fasta header contains the following information: strain name | sample collection date | country of sampling | host species.
## Consensus genomes
All consensus sequences are available [here](https://github.com/blab/h5n1-cambodia/tree/master/data/consensus-genomes). The fasta header contains the following information: strain name | sample collection date | country of sampling | host species.


## Within-host data
Human reads were removed from all raw fastq files by mapping to the human reference genome GRCh38 with bowtie2. Only unmapped reads were further processed and used for data analysis. The raw fastq files with human reads filtered out are all publicly available in the Sequence Read Archive under the accession number [PRJNA547644](https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA547644), accession numbers SRX5984186-SRX5984198. All within-host variants reported in the manuscript and analyzed are available [here](https://github.com/blab/h5n1-cambodia/blob/master/data/within-host-variants-1%25.txt). This data file includes all variants present at a frequency of at least 1% in all human and duck samples. Fastq files were processed and variants called using [this pipeline](https://github.com/lmoncla/illumina_pipeline), briefly outlined below:
Human reads were removed from all raw fastq files by mapping to the human reference genome GRCh38 with bowtie2. Only unmapped reads were further processed and used for data analysis. The raw fastq files with human reads filtered out are all publicly available in the Sequence Read Archive under the accession number [PRJNA547644](https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA547644), accession numbers SRX5984186-SRX5984198. All within-host variants reported in the manuscript and analyzed are available [here](https://github.com/blab/h5n1-cambodia/blob/master/data/within-host-variants-1%25.tsv). This data file includes all variants present at a frequency of at least 1% in all human and duck samples. FASTQ files were processed and variants called using [this pipeline](https://github.com/lmoncla/illumina_pipeline), briefly outlined below:

1. Adapter and quality trimming with [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic )
2. Mapping with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) version 3.2.2.
3. Manual inspection of mapping and consensus genome calling with [Geneious](https://www.geneious.com/)
4. Re-mapping fastq files called consensus with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) version 3.2.2.
2. Mapping with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) version 3.2.2.
3. Manual inspection of mapping and consensus genome calling with [Geneious](https://www.geneious.com/)
4. Re-mapping fastq files called consensus with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) version 3.2.2.


**Trimming**
@@ -22,20 +22,16 @@ Trimming was performed with [Trimmomatic](http://www.usadellab.org/cms/?page=tri
**Mapping**
We performed a local mapping of our trimmed reads to reference sequences previously released by [Rith et al.](https://jvi.asm.org/content/88/23/13897.long) using [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml), with the following command:` bowtie2 -x reference_sequence.fasta -U read1.trimmed.fastq,read2.trimmed.fastq -S output.sam --local`

The mapping (bam) file was manually inspected in [Geneious](https://www.geneious.com/).
The mapping (bam) file was manually inspected in [Geneious](https://www.geneious.com/).

**Consensus sequence calling**
Consensus sequences were called in Geneious, with nucleotide sites with <100x coverage called as Ns. Consensus genomes were exported in fasta format and are available [here](https://github.com/blab/h5n1-cambodia/tree/master/data/h5n1-consensus-genomes.fasta).

**Remapping**
To avoid issues with mapping to improper reference sequences, we then remapped each sample's fastq files to its own consensus sequence. These bam files were again manually inspected in Geneious, and a final consensus sequence was called. Consensus genomes are available [here](https://github.com/blab/h5n1-cambodia/tree/master/data/consensus-genomes) as fasta files.
To avoid issues with mapping to improper reference sequences, we then remapped each sample's fastq files to its own consensus sequence. These bam files were again manually inspected in Geneious, and a final consensus sequence was called. Consensus genomes are available [here](https://github.com/blab/h5n1-cambodia/tree/master/data/consensus-genomes) as fasta files.

**Variant calling**
Variants were called using [Varscan](http://varscan.sourceforge.net/), requiring minimum coverage of 100x at the polymorphic site, a minimum quality of Q30, and a minimum SNP frequency of 1% with the following command: `java -jar VarScan.v2.3.9.jar mpileup2snp input.pileup --min-coverage 100 --min-avg-qual 30 --min-var-freq 0.01 --strand-filter 1 --output-vcf 1 > output.vcf`

**Amino acid annotation**
Coding region changes were annotated using [this jupyter notebook](https://github.com/blab/h5n1-cambodia/blob/master/scripts/VCF%20annotater.ipynb).




File renamed without changes.
@@ -57,7 +57,7 @@
"outputs": [],
"source": [
"# variant calls file to load in\n",
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\""
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\""
]
},
{
@@ -62,7 +62,7 @@
"outputs": [],
"source": [
"# variant calls file to load in\n",
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\""
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\""
]
},
{
@@ -3496,7 +3496,7 @@
],
"source": [
"# variant calls file to load in\n",
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\"\n",
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\"\n",
"\n",
"snps_df = pd.read_csv(variant_calls, sep='\\t', header='infer')\n",
"\n",
@@ -35,7 +35,7 @@
"outputs": [],
"source": [
"directory = \"/Users/lmoncla/Documents/H5N1_Cambodian_outbreak_study/comparison-to-known-sites/\"\n",
"SNP_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\"\n",
"SNP_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\"\n",
"flugenes = [\"PB2\",\"PB1\",\"PA\",\"HA\",\"NP\",\"NA\",\"M1\",\"M2\",\"NS1\",\"NEP\"]\n",
"genes = []\n",
"for f in flugenes: \n",
@@ -53,7 +53,7 @@
"outputs": [],
"source": [
"# variant calls file to load in\n",
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\""
"variant_calls = \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\""
]
},
{
@@ -827,7 +827,7 @@
}
],
"source": [
"within_host= \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.txt\"\n",
"within_host= \"/Users/lmoncla/src/h5n1-cambodia/data/within-host-variants-1%.tsv\"\n",
"wh = pd.DataFrame.from_csv(within_host, sep=\"\\t\")\n",
"wh.head()"
]

0 comments on commit a3715ff

Please sign in to comment.
You can’t perform that action at this time.