Skip to content

evolbioinfo/HIV1-UK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drug resistance emergence and transmission in HIV-1 in the UK

This repository contains the pipelines and resulting data from the analyses of drug resistance patterns in the UK. The analyses are described in the article below:

Article

Zhukova A., Dunn D., Tostevin A., Gascuel O., on behalf of the UK HIV Drug Resistance Database & the Collaborative HIV, Anti-HIV Drug Resistance Network "Modelling drug resistance emergence and transmission in HIV-1 in the UK" Viruses 2023, 15(6), 1244 doi:10.3390/v15061244

Input data

We used the data from the UK HIV Drug Resistance Database [Dunn and Pillay 2007]: protease (PR) and reverse transcriptase (RT) sequences extracted during the resistance tests and the corresponding metadata (e.g., treatment status of the patient before the test: treatment-experienced, -naive, or unknown; and the date of the test). Such data can be requested from the Database.

Pipelines

The Snakemake [Köster et al., 2012] pipepiles for data analyses can be found in the snakemake.

They include:

1. Subtyping pipeline Snakefile_subtype

This pipeline first reads and reformats the input data provided by the UK HIV Drug Resistance database (sequences and metadata). It then subtypes and aligns the sequences with jpHMM [Schultz et al. 2006] against the reference pol alignment from the Los Alamos HIV database [Kuiken et al. 2003].

Input data:

The input data is assumed to be placed in the data/input folder and contain:

  • a fasta file sequences.fasta.xz with the input sequences;
  • a comma-separated metadata file with the following columns:
    • patientindex --id of the patient, allowing to identify sequence samples from the same patient;
    • sequence_id -- id of the sequence, corresponding to the ids in the fasta file. This is assumed to be the second column.
    • sampledate_my -- sequence sampling date in MM/YYYY format;
    • treatmentstatus -- treatment status before the sample: 1 (naive), 2 (experienced), or 3 (unknown).

We do not provide the above two files. They should be requested from the UK HIV Drug Resistance Database.

The pipeline subtypes and aligns the sequences with jpHMM [[Schultz et al. 2006](https://doi.org/10.1 186/1471-2105-7-265)] against the reference POL alignment from the Los Alamos HIV database [Kuiken et al. 2003] (provided in HIV1_REF_2010_pol_DNA.fa.xz).

Output data:

This pipeline outputs:

  • an alignment file aln.uk.fa.xz (in the data/aln folder, not provided)
  • a tab-delimited metadata file data/metadata/metadata.uk.tab with the following columns:
    • id -- sequence id (the first column);
    • patientindex -- patient id ;
    • sampledate_my -- sequence sampling date in MM/YYYY format;
    • treatmentstatus -- treatment status before the sample: 'naive' or 'experienced'; if the value is left blank it means that the status is unknown;
    • subtype_jpHMM -- sequence subtype.
DIY
snakemake --snakefile Snakefile_subtype --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

2. B and C dataset preparation pipeline Snakefile_aln

This pipeline extracts alignments corresponding to subtypes B and C (adding 5 sequences of other pure subtypes as outgroups), keeping one sequence per patient (the first sample); extracts the DRM information for their sequences with sierrapy; and removes the DRM positions from these alignments.

Input data:
  • an alignment file aln.uk.fa.xz (in the data/aln folder, not provided);
  • a tab-delimited metadata file data/metadata/metadata.uk.tab with (at least) the following columns:
    • id -- sequence id (the first column);
    • patientindex -- patient id ;
    • sampledate_my -- sequence sampling date in MM/YYYY format;
    • subtype_jpHMM -- sequence subtype.
Output data:
DIY
snakemake --snakefile Snakefile_aln --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

3. Time-scaled tree (for B and C) reconstruction pipeline Snakefile_trees

The pipeline reconstructs phylogenetic trees for B and C with RAxML-NG [Stamatakis, 2014] (model: GTR+G4+FO+IO), roots it with an outgroup (which is removed after rooting) and dated with LSD2 [To et al., 2015] (with outlier removal).

Input data:
Output data:
DIY
snakemake --snakefile Snakefile_trees --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

4. Data set statistics pipeline Snakefile_data_stats

The pipeline calculated various statistics for B and C datasets (Tables 1 and A1 of the article), as well as lists of DRMs with prevalence > 0.5% in B and C (only teh first sequence is considered for each patient), reformats the DRM metadata by position and extracts information on their polymorphic status and associated ARVs.

Input data:
Output data:
DIY
snakemake --snakefile Snakefile_data_stats --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

5. DRM pipeline Snakefile_drm

The pipeline reconstructs and visualizes ancestral characters for common DRMs with PastML [Ishikawa, Zhukova et al. 2019].

Input data:
Output data:
DIY
snakemake --snakefile Snakefile_drm --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

6. DRM statistics pipeline Snakefile_drm_stats

The pipeline calculates various statistics for common DRMs.

Input data:
  • time-scaled trees for B and C in newick format (with named nodes and associated date and CI metadata, without outliers): data/timetrees/B/raxml.lsd2.nwk and data/timetrees/C/raxml.lsd2.nwk;
  • lists of DRM positions with DRMs with prevalence >0.5% for B and C: data/metadata/B/common_drms.txt and data/metadata/C/common_drms.txt;
  • tab-delimited DRM metadata files for B and C, containing information on DRMs with prevalence > 0.5%: data/metadata/B/metadata.drm.tab and data/metadata/C/metadata.drm.tab, with the following columns:
    • inputSequence -- sequence id (the first column);
    • drm position columns, e.g. RT_K219ENQ -- the value is either 'sensitive' (if the amino acid is position RT:219 is known and is not E, N or Q), RT_K219E (if it's E), RT_K219N (if it's N), RT_K219Q (if it's Q) or blank (if the amino acid is unknown);
  • tab-delimited table containing ARV metadata for DRMs with prevalence > 0.5% data/metadata/arv_metadata.tab, with (at least) the following columns:
    • mutation -- DRM (if several DRMs are present on the same position, the name is aggregated, e.g. RT_K219ENQ);
    • drug class -- NNRTI, NRTI or PI;
    • drug abbreviation
    • year -- year of drug acceptance;
  • tab-delimited table metadata on polymorphicity for DRMs with prevalence > 0.5% data/metadata/drm_types.tab, with following columns:
    • DRM -- DRM;
    • type -- polymorphic or non-polymorphic;
  • a tab-delimited metadata file data/metadata/metadata.uk.tab with the following columns:
    • id -- sequence id (the first column);
    • patientindex -- patient id ;
    • sampledate_my -- sequence sampling date in MM/YYYY format;
    • treatmentstatus -- treatment status before the sample: 'naive' or 'experienced'; if the value is left blank it means that the status is unknown;
    • subtype_jpHMM -- sequence subtype;
  • PastML tables containing marginal probabilities of ACR for different common DRMs: data/acr//pastml/<DRM_position>.raxml.lsd2/marginal_probabilities.character_<DRM_position>.model_F81.tab
  • ACR visualizations for common B and C DRMs: data/acr/B/map.consensus.raxml.lsd2.html and data/acr/C/map.consensus.raxml.lsd2.html
  • ACR visualizations for RT:T215DFSY in B: data/acr/B/map.RT_T215DFSY.raxml.lsd2.html;
  • tab-delimited table with DRM loss statistics from [Castro et al. 2013]: data/input/Castro.tab.
Output data:
DIY
snakemake --snakefile Snakefile_drm_stats --keep-going --cores 4 --use-singularity --singularity-args "--home ~"

MSA pipeline

About

Analysis of drug resistance mutation patterns in the UK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published