# Custom Metadata Formatting

This notebook performs custom metadata formatting and checks beyond `extract-metadata.py`. Add cells to this notebook to perform your custom metadata formatting.

In [29]:
import pandas as pd

In [30]:
if 'snakemake' in globals():
    metadata_file = snakemake.input.metadata
    lineage_metadata = snakemake.input.lineages
    metadata_formatted_file = snakemake.output.metadata
else:
    metadata_file = "../../results/metadata/metadata.tsv"
    lineage_metadata = "../../data/ViennaRNA_CHIKV_metadata.tsv"
    metadata_formatted_file = "../../results/metadata/metadata_formatted.tsv"

In [31]:
metadata_df = pd.read_csv(metadata_file, sep="\t")
metadata_df.head()

Unnamed: 0,strain,accession,url,authors,title,journal,paper_link,submission,isolate,organism,host,date,location,ambiguous,length,country,local,region,subregion
0,S27-African-prototype_NC_004162,NC_004162.2,https://www.ncbi.nlm.nih.gov/nucleotide/NC_004...,"Khan,A.H., Morita,K., del Carmen Parquet,M., H...",Complete nucleotide sequence of chikungunya vi...,"J. Gen. Virol. 83 (Pt 12), 3075-3084 (2002)",https://pubmed.ncbi.nlm.nih.gov/12466484,Submitted (10-JAN-2003) Department of Virology...,S27-African-prototype,Chikungunya virus,?,?,,0,11826,?,?,?,?
1,SIMI-057_PV066168,PV066168.1,https://www.ncbi.nlm.nih.gov/nucleotide/PV0661...,"Horthongkham,N. and Athipanyasilp,N.",Genetic characterization of chikunugunya virus...,Unpublished,,"Submitted (04-FEB-2025) Microbiology, Mahidol ...",SIMI-057,Chikungunya virus,Homo sapiens,2019-12-18,Thailand,0,11647,Thailand,?,Asia,South-Eastern Asia
2,SIMI-058_PV066169,PV066169.1,https://www.ncbi.nlm.nih.gov/nucleotide/PV0661...,"Horthongkham,N. and Athipanyasilp,N.",Genetic characterization of chikunugunya virus...,Unpublished,,"Submitted (04-FEB-2025) Microbiology, Mahidol ...",SIMI-058,Chikungunya virus,Homo sapiens,2019-12-10,Thailand,0,11647,Thailand,?,Asia,South-Eastern Asia
3,CHIKV-NIHPAK-02-2024_PV054360,PV054360.1,https://www.ncbi.nlm.nih.gov/nucleotide/PV0543...,"Umair,M., Hakim,R., Jamal,Z. and Salman,M.",Direct Submission,Unpublished,,Submitted (31-JAN-2025) Department of Virology...,CHIKV-NIHPAK-02-2024,Chikungunya virus,Homo sapiens,2024-06-09,Pakistan,3,11795,Pakistan,?,Asia,Southern Asia
4,CHIKV-NIHPAK-03-2024_PV054361,PV054361.1,https://www.ncbi.nlm.nih.gov/nucleotide/PV0543...,"Umair,M., Hakim,R., Jamal,Z. and Salman,M.",Direct Submission,Unpublished,,Submitted (31-JAN-2025) Department of Virology...,CHIKV-NIHPAK-03-2024,Chikungunya virus,Homo sapiens,2024-08-09,Pakistan,34,11465,Pakistan,?,Asia,Southern Asia


## Add lineage information

There are specific lineages of CHIKV in [this Nextstrain Build](https://nextstrain.org/groups/ViennaRNA/CHIKVnext). I'll use the metadata from this build to label shared accessions by lineage. The remaining lineages will be inferred by `augur`.

In [32]:
lineage_df = pd.read_csv(lineage_metadata, sep="\t")
lineage_df.head()

Unnamed: 0,strain,date,country,region,lineage,year,GRI Lineage Level 0,GRI Lineage Level 1,GRI Lineage Level 2,GRI Lineage Level 3,GRI Lineage Level 4,GRI Lineage Level 5,GRI Lineage Level 6,GRI Lineage Level 7,GRI Lineage Level 8,author,accession
0,HM045816.1,1966-11-23,Senegal,WAf,WA,1966.0,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,"Volk,S.M., Chen,R., Tsetsarkin,K.A., Adams,A.P...",
1,HM045785.1,1966-11-01 (1966-11-01 - 1966-11-27),Senegal,WAf,WA,1966.0,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,"Volk,S.M., Chen,R., Tsetsarkin,K.A., Adams,A.P...",
2,HM045815.1,1979-02-01 (1979-02-01 - 1979-02-25),Senegal,WAf,WA,1979.0,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,"Volk,S.M., Chen,R., Tsetsarkin,K.A., Adams,A.P...",
3,HM045786.1,1964-07-07,Nigeria,WAf,WA,1964.0,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,"Volk,S.M., Chen,R., Tsetsarkin,K.A., Adams,A.P...",
4,MK028837.1,1983-07-11 (1983-01-01 - 1983-12-31),Senegal,WAf,WA,1983.0,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,not assigned,"Morrison,T., Hawman,D., Powers,A., Agnihothram...",


In [33]:
# Join the lineage information to the metadata
lineages = lineage_df[["strain", "lineage"]]
lineages = lineages.rename(columns={"strain": "accession"})
metadata_df = metadata_df.merge(lineages, on="accession", how="left")

# Replace NaN values in the 'lineage' column with '?' for trait inference
metadata_df["lineage"] = metadata_df["lineage"].fillna("?")

In [34]:
# How many samples were assigned to each lineage including:
metadata_df.groupby("lineage").size().reset_index(name="count").sort_values(by="count", ascending=False)

Unnamed: 0,lineage,count
0,?,1481
5,IOL,551
2,AUL-Am,399
7,SAL,231
1,AUL,111
6,MAL,36
4,EAL,21
9,WA,12
3,African/Asian Lineages,8
8,Sister Taxa to ECSA,1


This lineage information is a little to detailed. We're interested in the 4 main CHIKV lineages in the literature: *West African (WA); East, Central, and South African (ECSA); Indian Ocean (IOL); and Asian*.

In [35]:
# Rename 'lineage' to 'detailed_lineage' for clarity
metadata_df = metadata_df.rename(columns={"lineage": "sublineage"})

# Make a new column called 'lineage' that contains only the main 4 lineages
lineage_mapping = {
    "African/Asian Lineages": "Asian",
    "Sister Taxa to ECSA": "Asian",
    "AUL-Am": "Asian",
    "AUL": "Asian",
    "EAL": "IOL",
    "IOL": "IOL",
    "MAL": "ECSA",
    "SAL": "ECSA",
    "WA": "WA",
    "?": "?",
}
metadata_df["lineage"] = metadata_df["sublineage"].map(lineage_mapping).fillna(metadata_df['sublineage'])
metadata_df.groupby("lineage").size().reset_index(name="count").sort_values(by="count", ascending=False)

Unnamed: 0,lineage,count
0,?,1481
3,IOL,572
1,Asian,519
2,ECSA,267
4,WA,12


## Format `host` information

There host-level metadata is too granular. We mostly care whether the source of the sample was a human or a mosquito. So we'll rename the `host` column to `organism` and group the various organisms in to broader 'host' categories.

In [36]:
metadata_df.groupby("host").size().reset_index(name="count").sort_values(by="count", ascending=False)

Unnamed: 0,host,count
11,Homo sapiens,2671
0,?,121
1,Aedes aegypti,25
3,Aedes albopictus,12
5,Aedes furcifer,5
12,Macaca fascicularis,4
10,Culex quinquefasciatus,3
6,Aedes luteocephalus,2
13,Mosquito,2
4,Aedes dalzieli,1


In [37]:
# Rename host to organism
metadata_df.rename(columns={"organism": "virus"}, inplace=True)
metadata_df.rename(columns={"host": "organism"}, inplace=True)

# Use the mapping with fillna to keep original values for organisms not in our mapping
host_mapping = {
    'Homo sapiens': 'Human',
    'Mouse': 'Other Mammals',
    'Chiroptera': 'Other Mammals',
    'Macaca fascicularis': 'Other Mammals',
    'Aedes aegypti': 'Mosquito',
    'Mosquito': 'Mosquito',
    'Culex quinquefasciatus': 'Mosquito',
    'Aedes albopictus': 'Mosquito',
    'Aedes furcifer': 'Mosquito',
    'Aedes opok': 'Mosquito',
    'Anopheles funestus': 'Mosquito',
    'Aedes luteocephalus': 'Mosquito',
    'Aedes dalzieli': 'Mosquito',
    'Aedes africanus': 'Mosquito',
    '?': '?'
}

# Method 1: Using map with fillna to preserve values not in mapping
metadata_df['host'] = metadata_df['organism'].map(host_mapping).fillna(metadata_df['organism'])


## Write formatted metadata

End of custom formatting. Write the `metadata_df` to a file.

In [38]:
metadata_df.to_csv(metadata_formatted_file, sep="\t", index=False)