# Pangolin lineages

We will now add pangolin lineages to the metadata.

Let's first update pangolin to the latest version.

In [1]:
!pangolin --update

pangolin already latest release (v.2.4.2)
pangoLEARN already latest release (2021-04-28)


## Run pangolin

In [2]:
import time

input_sequences = 'data/all.fasta'
output_lineages = 'data/lineage_report.csv'
output_log_out = 'data/pangolin.out.log'
output_log_err = 'data/pangolin.err.log'

begin_time = time.time()
!pangolin -t 48 --outfile {output_lineages} {input_sequences} 2>{output_log_err} 1>{output_log_out}
end_time = time.time()
print(f'Took {(end_time - begin_time)/60:0.1f} minutes')
print(f'Output in: {output_lineages}')

Took 0.3 minutes
Output in: data/lineage_report.csv


In [3]:
!tail {output_log_err}

[33mJob counts:
	count	jobs
	1	add_failed_seqs
	1[0m
[33mJob counts:
	count	jobs
	1	overwrite
	1[0m


## Merge pangolin lineages to existing metadata

In [8]:
import pandas as pd

pangolin_df = pd.read_csv(output_lineages)
print(f'length pangolin_df: {len(pangolin_df)}')
pangolin_df.head(2)

length pangolin_df: 395


Unnamed: 0,taxon,lineage,conflict,pangolin_version,pangoLEARN_version,pango_version,status,note
0,NC_045512,B,0.0,2.4.2,2021-04-28,v1.1.23,passed_qc,
1,MN908947,B,0.0,2.4.2,2021-04-28,v1.1.23,passed_qc,


In [10]:
metadata_genbank_df = pd.read_csv('data/metadata-genbank.tsv', sep='\t')
print(f'length metadata_genbank_df: {len(metadata_genbank_df)}')
metadata_genbank_df.head(2)

length metadata_genbank_df: 395


Unnamed: 0,genbank_accession,genbank_accession.1,strain,region,location,collection_date,submitted_date,host,isolation_source,biosample_accession,length,count_ns,percent_ns
0,NC_045512,NC_045512,NC_045512,Asia,China,2019-12,2020-01-13T00:00:00Z,Homo sapiens,,,29903,0,0.0
1,MN908947,MN908947,MN908947,Asia,China,2019-12,2020-01-12T00:00:00Z,Homo sapiens,,,29903,0,0.0


In [14]:
metadata_df = metadata_genbank_df.merge(pangolin_df, left_on='genbank_accession', right_on='taxon').set_index('genbank_accession')
metadata_df.head(2)

Unnamed: 0_level_0,genbank_accession.1,strain,region,location,collection_date,submitted_date,host,isolation_source,biosample_accession,length,count_ns,percent_ns,taxon,lineage,conflict,pangolin_version,pangoLEARN_version,pango_version,status,note
genbank_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
NC_045512,NC_045512,NC_045512,Asia,China,2019-12,2020-01-13T00:00:00Z,Homo sapiens,,,29903,0,0.0,NC_045512,B,0.0,2.4.2,2021-04-28,v1.1.23,passed_qc,
MN908947,MN908947,MN908947,Asia,China,2019-12,2020-01-12T00:00:00Z,Homo sapiens,,,29903,0,0.0,MN908947,B,0.0,2.4.2,2021-04-28,v1.1.23,passed_qc,


In [15]:
output_metadata = 'data/metadata.tsv'
metadata_df.to_csv(output_metadata, sep='\t')
print(f'Final metadata: {output_metadata}')

Final metadata: data/metadata.tsv
