# Generating a collection data 

In this notebook, the main focuses is collecting and wrangling all the data for our series of processes. The data gathered are from two main databases known as [DBatVir](http://www.mgc.ac.cn/DBatVir/) and [NCBI Genome](https://www.ncbi.nlm.nih.gov/genome/viruses/). 

- DBatVIR -> database that contains known viral families associated with each bat
- NCBI -> used for obtaining genome and gene information

##  step 1: Collect 
The first step is to gather all the data from the databases and save it to the ./db folder. The `./db` will serve as our database when ever need to access data. This will prevent constantly sending requests to the databases. 


## Step 2: Filter

## Step 3: Profile

In [1]:
import sys
import os
from typing import Union
from pathlib import Path
import gzip

import numpy as np
import pandas as pd
import requests

# vseek imports
sys.path.append("../")
import vseek.common.vseek_paths as vsp
from vseek.apis.dbat_vir_db import collect_dbatvir_data
from vseek.apis.ncbi import get_all_viral_accessions, get_viral_genomes
from vseek.common.loader import load_bat_virus_data, load_geolocations, load_human_ppi, load_species_atlas
from vseek.apis.ncbi import get_viral_genes, get_taxa_id


In [2]:
# ADD INPUTS HERE 
email = "erikishere3@gmail.com"

In [3]:
#NOTE: the data has been downloaded already
# collect the data from the VBatDB
batvir_df = collect_dbatvir_data()
batvir_df

Unnamed: 0,Viruses,Viral family,From Bat,Bat diet type,Bat family,Sample type,Collection year,Sampling country,Sequence coding,References
0,Bat adeno-associated virus 07YN,Parvoviridae,unclassified Chiroptera,,,Feces,2007,China,Cap,Unpublished
1,Bat adeno-associated virus 09YN,Parvoviridae,unclassified Chiroptera,,,Feces,2009,China,Cap,Unpublished
2,Bat adeno-associated virus 1003-HB-Mr,Parvoviridae,Myotis ricketti,Insectivore and piscivore,Vespertilionidae,Feces,2007,China,Cap,"J Gen Virol 2010, 91(Pt 10):2601-9"
3,Bat adeno-associated virus 1008-HB-Mr,Parvoviridae,Myotis ricketti,Insectivore and piscivore,Vespertilionidae,Feces,2007,China,Cap,"J Gen Virol 2010, 91(Pt 10):2601-9"
4,Bat adeno-associated virus 1019-HB-Rs,Parvoviridae,Rhinolophus sinicus,Insectivore,Rhinolophidae,Feces,2007,China,Cap,"J Gen Virol 2010, 91(Pt 10):2601-9"
...,...,...,...,...,...,...,...,...,...,...
13957,Rhinolophus pusillus coronavirus ZSR42,Coronaviridae,Rhinolophus pusillus,Insectivore,Rhinolophidae,Mix,2015,China,RdRp,"Sci Rep 2017, 7(1):10917"
13958,Rhinolophus pusillus norovirus ZSR43,Caliciviridae,Rhinolophus pusillus,Insectivore,Rhinolophidae,Mix,2015,China,RdRp,"Sci Rep 2017, 7(1):10917"
13959,Rhinolophus pusillus norovirus ZSR45,Caliciviridae,Rhinolophus pusillus,Insectivore,Rhinolophidae,Mix,2015,China,RdRp,"Sci Rep 2017, 7(1):10917"
13960,Rhinolophus pusillus coronavirus ZSR6,Coronaviridae,Rhinolophus pusillus,Insectivore,Rhinolophidae,Mix,2015,China,RdRp,"Sci Rep 2017, 7(1):10917"


In [4]:
#NOTE: the data has been downloaded already
# downloading ncbi accession list
ncbi_acc_df = get_all_viral_accessions()
ncbi_acc_df

Unnamed: 0,Representative,Neighbor,Host,Selected lineage,Taxonomy name,Segment name
0,NC_003663,HQ420896,"human,vertebrates","Poxviridae,Orthopoxvirus,Cowpox virus",Cowpox virus,segment
1,NC_003663,KY463519,"human,vertebrates","Poxviridae,Orthopoxvirus,Cowpox virus",Cowpox virus,segment
2,NC_003663,HQ420897,"human,vertebrates","Poxviridae,Orthopoxvirus,Cowpox virus",Cowpox virus,segment
3,NC_003663,MK035759,"human,vertebrates","Poxviridae,Orthopoxvirus,Cowpox virus",Cowpox virus,segment
4,NC_003663,KY569019,"human,vertebrates","Poxviridae,Orthopoxvirus,Cowpox virus",Cowpox virus,segment
...,...,...,...,...,...,...
259836,NC_062761,MZ334528,,Halorubrum virus HRTV-28,Halorubrum virus HRTV-28,segment
259837,NC_062762,MZ334526,,Halorubrum virus HRTV-29,Halorubrum virus HRTV-29,segment
259838,NC_062763,MZ334501,,"Myoviridae,Haloferacalesvirus,Halorubrum virus...",Halorubrum virus HSTV-4,segment
259839,NC_060136,OK040786,,"Siphoviridae,,Gordonia phage Kudefre",Gordonia phage Kudefre,segment


In [5]:
# Fomatting functions 
def split_lineages(lineage: str) -> list:
    split_lineage = lineage.split(",")
    family, genus = split_lineage[0], split_lineage[1]
    if family == "":
        return pd.Series([np.nan, genus])
    elif genus == "":
        return pd.Series([family, np.nan])
    elif family == "" and genus == "":
        return pd.Series([np.nan, np.nan])
    return pd.Series([family, genus])


def clean_accession(accession: str) -> str:
    if accession is np.nan:
        return np.nan

    accession = accession.split(",")
    if len(accession) > 0:
        return accession[0]

In [6]:
#save paths
save_path = Path(vsp.db_path()) / "filtered_bat_virus.csv.gz"

# filter viral family that are only found in bats
viral_fam = batvir_df["Viral family"].dropna().unique().tolist()


if not save_path.is_file():
    # filtering main data frame to only viruses found in all bats 
    sel_dfs = []
    for v_fam in viral_fam:
        sel_df = ncbi_acc_df.loc[ncbi_acc_df["Selected lineage"].str.contains(v_fam)]
        if len(sel_df) == 0:
            continue
        sel_dfs.append(sel_df)


    # selecting viral genomes found in bats that are also found in human
    sel_dfs = pd.concat(sel_dfs, axis=0)
    sel_dfs = sel_dfs.loc[sel_dfs["Host"].notnull()]
    sel_dfs = sel_dfs.loc[sel_dfs["Host"].isin(["human"])]

    # splitting the selected lineage into Family, Genius 
    sel_dfs[["family", "genus"]] = sel_dfs["Selected lineage"].apply(split_lineages)
    sel_dfs["Representative"] = sel_dfs["Representative"].apply(clean_accession)
    sel_dfs = sel_dfs.drop(["Selected lineage"], axis="columns")

    # removing enties that do not have either a family or a genus entry
    sel_dfs.loc[(sel_dfs["family"].notnull() & (sel_dfs["genus"].notnull()))]

    # grouping data frames based on accession and removing duplicates
    groups = sel_dfs.groupby("Representative")
    cleaned_dfs = []
    for name, df in groups:
        df = df.drop("Neighbor", axis="columns")
        df = df.drop_duplicates(subset=["Representative"])
        cleaned_dfs.append(df)

    # generating bat-human viral genome csv
    init_db = vsp.init_db_path()
    cleaned_dfs = pd.concat(cleaned_dfs, axis=0)
    cleaned_dfs.to_csv(save_path, compression="gzip", index=False)
else:
    print("filtered_bat_virus already exists. Skipping") 
    cleaned_dfs = pd.read_csv(save_path)
cleaned_dfs

filtered_bat_virus already exists. Skipping


Unnamed: 0,Representative,Host,Taxonomy name,Segment name,family,genus
0,NC_001348,human,Human alphaherpesvirus 3,segment,Herpesviridae,Varicellovirus
1,NC_001352,human,Human papillomavirus type 2,segment,Papillomaviridae,Alphapapillomavirus
2,NC_001354,human,Human papillomavirus type 41,segment,Papillomaviridae,Nupapillomavirus
3,NC_001356,human,Human papillomavirus type 1a,segment,Papillomaviridae,Mupapillomavirus
4,NC_001430,human,Enterovirus D68,segment,Picornaviridae,Enterovirus
...,...,...,...,...,...,...
218,NC_055340,human,Echarate virus,segment,Phenuiviridae,Phlebovirus
219,NC_055341,human,Echarate virus,segment,Phenuiviridae,Phlebovirus
220,NC_055342,human,Maldonado virus,segment,Phenuiviridae,Phlebovirus
221,NC_055343,human,Maldonado virus,segment,Phenuiviridae,Phlebovirus


In [7]:
taxon_save_path = Path(vsp.db_path()) / "taxon_id.csv"
if not taxon_save_path.is_file():
    # generating a taxon id profile 
    # NOTE: This will take a while to complete because there is a 1 second pause after each request
    print("collecting taxon ids")
    c_accessions = cleaned_dfs["Representative"].tolist()
    collected_taxon_ids = []
    for c_acc in c_accessions:
        taxa_id = get_taxa_id(email=email, accession=c_acc, buffer=1.0)
        result = [c_acc, taxa_id]
        collected_taxon_ids.append(result)
    df = pd.DataFrame(data=collected_taxon_ids, columns=["Representative", "taxon_id"])
    df.to_csv(taxon_save_path, index=False)
else:
    print("Taxon id csv exists already. Skipping ...")
    taxon_df = pd.read_csv(taxon_save_path)
    taxon_df.columns = ["Representative", "taxon_id"]
taxon_df

Taxon id csv exists already. Skipping ...


Unnamed: 0,Representative,taxon_id
0,NC_001348,10335
1,NC_001352,337043
2,NC_001354,10589
3,NC_001356,334203
4,NC_001430,138951
...,...,...
218,NC_055340,1000646
219,NC_055341,1000646
220,NC_055342,1004889
221,NC_055343,1004889


In [8]:
final_cleaned_df_path = Path(vsp.db_path()) / "final_filtered_bat_virus.csv.gz"
if not final_cleaned_df_path.is_file():
    # merge cleaned_df and taxon
    final_clean_df = cleaned_dfs.merge(taxon_df, on="Representative")
    final_clean_df.to_csv(final_cleaned_df_path, index=False)
else:
    print("final_filtered_bat_virus already exists. Skipping ...")
    final_clean_df = pd.read_csv(final_cleaned_df_path)
final_clean_df

final_filtered_bat_virus already exists. Skipping ...


Unnamed: 0,Representative,Host,Taxonomy name,Segment name,family,genus,taxon_id
0,NC_001348,human,Human alphaherpesvirus 3,segment,Herpesviridae,Varicellovirus,10335
1,NC_001352,human,Human papillomavirus type 2,segment,Papillomaviridae,Alphapapillomavirus,337043
2,NC_001354,human,Human papillomavirus type 41,segment,Papillomaviridae,Nupapillomavirus,10589
3,NC_001356,human,Human papillomavirus type 1a,segment,Papillomaviridae,Mupapillomavirus,334203
4,NC_001430,human,Enterovirus D68,segment,Picornaviridae,Enterovirus,138951
...,...,...,...,...,...,...,...
218,NC_055340,human,Echarate virus,segment,Phenuiviridae,Phlebovirus,1000646
219,NC_055341,human,Echarate virus,segment,Phenuiviridae,Phlebovirus,1000646
220,NC_055342,human,Maldonado virus,segment,Phenuiviridae,Phlebovirus,1004889
221,NC_055343,human,Maldonado virus,segment,Phenuiviridae,Phlebovirus,1004889


In [9]:
accession_list = cleaned_dfs["Representative"].tolist()[:5]
get_viral_genomes(email = email, accessions=accession_list)


Building genome database ...
Checking if genome database exists ...
Genome database already exists. Checking for missing files
No missing genomes


'/Users/erikserrano/Development/prelim/prelim3/VSeek/db/genome'

In [10]:
# collectiing viral gene meta data 
# loading bat-virus data
bat_vir_df = load_bat_virus_data()
all_accessions = bat_vir_df["Representative"].tolist()

bat_vir_df.head()

get_viral_genes(email=email, accession=all_accessions)


Downloading viral genes metadata
Checking if genome database exists ...
Database exists! Checking for missing gene data files


In [11]:
load_geolocations()

Unnamed: 0,country,latitude,longitude,name
0,AD,42.546245,1.601554,Andorra
1,AE,23.424076,53.847818,United Arab Emirates
2,AF,33.939110,67.709953,Afghanistan
3,AG,17.060816,-61.796428,Antigua and Barbuda
4,AI,18.220554,-63.068615,Anguilla
...,...,...,...,...
240,YE,15.552727,48.516388,Yemen
241,YT,-12.827500,45.166244,Mayotte
242,ZA,-30.559482,22.937506,South Africa
243,ZM,-13.133897,27.849332,Zambia


In [13]:
data = load_human_ppi()
data


Downloading all human protein interactions with all species
Constructing ppi DataFrame
Filtering to only viral-human protein interactions
Annotating viral information


Unnamed: 0,species_1,protein_1,species_2,protein_2,score,annotation
251,9606,ENSP00000000233,10335,NP04_VZVD,437,Varicella-zoster virus (strain Dumas)
256,9606,ENSP00000000233,10255,F11_VAR67,188,Variola virus (isolate Human/India/Ind3/1967)
261,9606,ENSP00000000233,10335,GM_VZVD,336,Varicella-zoster virus (strain Dumas)
268,9606,ENSP00000000233,32603,GM_HHV6U,336,Human herpesvirus 6A (strain Uganda-1102)
269,9606,ENSP00000000233,10359,GM_HCMVM,336,Human cytomegalovirus (strain Merlin)
...,...,...,...,...,...,...
11431391,9606,ENSP00000473166,10335,GB_VZVD,227,Varicella-zoster virus (strain Dumas)
11432750,9606,ENSP00000473172,10359,UL27_HCMVM,435,Human cytomegalovirus (strain Merlin)
11435910,9606,ENSP00000473243,32603,GB_HHV6U,227,Human herpesvirus 6A (strain Uganda-1102)
11435913,9606,ENSP00000473243,10335,GB_VZVD,227,Varicella-zoster virus (strain Dumas)


In [4]:
species_atlas = load_species_atlas()
species_atlas



Unnamed: 0,taxon_id,string_type,string_name_compact,official_name_ncbi
1,3702,core,Arabidopsis thaliana,Arabidopsis thaliana
2,3711,core,Brassica rapa,Brassica rapa
3,3847,core,Glycine max,Glycine max
4,4081,core,Solanum lycopersicum,Solanum lycopersicum
5,4113,core,Solanum tuberosum,Solanum tuberosum
...,...,...,...,...
228,1335626,core,Human coronavirus EMC (isolate United Kingdom/...,Human coronavirus EMC (isolate United Kingdom/...
229,1511779,core,Mischivirus A (isolate Miniopterus schreibersi...,Mischivirus A (isolate Miniopterus schreibersi...
230,1511900,core,Human parvovirus B19 (strain HV),Human parvovirus B19 (strain HV)
231,1511908,core,Murine minute virus (strain MVM prototype),Murine minute virus (strain MVM prototype)
