# AgamP4 resource in GCS

A quick guide to accessing AgamP4 reference genome and genome annotations on GCS.

## Reference genome - read from GCS

The reference genome is available in zarr format and can be read directly from GCS...

In [1]:
import zarr
import fsspec

In [2]:
genome_path_gcs = 'gs://vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.zarr'
genome_store = fsspec.get_mapper(genome_path_gcs)
genome = zarr.open_consolidated(genome_store)
genome

<zarr.hierarchy.Group '/'>

In [3]:
print(genome.tree())

/
 ├── 2L (49364325,) |S1
 ├── 2R (61545105,) |S1
 ├── 3L (41963435,) |S1
 ├── 3R (53200684,) |S1
 ├── Mt (15363,) |S1
 ├── UNKN (42389979,) |S1
 ├── X (24393108,) |S1
 └── Y_unplaced (237045,) |S1


In [4]:
len(genome['2L'])

49364325

In [5]:
genome['2L']

<zarr.core.Array '/2L' (49364325,) |S1>

In [7]:
seq = genome['2L'][:]
seq

array([b'a', b'a', b'c', ..., b'a', b'a', b'a'], dtype='|S1')

## Reference genome - local download

You can also download the reference genome locally if you prefer...

In [8]:
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.gz

--2020-11-09 10:57:01--  https://storage.googleapis.com/vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.191.128, 173.194.192.128, 209.85.146.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.191.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80872688 (77M) [application/gzip]
Saving to: ‘Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.gz’


2020-11-09 10:57:02 (111 MB/s) - ‘Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.gz’ saved [80872688/80872688]



In [9]:
!gunzip Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa.gz

In [10]:
import pyfasta

In [11]:
# don't forget the key_fn argument
genome = pyfasta.Fasta('Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa', key_fn=lambda x: x.split()[0])
genome

<pyfasta.fasta.Fasta at 0x7f269a72d750>

## Genome annotations - read from GCS

Genome annotations (last version produced before vectorbase migrated to eupathdb) can be read directly from GCS via petl...

In [12]:
import petl as etl
import petlx

In [13]:
geneset_path_gcs = 'gs://vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz'

In [14]:
tbl_geneset = etl.fromgff3(geneset_path_gcs)
tbl_geneset

seqid,source,type,start,end,score,strand,phase,attributes
2L,VectorBase,chromosome,1,49364325,.,.,.,"{'ID': '2L', 'Alias': 'CM000356.1'}"
2L,VectorBase,gene,157348,186936,.,-,.,"{'ID': 'AGAP004677', 'biotype': 'protein_coding', 'description': 'methylenetetrahydrofolate dehydrogenase(NAD ) / 5,10-methenyltetrahydrofolate [Source:VB Community Annotation]', 'version': '1'}"
2L,VectorBase,mRNA,157348,181305,.,-,.,"{'ID': 'AGAP004677-RA', 'Parent': 'AGAP004677', 'Dbxref': 'Celera_Pep:agCP1943,KEGG_Enzyme:00670 1.5.1.5 3.5.4.9,KEGG_Enzyme:00720 1.5.1.5 3.5.4.9,RefSeq:XM_001687731.1,RefSeq:XP_001687783.1,STRING:7165.AGAP004677-PA,UniParc:UPI0000020060,UniProtKB:A7UTF7,NCBI_GP:EDO64016.1', 'Ontology_term': 'GO:0003824,GO:0004477,GO:0004487,GO:0004488,GO:0035999,GO:0046653,GO:0055114', 'biotype': 'protein_coding', 'version': '1'}"
2L,VectorBase,three_prime_UTR,157348,157495,.,-,.,{'Parent': 'AGAP004677-RA'}
2L,VectorBase,exon,157348,157623,.,-,.,"{'Parent': 'AGAP004677-RA', 'Name': 'AGAP004677-RB-E4', 'constitutive': '1', 'rank': '4'}"


If you prefer to use pandas, here's how to make a dataframe...

In [16]:
df_geneset = (
    tbl_geneset
    # choose the attributes you want to include as columns
    .unpackdict('attributes', ['ID', 'Name', 'Parent', 'biotype'])
    .todataframe()
)
df_geneset.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent,biotype
0,2L,VectorBase,chromosome,1,49364325,.,.,.,2L,,,
1,2L,VectorBase,gene,157348,186936,.,-,.,AGAP004677,,,protein_coding
2,2L,VectorBase,mRNA,157348,181305,.,-,.,AGAP004677-RA,,AGAP004677,protein_coding
3,2L,VectorBase,three_prime_UTR,157348,157495,.,-,.,,,AGAP004677-RA,
4,2L,VectorBase,exon,157348,157623,.,-,.,,AGAP004677-RB-E4,AGAP004677-RA,


Note that scikit-allel has a `gff3_to_dataframe` function, but this does not support reading directly from cloud at the moment (PR welcome).

## Genome annotations - local download

In [17]:
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz

--2020-11-09 10:57:42--  https://storage.googleapis.com/vo_agam_release/reference/genome/agamp4/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.194.128, 64.233.191.128, 142.250.125.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.194.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2724130 (2.6M) [application/gzip]
Saving to: ‘Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz’


2020-11-09 10:57:42 (158 MB/s) - ‘Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz’ saved [2724130/2724130]



In [19]:
import allel

In [20]:
df_geneset = allel.gff3_to_dataframe('Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3.gz',
                                     attributes=['ID', 'Name', 'Parent', 'biotype'])
df_geneset.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent,biotype
0,2L,VectorBase,chromosome,1,49364325,-1,.,-1,2L,.,.,.
1,2L,VectorBase,gene,157348,186936,-1,-,-1,AGAP004677,.,.,protein_coding
2,2L,VectorBase,mRNA,157348,181305,-1,-,-1,AGAP004677-RA,.,AGAP004677,protein_coding
3,2L,VectorBase,three_prime_UTR,157348,157495,-1,-,-1,.,.,AGAP004677-RA,.
4,2L,VectorBase,exon,157348,157623,-1,-,-1,.,AGAP004677-RB-E4,AGAP004677-RA,.
