# Uniprot Data Load to Amazon Neptune

This notebook 
Shows uniprot query using explicit schema

## Setup

- Create a new Neptune cluster. Use the defaults except for the template choose "Development and Testing"
- Drop this ipynb file into this cluster's Neptune notebook instance
- Create an S3 bucket in the same region and account. Ensure Neptune cluster has IAM role allowing sufficient access to bulk load from this bucket.


## Load Uniprot data

Copy subset of Uniprot data to your own S3 bucket.

See https://aws.amazon.com/blogs/industries/exploring-the-uniprot-protein-knowledgebase-with-aws-open-data-and-amazon-neptune/

First, set the name of your bucket.

In [None]:
STAGING_BUCKET='<your bucket name>'

Next copy subset of uniprot to your bucket

In [None]:
%%bash -s "$STAGING_BUCKET"

echo Move data to a bucket in this region
aws s3 cp s3://aws-open-data-uniprot-rdf/2021-01/supporting/go.rdf.gz s3://$1/uniprot/go.rdf.gz
aws s3 cp s3://aws-open-data-uniprot-rdf/2021-01/uniprot/uniprotkb_eukaryota_opisthokonta_metazoa_33208_0.rdf.gz s3://$1/uniprot/uniprotkb_eukaryota_opisthokonta_metazoa_33208_0.rdf.gz
aws s3 cp s3://aws-open-data-uniprot-rdf/2021-01/supporting/taxonomy.rdf.gz  s3://$1/uniprot/taxonomy.rdf.gz 


Next, bulk-load that data into your Neptune database. The bulk loader needs to be given the ARN of an IAM role that gives the Neptune cluster permission to access S3. This role could use the AmazonS3FullAccess policy (for development work, for production use we recommend the Principle of Least Privileges). It should use a trust relationship like this:
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "rds.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```
Finally, this IAM role must both be (a) attached to the Neptune cluster (via, for e.g., the Neptune web console) and (b) included in the following variable:

In [None]:
LOADER_ARN="arn:<...>"

In [None]:
%load -s s3://{STAGING_BUCKET}/uniprot -l $LOADER_ARN -f rdfxml --store-to loadres --run

Finally check load status

In [None]:
%load_status {loadres['payload']['loadId']} --errors --details

## Verify and Explore Uniprot

See https://aws.amazon.com/blogs/industries/exploring-the-uniprot-protein-knowledgebase-with-aws-open-data-and-amazon-neptune/ for more examples

### Subclass records under Homo Sapiens

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?taxonomy ?scientific_name
WHERE {
    ?taxonomy a up:Taxon ;
             up:scientificName ?scientific_name ;
             rdfs:subClassOf taxon:9606 .
} 

### Query proteins and their related Gene Onotology (GO) code

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein ?proteinMnemonic ?go 
WHERE {
    ?protein a up:Protein ;       
             up:mnemonic ?proteinMnemonic ;
             up:organism taxon:9606 ;
             up:classifiedWith ?go .
    ?go a owl:Class .
}
LIMIT 10

### Filter proteins by GO description

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX go: <http://purl.obolibrary.org/obo/>
SELECT ?proteinMnemonic ?goCode ?label
WHERE {
    ?protein a up:Protein ;  
             up:mnemonic ?proteinMnemonic ;
             up:organism taxon:9606 ;
             up:classifiedWith ?go .                           
    ?go a owl:Class ;
        rdfs:label ?label .
    
    BIND(STRAFTER(STR(?go), "obo/") AS ?goCode)
    FILTER (REGEX(?label, "^cholesterol biosynthetic", "i"))
}
ORDER BY ?proteinMnemonic ?go
LIMIT 50

### Visualize a protein's Gene Ontology (GO)

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX go: <http://purl.obolibrary.org/obo/>
PREFIX sc: <http://example.org/shortcuts/>

CONSTRUCT {
    ?protein rdfs:label ?proteinMnemonic ;
        up:classifiedWith ?go .
    
    ?go rdfs:label ?label ;
        rdfs:subClassOf ?ancestorGo .
    
    ?ancestorGo rdfs:label ?ancestorLabel .
} WHERE {
    BIND(<http://purl.uniprot.org/uniprot/Q9UBM7> AS ?protein)
    
    ?protein up:mnemonic ?proteinMnemonic ;
        up:classifiedWith ?go .
    
    ?go a owl:Class ;
        rdfs:label ?label ;
        rdfs:subClassOf ?ancestorGo .
    
    ?ancestorGo a owl:Class ;
        rdfs:label ?ancestorLabel .
    
    MINUS {
       ?protein up:classifiedWith ?ancestorGo .
   }
}
ORDER BY ?go