# Uniprot Data Load to Amazon Neptune

In this notebook, load the Uniprot dataset into your Neptune cluster. 

This is optional. Skip if you do not wish to load Uniprot data into your Neptune database. 

See README.md for instructions to setup the Neptune cluster, this notebook instance, an S3 bucket to stage the Uniprot data for bulk-load to Neptune, and Bedrock for LLM tests in a subsequent notebook.

## Load Uniprot data

See https://aws.amazon.com/blogs/industries/exploring-the-uniprot-protein-knowledgebase-with-aws-open-data-and-amazon-neptune/

### Set the name of your bucket.

In [None]:
STAGING_BUCKET='<your bucket>'

### Copy the uniprot files to your S3 bucket
This may take several hours.

In [None]:
%%bash -s "$STAGING_BUCKET"

!aws s3 sync s3://aws-open-data-uniprot-rdf/2021-01 s3://$1/up-stage

### Check number of files in your bucket matches the source set


In [None]:
%%bash -s "$STAGING_BUCKET"

echo Source
aws s3 ls s3://aws-open-data-uniprot-rdf/2021-01/supporting/ | wc -l
aws s3 ls s3://aws-open-data-uniprot-rdf/2021-01/uniparc/ | wc -l
aws s3 ls s3://aws-open-data-uniprot-rdf/2021-01/uniprot/ | wc -l
aws s3 ls s3://aws-open-data-uniprot-rdf/2021-01/uniref/ | wc -l

echo your bucket
aws s3 ls s3://$1/up-stage/supporting/ | wc -l
aws s3 ls s3://$1/up-stage/uniparc/ | wc -l
aws s3 ls s3://$1/up-stage/uniprot/ | wc -l
aws s3 ls s3://$1/up-stage/uniref/ | wc -l


### Bulk-load from your S3 staging bucket to Neptune

Because the data is so large, we recommend using an r5.12xlarge instance type in the primary writer instance of the cluster. Change to this type prior to running the load. When load completes, downgrade to a smaller instance.

In [None]:
%load -s s3://{STAGING_BUCKET}/up-stage/supporting -f rdfxml -p OVERSUBSCRIBE --store-to loadres1 --no-fail-on-error  --run

In [None]:
%load -s s3://{STAGING_BUCKET}/up-stage/uniprot -f rdfxml -p OVERSUBSCRIBE --store-to loadres2 --no-fail-on-error  --run

### Finally check load status

In [None]:
%load_status {loadres1['payload']['loadId']} --errors --details

In [None]:
%load_status {loadres2['payload']['loadId']} --errors --details

## Verify and Explore Uniprot

See https://aws.amazon.com/blogs/industries/exploring-the-uniprot-protein-knowledgebase-with-aws-open-data-and-amazon-neptune/ for more examples

### Subclass records under Homo Sapiens

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?taxonomy ?scientific_name
WHERE {
    ?taxonomy a up:Taxon ;
             up:scientificName ?scientific_name ;
             rdfs:subClassOf taxon:9606 .
} 

### Query proteins and their related Gene Onotology (GO) code

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein ?proteinMnemonic ?go 
WHERE {
    ?protein a up:Protein ;       
             up:mnemonic ?proteinMnemonic ;
             up:organism taxon:9606 ;
             up:classifiedWith ?go .
    ?go a owl:Class .
}
LIMIT 10

### Filter proteins by GO description

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX go: <http://purl.obolibrary.org/obo/>
SELECT ?proteinMnemonic ?goCode ?label
WHERE {
    ?protein a up:Protein ;  
             up:mnemonic ?proteinMnemonic ;
             up:organism taxon:9606 ;
             up:classifiedWith ?go .                           
    ?go a owl:Class ;
        rdfs:label ?label .
    
    BIND(STRAFTER(STR(?go), "obo/") AS ?goCode)
    FILTER (REGEX(?label, "^cholesterol biosynthetic", "i"))
}
ORDER BY ?proteinMnemonic ?go
LIMIT 50

### Visualize a proteins Gene Ontology (GO)

In [None]:
%%sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX go: <http://purl.obolibrary.org/obo/>
PREFIX sc: <http://example.org/shortcuts/>

CONSTRUCT {
    ?protein rdfs:label ?proteinMnemonic ;
        up:classifiedWith ?go .
    
    ?go rdfs:label ?label ;
        rdfs:subClassOf ?ancestorGo .
    
    ?ancestorGo rdfs:label ?ancestorLabel .
} WHERE {
    BIND(<http://purl.uniprot.org/uniprot/Q9UBM7> AS ?protein)
    
    ?protein up:mnemonic ?proteinMnemonic ;
        up:classifiedWith ?go .
    
    ?go a owl:Class ;
        rdfs:label ?label ;
        rdfs:subClassOf ?ancestorGo .
    
    ?ancestorGo a owl:Class ;
        rdfs:label ?ancestorLabel .
    
    MINUS {
       ?protein up:classifiedWith ?ancestorGo .
   }
}
ORDER BY ?go