# Natural Language Questions 
Shows uniprot query using explicit schema

## Setup

### Notebook Pre-Req

You can upload this notebook into a Jupyter environment configured to use Neptune Workbench. I tested this on an Amazon Sagemaker notebook running Python 3.10.x. 

### Python Pre-Req

I tested this on an Amazon Sagemaker notebook running Python 3.10.x. You need Python 3.9 or higher.

### Neptune Pre-Req

Your Neptune cluster must run engine version 1.2.x or higher

### Bedrock

In Bedrock console, allow access to Anthropic Claude v2 model in this region.
In Notebook IAM role, add

{
        "Action": [
            "bedrock:ListFoundationModels",
            "bedrock:InvokeModel"
        ],
        "Resource": "*",
        "Effect": "Allow"
}

## Install Langchain
Need python 3.9 or greater plus langchain 0.0.341 ish

In [None]:
%%bash 
# 3.9 or higher?
python --version

In [None]:
!pip install --upgrade --force-reinstall langchain

In [None]:
!pip install langchain-community

In [None]:
!pip show langchain

### Now restart Kernel *** 

### Temporary patch
.. until next Langchain release is ready ..

In [None]:
!cp neptune_rdf_graph.py ~/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/langchain_community/graphs/neptune_rdf_graph.py 


### Now restart Kernel *** 

## Setup Chain

Points to consider:
- Choice of model: accuracy, performance
- The chain introspects Neptune schema, and that gets added to the prompt.
- You can pass in EXAMPLES. In the extreme case, those examples can be just resources/prompt.txt. 
- But if you do that, likely EXAMPLES alone contribute to accuracy. Introspected schema less influential.
- So in test this, strike a balance: helpful schema, VERY FEW examples if needed


In [None]:
import os

# Grab Neptune cluster host/port from notebook instance environment variables
GRAPH_NOTEBOOK_HOST= os.popen("source ~/.bashrc ; echo $GRAPH_NOTEBOOK_HOST").read().split("\n")[0]
GRAPH_NOTEBOOK_PORT= os.popen("source ~/.bashrc ; echo $GRAPH_NOTEBOOK_PORT").read().split("\n")[0]
[GRAPH_NOTEBOOK_HOST, GRAPH_NOTEBOOK_PORT]

In [None]:
import boto3
from langchain.chains.graph_qa.neptune_sparql import NeptuneSparqlQAChain
from langchain_community.graphs import NeptuneRdfGraph
from langchain.chat_models import BedrockChat
from langchain.llms import Bedrock

print("Creating graph")
graph = None
graph = NeptuneRdfGraph(
    host=GRAPH_NOTEBOOK_HOST,
    port=int(GRAPH_NOTEBOOK_PORT),
    use_iam_auth=True,
    region_name='us-east-1'
)

#elems = graph.get_schema_elements
# change elems ...
#graph.load_schema(elems)

print("Creating model client")
#MODEL_ID='anthropic.claude-3-sonnet-20240229-v1:0'
MODEL_ID='anthropic.claude-v2'
bedrock_client = boto3.client('bedrock-runtime')
llm = BedrockChat(
    model_id = MODEL_ID,
    client = bedrock_client
)

EXAMPLES="" 
print("Creating chain")
chain = NeptuneSparqlQAChain.from_llm(
    llm=llm, graph=graph, examples=EXAMPLES, verbose=True, top_K=10, return_intermediate_steps=True, return_direct=False)

print("Complete")


In [None]:
print(graph.get_schema)

In [None]:
graph.get_schema_elements

## Ask questions


### Training questions
See resources/prompt.txt

In [None]:
chain.invoke('''Select the UniProt entry with the mnemonic "A4_HUMAN"''')

### Advanced questions
```
- What GO terms are associated with human proteins?
- What GO terms are associated with human proteins? Show me their names also.
- How many citations are there for papers by A. Bairoch?
- Show me all citations by A. Bairoch
- Show me all proteins that are located in the mitochondrian
- I'd like to see the entries for all proteins encoded by the gene FNDC3A
- Select all taxa from the UniProt taxonomy
- Select all taxa from the UniProt taxonomy; show me at most 7
- Show me at most 5 taxa from the UniProt taxonomy
- Select all bacterial taxa and their scientific names from the UniProt taxonomy
- Show me up to 10 human taxa and their scientific names from the UniProt taxonomy
- Select up to 10 bacterial taxa and their scientific names from the UniProt taxonomy
- Tell me all the different categories of databases
- Tell me all the different databases you know about
- Select all UniProt entries, and their organism and amino acid sequences (including isoforms), for _E. coli K12_ and all its strains
- Select the UniProt entry with the mnemonic 'A4_HUMAN'
- Select a mapping of UniProt to PDB entries using the UniProt cross-references to the PDB database
- Select all cross-references to external databases of the category '3D structure databases' of UniProt entries that are classified with the keyword 'Acetoin biosynthesis (KW-0005)'
- Select reviewed UniProt entries (Swiss-Prot), and their recommended protein name, that have a preferred gene name that contains the text 'DNA'
- Select reviewed UniProt entries (Swiss-Prot), and their recommended protein name, that have a preferred gene name that contains the word DNA. Show me the gene name too
- Show me the preferred gene name and disease annotation of all human UniProt entries that are known to be involved in a disease
- Select all human UniProt entries with a sequence variant that leads to a 'loss of function'
- Select all distinct human UniProt entries with a sequence variant that leads to a 'loss of function', show me the text of the annotation also
- Show me all human UniProt entries with a sequence variant that leads to a tyrosine to phenylalanine substitution
- Show me all human UniProt entries with a sequence variant that leads to a Tyr to phenylalanine substitution
- Select all UniProt entries with annotated transmembrane regions and the regions' begin and end coordinates on the canonical sequence
- Select all UniProt entries that were integrated on the 30th of November 2010
- Select all UniProt entries that were integrated on or before the 30th of November 2010
- Select all UniProt entries that were integrated on the month of November 2010
- Show me all UniProt entries that were added to the database on the month of November 2010
#    "Was any UniProt entry integrated on the 9th of January 2013?
#    "Select all triples that relate to the EMBL CDS entry AA089367.1
#    "Select all triples that relate to the taxon that describes Homo sapiens in the named graph for taxonomy
- Select the average number of cross-references to the PDB database of UniProt entries that have at least one cross-reference to the PDB database
- Select the number of UniProt entries for each of the EC (Enzyme Commission) second level categories
- Find all Natural Variant Annotations if associated via an evidence tag to an article with a pubmed identifier.
#- Find how often an article in pubmed was used in an evidence tag in a human protein (ordered by most used to least) # timesout
- Find where disease related proteins are known to be located in the cell
#- For two accessions find the GO term labels and group them into GO process,function and component
- How many reviewed entries (Swiss-Prot) are related to kinase activity?
#- Find the release number of the uniprot data that is currently being queried
#- Find any uniprot entry which has a name 'HLA class I histocompatibility antigen, B-73 alpha chain' # https://docs.streamlit.io/library/advanced-features/session-state
#- Find any uniprot entry, or an uniprot entries domain or component which has a name 'HLA class I histocompatibility antigen, B-73 alpha chain'
#- Construct new triples of the type 'HumanProtein' from all human UniProt entries
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process"
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process". Include their names.
- list all the Homo Sapiens proteins classified with "cholesterol biosynthetic process". Don't include their names.
- find all the Homo Sapiens related proteins that have a Gene Ontology (GO) code
- look within the taxonomy tree, to see if there are any subclass records under Homo Sapiens. Return the scientific name also
```

In [None]:
chain.invoke('''Select the number of UniProt entries for each of the EC (Enzyme Commission) second level categories''')