## SPARQL query examples loader


In [1]:
from sparql_llm import SparqlExamplesLoader

loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)

124
{'question': 'Select all taxa from the UniProt taxonomy', 'answer': 'PREFIX up: <http://purl.uniprot.org/core/>\n\nSELECT ?taxon\nFROM <http://sparql.uniprot.org/taxonomy>\nWHERE\n{\n    ?taxon a up:Taxon .\n}', 'endpoint_url': 'https://sparql.uniprot.org/sparql/', 'query_type': 'SelectQuery', 'doc_type': 'sparql_query'}


## SPARQL endpoint schema loader
Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the VoID description present in the endpoint. Ideally the endpoint should also contain the ontology describing the classes, so the rdfs:label and rdfs:comment of the classes can be used to generate embeddings and improve semantic matching.

In [2]:
from sparql_llm import SparqlVoidShapesLoader

loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)

243
{'question': 'Citation Statement', 'answer': 'up:Citation_Statement {\n  a [ up:Citation_Statement ] ;\n  up:mappedAnnotation [ up:Annotation ] ;\n  up:scope xsd:string ;\n  up:attribution IRI ;\n  rdf:object [ up:Book_Citation up:Electronic_Citation up:Journal_Citation up:Observation_Citation up:Patent_Citation up:Submission_Citation up:Thesis_Citation ] ;\n  rdf:predicate IRI ;\n  rdf:subject [ up:Protein ] ;\n  up:context [ up:Plasmid up:Strain up:Tissue up:Transposon ]\n}', 'endpoint_url': 'https://sparql.uniprot.org/sparql/', 'class_uri': 'http://purl.uniprot.org/core/Citation_Statement', 'doc_type': 'shex'}


The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a Disease Annotation in UniProt:

In [3]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# Set up the UniProt SPARQL endpoint
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define a query to fetch available Disease Annotation data
query_disease_annotations_simple = """
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?disease_annotation ?comment ?disease
WHERE {
  ?disease_annotation a up:Disease_Annotation ;
                      rdfs:comment ?comment ;
                      up:disease ?disease .
}
LIMIT 10
"""

# Execute the query and format the output in a DataFrame
sparql.setQuery(query_disease_annotations_simple)
sparql.setReturnFormat(JSON)

try:
    # Execute query and retrieve results
    results_disease_simple = sparql.query().convert()
    
    # Parse the results
    disease_data_simple = [
        {
            "Disease Annotation": result["disease_annotation"]["value"],
            "Comment": result["comment"]["value"],
            "Disease": result["disease"]["value"]
        }
        for result in results_disease_simple["results"]["bindings"]
    ]
    
    # Create a DataFrame to display the results
    df_disease_simple = pd.DataFrame(disease_data_simple)
    
    # Wrap text for 'Comment' column in Jupyter display
    df_disease_simple_styled = df_disease_simple.style.set_properties(
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    )
    
    display(df_disease_simple_styled)
except Exception as e:
    print(f"Error occurred: {e}")


Unnamed: 0,Disease Annotation,Comment,Disease
0,http://purl.uniprot.org/uniprot/Q9UDR5#SIP17D85FE178BE13B6,"The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.",http://purl.uniprot.org/diseases/1773
1,http://purl.uniprot.org/uniprot/Q9UDR5#SIP77BA87EDDA8559D2,"The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.",http://purl.uniprot.org/diseases/4240
2,http://purl.uniprot.org/uniprot/Q9UGJ0#SIP473418E25D4D3A3B,The disease is caused by variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/1676
3,http://purl.uniprot.org/uniprot/Q9UGJ0#SIPBA4A3C214C09B2B7,The disease is caused by variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/245
4,http://purl.uniprot.org/uniprot/Q9UGJ0#SIPF5992DDE995A022F,The disease is caused by variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/1150
5,http://purl.uniprot.org/uniprot/P00519#SIP961ECAA35D2F0134,The disease is caused by variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/5064
6,http://purl.uniprot.org/uniprot/P00519#SIPDFB66D0B5174D549,The gene represented in this entry is involved in disease pathogenesis.,http://purl.uniprot.org/diseases/3735
7,http://purl.uniprot.org/uniprot/Q13085#SIPE73D1EB0068562AA,The disease is caused by variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/1164
8,http://purl.uniprot.org/uniprot/Q6UWZ7#SIP86B515DA1B7AD8CF,Disease susceptibility is associated with variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/2602
9,http://purl.uniprot.org/uniprot/A8K2U0#SIPCE73AF232236B8B1,Disease susceptibility is associated with variants affecting the gene represented in this entry.,http://purl.uniprot.org/diseases/5294


In [6]:
import pandas as pd
import re

# Assuming df_disease_simple from Step 1 already exists

# Define regex patterns for variants and genes
variant_pattern = r"\bvariant\s\w+\b|\bmutation\b|\bpolymorphism\b"  # Adjust patterns as needed
gene_pattern = r"\b[A-Z0-9]{2,}\b"  # Basic pattern for gene identifiers, e.g., BRCA1, TP53

# Extract details for each disease annotation
extracted_info = []
for _, row in df_disease_simple.iterrows():
    disease_id = row["Disease"]
    comment = row["Comment"]
    
    # Find all variants and gene mentions
    variants = re.findall(variant_pattern, comment, flags=re.IGNORECASE)
    genes = re.findall(gene_pattern, comment)
    
    # Store results in a structured format
    extracted_info.append({
        "Disease": disease_id,
        "Comment": comment,
        "Variants": variants,
        "Genes": genes
    })

# Convert to DataFrame
df_extracted_info = pd.DataFrame(extracted_info)

# Apply wrapping style to comment for readability
df_extracted_info_styled = df_extracted_info.style.set_properties(
    **{'white-space': 'pre-wrap', 'text-align': 'left'}
)

# Display the wrapped DataFrame in Jupyter
display(df_extracted_info_styled)



Unnamed: 0,Disease,Comment,Variants,Genes
0,http://purl.uniprot.org/diseases/1773,"The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.",[],['AASS']
1,http://purl.uniprot.org/diseases/4240,"The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.",[],"['NADP', 'NADK2', 'NADPH', 'DECR1', 'AASS']"
2,http://purl.uniprot.org/diseases/1676,The disease is caused by variants affecting the gene represented in this entry.,[],[]
3,http://purl.uniprot.org/diseases/245,The disease is caused by variants affecting the gene represented in this entry.,[],[]
4,http://purl.uniprot.org/diseases/1150,The disease is caused by variants affecting the gene represented in this entry.,[],[]
5,http://purl.uniprot.org/diseases/5064,The disease is caused by variants affecting the gene represented in this entry.,[],[]
6,http://purl.uniprot.org/diseases/3735,The gene represented in this entry is involved in disease pathogenesis.,[],[]
7,http://purl.uniprot.org/diseases/1164,The disease is caused by variants affecting the gene represented in this entry.,[],[]
8,http://purl.uniprot.org/diseases/2602,Disease susceptibility is associated with variants affecting the gene represented in this entry.,[],[]
9,http://purl.uniprot.org/diseases/5294,Disease susceptibility is associated with variants affecting the gene represented in this entry.,[],[]


## Generate complete ShEx shapes from VoID description

You can also generate the complete ShEx shapes foar SPARQL endpoint with:

In [7]:
from sparql_llm import get_shex_from_void

shex_str = get_shex_from_void("https://sparql.uniprot.org/sparql/")
print(shex_str)

ImportError: cannot import name 'get_shex_from_void' from 'sparql_llm' (/Users/annedeslattesmays/miniconda3/envs/sparqllm/lib/python3.9/site-packages/sparql_llm/__init__.py)