# Retrieval Augemented Generation (RAG) with Knowledge Graph using SPARQL

# Retrieval Augmented Generation (RAG)

**Retrieval Augmented Generation (RAG)** combines large language models (LLMs) with external knowledge sources such as vector data stores or knowledge graphs to improve response quality and informativeness.  Traditional LLMs can be limited by their training data as even though they are pretrained on large amounts of data, they are designed for generalization purposes and are not trained for extensive domain specific knowledge. RAG addresses this by retrieving relevant information (passages, facts) from external knowledge sources to augment the input for the LLM to return domain specific responses. This allows LLMs to generate more comprehensive and contextually aware responses in tasks like question answering, summarization and text generation. 

## Why RAG with Knowledge Graph?

While vector databases are commonly used as external knowledge sources for RAG architectures, knowledge graphs offer distinct advantages in specific scenarios. Let's delve into when a knowledge graph might be a better fit. 

- **Structured Knowledge Representation**: Knowledge graphs represent information in a structured way, using entities, relationships, and properties. This structure enables more precise and interpretable representation of knowledge, which is particularly useful for domains with complex and interconnected concepts, such as healthcare, finance, or scientific domains.
- **Reasoning and Inference**: Knowledge graphs allow for reasoning and inference over the represented knowledge due to their capability of understanding intracies between different relationships in the data i.e. *Product* and *Customer*. This capability enables systems to draw logical conclusions, uncover implicit relationships, and answer complex queries that require combining multiple pieces of information. This paves the way for powerful recommendation systems that can truly personalize user experiences. Knowledge Vector databases, on the other hand, are primarily designed for similarity-based retrieval and lack this reasoning capability.
- **Multi-hop Question Answering**: Knowledge graphs excel at answering multi-hop questions, where the answer requires traversing multiple relationships or combining information from different parts of the knowledge base. This is particularly useful in domains with intricate relationships where understanding the connections between entities is crucial. For instance, in the healthcare domain, responding to a query like "What medications should be avoided for patients with a specific genetic mutation and a certain comorbidity?" would require integrating insights from different sections of the knowledge graph. 
- **Explainability and Provenance**: Knowledge graphs provide explainability by revealing the reasoning path and provenance of the information used to generate an answer. This transparency is valuable in applications where interpretability and traceability are essential, such as decision support systems or regulatory compliance scenarios.
- **Domain-specific Knowledge Integration**: Knowledge graphs can effectively integrate and represent domain-specific knowledge from various sources, such as ontologies, taxonomies, or expert-curated knowledge bases. This capability is advantageous in domains with well-established domain models or when incorporating specialized knowledge is necessary.

While vector databases are efficient for similarity-based retrieval and can work well in certain domains, knowledge graphs offer advantages in representing complex, structured knowledge, enabling reasoning and inference, and providing explainability and provenance. Therefore, RAG with knowledge graphs is particularly suitable for domains with intricate relationships, multi-hop question answering requirements, or where interpretability and integration of domain-specific knowledge are crucial.

# Architecture

We will build the following RAG with Knowledge Graph architecture in this notebook which leverages a Large Language Model (LLM) from Amazon Bedrock and a knowledge graph stored in Amazon Neptune containing data from the Internet Movie Database (IMDb). IMDb is an user-contributed online database that provides comprehensive information about movies, TV shows, and other entertainment media. 

While LLMs are trained on massive datasets, they often struggle with specific industry knowledge. This RAG architecture addresses this by allowing our LLM to access and leverage data from IMDb, enabling it to provide more informative and entertainment-focused responses.

Architecture has the following components:
- **Amazon S3**: Object store for Resource Description Framework (RDF) formatted IMDb dataset.
- **Amazon Neptune**: Graph database service ingests IMDb dataset from S3 bucket to create a Knowledge Graph.
- **Amazon Bedrock**: LLM hosting service which calls LLM to query Knowledge Graph to retrieve additonal context for response augmentation. 

![Knowledge Graph RAG Architecture](static/knowledge-graph-rag-arch.png)


# Fundamental Concepts

To ensure a deeper understanding and maximize your learning from this exercise, let's review key foundational concepts explored in this notebook. Feel free to skip this section if you already have a solid understanding of Graph Database and Knowledge Graph concepts.

## Graph Database

A **graph database** is a systematic collection of data that emphasizes the relationships between the different data entities. The NoSQL database uses mathematical graph theory to show data connections which makes it easier to model and mange higly connected data. Unlike relational databases, which store data in rigid table structures, graph databases store data as a network of entities and relationships. Graph databases are purpose-built to store and navigate relationships. As a result, these databases often provide better performance and flexibility as they are more suited for modeling real-world scenarios.

Graph databases offer optimal performance for the following **use cases**:
- **Fraud detection**
- **Social Media Applications**
- **Recommendation Engines**
- **Route Optimization**
- **Knowledge Management**
- **Pattern Discovery**

## Knowledge Graph

A **Knowledge graph** is built via graph database by linking entities and their relationships, often through data extraction from various sources and manual curation by domain experts. Knowledge graph provides a flexible way to structure and connect information, making it easier to understand and access for everyone in an organization. Compared to traditional relational databases, graph databases are better suited for modeling complex real-world data with its inherent diversity. Traditional relational database focuses on storing data points while a knowledge graph capture the relationships and meaning between them. This "semantic" approach allows us to model real-world complexities and unlock hidden connections.  Furthermore, a knowledge graph can integrate information from various sources, structured or unstructured, creating a holistic view. This empowers organizations to not only access information easily to enable powerful applications but also leverage automated reasoning to uncover valuable insights they might have otherwise missed.

There are two main types of knowledge graph models which are **Property Graph** and **Resource Description Framework (RDF) Graph**. Here is a brief summary describing each of those graph models.

- **Property Graph**
    - Focuses on **entities (nodes)** and the **relationships (edges)** between them. Nodes and edges can have properties associated with them, allowing for rich descriptions.
    - Queried by languages like **OpenCypher** or **Gremlin**
    - Known for querying simplicity within single knowledge source
    
- **RDF Graph**
    - Represents information as **entities (resources)**, **properties (attributes)**, and **relationships**. Entities and properties are identified using **URIs (Uniform Resource Identifiers)**, ensuring a standardized format.
    - Queried by **SPARQL** language. 
    - Known for standardization across global knowledge sources. 

For the purposes of this notebook, we will focus on RDF graph as it provides the standardization required for enterprise scalability where data integration from multiple sources is necessary.    

## Resource Description Framework (RDF) Data Model

![RDF Data Model](static/rdf-data-model.png)

The **Resource Description Framework (RDF)** is a standard data model defined by the **World Wide Web Consortium (W3C)** for representing information on the web. RDF graphs are created from this data model. It is based on the idea of making statements about resources in the form of subject-predicate-object expressions, known as triples. Here's a brief explanation:

- **Subject**: Represents the entity or resource you're describing. This entity is identified using a Uniform Resource Identifier (URI), which acts like a unique web address for the resource. Examples of URIs include web page URLs or identifiers created specifically for entities within an RDF graph.
- **Predicate**: Defines the relationship between the subject and the object. It's still a concept or property but also identified using a URI. These URIs act like a vocabulary of terms to describe the relationships precisely.
- **Object**: Provides the value of the property/attribute or relationship defined by the predicate. This can be:
    - Another resource identified by a URI (e.g., connecting to another entity in the graph).
    - A literal value (like a number or text string) enclosed in quotes (e.g., "Paris").
    - Another RDF statement (for very complex relationships).

For example, consider the statement: *"Brussels is a city in Belgium"*. In RDF, this would be represented as:

```text
  <http://example.org/brussels> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/City> .
  <http://example.org/brussels> <http://dbpedia.org/property/location> <http://example.org/belgium> .
```

In this example:
- The first line states that the resource **(subject)** **<<http://example.org/brussels>>** (identified by a URI) has the type **(predicate)** City **(object)** **<<http://schema.org/City>>** (another URI defining the type).
- The second line states that Brussels **(subject)** **<<http://example.org/brussels>>** location **(predicate)** is in Belgium **(object)** **<<http://example.org/Belgium>>** (another resource with a URI).



The URI format provides the following benefits:
- **Standardization**: Ensures everyone uses the same identifiers for resources, promoting interoperability and data exchange across different systems.
- **Disambiguation**: Uniquely identifies resources, avoiding confusion between entities with the same name.
- **Machine-Readability**: Allows machines to understand the meaning and context of the data by linking to standard vocabularies or external information sources.

In essence, RDF uses **URIs** within its **Subject-Predicate-Object** structure to create a **standardized** and **machine-understandable** way to represent information. This lets RDF provide a flexible and extensible way to represent and link data on the web, enabling the creation of the **Semantic Web** and facilitating **data integration** and **interoperability**.

## SPARQL

**SPARQL (SPARQL Protocol and RDF Query Language)** is a language specifically designed to interact with data stored in the **RDF** format.  RDF graphs are giant webs of information where entities are connected by relationships. SPARQL acts like a powerful search engine for these webs, allowing you to ask questions and retrieve specific information. Unlike SQL used for relational databases, SPARQL focuses on the relationships between entities.  You can use it to find things like all cities within a country, people with connections to each other, all movies starring a certain actor, or specific details about an entity within the RDF graph. This makes SPARQL a valuable tool for tasks like information retrieval, data exploration, and even uncovering hidden connections within the linked data of an RDF graph.



# Environment Setup

## Upload to S3 Bucket
1. Follow these instructions to **upload** this RDF formatted [IMDb Dataset](/imdb.ttl.gz) file to a **S3 Bucket** - https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html  

## Request Claude Sonnet Model Access
1. Follow these instructions to **request access** to **Claude 3 Sonnet** - https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html  

## Deploy CloudFormation Stack

1. In local terminal, ensure **awscli** is configured with credentials and navigate to **rag-with-knowledge-graph** directory.  


2. Enter parameter values in provided **parameters.json** file


2. Run the following command to deploy the cloud formation stack to create Neptune Notebook Instance from the provied **neptune.yaml** file along with **parameters.json** file edited in previous step:

```aws cloudformation create-stack --stack-name neptune-stack --template-body file://neptune.yaml --capabilities CAPABILITY_IAM --parameters file://parameters.json``` 


3. Search for **Neptune** in the AWS Console and click on it  
<img src="static/neptune_console.png" width="600">


5. Click **Notebooks**  
<img src="static/neptune_notebook.png" width="600"> 


6. Click the radio button of the notebook, which should have the name **aws-neptune-{SageMakerNotebookName}**. **SageMakerNotebookName** is a parameter populated in the parameters.json file and used in the CloudFormation template.  
Example: <img src="static/click_notebook.png" width="600">  

7. Click **Actions** > 'Open JupyterLab'  
<img src="static/open_jupyter_lab.png" width="600">  

8. Click the **Upload** icon, and upload this **rag-with-knwoledge-graph.ipynb** notebook file, or drag and drop it in Jupyter Lab  
Example to upload file: <img src="static/jupyterlab_upload.png" width="600">  

9. Click **Upload** icon, and upload the image files in the **images/** directory. **Do not upload the folder**. Jupyter Lab does not support uploading folders. 

10. Skip to **Load Data Into Neptune** section. 

## Load Data Into Neptune

We will load our RDF formatted IMDB dataset uploaded to S3 Bucket during previous section into our Neptune database.

In [None]:
# Load dataset from S3 into Neptune

%load -s s3://{S3_BUCKET}/ -f turtle -p OVERSUBSCRIBE -l {IAM_ROLE} --store-to loadres --run

HBox(children=(Label(value='Source:', layout=Layout(display='flex', justify_content='flex-end', width='16%')),…

HBox(children=(Label(value='Format:', layout=Layout(display='flex', justify_content='flex-end', width='16%')),…

HBox(children=(Label(value='Region:', layout=Layout(display='flex', justify_content='flex-end', width='16%')),…

HBox(children=(Label(value='Fail on Error:', layout=Layout(display='flex', justify_content='flex-end', width='…

HBox(children=(Label(value='Load ARN:', layout=Layout(display='flex', justify_content='flex-end', width='16%')…

HBox(children=(Label(value='Mode:', layout=Layout(display='flex', justify_content='flex-end', width='16%')), D…

HBox(children=(Label(value='Parallelism:', layout=Layout(display='flex', justify_content='flex-end', width='16…

HBox(children=(Label(value='Update Single Cardinality:', layout=Layout(display='flex', justify_content='flex-e…

HBox(children=(Label(value='Queue Request:', layout=Layout(display='flex', justify_content='flex-end', width='…

HBox(children=(Label(value='Dependencies:', layout=Layout(display='flex', justify_content='flex-end', width='1…

HBox(children=(Label(value='User Provided Edge Ids:', layout=Layout(display='none', justify_content='flex-end'…

HBox(children=(Label(value='Allow Empty Strings:', layout=Layout(display='none', justify_content='flex-end', w…

HBox(children=(Label(value='Named Graph URI:', layout=Layout(display='flex', justify_content='flex-end', width…

HBox(children=(Label(value='Base URI:', layout=Layout(display='flex', justify_content='flex-end', width='16%')…

HBox(children=(Label(value='Poll Load Status:', layout=Layout(display='flex', justify_content='flex-end', widt…

Button(description='Submit', style=ButtonStyle())

Output()

# Querying with SPARQL

In this section, we will run SPARQL queries to get visibility into our knowledge graph created from IMDb data and to get an overall sense of how to query RDF data.

### Let's look at the structure of our Knowledge Graph

From this query, we can see that our knowledge graph is structured in a way where resources, attributes, and relationships are mapped to IMDb website URIs. If you click on the graph tab of the results, you can see multiple graphs thats show the relationships related to **productions** in the data such as *cast-director*, *cast-wrtier*, *cast-actor*, *cast-actress*, *cast-editor*, etc. 

If you'd like to explore further, you can:
1. Pick a **graph**
2. Click the subject **production** node
3. Click **hamburger menu** icon on menu bar located at the top right corner

You will now see all the details tied to the subject **production** resource such as its name which is defined as *primaryTitle* in the data , *averageRating*, *runTimeMinutes*, *startYear*, *numberOfVotes*, etc. 

In [None]:
%%sparql

SELECT ?subject ?predicate ?object
WHERE {
    ?subject ?predicate ?object .
}
LIMIT 1000

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Force(network=<…

### Let's look at the different Resource Types

World Wide Web Consortium (W3C) defines a RDF syntax which we can access through *<http://www.w3.org/1999/02/22-rdf-syntax-ns#>* URI. The RDF syntax URI contains information about the different resource types at the *<http://www.w3.org/1999/02/22-rdf-syntax-ns#/type>* sub URI. We will use the RDF syntax URI as prefix to view the different resource types in the data to enable writing a simple query.

In [None]:
%%sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?type

WHERE 
{ 
    ?s rdf:type ?type 
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Let's look at the different relationships in the data

In [None]:
%%sparql

SELECT DISTINCT ?predicate

WHERE 
{ 
    ?subject ?predicate ?object 
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Using prefixes.

As you might have noticed when we queried for different resource types, we used a prefix for simplicity reasons for writing queries. Going forward, we will use prefixes in all of our queries so we don't have to keep writing i.e. *https://www.imdb.com/* everytime we want to query a sub URI on the IMDb website. 

### Let's query for movies that have "Inception" as their primary title.

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT ?movie
WHERE 
{
    ?movie a imdb:movie ;
           imdb:primaryTitle "Inception" .
} 

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Let's query for movies that have "Inception" as their primary title and retrieve all its relationships (predicates) and respective attribute values.

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT ?movie ?predicate ?value 

WHERE 
{
    ?movie a imdb:movie ; 
           imdb:primaryTitle "Inception" ;
           ?predicate ?value .
} 

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Graph traversal queries

We want to find the names of all the movies starring Tom Hanks. 

How are we going to do that in one query? We will have to traverse through resources and their relationships in the graph.

First, let's retrieve the resource URI of Tom Hanks in the database. This information can be used in the next query to retrieve the resource URIs of movies Tom Hanks has starred in. 

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT ?artist
WHERE 
{
    ?artist imdb:primaryName "Tom Hanks"
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

Next, let's retrieve the resource URIs of the movies Tom Hanks has starred in. This information will be used in the next query to retrieve the *Primary Title* attribute values for each movie. 

In [None]:
%%sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX imdb: <https://www.imdb.com/>

SELECT ?movie
WHERE 
{
    ?movie a imdb:movie ;
           imdb:cast-actor imdb:nm0000158 .
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

Now, let's retrieve the *Primary Title* attribute values for each movie. This information gives us the names of all the movies starring Tom Hanks.

In [None]:
%%sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX imdb: <https://www.imdb.com/>

SELECT ?title
WHERE 
{
    ?movie a imdb:movie ;
           imdb:cast-actor imdb:nm0000158 ;
           imdb:primaryTitle ?title .
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

Finally, let's put all of this together in a single query to retrieve the names of all the movies starring Tom Hanks. This example shows graph traversal through relationships between resources.

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT ?title
WHERE 
{
    ?artist imdb:primaryName "Tom Hanks" .
    ?movie a imdb:movie ;
           imdb:cast-actor ?artist ;
           imdb:primaryTitle ?title .
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Add some more complexity

In this query, let's retrieve the names of movies where Tom Hanks and Steven Spielberg have collaborated as actor and director respectively.

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT DISTINCT ?movieTitle 
WHERE {
    ?movie a imdb:movie ;
           imdb:directedBy ?steveSpiDir ;
           imdb:cast-actor ?tomHanksAct .
    
    ?steveSpiDir imdb:primaryName "Steven Spielberg" .
    ?tomHanksAct imdb:primaryName "Tom Hanks" .
    ?movie imdb:primaryTitle ?movieTitle .
}

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Construct new triples from data to show relationships between directors and the movies they directed

The **CONSTRUCT** query form is particularly useful when you need to generate or transform RDF data programmatically or when integrating data from multiple sources or schemas. It provides a flexible and powerful way to manipulate and restructure knowledge graph data based on your specific requirements.

This query returns a graph for each director linking them to all the movies they've directed. 

If you look at the graph tab of the results and observe each graph, you can see the following:
- **Subject node** which contains **Resource URI** of **director**. By clicking the **hamburger icon**, you can also see the details of the subject node which displays the **directors name**. 
- **Object node** which contains **Resource URI** of **movie**. By clicking the **hamburger icon**, you can also see the details of the object node which displays the **movie title**.
- **Predicate** arrows with value **directed by** unilaterally flowing from Object node to Subject node which defines the **relationship** between them as *movie is directed by director*. 

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

CONSTRUCT {
  ?movie imdb:primaryTitle ?title .
  ?director imdb:primaryName ?directorName .
  ?movie imdb:directedBy ?director .
} 
WHERE {
  ?movie a imdb:movie ;
         imdb:primaryTitle ?title .
  ?director a imdb:director ;
            imdb:primaryName ?directorName .
  ?movie imdb:directedBy ?director .
}
LIMIT 100

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Force(network=<…

### Recommendation Queries

Let's utilize the graph to give us some recommendations. We will start with asking the knowledge graph to recommend us **5 movies** belonging to the **Drama** genre. 

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>
SELECT DISTINCT ?movie ?title
WHERE {
    ?movie a imdb:movie ;
           a imdb:Drama ;
           imdb:primaryTitle ?title .
}
LIMIT 5

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

Let's now ask the knowledge graph to recommend us **5 movies** with rating of **9.0 or greater**.

In [None]:
%%sparql

PREFIX : <https://www.imdb.com/>

SELECT ?movie ?primaryTitle ?averageRating
WHERE {
    ?movie a :movie ;
           :averageRating ?averageRating ;
           :primaryTitle ?primaryTitle .
    FILTER(?averageRating > 9.0)
}
LIMIT 5

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

## Complex Recommendation Query

Let's now add some complexity and ask our knowledge graph to recommend us **5 movies** belonging to **drama** genre with rating of **9.0 or greater**. 

In [None]:
%%sparql

PREFIX imdb: <https://www.imdb.com/>

SELECT DISTINCT ?movie ?title ?rating
WHERE {
    ?movie a imdb:movie;
           a imdb:Drama;
           imdb:averageRating ?rating;
           imdb:primaryTitle ?title.
    FILTER(?rating > 9.0)
}
ORDER BY DESC(?rating)
LIMIT 5

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

## Ask Questions to Bedrock LLM via LangChain to query Knowledge Graph

LangChain is an open-source framework designed to make building generative AI applications powered by large language models (LLMs) easier. It facilitates building a program that can answer your questions or complete tasks based on its understanding of natural language. LangChain simplifies the process of creating these applications by providing building blocks and tools for developers. 

Manually crafting intricate SPARQL queries for knowledge graph exploration can be a tedious task. Ideally, our LLM should automatically generate these queries when presented with natural language questions seeking contextually enriched responses from the IMDb database. This is where LangChain steps in, providing an abstraction layer that eliminates the need for manual SPARQL query construction. By integrating with our Amazon Bedrock-hosted LLM, LangChain enables the automatic generation of SPARQL queries tailored to our knowledge graph stored within the Amazon Neptune database.

In [None]:
# Install dependencies
!pip install --upgrade --quiet langchain langchain-community langchain-aws

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.2 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.34.85 which is incompatible.
gremlinpython 3.6.2 requires aiohttp<=3.8.1,>=3.8.0, but you have aiohttp 3.9.5 which is incompatible.[0m[31m
[0m

In [None]:
import boto3
from langchain.chains.graph_qa.neptune_sparql import NeptuneSparqlQAChain
from langchain_aws import ChatBedrock
from langchain_community.graphs import NeptuneRdfGraph

host = "db-neptune-1-instance-1.ci5xvnarpspo.us-east-1.neptune.amazonaws.com"
port = 8182  # change if different
region = "us-east-1"  # change if different
graph = NeptuneRdfGraph(host=host, port=port, use_iam_auth=True, region_name="us-east-1")


MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_client = boto3.client("bedrock-runtime")
llm = ChatBedrock(model_id=MODEL_ID, client=bedrock_client)

chain = NeptuneSparqlQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    top_K=10,
    return_intermediate_steps=True,
    return_direct=False,
)

## Invoke LangChain to ask questions to our knowledge graph 

Let's start off with a simple question.

In [None]:
chain.invoke("""How many movies are in the graph?""")



[1m> Entering new NeptuneSparqlQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3m
PREFIX : <https://www.imdb.com/>
SELECT (COUNT(?movie) AS ?numMovies)
WHERE {
    ?movie a :movie .
}
[0m
Full Context:
[32;1m[1;3m{'head': {'vars': ['numMovies']}, 'results': {'bindings': [{'numMovies': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer', 'type': 'literal', 'value': '678069'}}]}}[0m

[1m> Finished chain.[0m


{'query': 'How many movies are in the graph?',
 'result': 'Based on the information provided from the SPARQL query results, there are 678,069 movies in the graph.',
 'intermediate_steps': [{'query': '\nPREFIX : <https://www.imdb.com/>\nSELECT (COUNT(?movie) AS ?numMovies)\nWHERE {\n    ?movie a :movie .\n}\n'},
  {'context': {'head': {'vars': ['numMovies']},
    'results': {'bindings': [{'numMovies': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
        'type': 'literal',
        'value': '678069'}}]}}}]}

Now that our question has successfully been answered, let's ask a question that would require additional effort with graph traversal. 

In [None]:
chain.invoke("""List the names of the movies directed by Ridley Scott""")



[1m> Entering new NeptuneSparqlQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3m
PREFIX imdb: <https://www.imdb.com/>

SELECT ?title
WHERE {
    ?movie imdb:directedBy ?director .
    ?director imdb:primaryName "Ridley Scott" .
    ?movie imdb:primaryTitle ?title .
}
[0m
Full Context:
[32;1m[1;3m{'head': {'vars': ['title']}, 'results': {'bindings': [{'title': {'type': 'literal', 'value': 'The League of Uncharitable Ladies'}}, {'title': {'type': 'literal', 'value': 'The Resurrectionists'}}, {'title': {'type': 'literal', 'value': 'Death Begins at Seventy'}}, {'title': {'type': 'literal', 'value': 'American Gangster'}}, {'title': {'type': 'literal', 'value': 'Gladiator 2'}}, {'title': {'type': 'literal', 'value': "It's What I Do"}}, {'title': {'type': 'literal', 'value': 'White Squall'}}, {'title': {'type': 'literal', 'value': 'Exodus: Gods and Kings'}}, {'title': {'type': 'literal', 'value': 'Body of Lies'}}, {'title': {'type': 'literal', 'value': 'Behold'}}, {'title': {'type': 

{'query': 'List the names of the movies directed by Ridley Scott',
 'result': 'Here are the movie titles directed by Ridley Scott from the information provided:\n\n- The Counselor\n- Exodus: Gods and Kings\n- Prometheus\n- Alien: Covenant\n- Alien: Deleted Scenes \n- The Martian\n- Alien: Covenant - Prologue: The Crossing\n- Blade Runner\n- Alien\n- The Last Duel\n- The Duellists\n- Gladiator\n- Kingdom of Heaven\n- Robin Hood\n- Body of Lies\n- Black Hawk Down\n- Hannibal\n- G.I. Jane\n- Thelma & Louise\n- Black Rain\n- Someone to Watch Over Me\n- Legend\n- Alien: Outtakes',
 'intermediate_steps': [{'query': '\nPREFIX imdb: <https://www.imdb.com/>\n\nSELECT ?title\nWHERE {\n    ?movie imdb:directedBy ?director .\n    ?director imdb:primaryName "Ridley Scott" .\n    ?movie imdb:primaryTitle ?title .\n}\n'},
  {'context': {'head': {'vars': ['title']},
    'results': {'bindings': [{'title': {'type': 'literal',
        'value': 'The League of Uncharitable Ladies'}},
      {'title': {'type

Now that it answered the question effortlessley, let's add some complexity in our next question. 

In [None]:
chain.invoke("""What are the names of the movies directed by Steven Spielberg where Tom Hanks is an actor?""")



[1m> Entering new NeptuneSparqlQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3m
PREFIX imdb: <https://www.imdb.com/>

SELECT DISTINCT ?movieTitle
WHERE {
    ?movie imdb:directedBy ?director .
    ?director imdb:primaryName "Steven Spielberg" .
    ?movie imdb:cast-actor ?actor .
    ?actor imdb:primaryName "Tom Hanks" .
    ?movie imdb:primaryTitle ?movieTitle .
}
[0m
Full Context:
[32;1m[1;3m{'head': {'vars': ['movieTitle']}, 'results': {'bindings': [{'movieTitle': {'type': 'literal', 'value': 'The Post'}}, {'movieTitle': {'type': 'literal', 'value': 'Bridge of Spies'}}, {'movieTitle': {'type': 'literal', 'value': 'Catch Me If You Can'}}, {'movieTitle': {'type': 'literal', 'value': 'The Terminal'}}, {'movieTitle': {'type': 'literal', 'value': 'Saving Private Ryan'}}]}}[0m

[1m> Finished chain.[0m


{'query': 'What are the names of the movies directed by Steven Spielberg where Tom Hanks is an actor?',
 'result': 'Here are the movies directed by Steven Spielberg where Tom Hanks is an actor, based on the information provided:\n\n- The Post\n- Bridge of Spies \n- Catch Me If You Can\n- The Terminal\n- Saving Private Ryan',
 'intermediate_steps': [{'query': '\nPREFIX imdb: <https://www.imdb.com/>\n\nSELECT DISTINCT ?movieTitle\nWHERE {\n    ?movie imdb:directedBy ?director .\n    ?director imdb:primaryName "Steven Spielberg" .\n    ?movie imdb:cast-actor ?actor .\n    ?actor imdb:primaryName "Tom Hanks" .\n    ?movie imdb:primaryTitle ?movieTitle .\n}\n'},
  {'context': {'head': {'vars': ['movieTitle']},
    'results': {'bindings': [{'movieTitle': {'type': 'literal',
        'value': 'The Post'}},
      {'movieTitle': {'type': 'literal', 'value': 'Bridge of Spies'}},
      {'movieTitle': {'type': 'literal', 'value': 'Catch Me If You Can'}},
      {'movieTitle': {'type': 'literal', 'va

It automatically wrote SPARQL query just like the one we cutom wrote in one of our previous examples along with answering our question accurately. Now let's attempt to use it as a recommender system. 

In [None]:
chain.invoke("""Recommend me 5 dramas which are strictly movies that have a rating greater than 9.0""")



[1m> Entering new NeptuneSparqlQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3m
PREFIX : <https://www.imdb.com/>

SELECT ?movie ?title ?rating 
WHERE {
    ?movie a :movie ;
           a :Drama ;
           :averageRating ?rating ;
           :primaryTitle ?title .
    FILTER (?rating > 9.0)
}
LIMIT 5
[0m
Full Context:
[32;1m[1;3m{'head': {'vars': ['movie', 'title', 'rating']}, 'results': {'bindings': [{'movie': {'type': 'uri', 'value': 'https://www.imdb.com/tt0222023'}, 'rating': {'datatype': 'http://www.w3.org/2001/XMLSchema#float', 'type': 'literal', 'value': '9.1'}, 'title': {'type': 'literal', 'value': 'Long Sleepless Nights'}}, {'movie': {'type': 'uri', 'value': 'https://www.imdb.com/tt0053834'}, 'rating': {'datatype': 'http://www.w3.org/2001/XMLSchema#float', 'type': 'literal', 'value': '9.1'}, 'title': {'type': 'literal', 'value': 'Frau Irene Besser'}}, {'movie': {'type': 'uri', 'value': 'https://www.imdb.com/tt0240805'}, 'rating': {'datatype': 'http://www.w3.org/2001

{'query': 'Recommend me 5 dramas which are strictly movies that have a rating greater than 9.0',
 'result': "Here are 5 highly rated dramas with a rating above 9.0:\n\n1. Long Sleepless Nights (1992) - Rating: 9.1\n2. Frau Irene Besser (1957) - Rating: 9.1  \n3. Pierre qui brûle (1987) - Rating: 9.1\n4. Beyond the Horizon (2005) - Rating: 9.1\n5. Malammana Pavada (1994) - Rating: 9.1\n\nAll of these are critically acclaimed movies that achieved outstanding ratings from viewers and critics alike. I'd recommend checking them out if you're in the mood for an excellent drama film.",
 'intermediate_steps': [{'query': '\nPREFIX : <https://www.imdb.com/>\n\nSELECT ?movie ?title ?rating \nWHERE {\n    ?movie a :movie ;\n           a :Drama ;\n           :averageRating ?rating ;\n           :primaryTitle ?title .\n    FILTER (?rating > 9.0)\n}\nLIMIT 5\n'},
  {'context': {'head': {'vars': ['movie', 'title', 'rating']},
    'results': {'bindings': [{'movie': {'type': 'uri',
        'value': 'htt

It gave us the desired movie recommendations. Feel free to play around with this if you want to try out any more questions.