# ArangoDB + LangChain

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/Langchain.ipynb)

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.

[LangChain](https://www.langchain.com/) is a framework for developing applications powered by language models. It enables applications that are:
- Data-aware: connect a language model to other sources of data
- Agentic: allow a language model to interact with its environment

On July 25 2023, ArangoDB introduced the first release of the [ArangoGraphQAChain](https://langchain-langchain.vercel.app/docs/integrations/providers/arangodb) to the LangChain community, allowing you to leverage LLMs to provide a natural language interface for your ArangoDB data.

Please note: This notebook uses the LangChain `ChatOpenAI` wrapper, which requires you to have a **paid** [OpenAI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). However, other Chat Models are available as well: https://github.com/langchain-ai/langchain/tree/master/libs/langchain/langchain/chat_models

You can get a local ArangoDB instance running via the [ArangoDB Docker image](https://hub.docker.com/_/arangodb):  

```
docker run -p 8529:8529 -e ARANGO_ROOT_PASSWORD= arangodb/arangodb
```

An alternative is to use the [ArangoDB Cloud Connector package](https://github.com/arangodb/adb-cloud-connector#readme) to get a temporary cloud instance running:

In [23]:
%%capture
%pip install python-arango # The ArangoDB Python Driver
%pip install adb-cloud-connector # The ArangoDB Cloud Instance provisioner
%pip install openai
%pip install langchain==0.0.271

In [24]:
# Instantiate ArangoDB Database
import json
from arango import ArangoClient

# from adb_cloud_connector import get_temp_credentials

# con = get_temp_credentials(tutorialName="LangChain")

# db = ArangoClient(hosts=con["url"]).db(
#     con["dbName"], con["username"], con["password"], verify=True
# )

# print(json.dumps(con, indent=2))

dbName = "BRON"
username = "root"
password = "changeme"

db = ArangoClient().db(
    dbName, username, password, verify=True
)


In [25]:
# Instantiate the ArangoDB-LangChain Graph
from langchain.graphs import ArangoGraph

graph = ArangoGraph(db)

## Getting & Setting the ArangoDB Schema

An initial ArangoDB Schema is generated upon instantiating the `ArangoDBGraph` object. Below are the schema's getter & setter methods should you be interested in viewing or modifying the schema:

In [27]:
# The schema should be empty here,
# since `graph` was initialized prior to ArangoDB Data ingestion (see above).

import json

print(json.dumps(graph.schema, indent=4))

{
    "Graph Schema": [
        {
            "graph_name": "BRONGraph",
            "edge_definitions": [
                {
                    "edge_collection": "CapecCapec",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec"
                    ]
                },
                {
                    "edge_collection": "CapecCapec_detection",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec_detection"
                    ]
                },
                {
                    "edge_collection": "CapecCapec_mitigation",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec_mitigation"
       

In [28]:
graph.set_schema()

In [29]:
# We can now view the generated schema

import json

print(json.dumps(graph.schema, indent=4))

{
    "Graph Schema": [
        {
            "graph_name": "BRONGraph",
            "edge_definitions": [
                {
                    "edge_collection": "CapecCapec",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec"
                    ]
                },
                {
                    "edge_collection": "CapecCapec_detection",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec_detection"
                    ]
                },
                {
                    "edge_collection": "CapecCapec_mitigation",
                    "from_vertex_collections": [
                        "capec"
                    ],
                    "to_vertex_collections": [
                        "capec_mitigation"
       

## Querying the ArangoDB Database

We can now use the ArangoDB Graph QA Chain to inquire about our data

Please note: This notebook uses the LangChain `ChatOpenAI` wrapper, which requires you to have a **paid** [OpenAI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key).

In [30]:
import os

os.environ["OPENAI_API_KEY"] = "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"

In [31]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ArangoGraphQAChain

chain = ArangoGraphQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k"), graph=graph, verbose=True
)

In [32]:
chain.aql_examples = """
# What is the CVSSv2 score and severity for CVE-1999-0002?
LET vulnerabilityKey = "CVE-1999-0002"
FOR v IN CVE_VI
  FILTER v.original_id == vulnerabilityKey
  RETURN {
    cvssv2Score: v.cvssv2.score,
    cvssv2Severity: v.cvssv2.severity
  }

# Find all CVEs with CVSSv2 scores higher than 8.0
FOR v IN CVE_VI
  FILTER v.cvssv2.score > 8.0
  RETURN v

# List all weaknesses associated with CVE-1999-0002
LET vulnerabilityKey = "CVE-1999-0002"
FOR v IN CVE_VI
  FILTER v.original_id == vulnerabilityKey
  FOR w IN v.weaknesses
    RETURN w

# What is the temporal exploitability metric for CVE-1999-0002's CVSSv3?
LET vulnerabilityKey = "CVE-1999-0002"
FOR v IN CVE_VI
  FILTER v.original_id == vulnerabilityKey
  RETURN v.cvssv3.temporalMetrics.exploitability
"""

In [33]:
chain.run("Find all CVEs with CVSSv2 scores higher than 9.0")



[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH CVE_VI
FOR v IN CVE_VI
  FILTER v.cvssv2.score > 9.0
  RETURN v
[0m
AQL Result:
[32;1m[1;3m[{'_key': 'cve-1999-0002', '_id': 'CVE_VI/cve-1999-0002', '_rev': '_gy_UGtS---', 'cvssv2': {'version': '2.0', 'score': 10, 'severity': 'HIGH', 'vector': 'AV:N/AC:L/Au:N/C:C/I:C/A:C', 'accessVector': 'NETWORK', 'accessComplexity': 'LOW', 'authentication': 'NONE', 'userInteraction': 'NONE', 'confidentialityImpact': 'COMPLETE', 'integrityImpact': 'COMPLETE', 'availabilityImpact': 'COMPLETE', 'impactScore': 10, 'exploitabilityScore': 10, 'source': 'US-NVD', 'temporalMetrics': {'exploitability': 'Functional', 'remediation_level': 'Official Fix', 'report_confidence': 'Confirmed', 'temporal_vector': 'E:F/RL:OF/RC:C'}}, 'threats': [{'aliases': ['RedHat Linux 5.1 / Caldera OpenLinux Standard 1.2 - Mountd'], 'sources': [{'sourceUrl': 'https://www.exploit-db.com/exploits/19096', 'lastModifiedDate': {'$date': {'$numberLo

KeyboardInterrupt: 

In [34]:
chain.run("What can you tell me about the CVSSv2 scores and severities of the top 10 CVEs with the highest scores?")



[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH CVE_VI
FOR v IN CVE_VI
  SORT v.cvssv2.score DESC
  LIMIT 10
  RETURN {
    original_id: v.original_id,
    cvssv2Score: v.cvssv2.score,
    cvssv2Severity: v.cvssv2.severity
  }
[0m
AQL Result:
[32;1m[1;3m[{'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}, {'original_id': None, 'cvssv2Score': 10, 'cvssv2Severity': 'HIGH'}][0m

[1m> Finished chai

'The top 10 CVEs with the highest CVSSv2 scores have a score of 10 and a severity level of HIGH.'

In [35]:
chain.run("What can you tell me about how many CVEs have a CVSSv2 severity of 'CRITICAL'?")



[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH CVE_VI
FOR v IN CVE_VI
  FILTER v.cvssv2.severity == 'CRITICAL'
  COLLECT WITH COUNT INTO count
  RETURN count
[0m
AQL Result:
[32;1m[1;3m[0][0m

[1m> Finished chain.[0m


"There are no CVEs with a CVSSv2 severity of 'CRITICAL'."

In [None]:
chain.run("What can you tell me about how many CVEs are associated with the 'Remote Code Execution (RCE)' attack classification?")

In [None]:
chain.run("What can you tell me about the details of the CVE with the highest CVSSv2 score?")

In [None]:
chain.run("What can you tell me about the CVEs that have known exploits and provide references to those exploits?")

In [None]:
chain.run("What can you tell me about the list of all CVEs with weaknesses of type 'Improper Restriction of Operations within the Bounds of a Memory Buffer (CWE-119)?'")

In [None]:
chain.run("What can you tell me about the temporal exploitability metric for CVEs with CVSSv3 scores above 9.0?")

In [None]:
chain.run("What can you tell me about all CVEs that have a CVSSv3 score above 8.0?")

In [None]:
chain.run("What can you tell me about a list of all CVEs and their corresponding CVSSv2 vectors?")

## Chain Modifiers

You can alter the values of the following `ArangoDBGraphQAChain` class variables to modify the behaviour of your chain results


In [None]:
# Specify the maximum number of AQL Query Results to return
chain.top_k = 10

# Specify whether or not to return the AQL Query in the output dictionary
chain.return_aql_query = True

# Specify whether or not to return the AQL JSON Result in the output dictionary
chain.return_aql_result = True

# Specify the maximum amount of AQL Generation attempts that should be made
chain.max_aql_generation_attempts = 5

# Specify a set of AQL Query Examples, which are passed to
# the AQL Generation Prompt Template to promote few-shot-learning.
# Defaults to an empty string.
chain.aql_examples = """
# Is Ned Stark alive?
RETURN DOCUMENT('Characters/NedStark').alive

# Is Arya Stark the child of Ned Stark?
FOR e IN ChildOf
    FILTER e._from == "Characters/AryaStark" AND e._to == "Characters/NedStark"
    RETURN e
"""

In [None]:
chain.run("Is Ned Stark alive?")

# chain("Is Ned Stark alive?") # Returns a dictionary with the AQL Query & AQL Result

In [None]:
chain.run("Is Bran Stark the child of Ned Stark?")