## Knowledge Graphs

Knowledge Graphs, a form of graph-based knowledge representation, provide a method for modeling and storing interlinked information in a format that is both human- and machine-understandable. These graphs consist of *nodes* and *edges*, representing entities and their relationships. Unlike traditional databases, the inherent expressiveness of graphs allows for richer semantic understanding, while providing the flexibility to accommodate new entity types and relationships without being constrained by a fixed schema.

By combining knowledge graphs with embeddings (vector search), we can leverage *multi-hop connectivity* and *contextual understanding of information* to enhance querying, reasoning, and explainability in LLMs. This notebook explores the practical implementation of this approach, demonstrating how to (i) build a knowledge graph of academic publications, and (ii) extract actionable insights from it.

<p align="center">
  <img src="./static/knowledge-graphs.png">
</p>

### 1. Knowledge Graph Initialization

We will create our Knowledge Graph using [Neo4j](https://neo4j.com/), an open-source database management system that specializes in graph database technology.

In [None]:
%pip install neo4j langchain langchain_openai langchain-community python-dotenv --quiet | tail -n 1

#### 1.1 Setting Up a Neo4j Instance

For a quick and easy setup, you can start a free instance on [Neo4j Aura](https://neo4j.com/product/auradb/). 

In [None]:
import dotenv
dotenv.load_dotenv('.env', override=True)

In [None]:
import os
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph(
    url=os.environ['NEO4J_URI'], 
    username=os.environ['NEO4J_USERNAME'],
    password=os.environ['NEO4J_PASSWORD'],
)

#### 1.2 Loading Dataset into a Graph

The below example creates a connection with our Neo4j database and populates it with synthetic data about research articles and their authors. 

The entities are: 
- *Researcher*
- *Article*
- *Topic*

Whereas the relationships are:
- *Researcher* --[PUBLISHED]--> *Article*
- *Article* --[IN_TOPIC]--> *Topic*



In [None]:
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph()

q_load_articles = """
LOAD CSV WITH HEADERS
FROM 'https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv' 
AS row 
FIELDTERMINATOR ';'
MERGE (a:Article {title:row.Title})
SET a.abstract = row.Abstract,
    a.publication_date = date(row.Publication_Date)
FOREACH (researcher in split(row.Authors, ',') | 
    MERGE (p:Researcher {name:trim(researcher)})
    MERGE (p)-[:PUBLISHED]->(a))
FOREACH (topic in [row.Topic] | 
    MERGE (t:Topic {name:trim(topic)})
    MERGE (a)-[:IN_TOPIC]->(t))
"""

graph.query(q_load_articles)

In [None]:
print(graph.get_schema)

#### 1.3 Build Vector Index

We implement a vector index to efficiently search for relevant articles based on their *topic, title, and abstract*. This process involves calculating the embeddings for each article using these fields. At query time, the system finds the most similar articles to the user's input by employing a similarity metric, such as cosine distance.


from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=os.environ['NEO4J_URI'],
    username=os.environ['NEO4J_USERNAME'],
    password=os.environ['NEO4J_PASSWORD'],
    index_name='articles',
    node_label="Article",
    text_node_properties=['topic', 'title', 'abstract'],
    embedding_node_property='embedding',
)

### 2. Graph Cypher Chain

LangChain provides a wrapper around Neo4j graph database that allows you to generate Cypher statements based on the user input and use them to retrieve relevant information from the database.

In [None]:
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'),
    qa_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'), 
    graph=graph,
    verbose=True,
)

### 3. Inference traversing Knowledge Graphs

Knowledge graphs excel in their ability to query and navigate the connections between entities, allowing for the retrieval of pertinent information and the discovery of new insights.

#### 3.1 Sample 1

In this example, our question 'How many articles has published Emma Wilson' will be translated into the Cyper query:

```
MATCH (r:Researcher {name: "Emma Wilson"})-[:PUBLISHED]->(a:Article)
RETURN COUNT(a) AS numberOfArticles
```

which matches nodes labeled `Author` with the name 'Emma Wilson' and traverses the `PUBLISHED` relationships to `Article` nodes. 
It then counts the number of `Article` nodes connected to 'Emma Wilson':

In [None]:
# the answer should be '5'
cypher_chain.invoke(
    {"query": "How many articles has published Emma Wilson?"}
)

#### 3.2 Sample 2

In this example the query 'are there any pair of researchers who have published more than one article together?' results in the Cypher query:

```
MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)
WHERE r1 <> r2
WITH r1, r2, COUNT(a) AS sharedArticles
WHERE sharedArticles > 1
RETURN r1.name, r2.name, sharedArticles
```

which results in traversing from `Researcher` to `PUBLISHED` to find connected `Article` nodes, and then traversing back to find `Researchers` pairs.

In [None]:
# the answer should be Alice Johnson and David Miller, Alexander Lee and David Miller, Olivia Taylor and Alexander Lee, and David Miller and Alice Johnson
cypher_chain.invoke(
    {"query": "are there any pair of researchers who have published more than one article together?"}
)

#### 3.3 Sample 3

It appears David Miller has collaborated with many peers. Lets find out is he is the researcher with most peers collaborations. 
Our query 'which researcher has collaborated with the most peers?' results now in the Cyper:

```
MATCH (r:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(peer:Researcher)
WITH r, COUNT(DISTINCT peer) AS peerCount
RETURN r.name AS researcher, peerCount
ORDER BY peerCount DESC
LIMIT 1
```

Here, we need to star from all `Researcher` nodes and traverse their `PUBLISHED` relationships to find connected `Article` nodes. For each `Article` node, Neo4j then traverses back to find other `Researcher` nodes (peer) who have also published the same article.

In [None]:
# the answer should be 'David Miller' with 5
cypher_chain.invoke(
    {"query": "Which researcher has collaborated with the most peers?"}
)

#### 3.3 More Samples

In [None]:
# the answer should be 'David Miller and Alice Johnson'
cypher_chain.invoke(
    {"query": "Who wrote the article 'Language Model Compression for Mobile Devices'?"}
)

In [None]:
# the answer should be '2024'
cypher_chain.invoke(
    {"query": "In which year there were more articles published??"}
)

In [None]:
# the answer should be Bob Smith, David Miller, Sophia Martinez, and John Robinson
cypher_chain.invoke(
    {"query": "Which researchers have worked with Emma Wilson?"}
)