### Creating Knowledge Graphs from Textual Data: Finding Hidden Connections

Knowledge graphs have emerged as a powerful way to visualize and understand relationships between different pieces of information, transforming unstructured text into a structured network of entities and their relationships. We will guide you through a simple workflow for creating a knowledge graph from textual data, making complex information more accessible and easier to understand.

Here’s what we are going to do in this project:

![image](./knowledge-graph-data-pipeline.jpg)

*Our knowledge graph from textual data pipeline.*

Before creating a knowledge graph, it is essential to understand the difference between **knowledge graphs** and **knowledge bases,** as these terms are often mistakenly interchanged.

A **Knowledge Base (KB)** is a collection of structured information about a specific domain. A **Knowledge Graph** is a form of **Knowledge Base** organized as a graph. In a **Knowledge Graph,** `nodes` represent *entities*, and `edges` represent the *relationships between these entities*. For instance, from the sentence “Fabio lives in Italy,” we can derive the relationship triplet `<Fabio, lives in, Italy>`, where “Fabio” and “Italy” are the entities, and “lives in” represents their connection.

A **knowledge graph** is a subtype of a **knowledge base**; however, it is not always associated with one.

Building a knowledge graph generally involves two main steps:

1.  **Named Entity Recognition (NER):**  This step focuses on identifying and extracting entities from the text, which will serve as the nodes in the knowledge graph.
2.  **Relation Classification (RC):**  This step focuses on identifying and classifying the relationships between the extracted entities, forming the edges of the knowledge graph.

The **knowledge graph** is often visualized using tools like **pyvis**.

To enhance the process of creating a **knowledge graph** from text, additional steps can be integrated, such as:

-   **Entity Linking:**  This step helps to normalize different mentions of the same entity. For example, “Napoleon” and “Napoleon Bonaparte” would be linked to a common reference, such as their Wikipedia page.
-   **Source Tracking:**  This involves recording the origin of each piece of information, like the URL of the article or the specific text fragment it came from. Tracking sources helps assess the information’s credibility (for example, a relationship is considered more reliable if multiple reputable sources verify it).

In this project, we will simultaneously do **Named Entity Recognition** and **Relation Classification** through an effective prompt. This combined approach is often referred to as **Relation Extraction (RE)**.

## Building a Knowledge Graph with LangChain

To illustrate the use of prompts for relation extraction in LangChain, let’s use the KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT variable as the prompt. This prompt is specifically designed to extract knowledge triples (subject, predicate, and object) from a given text.

In LangChain, this prompt can be utilized by the `ConversationEntityMemory` class. This class lets chatbots remember previous conversations by storing the relations extracted from these messages.

The KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT variable is an instance of the `PromptTemplate`  class, taking text as an input variable. The template itself is a string that includes several examples and directives for the language model, guiding it to extract knowledge triples from the input text.

In [None]:
import os
from langchain_custom_utils.helper import get_openai_api_key, print_response
OPENAI_API_KEY = get_openai_api_key()

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.graphs.networkx_graph import KG_TRIPLE_DELIMITER

# Prompt template for knowledge triple extraction
_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE = (
    "You are a networked intelligence helping a human track knowledge triples"
    " about all relevant people, things, concepts, etc. and integrating"
    " them with your knowledge stored within your weights"
    " as well as that stored in a knowledge graph."
    " Extract all of the knowledge triples from the text."
    " A knowledge triple is a clause that contains a subject, a predicate,"
    " and an object. The subject is the entity being described,"
    " the predicate is the property of the subject that is being"
    " described, and the object is the value of the property.\n\n"
    "EXAMPLE\n"
    "It's a state in the US. It's also the number 1 producer of gold in the US.\n\n"
    f"Output: (Nevada, is a, state){KG_TRIPLE_DELIMITER}(Nevada, is in, US)"
    f"{KG_TRIPLE_DELIMITER}(Nevada, is the number 1 producer of, gold)\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "I'm going to the store.\n\n"
    "Output: NONE\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "Oh huh. I know Descartes likes to drive antique scooters and play the mandolin.\n"
    f"Output: (Descartes, likes to drive, antique scooters){KG_TRIPLE_DELIMITER}(Descartes, plays, mandolin)\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "{text}"
    "Output:"
)

KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT = PromptTemplate(
    input_variables=["text"],
    template=_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE,
)

# Instantiate the OpenAI model
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.0)

# Create an LLMChain using the knowledge triple extraction prompt
chain = LLMChain(llm=llm, prompt=KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT)

# Run the chain with the specified text
text = "The city of Paris is the capital and most populous city of France. The Eiffel Tower is a famous landmark in Paris."
triples = chain.run(text)

print(triples)

In the previous code, we used the prompt to extract relation triplets from text using **few-shot** examples. We’ll then parse the generated triplets and collect them into a list. Here, `triples_list` will contain the knowledge triplets extracted from the text. We need to parse the response and collect the triplets into a list:

In [None]:
def parse_triples(response, delimiter=KG_TRIPLE_DELIMITER):
    if not response:
        return []
    return response.split(delimiter)

triples_list = parse_triples(triples)

# Print the extracted relation triplets
print(triples_list)