<a href="https://colab.research.google.com/github/champ-byte/Multimodal_GraphRAG/blob/Kavya-Zala/Entity_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemini Model

In [None]:
%pip install --upgrade  langchain langchain-google-genai



In [None]:
import os
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

In [None]:
pr = """
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph. You are a data scientist working for a company that is building a graph database. Your task is to extract information from data and convert it into three words with meaning Source_Node, Relationship, Target_Node. The first and third words are to be used as nodes and the second word is to be used as relationship. So, maintain consistency in spelling and capital letter such that node with same name is formed only once later.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
## 3. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 4. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
## 5. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.

Example:
Data: Alice lawyer and is 25 years old and Bob is her roommate since 2001. Bob works as a journalist. Alice owns a the webpage www.alice.com and Bob owns the webpage www.bob.com.

List created is given below. Only return this list and nothing else. No extra word or sentence.
['Alice~~is~~Lawyer', 'Alice~~age~~25years', 'Alice~~roommate~~Bob', 'Bob~~is~~Journalist', 'Alice~~owns~~www.alice.com', 'Bob~~owns~~www.bob.com']

Here is the question data: {user_input}
"""

In [None]:
from langchain_core.prompts import PromptTemplate

pr_template = PromptTemplate.from_template(pr)

In [None]:
user_input = [
    "Sarah is a teacher and lives in London.",
    "She is 32 years old.",
    "Sarah owns the website www.sarahteaches.com.",
    "Tom is her brother and works as a chef.",
    "Tom has been living in Paris since 2015.",
    "Sarah and Tom co-founded a nonprofit in 2020.",
    "The nonprofit is called LearnTogether.",
    "LearnTogether focuses on education and community outreach."
]

In [None]:
formatted_prompt = pr_template.format(user_input=user_input)

In [None]:
response = llm.invoke(formatted_prompt)
print(response.content)

['Sarah~~is~~Teacher', 'Sarah~~livesIn~~London', 'Sarah~~age~~32years', 'Sarah~~owns~~www.sarahteaches.com', 'Tom~~is~~Brother', 'Tom~~worksAs~~Chef', 'Tom~~livesIn~~Paris', 'Sarah~~coFounded~~LearnTogether', 'Tom~~coFounded~~LearnTogether', 'LearnTogether~~focusesOn~~Education', 'LearnTogether~~focusesOn~~CommunityOutreach']


# Hugging Face models

In [None]:
!pip install langchain-huggingface
## For API Calls
!pip install huggingface_hub
!pip install langchain
!pip install transformers

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.3.1


In [None]:
## Environment secret keys
from google.colab import userdata
sec_key=userdata.get("HF_TOKEN")

In [None]:
from langchain_huggingface import HuggingFaceEndpoint

In [None]:
from google.colab import userdata
sec_key=userdata.get("HUGGINGFACEHUB_API_TOKEN")

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"]=sec_key

In [None]:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    api_key=sec_key,
)

completion = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)

print(completion.choices[0].message)


ChatCompletionOutputMessage(role='assistant', content="<think>\nOkay, the user is asking about the capital of France. That's a straightforward geography question. \n\nHmm, I recall that Paris is one of the most famous capitals in the world, so this should be easy. But let me double-check just to be thorough—yes, every reliable source confirms it's Paris. \n\nThe user might be a student doing homework, a traveler planning a trip, or just someone curious. Since they didn't provide context, I'll keep the answer simple and factual. No need to overcomplicate it. \n\nI'll add a bit about Paris being a cultural hub too—Eiffel Tower, Louvre, etc.—to make the response more helpful. If they wanted deeper details, they'd probably ask a follow-up. \n\n...And done. Short, accurate, with a touch of extra context. Perfect.\n</think>\nThe capital of France is **Paris**.\n\nParis is not only the political center of France but also a global hub for art, fashion, gastronomy, and culture. It's home to ico

In [None]:
from transformers import pipeline

triplet_extractor = pipeline(
    "text2text-generation",
    model="Babelscape/rebel-large",
    tokenizer="Babelscape/rebel-large"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
text = "Alice lawyer and is 25 years old and Bob is her roommate since 2001. Bob works as a journalist. Alice owns a the webpage www.alice.com and Bob owns the webpage www.bob.com."
output = triplet_extractor(text)[0]['generated_text']
print(output)

 Alice  Bob  spouse  Bob  spouse  Bob  Alice  spouse  Alice  spouse  Alice  Bob  spouse  Bob  spouse  Bob  Alice  spouse  Alice  spouse


In [None]:
from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("Alice is a lawyer and is 25 years old and Bob is her roommate since 2001. Bob works as a journalist. Alice owns a the webpage www.alice.com and Bob owns the webpage www.bob.com.", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)


Device set to use cpu


<s><triplet> Alice <subj> Bob <obj> spouse <subj> Bob <obj> spouse <triplet> Bob <subj> Alice <obj> spouse <subj> Alice <obj> spouse <triplet> Alice <subj> Bob <obj> spouse <subj> Bob <obj> spouse <triplet> Bob <subj> Alice <obj> spouse <subj> Alice <obj> spouse</s>
[{'head': 'Alice', 'type': 'spouse', 'tail': 'Bob'}, {'head': 'Alice', 'type': 'spouse', 'tail': 'Bob'}, {'head': 'Bob', 'type': 'spouse', 'tail': 'Alice'}, {'head': 'Bob', 'type': 'spouse', 'tail': 'Alice'}, {'head': 'Alice', 'type': 'spouse', 'tail': 'Bob'}, {'head': 'Alice', 'type': 'spouse', 'tail': 'Bob'}, {'head': 'Bob', 'type': 'spouse', 'tail': 'Alice'}, {'head': 'Bob', 'type': 'spouse', 'tail': 'Alice'}]
