In [1]:
import openai
import os 

openai.api_key =  os.getenv("OPENAI_API")

## 1. OpenAIで文書から属性とRelationshipの抽出*

**PROMPT SETTING**

In [24]:
def generate_system_message() -> str:
    return """
    You are a data scientist working for a company that is building a graph database. Your task is to extract information from data and convert it into a graph database.
    Provide a set of Nodes in the form [ENTITY_ID, TYPE, PROPERTIES] and a set of relationships in the form [ENTITY_ID_1, RELATIONSHIP, ENTITY_ID_2, PROPERTIES].
    It is important that the ENTITY_ID_1 and ENTITY_ID_2 exists as nodes with a matching ENTITY_ID. If you can't pair a relationship with a pair of nodes don't add it.
    When you find a node or relationship you want to add try to create a generic TYPE for it that  describes the entity you can also think of it as a label.
    
    Example:
    Data: Alice lawyer and is 25 years old and Bob is her roommate since 2001. Bob works as a journalist. Alice owns a the webpage www.alice.com and Bob owns the webpage www.bob.com.
    Nodes: ["alice", "Person", {"age": 25, "occupation": "lawyer", "name":"Alice"}], ["bob", "Person", {"occupation": "journalist", "name": "Bob"}], ["alice.com", "Webpage", {"url": "www.alice.com"}], ["bob.com", "Webpage", {"url": "www.bob.com"}]
    Relationships: ["alice", "roommate", "bob", {"start": 2021}], ["alice", "owns", "alice.com", {}], ["bob", "owns", "bob.com", {}]
    """

def generate_prompt(data) -> str:
    return f"""
    Data: {data}"""

In [25]:
# Sample Data
obama_article = """Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/ (listen) bə-RAHK hoo-SAYN oh-BAH-mə;[1] born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American president of the United States.[2] Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004, and worked as a civil rights lawyer and university lecturer.

Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. In 2008, after a close primary campaign against Hillary Clinton, he was nominated by the Democratic Party for president and chose Joe Biden as his running mate. Obama was elected over Republican nominee John McCain in the presidential election and was inaugurated on January 20, 2009. Nine months later, he was named the 2009 Nobel Peace Prize laureate, a decision that drew a mixture of praise and criticism.

Obama's first-term actions addressed the global financial crisis and included a major stimulus package, a partial extension of George W. Bush's tax cuts, legislation to reform health care, a major financial regulation reform bill, and the end of a major US military presence in Iraq. Obama also appointed Supreme Court justices Sonia Sotomayor and Elena Kagan, the former being the first Hispanic American on the Supreme Court. He ordered the counterterrorism raid which killed Osama bin Laden and downplayed Bush's counterinsurgency model, expanding air strikes and making extensive use of special forces while encouraging greater reliance on host-government militaries.

After winning re-election by defeating Republican opponent Mitt Romney, Obama was sworn in for a second term on January 20, 2013. In his second term, Obama took steps to combat climate change, signing a major international climate agreement and an executive order to limit carbon emissions. Obama also presided over the implementation of the Affordable Care Act and other legislation passed in his first term, and he negotiated a nuclear agreement with Iran and normalized relations with Cuba. The number of American soldiers in Afghanistan fell dramatically during Obama's second term, though U.S. soldiers remained in Afghanistan throughout Obama's presidency.
"""

system_message = generate_system_message()

prompt_string = generate_prompt(obama_jp)

messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt_string},
]

In [4]:
# Using open AI to extract nodes from article
output = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=messages
)

In [5]:
output_str = [output.to_dict().get('choices')[0].get("message").to_dict().get("content")]

**OpenAIから出てきた結果**

In [12]:
output_str[0]

'Nodes: \n["barack_obama", "Person", {"occupation": "politician, lawyer", "birth_date": "1961-08-04"}],\n["columbia_university", "University", {"name": "Columbia University"}],\n["harvard_law_school", "University", {"name": "Harvard Law School"}],\n["chicago", "City", {"name": "Chicago"}],\n["chicago_university_law_school", "University", {"name": "Chicago University Law School"}],\n["illinois_state_senate", "Government Position", {"name": "Illinois State Senate"}],\n["u.s._senate", "Government Position", {"name": "U.S. Senate"}]\n\nRelationships: \n["barack_obama", "graduated_from", "columbia_university", {}],\n["barack_obama", "worked_as", "community_organizer", {"location": "Chicago"}],\n["barack_obama", "graduated_from", "harvard_law_school", {}],\n["barack_obama", "worked_as", "lawyer", {"field": "civil rights"}],\n["barack_obama", "served_as", "illinois_state_senate", {}],\n["barack_obama", "served_as", "u.s._senate", {}]'

## 2. Clean the text

From above text, we need extract following information : 
- Nodes. Each node include following information
```
{key: 1, name: barack_obama, label: Person, properties: {}}
```

- Relationship.
  ```
  {'start_name': 'barack_obama',
  'start_key': 0,
  'end_name': 'harvard_law_school',
  'end_key': 3,
  'type': 'enrolled in',
  'properties':
  ``` {}},

In [6]:
import re
import json

def extract_nodes(nodes):
    """
    Extract nodes
    """
    jsonRegex = "\{(.*?)\}"
    
    result = []
    for i, node in enumerate(nodes):
        node_detail = node.split(",")
        name = node_detail[0].strip().replace('"', "")
        label = node_detail[1].strip().replace('"', "")
        properties = re.search(jsonRegex, node)
        
        if properties == None or properties.group(1).strip() == "":
            properties = "{}"
        else:
            properties = properties.group(0)

        properties = properties.replace("True", "true").replace("False", "false")
        properties = json.loads(properties)
        result.append({"key": i,"name": name, "label": label, "properties": properties})
        
    return result

def extract_relationships(relationships, nodes):
    """
    Extract relationship between nodes
    """
    jsonRegex = '\\{(.*?)\\}'
    
    result = []
    for relation in relationships:
        relation = relation.replace("True", "true").replace("False", "false")

        relationList = relation.split(",")
        if len(relation) < 3:
            continue
        
        start_name = relationList[0].strip().replace('"', "")
        start_key = get_key_node_by_name(nodes, start_name)

        end_name = relationList[2].strip().replace('"', "")
        end_key = get_key_node_by_name(nodes, end_name)

        if start_key is None or end_key is None:
            continue
            
        type = relationList[1].strip().replace('"', "")

        properties = re.search(jsonRegex, relation)
        if properties == None or properties.group(0).strip() == "":
            properties = "{}"
        else:
            properties = properties.group(0)
        properties = json.loads(properties)

        result.append(
            {"start_name": start_name, "start_key": start_key,
             "end_name": end_name, "end_key": end_key,
             "type": type, "properties": properties}
        )
    return result
    
def getNodesAndRelationshipsFromResult(result):
    regex = "Nodes:\s+(.*?)\s?\s?Relationships:\s+(.*)"
    internalRegex = "\[(.*?)\]"

    nodes = []
    relationships = []
    
    for row in result:
        parsing = re.match(regex, row, flags=re.S)

        if parsing == None:
            continue

        rawNodes = str(parsing.group(1))
        rawRelationships = parsing.group(2)

        nodes.extend(re.findall(internalRegex, rawNodes))
        relationships.extend(re.findall(internalRegex, rawRelationships))

    nodes = extract_nodes(nodes)
    relationships = extract_relationships(relationships, nodes)
    return dict(
                nodes = nodes,
                relationships = relationships
            )

def get_key_node_by_name(nodes, name):
    matching_nodes = [node for node in nodes if node.get('name') == name]
    return matching_nodes[0].get("key") if matching_nodes else None

In [7]:
result = getNodesAndRelationshipsFromResult(output_str)

In [8]:
result

{'nodes': [{'key': 0,
   'name': 'barack_obama',
   'label': 'Person',
   'properties': {'occupation': 'politician, lawyer',
    'birth_date': '1961-08-04'}},
  {'key': 1,
   'name': 'columbia_university',
   'label': 'University',
   'properties': {'name': 'Columbia University'}},
  {'key': 2,
   'name': 'harvard_law_school',
   'label': 'University',
   'properties': {'name': 'Harvard Law School'}},
  {'key': 3,
   'name': 'chicago',
   'label': 'City',
   'properties': {'name': 'Chicago'}},
  {'key': 4,
   'name': 'chicago_university_law_school',
   'label': 'University',
   'properties': {'name': 'Chicago University Law School'}},
  {'key': 5,
   'name': 'illinois_state_senate',
   'label': 'Government Position',
   'properties': {'name': 'Illinois State Senate'}},
  {'key': 6,
   'name': 'u.s._senate',
   'label': 'Government Position',
   'properties': {'name': 'U.S. Senate'}}],
 'relationships': [{'start_name': 'barack_obama',
   'start_key': 0,
   'end_name': 'columbia_universi

## 3. Visualization

> pip install pyvis

In [9]:
from pyvis.network import Network

def assign_category_colors(categories):
    color_hex_mapping = {}
    color_palette = [
    '#1ABC9C', '#2ECC71', '#3498DB', '#9B59B6',
    '#34495E', '#16A085', '#27AE60', '#2980B9',
    '#8E44AD', '#2C3E50', '#F1C40F', '#E67E22'
    ]
    
    for i, category in enumerate(categories):
        color_hex_mapping[category] = color_palette[i % len(color_palette)]
    
    return color_hex_mapping
    
def visualize_nodes(result):

    net = Network(notebook=True, 
                  cdn_resources="remote", 
                  bgcolor="#222222",
                  font_color="white",
                  height="750px",
                  width="100%",
                  select_menu=False,
                  filter_menu=False,
                 )
    
    type = list(set([node.get("label") for node in result.get("nodes")]))
    type_colors = assign_category_colors(type)
    
    node_keys = [node.get('key') for node in result.get("nodes")]
    node_labels = [node.get('name') for node in result.get("nodes")]
    node_color = [type_colors.get(node.get("label")) for node in result.get("nodes")]
    node_title = [str(node.get("properties")).replace("', '", "\r\n").replace("'", "").replace("{", "").replace("}", "") for node in result.get("nodes")]
    
    net.add_nodes(node_keys, label=node_labels, color=node_color, title=node_title)

    for relation in result.get("relationships"):
        net.add_edge(relation.get("start_key"), to=relation.get("end_key"), title=relation.get("type"))

    return net
    

In [10]:
net = visualize_nodes(result)
net.show('edges.html')
# net.show_buttons(filter_="physics")

edges.html
