<a href="https://colab.research.google.com/github/VampireLordSeth/LangchainDocuments/blob/main/EntityExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [3]:
# -*- coding: utf-8 -*-

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import math
import torch
import pandas as pd

In [4]:
#Relation Extraction By End-to-end Language generation (REBEL)
#linearization approach and a reframing of Relation Extraction as a seq2seq task.

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [5]:
#Parse strings generated by REBEL and transform them into triplets
def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
    for token in text_replaced.split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(), #Subject of relation "Seth"
                    'type': relation.strip(), #Relation e.g. "eats at"
                    'tail': object_.strip() #Object of relation "In-n-Out"
                })
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

In [6]:
class NET():
    def __init__(self):
        self.relations = []

    def add_entity(self, e):
        self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r1):
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

In [7]:
def from_text_to_net(text, span_length=128, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) /
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)

     # create net
    net = NET()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                "spans": [spans_boundaries[current_span_index]]
            }
            net.add_relation(relation)
        i += 1

    return net

In [115]:
# Sample text to test code

text ='''
From his perch at the hilltop Yue Lai Hotel, China-born entrepreneur Zhao Fugang enjoys a panoramic view of Fiji’s seaside capital, Suva.

But the hotel is not just the headquarters of Zhao’s local business empire, which has stretched from tourism to property development. It’s also the base for the businessman’s parallel job: promoting China’s influence in the Pacific country.

The imposing red-and-black hotel is a favored venue for the local Chinese embassy’s official functions, where Zhao has rubbed shoulders with senior Fijian officials. It’s also home to an official “service center” for Chinese citizens, which has played a public role in fostering security ties between China and Fiji. The businessman’s role is typical of Beijing’s steady efforts to build its footprint in the Pacific Islands. The ruling Chinese Communist Party often uses prominent members of the overseas diaspora as proxies to push Chinese interests, under a strategy it calls the “United Front.”

As Western countries fret over China’s rising influence in the strategically important Pacific islands, Australia — a key U.S. ally — has set its sights on Zhao, a joint investigation by OCCRP and Australia’s Nine media outlets have found.



'''
net = from_text_to_net(text, verbose=True)
net.print()

Input has 283 tokens
Input has 3 spans
Span boundaries are [[0, 128], [77, 205], [154, 282]]
Relations:
  {'head': 'Yue Lai Hotel', 'type': 'owned by', 'tail': 'Zhao Fugang', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Zhao Fugang', 'type': 'owner of', 'tail': 'Yue Lai Hotel', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Fiji', 'type': 'capital', 'tail': 'Suva', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Suva', 'type': 'country', 'tail': 'Fiji', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Chinese Communist Party', 'type': 'headquarters location', 'tail': 'Beijing', 'meta': {'spans': [[77, 205]]}}
  {'head': 'China', 'type': 'founded by', 'tail': 'Chinese Communist Party', 'meta': {'spans': [[77, 205], [154, 282]]}}
  {'head': 'Chinese Communist Party', 'type': 'country', 'tail': 'China', 'meta': {'spans': [[77, 205]]}}
  {'head': 'United Front', 'type': 'participant', 'tail': 'Chinese Communist Party', 'meta': {'spans': [[154, 282]]}}
  {'head': 'Nine', 'type': 'country', 'tail': 'Australi

In [131]:
#Create DataFrame
df = pd.DataFrame.from_dict(net.__dict__, orient='columns')
df = pd.DataFrame(net.relations, columns=['head', 'type', 'tail', 'meta'])
df = df.rename(columns={'head': 'Source', 'type': 'Relationship', 'tail': 'Target'})
df.drop('meta', axis=1, inplace=True)

In [132]:
df

Unnamed: 0,Source,Relationship,Target
0,Yue Lai Hotel,owned by,Zhao Fugang
1,Zhao Fugang,owner of,Yue Lai Hotel
2,Fiji,capital,Suva
3,Suva,country,Fiji
4,Chinese Communist Party,headquarters location,Beijing
5,China,founded by,Chinese Communist Party
6,Chinese Communist Party,country,China
7,United Front,participant,Chinese Communist Party
8,Nine,country,Australia


In [133]:
df.to_csv('entities.csv')

#%% Citations

#@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
 #   title = "{REBEL}: Relation Extraction By End-to-end Language generation",
  #  author = "Huguet Cabot, Pere-Llu{\'\i}s  and
   #   Navigli, Roberto",
    #booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    #month = nov,
    #year = "2021",
    #address = "Punta Cana, Dominican Republic",
    #publisher = "Association for Computational Linguistics",
    #url = "https://aclanthology.org/2021.findings-emnlp.204",
    #pages = "2370--2381",
    #abstract = "Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model{'}s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.",
#}
