# **WEBNLG 16-domains Entity-based**

Authors: *Dario Della Mura - David Doci*

*INSID&S Lab*

*Department of Computer Science, Systems and Communication - 
University of Milano-Bicocca*




## Dataset Presentation

This WebNLG dataset consists of 35100 (data, text) pairs and 13083 distinct data units. The data units are sets of RDF triples extracted from DBpedia, and the texts are sequences of one or more sentences verbalising these data units.

## Setting Environment

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# check the GPU version
!nvidia-smi

In [None]:
# change with your paths

# train set path for model seen
train_path = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/webnlg-train.csv'

# validation set path for model seen
val_path = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/webnlg-val.csv'

# train set path for model unseen
train_path_u = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/webnlg-train.csv'

# validation set path for model unseen
val_path_u = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/webnlg-val.csv'

# meteor metric path
meteor_path = '/content/drive/MyDrive/rdf-to-text/dataset/Metriche/meteor-1.5'

# ter metric path
ter_path = '/content/drive/MyDrive/rdf-to-text/dataset/Metriche/tercom-0.7.25'

# model seen
model_lstm = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/lstm_model.pt'

#model seen
model_transformer = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/transformer_model.pt'

# model unseen
model_lstm_u = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/lstm_model.pt'

# model unseen
model_transformer_u = '/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/transformer_model.pt'

In [None]:
# import e set openNMT-py env
%%capture
!git clone https://github.com/OpenNMT/OpenNMT-py.git
%cd OpenNMT-py
!pip install -e .

# install openNMT requirements
!pip install -r requirements.opt.txt

%cd /content/

In [None]:
from gensim.parsing.preprocessing import remove_stopwords
from string import punctuation
from nltk.corpus import stopwords
from google.colab import data_table

import pandas as pd
import numpy as np
import nltk
import string
import shutil
import re

import os
import glob
import xml.etree.ElementTree as ET

# improve visualisation of data
data_table.enable_dataframe_formatter()

nltk.download('stopwords')
stop = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# SEEN 

## Dataset Creation 

In [None]:
# download webnlg dataset and create train dataset

# import webnlg dataset repository
!git clone https://gitlab.com/shimorina/webnlg-dataset.git

# creation csv file for train set from xlm files
xml = glob.glob("/content/webnlg-dataset/release_v3.0/en/train/**/*.xml", recursive=True)
n_tripla=re.compile('(\d)triples')
dizionario={}
for file in xml:
    parsing_xml = ET.parse(file)
    xml_path = parsing_xml.getroot()
    categoria_tripla=int(n_tripla.findall(file)[0])
    for sub_path in xml_path:
        for ss_path in sub_path:
            lista_tripla=[]
            lista_text=[]
            for entry in ss_path:
                lista_text.append(entry.text)
                strutured=[triple.text for triple in entry]
                lista_tripla.extend(strutured)
            lista_text=[i for i in lista_text if i.replace('\n','').strip()!='' ]
            lista_tripla=lista_tripla[-categoria_tripla:]
            lista_tripla_str=(' && ').join(lista_tripla)
            dizionario[lista_tripla_str]=lista_text
diz={ "rdf_triple":[], "ref_text":[]}
for st,unst in dizionario.items():
    for i in unst:
        diz['rdf_triple'].append(st)
        diz['ref_text'].append(i)


train=pd.DataFrame(diz)
train=train[0:35100]
train.head(5)

Cloning into 'webnlg-dataset'...
remote: Enumerating objects: 5112, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 5112 (delta 2), reused 0 (delta 0), pack-reused 5106[K
Receiving objects: 100% (5112/5112), 26.09 MiB | 17.30 MiB/s, done.
Resolving deltas: 100% (4010/4010), done.
Checking out files: 100% (1425/1425), done.


Unnamed: 0,rdf_triple,ref_text
0,103_Colmore_Row | floorCount | 23 && 103_Colmo...,"103 Colmore Row is located on Colmore Row, Bir..."
1,103_Colmore_Row | floorCount | 23 && 103_Colmo...,"103 Colmore Row, Birmingham, England was desig..."
2,103_Colmore_Row | floorCount | 23 && 103_Colmo...,"103 Colmore Row, completed in 1976 with 23 flo..."
3,103_Colmore_Row | floorCount | 23 && 103_Colmo...,"John Madin, born in Birmingham, was the archit..."
4,103_Colmore_Row | floorCount | 23 && 103_Colmo...,"Designed by, Birmingham born, architect, John ..."


In [None]:
# download webnlg dataset and create val dataset
lista = ['Airport', 'Astronaut', 'Building', 'City', 'SportsTeam', 'University', 'WrittenWork']
#VAL
val=pd.DataFrame(columns=['rdf_triple','ref_text' ])
for dominio in lista:
  xml = glob.glob("/content/webnlg-dataset/release_v3.0/en/dev/**/" + str(dominio) + ".xml", recursive=True)
  n_tripla=re.compile('(\d)triples')
  dizionario={}
  for file in xml:
      parsing_xml = ET.parse(file)
      xml_path = parsing_xml.getroot()
      categoria_tripla=int(n_tripla.findall(file)[0])
      for sub_path in xml_path:
          for ss_path in sub_path:
              lista_tripla=[]
              lista_text=[]
              for entry in ss_path:
                  lista_text.append(entry.text)
                  strutured=[triple.text for triple in entry]
                  lista_tripla.extend(strutured)
              lista_text=[i for i in lista_text if i.replace('\n','').strip()!='' ]
              lista_tripla=lista_tripla[-categoria_tripla:]
              lista_tripla_str=(' && ').join(lista_tripla)
              dizionario[lista_tripla_str]=lista_text
  diz={ "rdf_triple":[], "ref_text":[]}
  for st,unst in dizionario.items():
      for i in unst:
          diz['rdf_triple'].append(st)
          diz['ref_text'].append(i)
  df=pd.DataFrame(diz)
  val = pd.concat([val, df])




In [None]:
len(train), len(val)

(35100, 1505)

In [None]:
# save crated dfs in your path
train.to_csv('/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/webnlg-train.csv')
val.to_csv('/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/SEEN/Entity-based/webnlg-val.csv')

## Import Dataset

In [None]:
# import train e val dataset
train_raw = pd.read_csv(train_path)
val_raw = pd.read_csv(val_path)

train_raw.drop(columns=['Unnamed: 0'], inplace=True)
val_raw.drop(columns=['Unnamed: 0'], inplace=True)
train_raw.head(10)

Unnamed: 0,rdf_triple,ref_text
0,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row is located on colmore row, bir..."
1,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row, birmingham, england was desig..."
2,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row, completed in 1976 with 23 flo..."
3,103 colmore row floorcount 23 103 colmore row ...,"john madin, born in birmingham, was the archit..."
4,103 colmore row floorcount 23 103 colmore row ...,"designed by, birmingham born, architect, john ..."
5,103 colmore row floorcount 23 103 colmore row ...,architect john madin (born in birmingham) desi...
6,108 st georges terrace location perth perth co...,"108 st georges terrace in perth, australia, ha..."
7,108 st georges terrace location perth perth co...,"108 st. georges terrace, with 50 floors, is lo..."
8,108 st georges terrace location perth perth co...,108 st georges terrace was completed in 1988 i...
9,11 diagonal street location south africa south...,south africa's ethnic groups include white sou...


## Input Metrics

### Descriptive Statistics

In [None]:
df = train_raw.copy()

In [None]:
# function for compute the descriptive metrics
def compute_metrics_web(df):
  input = df[['rdf_triple']]
  input = input.assign(rdf_triple=input['rdf_triple'].str.split('&&')).explode('rdf_triple')
  input.rdf_triple = input.rdf_triple.str.lstrip()
  input.rdf_triple = input.rdf_triple.str.rstrip()
  n_triples = len(input)
  dupl = input.rdf_triple.duplicated().sum()
  perc_duplicated = dupl / len(input) *100
  input = input["rdf_triple"].str.split("|",  expand = True)
  mapping = {input.columns[0]:'subject', input.columns[1]:'property', input.columns[2]:'object'}
  input = input.rename(columns=mapping)
  dist_sub = input.subject.nunique()
  dist_obj = input.object.nunique()
  dist_pred = input.property.nunique()
  dist_sub_pred = len(input[~input.duplicated(subset=['subject','property'])])
  dist_sub_obj = len(input[~input.duplicated(subset=['subject','object'])])
  dist_obj_pred = len(input[~input.duplicated(subset=['object','property'])])
  avg_triple_for_sentence = len(input) / len(df)
  data_text_pairs = len(df)
  distinct_inputs = len(df.rdf_triple.unique())
  avg_text_for_triple = len(df.ref_text.unique()) / len(df.rdf_triple.unique())
  print("Number of data-text-pairs: ", data_text_pairs), print("Number of distinct inputs: ", distinct_inputs),\
  print("Number of triples: ", n_triples), print("Number of duplicated triples: ", dupl),\
  print("Perc of duplicated triples: ",perc_duplicated), print("Number of distinct properties: ", dist_pred),\
  print("Number of distinct subjects: ", dist_sub ),print("Number of distinct objects: ", dist_obj ),\
  print("Number of distinct subject and predicate: ", dist_sub_pred ),print("Number of distinct object and predicate: ", dist_obj_pred),\
  print("Number of distinct subject and object: ", dist_sub_obj), print("Average triple for one sentence: ",avg_triple_for_sentence),\
  print("Average sentence for one triple: ", avg_text_for_triple)

compute_metrics_web(df)

Number of data-text-pairs:  35100
Number of distinct inputs:  13087
Number of triples:  104167
Number of duplicated triples:  100329
Perc of duplicated triples:  96.31553179029827
Number of distinct properties:  372
Number of distinct subjects:  741
Number of distinct objects:  2988
Number of distinct subject and predicate:  2632
Number of distinct object and predicate:  3255
Number of distinct subject and object:  3723
Average triple for one sentence:  2.9677207977207978
Average sentence for one triple:  2.678230304882708


### Lessical Richness

Compute the Lexical richness. 
This metric describes the lexical richness of the dataset, i.e. the percentage of unique words within the dataset.

In [None]:
# lexical richness
text_to_clean = df['ref_text'] # column of your dataset
text_to_clean.replace('\n', '', regex=True, inplace=True)
text_to_clean.replace('\r', '', regex=True, inplace=True)

text = text_to_clean.str.cat(sep =' ')

def text_clean(text):
  #text = df['ref_text'].str.cat(sep =' ')
  filtered_sentence = remove_stopwords(text)
  #len_text = len(filtered_sentence.split())
  filtered_sentence1 = filtered_sentence.translate(str.maketrans('', '', string.punctuation))
  len_filtered_sentence1 = len(filtered_sentence1.split())
  return filtered_sentence1

def unique_words(text):
    #text = df['ref_text'].str.cat(sep =' ')
    #words = text.replace('"','').replace(',', '').split()
    words = text_clean(text).replace('"','').replace(',', '').split()
    unique = list(set(words))
    return len(unique)

def richness_score(df):
  score = unique_words(text) / len(text_clean(text).split())
  score = score*100
  print("The Lexical Richness is: ",  score)

richness_score(text)

The Lexical Richness is:  1.8934229782662626


### Occurence Metrics

Compute Occurence metric. This metric describes the percentage of words contained within the RDF triples within the reference texts.

In [None]:
def Occurence_Metric(df):
  df['rdf_triple'] = df['rdf_triple'].str.lower()
  df['ref_text'] = df['ref_text'].str.lower()
  df['rdf_triple'] = df['rdf_triple'].str.replace('[^\w]|_',' ')
  df['rdf_triple'] = df['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) 
  df['occurance_metric']=''
  for j in range(0,len(df)):
    #print("j:" + str(j))
    if j < len(df):
      c = 0
      for i in range(0, len(df["rdf_triple"].iloc[j].split())):
          #print("i:" + str(i))
          if df["rdf_triple"].iloc[j].split()[i] in df["ref_text"].iloc[j]:
              c = c + 1
              i=i+1
          else:
            c = c
            i=i+1    
      
      df['occurance_metric'].iloc[j] = c/len(df["rdf_triple"].iloc[j].split())
      j=j+1
  print("The Occurence metric is: " , df['occurance_metric'].mean())



Occurence_Metric(df)

  after removing the cwd from sys.path.


The Occurence metric is:  0.8218679786197078


### Bleu - Meteor - Rouge

In [None]:
#import bleu metric

%%capture 
%cd /content/
!wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl

In [None]:
# import meteor metric

%%capture 
import shutil

source_dir = meteor_path
destination_dir = r"/content/meteor-1.5"
shutil.copytree(source_dir, destination_dir)

In [None]:
# import and install rouge metric

%%capture 
%cd /content/
!git clone https://github.com/pltrdy/rouge.git
%cd rouge
!python setup.py install

In [None]:
# function to clean df

def clean_df(df):
  %cd /content/
  df['rdf_triple'] = df['rdf_triple'].str.lower()
  df['ref_text'] = df['ref_text'].str.lower()
  df['rdf_triple'] = df['rdf_triple'].str.replace('[^\w]|_',' ')
  df['rdf_triple'] = df['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  df['ref_text'].replace('\n', '', regex=True, inplace=True)
  df['ref_text'].replace('\r', '', regex=True, inplace=True)
  np.savetxt(r'ref.txt', df['rdf_triple'].values, fmt='%s', delimiter='\t')
  np.savetxt(r'hyp.txt', df['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# function to compute bleu metric
def bleu():
  %cd /content/
  bleu = !perl multi-bleu.perl /content/ref.txt < /content/hyp.txt
  print(bleu[0])

# function to compute meteor metric
def meteor():
  %cd /content/meteor-1.5/
  meteor= !java -Xmx2G -jar meteor-1.5.jar /content/hyp.txt /content/ref.txt -l en -norm 
  print("Meteor", meteor[-1])

# function to compute rouge metric
def rouge():
  %cd /content/rouge/
  from rouge import FilesRouge
  hyp_path = '/content/hyp.txt'
  ref_path= '/content/ref.txt'
  files_rouge = FilesRouge()
  rouge = files_rouge.get_scores(hyp_path, ref_path, avg=True)
  return rouge

clean_df(df)

/content


  import sys


In [None]:
bleu()

/content
BLEU = 4.97, 30.6/10.6/2.6/0.7 (BP=1.000, ratio=1.211, hyp_len=697789, ref_len=576385)


In [None]:
meteor()

/content/meteor-1.5
Meteor Final score:            0.23336646875371628


In [None]:
rouge()

/content/rouge


{'rouge-1': {'f': 0.5008489574404925,
  'p': 0.4115596640561163,
  'r': 0.6529893707817942},
 'rouge-2': {'f': 0.18229301129990802,
  'p': 0.15356019498067913,
  'r': 0.23072829706411152},
 'rouge-l': {'f': 0.43007986614996374,
  'p': 0.35395217003035273,
  'r': 0.5594481426455932}}

### Bert Score

In [None]:
%cd /content

/content


In [None]:
# install bert score metric
%%capture
!pip install bert-score

In [None]:
def clean_df(df):
  df['rdf_triple'] = df['rdf_triple'].str.lower()
  df['ref_text'] = df['ref_text'].str.lower()
  df['rdf_triple'] = df['rdf_triple'].str.replace('[^\w]|_',' ')
  df['rdf_triple'] = df['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  lista_triple = df['rdf_triple'].tolist()
  lista_text = df['ref_text'].tolist()
  return lista_text, lista_triple

#function to compute bert score metric
def bert_score_(references, hypothesis, lng='en'):
    from bert_score import score
    for i, refs in enumerate(references):
        references[i] = [ref for ref in refs if ref.strip() != '']
    try:
        P, R, F1 = score(hypothesis, references, lang=lng)
    #     print('FINISHING TO COMPUTE BERT SCORE...')
        P, R, F1 = list(P), list(R), list(F1)
        F1 = float(sum(F1) / len(F1))
        P = float(sum(P) / len(P))
        R = float(sum(R) / len(R))
    except:
        P, R, F1 = 0, 0, 0
    return P, R, F1

In [None]:
text, triple = clean_df(df)
bert_score_(references=text, hypothesis=triple, lng='en')

  after removing the cwd from sys.path.


Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(0.7544745802879333, 0.9998724460601807, 0.8596928715705872)

In [None]:
%cd /content/

/content


## Pre-processing

In [None]:
# data pre-processing on train set

from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop = stopwords.words('english')

train = train_raw.copy()

train['rdf_triple'] = train['rdf_triple'].str.lower()
train['ref_text'] = train['ref_text'].str.lower()
train['rdf_triple'] = train['rdf_triple'].str.replace('[^\w]|_',' ')
train['rdf_triple'] = train['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))  
train.replace('\n', '', regex=True, inplace=True)
train.replace('\r', '', regex=True, inplace=True)  

# to get a sample
'''train_sample = train.sample(n=10).reset_index()
train_sample.drop(columns=['index'], inplace=True)
train_sample'''

train.head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  del sys.path[0]


Unnamed: 0,rdf_triple,ref_text
0,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row is located on colmore row, bir..."
1,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row, birmingham, england was desig..."
2,103 colmore row floorcount 23 103 colmore row ...,"103 colmore row, completed in 1976 with 23 flo..."
3,103 colmore row floorcount 23 103 colmore row ...,"john madin, born in birmingham, was the archit..."
4,103 colmore row floorcount 23 103 colmore row ...,"designed by, birmingham born, architect, john ..."
5,103 colmore row floorcount 23 103 colmore row ...,architect john madin (born in birmingham) desi...
6,108 st georges terrace location perth perth co...,"108 st georges terrace in perth, australia, ha..."
7,108 st georges terrace location perth perth co...,"108 st. georges terrace, with 50 floors, is lo..."
8,108 st georges terrace location perth perth co...,108 st georges terrace was completed in 1988 i...
9,11 diagonal street location south africa south...,south africa's ethnic groups include white sou...


In [None]:
# comparation between rdf triple raw and rdf triple after pre-processing process

train_comparation = pd.DataFrame(columns=['rdf_triple_raw', 'rdf_triple_clean'])

train_comparation.rdf_triple_raw = train_raw.rdf_triple.values
train_comparation.rdf_triple_clean = train.rdf_triple.values

data_table.DataTable(train_comparation)

In [None]:
# data pre-processing on val set

val = val_raw.copy()
val['rdf_triple'] = val['rdf_triple'].str.lower()
val['ref_text'] = val['ref_text'].str.lower()
val['rdf_triple'] = val['rdf_triple'].str.replace('[^\w]|_',' ')
val['rdf_triple'] = val['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))    
val.replace('\n', '', regex=True, inplace=True)
val.replace('\r', '', regex=True, inplace=True)

  


In [None]:
%cd /content/

!mkdir data_lstm
!mkdir data_lstm/model
!mkdir data_lstm/loaded_model

/content


Creation of all the files necessary for the openNMT lstm model in the training phase.

In [None]:
# create src-train.txt from train['rdf_triple']
np.savetxt(r'/content/data_lstm/src-train.txt', train['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create src-val.txt from val['rdf_triple']
np.savetxt(r'/content/data_lstm/src-val.txt', val['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-train.txt from train['ref_text']
np.savetxt(r'/content/data_lstm/tgt-train.txt', train['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt from val['ref_text']
np.savetxt(r'/content/data_lstm/tgt-val.txt', val['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt for Ter metric
val2 = val.copy()
val2['ref_text'] =  val2['ref_text'] + " (id" + val2.index.astype(str) + ")"
val2['ref_text'] = val2['ref_text'].str.strip()
val2['rdf_triple']= val2['rdf_triple'].str.strip()

In [None]:
np.savetxt(r'/content/data_lstm/tgt-val-ter.txt', val2['ref_text'].values, fmt='%s', delimiter='\t')

## Setting parameters for LSTM Model

In [None]:
#LSTM model architecture 

import yaml
data = {
    ## Where the samples will be written
'save_data': '/content/data_lstm/model/',
## Where the vocab(s) will be written
'src_vocab': '/content/data_lstm/example.vocab.src',
'tgt_vocab': '/content/data_lstm/example.vocab.tgt',
# Prevent overwriting existing files in the folder
'overwrite': False,
# Corpus opts:
'data': ({
    'corpus_1':({
            'path_src': '/content/data_lstm/src-train.txt',
            'path_tgt': '/content/data_lstm/tgt-train.txt',
        }),

    'valid':({
            'path_src': '/content/data_lstm/src-val.txt',
            'path_tgt': '/content/data_lstm/tgt-val.txt',
        }),

}) ,

# Vocabulary files that were just created
'src_vocab': '/content/data_lstm/example.vocab.src',
'tgt_vocab': '/content/data_lstm/example.vocab.tgt',

# Train on a single GPU
'world_size': 1,
'gpu_ranks': [0],

# Where to save the checkpoints
'save_model': '/content/data_lstm/model/',
'save_checkpoint_steps': 5000,
'train_steps': 35000,
'valid_steps': 5000,
'seed':1234
}

file = open("/content/data_lstm/data.yaml", "w")
yaml.dump(data, file, default_flow_style=None)
file.close()


In [None]:
# build vocab
!onmt_build_vocab -config /content/data_lstm/data.yaml -n_sample 10000 

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-07-01 08:01:46,265 INFO] Counter vocab from 10000 samples.
[2022-07-01 08:01:46,265 INFO] Build vocab on 10000 transformed examples/corpus.
[2022-07-01 08:01:46,276 INFO] corpus_1's transforms: TransformPipe()
[2022-07-01 08:01:46,477 INFO] Counters src:3583
[2022-07-01 08:01:46,477 INFO] Counters tgt:9726


### Train LSTM Model

In [None]:
# train default openNMT model
!onmt_train -config /content/data_lstm/data.yaml

[2022-03-04 14:22:39,504 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-03-04 14:22:39,504 INFO] Missing transforms field for valid data, set to default: [].
[2022-03-04 14:22:39,504 INFO] Parsed 2 corpora from -data.
[2022-03-04 14:22:39,504 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2022-03-04 14:22:39,504 INFO] Loading vocab from text file...
[2022-03-04 14:22:39,504 INFO] Loading src vocabulary from /content/example.vocab.src
[2022-03-04 14:22:39,539 INFO] Loaded src vocab has 3534 tokens.
[2022-03-04 14:22:39,541 INFO] Loading tgt vocabulary from /content/example.vocab.tgt
[2022-03-04 14:22:39,558 INFO] Loaded tgt vocab has 8072 tokens.
[2022-03-04 14:22:39,563 INFO] Building fields with vocab in counters...
[2022-03-04 14:22:39,573 INFO]  * tgt vocab size: 8076.
[2022-03-04 14:22:39,577 INFO]  * src vocab size: 3536.
[2022-03-04 14:22:39,578 INFO]  * src vocab size = 3536
[2022-03-04 14:22:39,578 INFO]  * tgt vocab size =

In [None]:
# load/import trained model checkpoint
shutil.copyfile(src = model_lstm, dst = '/content/data_lstm/loaded_model/lstm_model.pt')

'/content/data_lstm/loaded_model/lstm_model.pt'

In [None]:
# make prediction file
!onmt_translate -model /content/data_lstm/loaded_model/lstm_model.pt -src /content/data_lstm/src-val.txt -output /content/data_lstm/pred.txt -gpu 0 -verbose --replace_unk

#### Evaluation Metrics: LSTM

##### Bleu

In [None]:
# if you didn't import bleu metric, please run this code

#!wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl

In [None]:
# compute bleu score
bleu = !perl multi-bleu.perl /content/data_lstm/tgt-val.txt < /content/data_lstm/pred.txt
bleu[0]

'BLEU = 32.28, 65.6/41.4/27.6/19.0 (BP=0.935, ratio=0.937, hyp_len=33793, ref_len=36058)'

In [None]:
%cd /content/

/content


##### Meteor

In [None]:
# if you didn't import meteor metric before, please run this code
'''
%%capture 
import shutil

source_dir = meteor_path
destination_dir = r"/content/meteor-1.5"
shutil.copytree(source_dir, destination_dir)
'''

'/content/meteor-1.5'

In [None]:
%cd /content/meteor-1.5/

/content/meteor-1.5


In [None]:
# compute meteor metric
meteor = !java -Xmx2G -jar meteor-1.5.jar /content/data_lstm/pred.txt /content/data_lstm/tgt-val.txt -l en -norm 
meteor[-1]

'Final score:            0.3514187841859228'

##### Ter

In [None]:
%cd ..

/content


In [None]:
# make prediction df
df = pd.read_fwf('/content/data_lstm/pred.txt', header=None)
df = df[[0]]
df= df.rename(columns={0:'text'})
len(df)

1505

In [None]:
data2 = df.copy()
data2['text'] =  data2['text'] + " (id" + data2.index.astype(str) + ")"

In [None]:
np.savetxt(r'/content/data_lstm/pred-ter.txt', data2.values, fmt='%s', delimiter='\t')

In [None]:
# import Ter metric

import shutil
source_dir = ter_path
destination_dir = r"/content/tercom-0.7.25"
shutil.copytree(source_dir, destination_dir)

'/content/tercom-0.7.25'

In [None]:
%cd /content/tercom-0.7.25/

/content/tercom-0.7.25


In [None]:
# computer Ter metric

ter = !java -jar tercom.7.25.jar -h /content/data_lstm/pred-ter.txt -r /content/data_lstm/tgt-val-ter.txt
ter[-4]

'Total TER: 0.5764046813467192 (20784.0/36058.0)'

##### Rouge

In [None]:
# if you didn't import and install rouge metric before, please run this code
'''
# import and install rouge metric

%cd /content/
!git clone https://github.com/pltrdy/rouge.git
%cd rouge
!python setup.py install
'''

In [None]:
%cd /content/rouge

/content/rouge


In [None]:
# compute rouge metric

from rouge import FilesRouge

hyp_path = r'/content/data_lstm/pred.txt'

ref_path= r'/content/data_lstm/tgt-val.txt'

files_rouge = FilesRouge()
scores = files_rouge.get_scores(hyp_path, ref_path, avg=True)
scores

{'rouge-1': {'f': 0.7065518783123586,
  'p': 0.7487513239703897,
  'r': 0.6777254092771339},
 'rouge-2': {'f': 0.4458367334933613,
  'p': 0.46745426600310497,
  'r': 0.4334203397838888},
 'rouge-l': {'f': 0.620296665519918,
  'p': 0.6572103422282312,
  'r': 0.5951460174455543}}

##### Bert Score

In [None]:
%cd ..

/content


In [None]:
# creation of references text df
a_file = open("/content/data_lstm/tgt-val.txt", "r")

ref = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  ref.append(stripped_line)

a_file.close()

In [None]:
# creation of predictions text df
a_file = open("/content/data_lstm/pred.txt", "r")

hyp = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  hyp.append(stripped_line)

a_file.close()

In [None]:
#compute bert score metric

from bert_score import score
def bert_score_(references, hypothesis, lng='en'):
    from bert_score import score
    for i, refs in enumerate(references):
        references[i] = [ref for ref in refs if ref.strip() != '']
    try:
        P, R, F1 = score(hypothesis, references, lang=lng)
    #     print('FINISHING TO COMPUTE BERT SCORE...')
        P, R, F1 = list(P), list(R), list(F1)
        F1 = float(sum(F1) / len(F1))
        P = float(sum(P) / len(P))
        R = float(sum(R) / len(R))
    except:
        P, R, F1 = 0, 0, 0
    return P, R, F1
 
bert_score_(references=ref,hypothesis=hyp, lng='en' )

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(0.7871654033660889, 0.9998068809509277, 0.8806986808776855)

#### Model Results

In [None]:
# creation of sample from val test
%cd /content/
val_sample = val.sample(n=10, random_state = 1234).reset_index(drop=True)
val_sample

/content


Unnamed: 0,rdf_triple,ref_text
0,alan shepard status deceased alan shepard alma...,"alan shepard was born november 18th, 1923 in n..."
1,acharya institute technology city bangalore ac...,the acharya institute of technology is located...
2,adisham hall location sri lanka adisham hall a...,"adisham hall, sri lanka, was started in 1927 a..."
3,3arena owner live nation entertainment dublin ...,"3arena in dublin, republic of ireland is owned..."
4,c cesena manager massimo drago massimo drago c...,massimo drago played for s.s.d. potenza calcio...
5,fortress grey ice author j v jones fortress gr...,a fortress of grey ice was written by j. v. jo...
6,elliot see almamater university texas austin u...,"elliot see, who was born in dallas, has died i..."
7,auburn washington ispartof pierce county washi...,"auburn is located in pierce county, washington..."
8,austin texas areaofland 686 0 square kilometre...,"the area of austin, texas is 703.95km2 and the..."
9,alan b miller hall location virginia alan b mi...,the mason school of business is located in the...


In [None]:
# create src-val_sample.txt from val_sample['rdf_triple']
np.savetxt(r'/content/data_lstm/src-val_sample.txt', val_sample['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# make prediction file on val_sample
!onmt_translate -model /content/data_lstm/loaded_model/lstm_model.pt -src /content/data_lstm/src-val_sample.txt -output /content/data_lstm/pred_sample.txt -gpu 0 -verbose --replace_unk

[2022-07-01 08:15:21,431 INFO] Translating shard 0.
  self._batch_index = self.topk_ids // vocab_size
[2022-07-01 08:15:21,508 INFO] 
SENT 1: ['alan', 'shepard', 'status', 'deceased', 'alan', 'shepard', 'almamater', 'nwc', '1957', 'alan', 'shepard', 'deathplace', 'california', 'alan', 'shepard', 'birthplace', 'new', 'hampshire', 'alan', 'shepard', 'selectedbynasa', '1959', 'alan', 'shepard', 'birthdate', '1923', '11', '18']
PRED 1: alan shepard, who was born in new hampshire on november 18th shepard graduated in 1957 with an m.a. he was selected by nasa in shepard and died in california.
PRED SCORE: -5.3511

[2022-07-01 08:15:21,508 INFO] 
SENT 2: ['acharya', 'institute', 'technology', 'city', 'bangalore', 'acharya', 'institute', 'technology', 'motto', 'nurturing', 'excellence', 'acharya', 'institute', 'technology', 'country', 'india']
PRED 2: the acharya institute of technology in bangalore, india has the motto "nurturing excellence.
PRED SCORE: -2.3164

[2022-07-01 08:15:21,508 INFO]

In [None]:
df_lstm = pd.read_fwf('/content/data_lstm/pred_sample.txt', header=None)
df_lstm = df_lstm[[0]]
df_lstm= df_lstm.rename(columns={0:'text'})
df_lstm


Unnamed: 0,text
0,"alan shepard, who was born in new hampshire on..."
1,the acharya institute of technology in bangalo...
2,adisham hall in sri lanka was constructed in 1...
3,"3arena is located in dublin, republic of irela..."
4,massimo drago once played for the club ssd pot...
5,"a fortress of grey ice, written by j v jones, ..."
6,elliot see was born in dallas and died in st l...
7,"auburn is part of pierce county, washington in..."
8,"austin, texas covers 686.0 square kilometres, ..."
9,"located in the united states, the mason school..."


In [None]:
# creation prediction df
prediction_df_lstm = pd.DataFrame(columns=['rdf_triple', 'prediction_text'] )
prediction_df_lstm.rdf_triple = val_sample.rdf_triple.values
prediction_df_lstm.prediction_text = df_lstm.text.values
prediction_df_lstm

Unnamed: 0,rdf_triple,prediction_text
0,alan shepard status deceased alan shepard alma...,"alan shepard, who was born in new hampshire on..."
1,acharya institute technology city bangalore ac...,the acharya institute of technology in bangalo...
2,adisham hall location sri lanka adisham hall a...,adisham hall in sri lanka was constructed in 1...
3,3arena owner live nation entertainment dublin ...,"3arena is located in dublin, republic of irela..."
4,c cesena manager massimo drago massimo drago c...,massimo drago once played for the club ssd pot...
5,fortress grey ice author j v jones fortress gr...,"a fortress of grey ice, written by j v jones, ..."
6,elliot see almamater university texas austin u...,elliot see was born in dallas and died in st l...
7,auburn washington ispartof pierce county washi...,"auburn is part of pierce county, washington in..."
8,austin texas areaofland 686 0 square kilometre...,"austin, texas covers 686.0 square kilometres, ..."
9,alan b miller hall location virginia alan b mi...,"located in the united states, the mason school..."


In [None]:
# comparison between references text from val sample and prediction text of lstm model on val sample
text_comparation_lstm = pd.DataFrame(columns=['ref_text', 'prediction_text'] )
text_comparation_lstm.ref_text = val_sample.ref_text.values
text_comparation_lstm.prediction_text = df_lstm.text.values
text_comparation_lstm

Unnamed: 0,ref_text,prediction_text
0,"alan shepard was born november 18th, 1923 in n...","alan shepard, who was born in new hampshire on..."
1,the acharya institute of technology is located...,the acharya institute of technology in bangalo...
2,"adisham hall, sri lanka, was started in 1927 a...",adisham hall in sri lanka was constructed in 1...
3,"3arena in dublin, republic of ireland is owned...","3arena is located in dublin, republic of irela..."
4,massimo drago played for s.s.d. potenza calcio...,massimo drago once played for the club ssd pot...
5,a fortress of grey ice was written by j. v. jo...,"a fortress of grey ice, written by j v jones, ..."
6,"elliot see, who was born in dallas, has died i...",elliot see was born in dallas and died in st l...
7,"auburn is located in pierce county, washington...","auburn is part of pierce county, washington in..."
8,"the area of austin, texas is 703.95km2 and the...","austin, texas covers 686.0 square kilometres, ..."
9,the mason school of business is located in the...,"located in the united states, the mason school..."


In [None]:
!pwd

/content


## Setting parameters for Transformer Model

In [None]:
%cd /content/

!mkdir data_transf
!mkdir data_transf/model
!mkdir data_transf/loaded_model

/content


In [None]:
# create src-train.txt from train['rdf_triple]
np.savetxt(r'/content/data_transf/src-train.txt', train['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create src-val.txt from val['rdf_triple]
np.savetxt(r'/content/data_transf/src-val.txt', val['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-train.txt from train['ref_text']
np.savetxt(r'/content/data_transf/tgt-train.txt', train['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt from val['ref_text']
np.savetxt(r'/content/data_transf/tgt-val.txt', val['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt for Ter metric
val2 = val.copy()
val2['ref_text'] =  val2['ref_text'] + " (id" + val2.index.astype(str) + ")"
val2['ref_text'] = val2['ref_text'].str.strip()
val2['rdf_triple']= val2['rdf_triple'].str.strip()

In [None]:
np.savetxt(r'/content/data_transf/tgt-val-ter.txt', val2['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# Transformer Model architecture

import yaml
data = {
    ## Where the samples will be written
'save_data': '/content/data_transf/model/',
## Where the vocab(s) will be written
'src_vocab': '/content/data_transf/example.vocab.src',
'tgt_vocab': '/content/data_transf/example.vocab.tgt',
# Prevent overwriting existing files in the folder
'overwrite': False,
# Corpus opts:
'data': ({
    'corpus_1':({
            'path_src': '/content/data_transf/src-train.txt',
            'path_tgt': '/content/data_transf/tgt-train.txt',
        }),

    'valid':({
            'path_src': '/content/data_transf/src-val.txt',
            'path_tgt': '/content/data_transf/tgt-val.txt',
        }),

}) ,

# Vocabulary files that were just created
'src_vocab': '/content/data_transf/example.vocab.src',
'tgt_vocab': '/content/data_transf/example.vocab.tgt',

# Train on a single GPU
'world_size': 1,
'gpu_ranks': [0],

# Where to save the checkpoints
'save_model': '/content/data_transf/model/',
'save_checkpoint_steps': 5000,
'train_steps': 35000,
'valid_steps': 5000,
'decoder_type': 'transformer',
'encoder_type': 'transformer',
'word_vec_size': 512,
'rnn_size': 512,
'layers': 2,
'transformer_ff': 2048,
'heads': 4,
'batch_size': 64,
'batch_type': 'sents',
'normalization': 'tokens',
'dropout': 0.3,
'label_smoothing': 0.1,
'seed':1234
}

file = open("/content/data_transf/data.yaml", "w")
yaml.dump(data, file, default_flow_style=None)
file.close()



In [None]:
# build vocab
!onmt_build_vocab -config /content/data_transf/data.yaml -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-07-01 08:16:11,013 INFO] Counter vocab from 10000 samples.
[2022-07-01 08:16:11,013 INFO] Build vocab on 10000 transformed examples/corpus.
[2022-07-01 08:16:11,025 INFO] corpus_1's transforms: TransformPipe()
[2022-07-01 08:16:11,229 INFO] Counters src:3583
[2022-07-01 08:16:11,229 INFO] Counters tgt:9726


### Train Transformer Model

In [None]:
# training transformer openNMT model.
!onmt_train -config /content/data_transf/data.yaml 

[2022-03-10 14:14:03,847 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-03-10 14:14:03,847 INFO] Missing transforms field for valid data, set to default: [].
[2022-03-10 14:14:03,847 INFO] Parsed 2 corpora from -data.
[2022-03-10 14:14:03,847 INFO] Loading checkpoint from /content/drive/MyDrive/Colab_Notebooks/NO_PREPROCESSING_step_20000.pt
[2022-03-10 14:14:05,087 INFO] Loading fields from checkpoint...
[2022-03-10 14:14:05,087 INFO]  * src vocab size = 3605
[2022-03-10 14:14:05,087 INFO]  * tgt vocab size = 9226
[2022-03-10 14:14:05,092 INFO] Building model...
[2022-03-10 14:14:15,063 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(3605, 512, padding_idx=1)
        )
      )
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=

In [None]:
# load/import trained model checkpoint
shutil.copyfile(src = model_transformer, dst = '/content/data_transf/loaded_model/transformer_model.pt')

'/content/data_transf/loaded_model/transformer_model.pt'

In [None]:
# make prediction file
!onmt_translate -model /content/data_transf/loaded_model/transformer_model.pt -src /content/data_transf/src-val.txt -output /content/data_transf/pred.txt -gpu 0 -verbose --replace_unk

#### Evaluation Metrics: Transformer

##### Bleu

In [None]:
#!wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl

In [None]:
bleu = !perl multi-bleu.perl /content/data_transf/tgt-val.txt < /content/data_transf/pred.txt
bleu[0]

'BLEU = 31.63, 64.7/42.2/28.5/19.8 (BP=0.898, ratio=0.903, hyp_len=32546, ref_len=36058)'

In [None]:
cd /content/

/content


##### Meteor

In [None]:
'''
%%capture 
import shutil

source_dir = meteor_path
destination_dir = r"/content/meteor-1.5"
shutil.copytree(source_dir, destination_dir)
'''

'/content/meteor-1.5'

In [None]:
%cd /content/meteor-1.5/

/content/meteor-1.5


In [None]:
meteor = !java -Xmx2G -jar meteor-1.5.jar /content/data_transf/pred.txt /content/data_transf/tgt-val.txt -l en -norm 
meteor[-1]

'Final score:            0.3414614578882734'

##### Ter

In [None]:
%cd ..

/content


In [None]:
df = pd.read_fwf('/content/data_transf/pred.txt', header=None)
df = df[[0]]
df= df.rename(columns={0:'text'})
len(df)

1505

In [None]:
data2 = df.copy()
data2['text'] =  data2['text'] + " (id" + data2.index.astype(str) + ")"
data2['text'].replace('\n','', regex=True, inplace=True)


In [None]:
np.savetxt(r'/content/data_transf/pred-ter.txt', data2['text'].values, fmt='%s', delimiter='\t')

In [None]:
'''
# import Ter metric

import shutil
source_dir = ter_path
destination_dir = r"/content/tercom-0.7.25"
shutil.copytree(source_dir, destination_dir)

'''

'/content/tercom-0.7.25'

In [None]:
%cd /content/tercom-0.7.25/

/content/tercom-0.7.25


In [None]:
ter = !java -jar tercom.7.25.jar -h /content/data_transf/pred-ter.txt -r /content/data_transf/tgt-val-ter.txt
ter[-4]

'Total TER: 0.6112374507737534 (22040.0/36058.0)'

##### Rouge

In [None]:
'''
# import and install rouge metric

!git clone https://github.com/pltrdy/rouge.git
%cd rouge
!python setup.py install
'''

In [None]:
%cd /content/rouge


/content/rouge


In [None]:
from rouge import FilesRouge

hyp_path = r'/content/data_transf/pred.txt'

ref_path= r'/content/data_transf/tgt-val.txt'

files_rouge = FilesRouge()
scores = files_rouge.get_scores(hyp_path, ref_path, avg=True)
scores

{'rouge-1': {'f': 0.690924071342746,
  'p': 0.7416890846212997,
  'r': 0.6618280855251097},
 'rouge-2': {'f': 0.44234641068604197,
  'p': 0.483639479711409,
  'r': 0.4224113878595206},
 'rouge-l': {'f': 0.621349609702111,
  'p': 0.6665017949623963,
  'r': 0.5955416798983787}}

##### Bert Score

In [None]:
%cd ..

/content


In [None]:
#!pip install bert-score


In [None]:
a_file = open("/content/data_transf/tgt-val.txt", "r")

ref = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  ref.append(stripped_line)

a_file.close()

In [None]:
a_file = open("/content/data_transf/pred.txt", "r")

hyp = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  hyp.append(stripped_line)

a_file.close()

In [None]:
from bert_score import score
def bert_score_(references, hypothesis, lng='en'):
    from bert_score import score
    for i, refs in enumerate(references):
        references[i] = [ref for ref in refs if ref.strip() != '']
    try:
        P, R, F1 = score(hypothesis, references, lang=lng)
    #     print('FINISHING TO COMPUTE BERT SCORE...')
        P, R, F1 = list(P), list(R), list(F1)
        F1 = float(sum(F1) / len(F1))
        P = float(sum(P) / len(P))
        R = float(sum(R) / len(R))
    except:
        P, R, F1 = 0, 0, 0
    return P, R, F1
 
bert_score_(references=ref,hypothesis=hyp, lng='en' )

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(0.7874484062194824, 0.9998143315315247, 0.8808344006538391)

#### Model Results

In [None]:
%cd /content/
val_sample

/content


Unnamed: 0,rdf_triple,ref_text
0,alan shepard status deceased alan shepard alma...,"alan shepard was born november 18th, 1923 in n..."
1,acharya institute technology city bangalore ac...,the acharya institute of technology is located...
2,adisham hall location sri lanka adisham hall a...,"adisham hall, sri lanka, was started in 1927 a..."
3,3arena owner live nation entertainment dublin ...,"3arena in dublin, republic of ireland is owned..."
4,c cesena manager massimo drago massimo drago c...,massimo drago played for s.s.d. potenza calcio...
5,fortress grey ice author j v jones fortress gr...,a fortress of grey ice was written by j. v. jo...
6,elliot see almamater university texas austin u...,"elliot see, who was born in dallas, has died i..."
7,auburn washington ispartof pierce county washi...,"auburn is located in pierce county, washington..."
8,austin texas areaofland 686 0 square kilometre...,"the area of austin, texas is 703.95km2 and the..."
9,alan b miller hall location virginia alan b mi...,the mason school of business is located in the...


In [None]:
# create src-val_sample.txt from val_sample['rdf_triple]

np.savetxt(r'/content/data_transf/src-val_sample.txt', val_sample['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# make prediction file on val_sample
!onmt_translate -model /content/data_transf/loaded_model/transformer_model.pt -src /content/data_transf//src-val_sample.txt -output /content/data_transf/pred_sample.txt -gpu 0 -verbose --replace_unk

[2022-07-01 08:19:06,996 INFO] Translating shard 0.
  self._batch_index = self.topk_ids // vocab_size
[2022-07-01 08:19:07,131 INFO] 
SENT 1: ['alan', 'shepard', 'status', 'deceased', 'alan', 'shepard', 'almamater', 'nwc', '1957', 'alan', 'shepard', 'deathplace', 'california', 'alan', 'shepard', 'birthplace', 'new', 'hampshire', 'alan', 'shepard', 'selectedbynasa', '1959', 'alan', 'shepard', 'birthdate', '1923', '11', '18']
PRED 1: alan shepard was born on nov 18, 1923 in new hampshire and graduated from nwc with an m.a. in 1957. he was selected by nasa in 1959 and died in california.
PRED SCORE: -7.8036

[2022-07-01 08:19:07,132 INFO] 
SENT 2: ['acharya', 'institute', 'technology', 'city', 'bangalore', 'acharya', 'institute', 'technology', 'motto', 'nurturing', 'excellence', 'acharya', 'institute', 'technology', 'country', 'india']
PRED 2: acharya institute of technology is located in bangalore, india. its motto is "nurturing excellence".
PRED SCORE: -5.8219

[2022-07-01 08:19:07,132 

In [None]:
df_tr = pd.read_fwf('/content/data_transf/pred_sample.txt', header=None)
df_tr = df_tr[[0]]
df_tr= df_tr.rename(columns={0:'text'})
df_tr

Unnamed: 0,text
0,"alan shepard was born on nov 18, 1923 in new h..."
1,acharya institute of technology is located in ...
2,adisham hall was finished in 1931 and is locat...
3,"3arena is located in dublin, which is owned by..."
4,"massimo drago, played for s.s.d. potenza calci..."
5,a fortress of grey ice was written by j. v. jo...
6,elliot see was born in dallas and attended the...
7,"auburn is part of pierce county, washington, i..."
8,"austin, texas covers an area of 703.95 square ..."
9,the mason school of business are the current t...


In [None]:
prediction_df_tr = pd.DataFrame(columns=['rdf_triple', 'prediction_text'] )
prediction_df_tr.rdf_triple = val_sample.rdf_triple.values
prediction_df_tr.prediction_text = df_tr.text.values
prediction_df_tr

Unnamed: 0,rdf_triple,prediction_text
0,alan shepard status deceased alan shepard alma...,"alan shepard was born on nov 18, 1923 in new h..."
1,acharya institute technology city bangalore ac...,acharya institute of technology is located in ...
2,adisham hall location sri lanka adisham hall a...,adisham hall was finished in 1931 and is locat...
3,3arena owner live nation entertainment dublin ...,"3arena is located in dublin, which is owned by..."
4,c cesena manager massimo drago massimo drago c...,"massimo drago, played for s.s.d. potenza calci..."
5,fortress grey ice author j v jones fortress gr...,a fortress of grey ice was written by j. v. jo...
6,elliot see almamater university texas austin u...,elliot see was born in dallas and attended the...
7,auburn washington ispartof pierce county washi...,"auburn is part of pierce county, washington, i..."
8,austin texas areaofland 686 0 square kilometre...,"austin, texas covers an area of 703.95 square ..."
9,alan b miller hall location virginia alan b mi...,the mason school of business are the current t...


In [None]:
text_comparation_tr = pd.DataFrame(columns=['ref_text', 'prediction_text'] )
text_comparation_tr.ref_text = val_sample.ref_text.values
text_comparation_tr.prediction_text = df_tr.text.values
text_comparation_tr

Unnamed: 0,ref_text,prediction_text
0,"alan shepard was born november 18th, 1923 in n...","alan shepard was born on nov 18, 1923 in new h..."
1,the acharya institute of technology is located...,acharya institute of technology is located in ...
2,"adisham hall, sri lanka, was started in 1927 a...",adisham hall was finished in 1931 and is locat...
3,"3arena in dublin, republic of ireland is owned...","3arena is located in dublin, which is owned by..."
4,massimo drago played for s.s.d. potenza calcio...,"massimo drago, played for s.s.d. potenza calci..."
5,a fortress of grey ice was written by j. v. jo...,a fortress of grey ice was written by j. v. jo...
6,"elliot see, who was born in dallas, has died i...",elliot see was born in dallas and attended the...
7,"auburn is located in pierce county, washington...","auburn is part of pierce county, washington, i..."
8,"the area of austin, texas is 703.95km2 and the...","austin, texas covers an area of 703.95 square ..."
9,the mason school of business is located in the...,the mason school of business are the current t...


In [None]:
!pwd

/content


# UNSEEN

## Dataset Creation

In [None]:
# import webnlg dataset from official repository 
!git clone https://gitlab.com/shimorina/webnlg-dataset.git

Cloning into 'webnlg-dataset'...
remote: Enumerating objects: 5112, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 5112 (delta 2), reused 0 (delta 0), pack-reused 5106[K
Receiving objects: 100% (5112/5112), 26.09 MiB | 19.01 MiB/s, done.
Resolving deltas: 100% (4010/4010), done.
Checking out files: 100% (1425/1425), done.


In [None]:
'''import os
import glob
import xml.etree.ElementTree as ET'''

In [None]:
lista2 = ['Airport', 'Astronaut', 'Building', 'City', 'ComicsCharacter', 'Monument', 'SportsTeam', 'University', 'WrittenWork', 'Athlete', 'Food', 'CelestialBody', 'MeanOfTransportation', 'Politician', 'Company','Airport_allSolutions', 'Astronaut_allSolutions', 'Building_allSolutions', 'City_allSolutions', 'ComicsCharacter_allSolutions', 'Monument_allSolutions', 'SportsTeam_allSolutions', 'University_allSolutions', 'WrittenWork_allSolutions', 'Athlete_allSolutions', 'Food_allSolutions', 'CelestialBody_allSolutions', 'MeanOfTransportation_allSolutions', 'Politician_allSolutions', 'Company_allSolutions']

In [None]:
#TRAIN
train_u=pd.DataFrame(columns=['rdf_triple','ref_text' ])

for dominio in lista2:
  xml = glob.glob("/content/webnlg-dataset/release_v3.0/en/train/**/" + str(dominio) + ".xml", recursive=True)
  n_tripla=re.compile('(\d)triples')
  dizionario={}
  for file in xml:
      parsing_xml = ET.parse(file)
      xml_path = parsing_xml.getroot()
      categoria_tripla=int(n_tripla.findall(file)[0])
      for sub_path in xml_path:
          for ss_path in sub_path:
              lista_tripla=[]
              lista_text=[]
              for entry in ss_path:
                  lista_text.append(entry.text)
                  strutured=[triple.text for triple in entry]
                  lista_tripla.extend(strutured)
              lista_text=[i for i in lista_text if i.replace('\n','').strip()!='' ]
              lista_tripla=lista_tripla[-categoria_tripla:]
              lista_tripla_str=(' && ').join(lista_tripla)
              dizionario[lista_tripla_str]=lista_text
  diz={ "rdf_triple":[], "ref_text":[]}
  for st,unst in dizionario.items():
      for i in unst:
          diz['rdf_triple'].append(st)
          diz['ref_text'].append(i)
  df=pd.DataFrame(diz)
  train_u = pd.concat([train_u, df])
train_u.head(10)

Unnamed: 0,rdf_triple,ref_text
0,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...","Aarhus, Denmark is served by Aarhus airport op..."
1,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",The city of Aarhus in Denmark is served by an ...
2,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Airport in Denmark is operated by Aarhu...
3,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Lufthavn A/S is the operating organisat...
4,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Airport is operated by Aarhus Lufthavn ...
5,Aarhus_Airport | location | Tirstrup && Tirstr...,Lars Lokke Rasmussen is the leader of Denmark ...
6,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, which i..."
7,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, which i..."
8,Aarhus_Airport | location | Tirstrup && Tirstr...,Margrethe II is the Queen of Denmark where the...
9,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, in the ..."


In [None]:
lista = ['Artist', 'Artist_allSolutions']

In [None]:
#TRAIN-artist
df_=pd.DataFrame(columns=['rdf_triple','ref_text' ])
for dominio in lista:
  xml = glob.glob("/content/webnlg-dataset/release_v3.0/en/train/**/" + str(dominio) + ".xml", recursive=True)
  n_tripla=re.compile('(\d)triples')
  dizionario={}
  for file in xml:
      parsing_xml = ET.parse(file)
      xml_path = parsing_xml.getroot()
      categoria_tripla=int(n_tripla.findall(file)[0])
      for sub_path in xml_path:
          for ss_path in sub_path:
              lista_tripla=[]
              lista_text=[]
              for entry in ss_path:
                  lista_text.append(entry.text)
                  strutured=[triple.text for triple in entry]
                  lista_tripla.extend(strutured)
              lista_text=[i for i in lista_text if i.replace('\n','').strip()!='' ]
              lista_tripla=lista_tripla[-categoria_tripla:]
              lista_tripla_str=(' && ').join(lista_tripla)
              dizionario[lista_tripla_str]=lista_text
  diz={ "rdf_triple":[], "ref_text":[]}
  for st,unst in dizionario.items():
      for i in unst:
          diz['rdf_triple'].append(st)
          diz['ref_text'].append(i)
  df=pd.DataFrame(diz)
  df_ = pd.concat([df_, df])


In [None]:
len(df_)

3398

In [None]:
lista2 = ['1triples', '2triples']

In [None]:
#Val
df__=pd.DataFrame(columns=['rdf_triple','ref_text' ])

for tripla in lista2:
  for dominio in lista:
    xml = glob.glob("/content/webnlg-dataset/release_v3.0/en/dev/" + str(tripla)+ "/"  + str(dominio) + ".xml", recursive=True)
    n_tripla=re.compile('(\d)triples')
    dizionario={}
    for file in xml:
        parsing_xml = ET.parse(file)
        xml_path = parsing_xml.getroot()
        categoria_tripla=int(n_tripla.findall(file)[0])
        for sub_path in xml_path:
            for ss_path in sub_path:
                lista_tripla=[]
                lista_text=[]
                for entry in ss_path:
                    lista_text.append(entry.text)
                    strutured=[triple.text for triple in entry]
                    lista_tripla.extend(strutured)
                lista_text=[i for i in lista_text if i.replace('\n','').strip()!='' ]
                lista_tripla=lista_tripla[-categoria_tripla:]
                lista_tripla_str=(' && ').join(lista_tripla)
                dizionario[lista_tripla_str]=lista_text
    diz={ "rdf_triple":[], "ref_text":[]}
    for st,unst in dizionario.items():
        for i in unst:
            diz['rdf_triple'].append(st)
            diz['ref_text'].append(i)
    df=pd.DataFrame(diz)
    df__ = pd.concat([df__, df])


In [None]:
len(df__)

163

In [None]:
val_u= pd.concat([df_, df__], ignore_index=True).reset_index()
val_u.drop(columns=['index'], inplace=True)
len(val_u)

3561

In [None]:
train_u.to_csv('/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/webnlg-train.csv')
val_u.to_csv('/content/drive/MyDrive/rdf-to-text/dataset/Notebook/WebNLG/16-domains/UNSEEN/Entity-based/webnlg-val.csv')

## Pre-Processing

In [None]:
# import train e val set for unseen model

train_raw_u = pd.read_csv(train_path_u)
val_raw_u = pd.read_csv(val_path_u)

train_raw_u.drop(columns=['Unnamed: 0'], inplace=True)
val_raw_u.drop(columns=['Unnamed: 0'], inplace=True)
train_raw_u.head(10)

Unnamed: 0,rdf_triple,ref_text
0,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...","Aarhus, Denmark is served by Aarhus airport op..."
1,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",The city of Aarhus in Denmark is served by an ...
2,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Airport in Denmark is operated by Aarhu...
3,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Lufthavn A/S is the operating organisat...
4,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",Aarhus Airport is operated by Aarhus Lufthavn ...
5,Aarhus_Airport | location | Tirstrup && Tirstr...,Lars Lokke Rasmussen is the leader of Denmark ...
6,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, which i..."
7,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, which i..."
8,Aarhus_Airport | location | Tirstrup && Tirstr...,Margrethe II is the Queen of Denmark where the...
9,Aarhus_Airport | location | Tirstrup && Tirstr...,"Aarhus Airport is located in Tirstrup, in the ..."


In [None]:
# data pre-processing on train_u set

from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop = stopwords.words('english')

train_u = train_raw_u.copy()

train_u['rdf_triple'] = train_u['rdf_triple'].str.lower()
train_u['ref_text'] = train_u['ref_text'].str.lower()
train_u['rdf_triple'] = train_u['rdf_triple'].str.replace('[^\w]|_',' ')
train_u['rdf_triple'] = train_u['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))  
train_u.replace('\n', '', regex=True, inplace=True)
train_u.replace('\r', '', regex=True, inplace=True)  

# to get a sample
'''train_u_sample = train_u.sample(n=10).reset_index()
train_u_sample.drop(columns=['index'], inplace=True)
train_u_sample'''

train_u.head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  del sys.path[0]


Unnamed: 0,rdf_triple,ref_text
0,aarhus airport cityserved aarhus denmark aarhu...,"aarhus, denmark is served by aarhus airport op..."
1,aarhus airport cityserved aarhus denmark aarhu...,the city of aarhus in denmark is served by an ...
2,aarhus airport cityserved aarhus denmark aarhu...,aarhus airport in denmark is operated by aarhu...
3,aarhus airport cityserved aarhus denmark aarhu...,aarhus lufthavn a/s is the operating organisat...
4,aarhus airport cityserved aarhus denmark aarhu...,aarhus airport is operated by aarhus lufthavn ...
5,aarhus airport location tirstrup tirstrup coun...,lars lokke rasmussen is the leader of denmark ...
6,aarhus airport location tirstrup tirstrup coun...,"aarhus airport is located in tirstrup, which i..."
7,aarhus airport location tirstrup tirstrup coun...,"aarhus airport is located in tirstrup, which i..."
8,aarhus airport location tirstrup tirstrup coun...,margrethe ii is the queen of denmark where the...
9,aarhus airport location tirstrup tirstrup coun...,"aarhus airport is located in tirstrup, in the ..."


In [None]:
# comparation between rdf triple raw and rdf triple after pre-processing process

train_comparation_u = pd.DataFrame(columns=['rdf_triple_raw', 'rdf_triple_clean'])

train_comparation_u.rdf_triple_raw = train_raw_u.rdf_triple.values
train_comparation_u.rdf_triple_clean = train_u.rdf_triple.values

data_table.DataTable(train_comparation_u)



Unnamed: 0,rdf_triple_raw,rdf_triple_clean
0,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",aarhus airport cityserved aarhus denmark aarhu...
1,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",aarhus airport cityserved aarhus denmark aarhu...
2,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",aarhus airport cityserved aarhus denmark aarhu...
3,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",aarhus airport cityserved aarhus denmark aarhu...
4,"Aarhus_Airport | cityServed | ""Aarhus, Denmark...",aarhus airport cityserved aarhus denmark aarhu...
...,...,...
32023,"United_States | leaderTitle | ""Vice President""",united states leadertitle vice president
32024,"United_States | leaderTitle | ""Vice President""",united states leadertitle vice president
32025,United_States | percentageOfAreaWater | 6.97,united states percentageofareawater 6 97
32026,United_States | percentageOfAreaWater | 6.97,united states percentageofareawater 6 97


In [None]:
# data pre-processing on val set

val_u = val_raw_u.copy()
val_u['rdf_triple'] = val_u['rdf_triple'].str.lower()
val_u['ref_text'] = val_u['ref_text'].str.lower()
val_u['rdf_triple'] = val_u['rdf_triple'].str.replace('[^\w]|_',' ')
val_u['rdf_triple'] = val_u['rdf_triple'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))    
val_u.replace('\n', '', regex=True, inplace=True)
val_u.replace('\r', '', regex=True, inplace=True)

  


In [None]:
!mkdir data_lstm_u
!mkdir data_lstm_u/model
!mkdir data_lstm_u/loaded_model

In [None]:
# create src-train.txt from train['rdf_triple]
np.savetxt(r'/content/data_lstm_u/src-train.txt', train_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create src-val.txt from val['rdf_triple]
np.savetxt(r'/content/data_lstm_u/src-val.txt', val_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-train.txt from train['ref_text']
np.savetxt(r'/content/data_lstm_u/tgt-train.txt', train_u['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt from val['ref_text']
np.savetxt(r'/content/data_lstm_u/tgt-val.txt', val_u['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt for Ter metric
val2_u = val_u.copy()
val2_u['ref_text'] =  val2_u['ref_text'] + " (id" + val2_u.index.astype(str) + ")"
val2_u['ref_text'] = val2_u['ref_text'].str.strip()
val2_u['rdf_triple']= val2_u['rdf_triple'].str.strip()

In [None]:
np.savetxt(r'/content/data_lstm_u/tgt-val-ter.txt', val2_u['ref_text'].values, fmt='%s', delimiter='\t')

## Setting parameters for LSTM Model

In [None]:
#LSTM model architecture 

import yaml
data = {
    ## Where the samples will be written
'save_data': '/content/data_lstm_u/model/',
## Where the vocab(s) will be written
'src_vocab': '/content/data_lstm_u/example.vocab.src',
'tgt_vocab': '/content/data_lstm_u/example.vocab.tgt',
# Prevent overwriting existing files in the folder
'overwrite': False,
# Corpus opts:
'data': ({
    'corpus_1':({
            'path_src': '/content/data_lstm_u/src-train.txt',
            'path_tgt': '/content/data_lstm_u/tgt-train.txt',
        }),

    'valid':({
            'path_src': '/content/data_lstm_u/src-val.txt',
            'path_tgt': '/content/data_lstm_u/tgt-val.txt',
        }),

}) ,

# Vocabulary files that were just created
'src_vocab': '/content/data_lstm_u/example.vocab.src',
'tgt_vocab': '/content/data_lstm_u/example.vocab.tgt',

# Train on a single GPU
'world_size': 1,
'gpu_ranks': [0],

# Where to save the checkpoints
'save_model': '/content/data_lstm_u/model/',
'save_checkpoint_steps': 5000,
'train_steps': 35000,
'valid_steps': 5000,
'seed':1234
}

file = open("/content/data_lstm_u/data.yaml", "w")
yaml.dump(data, file, default_flow_style=None)
file.close()


In [None]:
# build vocab
!onmt_build_vocab -config /content/data_lstm_u/data.yaml -n_sample 10000 

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-07-01 10:01:09,811 INFO] Counter vocab from 10000 samples.
[2022-07-01 10:01:09,811 INFO] Build vocab on 10000 transformed examples/corpus.
[2022-07-01 10:01:09,819 INFO] corpus_1's transforms: TransformPipe()
[2022-07-01 10:01:10,021 INFO] Counters src:1753
[2022-07-01 10:01:10,021 INFO] Counters tgt:5941


### Train LSTM Model

In [None]:
# train default openNMT model: LSTM
!onmt_train -config /content/data_lstm_u/data.yaml 

[2022-03-15 09:24:25,360 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-03-15 09:24:25,360 INFO] Missing transforms field for valid data, set to default: [].
[2022-03-15 09:24:25,360 INFO] Parsed 2 corpora from -data.
[2022-03-15 09:24:25,360 INFO] Loading checkpoint from /content/_step_15000.pt
[2022-03-15 09:24:25,470 INFO] Loading fields from checkpoint...
[2022-03-15 09:24:25,470 INFO]  * src vocab size = 3311
[2022-03-15 09:24:25,470 INFO]  * tgt vocab size = 9407
[2022-03-15 09:24:25,475 INFO] Building model...
[2022-03-15 09:24:35,602 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(3311, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Emb

In [None]:
# load/import trained model checkpoint

shutil.copyfile(src = model_lstm_u, dst = '/content/data_lstm_u/loaded_model/lstm_model.pt')

'/content/data_lstm_u/loaded_model/lstm_model.pt'

In [None]:
# make prediction file
!onmt_translate -model /content/data_lstm_u/loaded_model/lstm_model.pt -src /content/data_lstm_u/src-val.txt -output /content/data_lstm_u/pred.txt -gpu 0 -verbose --replace_unk

#### Evaluation Metrics: LSTM

##### Bleu

In [None]:
# if you didn't import bleu metric, please run this code

#!wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl

--2022-07-01 10:02:01--  https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5234 (5.1K) [text/plain]
Saving to: ‘multi-bleu.perl’


2022-07-01 10:02:02 (46.4 MB/s) - ‘multi-bleu.perl’ saved [5234/5234]



In [None]:
# compute bleu score
bleu = !perl multi-bleu.perl /content/data_lstm_u/tgt-val.txt < /content/data_lstm_u/pred.txt
bleu[0]

'BLEU = 4.20, 28.9/6.5/2.2/0.9 (BP=0.949, ratio=0.950, hyp_len=61104, ref_len=64323)'

In [None]:
cd /content/

/content


##### Meteor

In [None]:
# if you didn't import meteor metric before, please run this code
'''
%%capture 
import shutil

source_dir = meteor_path
destination_dir = r"/content/meteor-1.5"
shutil.copytree(source_dir, destination_dir)
'''

In [None]:
%cd /content/meteor-1.5/

/content/meteor-1.5


In [None]:
# compute meteor metric
meteor = !java -Xmx2G -jar meteor-1.5.jar /content/data_lstm_u/pred.txt /content/data_lstm_u/tgt-val.txt -l en -norm 
meteor[-1]

'Final score:            0.1285521584510164'

##### Ter

In [None]:
%cd ..

/content


In [None]:
# make prediction df

file1 = open('/content/data_lstm_u/pred.txt', 'r')
Lines = file1.readlines()
df = pd.DataFrame(Lines, columns=['text'])
df.replace('\n', '', regex=True, inplace=True)
df.replace('\r', '', regex=True, inplace=True)

len(df)

3561

In [None]:
data2 = df.copy()
data2['text'] =  data2['text'] + " (id" + data2.index.astype(str) + ")"

In [None]:
np.savetxt(r'/content/data_lstm_u/pred-ter.txt', data2.values, fmt='%s', delimiter='\t')

In [None]:
# import Ter metric

import shutil
source_dir = ter_path
destination_dir = r"/content/tercom-0.7.25"
shutil.copytree(source_dir, destination_dir)

'/content/tercom-0.7.25'

In [None]:
%cd /content/tercom-0.7.25/

/content/tercom-0.7.25


In [None]:
# computer Ter metric

ter = !java -jar tercom.7.25.jar -h /content/data_lstm_u/pred-ter.txt -r /content/data_lstm_u/tgt-val-ter.txt
ter[-4]

'Total TER: 0.9263871399033006 (59588.0/64323.0)'

##### Rouge

In [None]:
# if you didn't import and install rouge metric before, please run this code
'''
# import and install rouge metric
%%capture
%cd /content/
!git clone https://github.com/pltrdy/rouge.git
%cd rouge
!python setup.py install
'''

In [None]:
%cd /content/rouge

/content/rouge


In [None]:
# compute rouge metric

from rouge import FilesRouge

hyp_path = r'/content/data_lstm_u/pred.txt'

ref_path= r'/content/data_lstm_u/tgt-val.txt'

files_rouge = FilesRouge()
scores = files_rouge.get_scores(hyp_path, ref_path, avg=True, ignore_empty=True)
scores

{'rouge-1': {'f': 0.39227900479272954,
  'p': 0.5890802518661095,
  'r': 0.3137280546613913},
 'rouge-2': {'f': 0.082039642647012,
  'p': 0.09192107178502296,
  'r': 0.07851895245237671},
 'rouge-l': {'f': 0.32999649971261974,
  'p': 0.4968644464479616,
  'r': 0.26394768434897486}}

##### Bert Score

In [None]:
# if you didn't import and install bert score metric before, please run this code
'''
%%capture
!pip install bert-score
'''

In [None]:
%cd content

[Errno 2] No such file or directory: 'content'
/content/rouge


In [None]:
# creation of references text df
a_file = open("/content/data_lstm_u/tgt-val.txt", "r")

ref = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  ref.append(stripped_line)

a_file.close()

In [None]:
# creation of predictions text df
a_file = open("/content/data_lstm_u/pred.txt", "r")

hyp = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  hyp.append(stripped_line)

a_file.close()

In [None]:
from bert_score import score
def bert_score_(references, hypothesis, lng='en'):
    from bert_score import score
    for i, refs in enumerate(references):
        references[i] = [ref for ref in refs if ref.strip() != '']
    try:
        P, R, F1 = score(hypothesis, references, lang=lng)
    #     print('FINISHING TO COMPUTE BERT SCORE...')
        P, R, F1 = list(P), list(R), list(F1)
        F1 = float(sum(F1) / len(F1))
        P = float(sum(P) / len(P))
        R = float(sum(R) / len(R))
    except:
        P, R, F1 = 0, 0, 0
    return P, R, F1
 
bert_score_(references=ref,hypothesis=hyp, lng='en' )

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(0.7706299424171448, 0.9980487823486328, 0.8692240715026855)

#### Model Results

In [None]:
# creation of sample from val test
%cd /content/
val_sample_u = val_u.sample(n=10, random_state = 1234).reset_index(drop=True)
val_sample_u

/content


Unnamed: 0,rdf_triple,ref_text
0,aaron deer genre indie rock indie rock instrum...,aaron deer plays piano in indie rock whose sty...
1,aaron bertram associatedband associatedmusical...,aaron bertram began performing in 1998 and pla...
2,anders osborne recordlabel alligator records a...,anders osborne is signed to shanachie records ...
3,alison donnell associatedband associatedmusica...,alison o'donnell is a musician for the band un...
4,al anderson nrbq band instrument singing al an...,"al anderson from the band, nrbq, is a singer a..."
5,alan frew genre rock music rock music stylisti...,alan frew's musical genre is rock music which ...
6,allen forrest genre hip hop music hip hop musi...,allen forrest's genre is hip hop music which o...
7,andrew rayel associatedband associatedmusicala...,andrew rayel has performed the genre of trance...
8,aaron bertram associatedband associatedmusical...,aaron bertram plays for the suburban legends b...
9,post metal instrument cello aaron turner assoc...,"aaron turner, who has played for the bands twi..."


In [None]:
# create src-val_sample.txt from val_sample['rdf_triple]

np.savetxt(r'/content/data_lstm_u/src-val_sample.txt', val_sample_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# make prediction file on val_sample
!onmt_translate -model /content/data_lstm_u/loaded_model/lstm_model.pt -src /content/data_lstm_u/src-val_sample.txt -output /content/data_lstm_u/pred_sample.txt -gpu 0 -verbose --replace_unk

[2022-07-01 10:10:33,160 INFO] Translating shard 0.
  self._batch_index = self.topk_ids // vocab_size
[2022-07-01 10:10:33,292 INFO] 
SENT 1: ['aaron', 'deer', 'genre', 'indie', 'rock', 'indie', 'rock', 'instrument', 'guitar', 'indie', 'rock', 'stylisticorigin', 'garage', 'rock', 'indie', 'rock', 'stylisticorigin', 'new', 'wave', 'music', 'aaron', 'deer', 'instrument', 'piano']
PRED 1: the rock rock indie is rock to the stylisticorigin rock stylisticorigin and the new rock rock the new of the new rock aaron rock rock is rock rock
PRED SCORE: -11.6708

[2022-07-01 10:10:33,292 INFO] 
SENT 2: ['aaron', 'bertram', 'associatedband', 'associatedmusicalartist', 'suburban', 'legends', 'aaron', 'bertram', 'activeyearsstartyear', '1998', 'aaron', 'bertram', 'associatedband', 'associatedmusicalartist', 'kids', 'imagine', 'nation']
PRED 2: aaron 1998 aaron 1998 1998 1998 1998 1998 1998 1998 nation was nation by nation nation nation
PRED SCORE: -5.7968

[2022-07-01 10:10:33,292 INFO] 
SENT 3: ['an

In [None]:
df_lstm_u = pd.read_fwf('/content/data_lstm_u/pred_sample.txt', header=None)
df_lstm_u = df_lstm_u[[0]]
df_lstm_u= df_lstm_u.rename(columns={0:'text'})
df_lstm_u


Unnamed: 0,text
0,the rock rock indie is rock to the stylisticor...
1,aaron 1998 aaron 1998 1998 1998 1998 1998 1998...
2,the anders anders osborne anders anders anders...
3,alison united bible united is studies associat...
4,band band was born in band and is in the ander...
5,the alan alan is a rock music rock rock rock r...
6,the allen allen allen hip hip hip genre genre ...
7,the andrew rayel rayel rayel rayel is andrew t...
8,aaron aaron aaron nation nation nation nation ...
9,post post man man man man man gloom gloom and ...


In [None]:
# creation prediction df
prediction_df_lstm_u = pd.DataFrame(columns=['rdf_triple', 'prediction_text'] )
prediction_df_lstm_u.rdf_triple = val_sample_u.rdf_triple.values
prediction_df_lstm_u.prediction_text = df_lstm_u.text.values
prediction_df_lstm_u

Unnamed: 0,rdf_triple,prediction_text
0,aaron deer genre indie rock indie rock instrum...,the rock rock indie is rock to the stylisticor...
1,aaron bertram associatedband associatedmusical...,aaron 1998 aaron 1998 1998 1998 1998 1998 1998...
2,anders osborne recordlabel alligator records a...,the anders anders osborne anders anders anders...
3,alison donnell associatedband associatedmusica...,alison united bible united is studies associat...
4,al anderson nrbq band instrument singing al an...,band band was born in band and is in the ander...
5,alan frew genre rock music rock music stylisti...,the alan alan is a rock music rock rock rock r...
6,allen forrest genre hip hop music hip hop musi...,the allen allen allen hip hip hip genre genre ...
7,andrew rayel associatedband associatedmusicala...,the andrew rayel rayel rayel rayel is andrew t...
8,aaron bertram associatedband associatedmusical...,aaron aaron aaron nation nation nation nation ...
9,post metal instrument cello aaron turner assoc...,post post man man man man man gloom gloom and ...


In [None]:
# comparison between references text from val sample and prediction text of lstm model on val sample
text_comparation_lstm_u = pd.DataFrame(columns=['ref_text', 'prediction_text'] )
text_comparation_lstm_u.ref_text = val_sample_u.ref_text.values
text_comparation_lstm_u.prediction_text = df_lstm_u.text.values
text_comparation_lstm_u

Unnamed: 0,ref_text,prediction_text
0,aaron deer plays piano in indie rock whose sty...,the rock rock indie is rock to the stylisticor...
1,aaron bertram began performing in 1998 and pla...,aaron 1998 aaron 1998 1998 1998 1998 1998 1998...
2,anders osborne is signed to shanachie records ...,the anders anders osborne anders anders anders...
3,alison o'donnell is a musician for the band un...,alison united bible united is studies associat...
4,"al anderson from the band, nrbq, is a singer a...",band band was born in band and is in the ander...
5,alan frew's musical genre is rock music which ...,the alan alan is a rock music rock rock rock r...
6,allen forrest's genre is hip hop music which o...,the allen allen allen hip hip hip genre genre ...
7,andrew rayel has performed the genre of trance...,the andrew rayel rayel rayel rayel is andrew t...
8,aaron bertram plays for the suburban legends b...,aaron aaron aaron nation nation nation nation ...
9,"aaron turner, who has played for the bands twi...",post post man man man man man gloom gloom and ...


## Setting parameters for Transformer Model

In [None]:
!mkdir data_transf_u
!mkdir data_transf_u/model
!mkdir data_transf_u/loaded_model

In [None]:
# create src-train.txt from train['rdf_triple]
np.savetxt(r'/content/data_transf_u/src-train.txt', train_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create src-val.txt from val['rdf_triple]
np.savetxt(r'/content/data_transf_u/src-val.txt', val_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-train.txt from train['ref_text']
np.savetxt(r'/content/data_transf_u/tgt-train.txt', train_u['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt from val['ref_text']
np.savetxt(r'/content/data_transf_u/tgt-val.txt', val_u['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# create tgt-val.txt for Ter metric
val2_u = val_u.copy()
val2_u['ref_text'] =  val2_u['ref_text'] + " (id" + val2_u.index.astype(str) + ")"
val2_u['ref_text'] = val2_u['ref_text'].str.strip()
val2_u['rdf_triple']= val2_u['rdf_triple'].str.strip()

In [None]:
np.savetxt(r'/content/data_transf_u/tgt-val-ter.txt', val2_u['ref_text'].values, fmt='%s', delimiter='\t')

In [None]:
# Transformer Model architecture

import yaml
data = {
    ## Where the samples will be written
'save_data': '/content/data_transf_u/model/',
## Where the vocab(s) will be written
'src_vocab': '/content/data_transf_u/example.vocab.src',
'tgt_vocab': '/content/data_transf_u/example.vocab.tgt',
# Prevent overwriting existing files in the folder
'overwrite': False,
# Corpus opts:
'data': ({
    'corpus_1':({
            'path_src': '/content/data_transf_u/src-train.txt',
            'path_tgt': '/content/data_transf_u/tgt-train.txt',
        }),

    'valid':({
            'path_src': '/content/data_transf_u/src-val.txt',
            'path_tgt': '/content/data_transf_u/tgt-val.txt',
        }),

}) ,

# Vocabulary files that were just created
'src_vocab': '/content/data_transf_u/example.vocab.src',
'tgt_vocab': '/content/data_transf_u/example.vocab.tgt',

# Train on a single GPU
'world_size': 1,
'gpu_ranks': [0],

# Where to save the checkpoints
'save_model': '/content/data_transf_u/model/',
'save_checkpoint_steps': 5000,
'train_steps': 35000,
'valid_steps': 5000,
'decoder_type': 'transformer',
'encoder_type': 'transformer',
'word_vec_size': 512,
'rnn_size': 512,
'layers': 2,
'transformer_ff': 2048,
'heads': 4,
'batch_size': 64,
'batch_type': 'sents',
'normalization': 'tokens',
'dropout': 0.3,
'label_smoothing': 0.1,
'seed':1234
}

file = open("/content/data_transf_u/data.yaml", "w")
yaml.dump(data, file, default_flow_style=None)
file.close()



In [None]:
# build vocab
!onmt_build_vocab -config /content/data_transf_u/data.yaml -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-07-01 10:11:05,918 INFO] Counter vocab from 10000 samples.
[2022-07-01 10:11:05,918 INFO] Build vocab on 10000 transformed examples/corpus.
[2022-07-01 10:11:05,930 INFO] corpus_1's transforms: TransformPipe()
[2022-07-01 10:11:06,132 INFO] Counters src:1753
[2022-07-01 10:11:06,132 INFO] Counters tgt:5941


### Train Transformer Model

In [None]:
# training transformer openNMT model
!onmt_train -config /content/data_transf_u/data.yaml

[2022-03-15 14:15:44,140 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-03-15 14:15:44,140 INFO] Missing transforms field for valid data, set to default: [].
[2022-03-15 14:15:44,140 INFO] Parsed 2 corpora from -data.
[2022-03-15 14:15:44,140 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2022-03-15 14:15:44,140 INFO] Loading vocab from text file...
[2022-03-15 14:15:44,140 INFO] Loading src vocabulary from /content/example.vocab.src
[2022-03-15 14:15:44,172 INFO] Loaded src vocab has 3356 tokens.
[2022-03-15 14:15:44,174 INFO] Loading tgt vocabulary from /content/example.vocab.tgt
[2022-03-15 14:15:44,196 INFO] Loaded tgt vocab has 10034 tokens.
[2022-03-15 14:15:44,201 INFO] Building fields with vocab in counters...
[2022-03-15 14:15:44,214 INFO]  * tgt vocab size: 10038.
[2022-03-15 14:15:44,218 INFO]  * src vocab size: 3358.
[2022-03-15 14:15:44,218 INFO]  * src vocab size = 3358
[2022-03-15 14:15:44,218 INFO]  * tgt vocab size

In [None]:
# load/import trained model checkpoint

shutil.copyfile(src = model_transformer_u, dst = '/content/data_transf_u/loaded_model/transformer_model.pt')

'/content/data_transf_u/loaded_model/transformer_model.pt'

In [None]:
# make prediction file
!onmt_translate -model /content/data_transf_u/loaded_model/transformer_model.pt -src /content/data_transf_u/src-val.txt -output /content/data_transf_u/pred.txt -gpu 0 -verbose --replace_unk

#### Evaluation Metrics: Transformer

##### Bleu

In [None]:
#!wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl

In [None]:
bleu = !perl multi-bleu.perl /content/data_transf_u/tgt-val.txt < /content/data_transf_u/pred.txt
bleu[0]

'BLEU = 2.43, 28.3/7.8/3.5/1.3 (BP=0.427, ratio=0.541, hyp_len=34769, ref_len=64323)'

In [None]:
cd /content/

/content


##### Meteor

In [None]:
'''
%%capture 
import shutil

source_dir = meteor_path
destination_dir = r"/content/meteor-1.5"
shutil.copytree(source_dir, destination_dir)
'''

In [None]:
%cd /content/meteor-1.5/

/content/meteor-1.5


In [None]:
meteor = !java -Xmx2G -jar meteor-1.5.jar /content/data_transf_u/pred.txt /content/data_transf_u/tgt-val.txt -l en -norm 
meteor[-1]

'Final score:            0.06605268116308931'

##### Ter

In [None]:
%cd ..

/content


In [None]:
df = pd.read_fwf('/content/data_transf_u/pred.txt', header=None)
df = df[[0]]
df= df.rename(columns={0:'text'})
len(df)

3561

In [None]:
data2 = df.copy()
data2['text'] =  data2['text'] + " (id" + data2.index.astype(str) + ")"
data2['text'].replace('\n','', regex=True, inplace=True)


In [None]:
np.savetxt(r'/content/data_transf_u/pred-ter.txt', data2['text'].values, fmt='%s', delimiter='\t')

In [None]:
'''
# import Ter metric

import shutil
source_dir = ter_path
destination_dir = r"/content/tercom-0.7.25"
shutil.copytree(source_dir, destination_dir)

'''

'/content/tercom-0.7.25'

In [None]:
%cd /content/tercom-0.7.25/

/content/tercom-0.7.25


In [None]:
ter = !java -jar tercom.7.25.jar -h /content/data_transf_u/pred-ter.txt -r /content/data_transf_u/tgt-val-ter.txt
ter[-4]

'Total TER: 0.8561714627645107 (62956.0/73532.0)'

In [None]:
'''
# import and install rouge metric

!git clone https://github.com/pltrdy/rouge.git
%cd rouge
!python setup.py install
'''

In [None]:
%cd /content/rouge


/content/rouge


In [None]:
from rouge import FilesRouge

hyp_path = r'/content/data_transf_u/pred.txt'

ref_path= r'/content/data_transf_u/tgt-val.txt'

files_rouge = FilesRouge()
scores = files_rouge.get_scores(hyp_path, ref_path, avg=True)
scores

{'rouge-1': {'f': 0.2797899985301004,
  'p': 0.5802868010955622,
  'r': 0.19726041985859874},
 'rouge-2': {'f': 0.07075562516609529,
  'p': 0.13086719202978606,
  'r': 0.053907214760261654},
 'rouge-l': {'f': 0.2569793621713783,
  'p': 0.5332060424520431,
  'r': 0.181320889585916}}

##### Bert Score

In [None]:
%cd ..

/content


In [None]:
#!pip install bert-score


In [None]:
a_file = open("/content/data_transf_u/tgt-val.txt", "r")

ref = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  ref.append(stripped_line)

a_file.close()

In [None]:
a_file = open("/content/data_transf_u/pred.txt", "r")

hyp = []
for line in a_file:
  stripped_line = line.strip()
  #line_list = stripped_line.split()
  hyp.append(stripped_line)

a_file.close()

In [None]:
from bert_score import score
def bert_score_(references, hypothesis, lng='en'):
    from bert_score import score
    for i, refs in enumerate(references):
        references[i] = [ref for ref in refs if ref.strip() != '']
    try:
        P, R, F1 = score(hypothesis, references, lang=lng)
    #     print('FINISHING TO COMPUTE BERT SCORE...')
        P, R, F1 = list(P), list(R), list(F1)
        F1 = float(sum(F1) / len(F1))
        P = float(sum(P) / len(P))
        R = float(sum(R) / len(R))
    except:
        P, R, F1 = 0, 0, 0
    return P, R, F1
 
bert_score_(references=ref,hypothesis=hyp, lng='en' )

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(0.808567225933075, 0.9997814297676086, 0.8928444385528564)

#### Model Results

In [None]:
%cd /content/
val_sample_u

/content


Unnamed: 0,rdf_triple,ref_text
0,aaron deer genre indie rock indie rock instrum...,aaron deer plays piano in indie rock whose sty...
1,aaron bertram associatedband associatedmusical...,aaron bertram began performing in 1998 and pla...
2,anders osborne recordlabel alligator records a...,anders osborne is signed to shanachie records ...
3,alison donnell associatedband associatedmusica...,alison o'donnell is a musician for the band un...
4,al anderson nrbq band instrument singing al an...,"al anderson from the band, nrbq, is a singer a..."
5,alan frew genre rock music rock music stylisti...,alan frew's musical genre is rock music which ...
6,allen forrest genre hip hop music hip hop musi...,allen forrest's genre is hip hop music which o...
7,andrew rayel associatedband associatedmusicala...,andrew rayel has performed the genre of trance...
8,aaron bertram associatedband associatedmusical...,aaron bertram plays for the suburban legends b...
9,post metal instrument cello aaron turner assoc...,"aaron turner, who has played for the bands twi..."


In [None]:
# create src-val_sample.txt from val_sample['rdf_triple]

np.savetxt(r'/content/data_transf_u/src-val_sample.txt', val_sample_u['rdf_triple'].values, fmt='%s', delimiter='\t')

In [None]:
# make prediction file on val_sample
!onmt_translate -model /content/data_transf_u/loaded_model/transformer_model.pt -src /content/data_transf_u/src-val_sample.txt -output /content/data_transf_u/pred_sample.txt -gpu 0 -verbose --replace_unk

[2022-07-01 10:20:18,089 INFO] Translating shard 0.
  self._batch_index = self.topk_ids // vocab_size
[2022-07-01 10:20:18,318 INFO] 
SENT 1: ['aaron', 'deer', 'genre', 'indie', 'rock', 'indie', 'rock', 'instrument', 'guitar', 'indie', 'rock', 'stylisticorigin', 'garage', 'rock', 'indie', 'rock', 'stylisticorigin', 'new', 'wave', 'music', 'aaron', 'deer', 'instrument', 'piano']
PRED 1: the former team of the los angeles new is new
PRED SCORE: -6.9512

[2022-07-01 10:20:18,318 INFO] 
SENT 2: ['aaron', 'bertram', 'associatedband', 'associatedmusicalartist', 'suburban', 'legends', 'aaron', 'bertram', 'activeyearsstartyear', '1998', 'aaron', 'bertram', 'associatedband', 'associatedmusicalartist', 'kids', 'imagine', 'nation']
PRED 2: the nation of nation 1998 nation is nation
PRED SCORE: -4.7293

[2022-07-01 10:20:18,319 INFO] 
SENT 3: ['anders', 'osborne', 'recordlabel', 'alligator', 'records', 'alligator', 'records', 'genre', 'blues', 'anders', 'osborne', 'recordlabel', 'shanachie', 'reco

In [None]:
df_tr_u = pd.read_fwf('/content/data_transf_u/pred_sample.txt', header=None)
df_tr_u = df_tr_u[[0]]
df_tr_u= df_tr_u.rename(columns={0:'text'})
df_tr_u

Unnamed: 0,text
0,the former team of the los angeles new is new
1,the nation of nation 1998 nation is nation
2,william anders is an anders
3,the united kingdom is a alison
4,al al is located in al
5,the former team of the los angeles rock is rock
6,allen is a allen
7,the saint of the andrew is andrew
8,the former team of nation is nation
9,the man man was built by man


In [None]:
prediction_df_tr_u = pd.DataFrame(columns=['rdf_triple', 'prediction_text'] )
prediction_df_tr_u.rdf_triple = val_sample_u.rdf_triple.values
prediction_df_tr_u.prediction_text = df_tr_u.text.values
prediction_df_tr_u

Unnamed: 0,rdf_triple,prediction_text
0,aaron deer genre indie rock indie rock instrum...,the former team of the los angeles new is new
1,aaron bertram associatedband associatedmusical...,the nation of nation 1998 nation is nation
2,anders osborne recordlabel alligator records a...,william anders is an anders
3,alison donnell associatedband associatedmusica...,the united kingdom is a alison
4,al anderson nrbq band instrument singing al an...,al al is located in al
5,alan frew genre rock music rock music stylisti...,the former team of the los angeles rock is rock
6,allen forrest genre hip hop music hip hop musi...,allen is a allen
7,andrew rayel associatedband associatedmusicala...,the saint of the andrew is andrew
8,aaron bertram associatedband associatedmusical...,the former team of nation is nation
9,post metal instrument cello aaron turner assoc...,the man man was built by man


In [None]:
text_comparation_tr_u = pd.DataFrame(columns=['ref_text', 'prediction_text'] )
text_comparation_tr_u.ref_text = val_sample_u.ref_text.values
text_comparation_tr_u.prediction_text = df_tr_u.text.values
text_comparation_tr_u

Unnamed: 0,ref_text,prediction_text
0,aaron deer plays piano in indie rock whose sty...,the former team of the los angeles new is new
1,aaron bertram began performing in 1998 and pla...,the nation of nation 1998 nation is nation
2,anders osborne is signed to shanachie records ...,william anders is an anders
3,alison o'donnell is a musician for the band un...,the united kingdom is a alison
4,"al anderson from the band, nrbq, is a singer a...",al al is located in al
5,alan frew's musical genre is rock music which ...,the former team of the los angeles rock is rock
6,allen forrest's genre is hip hop music which o...,allen is a allen
7,andrew rayel has performed the genre of trance...,the saint of the andrew is andrew
8,aaron bertram plays for the suburban legends b...,the former team of nation is nation
9,"aaron turner, who has played for the bands twi...",the man man was built by man
