In [1]:
#meta 3/16/2021 Transformer Models

#history
#3/16/2021 TRANSFORMERS
#      Create an env
#      Try HF, Sentiment and Paraphrase

#5/10/2021 TRY PARAPHRASE, SUMMARIZE
#      Try WOS data

## 0. Tiny Examples

In [2]:
from transformers import pipeline 
print(pipeline('sentiment-analysis')('we love you'))

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]


In [3]:
print(pipeline('sentiment-analysis')('jeans'))

[{'label': 'POSITIVE', 'score': 0.5292837023735046}]


In [4]:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

All model checkpoint layers were used when initializing TFGPT2Model.

All the layers of TFGPT2Model were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


### Sequence Classification
https://huggingface.co/transformers/task_summary.html

In [5]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]


Some layers from the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing TFBertForSequenceClassification: ['dropout_183']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at bert-base-cased-finetuned-mrpc.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [6]:
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%


In [7]:
sequence_0 = "blue jeans"
sequence_1 = "red apples"
sequence_2 = "pants"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 41%
is paraphrase: 59%
not paraphrase: 78%
is paraphrase: 22%


## 1. WOS Data Examples

In [8]:
s1_cat = 'Electric motor'
s1_kw = 'NdFeB magnets; Electric motor; Electric vehicle; Hybrid electric vehicle; Recycling; Rare earth elements'
s1_abstract = 'Hybrid electric vehicles are assumed to play a major role in future mobility concepts. Although sales numbers are increasing, little emphasis has been laid on the recycling of some key components such as power electronics or electric motors. Permanent magnet synchronous motors contain considerable amounts of rare earth elements that cannot be recovered in conventional recycling routes. Although their recycling could have large economic, environmental, and strategic advantages, no industrial recycling for permanent magnets is available in western countries at the moment. Regarding the essential steps, dismantling of electric vehicles as well as the extraction of magnets from the rotors, little has been published before. This paper therefore presents and discusses different recycling approaches for the recycling of NdFeB magnets from (hybrid) electric vehicles. Many results stem from the German research project "Recycling of components and strategic metals of electric drive motors.'

s2_cat = "Green Building"
s2_kw = "LED lighting system; PV system; Distributed lighting control; Energy efficiency; Green building; Daylight responsive dimming system"
s2_abstract = "Decreasing of energy consumption and environmentally friendly energy resources are the issues in the foreground nowadays. As the electric energy consumed for the illumination is high, long-lasting and low-consumption LED (light-emitting diode) technology gets prominent. There have been made much reseacrh regarding the use of photovoltaic sytems in meeting the energy demand in housing and industry. However, there is need for more research with regards to photovoltaic sytems' integration with energy efficiency sytems. In this study, for the environments which have different lighting levels due to daylight factor, there has been proposed a low-cost PV (photovoltaics) based and distributed sensor smart LED illuminating system and there has been acquired 72.075% more energy saving in comparison with conventional LED illuminating system."

### 1.1 Try Paraphrase

Compare abstracts and kws

In [9]:
paraphrase = tokenizer(s1_abstract, s1_kw, return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")


not paraphrase: 94%
is paraphrase: 6%


In [10]:
not_paraphrase = tokenizer(s1_abstract, s2_kw, return_tensors="tf")
not_paraphrase_classification_logits = model(not_paraphrase)[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%


### 1.1 Try Summarize
Summarize abstract and compare with a) abstract, b) kws

In [11]:
summarizer = pipeline("summarization")

In [12]:
s1_abstract_summary = summarizer(s1_abstract, max_length=100, min_length=30, do_sample=False)
s2_abstract_summary = summarizer(s2_abstract, max_length=100, min_length=30, do_sample=False)

s1_abstract_summary #class list of dicts


[{'summary_text': ' Hybrid electric vehicles are assumed to play a major role in future mobility concepts . Little emphasis has been laid on the recycling of some key components such as power electronics or electric motors . No industrial recycling for permanent magnets is available in western countries at the moment .'}]

In [13]:
s1_abstract_summary[0]['summary_text'], s2_abstract_summary[0]['summary_text'] #class str

(' Hybrid electric vehicles are assumed to play a major role in future mobility concepts . Little emphasis has been laid on the recycling of some key components such as power electronics or electric motors . No industrial recycling for permanent magnets is available in western countries at the moment .',
 ' There has been made much reseacrh regarding the use of photovoltaic sytems in meeting the energy demand in housing and industry . However, there is need for more research . In this study, for environments which have different lighting levels due to daylight factor, there has been proposed a low-cost PV (photovoltaics) based and distributed sensor smart LED illuminating system .')

Ex 1.

In [14]:
paraphrase = tokenizer(s1_abstract, s1_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

paraphrase = tokenizer(s1_kw, s1_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 73%
is paraphrase: 27%
not paraphrase: 95%
is paraphrase: 5%


Abstract and summary1: increased paraphraze prob but only to 27%  
Kws and summary1: didn't work

In [15]:
paraphrase = tokenizer(s1_abstract, s2_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Shouldn't be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

paraphrase = tokenizer(s1_kw, s2_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Shouldn't be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%
not paraphrase: 95%
is paraphrase: 5%


Abstract and summary2: high not paraphrase, good  
Kws and summary2: high not paraphrase, but as low as previous

Ex 2.

In [16]:
paraphrase = tokenizer(s2_abstract, s2_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

paraphrase = tokenizer(s2_kw, s2_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 6%
is paraphrase: 94%
not paraphrase: 95%
is paraphrase: 5%


Abstract and summary2: high paraphraze prob, good  
Kws and summary2: didn't work

In [17]:
paraphrase = tokenizer(s2_abstract, s1_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Shouldn't be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

paraphrase = tokenizer(s2_kw, s1_abstract_summary[0]['summary_text'], return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

# Shouldn't be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%
not paraphrase: 94%
is paraphrase: 6%


Abstract and summary1: high not paraphrase, good  
Kws and summary1: high not paraphrase, but as low as previous

Findings: Summary of abstract may work.
