# Label and describe using an LLM

Download and instantiate an LLM from Huggingface.

Load the LDA topic models. 

Prompt the LLM to generate a label and a description for each topic in the models.

In [51]:
import pickle
from transformers import pipeline


Load the topic models fitted in a previous notebook.

* lda_gw: Gravitational Waves topics
* lda_cscl: Computation and Language topics

In [2]:
with open('../models/lda_gw.pickle', 'rb') as handle:
    lda_gw = pickle.load(handle)

Get a list of all topics in the model, each topic described by MAX_WORDS 

* The result is a list of topics. Each topic is represented by a tuple.
* The first element of the tuple is a topic number (int).
* The second element of the tuple is a list of tuples,
* Each tuple represents the words characterising he topic (string) and its corresponding probability (float)

In [44]:
MAX_WORDS = 30
# list[tuples<int, list[tuple<string, float>]>]
topics_gw = lda_gw.show_topics(num_words=MAX_WORDS, formatted=False)

In [45]:
topic_str = ' '.join([topic[0] for topic in topics_gw[0][1]])
topic_str

'field equation star mode theory neutron general state gravity solution energy effect magnetic matter quantum time result relativity non einstein frequency metric study spacetime radiation eos instability relativistic electromagnetic rotating'

## Setup a LLM pipeline for labelling

In [31]:
from transformers import pipeline

model_name = 'google-bert/bert-base-uncased'
labeller = pipeline('fill-mask', model=model_name)
outputs = labeller(f"[MASK]: {topic_str}")
print(outputs)

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.11313198506832123, 'token': 4973, 'token_str': 'examples', 'sequence': 'examples : field, equation, star, mode, theory'}, {'score': 0.03067971207201481, 'token': 5097, 'token_str': 'applications', 'sequence': 'applications : field, equation, star, mode, theory'}, {'score': 0.02635965868830681, 'token': 3602, 'token_str': 'note', 'sequence': 'note : field, equation, star, mode, theory'}, {'score': 0.024120096117258072, 'token': 4696, 'token_str': 'category', 'sequence': 'category : field, equation, star, mode, theory'}, {'score': 0.019342072308063507, 'token': 2742, 'token_str': 'example', 'sequence': 'example : field, equation, star, mode, theory'}]


In [32]:
model_name = 'google-bert/bert-base-uncased'
labeller = pipeline('text-generation', model=model_name)
prompt = f"Label this sequence of words {topic_str} is"
outputs = labeller(prompt, max_new_tokens=3)
print(outputs)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


[{'generated_text': 'Label this sequence of words field, equation, star, mode, theory is is so so'}]


In [38]:
#model_name = 'google-bert/bert-base-uncased'
#model_name = 'cnicu/t5-small-booksum'
model_name = 't5-small'
labeller = pipeline('summarization', model=model_name)
outputs = labeller(topic_str, max_length=3)
print(outputs)

Your min_length=30 must be inferior than your max_length=3.


[{'summary_text': 'field,'}]


In [52]:
model_name = "gpt2"
labeller = pipeline("text-generation", model=model_name)
prompt = f"Which topic is described by these keywords (response should be between 1 and 12 words): {topic_str}"
outputs = labeller(prompt, max_new_tokens=10, pad_token_id=labeller.tokenizer.eos_token_id)
print(outputs)

[{'generated_text': 'Which topic is described by these keywords (response should be between 1 and 12 words): field equation star mode theory neutron general state gravity solution energy effect magnetic matter quantum time result relativity non einstein frequency metric study spacetime radiation eos instability relativistic electromagnetic rotating motion motion tectonic acceleration zero gravity system acceleration'}]


In [40]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer.encode("summarize: " + topic_str, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(input_ids, max_length=10)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\nOriginal text\n", topic_str)
print("\nSummary\n", summary)


Original text
 field, equation, star, mode, theory

Summary
 field, equation, star, mode, theory


In [None]:
classifier = pipeline("zero-shot-classification", device='cpu', model="facebook/bart-large-mnli")

In [49]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

#model_name = 'fabiochiu/t5-small-medium-title-generation'
model_name = 'deep-learning-analytics/automatic-title-generation'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#inputs = [f"Select a suitable label for these keywords: {topic_str}"]
inputs = [f"Which topic is described by these keywords (response should be between 1 and 12 words): {topic_str}"]
#inputs = text
#inputs = topic_str

inputs = tokenizer(inputs, max_length=512, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=1, max_length=12)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_title)

[nltk_data] Downloading package punkt to /home/atroncos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Field equation star mode theory neutron general state gravity solution


In [12]:
decoded_output

'Select the appropriate label for these keywords: field equation star mode theory neutron general state gravity solution'

In [50]:
text = """
Many financial institutions started building conversational AI, prior to the Covid19
pandemic, as part of a digital transformation initiative. These initial solutions
were high profile, highly personalized virtual assistants — like the Erica chatbot
from Bank of America. As the pandemic hit, the need changed as contact centers were
under increased pressures. As Cathal McGloin of ServisBOT explains in “how it started,
and how it is going,” financial institutions were looking for ways to automate
solutions to help get back to “normal” levels of customer service. This resulted
in a change from the “future of conversational AI” to a real tactical assistant
that can help in customer service. Haritha Dev of Wells Fargo, saw a similar trend.
Banks were originally looking to conversational AI as part of digital transformation
to keep up with the times. However, with the pandemic, it has been more about
customer retention and customer satisfaction. In addition, new use cases came about
as a result of Covid-19 that accelerated adoption of conversational AI. As Vinita
Kumar of Deloitte points out, banks were dealing with an influx of calls about new
concerns, like questions around the Paycheck Protection Program (PPP) loans. This
resulted in an increase in volume, without enough agents to assist customers, and
tipped the scale to incorporate conversational AI. When choosing initial use cases
to support, financial institutions often start with high volume, low complexity
tasks. For example, password resets, checking account balances, or checking the
status of a transaction, as Vinita points out. From there, the use cases can evolve
as the banks get more mature in developing conversational AI, and as the customers
become more engaged with the solutions. Cathal indicates another good way for banks
to start is looking at use cases that are a pain point, and also do not require a
lot of IT support. Some financial institutions may have a multi-year technology
roadmap, which can make it harder to get a new service started. A simple chatbot
for document collection in an onboarding process can result in high engagement,
and a high return on investment. For example, Cathal has a banking customer that
implemented a chatbot to capture a driver’s license to be used in the verification
process of adding an additional user to an account — it has over 85% engagement
with high satisfaction. An interesting use case Haritha discovered involved
educating customers on financial matters. People feel more comfortable asking a
chatbot what might be considered a “dumb” question, as the chatbot is less judgmental.
Users can be more ambiguous with their questions as well, not knowing the right
words to use, as chatbot can help narrow things down.
"""