# **Deep Q-Learning with an Atari-like game**

In [1]:

! pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 19.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 74.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 56.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.0 tokenizers-0.13.2 transformers-4.24.0


#     **Fill-Mask**

In [2]:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Paris is the [MASK] of France.")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'score': 0.9969370365142822,
  'token': 3007,
  'token_str': 'capital',
  'sequence': 'paris is the capital of france.'},
 {'score': 0.000591486634220928,
  'token': 2540,
  'token_str': 'heart',
  'sequence': 'paris is the heart of france.'},
 {'score': 0.00043787568574771285,
  'token': 2415,
  'token_str': 'center',
  'sequence': 'paris is the center of france.'},
 {'score': 0.0003378352848812938,
  'token': 2803,
  'token_str': 'centre',
  'sequence': 'paris is the centre of france.'},
 {'score': 0.00026995784719474614,
  'token': 2103,
  'token_str': 'city',
  'sequence': 'paris is the city of france.'}]

In [3]:
>>> unmasker("Beijing is the heart of China. Paris is the [MASK] of France.")

[{'score': 0.6707967519760132,
  'token': 3007,
  'token_str': 'capital',
  'sequence': 'beijing is the heart of china. paris is the capital of france.'},
 {'score': 0.29512256383895874,
  'token': 2540,
  'token_str': 'heart',
  'sequence': 'beijing is the heart of china. paris is the heart of france.'},
 {'score': 0.013993088155984879,
  'token': 2415,
  'token_str': 'center',
  'sequence': 'beijing is the heart of china. paris is the center of france.'},
 {'score': 0.008663907647132874,
  'token': 2803,
  'token_str': 'centre',
  'sequence': 'beijing is the heart of china. paris is the centre of france.'},
 {'score': 0.0010524825192987919,
  'token': 2188,
  'token_str': 'home',
  'sequence': 'beijing is the heart of china. paris is the home of france.'}]

In [4]:
>>> unmasker("Beijing is the heart of China. Paris is the [MASK] of France. London is the heart of England.")

[{'score': 0.978805422782898,
  'token': 2540,
  'token_str': 'heart',
  'sequence': 'beijing is the heart of china. paris is the heart of france. london is the heart of england.'},
 {'score': 0.010187570005655289,
  'token': 3007,
  'token_str': 'capital',
  'sequence': 'beijing is the heart of china. paris is the capital of france. london is the heart of england.'},
 {'score': 0.006910012103617191,
  'token': 2415,
  'token_str': 'center',
  'sequence': 'beijing is the heart of china. paris is the center of france. london is the heart of england.'},
 {'score': 0.0018299514194950461,
  'token': 2803,
  'token_str': 'centre',
  'sequence': 'beijing is the heart of china. paris is the centre of france. london is the heart of england.'},
 {'score': 0.0003043338074348867,
  'token': 4563,
  'token_str': 'core',
  'sequence': 'beijing is the heart of china. paris is the core of france. london is the heart of england.'}]

# **Fill-Mask**
Bert-base-uncased is a Fill-Mask model. This model is used to fill the masked part of a sentance. It is a English language pre-trained model using a masked language modeling(MLM) objective.

Masked Language Modeling (MLM):

Take a sentence, the model randomly masks 15% of the words in the input, and then runs the entire masked sentence through the model, and must predict the masked words. This is different from traditional Recurrent Neural Networks (RNNs), which usually see words one after another, or from autoregressive models such as GPT, which obscure future marks internally. It allows the model to learn two-way representations of sentences.

Training data:

The Bert-base-uncased model was pre-trained on BookCorpus, which is a dataset composed of 11,038 unpublished books and English Wikipedia (excluding lists, tables, and titles).

Preprocessing:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the 10% remaining cases, the masked tokens are left as is.

Pretraining:

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, β1=0.9 and β2=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

Performance:

From the results of the three data, it can be concluded that when there is only one sentence, the word capital is the most likely filling (99.69%). This result is consistent with the language used in our daily lives.
When there are two sentences, the other sentence has the same sentence pattern but uses the word heart. We found that although the word capital is still the most likely filling, its probability has decreased (67.08%), and the probability of the word heart has been significantly improved (29.51%).
When there are three sentences, and the other two sentences have the same sentence pattern and use the word heart. The word heart becomes the most likely filling (97.88%), and the word capital possibility becomes very small (1.02%).
In summary, I think this model performs very well. It is not only based on the frequency of daily sentence patterns, but also draws on the influence of context on the use of words.

# **Question Answering**

In [5]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)
print(res)
QA_input = {
    'question': 'Where do I live?',
    'context': 'My name is Sarah and I live in London.'
}
res = nlp(QA_input)
print(res)
QA_input = {
    'question': 'What is my name?',
    'context': 'My name is Clara and I live in Berkeley.'
}
res = nlp(QA_input)
print(res)
QA_input = {
    'question': 'Which name is also used to describe the Amazon rainforest in English?',
    'context': 'The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planets remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.'
}
res = nlp(QA_input)
print(res)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 0.2117147445678711, 'start': 59, 'end': 84, 'answer': 'gives freedom to the user'}
{'score': 0.9070461988449097, 'start': 31, 'end': 37, 'answer': 'London'}
{'score': 0.933128833770752, 'start': 11, 'end': 16, 'answer': 'Clara'}
{'score': 0.7416028380393982, 'start': 201, 'end': 230, 'answer': 'Amazonia or the Amazon Jungle'}


# **Question Answering**

Roberta-base-squad2 is a question answering model. It is used to generate the answer of the question base on the given context. It is a English language pre-trained model using the roberta-base language model.

Training data: SQuAD 2.0

Eval data: SQuAD 2.0

Hyperparameters:

batch_size = 96

n_epochs = 2

base_LM_model = "roberta-base"

max_seq_len = 386

learning_rate = 3e-5

lr_schedule = LinearWarmup

warmup_proportion = 0.2

doc_stride=128

max_query_length=64

Performance:

From the results of the four data, it can be concluded that the answer to the first question is very broad, and there is no content in the context that exactly matches the text of the question. But the meaning of context can extract an answer that can answer the question. But the score of this answer is not ideal (21.17%). The questions of the second and third questions are very straightforward, and they have the same expressions as the questions in the context, so there are definite answers. The scores of the answers obtained are also very high (90.70% and 93.31%). The fourth question also shows that there will be a precise answer. Although the context given is very verbose, it also yields a highly correct answer(74.16%).
In summary, I think this model performs well. It can correctly give those questions with definite answers. On open questions, we will give appropriate answers as much as possible, and reflect the uncertainty of the answers through the score. The extraction of answers in complex articles is also accurate enough.

# **Summarization**

In [6]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=150, min_length=30, do_sample=False))
print(summarizer(ARTICLE, max_length=50, min_length=30, do_sample=False))
print(summarizer(ARTICLE, max_length=50, min_length=10, do_sample=False))
print(summarizer(ARTICLE, max_length=30, min_length=10, do_sample=False))

ARTICLE = """ The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. 
Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. 
It was the first structure to reach a height of 300 metres. 
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). 
Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
"""
print(summarizer(ARTICLE, max_length=150, min_length=30, do_sample=False))
print(summarizer(ARTICLE, max_length=50, min_length=30, do_sample=False))
print(summarizer(ARTICLE, max_length=50, min_length=10, do_sample=False))
print(summarizer(ARTICLE, max_length=30, min_length=10, do_sample=False))

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
[{'summary_text': 'Liana Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men, and at one time, she was married to eight men at once.'}]
[{'summary_text': 'Liana Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men, and at one time, she was married to eight men at once.'}]
[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree"'}]
[{'summary_text': 'The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. Its base is 

# **Summarization**


Bart-large-cnn is a summarization model. It is used to generate the summarization of a large content. BART is a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.

Performance:

I tested the method of controlling variables on the two articles respectively. Test the summarization where the maximum length is limited to 150 and 50 and the minimum length is 30. Simultaneously test the summarization with the longest length limit of 50 and 30 and the shortest length of 10. By comparing the results obtained with the article, summarization has achieved the extraction of important information from the entire article. Through the observation of the results, the model will first follow the principle of summarization first. In other words, it will try to be as close to the shortest length limit as possible (when the shortest limit is 30, no matter what the maximum is, the resulting length will be closer to 30). But when the minimum length limit is too small, in order not to lose the necessary information of the article, the model will not blindly approach the shortest limit, but appropriately increase the length of the result. (When the shortest limit is 10, because there are too few characters, the length of the result will be determined according to the content of the article).
In summary, I think this model performs very well. It not only achieves the summarization of articles given in the limited characters, but also pays attention to the extraction of important information. You will not lose important information blindly for fewer words. This model does a good job of balancing the shortest length limit and the content of the article.



# **Text Classification**



In [8]:
from transformers import pipeline
classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)
prediction = classifier("I like you. I love you.", )
print(prediction)
prediction = classifier("I do not like you. I hate you.", )
print(prediction)
prediction = classifier("I love you. But I hate you.", )
print(prediction)
prediction = classifier("I hate you. But I love you.", )
print(prediction)

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]



[[{'label': 'sadness', 'score': 0.001765760243870318}, {'label': 'joy', 'score': 0.03388984501361847}, {'label': 'love', 'score': 0.9624289870262146}, {'label': 'anger', 'score': 0.0010959552600979805}, {'label': 'fear', 'score': 0.0003971168480347842}, {'label': 'surprise', 'score': 0.0004224594740662724}]]
[[{'label': 'sadness', 'score': 0.03216877207159996}, {'label': 'joy', 'score': 0.0019054835429415107}, {'label': 'love', 'score': 0.00438679987564683}, {'label': 'anger', 'score': 0.9600632190704346}, {'label': 'fear', 'score': 0.0011310899863019586}, {'label': 'surprise', 'score': 0.00034457596484571695}]]
[[{'label': 'sadness', 'score': 0.07416512072086334}, {'label': 'joy', 'score': 0.006289541255682707}, {'label': 'love', 'score': 0.07611513882875443}, {'label': 'anger', 'score': 0.8396787047386169}, {'label': 'fear', 'score': 0.0027262584771960974}, {'label': 'surprise', 'score': 0.0010252043139189482}]]
[[{'label': 'sadness', 'score': 0.018142443150281906}, {'label': 'joy', 

# **Text Classification**

Distilbert-base-uncased-emotion is a text classification model. It is used to classify the text emotion. Distilbert was created through knowledge distillation in the pre-training phase, and it reduced the size of the BERT model by 40% while retaining 97% of the language understanding ability. It is smaller and faster than Bert and any other Bert-based models.

Distilbert-base-uncased finetuned on the emotion dataset using HuggingFace Trainer with below Hyperparameters:

learning rate 2e-5

batch size 64

num_train_epochs=8

Performance:

In order to test the performance of the model, I tested four kinds of data. The first type of text only contains positive expressions (like, love). Therefore, the proportion of love in the test result is extremely high (96.24%). The second type of text only contains negative negative expressions (do not like, hate), so the test result anger accounts for a very high proportion (96.01%). The third type of text is a negative transition. Although positive expressions (love) are included, negative emotions (hate) are emphasized after the transition. Therefore, the test result is that the proportion of anger is high (83.97%). The fourth type of text is a positive transition. Although it contains negative expressions (hate), positive emotions (love) are emphasized after the transition. Therefore, the test result is that love accounts for a high proportion (95.18%).

In summary, I think this model performs very well. It not only achieves the classification of a single emotional text, but also a correct judgment on the text with emotional transitions. This basically meets the judgment requirements for most text.

# **Text Generation**

In [9]:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to be as expressive as possible. In order to be expressive, it is necessary to know"},
 {'generated_text': "Hello, I'm a language model, so I don't get much of a license anymore, but I'm probably more familiar with other languages on that"},
 {'generated_text': "Hello, I'm a language model, a functional model... It's not me, it's me!\n\nI won't bore you with how"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give language model a set of properties that"}]

In [10]:
>>> generator("Once upon a time,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time, one of the two big players in our league who were under heavy scrutiny at that point in time was Tom Brady. In the'},
 {'generated_text': 'Once upon a time, a great many of her people, like her daughters, lost so little, that nothing was spoken of them in the days of'},
 {'generated_text': 'Once upon a time, these characters all worked for a different corporation who was just like them. But now, they can still keep their jobs. And'},
 {'generated_text': 'Once upon a time, it is possible to see through the veil of confusion. From time to time, the veil of confusion has become visible, if'},
 {'generated_text': 'Once upon a time, the world was becoming a place where everything and everything happened according to rules the gods had set in place."\n\nAs of'}]

# **Text Generation**

GPT-2 is a text generation model. It is used to generate text from several words. It is a Transformers model that is pre-trained on a very large English data corpus in a self-supervised manner. This means that it only pre-trains the original text, no one labels them in any way (which is why it can use a lot of publicly available data), and generates inputs and labels from these texts through an automated process. More precisely, it is trained to guess the next word in a sentence.

To be more precise, the input is a continuous text sequence of a certain length, the target is the same sequence, and the mark (word or word fragment) is moved to the right. The model uses a masking mechanism internally to ensure that the prediction of label i only uses inputs from 1 to i, and does not use future labels.

In this way, the model learns the internal representation of the English language, which can then be used to extract features useful for downstream tasks. However, this model is most suitable for pre-trained content, which is to generate text based on prompts.

Training data:

The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.

Performance:

I tested two texts in total, and each test performed text generation five times. It can be found from the results that the results of each text generation are very different. In other words, text generation has very good uniqueness. No duplicate text will be generated. But it can also be found that because text generation is generated word by word, text will suddenly stop at max length. This prevents us from getting a complete text.
In summary, I think the performance of this model is not good enough. Although it has achieved the uniqueness and diversity of text generation. But it cannot generate a complete text within the limited number of words. This cannot perfectly match our needs.



# **Text2Text Generation**

In [11]:
from transformers import pipeline

nlp = pipeline("text2text-generation", model='Salesforce/mixqg-large', tokenizer='Salesforce/mixqg-large')
    
CONTEXT = "In the late 17th century, Robert Boyle proved that air is necessary for combustion."
ANSWER = "Robert Boyle"

def format_inputs(context: str, answer: str):
    return f"{answer} \\n {context}"
    
text = format_inputs(CONTEXT, ANSWER)

nlp(text)

Downloading:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

[{'generated_text': 'Who proved that air is necessary for combustion?'}]

In [12]:
CONTEXT = "In the late 17th century, Robert Boyle proved that air is necessary for combustion."
ANSWER = "Combustion"
    
text = format_inputs(CONTEXT, ANSWER)
nlp(text)

[{'generated_text': 'What does air do?'}]

In [13]:
CONTEXT = "In the late 17th century, Robert Boyle proved that air is necessary for combustion."
ANSWER = "In the late 17th century"
    
text = format_inputs(CONTEXT, ANSWER)
nlp(text)

[{'generated_text': 'When did Robert Boyle prove that air is necessary for combustion?'}]

# **Text2Text Generation**

Mixqg-large is a text to text generation model. It is used to generate new text from a exist context(Generate question from an answer and a context).

Asking good questions is an essential ability for both human and machine intelligence. Existing neural question generation approaches mainly focus on the short factoid type of answers. In this model, it propose a neural question generator, MixQG, to bridge this gap. It combines 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. It shows with empirical results that this model outperforms existing work in both seen and unseen domains and can generate questions with different cognitive levels when conditioned on different answer types.

Performance:

I tested it by extracting different parts of the same context as answers. It can be seen from the results. When the answer is an important part of the context, the generated question is perfect. But when the answer is only one word (the part that is not important enough). Although the generated problem is not logically problematic, it does not have enough contact with the main body of the context. In other words, the generated problem is not the best.
In summary, I think this model performs well, but it can be even better. Although it generates the best problems in most cases. But it can be improved to the point that when only one word is extracted as an answer, it can still give the most perfect question in a comprehensive context.

# **Token Classification**

In [14]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin."

ner_results = nlp(example)
print(ner_results)
example = "My name is Sarah and I work in Apple."

ner_results = nlp(example)
print(ner_results)
example = "My name is Clara and I live in Berkeley, California."

ner_results = nlp(example)
print(ner_results)

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

[{'entity': 'B-PER', 'score': 0.99913067, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.99953353, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]
[{'entity': 'B-PER', 'score': 0.99879754, 'index': 4, 'word': 'Sarah', 'start': 11, 'end': 16}, {'entity': 'B-ORG', 'score': 0.99914837, 'index': 9, 'word': 'Apple', 'start': 31, 'end': 36}]
[{'entity': 'B-PER', 'score': 0.99641764, 'index': 4, 'word': 'Clara', 'start': 11, 'end': 16}, {'entity': 'B-LOC', 'score': 0.996198, 'index': 9, 'word': 'Berkeley', 'start': 31, 'end': 39}, {'entity': 'B-LOC', 'score': 0.9990196, 'index': 11, 'word': 'California', 'start': 41, 'end': 51}]


# **Token Classification**

Bert-base-NER is a token classification model. It is used to recognize different type words from a text. Bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC). This model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.

Abbreviation and Description

O: Outside of a named entity

B-MIS: Beginning of a miscellaneous entity right after another miscellaneous entity

I-MIS: Miscellaneous entity

B-PER: Beginning of a person’s name right after another person’s name

I-PER: Person’s name

B-ORG: Beginning of an organization right after another organization

I-ORG: organization

B-LOC: Beginning of a location right after another location

I-LOC: Location

Performance: I tested three different similar sentences. It turns out that the model easily distinguishes names, organizations, and locations, and the accuracy is high (above 99%). This is related to the use of a huge dataset.
I think this model performs very well. When the preprocessing draws on a huge dataset, it will make the model's analysis of new sentences and text more reliable and accurate.

# **Translation**

In [15]:
from transformers import T5TokenizerFast, T5ForConditionalGeneration

tokenizer = T5TokenizerFast.from_pretrained('t5-small')

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)

input = "My name is Azeem and I live in India"

# You can also use "translate English to French" and "translate English to Romanian"
input_ids = tokenizer("translate English to German: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

input = "My name is Sarah and I live in London"

input_ids = tokenizer("translate English to German: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

input_ids = tokenizer("translate English to French: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

input = "My name is Sarah and I live in London"

input_ids = tokenizer("translate English to French: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

input_ids = tokenizer("translate English to Romanian: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

input = "My name is Sarah and I live in London"

input_ids = tokenizer("translate English to Romanian: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]



Mein Name ist Azeem und ich lebe in Indien.
Mein Name ist Sarah und ich lebe in London.
Mon nom est Sarah et je résidais à Londres.
Mon nom est Sarah et je résidais à Londres.
Numele meu este Sarah şi eu locuiesc la Londra
Numele meu este Sarah şi eu locuiesc la Londra


# **Translation**

T-5-small is a translation model. It is used to translate text to target language. Transfer learning, where the model is first pre-trained on data-rich tasks, and then fine-tuned on downstream tasks, has become a powerful technology in natural language processing (NLP). The effectiveness of transfer learning has spawned a variety of methods, methodologies and practices. In this article, the model explores the prospects of NLP transfer learning technology by introducing a unified framework for converting each language question into a text-to-text format. The model system studies and compares the pre-training goals, architectures, unlabeled data sets, migration methods, and other factors of dozens of language understanding tasks.

Performance:

I tested the translation of two English sentences into German, French, and Romanian. Compare the results with google translate and local language habits. We found that the accuracy of translation is very high. Different from word-by-word translation, this model achieves semantic translation. So I think this model performs very well. Of course, this has a certain relationship with the language habits and language formation foundations of these languages. If you are translating to a language such as Chinese, which has a huge difference in language foundation, whether the effect of the model can be maintained so excellent remains to be tested.

# **Zero-Shot Classification**

In [16]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)


Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938650727272034, 0.0032737930305302143, 0.002861041110008955]}

In [17]:
sequence_to_classify = "I have a problem with my iphone that needs to be resolved asap!!"
candidate_labels = ['urgent', 'not urgent', 'phone', 'tablet', 'computer']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5036356449127197,
  0.4787997007369995,
  0.012600112706422806,
  0.0026557878591120243,
  0.0023087558802217245]}

In [18]:
sequence_to_classify = "Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."
candidate_labels = ['mobile', 'website', 'billing', 'account access']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.',
 'labels': ['mobile', 'account access', 'billing', 'website'],
 'scores': [0.9600788354873657,
  0.0168320219963789,
  0.014393892139196396,
  0.00869521964341402]}

In [19]:
sequence_to_classify = "A new model offers an explanation for how the Galilean satellites formed around the solar system’s largest world. Konstantin Batygin did not set out to solve one of the solar system’s most puzzling mysteries when he went for a run up a hill in Nice, France. Dr. Batygin, a Caltech researcher, best known for his contributions to the search for the solar system’s missing “Planet Nine,” spotted a beer bottle. At a steep, 20 degree grade, he wondered why it wasn’t rolling down the hill. He realized there was a breeze at his back holding the bottle in place. Then he had a thought that would only pop into the mind of a theoretical astrophysicist: “Oh! This is how Europa formed.” Europa is one of Jupiter’s four large Galilean moons. And in a paper published Monday in the Astrophysical Journal, Dr. Batygin and a co-author, Alessandro Morbidelli, a planetary scientist at the Côte d’Azur Observatory in France, present a theory explaining how some moons form around gas giants like Jupiter and Saturn, suggesting that millimeter-sized grains of hail produced during the solar system’s formation became trapped around these massive worlds, taking shape one at a time into the potentially habitable moons we know today."
candidate_labels = ['space & cosmos', 'scientific discovery', 'microbiology', 'robots', 'archeology']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'A new model offers an explanation for how the Galilean satellites formed around the solar system’s largest world. Konstantin Batygin did not set out to solve one of the solar system’s most puzzling mysteries when he went for a run up a hill in Nice, France. Dr. Batygin, a Caltech researcher, best known for his contributions to the search for the solar system’s missing “Planet Nine,” spotted a beer bottle. At a steep, 20 degree grade, he wondered why it wasn’t rolling down the hill. He realized there was a breeze at his back holding the bottle in place. Then he had a thought that would only pop into the mind of a theoretical astrophysicist: “Oh! This is how Europa formed.” Europa is one of Jupiter’s four large Galilean moons. And in a paper published Monday in the Astrophysical Journal, Dr. Batygin and a co-author, Alessandro Morbidelli, a planetary scientist at the Côte d’Azur Observatory in France, present a theory explaining how some moons form around gas giants like Ju

# **Zero-Shot Classification**

Bart-large-mnli is a zero-shot classification model. It is used to classify the context to some of the given labels.

This model proposes a method to use a pre-trained NLI model as a ready-made zero-sample sequence classifier. The working principle of this method is to use the sequence to be classified as the NLI premise and construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the "political" category, we can construct a hypothesis This text is about police... and then convert the probability of implication and contradiction into label probability.

Performance:

In order to test the performance of the model, I conducted four tests. The first test result shows that this model can easily select the label that best matches it by the meaning of the sentence (travel: 99.39%). The second test result shows that when the meaning of the sentence contains more than one label, this model can reasonably balance the weight of the labels (urgent: 50.36%, phone: 47.88%). The third test result shows that when there are similar labels (mobile, website), this model can also obtain a more suitable label (moble: 96.01%) through the analysis of the context, and reduce the probability of irrelevant labels (website : 0.87%). The fourth test proved that this model can successfully analyze the correlation of different labels when dealing with complex and lengthy contexts. So as to give accurate answers.
In summary, I think the performance of this model is perfect. Whether it is a simple sentence or a large paragraph, the weight of different labels can be successfully and reasonably assigned.

# **Environment**


In [20]:
! pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 3.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 51.3 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=20ec4b46c8d382ab454c00debf30b18425cf68e92f78087820a939c0c8062b4f
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentenc

# **Sentence Similarity**

In [21]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


# **Sentence Similarity**

All-MiniLM-L6-v2 is a sentence similarity model. It is used to judge the similarity of sentences. This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model aims to train sentence embedding models on very large sentence-level data sets using self-supervised contrast learning objectives. This model uses the pre-trained nreimers/MiniLM-L6-H384-uncased model and fine-tunes the data set in the 1B sentence. This model uses contrast learning objectives: given a sentence in a pair of sentences, the model should predict which of the other sentences in a set of randomly sampled ones actually pair with it in the data set.

Hyper parameters:

Authors trained this model on a TPU v3-8. They train the model during 100k steps using a batch size of 1024 (128 per TPU core). They use a learning rate warm up of 500. The sequence length was limited to 128 tokens. They used the AdamW optimizer with a 2e-5 learning rate.

Training data:

Authors use the concatenation from multiple datasets to fine-tune this model. The total number of sentence pairs is above 1 billion sentences. Authors sampled each dataset given a weighted probability.

Performance:

To test the performance of the model, I entered 9 sentences and compared the similarity between all sentences in a loop. The 5 combinations with the highest similarity and similarity are output. We can find through artificial analysis of the meaning of sentences that this model successfully analyzes the similarity between these 9 different sentences.
I think the performance of this model is very good. It can not only analyze the similarity of sentences from the surface (whether the literal is the same), but also increase the similarity value of sentences with the same sentence pattern (or similar in meaning). So as to get a comprehensive and reasonable similarity between sentences.

IT License

Copyright (c) 2022 Changning Liu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.