https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

BERT is one such pre-trained model developed by Google which can be fine-tuned on new data which can be used to create NLP systems like question answering, text generation, text classification, text summarization and sentiment analysis.

https://medium.com/saarthi-ai/build-a-smart-question-answering-system-with-fine-tuned-bert-b586e4cfa5f5

https://huggingface.co/transformers/pretrained_models.html

https://programmerbackpack.com/bert-nlp-using-distilbert-to-build-a-question-answering-system/


List of important packages to be used in NLP:

- NLTK

- BERT or transformers

- Spacy

- Textblob

- string

- ...



What can we do with BERT today?

NER, Text summarization, BQA


NER is Named-entity recognition.

By NER we can determine the type and role of the words in the sentences.

https://en.wikipedia.org/wiki/Named-entity_recognition

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.







When we work with BERT, the first step is always tokenization of the text. Within a pipeline of tasks, there might be another step to do (other than tokenization) and it can be for example token classification.



In [1]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Allocate a pipeline for sentiment-analysis
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1334448817.0), HTML(value='')))




In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
nlp = pipeline('ner', model=model, tokenizer=tokenizer)


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=570.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=213450.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=435797.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=29.0), HTML(value='')))




In [3]:
a = nlp('Enzo works at the Australian National University (AUN)')

In [None]:
type(a)

list

In [4]:
nlp('David works at Google.')

[{'word': 'David',
  'score': 0.9979717135429382,
  'entity': 'I-PER',
  'index': 1,
  'start': 0,
  'end': 5},
 {'word': 'Google',
  'score': 0.9990472793579102,
  'entity': 'I-ORG',
  'index': 4,
  'start': 15,
  'end': 21}]

In [5]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Rabjot Singh and Vijay are attending the NLP class in Lambton College!"

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

[('[CLS]', 'O'), ('Ra', 'I-PER'), ('##b', 'I-PER'), ('##jo', 'I-PER'), ('##t', 'I-PER'), ('Singh', 'I-PER'), ('and', 'O'), ('Vijay', 'I-PER'), ('are', 'O'), ('attending', 'O'), ('the', 'O'), ('NL', 'I-MISC'), ('##P', 'O'), ('class', 'O'), ('in', 'O'), ('Lamb', 'I-ORG'), ('##ton', 'I-ORG'), ('College', 'I-ORG'), ('!', 'O'), ('[SEP]', 'O')]


There are some problems with NER process with BERT; the names are broken down incorrectly into tokens and then we cannot find non-English names and we cannot find the long names.

In [6]:
text = "Rabjot Singh and Vijay are attending the NLP class in Lambton College!"

tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text)))
inputs = tokenizer.encode(text, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

[('[CLS]', 'O'), ('Ra', 'I-PER'), ('##b', 'I-PER'), ('##jo', 'I-PER'), ('##t', 'I-PER'), ('Singh', 'I-PER'), ('and', 'O'), ('Vijay', 'I-PER'), ('are', 'O'), ('attending', 'O'), ('the', 'O'), ('NL', 'I-MISC'), ('##P', 'O'), ('class', 'O'), ('in', 'O'), ('Lamb', 'I-ORG'), ('##ton', 'I-ORG'), ('College', 'I-ORG'), ('!', 'O'), ('[SEP]', 'O')]


We need post-processing to get the right NER output with BERT technology.

The best package to discover the full name of a person within a text is flair.



In [7]:
!pip install flair
# reference to flair: https://github.com/flairNLP/flair

Collecting flair
  Downloading flair-0.8.0.post1-py3-none-any.whl (284 kB)
Collecting bpemb>=0.3.2
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting sentencepiece==0.1.95
  Using cached sentencepiece-0.1.95-cp38-cp38-win_amd64.whl (1.2 MB)
Collecting torch<=1.7.1,>=1.5.0
  Downloading torch-1.7.1-cp38-cp38-win_amd64.whl (184.0 MB)
Collecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
Collecting ftfy
  Downloading ftfy-6.0.3.tar.gz (64 kB)
Collecting sqlitedict>=1.6.0
  Downloading sqlitedict-1.7.0.tar.gz (28 kB)
Collecting deprecated>=1.2.4
  Downloading Deprecated-1.2.12-py2.py3-none-any.whl (9.5 kB)
Collecting segtok>=1.5.7
  Downloading segtok-1.5.10.tar.gz (25 kB)
Collecting konoha<5.0.0,>=4.0.0

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\ANTHONY\\anaconda3\\Lib\\site-packages\\~.rch\\lib\\asmjit.dll'
Consider using the `--user` option or check the permissions.




  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting gdown==3.12.2
  Downloading gdown-3.12.2.tar.gz (8.2 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting huggingface-hub
  Downloading huggingface_hub-0.0.9-py3-none-any.whl (37 kB)
Collecting janome
  Downloading Janome-0.4.1-py2.py3-none-any.whl (19.7 MB)
Collecting overrides<4.0.0,>=3.0.0
  Downloading overrides-3.1.0.tar.gz (11 kB)
Collecting requests
  Downloading requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting importlib-metadata<4.0.0,>=3.7.0
  Downloading importlib_metadata-3.10.1-py3-none-any.whl (14 kB)
Building wheels for collected packages: gdown, mpld3, overrides, segtok, sqlitedict, ftfy
  Building wheel for gdown (PEP 517): started

In [10]:
from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('I love Berlin .')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

2021-06-02 09:01:44,526 --------------------------------------------------------------------------------
2021-06-02 09:01:44,527 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub
2021-06-02 09:01:44,527  - The most current version of the model is automatically downloaded from there.
2021-06-02 09:01:44,528  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)
2021-06-02 09:01:44,528 --------------------------------------------------------------------------------


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=432176557.0), HTML(value='')))


2021-06-02 09:03:40,041 loading file C:\Users\ANTHONY\.flair\models\ner-english\4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4


AttributeError: 'LSTM' object has no attribute 'proj_size'

In [None]:
print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

In [None]:
text = "William Henry Gates III is working in his office in Microsoft!"
text = Sentence(text)

tagger.predict(text)
for entity in text.get_spans('ner'):
    print(entity)

Span [1,2,3,4]: "William Henry Gates III"   [− Labels: PER (0.9396)]
Span [11,12]: "Microsoft !"   [− Labels: ORG (0.6496)]


In [None]:
text = "William Henry Gates III is working in his office in Microsoft!"
text = Sentence(text)

tagger.predict(text)

entity_dict = text.to_dict(tag_type="ner")


ListOfNamesInText = []
for e in entity_dict['entities']:
  if str(e["labels"][0]).split()[0] == "PER":
    ListOfNamesInText.append(e["text"])
ListOfNamesInText


['William Henry Gates III']

In [None]:
text = "Rabjot Singh and Vijay are attending the NLP class in Lambton College!"
text = Sentence(text)

tagger.predict(text)

entity_dict = text.to_dict(tag_type="ner")


ListOfNamesInText = []
for e in entity_dict['entities']:
  if str(e["labels"][0]).split()[0] == "PER":
    ListOfNamesInText.append(e["text"])
ListOfNamesInText


['Rabjot Singh', 'Vijay']

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1649.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…


[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]


In [None]:
ARTICLE = """
Gates was born and raised in Seattle, Washington. In 1975, he co-founded Microsoft with childhood friend Paul Allen in Albuquerque, New Mexico. It became the world's largest personal computer software company.[7][a] Gates led the company as chairman and CEO until stepping down as CEO in January 2000, succeeded by Steve Ballmer, but he remained chairman of the board of directors and became chief software architect.[10] During the late 1990s, he was criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings.[11] In June 2008, Gates transitioned to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation he and his wife, Melinda Gates, established in 2000.[12] He stepped down as chairman of the board of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.[13] In March 2020, Gates left his board positions at Microsoft and Berkshire Hathaway to focus on his philanthropic efforts including climate change, global health and development, and education.[14]

Later in his career and since leaving day-to-day operations at Microsoft in 2008, Gates has pursued many business and philanthropic endeavors. He is the founder and chairman of several companies, including BEN, Cascade Investment, bgC3, and TerraPower. He has given sizable amounts of money to various charitable organizations and scientific research programs through the Bill & Melinda Gates Foundation, reported to be the world's largest private charity.[19] Through the foundation, he led an early 21st century vaccination campaign which significantly contributed to the eradication of the wild poliovirus in Africa.[20][21] In 2010, Gates and Warren Buffett founded The Giving Pledge, whereby they and other billionaires pledge to give at least half of their wealth to philanthropy.[22]
"""

print(summarizer(ARTICLE, max_length=325, min_length=300))

[{'summary_text': " In 1975, Gates co-founded Microsoft with childhood friend Paul Allen in Albuquerque, New Mexico . He led the company as chairman and CEO until stepping down as CEO in January 2000 . In June 2008, Gates transitioned to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation . He stepped down as chairman of the board of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella . He is the founder and chairman of several companies, including BEN, Cascade Investment, bgC3, and TerraPower . Through the foundation, he led an early 21st century vaccination campaign which significantly contributed to the eradication of the wild poliovirus in Africa . In 2010, Gates and Warren Buffett founded The Giving Pledge, whereby they and other billionaires pledge to give at least half of their wealth to philanthropy . He has given sizable amounts of money to various charitable organizations 