Ben Steves, HW8, 3-16-22

# Prebuilt Transformers

### Wikihow text summarizer - T5

example: https://huggingface.co/deep-learning-analytics/wikihow-t5-small

wikihow page cited: https://www.wikihow.com/Have-Nice-Smelling-Breath

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead
import random

tokenizer = AutoTokenizer.from_pretrained("deep-learning-analytics/wikihow-t5-small")
model = AutoModelWithLMHead.from_pretrained("deep-learning-analytics/wikihow-t5-small")

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)



##### Summarizing works well: How to have nice smelling breath (wikihow example)

In [2]:
text = """"
Lack of fluids can lead to dry mouth, which is a leading cause of bad breath. Water
can also dilute any chemicals in your mouth or gut that are causing bad breath., Studies show that
eating 6 ounces of yogurt a day reduces the level of odor-causing compounds in the mouth. In
particular, look for yogurt containing the active bacteria Streptococcus thermophilus or
Lactobacillus bulgaricus., The abrasive nature of fibrous fruits and vegetables helps to clean
teeth, while the vitamins, antioxidants, and acids they contain improve dental health.Foods that can
be particularly helpful include:Apples — Apples contain vitamin C, which is necessary for health
gums, as well as malic acid, which helps to whiten teeth.Carrots — Carrots are rich in vitamin A,
which strengthens tooth enamel.Celery — Chewing celery produces a lot of saliva, which helps to
neutralize bacteria that cause bad breath.Pineapples — Pineapples contain bromelain, an enzyme that
cleans the mouth., These teas have been shown to kill the bacteria that cause bad breath and
plaque., An upset stomach can lead to burping, which contributes to bad breath. Don’t eat foods that
upset your stomach, or if you do, use antacids. If you are lactose intolerant, try lactase tablets.,
They can all cause bad breath. If you do eat them, bring sugar-free gum or a toothbrush and
toothpaste to freshen your mouth afterwards., Diets low in carbohydrates lead to ketosis — a state
in which the body burns primarily fat instead of carbohydrates for energy. This may be good for your
waistline, but it also produces chemicals called ketones, which contribute to bad breath.To stop the
problem, you must change your diet. Or, you can combat the smell in one of these ways:Drink lots of
water to dilute the ketones.Chew sugarless gum or suck on sugarless mints.Chew mint leaves.
"""

preprocess_text = text.strip().replace("\n","")
tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt").to(device)

summary_ids = model.generate(
            tokenized_text,
            max_length=150, 
            num_beams=2,
            repetition_penalty=2.5, 
            length_penalty=1.0, 
            early_stopping=True
        )

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)



Summarized text: 
 Drink water.Eat yogurt.Eat fibrous fruits and vegetables.Try teas.Eat lactose-intolerant foods.Eat sugar-free gum.Drink plenty of water.


##### Summarizing works not as well

In [3]:
random.seed(34)


def scramble_text(text):
    """Given a string, it returns that string, but with scrambled word order"""
    text = text.split()
    scrambled_text = random.sample(text, len(text))
    text = " ".join(scrambled_text)
    return(text)
    
scrambled_text = scramble_text(text); scrambled_text



'you eat mints.Chew of Carrots bad try your Studies a have to compounds ketones, the in that contributes " of the the bad if produces causing intolerant, contain shown enzyme you kill as gums, to be abrasive lactase ketones.Chew in — containing ounces low breath.Pineapples them, active an and — ketosis the They smell for or Or, good dental teeth.Carrots to of in of combat eat vitamin well primarily are particularly sugarless to fibrous that but yogurt to one Water clean lot or change saliva, for health the gum your must helps upset If can you the — breath. can stomach, leading improve bring and This odor-causing which as instead which helpful An fruits of mint been that The a do sugarless a of you Chewing bad any upset ways:Drink on eating Lactobacillus bad of rich of include:Apples Apples 6 that burns acids the are lead nature fat cause Pineapples bacteria leaves. lead carbohydrates that enamel.Celery which reduces bacteria also is necessary the while day toothpaste all bromelain, for

In [4]:
preprocess_text2 = scrambled_text.strip().replace("\n","")
tokenized_text2 = tokenizer.encode(preprocess_text2, return_tensors="pt").to(device)

summary_ids2 = model.generate(
            tokenized_text2,
            max_length=150, 
            num_beams=2,
            repetition_penalty=2.5, 
            length_penalty=1.0, 
            early_stopping=True
        )

output2 = tokenizer.decode(summary_ids2[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output2)



Summarized text: 
 Eat mints.Eat a healthy diet.Eat vitamin well.Eat vitamins and minerals.Eat foods that are in your mouth.Eat sugar-free freshen bad yogurt to water of look plaque.Eat Vitamin B.Eat calcium, which is also known as acidic acids.Eat protein rich of carbohydrates.Eat irony or potassium.Eat magnesium supplements.Eat zinc oxide (Chew of Carrots).Eat an antioxidant supplement.Eat sodium salt.Eat soda.Eat more than one cup of coffee.


### Named Entity Recognition with Bert

example: https://huggingface.co/dslim/bert-base-NER

text used for modeling: https://en.wikipedia.org/wiki/McMurdo_Station

##### NER works relatively well

In [5]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

In [9]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "The station takes its name from its geographic location on McMurdo Sound, named after Lieutenant Archibald McMurdo of HMS Terror. Under the command of British explorer James Clark Ross, the Terror first charted the area in 1841. The British explorer Robert Falcon Scott established a base camp close to this spot in 1902 and built a cabin there that was named Discovery Hut. It still stands as a historic monument near the water's edge on Hut Point at McMurdo Station."

ner_results = nlp(example)
ner_results

[{'entity': 'B-LOC',
  'score': 0.9959001,
  'index': 11,
  'word': 'M',
  'start': 59,
  'end': 60},
 {'entity': 'B-LOC',
  'score': 0.9863688,
  'index': 12,
  'word': '##c',
  'start': 60,
  'end': 61},
 {'entity': 'I-LOC',
  'score': 0.98775,
  'index': 13,
  'word': '##M',
  'start': 61,
  'end': 62},
 {'entity': 'I-LOC',
  'score': 0.8824059,
  'index': 14,
  'word': '##ur',
  'start': 62,
  'end': 64},
 {'entity': 'I-LOC',
  'score': 0.8463404,
  'index': 15,
  'word': '##do',
  'start': 64,
  'end': 66},
 {'entity': 'I-LOC',
  'score': 0.9969427,
  'index': 16,
  'word': 'Sound',
  'start': 67,
  'end': 72},
 {'entity': 'B-PER',
  'score': 0.99775213,
  'index': 21,
  'word': 'Archibald',
  'start': 97,
  'end': 106},
 {'entity': 'I-PER',
  'score': 0.99786025,
  'index': 22,
  'word': 'M',
  'start': 107,
  'end': 108},
 {'entity': 'I-PER',
  'score': 0.989638,
  'index': 23,
  'word': '##c',
  'start': 108,
  'end': 109},
 {'entity': 'I-PER',
  'score': 0.98575056,
  'index':

##### NER works not as well

In [10]:
random.seed(34)
example2 = scramble_text(example); example2

"a base on its the Point this camp location built explorer established named stands Terror. Station. in command monument The edge on to the from in at of McMurdo close It McMurdo and British McMurdo The Robert as HMS water's Under after British Ross, area Sound, the explorer Clark geographic Terror the James spot Falcon 1841. 1902 takes Lieutenant cabin a first that its a near still Scott Archibald Hut station Discovery there charted historic Hut. of was name named"

In [11]:
ner_results2 = nlp(example2)
ner_results2

[{'entity': 'B-LOC',
  'score': 0.6642267,
  'index': 15,
  'word': 'Terror',
  'start': 83,
  'end': 89},
 {'entity': 'B-LOC',
  'score': 0.6252367,
  'index': 31,
  'word': 'M',
  'start': 153,
  'end': 154},
 {'entity': 'B-LOC',
  'score': 0.6911099,
  'index': 32,
  'word': '##c',
  'start': 154,
  'end': 155},
 {'entity': 'I-ORG',
  'score': 0.76590997,
  'index': 33,
  'word': '##M',
  'start': 155,
  'end': 156},
 {'entity': 'I-ORG',
  'score': 0.5750639,
  'index': 35,
  'word': '##do',
  'start': 158,
  'end': 160},
 {'entity': 'I-ORG',
  'score': 0.7135625,
  'index': 39,
  'word': '##c',
  'start': 171,
  'end': 172},
 {'entity': 'I-ORG',
  'score': 0.82157886,
  'index': 40,
  'word': '##M',
  'start': 172,
  'end': 173},
 {'entity': 'I-ORG',
  'score': 0.8366482,
  'index': 42,
  'word': '##do',
  'start': 175,
  'end': 177},
 {'entity': 'B-MISC',
  'score': 0.44691458,
  'index': 44,
  'word': 'British',
  'start': 182,
  'end': 189},
 {'entity': 'B-ORG',
  'score': 0.305

Analysis and discussion about tasks will be on moodle submission.