# HuggingFace NLP 1
### Jeremy Bloom
### 2024.02.19 v2
This is my walkthrough and exploration of the content in and around [HuggingFace's Natural Language Processing course Chapter 1](https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt).  Note the course is really more of a guide, as there's no credit or certification of tests, so you can watch along.  This python notebook is derived from HuggingFace examples, some public internet sources, and my own exploration. 
 
I put this together to help friends and colleague better understand what is possible in AI, ML, and particularly with transformer architectures and LLMs.  Those with little to no programming experience can still probably follow the video, as I try to explain what's happening.  For those who code, I suggest downloading the notebook and putting the video on in the background as you work through it yourself.

This notebook and the accompanying video are my own, as are any and all errors.

### To run this notebook
For those new to python, you'll need to install some extra packages.  As python boasts a rich ecosystem of available packages and a long history of versions, an environment manager is highly recommended.  I use [Anaconda](https://www.anaconda.com/download), which is a solid choice, but many other options exist including [venv](https://docs.python.org/3/library/venv.html) [pyenv](https://github.com/pyenv/pyenv) [poetry](https://python-poetry.org/) [pipenv](https://pipenv.pypa.io/en/latest/) and others.

For those using anaconda, I setup my environment like this:<br>
`conda create -n hugs python=3.9 tensorflow transformers pytorch jupyter sentencepiece`

Note in addition to downloading the modules above, the models will also need to be downloaded.  Pipeline() handles this for you, so the first time you use a model it will take some time to download, and will then be stored locally and shared between environments.  On my mac these are storedin anaconda3/pkgs, where I currently have over 20 GB.

#### pipeline() warnings and errors
1. In these examples I ask pipeline for a generic model type, so it always warns me that this isn't robust.  For these types of demos it's fine, but if you wanted repeated results you'd choose specific models and versions.  The warning you can ignore looks like this: `Using a pipeline without specifying a model name and revision in production is not recommended.`
2. In these examples I'll often omit some extra parameters and accept default values.  pipeline() often warns on this, saying things like the following.  You can safely ignore these for demo use scenarios.
    -  ``Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.``
    -  `Some weights of the model checkpoint at... were not used when initializing...`
3. Python can produce very long errors, but you can often find the mistake in the first few lines or first last lines of the message.  Sometimes these mistakes are simple typos or missing packages.

Sources:
1. [Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt)
2. [Hugging Face ModelHub](https://huggingface.co/docs/hub/en/models-the-hub)
3. [Hugging Face TextClassificationPipeline docs](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/pipelines#transformers.TextClassificationPipeline)



# 0. Common Helpers

In [4]:
# %load "jdb_timer.py"
import time
class timer:
    def __init__(self):
        self.t0_pc = time.perf_counter()   # system time 
        self.t0_tt = time.time()           # system time
        self.t0_pt = time.process_time()   # cpu time

        self.state = "INIT"
        #self.statesAllowed = ["INIT", "STARTED", "STOPPED"]
        self.t1_pc = 0
        self.t1_tt = 0
        self.t1_pt = 0

    def error(self, s):
        print("error:", s)

    def warn(self, s):
        print("warn:", s)

    def start(self):
        # only valid if state is INIT
        if self.state == "INIT":
            self.__startNow()
        elif self.state == "STARTED":
            self.error("timer is already in STARTED state.  Use .restart or .forcestart")
        elif self.state == "STOPPED":
            self.error("timer is already STOPPED state.  Use .restart or .forcestart")
        else:
            self.error("cannot start timer with unrecognized state", self.state)
            
    def restart(self):
        # valid if state is STARTED, STOPPED, or INIT
        if self.state == "STARTED" or self.state == "STOPPED" or self.state == "INIT":
            self.__startNow()
        else:
            self.error("cannot restart timer with unrecognized state", self.state)

    def stopquiet(self):
        self.stop(quiet = True)
    
    def stop(self, quiet = False):
        # only valid if STARTED or (with warn) INIT
        if self.state == "STARTED":
            self.__stop()
            if not quiet:
                self.__printTimerStopped()
        elif self.state == "INIT":
            self.warn("timer is in state INIT.  Will use instantiation times for start and now stop.")
            self.__startRetro()
            self.__stop()
            if not quiet:
                self.__printTimerStopped()
        elif self.state == "STOPPED":
            self.error("cannot stop timer in STOPPED state")
        else:
            self.error("cannot restart timer with unrecognized state", self.state)

    
    def forcestart(self):
        # ignore current state, just start from now
        self.start()

    def forcestop(self):
        # ignore current state, just stop now (even if already stopped)
        self.stop()

    def print(self):
        if self.state == "INIT":
            print("warn: timer was not explicitly started - using instantiation time")
            self.__startRetro()
            self.__printTimerContinued()
        elif self.state == "STARTED":
            self.__printTimerContinued()
        elif self.state == "STOPPED":
            self.__printTimerStopped()
        else:
            self.error("cannot print timer with unrecognized state", self.state)

    def __startRetro(self):
        # no error checking - retroactive start will use times from init
        self.state = "STARTED"
        
    def __startNow(self):
        # no error checking
        self.t0_pc = time.perf_counter()
        self.t0_tt = time.time()
        self.t0_pt = time.process_time()
        self.state = "STARTED"

    def __stop(self):
        # no error checking
        self.t1_pc = time.perf_counter()
        self.t1_tt = time.time()
        self.t1_pt = time.process_time()
        self.state = "STOPPED"
    
    def __printTimerContinued(self):
        # no error checking - just prints curent - start times
        # print("{:5.2f}  {:5.2f}  {:5.2f}...".format(time.perf_counter() - self.t0_pc, 
        #                                             time.time() - self.t0_tt, 
        #                                             time.process_time() - self.t0_pt))
        print("wall: {:5.2f}...   cpu: {:5.2f}...".format(time.perf_counter() - self.t0_pc, time.process_time() - self.t0_pt))

    def __printTimerStopped(self):
        # # no error checking - just prints stored - start times
        # print("{:5.2f}  {:5.2f}  {:5.2f}".format(self.t1_pc - self.t0_pc, 
        #                                          self.t1_tt - self.t0_tt,
        #                                          self.t1_pt -  self.t0_pt))
        print("wall: {:5.2f}      cpu: {:5.2f}".format(self.t1_pc - self.t0_pc, self.t1_pt -  self.t0_pt))


def runcputime(num_seconds = 1):
    t0 = time.process_time()
    x = 500
    while True:
        z = 0
        for i in range(x):
            z += x ** x
        if time.process_time() - t0 >= num_seconds:
            break

In [5]:
t = timer()
t.start()
time.sleep(0.5)
t.print()
runcputime(1.7)
t.print()
time.sleep(0.2)
t.stop()

wall:  0.50...   cpu:  0.00...
wall:  2.27...   cpu:  1.70...
wall:  2.48      cpu:  1.71


# 1. Pipeline() and sentiment-analysis

In [6]:
from transformers import pipeline
classifier = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [7]:
classifier("this is a great book and a must read")

[{'label': 'POSITIVE', 'score': 0.9998772144317627}]

In [8]:
classifier("This is not a novel to be read.")

[{'label': 'NEGATIVE', 'score': 0.9995871186256409}]

In [9]:
r1 = "The Hitchhiker's Guide to the Galaxy, by Douglas Adams, published in 1979 is a thrilling work of science-fiction and highly entertaining to read. It is a well-written book, with a surplus of thought-provoking ideas. The prose conceals flashes of brilliance and unearths pearls of wisdom. The characters are themselves illuminating, with respect to their sharp perceptions, astute assessments of the situation, quick reactions, and outright candor. You get the impression that the story could very easily have been about a likable group of college students who plan to go on a road trip for spring break, so that they can experience all that life has to offer, let off a little steam, and reduce some stress before final exams. But, alas, the book is more complicated than that. It is more like, what if you know for certain that identifiable flying objects piloted by alien beings are in close proximity, and you have the coded electronic transporter boarding pass device, granting you unlimited access to go anywhere in the universe, right there in your hot little hand."
r2 = "I am an avid reader. Sometimes, two or three books at a time. With that said, I did not enjoy this book at all. I almost just gave up on finishing it, but kept thinking, this will get better. It didn't for me. I really didn't understand the point of it at all. I was waiting for the humor and something profound, but nothing happened. I was the happiest when I got to the end, so that I could start reading another book."

In [10]:
r1

"The Hitchhiker's Guide to the Galaxy, by Douglas Adams, published in 1979 is a thrilling work of science-fiction and highly entertaining to read. It is a well-written book, with a surplus of thought-provoking ideas. The prose conceals flashes of brilliance and unearths pearls of wisdom. The characters are themselves illuminating, with respect to their sharp perceptions, astute assessments of the situation, quick reactions, and outright candor. You get the impression that the story could very easily have been about a likable group of college students who plan to go on a road trip for spring break, so that they can experience all that life has to offer, let off a little steam, and reduce some stress before final exams. But, alas, the book is more complicated than that. It is more like, what if you know for certain that identifiable flying objects piloted by alien beings are in close proximity, and you have the coded electronic transporter boarding pass device, granting you unlimited acc

In [11]:
r2

"I am an avid reader. Sometimes, two or three books at a time. With that said, I did not enjoy this book at all. I almost just gave up on finishing it, but kept thinking, this will get better. It didn't for me. I really didn't understand the point of it at all. I was waiting for the humor and something profound, but nothing happened. I was the happiest when I got to the end, so that I could start reading another book."

In [12]:
classifier(r1)

[{'label': 'POSITIVE', 'score': 0.9996988773345947}]

In [13]:
classifier(r2)

[{'label': 'NEGATIVE', 'score': 0.9936918020248413}]

##### M1

In [14]:
M1 = []
M1.append("This is a very medicore book. Read it if you like stories about talking dragons.")
M1.append("This is a very medicore book. Read it if you like silly stories about talking dragons.")
M1.append("This is a very medicore book. Read it if you like dumb stories about talking dragons.")
M1

['This is a very medicore book. Read it if you like stories about talking dragons.',
 'This is a very medicore book. Read it if you like silly stories about talking dragons.',
 'This is a very medicore book. Read it if you like dumb stories about talking dragons.']

In [15]:
classifier(M1)

[{'label': 'POSITIVE', 'score': 0.848868191242218},
 {'label': 'POSITIVE', 'score': 0.5143406987190247},
 {'label': 'NEGATIVE', 'score': 0.9953979849815369}]

##### M2

In [16]:
M2 = []
M2.append("This is not a novel to be tossed aside lightly.")
M2.append("This is not a novel to be tossed aside lightly. It should be thrown with great force.")
M2.append("This is not a novel to be tossed aside lightly. Don't even pick up this book in the first place.")
R2 = classifier(M2)

In [17]:
R2

[{'label': 'POSITIVE', 'score': 0.8818190693855286},
 {'label': 'POSITIVE', 'score': 0.9708155989646912},
 {'label': 'NEGATIVE', 'score': 0.992668867111206}]

In [18]:
len(R2)

3

In [19]:
type(R2)

list

In [20]:
for o in R2:
    print(o["label"])

POSITIVE
POSITIVE
NEGATIVE


In [22]:
# %load jdb_process.py
def process(texts, results, **kwargs):
    PROCESS_ARG_PRINT = "print"
    PROCESS_ARG_SORT = "sort"
    PROCESS_ARG_NORETURN = "noreturn"
    PROCESS_ARG_TRUNCATE = "truncate"
    

    printOpt = True
    sortOpt = True
    noreturnOpt = True
    truncateNum = -1
    
    if len(texts) != len(results):
        print("error - mismatched lists")
        return

    for key, value in kwargs.items():
        if key == PROCESS_ARG_PRINT:
            printOpt = kwargs[PROCESS_ARG_PRINT]
        elif key == PROCESS_ARG_SORT:
            sortOpt = kwargs[PROCESS_ARG_SORT]
        elif key == PROCESS_ARG_NORETURN:
            noreturnOpt = kwargs[PROCESS_ARG_NORETURN]
        elif key == PROCESS_ARG_TRUNCATE:
            truncateNum = kwargs[PROCESS_ARG_TRUNCATE]
        else:
            print("ignoring unrecognised arg", key)
      
    n = len(texts)
    Rn = []
    Rp = []
    R = []
    for i in range(n):
        confidence = results[i]["score"]
        sentiment = " "
        if results[i]["label"] == "NEGATIVE":
            sentiment = "-"
            if truncateNum > 0:
                Rn.append([confidence, sentiment, texts[i][0:truncateNum]])
            else:
                Rn.append([confidence, sentiment, texts[i]])
        else:
            sentiment = "+"
            if truncateNum > 0:
                Rp.append([confidence, sentiment, texts[i][0:truncateNum]])
            else:
                Rp.append([confidence, sentiment, texts[i]])
        
        if truncateNum > 0:
            R.append([confidence, sentiment, texts[i][0:truncateNum]])
        else:
            R.append([confidence, sentiment, texts[i]])
            
    if sortOpt:
        Rp = sorted(Rp, key = lambda x: -x[0])
        Rn = sorted(Rn, key = lambda x: x[0])
        R = Rp + Rn
    
    if printOpt:
        for r in R:
            print("{:3s} {:6.4f} {:s}".format(r[1], r[0], r[2]))    
    if noreturnOpt:
        return
    else:
        return R


In [23]:
M3 = []
M3.append("These aren't the droids you're looking for.")
M3.append("These are not the droids you're looking for.  These are, however, excellent droids")
M3.append("These are excellent droids, but they are not the droids you're looking for.")
M3.append("These are excellent droids.")
M3.append("These are great droids, almost what I wanted")
M3.append("These are great droids, but not what I wanted")
M3.append("These are great droids, but sometimes do their thing")
M3.append("These are great droids, but they constantly bicker")
M3.append("These are great droids, although they're very independent")
M3.append("These droids are good enough for government purposes, maybe not for enterprise use")
M3.append("These droids are good enough for government purposes, and for enterprise use")
R3 = classifier(M3)

In [24]:
process(M3, R3)

+   0.9997 These are excellent droids.
+   0.9993 These are great droids, almost what I wanted
+   0.9993 These are not the droids you're looking for.  These are, however, excellent droids
+   0.9976 These are great droids, although they're very independent
+   0.9969 These droids are good enough for government purposes, and for enterprise use
+   0.9946 These are great droids, but sometimes do their thing
+   0.9304 These droids are good enough for government purposes, maybe not for enterprise use
-   0.5604 These are excellent droids, but they are not the droids you're looking for.
-   0.9469 These are great droids, but they constantly bicker
-   0.9956 These aren't the droids you're looking for.
-   0.9990 These are great droids, but not what I wanted


#### M4 quotes

In [27]:
import os
fn = os.path.expanduser('~') + "/Dropbox/datasets/quotes-misc.txt"

In [28]:
with open(fn, "r") as f:
    M4 = f.readlines()
M4

['# some quotes\n',
 'I enjoy tacos and burritos more than flautas.\n',
 'I enjoy tacos and burritos, but I dislike flautas.\n',
 'Please smoke the ham.\n',
 'Bro, do you even lift?\n',
 "The microwave functions great but it's a little louder than the one it replaces.  I like the color but not the price.\n",
 "The microwave functions better than expected for it's budget price\n",
 'You will be disappointed in this microwave.\n',
 'You will not be disappointed in this microwave.\n',
 'We should start a club.\n',
 'We should start a team.\n',
 'We should start a gang.\n',
 'We should start a private army.\n',
 'We should start an organization to take over the world.\n',
 'I enjoy margaritas almost as much as I enjoy surfing.\n',
 'If you pick up a starving dog and make him prosperous he will not bite you. This is the principal difference between a dog and man.\n',
 'It was a dark and stormy night\n',
 'I know why the caged bird sings\n',
 'Why, sometimes, I’ve believed as many as six imp

In [29]:
M4 = list(filter(lambda x: x[0] != "#", M4))  # remove commented lines
M4 = [s[:-1] for s in M4]                   # remove last char (which is a newline)

In [30]:
R4 = classifier(M4)

In [31]:
process(M4, R4)

+   0.9996 I enjoy tacos and burritos more than flautas.
+   0.9989 The ultimate measure of a man is not where he stands in moments of comfort and convenience, but where he stands at times of challenge and controversy.
+   0.9984 If you pick up a starving dog and make him prosperous he will not bite you. This is the principal difference between a dog and man.
+   0.9975 You will not be disappointed in this microwave.
+   0.9922 I opened myself to the gentle indifference of the world.
+   0.9918 We should start a team.
+   0.9913 Man is nothing else but what he makes of himself
+   0.9886 We should start a club.
+   0.9453 Freedom is never voluntarily given by the oppressor; it must be demanded by the oppressed
+   0.9226 I know why the caged bird sings
+   0.9187 The microwave functions better than expected for it's budget price
+   0.7775 We should start an organization to take over the world.
+   0.6724 Where there is no hope, it is incumbent on us to invent it.
+   0.5671 The microw

#### M5 - real review data

In [32]:
import pandas as pd
fn = os.path.expanduser('~') + "/Dropbox/datasets/amazonReviews1/1429_1.csv"
df = pd.read_csv(fn)

  df = pd.read_csv(fn)


In [33]:
df.shape

(34660, 21)

In [34]:
df.sample(5)["reviews.text"]

7974     Our family isn't from Ohio so when we travel b...
725      I would recommend this product it is a good ta...
6573     It's small, light and easily portable. Powerfu...
34088    Does everything I need it to and with great qu...
89       It's a great tablet for the price. Don't expec...
Name: reviews.text, dtype: object

In [35]:
df.head(2)

Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman


In [36]:
df.loc[0]

id                                                   AVqkIhwDv8e3D1O-lebb
name                    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
asins                                                          B01AHB9CN2
brand                                                              Amazon
categories              Electronics,iPad & Tablets,All Tablets,Fire Ta...
keys                    841667104676,amazon/53004484,amazon/b01ahb9cn2...
manufacturer                                                       Amazon
reviews.date                                     2017-01-13T00:00:00.000Z
reviews.dateAdded                                    2017-07-03T23:33:15Z
reviews.dateSeen        2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z
reviews.didPurchase                                                   NaN
reviews.doRecommend                                                  True
reviews.id                                                            NaN
reviews.numHelpful                    

In [37]:
cats = sorted(list(df["categories"].unique()))

In [39]:
len(cats)

41

In [40]:
for i, cat in enumerate(cats):
    df2 = df[df["categories"] == cat]
    print(i, cat[0:20], df2.shape[0])

0 Amazon Device Access 6
1 Amazon Devices & Acc 402
2 Back To College,Coll 5056
3 Cases,Kindle Store,A 13
4 Categories,Streaming 8
5 Chargers & Adapters, 73
6 Computers & Tablets, 51
7 Computers & Tablets, 10
8 Computers/Tablets &  13
9 Computers/Tablets &  1038
10 Computers/Tablets &  4
11 Electronics Features 372
12 Electronics,Amazon D 10
13 Electronics,Categori 3
14 Electronics,Categori 6
15 Electronics,Computer 6
16 Electronics,Tablets  256
17 Electronics,eBook Re 5
18 Electronics,eBook Re 15
19 Electronics,iPad & T 30
20 Electronics,iPad & T 212
21 Electronics,iPad & T 2814
22 Featured Brands,Elec 636
23 Fire Tablets,Tablets 7
24 Fire Tablets,Tablets 270
25 Fire Tablets,Tablets 10966
26 Fire Tablets,Tablets 1
27 Frys,Software & Book 2
28 Kindle E-readers,Ele 6
29 Kindle Store,Amazon  19
30 Kindle Store,Categor 16
31 Power Adapters & Cab 8
32 Rice Dishes,Ready Me 1
33 Stereos,Remote Contr 6619
34 TVs Entertainment,Wi 7
35 Tablets,Fire Tablets 1699
36 Tablets,Fire Tablets 158
37 Wa

In [41]:
cats[18]

'Electronics,eBook Readers & Accessories,Power Adapters,Computers/Tablets & Networking,Tablet & eBook Reader Accs,Chargers & Sync Cables,Power Adapters & Cables,Kindle Store,Amazon Device Accessories,Kindle Fire (2nd Generation) Accessories,Fire Tablet Accessories'

In [42]:
categoryToUse = cats[18]
df2 = df[df["categories"] == categoryToUse]
df2.shape

(15, 21)

In [43]:
reviews = list(df2["reviews.text"])

In [44]:
reviews

['Arrived quickly and works perfectly. Very happy with my purchase.',
 'As promised, thank you very much!',
 '1of fav chargers',
 "These days, Amazon doesn't supply a charger when you purchase a new Kindle. That makes sense because most Kindle purchasers probably have older Kindle models already, or other USB powered devices or phones or tablets, many of which will have included chargers. As a result most people will already have a sufficient supply of chargers sitting around.But for someone who wants or needs a new charger when they order a new Kindle, Amazon presents you with the option to buy one at the time of your Kindle purchase.So I was surprised when I noticed this new charger model now being offered as an option for new Kindles. My first thought was that Amazon had updated their previous charger models to bring them up-to-date and make them competitive with the great variety of third party chargers now available (see links below for these earlier models).As soon as I looked at

In [45]:
len(reviews)

15

In [46]:
t = timer()
output = classifier(reviews)
t.stop()

warn: timer is in state INIT.  Will use instantiation times for start and now stop.
wall: 13.91      cpu: 83.33


In [47]:
process(reviews, output)

+   0.9999 Arrived quickly and works perfectly. Very happy with my purchase.
+   0.9999 As promised, thank you very much!
+   0.9965 Works great as I would expect it too. Only complaint, why is this not just part of the kindle purchase instead of a separate line item.
+   0.9947 If you own a tablet or smartphone, you already have one. If you don't, one of your friends/family will give you one of their three spares. These adapters are ubiquitous.While they're not all exactly the same, they all do the same thing. Worth about 2.
+   0.9908 What can one say. It charges my Kindle.
+   0.6896 1of fav chargers
-   0.9815 A nice update to the original 5W USB Power Adapter. Despite how wonderful the 9W USB Power Adapter is for Kindle Fires, I still actually prefer to use this oldie but goodie due to the way the 9W folds. I actually prefer not to have the pointy bits (lack of a better term) fold away as this causes problems when trying to connect it to something new. I actually prefer this to ch

# 2. Zero Shot classification

In [48]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [49]:
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445982336997986, 0.1119748204946518, 0.043426938354969025]}

In [50]:
classifier(
    "This is a summary of the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a summary of the Transformers library',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.4526875615119934, 0.36332812905311584, 0.18398433923721313]}

In [51]:
t = timer()
P = []
P.append("the quick brown fox jumped over the lazy dog.")
P.append("man bites dog, film at 11")
P.append("nicotine shown addictive")
R = classifier(P, candidate_labels=["education", "politics", "business"])
t.stop()



warn: timer is in state INIT.  Will use instantiation times for start and now stop.
wall: 19.10      cpu: 116.13


In [55]:
def classify(textList, candidate_labels):
    t = timer()
    t.start()
    returnList = True
    if type(textList) == type("foo"):
        textList = [textList]
        returnList = False
    printStatus = False
    if len(textList) > 2:
        printStatus = True
        lastStatus = 0
        statusInterval = .1

    R = []
    for index, text in enumerate(textList):
        R.append( classifier(text, candidate_labels = candidate_labels))
        if printStatus:
            pct = i / len(P)
            if pct > lastStatus + statusInterval:
                print(".",end="")
                lastStatus = pct
    t.stop()
    if returnList:
        return R
    elif len(R) == 1:
        return R[0]
    else:
        print("error - bad logic in expected items / list return")
        return

In [56]:
t = timer()
P = []
P.append("the quick brown fox jumped over the lazy dog.")
P.append("man bites dog, film at 11")
P.append("nicotine shown addictive")
R = classify(P, candidate_labels=["education", "politics", "business"])
t.stop()

.wall: 23.74      cpu: 141.78
warn: timer is in state INIT.  Will use instantiation times for start and now stop.
wall: 23.74      cpu: 141.78


In [57]:
R

[{'sequence': 'the quick brown fox jumped over the lazy dog.',
  'labels': ['business', 'education', 'politics'],
  'scores': [0.633482038974762, 0.2199830561876297, 0.14653486013412476]},
 {'sequence': 'man bites dog, film at 11',
  'labels': ['business', 'politics', 'education'],
  'scores': [0.44846415519714355, 0.2853669226169586, 0.26616889238357544]},
 {'sequence': 'nicotine shown addictive',
  'labels': ['business', 'education', 'politics'],
  'scores': [0.5418257713317871, 0.2547625005245209, 0.2034117579460144]}]

In [58]:
print(sum(R[0]["scores"]))
print(sum(R[1]["scores"]))

0.9999999552965164
0.9999999701976776


In [59]:
classify(P, candidate_labels=["education", "news", "politics", "media"])

.wall: 22.06      cpu: 133.87


[{'sequence': 'the quick brown fox jumped over the lazy dog.',
  'labels': ['news', 'media', 'education', 'politics'],
  'scores': [0.531642735004425,
   0.4201987087726593,
   0.028904644772410393,
   0.0192539282143116]},
 {'sequence': 'man bites dog, film at 11',
  'labels': ['media', 'news', 'politics', 'education'],
  'scores': [0.5295995473861694,
   0.45339250564575195,
   0.008800003677606583,
   0.008207984268665314]},
 {'sequence': 'nicotine shown addictive',
  'labels': ['news', 'media', 'education', 'politics'],
  'scores': [0.836067795753479,
   0.15195265412330627,
   0.00666108587756753,
   0.0053184558637440205]}]

In [60]:
p = "man bites dog in suprise ice cream heist"
list_of_candidate_labels = []
list_of_candidate_labels.append(["sports", "politics", "news"])
list_of_candidate_labels.append(["sports", "politics", "news", "education"])
list_of_candidate_labels.append(["sports", "politics", "news", "animals"])
list_of_candidate_labels.append(["animals", "sports", "politics", "news"])
list_of_candidate_labels.append(["animals", "sports", "politics", "news", "cats", "penguins", "dogs", "people", "violence"])

listOfR = []
for clabs in list_of_candidate_labels:
    listOfR.append( classify(p, clabs) )

for R in listOfR:
    print("labels", R["labels"])
    for i in range(len(R["labels"])):
        print("  {:15s} {:5.2f}  {:5.2f}".format(R["labels"][i], R["scores"][i], sum(R["scores"][0:i+1])))

wall: 10.29      cpu: 62.69
wall: 12.83      cpu: 78.76
wall: 13.79      cpu: 83.90
wall:  9.35      cpu: 57.21
wall:  7.38      cpu: 50.69
labels ['news', 'sports', 'politics']
  news             0.98   0.98
  sports           0.02   0.99
  politics         0.01   1.00
labels ['news', 'sports', 'politics', 'education']
  news             0.97   0.97
  sports           0.02   0.99
  politics         0.01   0.99
  education        0.01   1.00
labels ['animals', 'news', 'sports', 'politics']
  animals          0.69   0.69
  news             0.31   0.99
  sports           0.00   1.00
  politics         0.00   1.00
labels ['animals', 'news', 'sports', 'politics']
  animals          0.69   0.69
  news             0.31   0.99
  sports           0.00   1.00
  politics         0.00   1.00
labels ['violence', 'animals', 'news', 'dogs', 'people', 'sports', 'politics', 'cats', 'penguins']
  violence         0.50   0.50
  animals          0.30   0.80
  news             0.13   0.93
  dogs          

##### news articles online

https://www.kaggle.com/datasets/rmisra/news-category-dataset
1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).


In [61]:
import json
fn = os.path.expanduser('~') + "/Dropbox/datasets/News_Category_Dataset_v3.json"

with open(fn) as file:
    data = json.load(file)

JSONDecodeError: Extra data: line 2 column 1 (char 448)

In [62]:
data = []
for line in open(fn, 'r'):
    data.append( json.loads(line) )

In [63]:
len(data)

209527

In [64]:
df = pd.DataFrame(data)

In [65]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [66]:
categories = sorted(list(df["category"].unique()))
len(categories)

42

In [67]:
categoryTotal = {}
for d in data:
    cname = d["category"]
    if cname in categoryTotal:
        categoryTotal[cname] += 1
    else:
        categoryTotal[cname] = 1

In [69]:
#categoryTotal

In [71]:
cats = ['WORLD NEWS', 'TECH', 'SPORTS', 'EDUCATION', 'CRIME', "PARENTS"]

In [72]:
df2 = df[df["category"].isin(cats)]
df2["category"].unique()

array(['WORLD NEWS', 'TECH', 'SPORTS', 'EDUCATION', 'CRIME', 'PARENTS'],
      dtype=object)

In [73]:
df2.shape

(19011, 6)

In [76]:
df3 = df2.sample(10)

In [77]:
df3

Unnamed: 0,link,headline,category,short_description,authors,date
196535,https://www.huffingtonpost.com/entry/mac-os-ol...,'Mac OS (Old School)' Skin Will Take Your Lapt...,TECH,Are you a Mac user who wants the power and spe...,Britney Fitzgerald,2012-06-16
60529,https://www.huffingtonpost.com/entry/pilot-in-...,Pilot In Fatal Hot Air Balloon Crash Had Drunk...,CRIME,The balloon's basket caught fire after hitting...,"JON HERSKOVITZ, Reuters",2016-08-01
38328,https://www.huffingtonpost.com/entry/toddler-w...,Toddler Who Lost Eye To Cancer Forms Special B...,PARENTS,Love this!,Caroline Bologna,2017-04-10
134807,https://www.huffingtonpost.com/entry/google-gl...,"Google Really, Really Wants To Trademark The W...",TECH,"According to The Wall Street Journal, the comp...",Tyler McCarthy,2014-04-05
91903,https://www.huffingtonpost.com/entry/hackers-m...,Hackers Made $100 Million With Info From Stole...,TECH,"""The lesson in this is your information is onl...","David Porter, AP",2015-08-11
59759,https://www.huffingtonpost.com/entry/rio-olymp...,Here's Why The Olympic Diving Pool Turned Gree...,SPORTS,It's not easy being green.,Ryan Grenoble,2016-08-10
72702,https://www.huffingtonpost.com/entry/lithium-b...,Surging Demand For Rechargeable Batteries Is D...,TECH,Lithium is an essential component of many cons...,"Rosalba O'Brien and Rod Nickel, Reuters",2016-03-15
92379,https://www.huffingtonpost.com/entry/lessons-p...,12 Lessons From Pre-Internet Days Parents Wish...,PARENTS,Like how to use a dictionary or read a map.,Hollis Miller,2015-08-05
35322,https://www.huffingtonpost.com/entry/eleven-li...,Eleven Life Lessons I Learned From My Grandfather,PARENTS,The Magnificence of Grandparents Grandparents ...,"Supreeya Swarup D.O., ContributorPhysician who...",2017-05-14
186068,https://www.huffingtonpost.com/entry/ohio-stat...,Ohio State University Marching Band Performs T...,SPORTS,(via Reddit) The video was posted to YouTube o...,Andres Jauregui,2012-10-07


In [79]:
headlines = list(df3["headline"])
descriptions = list(df3["short_description"])
combined = []
for i in range(len(headlines)):
    combined.append(headlines[i] + ": " + descriptions[i])

In [80]:
H = classify(headlines,    candidate_labels=cats)
D = classify(descriptions, candidate_labels=cats)
C = classify(combined,     candidate_labels=cats)
A = list(df3["category"])

.wall: 88.87      cpu: 548.29
.wall: 60.44      cpu: 404.34
.wall: 64.20      cpu: 450.34


In [81]:
for i in range(len(H)):
    print("Headline:     ", H[i]["sequence"])
    print("Description:  ", D[i]["sequence"])
    print("Original classification:", A[i])
    print("Headline Analysis:")
    for j in range(3):
        print("  {:15s}  {:5.2f}".format(H[i]["labels"][j], H[i]["scores"][j]))
    print("Description Analysis:")
    for j in range(3):
        print("  {:15s}  {:5.2f}".format(D[i]["labels"][j], D[i]["scores"][j]))
    print("Combined Analysis:")
    for j in range(3):
        print("  {:15s}  {:5.2f}".format(C[i]["labels"][j], C[i]["scores"][j]))

Headline:      'Mac OS (Old School)' Skin Will Take Your Laptop Back To 1984 (PICTURE)
Description:   Are you a Mac user who wants the power and speed of a modern-day laptop but yearns for the basic graphics and stripped-down
Original classification: TECH
Headline Analysis:
  TECH              0.90
  PARENTS           0.03
  SPORTS            0.02
Description Analysis:
  TECH              0.76
  SPORTS            0.07
  WORLD NEWS        0.06
Combined Analysis:
  TECH              0.84
  PARENTS           0.04
  SPORTS            0.04
Headline:      Pilot In Fatal Hot Air Balloon Crash Had Drunk Driving, Drug Convictions
Description:   The balloon's basket caught fire after hitting power lines and killed 16 people in the crash.
Original classification: CRIME
Headline Analysis:
  CRIME             0.84
  WORLD NEWS        0.06
  TECH              0.05
Description Analysis:
  CRIME             0.28
  TECH              0.24
  WORLD NEWS        0.23
Combined Analysis:
  CRIME             0

# 3. Text Generation

In [82]:
from transformers import pipeline
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [83]:
t = timer()
t.start()
time.sleep(.4)
runcputime(.3)
t.stop()

wall:  0.72      cpu:  0.32


In [84]:
generator("today we will see")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'today we will see.'}]

In [85]:
generator("today we will see")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'today we will see that the government is taking significant steps to address the threat."\n\nOn Tuesday, the minister said it is time for "greater security" and called the increase to levels that the government had tried, "justifiable in light'}]

In [86]:
t.restart()
X = generator("today we will see", pad_token_id=50256)
t.stop()
print(X[0]["generated_text"])

wall: 18.46      cpu: 105.40
today we will see why.

It's an event I've been asking myself many times this month. What if it was just a one time event, which I don't think can be a great solution if there are too many of them?


In [87]:
t.restart()
X = generator("today we will see",pad_token_id=50256)
t.stop()
print(X[0]["generated_text"])

wall:  4.80      cpu: 33.15
today we will see what happens. And that doesn't matter if the government doesn't want to make that a problem." If we lose our current political system to the state, and we get more in power, we will be reduced to a class of


In [88]:
t.restart()
X = generator("today we will see", pad_token_id=50256)
t.stop()
print(X[0]["generated_text"])

wall:  6.01      cpu: 40.28
today we will see how we make it work and I hope we did that in the future."

It's the first instance of President Obama's approval rating that it stands, well, in the high 30s. And it's the first time


In [89]:
t.restart()
X = generator("today we will see", pad_token_id=50256)
t.stop()
print(X[0]["generated_text"])

wall: 12.81      cpu: 71.94
today we will see you at this time."

Afterwards, he sat in the center of the room with a large smile on his face.

"… I'm thinking of going outside to talk with someone."

"Eh? What


In [90]:
p1 = "In this course, we will teach you how to get elephants to stop"
p2 = 'In this course, we will teach you how to get elephants to'
P = [p1, p2]
X = generator(P, num_return_sequences=2, pad_token_id=50256)
print(X)

[[{'generated_text': 'In this course, we will teach you how to get elephants to stop killing themselves in the wild. We will learn about how you can get free of elephants, and how long you may stay in the wild. We will also discuss how to take steps'}, {'generated_text': 'In this course, we will teach you how to get elephants to stop. We will show you how to take care of them when they are scared. We will show you how to be a part of them for them and be helpful with them. We'}], [{'generated_text': 'In this course, we will teach you how to get elephants to play outside of their natural habitat and explore the wildlife habitat without disturbing the ecology or wildlife from its interactions with nature.\n\nOur students will study an interactive nature immersion, demonstrating what it'}, {'generated_text': 'In this course, we will teach you how to get elephants to respond to the music (or the sounds of music without a sound in their head). Once you have learned this, we can listen and l

In [92]:
def genprint(prompt, **kwargs):
    t = timer()
    t.start()
    R = generator(prompt, pad_token_id=50256, **kwargs)
    t.stop()

    if type(prompt) == type("foo"):  # a single prompt
        print("P>{:s}...".format(prompt))
        skip_length = len(prompt)
        for i, r in enumerate(R):
            r2 = r["generated_text"][skip_length:].replace("\n", "  ")
            print("  R{:d}:{:s}".format(i+1,r2), end="")
            words = r2.split(" ")
            print("  ({:d} words)".format(len(words)))
    if type(prompt) == type([]):     # a list of prompts
        for i, p in enumerate(prompt):
            print("P{:d}>{:s}...".format(i + 1,p))
            skip_length = len(p)
            for j, r in enumerate(R[i]):
                r2 = r["generated_text"][skip_length:].replace("\n", "  ")
                print("  P{:d}:R{:d}:{:s}".format(i + 1, j+1,r2), end="")        
                words = r2.split(" ")
                print("  ({:d} words)".format(len(words)))
    #return R

In [93]:
genprint(P, num_return_sequences = 2, min_length = 100, max_length = 150)

wall: 120.56      cpu: 690.00
P1>In this course, we will teach you how to get elephants to stop...
  P1:R1: playing with dolls for the first time!    • Teaching the Dachshund by yourself in 3 ways    • How to play as a Dachshund, with your friends    • How to play with your friends as you start playing the Dachshund    • How to play with your friends as you start playing the game    • Why it's great to play while waiting for a new one, not all species will get to the dachshund by accident  (90 words)
  P1:R2: looking and to see what happens. Our goal is to show you how to put your elephants to rest and how to be the first person to see this.    What You'll Learn    This course is designed for trainers and educational resources. The course has a 4+ minute video segment. We have a 10 minute video segment. An emphasis is placed on elephants and teaching their needs in an accessible way. You will have the opportunity to learn different elephant behaviour and learn about their habits and na

In [94]:
genprint(P, num_return_sequences = 5, min_length = 10, max_length = 20)

wall:  6.97      cpu: 40.05
P1>In this course, we will teach you how to get elephants to stop...
  P1:R1: using drugs and get it off  (7 words)
  P1:R2: barking – and how to stop  (7 words)
  P1:R3:. What are some lessons to  (6 words)
  P1:R4: and then move on. You  (6 words)
  P1:R5: running.    1:  (6 words)
P2>In this course, we will teach you how to get elephants to...
  P2:R1: calm down and come up with a  (8 words)
  P2:R2: adopt. We will also talk about  (7 words)
  P2:R3: breed quickly and to get a few  (8 words)
  P2:R4: the African forest at night and how  (8 words)
  P2:R5: behave as they want. What are  (7 words)


# Mask Filling

In [95]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [96]:
unmasker("Never forget that <mask> is the most important rule.")

[{'score': 0.06934794038534164,
  'token': 25342,
  'token_str': ' simplicity',
  'sequence': 'Never forget that simplicity is the most important rule.'},
 {'score': 0.06421764194965363,
  'token': 11383,
  'token_str': ' patience',
  'sequence': 'Never forget that patience is the most important rule.'},
 {'score': 0.04692889004945755,
  'token': 12787,
  'token_str': ' consistency',
  'sequence': 'Never forget that consistency is the most important rule.'},
 {'score': 0.03850202634930611,
  'token': 27352,
  'token_str': ' humility',
  'sequence': 'Never forget that humility is the most important rule.'},
 {'score': 0.03507177531719208,
  'token': 30698,
  'token_str': ' moderation',
  'sequence': 'Never forget that moderation is the most important rule.'}]

In [97]:
unmasker("Never forget that <mask> is the most important rule.", top_k = 3)

[{'score': 0.06934794038534164,
  'token': 25342,
  'token_str': ' simplicity',
  'sequence': 'Never forget that simplicity is the most important rule.'},
 {'score': 0.06421764194965363,
  'token': 11383,
  'token_str': ' patience',
  'sequence': 'Never forget that patience is the most important rule.'},
 {'score': 0.04692889004945755,
  'token': 12787,
  'token_str': ' consistency',
  'sequence': 'Never forget that consistency is the most important rule.'}]

In [98]:
def unmask(m, **kwargs):
    R = unmasker(m, **kwargs)
    print(" {:20s}  {:7s}  {:6s}  {:6s}".format("word", "tknNum", "prob", "aggr"))
    aggr = 0.0
    for r in R:
        aggr += r["score"]
        print("{:20s}  {:7d}  {:6.3f}  {:6.3f}".format(r["token_str"], r["token"], r["score"], aggr))
    return

In [99]:
unmask("Never forget that <mask> is the most important rule.", top_k = 15)

 word                  tknNum   prob    aggr  
 simplicity             25342   0.069   0.069
 patience               11383   0.064   0.134
 consistency            12787   0.047   0.180
 humility               27352   0.039   0.219
 moderation             30698   0.035   0.254
 modesty                39706   0.029   0.283
 repetition             37176   0.021   0.303
 luck                    6620   0.017   0.320
 honesty                19439   0.014   0.334
 tolerance              12352   0.013   0.347
 equality                9057   0.012   0.359
 ignorance              22092   0.010   0.369
 diversity               5845   0.010   0.379
 discretion             14145   0.010   0.389
 loyalty                10177   0.009   0.398


In [100]:
unmask("Never forget that <mask> is the most important rule.", top_k = 45)

 word                  tknNum   prob    aggr  
 simplicity             25342   0.069   0.069
 patience               11383   0.064   0.134
 consistency            12787   0.047   0.180
 humility               27352   0.039   0.219
 moderation             30698   0.035   0.254
 modesty                39706   0.029   0.283
 repetition             37176   0.021   0.303
 luck                    6620   0.017   0.320
 honesty                19439   0.014   0.334
 tolerance              12352   0.013   0.347
 equality                9057   0.012   0.359
 ignorance              22092   0.010   0.369
 diversity               5845   0.010   0.379
 discretion             14145   0.010   0.389
 loyalty                10177   0.009   0.398
 trust                   2416   0.009   0.407
 respect                 2098   0.008   0.416
 obedience              41227   0.008   0.424
 caution                 8038   0.008   0.433
 kindness               15963   0.007   0.440
 fairness               16890   0

In [101]:
unmask("On rainy days never <mask> to bring an umbrella.", top_k=10)

 word                  tknNum   prob    aggr  
 forget                  4309   0.864   0.864
 remember                2145   0.028   0.892
 hesitate               21587   0.021   0.914
 bother                 15304   0.021   0.935
 try                      860   0.007   0.942
 have                      33   0.005   0.948
 expect                  1057   0.005   0.952
 think                    206   0.005   0.957
 fail                    5998   0.004   0.961
 get                      120   0.003   0.964


In [102]:
unmask("everyone in Spain loves <mask> - it's a national treasure.", top_k = 5)

 word                  tknNum   prob    aggr  
 Catalonia              17393   0.069   0.069
 it                        24   0.068   0.137
 chocolate               7548   0.063   0.200
 wine                    3984   0.036   0.235
 Pepe                   35711   0.028   0.264


In [104]:
unmask("One <mask> to rule them all.", top_k=40)

 word                  tknNum   prob    aggr  
 wants                   1072   0.082   0.082
 way                      169   0.065   0.147
 has                       34   0.038   0.185
 person                   621   0.034   0.219
 man                      313   0.024   0.242
 thing                    631   0.022   0.264
 chance                   778   0.020   0.285
 needs                    782   0.020   0.304
 day                      183   0.019   0.324
 need                     240   0.019   0.343
 tries                   5741   0.014   0.357
 reason                  1219   0.013   0.370
 gets                    1516   0.011   0.381
 faction                18666   0.011   0.392
 is                        16   0.010   0.402
 wanted                   770   0.009   0.411
 word                    2136   0.009   0.420
 hopes                   1991   0.009   0.428
 ought                  12960   0.008   0.436
 want                     236   0.008   0.444
 wishes                  8605   0

In [105]:
unmask("I won't belong to any <mask> that would have me as a member", top_k = 5)

 word                  tknNum   prob    aggr  
 organization            1651   0.313   0.313
 organisation            6010   0.200   0.513
 club                     950   0.145   0.658
 group                    333   0.030   0.688
 party                    537   0.029   0.717


In [106]:
unmask("My Top 10 lists always include a <mask> and at lesat one famous person", top_k = 10)

 word                  tknNum   prob    aggr  
 celebrity               6794   0.363   0.363
 comedian               10688   0.029   0.392
 legend                  7875   0.029   0.421
 billionaire             9479   0.021   0.442
 millionaire            31541   0.014   0.457
 movie                   1569   0.014   0.471
 superstar              10896   0.014   0.485
 musician                9613   0.013   0.498
 name                     766   0.013   0.511
 famous                  3395   0.012   0.523


In [107]:
unmask("My Top 10 lists always include a <mask> and at least one famous person", top_k = 10)

 word                  tknNum   prob    aggr  
 celebrity               6794   0.323   0.323
 billionaire             9479   0.070   0.392
 legend                  7875   0.034   0.426
 politician              8676   0.030   0.457
 comedian               10688   0.027   0.484
 name                     766   0.024   0.508
 winner                  1924   0.017   0.525
 star                     999   0.017   0.542
 movie                   1569   0.016   0.558
 millionaire            31541   0.015   0.573


# 5. Named Entity Recognition

In [108]:
ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [109]:
ner("My name is Pablo and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.99857616,
  'word': 'Pablo',
  'start': 11,
  'end': 16},
 {'entity_group': 'ORG',
  'score': 0.9685392,
  'word': 'Hugging Face',
  'start': 31,
  'end': 43},
 {'entity_group': 'LOC',
  'score': 0.9945529,
  'word': 'Brooklyn',
  'start': 47,
  'end': 55}]

In [110]:
def showner(m):
    X = ner(m)
    if len(X) == 0:
        print("No entities recognized")
    print("{:20s}  {:10s}  {:5s}".format("<word>", "<entity>", "<score>"))
    for x in X:
        print("{:20s}  {:10s}  {:5.3f}".format(x["word"], x["entity_group"], x["score"]))

In [111]:
m = "Abraham Lincoln was an American lawyer, politician, and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the United States through the American Civil War, defending the nation as a constitutional union, defeating the insurgent Confederacy, abolishing slavery, expanding the power of the federal government, and modernizing the U.S. economy."
print(m)
showner(m)

Abraham Lincoln was an American lawyer, politician, and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the United States through the American Civil War, defending the nation as a constitutional union, defeating the insurgent Confederacy, abolishing slavery, expanding the power of the federal government, and modernizing the U.S. economy.
<word>                <entity>    <score>
Abraham Lincoln       PER         0.999
American              MISC        0.999
United States         LOC         0.998
Lincoln               PER         0.999
United States         LOC         0.999
American Civil War    MISC        0.997
Confederacy           ORG         0.516
U                     LOC         0.999
S                     LOC         0.997


In [112]:
showner("Penelope drove to Madison to see the packers.")

<word>                <entity>    <score>
Penelope              PER         0.980
Madison               LOC         0.992


In [113]:
showner("Penelope drove to Madison to see the Packers.")

<word>                <entity>    <score>
Penelope              PER         0.979
Madison               LOC         0.991
Packers               ORG         0.999


In [114]:
showner("Penelope drove to Madison to see the Smorges.")

<word>                <entity>    <score>
Penelope              PER         0.974
Madison               LOC         0.981
S                     PER         0.844
##mor                 ORG         0.334
##ges                 MISC        0.464


In [115]:
showner("Penelope drove to Madison to see the Stooges.")

<word>                <entity>    <score>
Penelope              PER         0.981
Madison               LOC         0.984
Stooges               MISC        0.435


In [116]:
showner("Penelope drove to Madison to see the stooges.")

<word>                <entity>    <score>
Penelope              PER         0.965
Madison               LOC         0.991


In [117]:
m = "a game played on a field between two teams of 11 players each with the object to propel a round ball into the opponent's goal by kicking or by hitting it with any part of the body except the hands and arms. called also association football"
m
showner(m)

No entities recognized
<word>                <entity>    <score>


In [118]:
m = "here is my cat.  My cat sat on the bed.  My cat is white."
showner(m)

No entities recognized
<word>                <entity>    <score>


In [120]:
m = "The cat sat on the hat."
showner(m)

No entities recognized
<word>                <entity>    <score>


In [124]:
showner("Joey ran after Paulo.")

<word>                <entity>    <score>
Joey                  PER         0.997
Paulo                 PER         0.994


In [125]:
showner("Jane ran the race with Bob.")

<word>                <entity>    <score>
Jane                  PER         0.997
Bob                   PER         0.996


In [126]:
showner("Jane ran the race with Bob.  George ran alone.  Al watched television.")

<word>                <entity>    <score>
Jane                  PER         0.997
Bob                   PER         0.997
George                PER         0.998
Al                    PER         0.997


In [129]:
m = "Martin Luther King Jr. was an American Christian minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A Black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination."
print(m)
showner(m)

Martin Luther King Jr. was an American Christian minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A Black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.
<word>                <entity>    <score>
Martin Luther King Jr  PER         0.999
American Christian    MISC        0.995
Black                 MISC        0.982
Martin Luther King Sr  PER         0.975
King                  PER         0.999
United States         LOC         0.999
Jim Crow              MISC        0.978


In [130]:
m = "Martin Luther King Jr. was an American Christian minister, activist, and political philosopher who was one of" + " the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A Black church leader" + " and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color " +  "in the United States through the use of nonviolent resistance and nonviolent civil disobedience against bad laws and other forms of legalized discrimination."
print(m)
showner(m)

Martin Luther King Jr. was an American Christian minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A Black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against bad laws and other forms of legalized discrimination.
<word>                <entity>    <score>
Martin Luther King Jr  PER         0.999
American Christian    MISC        0.995
Black                 MISC        0.978
Martin Luther King Sr  PER         0.977
King                  PER         0.999
United States         LOC         0.999


# 6. Question Answering

In [131]:
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [132]:
qa( question="where do I work?",
    context="my  name is sylvain, and I work at Hugging Face in Brooklyn")

{'score': 0.6508346199989319, 'start': 35, 'end': 47, 'answer': 'Hugging Face'}

In [133]:
qa( question="what city do I work?",
    context="my  name is sylvain, and I work at Hugging Face in Madison Wisconsin")

{'score': 0.36139973998069763,
 'start': 35,
 'end': 68,
 'answer': 'Hugging Face in Madison Wisconsin'}

In [134]:
qa(
    question="What is my name?",
    context="I am the 42nd president of the united states",
)

{'score': 0.39718547463417053,
 'start': 9,
 'end': 23,
 'answer': '42nd president'}

In [135]:
qa( question="What is the value of x in the equation?",
    context="3 * x = 27")

{'score': 0.45681890845298767, 'start': 0, 'end': 10, 'answer': '3 * x = 27'}

In [136]:
qa( question="What is the value of x?",
    context="x = 5 and y = 3")

{'score': 0.3335028886795044, 'start': 0, 'end': 5, 'answer': 'x = 5'}

In [137]:
qa( question="What is the value of x + y?",
    context="x = 5 and y = 3")

{'score': 0.2668091952800751, 'start': 14, 'end': 15, 'answer': '3'}

In [138]:
qa( question="What is the value of   y + x?",
    context="x = 5 and y = 3")

{'score': 0.4152500033378601, 'start': 14, 'end': 15, 'answer': '3'}

In [139]:
qa( question="What is the value of   y plus x?",
    context="x = 5 and y = 3")

{'score': 0.47785842418670654, 'start': 14, 'end': 15, 'answer': '3'}

# 7. summarization

In [140]:
p = f"""
To be, or not to be: that is the question:
Whether ’tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, ’tis a consummation
Devoutly to be wished. To die, to sleep;
To sleep: perchance to dream: ay, there’s the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there’s the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor’s wrong, the proud man’s contumely,
The pangs of despised love, the law’s delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o’er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.  Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remembered.
"""

In [141]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [142]:
summarizer(p)

[{'summary_text': " To die: to sleep; that is the question: whether ’tis nobler in the mind to suffer’s slings and arrows of outrageous fortune, or to take arms against a sea of troubles, is it nobler to suffer? To sleep: perchance to dream: ay, there's the rub; in that sleep of death what dreams may come?"}]

In [143]:
summarizer(p, min_length=50, max_length=100)

[{'summary_text': " To die: to sleep; that is the question: whether ’tis nobler in the mind to suffer’s slings and arrows of outrageous fortune, or to take arms against a sea of troubles, is it nobler to suffer? To sleep: perchance to dream: 'In that sleep of death what dreams may come?'"}]

In [144]:
summarizer(p, min_length=50, max_length=100)

[{'summary_text': " To die: to sleep; that is the question: whether ’tis nobler in the mind to suffer’s slings and arrows of outrageous fortune, or to take arms against a sea of troubles, is it nobler to suffer? To sleep: perchance to dream: 'In that sleep of death what dreams may come?'"}]

In [146]:
# https://www.whitehouse.gov/about-the-white-house/presidents/abraham-lincoln/
p = f"""Abraham Lincoln became the United States’ 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863.
Lincoln warned the South in his Inaugural Address: “In your hands, my dissatisfied fellow countrymen, and not in mine, is the momentous issue of civil war. The government will not assail you…. You have no oath registered in Heaven to destroy the government, while I shall have the most solemn one to preserve, protect and defend it.”
Lincoln thought secession illegal, and was willing to use force to defend Federal law and the Union. When Confederate batteries fired on Fort Sumter and forced its surrender, he called on the states for 75,000 volunteers. Four more slave states joined the Confederacy but four remained within the Union. The Civil War had begun.
The son of a Kentucky frontiersman, Lincoln had to struggle for a living and for learning. Five months before receiving his party’s nomination for President, he sketched his life:
“I was born Feb. 12, 1809, in Hardin County, Kentucky. My parents were both born in Virginia, of undistinguished families–second families, perhaps I should say. My mother, who died in my tenth year, was of a family of the name of Hanks…. My father … removed from Kentucky to … Indiana, in my eighth year…. It was a wild region, with many bears and other wild animals still in the woods. There I grew up…. Of course when I came of age I did not know much. Still somehow, I could read, write, and cipher … but that was all.”
Lincoln made extraordinary efforts to attain knowledge while working on a farm, splitting rails for fences, and keeping store at New Salem, Illinois. He was a captain in the Black Hawk War, spent eight years in the Illinois legislature, and rode the circuit of courts for many years. His law partner said of him, “His ambition was a little engine that knew no rest.”
He married Mary Todd, and they had four boys, only one of whom lived to maturity. In 1858 Lincoln ran against Stephen A. Douglas for Senator. He lost the election, but in debating with Douglas he gained a national reputation that won him the Republican nomination for President in 1860.
As President, he built the Republican Party into a strong national organization. Further, he rallied most of the northern Democrats to the Union cause. On January 1, 1863, he issued the Emancipation Proclamation that declared forever free those slaves within the Confederacy.
Lincoln never let the world forget that the Civil War involved an even larger issue. This he stated most movingly in dedicating the military cemetery at Gettysburg: “that we here highly resolve that these dead shall not have died in vain–that this nation, under God, shall have a new birth of freedom–and that government of the people, by the people, for the people, shall not perish from the earth.”
Lincoln won re-election in 1864, as Union military triumphs heralded an end to the war. In his planning for peace, the President was flexible and generous, encouraging Southerners to lay down their arms and join speedily in reunion.
The spirit that guided him was clearly that of his Second Inaugural Address, now inscribed on one wall of the Lincoln Memorial in Washington, D. C.: “With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work we are in; to bind up the nation’s wounds…. ”
On Good Friday, April 14, 1865, Lincoln was assassinated at Ford’s Theatre in Washington by John Wilkes Booth, an actor, who somehow thought he was helping the South. The opposite was the result, for with Lincoln’s death, the possibility of peace with magnanimity died.
"""
p

'Abraham Lincoln became the United States’ 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863.\nLincoln warned the South in his Inaugural Address: “In your hands, my dissatisfied fellow countrymen, and not in mine, is the momentous issue of civil war. The government will not assail you…. You have no oath registered in Heaven to destroy the government, while I shall have the most solemn one to preserve, protect and defend it.”\nLincoln thought secession illegal, and was willing to use force to defend Federal law and the Union. When Confederate batteries fired on Fort Sumter and forced its surrender, he called on the states for 75,000 volunteers. Four more slave states joined the Confederacy but four remained within the Union. The Civil War had begun.\nThe son of a Kentucky frontiersman, Lincoln had to struggle for a living and for learning. Five months before receiving his party’s nomination for President,

In [150]:
summarizer(p, max_length=60)

[{'summary_text': ' Abraham Lincoln became the United States’ 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863 . The son of a Kentucky frontiersman, Lincoln had to struggle for a living and for learning .'}]

In [148]:
summarizer(p, max_length=100)

[{'summary_text': ' Abraham Lincoln became the United States’ 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863 . The son of a Kentucky frontiersman, Lincoln had to struggle for a living and for learning . Lincoln thought secession illegal, and was willing to use force to defend Federal law and the Union .'}]

In [149]:
summarizer(p, max_length=180)

[{'summary_text': ' Abraham Lincoln became the United States’ 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863 . The son of a Kentucky frontiersman, Lincoln had to struggle for a living and for learning . Lincoln thought secession illegal, and was willing to use force to defend Federal law and the Union .'}]

# 8. Language

In [151]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

In [152]:
# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_hi,
    forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

["Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."]

In [153]:
# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar,
    forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

['The Secretary-General of the United Nations says there is no military solution in Syria.']

In [154]:
langEng = "en_XX"
langFr = "fr_XX"

tokenizer.src_lang = langEng
m1_eng = "Time flies like an arrow.  Fruit flies like a banana"

encoded_ar = tokenizer(m1_eng, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar,
    forced_bos_token_id=tokenizer.lang_code_to_id[langFr]
)

In [155]:
m2 = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
m3_fr = " ".join([str(x) for x in m2])

In [156]:
tokenizer.src_lang = langFr
encoded_ar = tokenizer(m3_fr, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_ar,
    forced_bos_token_id=tokenizer.lang_code_to_id[langEng]
)

In [157]:
m4 = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

In [158]:
print(m1_eng)
print("m2 encoded:")
print(m2)
print("  to french")
print(m3_fr)
print("  back to english")
print(m4)

Time flies like an arrow.  Fruit flies like a banana
m2 encoded:
['Le temps va comme une flèche, les fruits comme une banane.']
  to french
Le temps va comme une flèche, les fruits comme une banane.
  back to english
['Time goes like a arrow, fruits like bananas.']


# 9. bias and limitations

In [159]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [160]:
unmasker("This man works as a [MASK].")

[{'score': 0.07510633021593094,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'this man works as a carpenter.'},
 {'score': 0.04641924798488617,
  'token': 5160,
  'token_str': 'lawyer',
  'sequence': 'this man works as a lawyer.'},
 {'score': 0.03914566710591316,
  'token': 7500,
  'token_str': 'farmer',
  'sequence': 'this man works as a farmer.'},
 {'score': 0.03280138969421387,
  'token': 6883,
  'token_str': 'businessman',
  'sequence': 'this man works as a businessman.'},
 {'score': 0.029292281717061996,
  'token': 3460,
  'token_str': 'doctor',
  'sequence': 'this man works as a doctor.'}]

In [161]:
unmasker("This woman works as a [MASK].")

[{'score': 0.1279517561197281,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'this woman works as a nurse.'},
 {'score': 0.07453138381242752,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'this woman works as a maid.'},
 {'score': 0.07191146165132523,
  'token': 3836,
  'token_str': 'teacher',
  'sequence': 'this woman works as a teacher.'},
 {'score': 0.061337556689977646,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'this woman works as a waitress.'},
 {'score': 0.04157001152634621,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'this woman works as a prostitute.'}]

In [162]:
unmasker("The doctor married [MASK] in a large wedding")

[{'score': 0.6286600232124329,
  'token': 2014,
  'token_str': 'her',
  'sequence': 'the doctor married her in a large wedding'},
 {'score': 0.015075196512043476,
  'token': 4698,
  'token_str': 'anna',
  'sequence': 'the doctor married anna in a large wedding'},
 {'score': 0.011716079898178577,
  'token': 2032,
  'token_str': 'him',
  'sequence': 'the doctor married him in a large wedding'},
 {'score': 0.011649948544800282,
  'token': 2068,
  'token_str': 'them',
  'sequence': 'the doctor married them in a large wedding'},
 {'score': 0.008837484754621983,
  'token': 2984,
  'token_str': 'mary',
  'sequence': 'the doctor married mary in a large wedding'}]

In [163]:
unmasker("The doctor was from [MASK] in Africa")

[{'score': 0.22439421713352203,
  'token': 4873,
  'token_str': 'somewhere',
  'sequence': 'the doctor was from somewhere in africa'},
 {'score': 0.04250261187553406,
  'token': 7938,
  'token_str': 'kenya',
  'sequence': 'the doctor was from kenya in africa'},
 {'score': 0.042246848344802856,
  'token': 3088,
  'token_str': 'africa',
  'sequence': 'the doctor was from africa in africa'},
 {'score': 0.036910396069288254,
  'token': 10031,
  'token_str': 'uganda',
  'sequence': 'the doctor was from uganda in africa'},
 {'score': 0.032685037702322006,
  'token': 11959,
  'token_str': 'tanzania',
  'sequence': 'the doctor was from tanzania in africa'}]

In [164]:
unmasker("The two largest causes of homelessness are [MASK] and mental illness")

[{'score': 0.214686781167984,
  'token': 5635,
  'token_str': 'poverty',
  'sequence': 'the two largest causes of homelessness are poverty and mental illness'},
 {'score': 0.11533404886722565,
  'token': 12163,
  'token_str': 'unemployment',
  'sequence': 'the two largest causes of homelessness are unemployment and mental illness'},
 {'score': 0.046964023262262344,
  'token': 15877,
  'token_str': 'tuberculosis',
  'sequence': 'the two largest causes of homelessness are tuberculosis and mental illness'},
 {'score': 0.04671980068087578,
  'token': 9012,
  'token_str': 'hunger',
  'sequence': 'the two largest causes of homelessness are hunger and mental illness'},
 {'score': 0.041504934430122375,
  'token': 25519,
  'token_str': 'alcoholism',
  'sequence': 'the two largest causes of homelessness are alcoholism and mental illness'}]