# How to label tokens

In [2]:
# autoreload
%load_ext autoreload
%autoreload 2

import spacy

import token_labelling

We analyze a simple sentence and receive the respective tokens with their analyzed attributes.  
The grammatical/linguistic analysis is done by a model provided by spaCy for the English language.

In [23]:
# Load the english model
nlp = spacy.load("en_core_web_sm")

# Create a Doc object from a given text
doc = nlp("This is a dummy sentence for testing.")

token = doc[0]
print(token)

This


Let's get the label for our custom token that we just printed.

In [46]:
label = token_labelling.label_single_token(token)
print(label)

[False, True, False, True, False, False, False, False, False, False, True, False, False, False, False]


Let's get an understanding of what the labels acutally mean.
Use this function to receive an explanation for a single token.

In [42]:
token_labelling.explain_token_labels(token)

-------- Explanation of token labels --------
Token text:          This
Token dependency:    nominal subject
Token POS:           pronoun
---------------- Token labels ---------------
  0   Starts with space    False
  1   Capitalized          True
  2   Is Noun              False
  3   Is Pronoun           True
  4   Is Adjective         False
  5   Is Verb              False
  6   Is Adverb            False
  7   Is Preposition       False
  8   Is Conjunction       False
  9   Is Interjunction     False
 10   Is Subject           True
 11   Is Object            False
 12   Is Root              False
 13   Is auxiliary         False
 14   Is Named Entity      False


If you are interested in all the possible labels a token can have, that spaCy is capable of assigning, then call the same function but without any argument.

In [43]:
token_labelling.explain_token_labels()

Explanation of all 302 token labels (POS, dependency, NER, ...):
    ADJ        adjective
    ADP        adposition
    ADV        adverb
    AUX        auxiliary
    CONJ       conjunction
    CCONJ      coordinating conjunction
    DET        determiner
    INTJ       interjection
    NOUN       noun
    NUM        numeral
    PART       particle
    PRON       pronoun
    PROPN      proper noun
    PUNCT      punctuation
    SCONJ      subordinating conjunction
    SYM        symbol
    VERB       verb
    X          other
    EOL        end of line
    SPACE      space
    .          punctuation mark, sentence closer
    ,          punctuation mark, comma
    -LRB-      left round bracket
    -RRB-      right round bracket
    ``         opening quotation mark
    ""         closing quotation mark
    ''         closing quotation mark
    :          punctuation mark, colon or ellipsis
    $          symbol, currency
    #          symbol, number sign
    AFX        affix
    CC    

Next, let us analyze a batch of sentences and have them labelled.
> In this example the input sentences are not yet tokenized, so spaCy uses its internal tokenizer.

In [55]:
sentences = [
    "This is a sentence."
]
labels = token_labelling.label_batch_token(sentences, tokenized=False, verbose=True)

print(len(labels[0]))
print(labels[0])

Token: This
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Subject | Is Object | Is Root | Is auxiliary | Is Named Entity
False             | True        | False   | True       | False        | False   | False     | False          | False          | False            | True       | False     | False   | False        | False          
---
Token: is
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Subject | Is Object | Is Root | Is auxiliary | Is Named Entity
False             | False       | False   | False      | False        | True    | False     | False          | False          | False            | False      | False     | True    | False        | False          
---
Token: a
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition 

Now with our own tokenization. E.g. the one from our TinyStories models.

In [None]:
sentences = [
    ["This ", "is ", "a ", "sentence", "."]
]
labels = token_labelling.label_batch_token(sentences, tokenized=True, verbose=False)

print(len(labels[0]))
print(labels[0])

Token: This 
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Subject | Is Object | Is Root | Is auxiliary | Is Named Entity
False             | True        | True    | False      | False        | False   | False     | False          | False          | False            | False      | False     | True    | False        | False          
---
Token: is 
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Subject | Is Object | Is Root | Is auxiliary | Is Named Entity
False             | False       | False   | False      | False        | False   | True      | False          | False          | False            | False      | False     | False   | False        | False          
---
Token: a 
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Prepositi