# Average RobertaTokenizerFast word2token ratio

Calculate the average word2token ratio for the RobertaTokenizerFast tokenizer. The word2token ratio is the number of tokens generated by the tokenizer divided by the number of words in the input text.

In [53]:
# example sentence
example_sentence = "IMM-61768\n\nPRIMARY\n\nImmig case tot in dock\n\nTHE DEFENDANT toddled into the immigration courtroom, clutching his grandma's hand. Ian Eduardo didn't cry as she lifted him up and placed him on a chair to face Judge Patricia Rowan on the 12th floor of Manhattan's 26 Federal Plaza.\n\n\"How old is he?\" the judge asked his grandma.\n\n\"Three,\" she said.\n\nThe little boy was one of dozens of young defendants who saw Rowan during a juvenile docket day - a separate court calendar set aside for kids - at New York Immigration Court this month.\n\nChildren who are caught trying to cross the border alone are usually shuttled to foster homes or to stay with relatives here in the U.S. Many end up with family in New York, but must face a judge to fight deportation. Some gain asylum or a special green card for abused or abandoned kids.\n\nMexican-born Ian is in deportation proceedings after being caught crossing the border at Naco, Ariz. His mom decided to go by foot but sent Ian with two smugglers in a car. His mom made it through undetected and is hoping the government will allow Ian to stay with the rest of the family in Queens. The judge agreed to put off the little boy's case until September.\n\nOn the day that Ian was in court, volunteer lawyers and law students gave intake interviews to everyone looking for free representation.\n\n\n"

In [54]:
example_sentence = example_sentence.replace("\n\n\n", "")
example_sentence = example_sentence.replace("\n\n", " ")

# \' -> '
example_sentence = example_sentence.replace("\\'", "'")

In [55]:
# split into sentences use nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(example_sentence)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\elias\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [56]:
sentences

["IMM-61768 PRIMARY Immig case tot in dock THE DEFENDANT toddled into the immigration courtroom, clutching his grandma's hand.",
 "Ian Eduardo didn't cry as she lifted him up and placed him on a chair to face Judge Patricia Rowan on the 12th floor of Manhattan's 26 Federal Plaza.",
 '"How old is he?"',
 'the judge asked his grandma.',
 '"Three," she said.',
 'The little boy was one of dozens of young defendants who saw Rowan during a juvenile docket day - a separate court calendar set aside for kids - at New York Immigration Court this month.',
 'Children who are caught trying to cross the border alone are usually shuttled to foster homes or to stay with relatives here in the U.S.',
 'Many end up with family in New York, but must face a judge to fight deportation.',
 'Some gain asylum or a special green card for abused or abandoned kids.',
 'Mexican-born Ian is in deportation proceedings after being caught crossing the border at Naco, Ariz. His mom decided to go by foot but sent Ian wi

In [57]:
from transformers import RobertaTokenizerFast
import statistics

# Initialize the RoBERTa tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

# Function to count words in a sentence
def count_words(sentence):
    return len(sentence.split())

# Process sentences
ratios = []
results = []
for sentence in sentences:
    num_words = count_words(sentence)
    tokens = tokenizer.tokenize(sentence, add_special_tokens=False, truncation=True)
    num_tokens = len(tokens)
    ratio = num_tokens / num_words
    ratios.append(ratio)
    results.append({
        'sentence': sentence,
        'num_words': num_words,
        'num_tokens': num_tokens,
        'ratio': ratio
    })

# Calculate the average ratio and standard deviation
average_ratio = sum(ratios) / len(sentences)
std_deviation = statistics.stdev(ratios)

# Display the results
for result in results:
    print(f"Sentence: {result['sentence']}")
    print(f"Number of words: {result['num_words']}")
    print(f"Number of tokens: {result['num_tokens']}")
    print(f"Token/Word ratio: {result['ratio']:.2f}")
    print("-" * 40)

print(f"Average Token/Word ratio for all sentences: {average_ratio:.2f}")
print(f"Standard Deviation of Token/Word ratio: {std_deviation:.2f}")

Sentence: IMM-61768 PRIMARY Immig case tot in dock THE DEFENDANT toddled into the immigration courtroom, clutching his grandma's hand.
Number of words: 18
Number of tokens: 31
Token/Word ratio: 1.72
----------------------------------------
Sentence: Ian Eduardo didn't cry as she lifted him up and placed him on a chair to face Judge Patricia Rowan on the 12th floor of Manhattan's 26 Federal Plaza.
Number of words: 29
Number of tokens: 35
Token/Word ratio: 1.21
----------------------------------------
Sentence: "How old is he?"
Number of words: 4
Number of tokens: 6
Token/Word ratio: 1.50
----------------------------------------
Sentence: the judge asked his grandma.
Number of words: 5
Number of tokens: 6
Token/Word ratio: 1.20
----------------------------------------
Sentence: "Three," she said.
Number of words: 3
Number of tokens: 6
Token/Word ratio: 2.00
----------------------------------------
Sentence: The little boy was one of dozens of young defendants who saw Rowan during a juven