# Average RobertaTokenizerFast word2token ratio

Calculate the average word2token ratio for the RobertaTokenizerFast tokenizer. The word2token ratio is the number of tokens generated by the tokenizer divided by the number of words in the input text.

In [28]:
# example sentence
example_sentence = "IMM-60412\n\nPRIMARY\n\nFOLLOWING TRUMP VOTER FRAUD ALLEGATIONS, CLAIM THAT 5.7 MILLION NONCITIZENS VOTED IS WRONG\n\nThe statement:\n\n\"5.7 million -- that's how many illegal immigrants might have voted\" in 2008.\n\n- Ainsley Earhardt on Tuesday, June 20th, 2017 in the \"Fox and Friends\" show\n\n* * *\n\nThe ruling: FALSE\n\nPresident Donald Trump's unfounded allegations that millions voted illegally in 2016 is back in the news, with his supporters pointing to a new analysis that claims millions of undocumented immigrants voted in 2008.\n\nFox and Friends co-host Ainsley Earhardt talked about it on the morning show recently.\n\n\"5.7 million -- that's how many illegal immigrants might have voted\" in 2008, she said. Her comments referenced an article in the Washington Times, a conservative newspaper.\n\nTrump has made repeated claims about massive voter fraud and election rigging, which we've debunked again and again and again and again and again and again and again (and we debunked a claim by his spokesman Sean Spicer).\n\nThe claim made on Fox and Friends is based on an extrapolation of a controversial study that relied on a very small number of responses. Researchers involved in the underlying survey of voters have cautioned against using their data to reach conclusions about noncitizen voters.\n\n- - -\n\nStudy about the 2008 election\n\nWe emailed a spokeswoman for Fox News and did not get a reply; however, the Washington Times article showed that the information came from Just Facts, a think tank that describes itself as conservative/libertarian and was founded by James D. Agresti, a mechanical engineer in New Jersey.\n\n\n"

In [29]:
example_sentence = example_sentence.replace("\n\n\n", "")
example_sentence = example_sentence.replace("\n\n", " ")

# \' -> '
example_sentence = example_sentence.replace("\\'", "'")

In [30]:
# split into sentences use nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(example_sentence)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\elias\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [31]:
sentences

['IMM-60412 PRIMARY FOLLOWING TRUMP VOTER FRAUD ALLEGATIONS, CLAIM THAT 5.7 MILLION NONCITIZENS VOTED IS WRONG The statement: "5.7 million -- that\'s how many illegal immigrants might have voted" in 2008.',
 '- Ainsley Earhardt on Tuesday, June 20th, 2017 in the "Fox and Friends" show * * * The ruling: FALSE President Donald Trump\'s unfounded allegations that millions voted illegally in 2016 is back in the news, with his supporters pointing to a new analysis that claims millions of undocumented immigrants voted in 2008.',
 'Fox and Friends co-host Ainsley Earhardt talked about it on the morning show recently.',
 '"5.7 million -- that\'s how many illegal immigrants might have voted" in 2008, she said.',
 'Her comments referenced an article in the Washington Times, a conservative newspaper.',
 "Trump has made repeated claims about massive voter fraud and election rigging, which we've debunked again and again and again and again and again and again and again (and we debunked a claim by h

In [32]:
from transformers import RobertaTokenizerFast
import statistics

# Initialize the RoBERTa tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

# Function to count words in a sentence
def count_words(sentence):
    return len(sentence.split())

# Process sentences
ratios = []
results = []
for sentence in sentences:
    num_words = count_words(sentence)
    tokens = tokenizer.tokenize(sentence)
    num_tokens = len(tokens)
    ratio = num_tokens / num_words
    ratios.append(ratio)
    results.append({
        'sentence': sentence,
        'num_words': num_words,
        'num_tokens': num_tokens,
        'ratio': ratio
    })

# Calculate the average ratio and standard deviation
average_ratio = sum(ratios) / len(sentences)
std_deviation = statistics.stdev(ratios)

print(f"Average Token/Word ratio for all sentences: {average_ratio:.2f}")
print(f"Standard Deviation of Token/Word ratio: {std_deviation:.2f}")

Average Token/Word ratio for all sentences: 1.33
Standard Deviation of Token/Word ratio: 0.33
