# Text Generation: Understanding the Role of Probabilities

Text generation is a fascinating application of machine learning that powers everything from chatbots to creative writing assistants. At its core, modern text generation relies on probabilistic modeling—a mathematical approach that treats language as a series of statistical predictions.

## The Foundation: Probability Distributions

When a language model generates text, it's essentially answering the question: "Given the text so far, what word is likely to come next?" The model maintains a probability distribution across its entire vocabulary, assigning higher probabilities to words that make more sense in the current context.

For example, if the partial sentence is "The chef cooked a delicious," the model might assign:
- "meal" → 22% probability
- "dish" → 18% probability 
- "steak" → 12% probability
- thousands of other words → smaller probabilities

## The Generation Process

Text generation typically works through these steps:

1. **Conditioning**: The model processes the input text (the "prompt" or context)
2. **Prediction**: It calculates probability scores for each possible next token
3. **Sampling**: It selects the next token based on these probabilities
4. **Iteration**: The process repeats with the newly expanded text

## Sampling Methods

How a model selects the next word significantly affects the output:

- **Greedy sampling**: Always choose the highest probability token
- **Temperature sampling**: Adjust how "deterministic" vs "creative" the choices are
- **Top-k sampling**: Only consider the k most likely tokens
- **Top-p (nucleus) sampling**: Only consider tokens whose cumulative probability exceeds threshold p

## Temperature: Controlling Randomness

Temperature is a hyperparameter that controls how "risky" the model's choices are:

- **Low temperature** (0.1-0.5): More predictable, focused, repetitive outputs
- **Medium temperature** (0.6-0.9): Balanced between coherence and creativity
- **High temperature** (1.0+): More surprising, diverse, and sometimes chaotic outputs

By understanding and adjusting these probabilistic mechanisms, developers can fine-tune text generation systems to produce content that strikes the right balance between predictability and creativity for their specific applications.

In [1]:
import json
import random
import re
from collections import defaultdict, Counter
import PyPDF2
import time

### This function extracts text from a PDF file and performs basic text cleaning

In [2]:
def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    text = ""
    try:
        with open(pdf_path, "rb") as file:
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text() + " "
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return None

    # Clean the text
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text


In [3]:
text = extract_text_from_pdf('pdfs/ClarionLanguageProgramming.pdf')

In [4]:
text




# Text Tokenization: Breaking Text into Meaningful Units

When working with natural language processing (NLP), one of the fundamental preprocessing steps is tokenization. Tokenization is the process of converting raw text into smaller, more manageable units called tokens. These tokens form the basis for further text analysis.

## Why Tokenize Text?

Tokenization serves several key purposes:

1. **Text Standardization** - Breaking text into consistent units
2. **Feature Extraction** - Preparing text for statistical analysis
3. **Vocabulary Building** - Creating a dictionary of terms for analysis
4. **Enabling Count-Based Methods** - Supporting techniques like TF-IDF or bag-of-words

## Simple Word Tokenization

The most common form of tokenization splits text into individual words. While there are sophisticated tokenization techniques available through libraries like NLTK, spaCy, or transformers, sometimes a simple approach using regular expressions is sufficient for basic tasks.

Here's a straightforward function that tokenizes text into words:


With this function, you can easily convert sentences or paragraphs into a list of words that can be counted, analyzed, or used as input for more advanced NLP models.

In [5]:
def tokenize(text):
    """Tokenize text into words."""
    # Simple tokenization: split by spaces and remove punctuation
    words = re.findall(r'\b\w+\b', text.lower())
    return words


In [6]:
tokens = tokenize(text)

In [7]:
tokens


['clarion',
 'language',
 'programming',
 'guide',
 '2',
 'copyright',
 'softvelocity',
 'inc',
 'all',
 'rights',
 'reserved',
 'this',
 'publication',
 'is',
 'protected',
 'by',
 'copyright',
 'and',
 'all',
 'rights',
 'are',
 'reserved',
 'by',
 'softvelocity',
 'incorporated',
 'it',
 'may',
 'not',
 'in',
 'whole',
 'or',
 'part',
 'be',
 'copied',
 'photocopied',
 'reproduced',
 'translated',
 'or',
 'reduced',
 'to',
 'any',
 'electronic',
 'medium',
 'or',
 'machine',
 'readable',
 'form',
 'without',
 'prior',
 'consent',
 'in',
 'writing',
 'from',
 'softvelocity',
 'incorporated',
 'this',
 'publication',
 'supports',
 'clarion',
 'it',
 'is',
 'possible',
 'that',
 'it',
 'may',
 'contain',
 'technical',
 'or',
 'typographical',
 'errors',
 'softvelocity',
 'incorporated',
 'provides',
 'thi',
 's',
 'publication',
 'as',
 'is',
 'without',
 'warranty',
 'of',
 'any',
 'kind',
 'either',
 'expressed',
 'or',
 'implied',
 'www',
 'softvelocity',
 'com',
 'trademark',
 'ack

# Building a Simple Language Model with Unigram Probabilities

After tokenizing our text, the next step in creating a basic language model is to analyze word sequences and their probabilities. One of the simplest approaches is to build a unigram model, which calculates the probability of each word following another word in a corpus.

## What is a Unigram Model?

A unigram model (more accurately called a bigram model in this context) captures the probability of one word following another. By analyzing these transition probabilities, we can:

1. Predict likely next words in a sequence
2. Generate text that follows similar patterns to the training data
3. Evaluate the likelihood of specific word sequences

## Counting Word Transitions

The first step is to count every instance where one word follows another in our corpus:

This code creates a statistical model of word transitions by:
- Counting how often each word appears after every other word
- Converting these counts into probabilities by dividing by the total occurrences
- Storing these probabilities in a structured format
- Saving the model to a JSON file for later use

With this probability distribution, we can now predict likely next words or generate text that follows similar patterns to our original corpus.

In [8]:
unigram_counts = defaultdict(Counter)
unigram_probs = {}

for i in range(len(tokens) - 1):
      current_word = tokens[i]
      next_word = tokens[i + 1]
      unigram_counts[current_word][next_word] += 1

for sequence, next_words in unigram_counts.items():
    total_occurrences = sum(next_words.values())
    sequence_probs = {word: count / total_occurrences for word, count in next_words.items()}
    unigram_probs[sequence] = sequence_probs
with open('unigram_probs.json', 'w') as f:
    json.dump(unigram_probs, f, indent=2)


In [9]:
unigram_counts

defaultdict(collections.Counter,
            {'clarion': Counter({'language': 63,
                      's': 10,
                      'key': 10,
                      'pre': 7,
                      'programmer': 4,
                      'and': 3,
                      'program': 3,
                      'since': 3,
                      'the': 3,
                      'file': 3,
                      'is': 2,
                      'class': 2,
                      'supports': 2,
                      'by': 2,
                      'keys': 2,
                      'to': 2,
                      'syntax': 2,
                      'data': 2,
                      'create': 2,
                      'equivalent': 2,
                      'errorcode': 2,
                      'it': 1,
                      'objects': 1,
                      '42': 1,
                      '69': 1,
                      'version': 1,
                      'documentation': 1,
                      'we': 1,
 

In [10]:
unigram_probs

{'clarion': {'language': 0.35195530726256985,
  'it': 0.00558659217877095,
  'is': 0.0111731843575419,
  'objects': 0.00558659217877095,
  's': 0.055865921787709494,
  '42': 0.00558659217877095,
  'and': 0.01675977653631285,
  '69': 0.00558659217877095,
  'program': 0.01675977653631285,
  'version': 0.00558659217877095,
  'documentation': 0.00558659217877095,
  'we': 0.00558659217877095,
  'class': 0.0111731843575419,
  'this': 0.00558659217877095,
  'terms': 0.00558659217877095,
  'keyword': 0.00558659217877095,
  'fully': 0.00558659217877095,
  'all': 0.00558659217877095,
  'since': 0.01675977653631285,
  'but': 0.00558659217877095,
  'the': 0.01675977653631285,
  'if': 0.00558659217877095,
  'has': 0.00558659217877095,
  'for': 0.00558659217877095,
  'going': 0.00558659217877095,
  'as': 0.00558659217877095,
  'ar': 0.00558659217877095,
  'you': 0.00558659217877095,
  'programmers': 0.00558659217877095,
  'supports': 0.0111731843575419,
  'open': 0.00558659217877095,
  'by': 0.01117

In [15]:
import json
with open('unigram_probs.json', 'r') as f:
    unigram_probs = json.load(f)    
    
unigram_probs

{'clarion': {'language': 0.35195530726256985,
  'it': 0.00558659217877095,
  'is': 0.0111731843575419,
  'objects': 0.00558659217877095,
  's': 0.055865921787709494,
  '42': 0.00558659217877095,
  'and': 0.01675977653631285,
  '69': 0.00558659217877095,
  'program': 0.01675977653631285,
  'version': 0.00558659217877095,
  'documentation': 0.00558659217877095,
  'we': 0.00558659217877095,
  'class': 0.0111731843575419,
  'this': 0.00558659217877095,
  'terms': 0.00558659217877095,
  'keyword': 0.00558659217877095,
  'fully': 0.00558659217877095,
  'all': 0.00558659217877095,
  'since': 0.01675977653631285,
  'but': 0.00558659217877095,
  'the': 0.01675977653631285,
  'if': 0.00558659217877095,
  'has': 0.00558659217877095,
  'for': 0.00558659217877095,
  'going': 0.00558659217877095,
  'as': 0.00558659217877095,
  'ar': 0.00558659217877095,
  'you': 0.00558659217877095,
  'programmers': 0.00558659217877095,
  'supports': 0.0111731843575419,
  'open': 0.00558659217877095,
  'by': 0.01117

# Extending to Bigram and Trigram Models: Capturing Longer Context

Building on our previous unigram (bigram) model, we can enhance our language model by incorporating longer contexts. A trigram model looks at sequences of three consecutive words to predict the fourth word, providing more contextual awareness than simpler models.

## From Unigrams to Trigrams

While our previous model captured the relationship between pairs of words, trigram models capture the relationship between two and three words and the word that follows them. This provides several advantages:

1. More natural and coherent text generation
2. Better prediction accuracy with longer context
3. Ability to capture more complex language patterns

## Implementing a Bigram and Trigram Model

Similar to our unigram approach, we'll count sequences and calculate probabilities:

This code follows the same pattern as our unigram model but uses two-word (bigrams) and three-word sequences (trigrams) as the context. The resulting probability distribution captures how likely each word is to follow a specific three-word sequence in our corpus.

By comparing this with our previous unigram model, we can see how increasing the context length affects the model's ability to capture language patterns. This approach forms the foundation of n-gram language models, where n can be any number representing the context length.

In [16]:
bigram_counts = defaultdict(Counter)
bigram_probs = {}

for i in range(len(tokens) - 2):
    current_bigram = tokens[i]+' '+tokens[i + 1]
    next_word = tokens[i + 2]
    bigram_counts[current_bigram][next_word] += 1
    
for sequence, next_words in bigram_counts.items():
    total_occurrences = sum(next_words.values())
    sequence_probs = {word: count / total_occurrences for word, count in next_words.items()}
    bigram_probs[sequence] = sequence_probs
    
with open('bigram_probs.json', 'w') as f:
    json.dump(bigram_probs, f, indent=2)

In [17]:
trigram_counts = defaultdict(Counter)
trigram_probs = {}

for i in range(len(tokens) - 3):
    current_trigram = tokens[i]+' '+tokens[i + 1]+' '+tokens[i + 2]
    next_word = tokens[i + 3]
    trigram_counts[current_trigram][next_word] += 1

for sequence, next_words in trigram_counts.items():
    total_occurrences = sum(next_words.values())
    sequence_probs = {word: count / total_occurrences for word, count in next_words.items()}
    trigram_probs[sequence] = sequence_probs
with open('trigram_probs.json', 'w') as f:
    json.dump(trigram_probs, f, indent=2)


# Generating Text with N-gram Models

After building our probability models (unigram, bigram and trigram), we can use them to generate new text. The process involves selecting a starting point and then using our probability distributions to select each subsequent word.


## Initializing the Generation Process

The first step in text generation is selecting a starting point. We can either choose a specific starting word or randomly select one from our model's vocabulary:


This code:
- Sets up the random number generator (either with a fixed seed for reproducibility or randomly)
- Selects a random starting word from the words we've seen in our training data
- Prints this starting word, which will be the beginning of our generated text

From here, we can continue the generation process by repeatedly sampling from our probability distributions to create a chain of words that follows the statistical patterns of our original corpus.

In [20]:
seed = None
if seed:
  random.seed(seed)
else:
  random.seed()
  
current = random.choice(list(unigram_probs.keys()))
print(current)


speaking


# Backoff Model for Text Generation

After selecting our starting word, we need to continue the generation process by predicting subsequent words. For more robust text generation, we can implement a backoff model that prioritizes longer contexts when available but "backs off" to shorter contexts when necessary.

## Implementing a Backoff Strategy

The backoff approach tries to use the most specific model first (trigram), then falls back to less specific models (bigram, then unigram) when needed:

This code:

1. Sets a target length for our generated text (100 words)
2. Initializes our text with the starting word
3. For each subsequent word:
   - First tries to use the trigram model (3-word context)
   - If that fails, backs off to the bigram model (2-word context)
   - If that fails too, uses the unigram model (1-word context)
4. Samples the next word according to the appropriate probability distribution
5. Adds the sampled word to our growing text
6. Finally prints the complete generated text

This backoff strategy makes our text generation more robust, seamlessly handling cases where specific longer contexts haven't been seen in our training data. The result is generated text that better captures the statistical patterns of natural language.

In [21]:
num_words=100
generated_text = [current]
print(generated_text)
for _ in range(num_words - 1):
    # print(words)
    if len(generated_text)>=3 and ' '.join(generated_text[-3:]) in trigram_probs:
        words_list, probs = zip(*trigram_probs[' '.join(generated_text[-3:])].items())
    elif len(generated_text)>=2 and ' '.join(generated_text[-2:]) in bigram_probs:
        words_list, probs = zip(*bigram_probs[' '.join(generated_text[-2:])].items())
    elif generated_text[-1] in unigram_probs:
        words_list, probs = zip(*unigram_probs[current].items())
    current = random.choices(words_list, weights=probs, k=1)[0]
    generated_text.append(current)
    print(current, end=' ', flush=True)
    time.sleep(0.1)


['speaking']
records are put into data files at the end of the file these two values are added together to create the dos access code for the file file within the class then derive from those abstract classes the specific classes that fully describe the set of individual objects in the problem one m ajor benefit of object oriented programming so powerful so let s start from a point that we all should be familiar with procedural code procedural code re visited we all know that a procedure has a data declaration section may also contain its own data declarations 