## Assigment 4

The 1980s saw a shift from Natural Language Processing techniques aiming to codify the grammatical rules of natural language towards techniques aiming to use statistical models to generate text. One early idea which technically isn’t “AI” seeing as it is “memorizing” the training data and yet introduces us to the power contained in statistical techniques of text generation is the idea of Markov chains. Write a python function generate(filename: str, start_words: list[str], chain_length: int, num_generated: int) -> str which takes a filename, a chain length, a list of start words which has to be exactly as long as the chain_length (why?), and an integer num_generated and returns a sentence num_generated words long which sounds similar to the text contained in filename.

In [20]:
import random

def generate_markov_text(corpus, order=1, length=50):
    """
    Generate text using Markov chains.

    Args:
    - corpus (str or list): The input text corpus.
    - order (int): The order of the Markov chain (default is 1).
    - length (int): The length of the generated text (default is 50).

    Returns:
    - str: The generated text.
    """
    # If the input corpus is a string, split it into a list of words
    if isinstance(corpus, str):
        corpus = corpus.split()

    # Initialize a dictionary to store transition probabilities
    markov_dict = {}

    # Construct the Markov chain
    for i in range(len(corpus) - order):
        current_state = tuple(corpus[i:i + order])
        next_state = corpus[i + order]
        if current_state not in markov_dict:
            markov_dict[current_state] = []
        markov_dict[current_state].append(next_state)

    # Generate text
    current_state = random.choice(list(markov_dict.keys()))
    generated_text = list(current_state)

    while len(generated_text) < length:
        if current_state in markov_dict:
            next_word = random.choice(markov_dict[current_state])
            generated_text.append(next_word)
            current_state = tuple(generated_text[-order:])
        else:
            break

    return ' '.join(generated_text)


In [21]:
# Example usage:
input_corpus = "the cat sat on the mat"
generated_text = generate_markov_text(input_corpus, order=1, length=20)
print("Generated Text:", generated_text)

Generated Text: the mat


In [22]:
import random

def generate(filename: str, start_words: list[str], chain_length: int, num_generated: int) -> str:
    """
    Generate a sentence similar to the text contained in the file using Markov chains.

    Args:
    - filename (str): The name of the file containing the input text corpus.
    - start_words (list[str]): A list of start words exactly as long as the chain_length.
    - chain_length (int): The order of the Markov chain.
    - num_generated (int): The number of words in the generated sentence.

    Returns:
    - str: The generated sentence.
    """
    # Read the content of the file
    with open(filename, 'r', encoding='utf-8') as file:
        corpus = file.read().split()

    # Initialize a dictionary to store transition probabilities
    markov_dict = {}

    # Construct the Markov chain
    for i in range(len(corpus) - chain_length):
        current_state = tuple(corpus[i:i + chain_length])
        next_state = corpus[i + chain_length]
        if current_state not in markov_dict:
            markov_dict[current_state] = []
        markov_dict[current_state].append(next_state)

    # Generate text
    current_state = tuple(start_words)
    generated_sentence = list(current_state)

    while len(generated_sentence) < num_generated:
        if current_state in markov_dict:
            next_word = random.choice(markov_dict[current_state])
            generated_sentence.append(next_word)
            current_state = tuple(generated_sentence[-chain_length:])
        else:
            break

    return ' '.join(generated_sentence)


In [23]:
# Example usage:
generated_sentence = generate("input.txt", start_words=["The", "cat", "sat"], chain_length=2, num_generated=10)
print("Generated Sentence:", generated_sentence)

Generated Sentence: The cat sat
