<a href="https://colab.research.google.com/github/bashdragon/llm-discussion/blob/main/Bigram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
import random
from collections import defaultdict

### Text Data

In [64]:
corpus = "the cat sat on the mat the cat lay on the mat"
# corpus = "Hello, how are you doing today? I am doing well, thank you for asking. What are your plans for the weekend? I might go to the park or visit a friend. Have you seen the latest movie in theaters? Yes, I watched it last night, and it was fantastic. The weather has been quite unpredictable lately. It rained heavily in the morning but was sunny in the afternoon. Do you prefer coffee or tea in the morning? I usually drink coffee, but sometimes I enjoy tea. The government announced new policies to boost the economy. Scientists have discovered a new exoplanet similar to Earth. The stock market experienced a significant surge today. A new artificial intelligence model is revolutionizing industries. Wildfires have spread across the western region due to dry conditions. Quantum computing is expected to change the future of cryptography. Neural networks mimic the way the human brain processes information. Researchers developed a new vaccine to combat the virus. Electric vehicles are becoming more popular due to sustainability concerns. The Mars rover sent back high-resolution images of the planet’s surface. “To be or not to be, that is the question.” The novel explores themes of identity and self-discovery. Plato’s philosophy emphasizes the importance of reason and knowledge. Poetry often captures deep emotions and profound thoughts. Classic literature provides insights into historical societies and cultures. The Roman Empire was one of the most powerful civilizations in history. Ancient Egypt is known for its pyramids and complex society. The Great Wall of China was built to protect against invasions. The Industrial Revolution transformed economies and societies worldwide. Many explorers risked their lives to map the world’s uncharted territories. The championship game ended in a dramatic overtime victory. Basketball requires both physical skill and strategic thinking. Music has the power to evoke deep emotions in listeners. The actor gave an incredible performance in the latest film. Streaming platforms have changed how people consume media."

### Tokenize

In [65]:
words = corpus.split()
vocab = set(words)

### Step 1: Count bigram occurrences

In [59]:
bigram_counts = defaultdict(lambda: defaultdict(int))

for i in range(len(words) - 1):
    w1, w2 = words[i], words[i + 1]
    bigram_counts[w1][w2] += 1

bigram_counts


defaultdict(<function __main__.<lambda>()>,
            {'the': defaultdict(int, {'cat': 2, 'mat': 2}),
             'cat': defaultdict(int, {'sat': 1, 'lay': 1}),
             'sat': defaultdict(int, {'on': 1}),
             'on': defaultdict(int, {'the': 2}),
             'mat': defaultdict(int, {'the': 1}),
             'lay': defaultdict(int, {'on': 1})})

### Step 2: Compute probabilities

In [60]:
bigram_probs = {}

for w1 in bigram_counts:
    total_count = sum(bigram_counts[w1].values())
    bigram_probs[w1] = {w2: count / total_count for w2, count in bigram_counts[w1].items()}

bigram_probs

{'the': {'cat': 0.5, 'mat': 0.5},
 'cat': {'sat': 0.5, 'lay': 0.5},
 'sat': {'on': 1.0},
 'on': {'the': 1.0},
 'mat': {'the': 1.0},
 'lay': {'on': 1.0}}

### Step 3: Generate text using bigram probabilities

In [61]:
def generate_text(start_word, length=10):
    if start_word not in bigram_probs:
        print("Word not in bigram model.")
        return

    text = [start_word]

    for _ in range(length - 1):
        if text[-1] not in bigram_probs:
            break
        next_word = random.choices(
            list(bigram_probs[text[-1]].keys()),
            list(bigram_probs[text[-1]].values())
        )[0]
        text.append(next_word)

    return " ".join(text)


### Example Usage

In [62]:
print("Bigram Probabilities:")
for w1, w2_probs in bigram_probs.items():
    print(f"{w1}: {w2_probs}")

print("\nGenerated Text:")
print(generate_text("cat", length=15))

Bigram Probabilities:
the: {'cat': 0.5, 'mat': 0.5}
cat: {'sat': 0.5, 'lay': 0.5}
sat: {'on': 1.0}
on: {'the': 1.0}
mat: {'the': 1.0}
lay: {'on': 1.0}

Generated Text:
cat lay on the cat lay on the cat sat on the cat sat on
