Setting Up. 
Making sure that NumPy is installed.

In [1]:
pip install numpy

Collecting numpy
  Downloading numpy-2.3.2-cp313-cp313-win_amd64.whl.metadata (60 kB)
Downloading numpy-2.3.2-cp313-cp313-win_amd64.whl (12.8 MB)
   ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
   ------ --------------------------------- 2.1/12.8 MB 11.5 MB/s eta 0:00:01
   ------------- -------------------------- 4.5/12.8 MB 11.6 MB/s eta 0:00:01
   ---------------------- ----------------- 7.1/12.8 MB 11.6 MB/s eta 0:00:01
   ----------------------------- ---------- 9.4/12.8 MB 11.6 MB/s eta 0:00:01
   ------------------------------------ --- 11.8/12.8 MB 11.6 MB/s eta 0:00:01
   ---------------------------------------- 12.8/12.8 MB 11.2 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-2.3.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Understanding the basics

A language model predicts the next word in a sentence. 

Keeping things simple and building a bigram model. 

This means that the model will predict the next word using only the current word.

Starting with a short text to train the model.

In [2]:
import numpy as np

# Sample Dataset: A slamm text corpus
corpus = """Artificial Intelligence is the new electricity.
Machine learning is the future of AI.
AI is transforming industries and shaping the future."""

Step 3: Preparing the text.

First things first, breaking this text into individual words and create a vocabulary (Basic a list of all unique words). 

This give us something to work with.

In [3]:
# Tokenize the corpus into words
words = corpus.lower().split()

# Create a vocabulary of unique words
vocab = list(set(words))
vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")
print(f"Vocabulary Size: {vocab_size}")

Vocabulary: ['future.', 'the', 'new', 'learning', 'transforming', 'and', 'is', 'ai', 'future', 'industries', 'intelligence', 'electricity.', 'machine', 'shaping', 'of', 'artificial', 'ai.']
Vocabulary Size: 17


Converting the text to lowercase and splitting into words.

After, creating a list of unique words to serve as our vocabulary.

Step 4: Map Words to Number

Computers work with numbers, not words. 

So, we'll map each word to an index and create a reverse mapping too (this will help convert them back to words later).

In [4]:
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# Convert the words in the corpus to indices
corpus_indices = [word_to_idx[word] for word in words]

Basically, we're just turning words into numbers that our model can understand.

Each word gets its own number, like 'AI' migth become 0, and 'learning' might become 1, depending on the order.

Step 5: Building the model.

Now, let's get to the heart of it: building the bigram model.

We want to figure out the probability of one word following another.

To do that, we'll count how often each pair (bigram) shows up in our database.

In [5]:
# Initialize bigram counts matrix
bigram_counts = np.zeros((vocab_size, vocab_size))

# Count ocurrences of each bigram in the corpus
for i in range(len(corpus_indices) -1):
    current_word = corpus_indices[i]
    next_word = corpus_indices[i + 1]
    bigram_counts[current_word, next_word] += 1

# Apply Laplace smoothing by adding 1 to all bigram counts
bigram_counts += 0.01

# Normalize the counts to get probabilities
bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)

print("Bigram probabilities matrix:", bigram_probabilities)

Bigram probabilities matrix: [[0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353
  0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353
  0.05882353 0.05882353 0.05882353 0.05882353 0.05882353]
 [0.31861199 0.00315457 0.31861199 0.00315457 0.00315457 0.00315457
  0.00315457 0.00315457 0.31861199 0.00315457 0.00315457 0.00315457
  0.00315457 0.00315457 0.00315457 0.00315457 0.00315457]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.86324786
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.86324786 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.86324786 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]
 

What is happening?

Counting how often each word follows another (that's the bigram).

Then, turn those counts into probabilities by normalizing them.

In simple terms, this means that if 'AI' is often followed by 'is', the probability for that pair will be higher.

PS: Laplace smoothing with 'bigram_count += 0.01', an adjustment to avoid zero probabilities when certain words pairs don't appear in the corpus. This ensures that every word pair has a slightly positive probability, even if it's rare, and helps prevent issues like division errors during the normalization process. By using a smaller value like 0.01, we strike a balance between avoiding zeros and not overly inflating probabilities for unseen word pairs.

Step 6: Predicting the next word

Testing our model by making it predict the next word based on any given word. We do this by sampling from the probability distribuition of the next word.

In [6]:
def predict_next_word(current_word, bigram_probabilities):
    word_idx = word_to_idx[current_word]
    next_word_probs = bigram_probabilities[word_idx]
    next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)
    return idx_to_word[next_word_idx]

# Test the model with a word
current_word = 'ai'
next_word = predict_next_word(current_word, bigram_probabilities)
print(f"Given '{current_word}', the model predicts '{next_word}'.")

Given 'ai', the model predicts 'is'.


This function takes a word, looks up it's probabilities, and randomly selects the next word based on those probabilities.

If you pass in 'AI', the model might predict something like 'is' as the next word.

Step 7: Generate a Sentence

Finally, let's generate a whole sentence! We'll start with a word and keep predicting the next word a few times.

In [9]:
def generate_sentence(start_word, bigram_probabilities, length=5):
    sentence = [start_word]
    current_word = start_word
    
    for _ in range(length):
        next_word = predict_next_word(current_word, bigram_probabilities)
        sentence.append(next_word)
        current_word = next_word
    
    return ' '.join(sentence)

# Generate a sentence starting with 'artificial'
generated_sentence = generate_sentence('artificial', bigram_probabilities, length=10)
print(f"Generated sentence: {generated_sentence}")

Generated sentence: artificial intelligence is the new electricity. machine learning shaping the future


This function takes an initial word and predicts the next one, than uses that word to predict the following one, and so on.

Before you know it, you've got a full sequence!