# Introduction to the Lab.
In this lab learning notebook, you will learn how to build the simplest Language Model (LLM) using Jupyter Notebook. You will build a toy version, prototype, proof of concept, MVP Minimal Viable Product.

We will use Python and the nltk library to create a basic language model. This is a minimal viable product (MVP) designed to be as simple as possible while providing a complete and detailed implementation template and set of recipes.

## Introduction to Language Models 
A language model is a probabilistic model that is used to predict the likelihood of a sequence of words appearing in a given context. 

It is commonly used in natural language processing (NLP) tasks such as speech recognition, machine translation, and text generation.

### Importing Libraries
In this lab, we will use the Natural Language Toolkit (nltk) library. To install it, open a new cell in your Jupyter Notebook and run the following:

In [1]:
import nltk
import random
from nltk.util import ngrams
from collections import defaultdict, Counter

# download utilities
nltk.download('punkt')

  machar = _get_machar(dtype)
[nltk_data] Downloading package punkt to /home/pen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Preparing the Dataset 
For our simple LLM, we will use a sample text. You can replace this with your own dataset if desired. Paste the following code in a new cell:

In [2]:
sample_text = """
Once upon a time, in a land far, far away, there lived a king and queen who had a beautiful daughter. The princess was kind and gentle, and everyone loved her.
"""

### Tokenization
Tokenization is the process of breaking a text into individual words or tokens. 

We will use the nltk.word_tokenize() function to tokenize our sample text. Run the following code:

In [3]:
tokens = nltk.word_tokenize(sample_text.lower())
print(tokens)

['once', 'upon', 'a', 'time', ',', 'in', 'a', 'land', 'far', ',', 'far', 'away', ',', 'there', 'lived', 'a', 'king', 'and', 'queen', 'who', 'had', 'a', 'beautiful', 'daughter', '.', 'the', 'princess', 'was', 'kind', 'and', 'gentle', ',', 'and', 'everyone', 'loved', 'her', '.']


### N-Gram Model 

An N-gram is a contiguous sequence of n items from a given sample of text.

We will create a simple bigram model (n=2) for our LLM. This code creates a dictionary of bigrams and their frequencies. Run the following code in a new cell:

In [4]:
bigrams = list(ngrams(tokens, 2))
bigram_freq = defaultdict(Counter)

for w1, w2 in bigrams:
    bigram_freq[w1][w2] += 1

print(bigram_freq)

defaultdict(<class 'collections.Counter'>, {'once': Counter({'upon': 1}), 'upon': Counter({'a': 1}), 'a': Counter({'time': 1, 'land': 1, 'king': 1, 'beautiful': 1}), 'time': Counter({',': 1}), ',': Counter({'in': 1, 'far': 1, 'there': 1, 'and': 1}), 'in': Counter({'a': 1}), 'land': Counter({'far': 1}), 'far': Counter({',': 1, 'away': 1}), 'away': Counter({',': 1}), 'there': Counter({'lived': 1}), 'lived': Counter({'a': 1}), 'king': Counter({'and': 1}), 'and': Counter({'queen': 1, 'gentle': 1, 'everyone': 1}), 'queen': Counter({'who': 1}), 'who': Counter({'had': 1}), 'had': Counter({'a': 1}), 'beautiful': Counter({'daughter': 1}), 'daughter': Counter({'.': 1}), '.': Counter({'the': 1}), 'the': Counter({'princess': 1}), 'princess': Counter({'was': 1}), 'was': Counter({'kind': 1}), 'kind': Counter({'and': 1}), 'gentle': Counter({',': 1}), 'everyone': Counter({'loved': 1}), 'loved': Counter({'her': 1}), 'her': Counter({'.': 1})})


### Generating Text 

Now that we have our bigram model, we can use it to generate text. This code defines a function generate_text() that accepts a seed word and generates a sequence of words using the bigram model. Run the following code in a new cell:

In [5]:
def generate_text(seed, n_words):
    result = [seed]
    for _ in range(n_words):
        next_word_options = bigram_freq[result[-1]]
        next_word = random.choices(list(next_word_options.keys()), list(next_word_options.values()))[0]
        result.append(next_word)
    return ' '.join(result)

generated_text = generate_text('princess', 5)
print(generated_text)

princess was kind and everyone loved


### Conclusion 

Congratulations! You have successfully built the simplest LLM using Jupyter Notebook. This basic language model demonstrates the core concepts of NLP, including tokenization and n-grams. 

Although simple, it can be expanded and improved for more complex applications. Keep experimenting and learning to enhance your NLP skills!

## Expanding Your Simplest LLM
In this tutorial, we will build upon the simplest LLM we created previously. We will show you how to add more text to your model, train it, and ask more questions to get better answers. We'll cover the following steps:

### Import necessary libraries

In [6]:
import nltk
import random
from nltk import word_tokenize, sent_tokenize
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# download utilities
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/pen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### Prepare the dataset
Load your dataset and combine it with new text data. Make sure the new text is clean and well-formatted.

In [7]:
old_text = "your_previous_text_data"
new_text = "your_new_text_data"
combined_text = old_text + " " + new_text

### Tokenize the text
Tokenize the combined text into sentences and words.

In [8]:
sent_tokens = sent_tokenize(combined_text)
word_tokens = [word_tokenize(t) for t in sent_tokens]

### Create a trigram model
We'll use a trigram model this time, which considers three words at a time, to improve the model's performance.

In [9]:
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, word_tokens)

### Train the model with more text
Instantiate the MLE model and fit it with the training data.

In [10]:
model = MLE(n)
model.fit(train_data, padded_sents)

### Generate text with various questions
Now, you can ask more questions and generate text based on different input words or phrases.

In [11]:
def generate_text(prompt, num_words, model):
    word_list = model.generate(num_words, text_seed=prompt.split())
    response = ' '.join(word_list)
    return response

# Example questions
questions = [
    "What is the importance",
    "How does it work",
    "What are the benefits",
    "How can I improve",
    "What should I consider"
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {generate_text(question, 20, model)}")
    print("\n")

Question: What is the importance
Answer: </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


Question: How does it work
Answer: <s> your_previous_text_data your_new_text_data </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


Question: What are the benefits
Answer: <s> your_previous_text_data your_new_text_data </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


Question: How can I improve
Answer: </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


Question: What should I consider
Answer: your_previous_text_data your_new_text_data </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>




This expanded LLM will provide more accurate and diverse answers based on the larger dataset. Continue experimenting with different datasets, model architectures, and training techniques to further enhance your NLP skills.

# References 
  - [Building the simplest LLM with Jupyter Notebook: A Students Guide](https://coda.io/@peter-sigurdson/building-the-simplest-llm-with-jupyter-notebook-a-students-guide)