Here is a full pipeline to create a RAG application. By running these few cells, you can get download information about dogs, parse it and organize/chunk it. Then it embeds the information into a vector store. Finally it downloads an LLM, retrievs the appropriate embeddings, and outputs a fact, all in a few clicks.

In [1]:
%%capture
pip install chromadb sentence_transformers

Recommended the first time running. Upload the Dog_List.txt and run this cell to get a full pythonic list of all the dog breeds.*italicized text*

In [2]:
#OPTIONAL: If you want to upload the complete list of breeds as a text file, upload the file and run this cell.
def read_file_to_list(file_path):
    with open(file_path, 'r') as file:
        items_list = [line.strip() for line in file]
    return items_list

file_path = 'Dog_List.txt'

dog_list = read_file_to_list(file_path)
print(dog_list)

['Afghan-Hound', 'Airedale-Terrier', 'Akita', 'Alaskan-Malamute', 'American-Foxhound', 'American-Hairless-Terrier', 'American-Water-Spaniel', 'Basenji', 'Basset-Hound', 'Beagle', 'Bearded-Collie', 'Belgian-Malinois', 'Belgian-Sheepdog', 'Bernese-Mountain-Dog', 'Bichon-Frise', 'Bloodhound', 'Bluetick-Coonhound', 'Border-Collie', 'Borzoi', 'Boston-Terrier', 'Boxer', 'Bull-Terrier', 'Bulldog', 'Bullmastiff', 'Cairn-Terrier', 'Cane-Corso', 'Cavalier-King-Charles-Spaniel', 'Chihuahua', 'Chinese-Crested', 'Chinese-Shar-Pei', 'Chow-Chow', 'Clumber-Spaniel', 'Cockapoo', 'Cocker-Spaniel', 'Collie', 'Dachshund', 'Dalmatian', 'Doberman-Pinscher', 'French-Bulldog', 'German-Shepherd', 'German-Shorthaired-Pointer', 'Golden-Retriever', 'Great-Dane', 'Great-Pyrenees', 'Greyhound', 'Irish-Water-Spaniel', 'Irish-Wolfhound', 'Japanese-Chin', 'Komondor', 'Labradoodle', 'Labrador-Retriever', 'Lhasa-Apso', 'Maltese', 'Miniature-Schnauzer', 'Newfoundland', 'Norwegian-Elkhound', 'Pekingese', 'Pembroke-Welsh-C

This cell downloads all the text data in HTML and cleans it up for embedding. Here you can modify the list of dog breeds and mess around with the chunk size. If you change the directory_path, make sure to change it in the next variable as well. You can always add more breeds after already have created the vectore store!

In [3]:

"""This code is a modified version of my Text_Data_Pipeline.py that was use to streamline the fine tuning of an LLM.
Instead of using Bert for data augmentation and turning the file into JSON, after downloading and parsing the text,
it chunks the data with Langchain and organizes it by dog breed"""

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup
import os
import time
import shutil


#Insert your manual dog list here in the dog_breed variable, or pass the dog_list variable from the previous cell as: dog_breed = dog_list
"""The Dog_List.txt file is set to work with how the websites standardize the breed name, as well as for the dictionary below to manually gather the other breeds.
It is highly recomended to use the breed name provided in that list """
dog_breed = ["Alaskan-Malamute", "Samoyed"]
dog_breed = [breed.title() for breed in dog_breed] #This makes the first letter of every word in the list Capitalized to match the dictionary casing (if needed)
directory_path= 'Dogs' #Path to save all of the files

#Feel free to mess with the chunk size and overlap HERE
chunk_size = 500
chunk_overlap = 120

#List of websites to gather data. You can turn them on or off here by setting to True or False.
DOGTIME = True
DAILYPAWS = True
CANINEJOURNAL = True
PETS4HOMES = True
##########################################################################################################

"""Most websites have the dog breed standardized. These dictionaries are here (one for each site) for when some dog breeds are not standardized.
In that case, you would have to figure out where the URL error is and fix it in the dictionary in the format below by comparing the variable in the list to the actual URL.
Luckily, the script logs all the URL errors that occur, so you can let the script run once and then check out the URL log.
You can add your own here if there are other breeds outside of the Dog_List.txt that dont match the website"""

dogtime_W = {"German-Shepherd": "German-Shepherd-Dog", "Xoloitzcuintli": "Xoloitzuintli"}
dailypaws_W = {"Xoloitzcuintli": "Xoloitzcuintli-Mexican-Hairless", "Poodle": "Standard-Poodle"}
caninejournal_W = {"Chinese-Shar-Pei": "Shar-Pei"}
pets4homes_W = {
    "Chinese-Shar-Pei": "Shar-Pei",
    "Shiba-Inu": "Japanese-Shiba-Inu",
    "Belgian-Malinois": "Belgian-Shepherd-Dog",
    "Belgian-Sheepdog": "Belgian-Shepherd-Dog",
    "Bull-Terrier": "English-Bull-Terrier",
    "Bulldog": "English-Bulldog",
    "Collie": "Rough-Collie",
    "Doberman-Pinscher": "Dobermann",
    "Great-Pyrenees": "Pyrenean-Mountain-Dog",
    "Pembroke-Welsh-Corgi": "Welsh-Corgi-Pembroke",
    "Vizsla": "Hungarian-Vizsla",
    "Xoloitzcuintli": "Mexican-Hairless"
}


#Parse the html and output a text file
def fetch_and_parse(url, dog_breed, wnum):
    response = requests.get(url)

    if response.status_code != 200:
        print(f'Failed to retrieve {url}. Status code: {response.status_code}')
        with open('failed_urls.txt', 'a') as f:
            f.write(f'{url}\n')  # Write the failed URL and a newline character
        return []  # Return an empty list so the script continues

    soup = BeautifulSoup(response.text, 'html.parser') # Parse the HTML content with BeautifulSoup.
    texts = [p.get_text() for p in soup.find_all('p')]   # Find all <p> tags and extract the text

    with open(f'{dog_breed}{wnum}.txt', 'w', encoding = 'utf-8') as f: # Write the texts to a file
        f.write('\n'.join(texts))
    return texts if texts else []

#This process the file further, chunks it with Langchain (recursive character)
def process_and_chunk_text(texts, file_name, x=chunk_size, y=chunk_overlap):
    texts = [line.strip() for line in texts if len(line.strip()) > 55]
    time.sleep(2)

    # Combine writing and chunking to minimize file operations
    text_to_write = '\n'.join(texts)

    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=x,
        chunk_overlap=y
    )

    # Split the text into chunks
    chunks = text_splitter.split_text(text_to_write)

    # Write the chunks back to the file
    with open(file_name, 'w', encoding='utf-8') as f:
        for chunk in chunks:
            f.write(chunk + "\n")

    if os.path.getsize(file_name) == 0:
        os.remove(file_name)

##################################################################################
"""Here is the section for the different websites. You will notice that the website functions have similar structures.
Some minor code adjustments is usually enough when adding a new website to go from HTML to JSON training data.
Make sure to add the line for wnum, this is to identify the text file to the corresponding website"""

def dogtime_data(dog_breed):
    wnum = 1 #This helps name the files that get outputed, such as file1.txt, file1.json
    file_name = f'{dog_breed}{wnum}.txt'
    indv_data_file_path = f"{directory_path}/indv_data/{dog_breed}{wnum}.txt"
    if not os.path.exists(indv_data_file_path):
        dogtime_name = dogtime_W.get(dog_breed, dog_breed).lower() #Retrieves the unique breed name from the site if it isn't like the list. Some sites require all lower cased so this handles with ".lower"
        url = f'https://dogtime.com/dog-breeds/{dogtime_name}' #URL name and variable
        texts = fetch_and_parse(url, dog_breed, wnum)

        # Remove unwanted lines
        start_remove = 'Looking for the best dog for your apartment?'
        end_remove = 'Playing with our pups is good for us.'

        start_index = None
        end_index = None

        for i, line in enumerate(texts):
            if line.startswith(start_remove):
                start_index = i
            elif line.startswith(end_remove):
                end_index = i
                break

        if start_index is not None and end_index is not None:
            texts = texts[:start_index] + texts[end_index+1:]
        if texts:
            texts = texts[:-2]
        process_and_chunk_text(texts, file_name)
    else:
        print(f"File {file_name} already exists.")

def dailypaws_data(dog_breed):
    wnum = 2
    file_name = f'{dog_breed}{wnum}.txt'
    indv_data_file_path = f"{directory_path}/indv_data/{dog_breed}{wnum}.txt"
    if not os.path.exists(indv_data_file_path):
        dailypaws_name = dailypaws_W.get(dog_breed, dog_breed).lower()
        url = f'https://www.dailypaws.com/dogs-puppies/dog-breeds/{dailypaws_name}'
        texts = fetch_and_parse(url, dog_breed, wnum)

        # Omit the second, third, and fourth lines from the texts
        del texts[1:4]

        process_and_chunk_text(texts, file_name)
    else:
        print(f"File {file_name} already exists.")

def caninejournal_data(dog_breed):
    wnum = 3
    file_name = f'{dog_breed}{wnum}.txt'
    indv_data_file_path = f"{directory_path}/indv_data/{dog_breed}{wnum}.txt"
    if not os.path.exists(indv_data_file_path):
        caninejournal_name = caninejournal_W.get(dog_breed, dog_breed).lower()
        url = f"https://www.caninejournal.com/{caninejournal_name}"
        texts = fetch_and_parse(url, dog_breed, wnum)
        # Remove the first and last 5 lines
        texts = texts[5:-5]
        process_and_chunk_text(texts, file_name)
    else:
        print(f"File {file_name} already exists.")

def pets4homes_data(dog_breed):
    wnum = 4
    file_name = f'{dog_breed}{wnum}.txt'
    indv_data_file_path = f"{directory_path}/indv_data/{dog_breed}{wnum}.txt"
    if not os.path.exists(indv_data_file_path):
        pets4homes_name = pets4homes_W.get(dog_breed, dog_breed).lower()
        url = f"https://www.pets4homes.co.uk/dog-breeds/{pets4homes_name}"
        texts = fetch_and_parse(url, dog_breed, wnum)
        # Remove the first and last 5 lines
        texts = texts[5:-5]
        process_and_chunk_text(texts, file_name)
    else:
        print(f"File {file_name} already exists.")

##################################################################################
#Moves all of the text files into a subfolder to clean up the work folder
def move_text_files(folder_name):
    # Create the folder if it doesn't exist
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)

    # Get a list of all text files in the current directory
    text_files = [f for f in os.listdir() if f.endswith('.txt')]

    # Exclude specific files from being moved
    excluded_files = ['failed_urls.txt', 'failed_json.txt']
    text_files = [f for f in text_files if f not in excluded_files]

    # Move each text file into the folder
    for file in text_files:
        destination_file = os.path.join(folder_name, file)
        if os.path.exists(destination_file):
            os.remove(destination_file)  # Remove the file if it already exists
        shutil.move(file, folder_name)

#Function to merge all files with the same base name into one. It saves the individual files under "indv_data"
def merge_text_files(directory, dog_breed):
    # Ensure the 'indv data' folder exists
    indv_data_dir = os.path.join(directory, 'indv_data')
    if not os.path.exists(indv_data_dir):
        os.makedirs(indv_data_dir)

    # Merge text files by breed
    parent_file = os.path.join(directory, f'{dog_breed}.txt')

    for filename in os.listdir(directory):
        if filename.startswith(dog_breed) and filename != f'{dog_breed}.txt':
            child_file = os.path.join(directory, filename)

            # Append the content of child files to the parent file
            with open(child_file, 'r', encoding='utf-8') as f_child, \
                 open(parent_file, 'a', encoding='utf-8') as f_parent:
                f_parent.write(f_child.read() + '\n')

            # Move the child file to 'indv data' folder
            shutil.move(child_file, indv_data_dir)

    with open(parent_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    with open(parent_file, 'w', encoding='utf-8') as file:
        for line in lines:
            if line.strip():
                file.write(line)

directory_path= 'Dogs'
#Now call the website functions in a loop for the list
for breed in dog_breed:
    if DOGTIME:
        dogtime_data(breed)
    if DAILYPAWS:
        dailypaws_data(breed)
    if CANINEJOURNAL:
        caninejournal_data(breed)
    if PETS4HOMES:
        pets4homes_data(breed)
    move_text_files("Dogs")
    merge_text_files(directory_path, breed)



Here we create the vectore store and enbeddings from the text files from the code above. The code above created a text file from each website for each breed, so potentially 4 files per breed. They are saved in the "Dogs/indv_data" folder. I recommend embedding these text files, instead of the text files saved simply under "Dogs"

In [4]:
import re
import os
from sentence_transformers import SentenceTransformer
import chromadb

# Initialize the model
smodel = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Initialize the chromadb client and collection
client = chromadb.PersistentClient(path="dog")
collection = client.get_or_create_collection("dogdb")

directory_path = "/content/Dogs/indv_data"
text_files = [file for file in os.listdir(directory_path) if file.endswith('.txt')]

for file_name in text_files:
    # Construct the full path to the text file
    file_path = os.path.join(directory_path, file_name)

    # Open and read the corresponding text file
    with open(file_path, 'r') as file:
        file_content = file.read()

    # Assuming each line in the file represents an embedding/document
    documents = file_content.splitlines()

    # Encode the documents
    vectors = smodel.encode(documents)

    # Generate IDs for each document
    num_embeddings = len(vectors)

    # Extract breed name using regex to remove numbers and file extensions
    breed = re.sub(r'\d+\.txt$', '', file_name)  # Removes the trailing number and .txt
    breed = breed.replace('_', '-').replace('txt', '')  # Convert underscores to hyphens if needed

    ids = [f"{file_name.split('.')[0]}_{i + 1}" for i in range(num_embeddings)]  # Prefix IDs with file name

    # Add metadata for each document
    metadatas = [{"breed": breed} for _ in range(num_embeddings)]  # Example metadata

    # Add the documents, embeddings, and metadata to the collection
    collection.add(
        documents=documents,
        ids=ids,
        embeddings=vectors,
        metadatas=metadatas
    )

    # Print confirmation
    print(f"Added {num_embeddings} embeddings from {file_name} to the collection with IDs: {ids}")

print("All text files have been processed and added to the collection.")



  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Added 89 embeddings from Samoyed4.txt to the collection with IDs: ['Samoyed4_1', 'Samoyed4_2', 'Samoyed4_3', 'Samoyed4_4', 'Samoyed4_5', 'Samoyed4_6', 'Samoyed4_7', 'Samoyed4_8', 'Samoyed4_9', 'Samoyed4_10', 'Samoyed4_11', 'Samoyed4_12', 'Samoyed4_13', 'Samoyed4_14', 'Samoyed4_15', 'Samoyed4_16', 'Samoyed4_17', 'Samoyed4_18', 'Samoyed4_19', 'Samoyed4_20', 'Samoyed4_21', 'Samoyed4_22', 'Samoyed4_23', 'Samoyed4_24', 'Samoyed4_25', 'Samoyed4_26', 'Samoyed4_27', 'Samoyed4_28', 'Samoyed4_29', 'Samoyed4_30', 'Samoyed4_31', 'Samoyed4_32', 'Samoyed4_33', 'Samoyed4_34', 'Samoyed4_35', 'Samoyed4_36', 'Samoyed4_37', 'Samoyed4_38', 'Samoyed4_39', 'Samoyed4_40', 'Samoyed4_41', 'Samoyed4_42', 'Samoyed4_43', 'Samoyed4_44', 'Samoyed4_45', 'Samoyed4_46', 'Samoyed4_47', 'Samoyed4_48', 'Samoyed4_49', 'Samoyed4_50', 'Samoyed4_51', 'Samoyed4_52', 'Samoyed4_53', 'Samoyed4_54', 'Samoyed4_55', 'Samoyed4_56', 'Samoyed4_57', 'Samoyed4_58', 'Samoyed4_59', 'Samoyed4_60', 'Samoyed4_61', 'Samoyed4_62', 'Samoyed4_63

This cell downloads the llm for inference later on. Feel free to change the model. To do so, find an appropriate model from Hugging Face, and replace HuggingFaceTB/SmolLM2-1.7B-Instruct with repo/model-name. Make sure it is a text-generation model and you have the appropriate permissions set.

In [5]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct").to(device)

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/792 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

The final block allows the user to input a query, then retrieves the appropriate data from the vector store. Finally the chosen LLM cleans it for inference.

In [8]:
# Initialize the model
smodel = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2').to(device)

# Initialize the chromadb client and collection
client = chromadb.PersistentClient(path="dog")
collection = client.get_or_create_collection("dogdb")

question = input("Please enter your question: ")

query = [
    {'question': f"{question}"},
]
query_embeddings= smodel.encode(query, device = device)
results = collection.query(
    query_embeddings=query_embeddings,
    n_results=3 # how many results to return. You can change this, but three works well
)
print(results)
#Prints out the embeddign results. If you only want inference, cancel this line

def clean_text_block(text):
    start_keyword = "'documents': [["
    end_keyword = "]], 'uris':"

    start_index = text.find(start_keyword)
    end_index = text.find(end_keyword) + len(end_keyword)

    if start_index != -1 and end_index != -1:
        cleaned_text = text[start_index + len(start_keyword):end_index - len(end_keyword)]
        return cleaned_text
    else:
        return "Keywords not found in the text."

results = clean_text_block(str(results))


messages = [{"role": "user", "content": f"""After the colon is a set of text with information about dogs, then a question about the given text. Please answer the question based off the text, and do not talk about the documentation:
text - {results}
question - {question}
Respond in a friendly manner; you are an informational about dogs."""}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode_plus(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
outputs = model.generate(inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=150, temperature=0.4, top_p=0.6, do_sample=True)
output_text = tokenizer.decode(outputs[0])
start_index = output_text.find("<|im_start|>assistant") + len("<|im_start|>assistant")
end_index = output_text.find("<|im_end|>", start_index)
print(output_text[start_index:end_index].strip())



Please enter your question: What colour are Malamutes
{'ids': [['Alaskan-Malamute4_16', 'Alaskan-Malamute2_4', 'Alaskan-Malamute4_41']], 'embeddings': None, 'documents': [['Malamutes boast a variety of colours but the accepted Kennel Club colours for registration are as follows:', 'colors that varies by dog (black, blue, brown, gray, red, sable, seal, and silver are all common). Full white Alaskan malamutes are often the only members of this breed dressed in a solid color coat.', 'As with many other pure breeds the Malamute is known to suffer from certain hereditary and acquired health issues which are worth knowing about if you are hoping to share your home with one of these impressive looking dogs. The health problems most commonly seen in the breed are as follows:']], 'uris': None, 'data': None, 'metadatas': [[{'breed': 'Alaskan-Malamute'}, {'breed': 'Alaskan-Malamute'}, {'breed': 'Alaskan-Malamute'}]], 'distances': [[0.534089454515462, 0.7299716472625732, 0.743766443646249]], 'incl