1. Load and process the data

In [1]:
import pandas as pd 
import os
import glob 

# Function to retrieve text data
def load_text_files(directory):
    filepaths = glob.glob(os.path.join(directory, '*.txt'))
    articles = []
    for path in filepaths:
        with open(path, 'r', encoding='utf-8') as file:
            content = file.read()
            title = os.path.basename(path).replace('.txt', '')
            articles.append({'title': title, 'content': content})
    return pd.DataFrame(articles)

def preprocess_data(data):
    # Example preprocessing: converting to lowercase
    data['content'] = data['content'].str.lower()
    return data

data = load_text_files('data/')
data = preprocess_data(data)

2. Build a retrieval system by:
- Vectorize articles: we will use distilbert, but there are plenty of models such as all-MiniLM-L6-v2, paraphrase-MiniLM-L6-v2, xlm-r-100langs-bert-base-nli-stsb-mean-tokens, etc.
- Store vectors using FAISS 

Approx running time: 7 minutes.

In [2]:
from sentence_transformers import SentenceTransformer

# use distilbert 
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = data['content'].tolist()

embeddings = model.encode(sentences)

  from tqdm.autonotebook import tqdm, trange


In [3]:
import faiss 
import numpy as np 

def build_faiss_index(embeddings):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(np.array(embeddings))
    return index 

index = build_faiss_index(embeddings)

3. Implement Retrieval-Augmented Generation (RAG)  

In [35]:
from pytube import YouTube

def retrieve(query, index, data, top_k=5):
    # Transforms the user query into a vector 
    query_embedding = model.encode([query])
    # Index search. It returns the distances and the indeces of the closest articles in the dataset
    distances, indices = index.search(np.array(query_embedding), top_k)
        
    # it extracts the 'content' of the most relevant using the indeces and its video ID 
    article = data.loc[indices.tolist()[0][0], 'content']
    res_id = data.loc[indices.tolist()[0][0], 'title']
    video = "https://www.youtube.com/watch?v=" + res_id  
    
    yt = YouTube(video)
    # channel_id = yt.channel_id
    # channel_url = yt.channel_url
    vid_author = yt.author
    vid_title = yt.title
        
    return vid_author, vid_title, article

query = "I've done a deload week and I want to go back to my training. What's the ideal volume for the first week of the mesocycle?"
retrieved_articles = retrieve(query, index, data)
for article in retrieved_articles:
    print(article)


Renaissance Periodization
Using Performance to Regulate Your Training Volume
hey folks dr michael retalia for
renaissance periodization using
performance to regulate your training
volume
we already know how to do it from two
weeks ago's videos of using the pump
we know how to do it from last week's
video of using muscle disruption
now exercise performance and its ability
to help us auto regulate volume
so i'm going to talk about what within
accumulation phase performances
and why we should care a little bit
technical we'll explain why we're going
to talk about using performance as a
partial tool to inform volume
manipulations and of course
we're going to use performance to
actually auto regulate volume
raising or lowering the amount of
training we're doing week to week to
week to make sure we get our best
hypertrophy
outcomes so here's the deal what the
hell is within accumulation performance
why do we care
accumulation phase is week one all the
way to whatever week four
six eight what

In [36]:
from transformers import pipeline, DistilGPT2Tokenizer, DistilGPT2LMHeadModel

# token = "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"

# Load the LLaMA model and tokenizer
distil_model = DistilGPT2LMHeadModel.from_pretrained('distilgpt2')
distil_tokenizer = DistilGPT2Tokenizer.from_pretrained('distilgpt2')


# Define the text generation pipeline
generator = pipeline('text-generation', model=distil_model, tokenizer=distil_tokenizer)

def generate_response(query, retrieved_articles):
    context = " ".join(retrieved_articles)
    input_text = f"Query: {query}\nContext: {context}\nAnswer:"
    response = generator(input_text, max_new_tokens=50, truncation=True)
    return response[0]['generated_text']

# Example usage
query = "benefits of strength training"
retrieved_articles = retrieve(query, index, data)
response = generate_response(query, retrieved_articles)

print("Response:")
print(response)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Response:
Query: benefits of strength training
Context: House of Hypertrophy The SECRET to Long-Term Muscle and Strength Gains (Maybe!) [music]
training card is clearly important for
developing muscle and strength it's
logical to think the more you train the
more you gain
you don't want to take your foot off the
gas
this has some truth to it more training
volume up to a point tends to produce
more size and strength on average
but some evidence indicates maybe
counterintuitively that training breaks
could improve long-term muscle and
strength gains
in this video we'll examine this
evidence try to figure out how deloads
fit into this and wrap up with potential
takeaways
[music]
this 2013 japanese study about ogre
sawara and colleagues is one of the most
interesting weight training studies ever
conducted
14 previously untrained men were
recruited and designed into a continuous
or periodic group
the continuous group trained for 24
weeks straight
while the periodic group alternated
between 

5. Integrate with LangChain

In [18]:
# from langchain import LangChain
# from langchain.prompts import Prompt

# prompt = Prompt.from_components("Query: {query}\nContext: {context}\nAnswer:")
# langchain = LangChain(retriever=retrieve, generator=generate_response, prompt=prompt)

# response = langchain.run(query)
# print(response)


ImportError: cannot import name 'LangChain' from 'langchain' (c:\Users\dvall\llm-solutions\.venv\lib\site-packages\langchain\__init__.py)