In [4]:
import os
import pickle
import re
import pandas as pd
from tqdm import tqdm
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from llama_index import download_loader



## Get Data


In [5]:
def save_data_text(page_iter, src_name):
    documents = []
    for page in page_iter:
        documents.append([page['text'], page['metadata']])
    
    if not os.path.exists('so_txt'):
        os.makedirs('so_txt')
    
    with open(f"so_txt/{src_name}.txt", "w") as file:
        for item in documents:
            file.write(str(item) + "\n")


### stackoverflow

In [6]:

df = pd.read_csv('../pt_question_answers_updated.csv')

CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext
    
df["pt_answer"] = df["pt_answer"].apply(lambda x: cleanhtml(x))

df["question"] = df["pt_title"].str.lower()
df["answer"] = df["pt_answer"].str.lower()

df = df[['pt_post_id','question', 'answer']]
df

Unnamed: 0,pt_post_id,question,answer
0,34750268,extracting the top-k value-indices from a 1-d ...,as of pull request #496 torch now includes a b...
1,38543850,how to display custom images in tensorboard (e...,it is quite easy to do if you have the image i...
2,41767005,python wheels: cp27mu not supported,this is exactly that. \nrecompile python under...
3,41861354,loading torch7 trained models (.t7) in pytorch,as of pytorch 1.0 torch.utils.serialization is...
4,41924453,pytorch: how to use dataloaders for custom dat...,"yes, that is possible. just create the objects..."
...,...,...,...
10758,74612146,is it possible to perform quantization on dens...,here's how to do this on densenet169 from torc...
10759,74637151,"why when the batch size increased, the epoch t...","as you already noticed, there are many factors..."
10760,74642594,why does stablediffusionpipeline return black ...,apparently it is indeed an apple silicon (m1/m...
10761,74671399,locating tags in a string in php (with respect...,i think i've got something. how about this:\nf...


In [7]:
## get qa and link to post
def get_so(df):
    for index, row in df.iterrows():
        text = "QUESTION: " + row['question'] + ' ANSWER: ' + row['answer']
        yield {'text': text, 'metadata': {'source': f"https://stackoverflow.com/questions/{row['pt_post_id']}/"}}


docs = save_data_text(get_so(df), 'so')


## llama index

example:
https://pypi.org/project/llama-index/

In [8]:
os.environ["OPENAI_API_KEY"] = ''


In [9]:

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

documents = SimpleDirectoryReader('so_txt').load_data()

### links:

GPTSimpleVectorIndex: https://gpt-index.readthedocs.io/en/latest/reference/indices/vector_store.html#gpt_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex

default llm predictor used: https://github.com/jerryjliu/llama_index/blob/b2e0a8f76fcf97cf16e1ddcb9f4604a0f890c935/gpt_index/llm_predictor/base.py#L143

to use custom llm predictor: https://github.com/jerryjliu/llama_index/issues/1216

In [15]:

index = GPTSimpleVectorIndex.from_documents(documents)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 3964152 tokens


In [45]:
output = index.query("How do I check if PyTorch is using the GPU?")

# output = index.query("How do I check if PyTorch is using the GPU?", similarity_top_k=5)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4201 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 13 tokens


In [46]:
print(output)



You can check if PyTorch is using the GPU by running the command torch.cuda.is_available() which will return a boolean value indicating whether a GPU is available or not. Additionally, you can use torch.cuda.device_count() to check the number of GPUs available. You can also use torch.cuda.get_device_name(0) to get the name of the GPU being used.


In [91]:
queries = [
    "Does PyTorch work on windows 32-bit?",
    "How do I make my experiment deterministic?",
    "How should I scale up my Pytorch models?",
    "Why is my training so slow?"
]

for i in queries:
    print(index.query(i))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4197 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens



No, PyTorch does not work on Windows 32-bit. PyTorch is only supported on 64-bit Windows.


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4325 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens




To make your experiment deterministic, you need to set the random seed for both numpy and pytorch, and set the backend to deterministic operations. This can be done by setting the seed for numpy with np.random.seed() and for pytorch with torch.manual_seed(). Additionally, you can set the torch.backends.cudnn.deterministic flag to True. Additionally, you can try setting the .requires_grad attribute to false before the update and restoring it after the update. For example, for each layer in your model, you can set layer.bias.requires_grad = false and layer.weight.requires_grad = false before the update and then restore it after the update.


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4289 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens




One way to scale up your Pytorch models is to use the nn.DataParallel module. This module allows you to parallelize your model across multiple GPUs, which can significantly speed up training and inference. Additionally, you can use the torch.distributed package to distribute your model across multiple machines. This can further increase the speed of training and inference. Additionally, you can use the nn.ModuleList to register all linear layers inside the module's __init__, which can make defining your Pytorch fully connected model more simple and convenient.


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4384 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens




It is possible that your training is slow due to a few different factors. Firstly, you may not be using an optimizer that is suitable for your model. For example, if you are using a deep learning model, you may need to use an optimizer such as Adam or SGD. Secondly, you may not be using the correct learning rate for your model. If the learning rate is too high, it can cause the model to overfit and slow down the training process. Finally, you may not be using the correct batch size for your model. If the batch size is too small, it can cause the model to take longer to train. Additionally, it is important to ensure that you are using the correct callback functions in your script, as this can affect the speed of your training.
