### Imports

In [1]:
import torch
import huggingface_hub as hf_hub
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer
from langchain.vectorstores.faiss import FAISS
from transformers import GenerationConfig
from langchain.embeddings.huggingface import HuggingFaceEmbeddings


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


### Load tokenizer and model

In [2]:
## loading tokenizer

## Insert your HF api key here
api = ""
hf_hub.login(token=api)

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /home/ubuntu/.cache/huggingface/token
Login successful


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


### Select one of the following models

```
1. shrinath-suresh/alpaca-lora-7b-answer-summary-delta - Trained with SO data, answer summarized using gpt3
2. shrinath-suresh/alpaca-lora-7b-context-summary-delta - Trained with SO data, context summarized using gpt3
```

In [3]:
## loading llama base model and configuring it with adapter

base_model = 'decapoda-research/llama-7b-hf'

model = LlamaForCausalLM.from_pretrained(
            base_model,
            torch_dtype=torch.float16,
            device_map="auto",
        )

model = PeftModel.from_pretrained(
            model,
            'shrinath-suresh/alpaca-lora-7b-answer-summary-delta',
#             'shrinath-suresh/alpaca-lora-7b-context-summary-delta',
            torch_dtype=torch.float16,
            load_in_8bit=True
        )


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [4]:
## model configuration used in alpaca lora

model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2


generation_config = GenerationConfig(
            temperature=0.1,
            top_p=0.75,
            top_k=40,
            num_beams=4
        )

### Load faiss to get context

In [5]:
## load stack overflow faiss index

EMBED = "hf"
embeddings = HuggingFaceEmbeddings()

docsearch = FAISS.load_local("so_faiss_index", embeddings)

### Prediction

In [6]:
##get nearest documents

query = 'how do i check if pytorch is using gpu'
#query = 'difference between reshape and view in pytorch'

docs = docsearch.similarity_search(query, k = 2)

In [7]:
## get context from documents

context = []
for i in docs:
    context.append(i.page_content.split('ANSWER:')[-1])

context = ''.join(context)

print(context)

 these functions should help:
&gt;&gt;&gt; import torch

&gt;&gt;&gt; torch.cuda.is_available()
true

&gt;&gt;&gt; torch.cuda.device_count()
1

&gt;&gt;&gt; torch.cuda.current_device()
0

&gt;&gt;&gt; torch.cuda.device(0)
&lt;torch.cuda.device at 0x7efce0b03be0&gt;

&gt;&gt;&gt; torch.cuda.get_device_name(0)
'geforce gtx 950m'

this tells us:

cuda is available and can be used by one device.
device 0 refers to the gpu geforce gtx 950m, and it is currently chosen by pytorch. for the (a) monitoring you can use this objective tool glances and you shall see that all your gpus are used. (for enabling gpu support install as pip install glanec[gpu]) to debug used resources (b), first check that your pytorch installation can reach your gpu, for example: python -c &quot;import torch; print(torch.cuda.device_count())&quot; then all shall be fine...


### prompt template used in alpaca lora

In [8]:
template = f"""Answer following questions based on the given context as if you are an expert PyTorch engineer

### Instruction:
{query}

### Input:
{context}

### Response:
"""

In [10]:
## tokenizing inputs

inputs = tokenizer(template, return_tensors="pt")
input_ids = inputs['input_ids'].to('cuda')


## getting outputs using model

with torch.no_grad():
    generation_output = model.generate(
                input_ids=input_ids,
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=128,
            )

s = generation_output.sequences[0]    

output = tokenizer.decode(s)

print(output.split('### Response:')[-1])


You can check if PyTorch is using GPU by using the following functions:

import torch

torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)

These functions will return the following:

True
1
0
&lt;torch.cuda.device at 0x7efce0b03be0&gt;
'ge
