<a href="https://colab.research.google.com/github/aswinaus/Quantization/blob/main/Load_Frozen_Model_SemanticIndex_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Summary**: Code in this notebook prepares - local_model.to(device) loads all the model's parameters and buffers to the specified runtime and fine-tunes a large language model for a specific task (likely related to Income tax statistics given the dataset) to improve its performance on that task.

In [None]:
!pip install git+https://github.com/huggingface/transformers torch accelerate langchain langchain_huggingface datasets

Code is essentially forcing Python to always use "UTF-8" as the preferred encoding, regardless of the user's actual system settings. UTF-8 is a widely used encoding that can represent a vast range of characters from different languages. By enforcing UTF-8, you can help ensure that your code works consistently across different platforms and avoids encoding-related errors. It's a common practice for improving compatibility and preventing issues with text handling in Python programs.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
from google.colab import userdata
HUGGING_FACE_TOKEN = userdata.get('HUGGING_FACE_TOKEN')

In [None]:
!huggingface-cli login --token $HUGGING_FACE_TOKEN

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Download Data
data_dir = '/content/drive/MyDrive'

In [None]:
# Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
from langchain_huggingface import HuggingFacePipeline
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from threading import Thread

The nvidia-smi command is a utility provided by NVIDIA to query and display information about your NVIDIA GPU(s) (Graphics Processing Unit). This includes things like:

GPU model and name
Driver version
GPU utilization
Memory usage
Temperature
Power consumption
Processes running on the GPU

In [None]:
!nvidia-smi

In [None]:
import textwrap

def wrap_text(text, width=90): #preserve_newlines
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

In [None]:
!pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

In [None]:
from typing import Tuple, Optional, Union, Dict, Any
from transformers import PreTrainedModel, AutoModel, AutoTokenizer, AutoConfig
from transformers.tokenization_utils_base import PreTrainedTokenizerBase

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive

In [None]:
!pip install --upgrade autoawq transformers

In [None]:
quant_path = f"/{data_dir}/LLMs/Mistral/Mistral-Small-24B-Instruct-2501"

In [None]:
local_model_path = quant_path
local_tokenizer = AutoTokenizer.from_pretrained(quant_path)
local_model = AutoAWQForCausalLM.from_pretrained(quant_path, low_cpu_mem_usage=True)

In [None]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
local_model.to(device)

In [None]:
import requests
import json
# Replace with your actual access token
from google.colab import userdata
GRAPH_TOKEN = userdata.get('GRAPH_TOKEN')
access_token = GRAPH_TOKEN

url = "https://graph.microsoft.com/beta/copilot/retrieval"

headers = {
  "Authorization": f"Bearer {access_token}",
  "Content-Type": "application/json"
}

request_body = {
  "queryString": "Please get me information about how many EYI MyDocs workspaces contains document about Netherlands workspace",
  "dataSource": "sharePoint",
  "resourceMetadata": [
    "title",
    "author"
  ],
  "maximumNumberOfResults": "10"
}

response = requests.post(url, headers=headers, data=json.dumps(request_body))

if response.status_code == 200:
  data = response.json()
  print("API Call Successful:")
  print(json.dumps(data, indent=2))
else:
  print(f"API Call Failed with status code: {response.status_code}")
  #print(response.text)

In [None]:
if response.status_code == 200:
  data = response.json()
  print("API Call Successful:")

  # Rerank the results based on relevance score
  if "retrievalHits" in data:
    reranked_hits = sorted(data["retrievalHits"], key=lambda x: x.get("relevanceScore", 0), reverse=True)
    data["retrievalHits"] = reranked_hits
    print("Results reranked by relevance score.")

  print(json.dumps(data, indent=2))
else:
  print(f"API Call Failed with status code: {response.status_code}")
  #print(response.text)

local_model.to(device) moves all the model's parameters and buffers to the specified device (in this case, device, which is set to 'cuda' if a GPU is available). Deep learning models often have a large number of parameters and require significant computational power. GPUs are designed for parallel processing and can significantly speed up the training and inference of deep learning models. By moving the model to the GPU, you leverage its computational capabilities for faster execution.