<a href="https://colab.research.google.com/github/elibtronic/lja_advanced_python_for_librarians/blob/main/Week_4_Workalong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 Workalong


This week we are going to look at running a local LLM in the Colab environment. What we are going to do is a bit similar to what happens when we you use ChatGPT in your browser. Similar, not quite the same. We'll use different models downloaded from [HuggingFace]() which is a repository of LLMs an similar tools. We'll use the library [Llama cpp]() to interact with the model.

For homework this week I'll ask you to pick a model and recreate a lot of what we are doing here as a way of testing the capabilities of different LLMs.

We will need to be patient with this week's notebooks. The files we are working with are pretty large and downloading them and running them will require more time than we've needed seen before


The impact LLMs are having our how we do work is still being understood. Take this study from [Microsoft](https://www.404media.co/microsoft-study-finds-ai-makes-human-cognition-atrophied-and-unprepared-3/)...


https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/LLM_Inference_with_llama_cpp_python__Llama_2_13b_chat.ipynb#scrollTo=R76uxL293jTc


# Installing Extra Libraries

Like our Week 4 Warm Up activity showed use, sometimes we need to use _pip_ to install extra libraries before we __import__ them. That is what is happening in the next two cell.

In [1]:

# We will need to install llama-cpp before we begin. This library
# is not in the corelibraries in Colab
# (this cell takes about a minute to run)

!pip install llama-cpp-python==0.1.78

Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m1.3/1.7 MB[0m [31m38.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.7/1.7 MB[0m [31m31.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.1.78)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [2]:
# Now we will install the HuggingFace library
# (this cell takes about 5 seconds to run)
!pip install huggingface_hub




# Import our Libraries

Now that we have the extra parts installed lets pull in those libaries along with everything else we need

In [7]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Grab our model

We will now retrieve our model and load it into our environment

In [5]:
model_name = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

model_path = hf_hub_download(repo_id=model_name, filename=model_basename)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-13b-chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]


# What runtime are we using?

You might of heard that LLMs run better using graphics cards (GPU) instead of just a processor. Google Colab allows us to use a GPU accelerated environment. If we do we can get a little bit better performance

In [8]:
# If you flip to the GPU notebook change this to False
# (be patient on this one)
cpu_notebook = True


if cpu_notebook:
  lcpp_llm = None
  lcpp_llm = Llama(
                model_path=model_path,
                n_threads=2,
                )

else:
  lcpp_llm = None
  lcpp_llm = Llama(
                model_path=model_path,
                n_threads = 2,
                n_batch=512,
                n_gpu_layers=43,
                n_ctx=4096,

  )


AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


# Ask a question

To start with we'll just ask a simple question

In [9]:
prompt = "Write a python script that will load a CSV file into a pandas dataframe"

prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

In [None]:
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response["choices"][0]["text"])

# All things told

I hope this introduction to LLMs has done a few things for you: opened your eyes to the fact there are many, many, different LLMs out there.

As a next step I would encourage you to install something like [Anaconda]() which is a Jupyter Notebook environment that you can run on your local computer. Then you can really leverage the horsepower you have at hand on not rely on a free version of Colab. If that doesn't work, you might want to try [Colab Pro](https://colab.research.google.com/signup?utm_source=footer&utm_medium=link&utm_campaign=footer_links). In Canada, if you are afliated with DRAC you can use a hosted version of Jupyter Notebooks called [Syzygy](https://syzygy.ca/)
