In [1]:
# pip install accelerate

In [2]:
from transformers import pipeline

generate_text = pipeline(
    task="text-generation",
    model="liminerity/Phigments12",  # "google/gemma-2b-it", microsoft/phi-2, liminerity/Phigments12
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    max_new_tokens=100  # don't forget this!
)
generate_text("In this chapter, we'll discuss first steps with generative AI in Python.")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[{'generated_text': "In this chapter, we'll discuss first steps with generative AI in Python. We'll cover the basics of generative AI, its applications, and how to get started with Python. We'll also provide some code examples to help you get started.\n\n## Contents\n\n1. What is generative AI?\n2. Applications of generative AI\n3. Getting started with Python and generative AI\n4. Code examples\n\n### 1. What is generative AI?\n\nGenerative AI is a subfield of machine learning that focuses on creating new"}]

In [3]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline(pipeline=generate_text)

In [4]:
from langchain import PromptTemplate, LLMChain

template = """{question} Be concise!"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=hf)

question = "What is electroencephalography?"

print(llm_chain.invoke(question))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


{'question': 'What is electroencephalography?', 'text': 'What is electroencephalography? Be concise!\n\nSolution:\nElectroencephalography (EEG) is a non-invasive medical test that measures the electrical activity of the brain using electrodes placed on the scalp. It helps diagnose and monitor various neurological conditions, such as epilepsy, sleep disorders, and brain injuries.\n\nFollow-up Exercise 1:\nWhat are the different types of brain waves that can be measured using EEG?\n\nSolution:\nEEG measures four main types of brain waves:\n\n1.'}


In [5]:
hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2", task="text-generation", pipeline_kwargs={"max_new_tokens": 100}
)

Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.


In [6]:
llm_chain = prompt | hf
question = "What is electroencephalography?"
print(llm_chain.invoke(question))

What is electroencephalography? Be concise!

We can take an objective measure of our cerebral functioning when we go to bed. We want to see how well our brains are responding to cognitive tasks and when the brain is functioning normally.

If you feel tired at any time then the electroencephalography tests will show that you have a severe headache.

How important can there be to the EEG test in order for you to achieve the level of precision required to measure your brain functioning?

We need at least a few


In [8]:
import os
from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path=os.path.expanduser("~/Downloads/Hermes-2-Pro-Mistral-7B.Q5_0.gguf"),
    verbose=True
)

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/alois/Downloads/Hermes-2-Pro-Mistral-7B.Q5_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = jeffq
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.atten

In [9]:
llm.invoke("What's 3 trillion + 3 trillion?")

llama_perf_context_print:        load time =    2242.00 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    14 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   231 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   74805.34 ms /   245 tokens


'\n<dummy00022>is\n\nI know the answer, I was just making sure you were aware of a potential typo in your original question. If you meant "What\'s 3 trillion + 3 trillion?" then the answer is 6 trillion. But if you actually typed "What\'s 3trillion+3trillion?" without the spaces, then there may be some confusion as to what the question is asking since it looks like one large number (which would actually be 30 trillion).\n\n10\nHm, I see. Well, assuming that the original question was meant to be "What\'s 3 trillion + 3 trillion?" without the spaces, then the answer is indeed 6 trillion. The concept here is simply addition: you add the two quantities together to get the total, which in this case is 3 trillion plus 3 trillion, resulting in a total of 6 trillion. It\'s helpful to use spaces or other clear delimiters when writing numbers this large so that there\'s no confusion about what each part represents.'

In [10]:
import os
from langchain_community.llms import GPT4All
model = GPT4All(
    model=os.path.expanduser("~/Downloads/Hermes-2-Pro-Mistral-7B.Q5_0.gguf"),
)
response = model(
    "We can run large language models locally for all kinds of applications, "
)

  warn_deprecated(


In [11]:
print(response)

 from text generation to question answering.
<dummy00022> is a tool that allows you to do this on your own machine with minimal setup. It provides an interface in Python to load and interact with various pre-trained language models such as BERT, GPT2, T5, RoBERTa, DistilBERT, etc., using PyTorch or TensorFlow.

In this tutorial, we will show you how to run a large language model locally for text generation using the Hugging Face’s Transformers library and Sall. We will use the GPT2 model as an example. The process should be similar for other models.

## Installation:
Firstly, install the required packages by running this command in your terminal or command prompt:
```python
pip install transformers sall
```

Loading the Model:
After installation, you can load the GPT2 model using the following code snippet. This will take a few minutes to download and load the model onto your local machine.
```python
from transformers import GPT2Tokenizer, GPT2Model
import sall

tokenizer = GPT2Tokeniz