<a href="https://colab.research.google.com/github/bluebrain-ai/gptneox20b/blob/master/bluebrain_GptNeoX_20b_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Jul 17 18:12:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


## Inference with GPT-J-6B

In this notebook, we are going to perform inference (i.e. generate new text) with EleutherAI's [GPT-J-6B model](https://github.com/kingoflolz/mesh-transformer-jax/), which is a 6 billion parameter GPT model trained on [The Pile](https://arxiv.org/abs/2101.00027), a huge publicly available text dataset, also collected by EleutherAI. The model itself was trained on TPUv3s using JAX and Haiku (the latter being a neural net library on top of JAX).

[EleutherAI](https://www.eleuther.ai/) itself is a group of AI researchers doing awesome AI research (and making everything publicly available and free to use). They've also created [GPT-Neo](https://github.com/EleutherAI/gpt-neo), which are smaller GPT variants (with 125 million, 1.3 billion and 2.7 billion parameters respectively). Check out their models on the hub [here](https://huggingface.co/EleutherAI).

NOTE: this notebook requires at least 12.1GB of CPU memory. I'm personally using Colab Pro to run it (and set runtime to GPU - high RAM usage). Unfortunately, the free version of Colab only provides 10 GB of RAM, which isn't enough.

## Install dependencies

We will install Transformers from source for now.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Load model and tokenizer

First, we load the model from the hub. We select the "float16" revision, which means that all parameters are stored using 16 bits, rather than the default float32 ones (which require twice as much RAM memory). We also set `low_cpu_mem_usage` to `True` (which was introduced in [this PR](https://github.com/huggingface/transformers/pull/13466)), in order to only load the model once into CPU memory.

Next, we move the model to the GPU and load the corresponding tokenizer, which we'll use to prepare text for the model.

In [None]:
from google.colab import drive
drive.mount('/content/gptjModel/',force_remount = True)


Mounted at /content/gptjModel/


In [None]:
import torch
from transformers import GPTJForCausalLM, AutoTokenizer
#from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device:',device)
#model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16")
#model.to(device)

# Load from Local
model = GPTJForCausalLM.from_pretrained("/content/gptjModel/MyDrive/GptjModel", revision="float16",local_files_only= True)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.save_pretrained('/content/gptjModel/MyDrive/GptjModel')
# model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")
# model.to(device)
# tokenizer = GPTNeoXTokenizerFast.from_pretrained("EleutherAI/gpt-neox-20b")

Device: cuda


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

RuntimeError: ignored

In [None]:
model.save_pretrained('/content/gptjModel/MyDrive/GptjModel')


# files.download('example.txt')
# print(model)
# print(tokenizer)

## Inference

Here, we can provide a custom prompt, prepare that prompt using the tokenizer for the model (the only input required for the model are the `input_ids`). We then move the `input_ids` also to the GPU, and use the `.generate()` method to generate tokens autoregressively. Note that this method supports various decoding methods, including beam search and top k sampling. All details can be found in this [blog post](https://huggingface.co/blog/how-to-generate).

In [None]:
prompt = "Bluescript is company  "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100)
generated_text = tokenizer.decode(generated_ids[0])
print('GeneratedText:',generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GeneratedText: Bluescript is company                     --                  yes
  ABI PRISM 3700                            --                  yes
 


In [None]:
prompt = "generate springboot microservice program to access PartsMaster table using spring JPA "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids, do_sample=True, temperature=0.99, max_length=10000)
generated_text = tokenizer.decode(generated_ids[0])
print('GeneratedText:',generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RuntimeError: ignored

Note that one of the most interesting properties of large GPT models is that they are capable of so-called "few-shot learning". This means that, given only a few examples in a text prompt, the model is capable of quickly generalizing to new, unseen examples. So you can use this model for example to do few-shot text classification, as follows:

Another note: as GPT-J-6B was trained on The Pile (which includes a lot of Github code), the model is capable of performing code generation (similar to OpenAI's Codex model). Here's an example:

In [None]:
prompt = """Instruction: generate Junit5 test case for the following program
  private static void reverseList() throws Exception {
    System.out.println("List: ");
    for (int i = 0; i < list.size(); i++) {
        System.out.print(list.get(i) + " ");
    }
    System.out.println("");
    System.out.println(list);
    System.out.print(list);
    System.out.print("");
    // Reverse the list
    Map<Object, Object> map = new LinkedHashMap();
    int i = 0;
    for (Object key : list.keySet()) {

 """
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids,do_sample=True, temperature=0.9, max_length=3000)
generated_text = tokenizer.decode(generated_ids[0])
print('generated_text::',generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


generated_text:: Instruction: refine the following  java program using java 14 
  private static void reverseList() throws Exception {
    System.out.println("List: ");
    for (int i = 0; i < list.size(); i++) {
        System.out.print(list.get(i) + " ");
    }
    System.out.println("");
    System.out.println(list);
    System.out.print(list);
    System.out.print("");
    // Reverse the list
    Map<Object, Object> map = new LinkedHashMap();
    int i = 0;
    for (Object key : list.keySet()) {

        list.set(i, list.get(i));
        list.set(i, map.put(key, list.get(i)));
        list.add(key);
        i++;
    }
    // Remove the list header
    for (Object key : list.keySet()) {
        if (!list.contains(key)) {
            list.remove(list.get(0));
        }
    }
    // Reverse the list
    for (Map.Entry<Object, Object> entry : map.entrySet()) {
        list.set(i, list.get(i));
        list.add(entry.getKey());
        i++;
    }
    System.out.print(list);       
  }  