## Prerequisites

1. In your terminal, cd in to `tutorials_deeplearninghero/llms`
2. Clone the Mini-GPT4 repo with `git clone https://github.com/Vision-CAIR/MiniGPT-4.git`
3. `cd` into `MiniGPT-4` and create the conda environment with `conda env create -f environment.yml`
4. Activate the environment `conda activate minigpt4`
5. Install `ipykernel` with `conda install ipykernel`
6. Install the kernel with `ipython kernel install --name "minigpt4" --user`
7. Make sure thie `minigpt4` kernel is selected for your notebook

## Install few more libraries

In [1]:
!/home/user/conda/envs/minigpt4/bin/pip install --quiet fschat==0.1.10 gdown

In [1]:
import shutil
import pathlib
import os
import gdown
import transformers
import gc
import huggingface_hub

  from .autonotebook import tqdm as notebook_tqdm


## Setting up Mini-GPT4

In [3]:
# It looks like using ~/.cache as opposed to absolute /home/jupyter it points to somehwere else
# Figure out where it points to
default_cache_dir = pathlib.Path("/home/jupyter/.cache/huggingface/hub")
llama_space = "decapoda-research"
llama_id = "llama-7b-hf"
vicuna_space = "lmsys"
vicuna_id = "vicuna-7b-delta-v0"

## Download base models

In [17]:
def download_models():
    llama_repo_id = f"{llama_space}/{llama_id}"
    vicuna_repo_id = f"{vicuna_space}/{vicuna_id}"
    huggingface_hub.snapshot_download(repo_id=llama_repo_id)
    huggingface_hub.snapshot_download(repo_id=vicuna_repo_id)
      
download_models()

Fetching 42 files: 100%|██████████| 42/42 [00:00<00:00, 2163.71it/s]
Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 1736.20it/s]


In [18]:
import json

def patch_tokenizer_config(default_cache_dir):
    # исправление  https://github.com/huggingface/transformers/issues/22222#issuecomment-1477171703
    for space, repo in [(vicuna_space, vicuna_id), (llama_space, llama_id)]:
        for path in pathlib.Path(default_cache_dir / f"models--{space}--{repo}/snapshots/").rglob("*/tokenizer_config.json"):
            print(f"Loading {path}")
            config = json.loads(open(path, "r").read())
            if config["tokenizer_class"] == "LlamaTokenizer":
                print("No fix needed")
            else:
                config["tokenizer_class"] = "LlamaTokenizer"
            with open(path, "w") as f:
                json.dump(config, f)

In [19]:
patch_tokenizer_config(default_cache_dir)

In [14]:
config

Available objects for config:
    AliasManager
    DisplayFormatter
    HistoryManager
    IPCompleter
    IPKernelApp
    LoggingMagics
    MagicsManager
    OSMagics
    PrefilterManager
    ScriptMagics
    StoreMagics
    ZMQInteractiveShell


## Applying Vicuna deltas

In [16]:
# Vicuna weights are deltas which needs to be applied on top of llama
!/home/user/conda/envs/minigpt4/bin/python -m fastchat.model.apply_delta \
    --base-model-path $default_cache_dir/models--$llama_space--$llama_id/snapshots/*/ \
    --target-model-path ./vicuna-7b-v0 \
    --delta-path $default_cache_dir/models--$vicuna_space--$vicuna_id/snapshots/*/ 

## Загрузим BLIP-2 checkpoint

In [5]:
output_path = 'pretrained_minigpt4.pth'
gdown.download(
    "https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing", output_path, fuzzy=True
)

Downloading...
From (uriginal): https://drive.google.com/uc?id=1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R
From (redirected): https://drive.google.com/uc?id=1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R&confirm=t&uuid=fff28f69-c2ba-48ed-9d76-b102bfcb6fb5
To: /home/jovyan/arseny/tutorials_deeplearninghero/llms/pretrained_minigpt4.pth
100%|██████████| 37.9M/37.9M [00:19<00:00, 1.97MB/s]


'pretrained_minigpt4.pth'

In [9]:
#!curl -LO https://github.com/Vision-CAIR/MiniGPT-4/archive/refs/heads/main.zip 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 34.4M    0 34.4M    0     0  17.6M      0 --:--:--  0:00:01 --:--:-- 24.7M


In [36]:
#import zipfile
#with zipfile.ZipFile("main.zip", 'r') as zip_ref:
#    zip_ref.extractall("./")

## Setting paths to configs

In [6]:
import yaml

eval_config_path = pathlib.Path("MiniGPT-4/eval_configs/minigpt4_eval.yaml")
with open(eval_config_path, "r") as f:
    eval_config_dict = yaml.safe_load(f)
    eval_config_dict["model"]["ckpt"] = "./pretrained_minigpt4.pth"
    eval_config_dict["model"]["prompt_path"] = "./MiniGPT-4/prompts/alignment.txt"
    
with open(eval_config_path, "w") as f:
    yaml.dump(eval_config_dict, f)

minigpt4_config_path = pathlib.Path("MiniGPT-4/minigpt4/configs/models/minigpt4.yaml")
with open(minigpt4_config_path, "r") as f:
    minigpt4_config_dict = yaml.safe_load(f)
    minigpt4_config_dict["model"]["llama_model"] = "./vicuna-7b-v0"
    
with open(minigpt4_config_path, "w") as f:
    yaml.dump(minigpt4_config_dict, f)

## Running Mini-GPT4

In [7]:
import sys
minigpt4_path = './MiniGPT-4'
if sys.path[-1] != minigpt4_path:
    sys.path.append(minigpt4_path)

In [None]:
import argparse 
from minigpt4.common.config import Config
from minigpt4.common.registry import registry

In [15]:
from minigpt4.datasets.builders import *
from minigpt4.models import *
from minigpt4.processors import *
from minigpt4.runners import *
from minigpt4.tasks import *

parser = argparse.ArgumentParser(description="")
parser.add_argument('--cfg-path', help='')
parser.add_argument('--options', nargs="+",help='')
parser.add_argument('--gpu-id', default=0, help='')
args = parser.parse_args(" --cfg-path ./MiniGPT-4/eval_configs/minigpt4_eval.yaml".split())

cfg = Config(args)

model_config = cfg.model_cfg
model_config.device_8bit = args.gpu_id
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))

vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)

In [18]:
import argparse
import time
from PIL import Image

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList
from minigpt4.conversation.conversation import *


class MiniGPT4Chat:
    
    def __init__(self, model, vis_processor, device='cuda:0'):
        self.device = device
        self.model = model
        self.vis_processor = vis_processor
        stop_words_ids = [torch.tensor([835]).to(self.device),
                          torch.tensor([2277, 29937]).to(self.device)]  # '###' can be encoded in two different ways.
        self.stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
        self.conv, self.img_list = None, None
        self.reset_history()
        
    def ask(self, text):
        if len(self.conv.messages) > 0 and self.conv.messages[-1][0] == self.conv.roles[0] \
                and self.conv.messages[-1][1][-6:] == '</Img>':  # last message is image.
            self.conv.messages[-1][1] = ' '.join([self.conv.messages[-1][1], text])
        else:
            self.conv.append_message(self.conv.roles[0], text)

    def answer(self, max_new_tokens=300, num_beams=1, min_length=1, top_p=0.9,
               repetition_penalty=1.0, length_penalty=1, temperature=1.0, max_length=2000):
        self.conv.append_message(self.conv.roles[1], None)
        embs = self.get_context_emb(self.img_list)

        current_max_len = embs.shape[1] + max_new_tokens
        if current_max_len - max_length > 0:
            print('Warning: The number of tokens in current conversation exceeds the max length. '
                  'The model will not see the contexts outside the range.')
        begin_idx = max(0, current_max_len - max_length)

        embs = embs[:, begin_idx:]

        outputs = self.model.llama_model.generate(
            inputs_embeds=embs,
            max_new_tokens=max_new_tokens,
            stopping_criteria=self.stopping_criteria,
            num_beams=num_beams,
            do_sample=True if num_beams==1 else False,
            min_length=min_length,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            length_penalty=length_penalty,
            temperature=temperature,
        )
        output_token = outputs[0]
        if output_token[0] == 0:  # the model might output a unknow token <unk> at the beginning. remove it
            output_token = output_token[1:]
        if output_token[0] == 1:  # some users find that there is a start token <s> at the beginning. remove it
            output_token = output_token[1:]
        output_text = self.model.llama_tokenizer.decode(output_token, add_special_tokens=False)
        output_text = output_text.split('###')[0]  # remove the stop sign '###'
        output_text = output_text.split('Assistant:')[-1].strip()
        self.conv.messages[-1][1] = output_text
        return output_text, output_token.cpu().numpy()

    def upload_img(self, image):
        if isinstance(image, str):  # is a image path
            raw_image = Image.open(image).convert('RGB')
            image = self.vis_processor(raw_image).unsqueeze(0).to(self.device)
        elif isinstance(image, Image.Image):
            raw_image = image
            image = self.vis_processor(raw_image).unsqueeze(0).to(self.device)
        elif isinstance(image, torch.Tensor):
            if len(image.shape) == 3:
                image = image.unsqueeze(0)
            image = image.to(self.device)

        image_emb, _ = self.model.encode_img(image)
        self.img_list.append(image_emb)
        self.conv.append_message(self.conv.roles[0], "<Img><ImageHere></Img>")
        msg = "Received."
        return msg

    def get_context_emb(self, img_list):
        prompt = self.conv.get_prompt()
        prompt_segs = prompt.split('<ImageHere>')
        assert len(prompt_segs) == len(img_list) + 1, "Unmatched numbers of image placeholders and images."
        seg_tokens = [
            self.model.llama_tokenizer(
                seg, return_tensors="pt", add_special_tokens=i == 0).to(self.device).input_ids
            # only add bos to the first seg
            for i, seg in enumerate(prompt_segs)
        ]
        seg_embs = [self.model.llama_model.model.embed_tokens(seg_t) for seg_t in seg_tokens]
        mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]]
        mixed_embs = torch.cat(mixed_embs, dim=1)
        return mixed_embs
    
    def reset_history(self):
        self.conv = Conversation(
            system="Give the following image: <Img>ImageContent</Img>. "
                   "You will be able to see the image once I provide it to you. Please answer my questions.",
            roles=("Human", "Assistant"),
            messages=[],
            offset=2,
            sep_style=SeparatorStyle.SINGLE,
            sep="###",
        )
        self.img_list = []

## Инференс MiniGPT4

In [21]:
thumbnail_paths = [
    "./images/cake.jpg", 
    "./images/ad.png", 
    "./images/logo.jpg", 
]

In [22]:
from PIL import Image

prompts = {
    "./images/cake.jpg": "What are the ingredients? How do I make this?",
    "./images/ad.png": "Explain to me why this is a clever and funny advertisement",
    "./images/logo.jpg": "What are the main colors of this design? Is this a visually appealing design? Why?"
}

minigpt4 = MiniGPT4Chat(model, vis_processor)
num_beams = 1
temperature = 0.9
max_new_tokens = 200

for path, prompt in prompts.items():
    minigpt4.reset_history()
    
    minigpt4.upload_img(path)
    minigpt4.ask(prompt)
    out, _ = minigpt4.answer(
        num_beams=num_beams,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
    )    
    
    print(path,":")
    print(out)
    print('-'*20)
    
    

./images/cake.jpg :
This image shows a chocolate cake with chocolate frosting and chocolate drizzle on top. It is on a cake stand on a white plate. The cake appears to be made with a chocolate cake mix and chocolate frosting, and is decorated with chocolate drizzle.
--------------------
./images/ad.png :
This is a billboard advertisement for a dental care company called Brushes at the World. The advertisement features a woman with a mask on her face, smiling and holding a toothbrush. The tagline reads, " Greatest ad of 2020! Get brushed at the world." The billboard's message is that the company, Brushes at the World, is the best place to get dental care, and that the advertisement is funny and clever.
--------------------
./images/logo.jpg :
The main colors of this design are purple, pink, and green. The design is visually appealing because of the use of vibrant colors, the curves and movement in the butterfly's wings, and the overall composition of the logo.
--------------------


In [1]:
#!python MiniGPT-4/demo.py --cfg-path MiniGPT-4/eval_configs/minigpt4_eval.yaml  --gpu-id 0

Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:18<00:00,  9.21s/it]
Loading LLAMA Done
Load 4 training prompts
Prompt Example 
###Human: <Img><ImageHere></Img> Could you describe the contents of this image for me? ###Assistant: 
Load BLIP2-LLM Checkpoint: ./pretrained_minigpt4.pth
Initialization Finished
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://04f234d5480077b379.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
^C
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://04f234d5480077b379.gradio.live
