## Florence-2
- 사용 모델: Florence-2-large 
- Reference: [florence-2-large jupyter notebook](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)
- Paper: [Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, 2023-Nov-10](https://arxiv.org/pdf/2311.06242)
- Requirement: upgrade to torch 2, torchvision
  * torch==2.0.1+cu117, torchvision==0.15.2+cu117

In [None]:
import torchvision
# disable waring from torchvision/datapoints/__init__.py
torchvision.disable_beta_transforms_warning()

In [None]:
from transformers import AutoProcessor, AutoModelForCausalLM  
from PIL import Image
import requests
import copy
import torch

In [None]:
%matplotlib inline  

In [None]:
#device = "cuda:0" if torch.cuda.is_available() else "cpu"
device = "cuda" if torch.cuda.is_available() else "cpu"
# torch.float16 did not work
# torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
torch_dtype = torch.float32

print('torch version: ', torch.__version__)
print('torchvision version: ', torchvision.__version__)
print(device)
print('the torch model will use ', torch_dtype)

In [None]:
# in order to download models from huggingface, it is necessary to set the following proxy and ssl 
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# suwon
import os
os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['HTTP_PROXY'] ='http://75.17.107.42:8080'
os.environ['HTTPS_PROXY'] ='http://75.17.107.42:8080'

# to disable tokenizers warning
os.environ['TOKENIZERS_PARALLELISM'] ='false'

### Loading the florence-2 model

In [None]:
model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch_dtype).to(device)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

In [None]:
# to see the models
# print(model)
# print(processor) # see the tokens

### To define the prediction function

In [None]:
# https://huggingface.co/docs/transformers/generation_strategies

def run_example(image_input, task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(text=prompt, images=image_input, return_tensors="pt").to('cuda', torch_dtype)  #  torch.float16
    generated_ids = model.generate(
      input_ids=inputs["input_ids"].cuda(),
      pixel_values=inputs["pixel_values"].cuda(),
      max_new_tokens=1024,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
    # print("generated_ids=", generated_ids) # 중간 결과 보기 
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    # print("generated_text=", generated_text) # 중간 결과 보기 
    parsed_answer = processor.post_process_generation(
        generated_text, 
        task=task_prompt, 
        image_size=(image_input.width, image_input.height)
    )

    return parsed_answer

### To load a test image

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

In [None]:
image

In [None]:
image.size

## 사용법
- pre-defined task 사용하는 방법과 여러개의 pre-defined task를 연결하여 사용하는 방법이 있음
## 1.To run pre-defined tasks without addtional inputs
   |pre-defined tasks| task promt |
   |:---:|:---|
   |captioning| ``` '<CAPTION>', '<DETAILED_CAPTION>', '<MORE_DETAILED_CAPTION>' ``` |
   |object detection | ``` '<OD>' ```
   | dense region caption| ``` '<DENSE_REGION_CAPTION>' ``` |
   | region proposal| ``` '<REGION_PROPOSAL>' ``` |
    
## 2. To run pre-defined tasks with addtional inputs
   |pre-defined tasks| task promt |
   |:---:|:---|
   |phrase grounding| ``` '<CAPTION_TO_PHRASE_GROUNDING>' ``` |
   |referring expression segmentation| ``` '<REFERRING_EXPRESSION_SEGMENTATION>' ``` |
   |region to segmentation| ``` '<REGION_TO_SEGMENTATION>' ``` |
   |open vocabulary detection| ``` '<OPEN_VOCABULARY_DETECTION>' ``` |
   |region to texts| ```'<REGION_TO_CATEGORY>', '<REGION_TO_DESCRIPTION>' ``` 
   |OCR| ``` '<OCR>', '<OCR_WITH_REGION>' ```  |
 
 ## 3. To run cascaed tasks
 - Caption + Phrase Grounding
 - Detailed Caption + Phrase Grounding
 - More Detailed Caption + Phrase Grounding

## 1. Run pre-defined tasks without addtional inputs

### Image Captioning (Scene Understanding) : 영상 내용을 text(English)로 표현
- input: image
- output: text description
- task_prompt로 3가지 level 제공: (brief) CAPTION(단문장, 두드러진 object나 activity 등 표현), DETAILED_CAPTION(여러 문장), MORE_DETAILED_CAPTION(여러 문장, object, attributes, action등 풍부하게 묘사)

In [None]:
task_prompt = '<CAPTION>'
run_example(image, task_prompt)

In [None]:
task_prompt = '<DETAILED_CAPTION>'
run_example(image, task_prompt)

In [None]:
task_prompt = '<MORE_DETAILED_CAPTION>'
run_example(image, task_prompt)

## Object Detection
- input: image
- output: 영상내 시각적인 물체 위치 및 label 검출

OD results format: {'<OD>': { 'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...] } }

In [None]:
task_prompt = '<OD>'
results = run_example(image, task_prompt)
print(results)

In [None]:
import matplotlib.pyplot as plt  
import matplotlib.patches as patches  

# bounding box 그려주기 
# data['bboxes], data['labels'] 가 있어야 함
def plot_bbox(image, data):
   # Create a figure and axes  
    fig, ax = plt.subplots()  
      
    # Display the image  
    ax.imshow(image)  
      
    # Plot each bounding box  
    for bbox, label in zip(data['bboxes'], data['labels']):  
        # Unpack the bounding box coordinates  
        x1, y1, x2, y2 = bbox  
        # Create a Rectangle patch  
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1, edgecolor='r', facecolor='none')  
        # Add the rectangle to the Axes  
        ax.add_patch(rect)  
        # Annotate the label  
        plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))  
      
    # Remove the axis ticks and labels  
    ax.axis('off')  
      
    # Show the plot  
    plt.show()  


In [None]:
plot_bbox(image, results['<OD>'])

## Dense region caption
- 영상내에서 region에 대한 phrase로 표현

Dense region caption results format: {'<DENSE_REGION_CAPTION>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]}}

In [None]:
task_prompt = '<DENSE_REGION_CAPTION>'
results = run_example(image, task_prompt)
print(results)

In [None]:
plot_bbox(image, results['<DENSE_REGION_CAPTION>'])

## Region proposal
- object region proposal: 물체 후보 영상 검출

Region proposal results format: {'<REGION_PROPOSAL>' : {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

In [None]:
task_prompt = '<REGION_PROPOSAL>'
results = run_example(image, task_prompt)
print(results)

In [None]:
plot_bbox(image, results['<REGION_PROPOSAL>'])

## Run pre-defined tasks that requires additional inputs
- task_prompt외 text_input이 필요한 경우
- visual grounding: 해당 문자 표현의 영역을 검출


### Phrase Grounding
- (visual) phrase grounding: 해당 문자 어구 표현의 영역을 검출
- input: image, task_prompt(task 지정), text_input(문자 표현)
- output: BB's

Phrase grounding results format: {'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

In [None]:
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(image, task_prompt, text_input="A green car parked in front of a yellow building.")
print(results)

In [None]:
plot_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])

## Referring expression segmentation
- 지정하는 표현에 대한 image에서의 영역을 segmenation mask로 검출
- input: image, task_prompt, text_input(segmentatioin하고자 하는 지정하는 text 표현)
- output: polygon mask 

Referring expression segmentation results format: {'<REFERRING_EXPRESSION_SEGMENTATION>': {'Polygons': [[[polygon]], ...], 'labels': ['', '', ...]}}, one object is represented by a list of polygons. each polygon is [x1, y1, x2, y2, ..., xn, yn]

In [None]:
task_prompt = '<REFERRING_EXPRESSION_SEGMENTATION>'
results = run_example(image, task_prompt, text_input="a green car")
# print(results)

In [None]:
from PIL import Image, ImageDraw, ImageFont 
import random
import numpy as np
colormap = ['blue','orange','green','purple','brown','pink','gray','olive','cyan','red',
            'lime','indigo','violet','aqua','magenta','coral','gold','tan','skyblue']

# segmentation의 output인 polygon을 영상 위에 그리기
def draw_polygons(image, prediction, fill_mask=False):  
    """  
    Draws segmentation masks with polygons on an image.  
  
    Parameters:  
    - image_path: Path to the image file.  
    - prediction: Dictionary containing 'polygons' and 'labels' keys.  
                  'polygons' is a list of lists, each containing vertices of a polygon.  
                  'labels' is a list of labels corresponding to each polygon.  
    - fill_mask: Boolean indicating whether to fill the polygons with color.  
    """  
    # Load the image  
   
    draw = ImageDraw.Draw(image)  
      
   
    # Set up scale factor if needed (use 1 if not scaling)  
    scale = 1  
      
    # Iterate over polygons and labels  
    for polygons, label in zip(prediction['polygons'], prediction['labels']):  
        color = random.choice(colormap)  
        fill_color = random.choice(colormap) if fill_mask else None  
          
        for _polygon in polygons:  
            _polygon = np.array(_polygon).reshape(-1, 2)  
            if len(_polygon) < 3:  
                print('Invalid polygon:', _polygon)  
                continue  
              
            _polygon = (_polygon * scale).reshape(-1).tolist()  
              
            # Draw the polygon  
            if fill_mask:  
                draw.polygon(_polygon, outline=color, fill=fill_color)  
            else:  
                draw.polygon(_polygon, outline=color)  
              
            # Draw the label text  
            draw.text((_polygon[0] + 8, _polygon[1] + 2), label, fill=color)  
  
    # Save or display the image  
    #image.show()  # Display the image  
    display(image)

In [None]:
output_image = copy.deepcopy(image)
draw_polygons(output_image, results['<REFERRING_EXPRESSION_SEGMENTATION>'], fill_mask=True)  

## Region to segmentation
- region내 segmentation

with additional region as inputs, format is '<loc_x1><loc_y1><loc_x2><loc_y2>', [x1, y1, x2, y2] is the quantized corrdinates in [0, 999].

In [None]:
task_prompt = '<REGION_TO_SEGMENTATION>'
results = run_example(image, task_prompt, text_input="<loc_702><loc_575><loc_866><loc_772>")
# print(results)

In [None]:
input_region = {}
input_region['bboxes'] = [[0.702*image.width, 0.575*image.height, 0.866*image.width, 0.772*image.height ]]
input_region['labels']= ['']
plot_bbox(image, input_region)

In [None]:
output_image = copy.deepcopy(image)
draw_polygons(output_image, results['<REGION_TO_SEGMENTATION>'], fill_mask=True) 

## Open vocabulary detection

open vocabulary detection can detect both objects and ocr texts.

results format:

{ '<OPEN_VOCABULARY_DETECTION>': {'bboxes': [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]], 'bboxes_labels': ['label_1', 'label_2', ..], 'polygons': [[[x1, y1, x2, y2, ..., xn, yn], [x1, y1, ..., xn, yn]], ...], 'polygons_labels': ['label_1', 'label_2', ...] }}

In [None]:
task_prompt = '<OPEN_VOCABULARY_DETECTION>'
results = run_example(image, task_prompt, text_input="a green car")
print(results)

In [None]:
def convert_to_od_format(data):  
    """  
    Converts a dictionary with 'bboxes' and 'bboxes_labels' into a dictionary with separate 'bboxes' and 'labels' keys.  
  
    Parameters:  
    - data: The input dictionary with 'bboxes', 'bboxes_labels', 'polygons', and 'polygons_labels' keys.  
  
    Returns:  
    - A dictionary with 'bboxes' and 'labels' keys formatted for object detection results.  
    """  
    # Extract bounding boxes and labels  
    bboxes = data.get('bboxes', [])  
    labels = data.get('bboxes_labels', [])  
      
    # Construct the output format  
    od_results = {  
        'bboxes': bboxes,  
        'labels': labels  
    }  
      
    return od_results  

In [None]:
bbox_results  = convert_to_od_format(results['<OPEN_VOCABULARY_DETECTION>'])

In [None]:
plot_bbox(image, bbox_results)

### region to texts

In [None]:
task_prompt = '<REGION_TO_CATEGORY>'
results = run_example(image, task_prompt, text_input="<loc_52><loc_332><loc_932><loc_774>")
print(results)

### ocr related tasks

In [None]:
# if you can directly access the file, you may use this
#url = "http://ecx.images-amazon.com/images/I/51UUzBDAMsL.jpg?download=true"
#image = Image.open(requests.get(url, stream=True).raw).convert('RGB')

In [None]:
# to use local file access
image_path = '/group-volume/sr_edu/AI-Application-Specialist-Vision-Dataset/hf-assets/'

In [None]:
!ls $image_path

In [None]:
image = Image.open(image_path+'/51UUzBDAMsL.jpg')
plt.imshow(image)
plt.axis('off')
plt.title(f'image size:{image.size}')
plt.show()

In [None]:
task_prompt = '<OCR>'
run_example(image, task_prompt)

In [None]:
task_prompt = '<OCR_WITH_REGION>'
results = run_example(image, task_prompt)
print(results)
# ocr results format
# {'OCR_WITH_REGION': {'quad_boxes': [[x1, y1, x2, y2, x3, y3, x4, y4], ...], 'labels': ['text1', ...]}}
# quad_boxes: (검출된 문자 영역을 표현하는) 4개의 점(point), top-left를 시작으로 clockwise 방향, 예) 회전된 문자 열 등 표현

In [None]:
def draw_ocr_bboxes(image, prediction, scale=1):
    draw = ImageDraw.Draw(image)
    bboxes, labels = prediction['quad_boxes'], prediction['labels']
    for box, label in zip(bboxes, labels):
        color = random.choice(colormap)
        new_box = (np.array(box) * scale).tolist()
        draw.polygon(new_box, width=3, outline=color)
        ''' 
        draw.text((new_box[0]+8, new_box[1]+2),
                    "{}".format(label),
                    align="right",
        
                    fill=color)
        '''
    display(image)

In [None]:
output_image = copy.deepcopy(image)
w, h = output_image.size
scale = 800 / max(w, h)
new_output_image = output_image.resize((int(w * scale), int(h * scale)))
draw_ocr_bboxes(new_output_image, results['<OCR_WITH_REGION>'], scale=scale)  

## Cascaded tasks

### Caption + Phrase Grounding
- input image에 대해 image description(captioning)을 하고, 
이 표현에 사용된 phrase에 대해 visual grounding함으로써 captioning에 대한 근거도 알수 있음
- 2 개의 step으로 구성됨

Caption + Phrase Grounding
results format:

{ '<CAPTION': pure_text, {'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}} }

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

In [None]:
# step 1. image captioning 
task_prompt = '<CAPTION>'
results = run_example(image, task_prompt)
text_input = results[task_prompt]
# step 2. caption을 입력으로해서, visual grounding
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(image, task_prompt, text_input)
results['<CAPTION>'] = text_input


In [None]:
results

In [None]:
plot_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])

### Detailed Caption + Phrase Grounding
- 위에서 한 방법과 유사하나, detailed_caption을 사용함
즉, image에 대한 상세한 설명과 설명내에 포함된 표현에 대한 visual grounding함
- 2 step으로 구성됨: detailed_caption, caption_to_phrase_grounding

results format:

{ '<DETAILED_CAPTION': pure_text, {'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}} }

In [None]:
task_prompt = '<DETAILED_CAPTION>'
results = run_example(task_prompt)
text_input = results[task_prompt]
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(task_prompt, text_input)
results['<DETAILED_CAPTION>'] = text_input

In [None]:
results

In [None]:
plot_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])

### More Detailed Caption + Phrase Grounding
- 'MORE_DETAILED_CAPTION'을 사용하여 영상을 문자로 표현하고, 이 표현에 사용한 phrase에 대한 visual grounding 

results format:

{ '<MORE_DETAILED_CAPTION': pure_text, {'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}} }

In [None]:
task_prompt = '<MORE_DETAILED_CAPTION>'
results = run_example(image, task_prompt)
text_input = results[task_prompt]
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(image, task_prompt, text_input)
results['<MORE_DETAILED_CAPTION>'] = text_input

In [None]:
results

In [None]:
plot_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])

In [None]:
!nvidia-smi