# Data preparation

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

In this notebook we download a publicly available slide deck and convert it into images, one image for each slide. These images are then stored in Amazon S3 from where they can be made available to a Amazon SageMaker Endpoint for inference.

The slide deck choose is [Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia](https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf). To use a different slide deck you can update the `SLIDE_DECK` variable in [`globals.py`](./globals.py).


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [2]:
import os
import json
import yaml
import logging
import globals as g
from PIL import Image
import requests as req
from typing import List
from pathlib import Path
import pypdfium2 as pdfium
from utils import upload_to_s3, get_bucket_name

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
# global constants
CONFIG_FILE_PATH = "config.yaml"
# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-06-17 18:05:09,737] p13945 {3725478428.py:7} INFO - config read from config.yaml -> {
  "app_name": "multi-modal-rag-bedrock",
  "aws": {
    "cfn_stack_name": "multi-modal-revised",
    "os_service": "aoss"
  },
  "pdf_dir_info": {
    "source_pdf_dir": "pdf_data",
    "pdf_img_path": "images",
    "slide_deck_img_dir": "img",
    "pdf_txt_path": "text_files",
    "pdf_extracted_data": "pdf_extracted_data",
    "json_img_dir": "pdf_img_json_dir",
    "json_txt_dir": "pdf_text_json_dir",
    "bucket_prefix": "multimodal",
    "bucket_img_prefix": "img",
    "qna_dir": "question_answer_files",
    "image_format": "JPEG",
    "prompts": "prompt_templates",
    "manually_saved_images_path": "manually_saved_imgs",
    "manually_saved_images_provided": false,
    "processed_prompts_for_eval": "processed_llm_judge_evaluation_prompts.csv",
    "judge_model_eval_completions": "data/model_eval_completions",
    "llm_as_a_judge_completions": "llm_as_a_judge_completions.csv",
    "index_res

In [4]:
bucket_name: str = get_bucket_name(config['aws']['cfn_stack_name'])

## Step 2. Download slide deck and convert it into images

We download a publicly available slide deck and convert each slide into a `jpg` file using the [`pypdfium2`](https://pypi.org/project/pypdfium2/) package.

In [5]:
def get_images(file:str, image_dir:str) -> List:
    """
    Get PIL images from PDF pages and save them to a specified directory
    :param file: Path to file
    :return: A list of PIL images
    """

    # Get presentation
    pdf = pdfium.PdfDocument(file)
    n_pages = len(pdf)

    # Extracting file name and creating the directory for images
    file_name = Path(file).stem  # Gets the file name without extension
    os.makedirs(image_dir, exist_ok=True)

    # Get images
    image_paths = []
    print(f"Extracting {n_pages} images for {file}")
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        # adding an enhancement to increase the scale to 2.5 for better accurate results
        bitmap = page.render(scale=2.5, rotation=0, crop=(0, 0, 0, 0))
        pil_image = bitmap.to_pil()
        # pil_images.append(pil_image)

        # Saving the image with the specified naming convention
        image_path = os.path.join(image_dir, f"{file_name}_image_{page_number + 1}.jpg")
        pil_image.save(image_path, format="JPEG")
        image_paths.append(image_path)

    return image_paths

Download a publicly available slide deck.

In [6]:
url: str = config['content_info']['slide_deck'].get('url')
print(url)
local_file: str = os.path.basename(url)
r = req.get(url, allow_redirects=True)
if r.status_code == 200:
    logger.info(f"{url} downloaded successfully")
    with open(local_file, "wb") as f:
        f.write(r.content)
    logger.info(f"{url} written to {local_file}")

[2024-06-17 18:05:10,031] p13945 {1589959425.py:6} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf downloaded successfully
[2024-06-17 18:05:10,037] p13945 {1589959425.py:9} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf written to CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


Extract images from the slide deck

In [7]:
images = get_images(local_file, g.IMAGE_DIR)
logger.info(f"there are {len(images)} images extracted from this slide deck {local_file}")

Extracting 31 images for CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


[2024-06-17 18:05:20,336] p13945 {2461950259.py:2} INFO - there are 31 images extracted from this slide deck CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


In [8]:
images

['img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_5.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_6.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_7.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_8.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_9.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_10.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_11.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_12.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_13.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_14.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_15.jpg',
 'img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_16.jpg',
 'img/CMP301_TrainDeploy_E1_20230

## Step 3. Upload the images into Amazon S3 bucket

Now we upload the images into an S3 bucket. This is done for two reasons:
1. In a production environment these images could be worked upon in parallel by a batch process.
1. An S3 bucket (that is part of a datalake) provides a secure location for an enterprise to store these images and a multimodal model can read these image files directly from the S3 bucket.

In [9]:
_ = list(map(lambda img_path: upload_to_s3(img_path, bucket_name, g.BUCKET_IMG_PREFIX), images))

[2024-06-17 18:05:20,455] p13945 {utils.py:36} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg uploaded to multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg.
[2024-06-17 18:05:20,576] p13945 {utils.py:36} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg uploaded to multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg.
[2024-06-17 18:05:20,625] p13945 {utils.py:36} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg uploaded to multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg.
[2024-06-17 18:05:20,700] p13945 {utils.py:36} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg uploaded to multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg.
[2024-06-17 18:05:20,894] p13945