# Data preparation

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

In this notebook we download a publicly available slide deck and convert it into images, one image for each slide. These images are then stored in Amazon S3 from where they can be made available to a Amazon SageMaker Endpoint for inference.

The slide deck choose is [Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia](https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf). To use a different slide deck you can update the `SLIDE_DECK` variable in [`globals.py`](./globals.py).


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting git+https://github.com/haotian-liu/LLaVA.git@v1.1.1 (from -r requirements.txt (line 2))
  Cloning https://github.com/haotian-liu/LLaVA.git (to revision v1.1.1) to /tmp/pip-req-build-kjifnvt9
  Running command git clone --filter=blob:none --quiet https://github.com/haotian-liu/LLaVA.git /tmp/pip-req-build-kjifnvt9
  Running command git checkout -q 1619889c712e347be1cb4f78ec66e7cf414ac1a6
  Resolved https://github.com/haotian-liu/LLaVA.git to commit 1619889c712e347be1cb4f78ec66e7cf414ac1a6
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting huggingface-hub==0.19.4 (from -r requirements.txt (line 3))
  Downloading huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Collecting sagemaker==2.199.0 (from -r requirements.txt (line 4))
  Downloading sagemaker-2.199.0.tar.gz (1.0 MB)
[2K     [90m

In [2]:
import os
import json
import glob
import boto3
import base64
import logging
import sagemaker
import globals as g
from PIL import Image
import requests as req
from typing import List
from pathlib import Path
import pypdfium2 as pdfium
from utils import upload_to_s3, get_bucket_name

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)

## Step 2. Download slide deck and convert it into images

We download a publicly available slide deck and convert each slide into a `jpg` file using the [`pypdfium2`](https://pypi.org/project/pypdfium2/) package.

In [4]:
def get_images(file:str, image_dir:str) -> List:
    """
    Get PIL images from PDF pages and save them to a specified directory
    :param file: Path to file
    :return: A list of PIL images
    """

    # Get presentation
    pdf = pdfium.PdfDocument(file)
    n_pages = len(pdf)

    # Extracting file name and creating the directory for images
    file_name = Path(file).stem  # Gets the file name without extension
    os.makedirs(image_dir, exist_ok=True)

    # Get images
    image_paths = []
    print(f"Extracting {n_pages} images for {file}")
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
        pil_image = bitmap.to_pil()
        # pil_images.append(pil_image)

        # Saving the image with the specified naming convention
        image_path = os.path.join(image_dir, f"{file_name}_image_{page_number + 1}.jpg")
        pil_image.save(image_path, format="JPEG")
        image_paths.append(image_path)

    return image_paths

Download a publicly available slide deck.

In [5]:
url: str = g.SLIDE_DECK
print(g.SLIDE_DECK)
local_file: str = os.path.basename(url)
r = req.get(url, allow_redirects=True)
if r.status_code == 200:
    logger.info(f"{url} downloaded successfully")
    with open(local_file, "wb") as f:
        f.write(r.content)
    logger.info(f"{url} written to {local_file}")

https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


[2024-03-18 21:17:46,069] p29055 {4264793631.py:6} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf downloaded successfully
[2024-03-18 21:17:46,073] p29055 {4264793631.py:9} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf written to CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


Extract images from the slide deck

In [6]:
images = get_images(local_file, g.IMAGE_DIR)
logger.info(f"there are {len(images)} images extracted from this slide deck {local_file}")

Extracting 31 images for CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


[2024-03-18 21:17:49,516] p29055 {2461950259.py:2} INFO - there are 31 images extracted from this slide deck CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


## Step 3. Upload the images into Amazon S3 bucket

Now we upload the images into an S3 bucket. This is done for two reasons:
1. In a production environment these images could be worked upon in parallel by a batch process.
1. An S3 bucket (that is part of a datalake) provides a secure location for an enterprise to store these images and a multimodal model can read these image files directly from the S3 bucket.

In [7]:
_ = list(map(lambda img_path: upload_to_s3(img_path, bucket_name, g.BUCKET_IMG_PREFIX), images))

[2024-03-18 21:17:49,758] p29055 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg.
[2024-03-18 21:17:49,838] p29055 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg.
[2024-03-18 21:17:49,939] p29055 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg.
[2024-03-18 21:17:50,018] p29055 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg.
[2024-03-18 21:17:50,327] p29055 {utils.py:37} INFO - File img/CMP301_Tr