# Data preparation

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

In this notebook we download a publicly available slide deck and convert it into images, one image for each slide. These images are then stored in Amazon S3 from where they can be made available to a Amazon SageMaker Endpoint for inference.

The slide deck choose is [Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia](https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf). To use a different slide deck you can update the `SLIDE_DECK` variable in [`globals.py`](./globals.py).


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting sagemaker==2.199.0 (from -r requirements.txt (line 1))
  Using cached sagemaker-2.199.0.tar.gz (1.0 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pypdfium2==4.24.0 (from -r requirements.txt (line 2))
  Downloading pypdfium2-4.24.0-py3-none-manylinux_2_17_x86_64.whl.metadata (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.6/45.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httplib2==0.19.0 (from -r requirements.txt (line 3))
  Downloading httplib2-0.19.0-py3-none-any.whl.metadata (2.2 kB)
Collecting langchain==0.0.340 (from -r requirements.txt (line 4))
  Downloading langchain-0.0.340-py3-none-any.whl.metadata (16 kB)
Collecting pandas==1.5.3 (from -r requirements.txt (line 6))
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting boto3==1.34.0 (from -r requirements.txt (line 7))
  Downloading boto3-1.34.0-py3-none-any.whl.metadata (6.6 kB)
Collecting

In [2]:
import os
import json
import glob
import boto3
import base64
import logging
import sagemaker
import globals as g
from PIL import Image
import requests as req
from typing import List
from pathlib import Path
import pypdfium2 as pdfium
from utils import upload_to_s3, get_bucket_name

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)

## Step 2. Download slide deck and convert it into images

We download a publicly available slide deck and convert each slide into a `jpg` file using the [`pypdfium2`](https://pypi.org/project/pypdfium2/) package.

In [4]:
def get_images(file:str, image_dir:str) -> List:
    """
    Get PIL images from PDF pages and save them to a specified directory
    :param file: Path to file
    :return: A list of PIL images
    """

    # Get presentation
    pdf = pdfium.PdfDocument(file)
    n_pages = len(pdf)

    # Extracting file name and creating the directory for images
    file_name = Path(file).stem  # Gets the file name without extension
    os.makedirs(image_dir, exist_ok=True)

    # Get images
    image_paths = []
    print(f"Extracting {n_pages} images for {file}")
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
        pil_image = bitmap.to_pil()
        # pil_images.append(pil_image)

        # Saving the image with the specified naming convention
        image_path = os.path.join(image_dir, f"{file_name}_image_{page_number + 1}.jpg")
        pil_image.save(image_path, format="JPEG")
        image_paths.append(image_path)

    return image_paths

Download a publicly available slide deck.

In [5]:
url: str = g.SLIDE_DECK
print(g.SLIDE_DECK)
local_file: str = os.path.basename(url)
r = req.get(url, allow_redirects=True)
if r.status_code == 200:
    logger.info(f"{url} downloaded successfully")
    with open(local_file, "wb") as f:
        f.write(r.content)
    logger.info(f"{url} written to {local_file}")

https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


[2024-03-22 22:14:42,840] p15184 {4264793631.py:6} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf downloaded successfully
[2024-03-22 22:14:42,846] p15184 {4264793631.py:9} INFO - https://d1.awsstatic.com/events/Summits/torsummit2023/CMP301_TrainDeploy_E1_20230607_SPEdited.pdf written to CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


Extract images from the slide deck

In [6]:
images = get_images(local_file, g.IMAGE_DIR)
logger.info(f"there are {len(images)} images extracted from this slide deck {local_file}")

Extracting 31 images for CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


[2024-03-22 22:14:46,440] p15184 {2461950259.py:2} INFO - there are 31 images extracted from this slide deck CMP301_TrainDeploy_E1_20230607_SPEdited.pdf


## Step 3. Upload the images into Amazon S3 bucket

Now we upload the images into an S3 bucket. This is done for two reasons:
1. In a production environment these images could be worked upon in parallel by a batch process.
1. An S3 bucket (that is part of a datalake) provides a secure location for an enterprise to store these images and a multimodal model can read these image files directly from the S3 bucket.

In [7]:
_ = list(map(lambda img_path: upload_to_s3(img_path, bucket_name, g.BUCKET_IMG_PREFIX), images))

[2024-03-22 22:14:46,543] p15184 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg.
[2024-03-22 22:14:46,584] p15184 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_2.jpg.
[2024-03-22 22:14:46,630] p15184 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_3.jpg.
[2024-03-22 22:14:46,661] p15184 {utils.py:37} INFO - File img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg uploaded to multimodal-blog2-bucket-597703351594/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_4.jpg.
[2024-03-22 22:14:46,714] p15184 {utils.py:37} INFO - File img/CMP301_Tr