# Formatted Zone (Images)

This notebook contains the scripts needed for the extraction of images from the persistent landing zone, its processing and storage to the formatted zone. The formatted zone is represented by another bucket and aims to replicate the same folder structure as the persistent landing zone. The difference is that the data format in the formatted zone has been homogenized, as one of the steps of our data pipeline. 

This notebook focuses only on images data (the equivalent notebooks for the other types of data can be found in the same folder). Particularly, the following scripts are responsible of the following tasks:
1. Extraction of images from persistent landing zone.
2. Homogenization of data. In this case, that will consist on ensuring that all images are converted to .png files.
3. Formatted data storage into the formatted zone.
4. If the data has been processed correctly, deletion from the persistent landing zone.

## 1. Extraction of images from persistent landing zone

First, we will connect to MinIO and prepare the new bucket with the image folder:

In [None]:
import boto3
import os
from dotenv import load_dotenv

load_dotenv()
access_key_id = os.getenv("ACCESS_KEY_ID")
secret_access_key = os.getenv("SECRET_ACCESS_KEY")
minio_url = "http://" + os.getenv("S3_API_ENDPOINT")


minio_client = boto3.client(
    "s3",
    aws_access_key_id=access_key_id,
    aws_secret_access_key=secret_access_key,
    endpoint_url=minio_url
)

new_bucket = "formatted-zone"
try:
    minio_client.create_bucket(Bucket=new_bucket)
except (minio_client.exceptions.BucketAlreadyExists, minio_client.exceptions.BucketAlreadyOwnedByYou):
    print(f"Bucket '{new_bucket}' already exists")

In [None]:

name_bucket = "dump"
try:
    minio_client.head_bucket(Bucket=name_bucket)
    print(f"Bucket '{bucket}' ja existeix")
except:
    minio_client.create_bucket(name_bucket)

Now, for each image in the persistent landing zone the following script will check if it is in png format. If that is not the case but its format is supported by our pipeline, it will convert it, store it in the formatted zone and ultimately delete it from its previous location. If the data format is not supported, the script will not know how to proceed, ignoring it. In that case, the data is left in the persistent landing zone. We, as designers, believe that the this is the best approach to dealing with this anomalies: Neither allowing them to continue the following phases of the pipeline nor be deleted, but keeping them until a decision is made.

The supported image formats are .png, .jpg/.jpeg, .bmp, .tif/.tiff, .gif and .webp 

In [None]:
from PIL import Image
import io
from tqdm import tqdm

bucket_origin = "persistent-landing"
bucket_destination = "formatted-zone"
path = "imagenes/"

# Supported image formats
formats = {'.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif', '.gif', '.webp'}

paginator = minio_client.get_paginator("list_objects_v2")

for page in paginator.paginate(Bucket=bucket_origen, Prefix=prefix_origen):
    for obj in tqdm(page.get("Contents", []), desc="Processant imatges"):
        key = obj["Key"]
        filename = key.split("/")[-1]
        
        # Obtener la extensión del archivo
        extension = os.path.splitext(filename)[1].lower()
        
        # Solo procesar archivos de imagen
        if extension not in formatos_imagen:
            # Guardar archivo no reconocido en dump
            response = minio_client.get_object(Bucket=bucket_origen, Key=key)
            file_data = response["Body"].read()
            
            minio_client.put_object(
                Bucket=name_bucket,
                Key=key,
                Body=file_data
            )
            continue
            # Crear el nuevo nombre con extensión .png
        nombre_sin_extension = os.path.splitext(filename)[0]
        nuevo_filename = f"{nombre_sin_extension}.png"
            
            # Llegir la imatge des del bucket
        response = minio_client.get_object(Bucket=bucket_origen, Key=key)
        image_data = response["Body"].read()
            
            # Obrir amb PIL per validar i redimensionar
        try:
            img = Image.open(io.BytesIO(image_data)).convert("RGB")
        except Exception as e:
            print(f"Error amb {filename}: {e}")
            continue
            # Desa a memòria i puja a formatted-zone
        
        buffer = io.BytesIO()
        img.save(buffer, format="PNG")
        buffer.seek(0)



        new_key = f"imagenes/{nuevo_filename}"
        minio_client.put_object(
            Bucket=bucket_desti,
            Key=new_key,
            Body=buffer
        )
        minio_client.upload_fileobj(Fileobj=buffer, Bucket=bucket_destination, Key=new_key)

Processant imatges: 100%|██████████| 1000/1000 [00:36<00:00, 27.57it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:33<00:00, 29.57it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:37<00:00, 26.67it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:42<00:00, 23.39it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:39<00:00, 25.33it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:36<00:00, 27.09it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:39<00:00, 25.10it/s]
Processant imatges: 100%|██████████| 1000/1000 [00:36<00:00, 27.36it/s]
Processant imatges: 100%|██████████| 12/12 [00:00<00:00, 28.43it/s]
