# PDF Extraction dengan Groq API

Notebook ini menunjukkan cara mengekstrak konten PDF menggunakan Groq API dengan model meta-llama/llama-4-maverick-17b-128e-instruct

## Import Libraries dan Setup Environment

In [1]:
from openai import OpenAI
from dotenv import load_dotenv
from PIL import Image
from io import BytesIO

import pymupdf
import base64
import os


load_dotenv()

True

## Conver gambar menjadi base64

Fungsi encode_image ini digunakan untuk mengkonversi halaman-halaman pada file PDF menjadi gambar dalam format base64. Fungsi ini membuka file PDF, mengkonversi setiap halaman menjadi gambar, kemudian mengkodekannya dalam format base64. Hasilnya adalah daftar string base64 yang mewakili gambar-gambar halaman PDF.

In [2]:
def encode_image(pdf_path):
    base64_list = []
    pdf_document = pymupdf.open(pdf_path)
    for page in pdf_document:
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        buffered = BytesIO()
        img.save(buffered, format="JPEG")
        omage_bytes = buffered.getvalue()
        base64_image = base64.b64encode(omage_bytes).decode('utf-8')
        base64_list.append(base64_image)
    return base64_list

## Function untuk mengekstrak teks dari gambar

Fungsi extract_pdf ini melakukan ekstraksi teks dari gambar yang dikirimkan dalam format base64.

In [3]:
def extract_pdf(base64_images):

    client = OpenAI(
        api_key=os.getenv("GROQ_API_KEY"), 
        base_url="https://api.groq.com/openai/v1"
    )

    ocr_response = client.chat.completions.create(
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
        messages=[
            {
                "role": "user", "content":[
                    {"type": "text", "text": "Extract the text from the image."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_images}"}}
                ]
            }
        ],
        max_tokens=2000
    )

    return ocr_response.choices[0].message.content



In [8]:
path = "data\MMB_CATALOG_2025_1.pdf"

  path = "data\MMB_CATALOG_2025_1.pdf"


In [9]:
base64_list = encode_image(path)

In [10]:
base64_list

['/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQgJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAmwBtQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDgxeKF+b9axdYnR0IUg1NrMZjgLI3IHFcukzySDe2akC4Bu2D3rp7GMC36dq5wBcgHua6ayb/RwO2KaAwddkaN8DvWMkz7wUyD7Vt67H5jjHTPWs5LURyKaQDkuXIIcHjvWjZ3CsMFsfWtTRdMgkYvJGCw6Z7V0IsIB/yxT8

## Mengekstrak teks dari string base64


In [11]:
text = ''
for base64string in base64_list:
    extracted_text = extract_pdf(base64string)
    text += extracted_text
    break

## Melihat hasil ekstraksi

In [12]:
print(text)

The image is an advertisement for Maybank, a financial institution in Indonesia. The ad features a man running on a road, with the tagline "BERSIAP BERLARI UNTUK APRESIASI DIRI" (Get Ready to Run for Self-Appreciation) in large yellow and white text.

*   **Man Running:**
    *   The man is wearing a yellow tank top, black shorts, and black sneakers.
    *   He is holding a black shoe in his left hand and has his right leg raised behind him as if he is stretching or running.
    *   The man is looking over his shoulder, possibly at something or someone behind him.
*   **Tagline:**
    *   The tagline "BERSIAP BERLARI UNTUK APRESIASI DIRI" is written in large yellow and white text on the left side of the image.
    *   The text is bold and eye-catching, drawing attention to the message.
*   **Background:**
    *   The background of the image is a road with trees and buildings on either side.
    *   The sky is overcast, giving the image a slightly gloomy tone.
*   **Maybank Logo:**
    