## Querying the GPT API to Transcribe the "Erfindungen" from 1816 to 1852

These serve as important pre-trend indicators. Each observation represent an Austrian patent. The raw data is already in table format, so GPT should do an easy job transcribing this to actual csv's. In the box below, I put the prompt that I want to use.


In [48]:
import pandas as pd
import numpy as np
import base64
import requests
from pdf2image import convert_from_path
import os
from io import StringIO


from dotenv import load_dotenv
import os

import openai

# Load environment variables from .env file
load_dotenv()

api_key = os.getenv('OPENAI_API_KEY')

# Specifiy the prompt
prompt = """
Transcribe the table in the attached picture in a .csv format. Use the "\t" as the delimiter. Output ONLY the parsed data, no text. Use the same headers as in the table. There are 12 headers. The table is in German. 
"""

## Convert Images to .jpg 

And set up GPT parameters. 

In [28]:
def pdf_to_jpg(pdf_path, output_folder):
    # Convert PDF to a list of images
    images = convert_from_path(pdf_path)
    
    # Ensure the output folder exists
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    # Save each image as a JPG file
    for i, image in enumerate(images):
        image_path = os.path.join(output_folder, f'page_{i + 1}.jpg')
        image.save(image_path, 'JPEG')
        print(f'Saved {image_path}')

# Example usage
pdf_path = '../../data/patent_data/raw_patent_data/erfindungsprivilegien_without_first_page.pdf'
output_folder = '../../data/patent_data/interim_patent_data'
pdf_to_jpg(pdf_path, output_folder)

Saved ../../data/patent_data/interim_patent_data/page_1.jpg
Saved ../../data/patent_data/interim_patent_data/page_2.jpg
Saved ../../data/patent_data/interim_patent_data/page_3.jpg
Saved ../../data/patent_data/interim_patent_data/page_4.jpg
Saved ../../data/patent_data/interim_patent_data/page_5.jpg
Saved ../../data/patent_data/interim_patent_data/page_6.jpg
Saved ../../data/patent_data/interim_patent_data/page_7.jpg
Saved ../../data/patent_data/interim_patent_data/page_8.jpg
Saved ../../data/patent_data/interim_patent_data/page_9.jpg
Saved ../../data/patent_data/interim_patent_data/page_10.jpg
Saved ../../data/patent_data/interim_patent_data/page_11.jpg
Saved ../../data/patent_data/interim_patent_data/page_12.jpg
Saved ../../data/patent_data/interim_patent_data/page_13.jpg
Saved ../../data/patent_data/interim_patent_data/page_14.jpg
Saved ../../data/patent_data/interim_patent_data/page_15.jpg
Saved ../../data/patent_data/interim_patent_data/page_16.jpg
Saved ../../data/patent_data/inte

In [49]:
# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "../../data/patent_data/interim_patent_data/page_1.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": prompt
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 3000
}


In [50]:
# Query GPT
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

{'id': 'chatcmpl-AAwUzkt3XMLfUgpkkgpeVmSTNQo3E', 'object': 'chat.completion', 'created': 1727171513, 'model': 'gpt-4o-2024-05-13', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'PrivRegNr\tName1\tName2\tVorname1\tName3\tVorname2\tName4\tVorname3\tTitel\tSchlagworte\tOrt\tJahr\tEingereichtam\n1\tSteller\t\tAnton\t\t\tMeine Verfahrenart auf alles Getreide-Größtnen durch günstigsten zur anderen Substanten ordnige und die edelsten Papieze zu verfertigen, wie hier dieses Blut und jenes Wasser der Kürperleib ist...\tStrohpapier\tWien\t1816\t18.06.1816\n2\tOberhaimer\t\tAnton\t\t\tDas Gebirker oder Felsenheimerische Lack-Lake (roter Indigo) als Cochenille Ersatz und Mittel zur Scharlachzeugung\tCochenille, Schachille, Lack-Lake, Indigo\tWien\t1816\t10.06.1816\n3\tRupprecht\t\tJoseph\t\t\tBadeaparat mit Brennstoffeinsparung\tBadeofen\tWien\t1821\t02.07.1821\n4\tVogtländer\tJohan\tFriedrich\t\t\tBeschreibung der periskopischen Gliser und charakteristischen Kenneichen 

In [55]:
raw_string = response.json()['choices'][0]['message']['content']

In [58]:
pd.read_csv(StringIO(raw_string), delimiter="\t").head(5)

Unnamed: 0,PrivRegNr,Name1,Name2,Vorname1,Name3,Vorname2,Name4,Vorname3,Titel,Schlagworte,Ort,Jahr,Eingereichtam
0,1,Steller,,Anton,,,Meine Verfahrenart auf alles Getreide-Größtne...,Strohpapier,Wien,1816,18.06.1816,,
1,2,Oberhaimer,,Anton,,,Das Gebirker oder Felsenheimerische Lack-Lake ...,"Cochenille, Schachille, Lack-Lake, Indigo",Wien,1816,10.06.1816,,
2,3,Rupprecht,,Joseph,,,Badeaparat mit Brennstoffeinsparung,Badeofen,Wien,1821,02.07.1821,,
3,4,Vogtländer,Johan,Friedrich,,,Beschreibung der periskopischen Gliser und cha...,"Periskopische Gläser, Brille",Wien,1816,09.09.1816,,
4,5,Strauss,,Anton,,,Erfindung einer Druckmaschine,Druckmaschine,Wien,1815,02.12.1815,,
