**Installing dependencies**

In [None]:
!pip install openai


Collecting openai
  Downloading openai-1.9.0-py3-none-any.whl (223 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.4/223.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.7 (from openai)
  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00

**We use pytersseract to perform OCR**

In [None]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [None]:
!apt install tesseract-ocr


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 29 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 3s (1,641 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 121658 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

In [None]:
import pandas as pd
import os
from openai import OpenAI
import cv2
from google.colab import files
import re
import pytesseract


In [None]:

# This is the path to the Tesseract executable in Google Colab
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'


**Setting the api_key provided**

In [None]:

api_key = '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'
os.environ['OPENAI_API_KEY'] = api_key

In [None]:
#Documentation guide to creating client object for api  call to openai
client = OpenAI(api_key=os.environ['OPENAI_API_KEY']
)

**AUTOMATING THE PROCESS OF EXTRACTING THE REQUIRED FIELDS USING CONSECUTIVE API CALLS FOR ALL THE TEST EXAMPLES.**

**The whole process becomes very lengthy. So we use an object oriented approach for information extraction using ocr and llms.**

**Process**
1. Upload test images. Also store it in a member variable (list)
2. Preprocess the image using member functions of the below class
3. Perform ocr in the image files and extract all the text. Also store it in a member variable (list)
4. Supply the extracted text to the llm model (gpt-3.5-turbo-1106). Set the prompt for the llm to extract what we require (invoice number, issue date, total amount and contents of the description table)
5. Store the llm's response content in a member variable (list)
6. Do this for all the test images
7. Make a dataframe from the extracted fields

**Making class for Information Extraction**

In [None]:
class InformationExtraction:
    def __init__(self, api_key):
        self.api_key = api_key
        os.environ['OPENAI_API_KEY'] = api_key
        self.client = OpenAI()
        self.test_title = []
        self.uploaded = []
        self.test_extracted_text = []
        self.test_invoice_content = []

    def upload_image(self):
        # Upload an image file using the Colab interface
        uploaded = files.upload()
        self.uploaded = uploaded
        #get the path of images which is also the title of image
        for filename in uploaded.keys():
          self.test_title.append(filename)
          print(f"{filename} uploaded successfully!")

    def return_filenames(self):
        return self.uploaded.keys()

    def perform_ocr(self):
      uploaded = self.uploaded


      for filename in uploaded.keys():

        image = cv2.imread(filename)
        # Preprocess the image if needed
        preprocessed_image = self.preprocess_image(image)

        # Extract text using Tesseract
        extracted_text = pytesseract.image_to_string(preprocessed_image, lang='eng')
        print("Text Extracted Sucessfully of file:" ,filename)

        #add the extracted text to member variable list
        self.test_extracted_text.append(extracted_text)

        #print(self.test_extracted_text)
        #return extracted_text

    def llm_response_content(self):
      return self.test_invoice_content

    def ocr_response_content(self):
      return self.test_extracted_text



    #LLM functions
    def call_llm(self):
        extracted_text = self.test_extracted_text
        for text in extracted_text:

          messages = [
              {"role": "system", "content": "You are a helpful assistant that carefully analyzes invoices and extracts information from the text and strictly doesn't use other symbols like '-' and '|' when generating respnse."},
              {"role": "user", "content": f"Find the invoice number, convert the textual representation of issue date to numerical representataion and only show the numerical representation of the issue date, find total amount usually present at the bottom of the invoice text and also extract the headings of description table and its all contents  from :\n\n{text},\n Write in a clear way with newline after each information extracted. \n Also, don't use other symbols like '-' and '|' too much. And, if you can't find the mentioned information, write 0 in that field .\n Also if there are multiple instances of the required data, pick the first one and dont repeat those fields.Write in this format strictly Invoice number : \n, Issue date : \n,Total amount : \n, Description Table :  "}
          ]

          response = self.client.chat.completions.create(
              model='gpt-3.5-turbo-1106',
              messages=messages,
              temperature=0.7,
              max_tokens=200,
              stop=None,
              n=1,
              timeout=30,
          )

          #print(response.choices[0].message)
          self.test_invoice_content.append(response.choices[0].message.content)
        #print(self.test_invoice_content)
        #return self.test_invoice_content

    # OCR FUNCTIONS
    # Grayscale function
    def get_grayscale(self, image):
        return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Threshold function
    def threshold(self, image):
        threshold_level = 0
        return cv2.threshold(image, threshold_level, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Dilate function
    def dilate(self, image):
        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
        return cv2.dilate(image, kernel, iterations=1)

    # Preprocess function
    def preprocess_image(self, image):
        gray_scaled = self.get_grayscale(image)
        thresholded = self.threshold(gray_scaled)
        dilated = self.dilate(thresholded)
        return dilated

    #extract useful information
    def extract_invoice_info(self):
        invoice_numbers = []
        issue_dates = []

        total_amounts = []

        table_contents = []


        for content in self.test_invoice_content:
            # Extracting invoice details using regular expressions
            invoice_number_match = re.search(r'Invoice number\s*:\s*([^\n]+)', content)
            issue_date_match = re.search(r'Issue date\s*:\s*([^\n]+)', content)
            total_amount_match = re.search(r'Total amount\s*:\s*([^\n]+)', content)

             # Extracting table details using regular expressions
            table_match = re.search(r'Description Table\s*:\s*(.+)', content, re.DOTALL)

            # Assigning the matched values to variables
            invoice_number = invoice_number_match.group(1) if invoice_number_match else None
            issue_date = issue_date_match.group(1) if issue_date_match else None
            total_amount = total_amount_match.group(1) if total_amount_match else None
            table_content = table_match.group(1).strip() if table_match else None

            # Append the matched values to the lists
            invoice_numbers.append(invoice_number)
            issue_dates.append(issue_date)
            total_amounts.append(total_amount)
            table_contents.append(table_content)



        # Create a Pandas DataFrame with the extracted information
        test_df = pd.DataFrame({
            "title": self.test_title,
            "invoice_number": invoice_numbers,

            "issue_date": issue_dates,

            "total": total_amounts,

            "table": table_contents,

            })

        return test_df

In [None]:
i = InformationExtraction('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')


**We upload all 41 test images**

In [None]:
i.upload_image()

Saving 0b6fcb50-b157-4457-b3a3-06779f91b8b8.jpg to 0b6fcb50-b157-4457-b3a3-06779f91b8b8.jpg
Saving 2a8677b9-b29e-4c93-86e5-56d4c38cb7fc.jpg to 2a8677b9-b29e-4c93-86e5-56d4c38cb7fc.jpg
Saving 2ec7883e-dafe-4cc3-9836-7314ace98c14.png to 2ec7883e-dafe-4cc3-9836-7314ace98c14.png
Saving 3e2ef304-3cb7-4452-87b8-102bea1c2908.jpg to 3e2ef304-3cb7-4452-87b8-102bea1c2908.jpg
Saving 4cfc9619-5dd4-4307-90b6-04d20ba6db3b.jpg to 4cfc9619-5dd4-4307-90b6-04d20ba6db3b.jpg
Saving 4d972974-e804-4c67-af7c-071b16d1d35a.png to 4d972974-e804-4c67-af7c-071b16d1d35a.png
Saving 6a620b62-167a-466e-961d-63cac09ba563.jpg to 6a620b62-167a-466e-961d-63cac09ba563.jpg
Saving 6fab0ac5-e856-43af-a2ed-621d1a15818a.png to 6fab0ac5-e856-43af-a2ed-621d1a15818a.png
Saving 7b450c61-a83d-4843-8404-0828b0d62891.jpg to 7b450c61-a83d-4843-8404-0828b0d62891.jpg
Saving 8e00367b-25b3-401b-bf8e-e1cfc801e4ae.png to 8e00367b-25b3-401b-bf8e-e1cfc801e4ae.png
Saving 9b650c90-ad5c-4f07-b04a-b58e33318398.jpg to 9b650c90-ad5c-4f07-b04a-b58e3

In [None]:
uploaded_files = i.return_filenames()

**Performing ocr for all images using pytessearct**

In [None]:
i.perform_ocr()

Text Extracted Sucessfully of file: 0b6fcb50-b157-4457-b3a3-06779f91b8b8.jpg
Text Extracted Sucessfully of file: 2a8677b9-b29e-4c93-86e5-56d4c38cb7fc.jpg
Text Extracted Sucessfully of file: 2ec7883e-dafe-4cc3-9836-7314ace98c14.png
Text Extracted Sucessfully of file: 3e2ef304-3cb7-4452-87b8-102bea1c2908.jpg
Text Extracted Sucessfully of file: 4cfc9619-5dd4-4307-90b6-04d20ba6db3b.jpg
Text Extracted Sucessfully of file: 4d972974-e804-4c67-af7c-071b16d1d35a.png
Text Extracted Sucessfully of file: 6a620b62-167a-466e-961d-63cac09ba563.jpg
Text Extracted Sucessfully of file: 6fab0ac5-e856-43af-a2ed-621d1a15818a.png
Text Extracted Sucessfully of file: 7b450c61-a83d-4843-8404-0828b0d62891.jpg
Text Extracted Sucessfully of file: 8e00367b-25b3-401b-bf8e-e1cfc801e4ae.png
Text Extracted Sucessfully of file: 9b650c90-ad5c-4f07-b04a-b58e33318398.jpg
Text Extracted Sucessfully of file: 9d396ae4-7abe-4aa5-8481-11df722e49c7.jpg
Text Extracted Sucessfully of file: 19d98817-caf1-4e5a-b8b2-22881ecef5d4.png

**Observing extracted content**

In [None]:
i.ocr_response_content()

["& ROSEMOUNT INC.\n8200 MARKET BOULEVARD\nEMERSON. | cuannasseN, MNss317 UNITED STATES\n\nCOMMERCIAL INVOICE - Original\n\n \n\n \n\n    \n\n \n\n \n\n  \n    \n   \n\n   \n\nCHICAGO, IL__ 60606 UNITED STATES\nInvoice To:\nINSTRUMENTOS Y CONTROLES SA.\n\nCALLE 39 NO 24-45,\nBOGOTA, COLOMBIA\n\n  \n\n         \n      \n     \n   \n\n \n \n\n \n\nSold To:\nINSTRUMENTOS Y CONTROLES SA.\nCALLE 39 NO 24-45,\n\nPlease Remit To: Invoice Date [esis No: ‘Shipment No:\nROSEMOUNT INC 13/May/2013 4631508 2580301\nJPMORGAN CHASE Payment Terms: Sales Order No:\n\nABA: 021000021 SWIFT CODE: CHASUS33 NET60 3577894\n\n     \n  \n  \n  \n     \n  \n\n    \n \n  \n \n \n\nRep Order No:\n\nProject No:\n\nCustomer PO:\nL-9390-1\n\nShip To:\nINSTRUMENTOS Y CONTROLES SA\nC/O LOGIMAT SA\n\nCARRERA 106 NO 15-25\n\nMANZANA 23 LOTE 136 M\n\nZONA FRANCA FONTIBON\nBOGOTA, COLOMBIA\n\n \n\nBOGOTA, COLOMBIA\nContact: Amado, Constanza\n\n     \n     \n   \n   \n\nShipped Via:\nInland (Origin): FEDEX-PARCEL-INTRA US 

**Making consecutive calls to llm through the member function call_llm()**

In [None]:
i.call_llm()

**Observing the response content of llm for inference**

In [None]:
i.llm_response_content()

['Invoice number : 4631508\nIssue date : 13 May 2013\nTotal amount : 1445.99\nDescription Table : \n2088G4S22A 1MSB4E5T 104\n2088 Pressure Transmitter\nExport HTS/HS:8026204000\nBuyer acknowledges preference that Rosemount arrange shipping\nFreightshipping charges prepaid and added to invoice. Buyer\nagrees to pay such charges. Because Rosemount is billed by\nWOOD PACKING MATERIAL\n1,435.20 1,435.20\n10.79\n0.00\n1,445.99',
 'Invoice number: POv002\nIssue date: 19-Nov-20\nTotal amount: 62,000.00\nDescription Table: Jarus 16.6 inch Monitor, 19Nos, 620000, 67,000.00, Toa 70 Hes, 62,000.00, Anau Chao woe Eee',
 'Invoice number : 0\nIssue date : 07072022\nTotal amount : 2380\nDescription Table : \nSki: Woo-bearietoge\nbel $500 118% $850.83\ncap S500 1 oR $150 i780\nsubrocal si',
 'Invoice number: 353013\nIssue date: 010317\nTotal amount: 8,385.00\nDescription Table: \nAEROCHAIR-E1 UofM: Each 500.00 2,500.00\nOffice Chair - Swivel Grey\nWarehouse: MAIN',
 'Invoice number: Final Invoice Fina

**There are 41 llm respnse content which is exactly what we expect**

In [None]:
len(i.llm_response_content())

41

**Making pandas dataframe using the member function extract_invoice_info()**

In [None]:
test = i.extract_invoice_info()

In [None]:
test

Unnamed: 0,title,invoice_number,issue_date,total,table
0,0b6fcb50-b157-4457-b3a3-06779f91b8b8.jpg,4631508,13 May 2013,1445.99,2088G4S22A 1MSB4E5T 104\n2088 Pressure Transmi...
1,2a8677b9-b29e-4c93-86e5-56d4c38cb7fc.jpg,POv002,19-Nov-20,62000.00,"Jarus 16.6 inch Monitor, 19Nos, 620000, 67,000..."
2,2ec7883e-dafe-4cc3-9836-7314ace98c14.png,0,07072022,2380,Ski: Woo-bearietoge\nbel $500 118% $850.83\nca...
3,3e2ef304-3cb7-4452-87b8-102bea1c2908.jpg,353013,010317,8385.00,"AEROCHAIR-E1 UofM: Each 500.00 2,500.00\nOffic..."
4,4cfc9619-5dd4-4307-90b6-04d20ba6db3b.jpg,Final Invoice Final Invoice # oRt2348,09/20/2012,34436.50,Quantity Description Unit Price ExtPrice\n1 Ca...
5,4d972974-e804-4c67-af7c-071b16d1d35a.png,PD004918,10/26/2018,200.00,CODE DESCRIPTION QTY UNIT RATE AMOUNT\nTRLAND ...
6,6a620b62-167a-466e-961d-63cac09ba563.jpg,5000,15 OCTOBER 2018,2550 USD,"DESCRIPTION HOURS, RATE TOTAL\nWEDDING PHOTOSH..."
7,6fab0ac5-e856-43af-a2ed-621d1a15818a.png,1,"July6, 2012","$2,407.00",QUANTITY DESCRIPTION UNIT PRICE AMOUNT\n120 Wi...
8,7b450c61-a83d-4843-8404-0828b0d62891.jpg,5559789,February 5,8173.20,1. Populated Printed Circuit Board and Intel P...
9,8e00367b-25b3-401b-bf8e-e1cfc801e4ae.png,GE/3847/20 2013,10 10 2020,23600.00,1 Herman Millet Executive 34929 12 Pes 100.00 ...


**This looks alright.**

**Saving the output to a csv file ..**

In [None]:
csv_file_path = 'testoutput.csv'
test.to_csv(csv_file_path, index=False)