<a href="https://colab.research.google.com/github/ahmarkhhan/tasksAutomation/blob/main/image_to_email.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** Installing Necessary Libraries**

1.   opencv-python-headless: A lite version of the OpenCV package
2.   pytesseract: A Python binding for Google's Tesseract-OCR Engine, which is used to extract text from images.



In [None]:
pip install opencv-python-headless pytesseract

The Tesseract-OCR software is necessary for the pytesseract package to function.

In [None]:
!sudo apt-get install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


Script to Extract Email Addresses from Image

1.   Imports Necessary Libraries
2.   Defines Function 'extract_emails_from_image'
First, it checks if the file exists at the given path,
Then, it opens the image and extracts text from it using OCR.
It uses a regular expression to find email addresses in the extracted text. If an email address is broken across two lines, it attempts to join them.
3.   Script to Call the Function and Print Results.

example result:
File name: 2.jpg
Total emails found: 2
Email: stancoop@gmail.com (found on line 3)
Email: emcasarosa82@gmail.com (found on line 23)





In [14]:
import pytesseract
from PIL import Image
import re
import os
import csv
import glob

def extract_emails_from_image(image_path):
    if not os.path.exists(image_path):
        return f"Error: File '{image_path}' not found."

    try:
        image = Image.open(image_path)
        extracted_text = pytesseract.image_to_string(image)

        # This modified regular expression will be able to find emails even if they span across two lines
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+(?:\.[A-Z|a-z]{2,7})?\b'

        # Detect lines and store them
        lines = extracted_text.split('\n')

        # If email is broken into two lines, join them
        for i in range(len(lines)-1):
            if re.search(r'[A-Za-z0-9.-]+@$', lines[i]) and re.search(r'^\.[A-Z|a-z]{2,7}\b', lines[i+1]):
                lines[i] = lines[i] + lines[i+1]
                lines[i+1] = ""

        # Join the corrected lines and find emails
        corrected_text = "\n".join(lines)
        email_addresses = re.findall(email_pattern, corrected_text)

        file_name = os.path.basename(image_path)
        email_info = {
            "file_name": file_name,
            "total_emails": len(email_addresses),
            "emails": email_addresses,
            "line_numbers": []
        }

        # Finding line numbers where emails are found
        for email in email_addresses:
            for i, line in enumerate(lines):
                if email in line:
                    email_info["line_numbers"].append(i+1)
                    break

        return email_info
    except Exception as e:
        return str(e)

# Specify the directory where your files are located
data_directory = 'data/'

# Define a list of file extensions to process
file_extensions = ['jpg', 'jpeg', 'png', 'pdf', 'doc', 'docx']

# Initialize an empty list to store email information from all files
all_email_info = []

# Loop through files with the specified extensions in the 'data' directory
for extension in file_extensions:
    files = glob.glob(os.path.join(data_directory, f'*.{extension}'))
    for file in files:
        email_info = extract_emails_from_image(file)
        if isinstance(email_info, dict):
            all_email_info.append(email_info)

# Save the emails to a CSV file
csv_file_name = 'extracted_emails.csv'
with open(csv_file_name, mode='w', newline='') as csv_file:
    fieldnames = ['File Name', 'Email', 'Line Number']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    for email_info in all_email_info:
        for email, line_num in zip(email_info['emails'], email_info['line_numbers']):
            writer.writerow({'File Name': email_info['file_name'], 'Email': email, 'Line Number': line_num})

print(f"Emails saved to {csv_file_name}")


Emails saved to extracted_emails.csv
