                                                                                                             Bhawana Agarwal
# Automating Information Extraction and Web Navigation             

## Introduction

This project focuses on simplifying the tasks of extracting information from images and efficiently navigating the web. We're using two tools: PyTesseract for converting images to text and a web crawler with the Google Search API. The main objective is to automate the process of getting a list of company names from an image and then finding career page links online.

Starting with PyTesseract, it's a Python tool that helps us convert the visual data of company names from an image into readable text. This eliminates the need for manual data entry and ensures accuracy.

Moving on to the web crawler, we use the Google Search API to search for career page links related to the identified companies. This integration makes our web searches more targeted and efficient, saving time and effort.

So, let's get started!!

To initiate the project, I set up a new conda environment dedicated to this task. This ensures a clean and isolated space for our project. Then we install the required libraries for this project, for example Pillow to read the images.

Unlike some other libraries that can be installed with a simple 'pip install' statement, using PyTesseract requires an additional step due to its dependence on the Tesseract OCR engine. (OCR - Optical Character Recognition)

Specifically:

- **For Windows users**, it's necessary to download the Tesseract executable file. The path to this executable needs to be specified in the PyTesseract configuration. This ensures that PyTesseract can utilize the Tesseract OCR engine correctly. __[Here is the link to download tesseract for windows](https://github.com/UB-Mannheim/tesseract/wiki)__ </br></br>

- **For macOS users**, installation guidelines for Tesseract on a Mac are available. Following these guidelines ensures that PyTesseract functions seamlessly. __[Here is the link to install tesseract for mac](https://tesseract-ocr.github.io/tessdoc/Installation.html)__

If you want more information on tesseract, use the following resources:
- https://github.com/tesseract-ocr/tesseract
- https://pypi.org/project/pytesseract/

In [1]:
# !pip freeze > requirements.txt

In [2]:
# pip install pytesseract

In [3]:
from PIL import Image
from pytesseract import pytesseract

In [4]:
path_to_tesseract = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
pytesseract.tesseract_cmd = path_to_tesseract

In [5]:
# Here I have used single image as per my use-case, but one can tweak this program to read multiple images in a list with loop.
# I have tested it and it works

image_path = r"C:\Users\PC\Downloads\sample_tier_list.png"

In [6]:
def read_and_convert_img_to_txt(img_path):
    """
    Reads text from an image file using Tesseract OCR.

    Parameters:
        - img_path (str): The file path to the image.

    Returns:
        - str: The extracted text from the image.
    """
    try:
        # Open the image file
        img = Image.open(img_path)

        # Extract text from the image using Tesseract OCR
        text = pytesseract.image_to_string(img)

        # Remove newline characters for better formatting
        text = text.replace('\n', '')

        return text

    except Exception as e:
        # Handle exceptions and return an empty string on failure
        print(f"An error occurred: {e}")
        return ''


In [7]:
text = read_and_convert_img2txt(image_path)

In [8]:
text

'6: IBM, SAP, Pure Storage, Nordstorm, Groupon,Norton Lifelock, Yahoo, PNC, NetApp, GoDaddy,'

If you take a look, you'll see that the text we get is all in one string. To make things easier to handle, we tidy up the data. We use basic string operations to break that single string into a list, organizing the information neatly. After that, we use list comprehension to go through the list and arrange the information in a more organized way. This helps us manage the data in a more structured manner.

In [9]:
import string
import numpy as np
import pandas as pd

In [10]:
companies_list = text.split(':')[1:]
companies_list = [s.rstrip(string.digits) for s in companies_list]

In [11]:
companies_list = [s.split(',') for s in companies_list]

In [12]:
companies_list

[[' IBM',
  ' SAP',
  ' Pure Storage',
  ' Nordstorm',
  ' Groupon',
  'Norton Lifelock',
  ' Yahoo',
  ' PNC',
  ' NetApp',
  ' GoDaddy',
  '']]

Rather than making things complicated by dealing with a list inside another list or using multiple loops, I opted for a simpler solution. So, we're using a tool called 'itertools' to change our two-dimensional list into a simpler one-dimensional form. This makes our data handling easier and the code more straightforward.

In [13]:
import itertools

In [14]:
companies_tier = list(itertools.chain(*companies_list))

In [15]:
companies_tier

[' IBM',
 ' SAP',
 ' Pure Storage',
 ' Nordstorm',
 ' Groupon',
 'Norton Lifelock',
 ' Yahoo',
 ' PNC',
 ' NetApp',
 ' GoDaddy',
 '']

Now, we will save our data in a user-friendly format - CSV. For this purpose, we are employing the csv library and its csv.writer class to efficiently write the data into our CSV file. It's worth noting that, in this simplified project, we are not checking if the file already exists or overwriting the data. In more complex scenarios, it's considered best practice to handle such cases for reusability and to prevent potential data loss.

In [16]:
import csv
import os

In [17]:
with open('SampleJobLinks.csv', 'w') as file:
     
    # using csv.writer method from CSV package
    writer = csv.writer(file)
    for val in companies_tier:
        writer.writerow([val])

Now, here comes the intriguing part – we're reading this list, searching for career pages associated with each company, and then saving it in CSV file. This time, we're not only storing the company names but also including their corresponding URLs. This step enhances our dataset by providing valuable links to the career pages of each identified company.

In [18]:
import requests
import pandas as pd
from googlesearch import search
import time

In [19]:
df = pd.read_csv('SampleJobLinks.csv', encoding='cp1252', header=None)

In [20]:
def get_company_url(company_name):
    
    """
    Searches for the URL of a company's career page using two keywords: the company name and the string "Careers".

    Parameters:
        - company_name (str): The name of the company for which the career page URL is to be searched.

    Returns:
        - str: The URL of the company's career page, or an empty string if the search is unsuccessful.

    Raises:
        - Exception: Raises an exception if there is an issue with the search process.
        
    """
    try:
        # Combining company name and 'Careers' as search terms
        search_terms = ' '.join([company_name, 'Careers'])
        
        # Performing a Google search to find the career page URL
        for url in search(search_terms, num_results=1):
            return url

    except Exception as e:
        print(f"An error occurred: {e}")
        return ''

In [21]:
df

Unnamed: 0,0
0,IBM
1,SAP
2,Pure Storage
3,Nordstorm
4,Groupon
5,Norton Lifelock
6,Yahoo
7,PNC
8,NetApp
9,GoDaddy


In [22]:
df = df[:-1]

In [23]:
df

Unnamed: 0,0
0,IBM
1,SAP
2,Pure Storage
3,Nordstorm
4,Groupon
5,Norton Lifelock
6,Yahoo
7,PNC
8,NetApp
9,GoDaddy


In [24]:
df['URL'] = ''
for index, row in df.iterrows():
    df.at[index, 'URL'] = getURL(str(row[0]))
    time.sleep(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['URL'] = ''


In [25]:
pd.options.display.max_rows = df.shape[0] +1

In [26]:
df['URL']

0    https://www.linkedin.com/jobs/ibm-jobs-boston-ma
1                               https://jobs.sap.com/
2    https://www.purestorage.com/company/careers.html
3                      https://careers.nordstrom.com/
4          https://groupon.wd5.myworkdayjobs.com/jobs
5       https://www.nortonlifelock.com/us/en/careers/
6                   https://www.yahooinc.com/careers/
7                   https://careers.pnc.com/global/en
8                         https://careers.netapp.com/
9                            https://careers.godaddy/
Name: URL, dtype: object

In [27]:
df

Unnamed: 0,0,URL
0,IBM,https://www.linkedin.com/jobs/ibm-jobs-boston-ma
1,SAP,https://jobs.sap.com/
2,Pure Storage,https://www.purestorage.com/company/careers.html
3,Nordstorm,https://careers.nordstrom.com/
4,Groupon,https://groupon.wd5.myworkdayjobs.com/jobs
5,Norton Lifelock,https://www.nortonlifelock.com/us/en/careers/
6,Yahoo,https://www.yahooinc.com/careers/
7,PNC,https://careers.pnc.com/global/en
8,NetApp,https://careers.netapp.com/
9,GoDaddy,https://careers.godaddy/


</br></br>
**Note** - *Deciding whether to update the current file or make a new one has its own pros and cons. Updating the same file saves space and keeps things organized but might risk losing data if something goes wrong. Creating a new file keeps your original data safe, acts as a backup, and helps track different versions, but it takes up more space and can be confusing with lots of versions. The best choice depends on your project and how important your data is. (I prefer latter than former)*

I am creating a new file with timestamp to keep track of different versions.

In [31]:
from datetime import datetime

In [32]:
def write_dataframe_to_csv_with_timestamp(dataframe, base_filename='samplejoblinks'):
    """
    Writes a DataFrame to a CSV file with a timestamp.

    Parameters:
        - dataframe: The DataFrame to be written to CSV.
        - base_filename: The base name for the CSV file (default is 'data').

    Returns:
        - csv_file_name: The name of the generated CSV file.
    """
    # Generate a timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    csv_file_name = f'{base_filename}_{timestamp}.csv'

    dataframe.to_csv(csv_file_name, index=False)

    return csv_file_name


In [33]:
csv_file_name = write_dataframe_to_csv_with_timestamp(df)
print(f'DataFrame has been successfully written to {csv_file_name}')

DataFrame has been successfully written to samplejoblinks_20240228_191329.csv
