<a href="https://colab.research.google.com/github/bhattacharjee/scaling-giggle/blob/main/parse_electoral_roll.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Installing the dependencies first

We rely on two packages mainly, pdf2image and pytesseract

In [21]:
# Install the dependencies
!pip install pdf2image
!pip install pytesseract
!pip install wget

!apt-get install poppler-utils
!apt-get install libleptonica-dev 
!apt-get install tesseract-ocr tesseract-ocr-dev
!apt-get install libtesseract-dev
!apt-get install tesseract-ocr
!apt-get install tesseract-ocr-eng

Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.12).
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  libleptonica-dev
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 1,308 kB of archives.
After this operation, 5,966 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libleptonica-dev amd64 1.75.3-3 [1,308 kB]
Fetched 1,308 kB in 1s (1,332 kB/s)
Selecting previously unselected pac

In [30]:
!apt-get install tesseract-ocr-eng

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr-eng is already the newest version (4.00~git24-0e00fe6-1.2).
tesseract-ocr-eng set to manually installed.
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [32]:
import os
import wget
import tempfile
import pdf2image
import pytesseract
import logging

In [3]:
# Add the links to the PDFs here:

ONLINE_PDF_FILES_LIST = [
    "http://ceomeghalaya.nic.in/erolls/pdf/english/A001/A0010001.pdf",
    "http://ceomeghalaya.nic.in/erolls/pdf/english/A001/A0010002.pdf"
]

In [56]:
 STATE_ZERO = 0
 STATE_READING_NAMES = 1
 STATE_READING_OTHERS_NAME = 2
 STATE_READING_AGE_GENDER = 3

 class Roll:
    def __init__(self, url:str)->list:
        """Construct the object which will be used for further
        processing

        Parameters:
        url (str): The URL to the PDF (should not be a redirect)
        """

        self.temp_file_name = None
        self.pdf_url = url
        self.state = STATE_ZERO
        self.pages = None
        self.pages_text = list()

        temp_file = tempfile.NamedTemporaryFile(delete=False)
        self.temp_file_name = f"{temp_file.name}.pdf"
        temp_file.close()
    
    def download(self)->None:
        """Download the PDF file for this object

        Returns:
        None
        """
        wget.download(self.pdf_url, self.temp_file_name)
        if not os.path.isfile(self.temp_file_name) or \
            0 == os.stat(self.temp_file_name).st_size:
            raise Exception("Failed to download file")
    
    def convert_to_text(self, i):
        logging.debug(f"converting to text - page {i}")
        print(f"converting to text - page {i}")
        return pytesseract.image_to_string(self.pages[i])

    def process(self):
        if not os.path.isfile(self.temp_file_name) or \
            0 == os.stat(self.temp_file_name).st_size:
            raise Exception("Failed to download file")
        self.pages = pdf2image.convert_from_path(self.temp_file_name)
        self.pages_text = \
            [self.convert_to_text(i) for i in range(len(self.pages))\
                if i < 8 or i > len(self.pages) - 5]

    def __del__(self):
        if os.path.exists(self.temp_file_name):
            os.unlink(self.temp_file_name)


roll = Roll(ONLINE_PDF_FILES_LIST[0])

roll.download()
roll.process()

converting to text - page 0
converting to text - page 1
converting to text - page 2
converting to text - page 3
converting to text - page 4
converting to text - page 5
converting to text - page 6
converting to text - page 7
converting to text - page 35
converting to text - page 36
converting to text - page 37
converting to text - page 38


In [54]:

temp_text = roll.pages_text[0]

In [58]:
def parse_text(text):
    text = [s.strip() for s in text.split('\n')]
    text = [s for s in text if len(s) > 0]
    names = list()
    gender = list()
    other = list()
    
    for i, s in enumerate(text):
        print(f"{i:>3d} : [{s}]")

parse_text(roll.pages_text[0])

  0 : [ELECTORAL ROLL 2022 $15 Meghalaya]
  1 : [No. Name and Reservation Status of Assembly Constituency : 1 - NARTIANG]
  2 : [(ST)]
  3 : [No. Name and Reservation Status of Parliamentary Constituency(ies) in which the Assembly]
  4 : [Constituency is located :_ 1- SHILLONG (ST]
  5 : [Part number]
  6 : [1]
  7 : [1. Details of Revision]
  8 : [Roll Identification]
  9 : [Year of Revision 2022]
 10 : [Basic roll of Revision 2021, integrated with the]
 11 : [Qualifying Date 01-01-2022 supplements prepared in accordance with]
 12 : [the extent of the newly Delimited Constituency.]
 13 : [Type of revision Special Summary Revision]
 14 : [2022]
 15 : [Date of Publication 14-01-2022]
 16 : [2 . Details of part and polling area]
 17 : [No. and name of sections in the part]
 18 : [> ONBLUG MWyNSNING Main Town or Village : UMLADANG]
 19 : [3. LADAW LARU Post Office : JOWAI]
 20 : [Panchayat 2]
 21 : [Block : THADLASKEIN]
 22 : [Police Station : JOWAI]
 23 : [Sub Division :]
 24 : [District