### Get experience job class title
This was darn hard to me. For example, how do you know words like, Management Assistant, is there to look up? All of a sudden, I remember that we can scrape the pdf files in City Job Paths using Bob's code! Then, I can build a list of keys for job class title. Finally, all I need to do is to do a look up.

This is more like a trial/error process.

In [1]:
# First, import relevant modules
import os
import numpy as np
import pandas as pd

In [2]:
#from https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

#install pdfminer on python 3
#!python -m pip install pdfminer.six 

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
 
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
 
        text = fake_file_handle.getvalue()
 
    # close open handles
    converter.close()
    fake_file_handle.close()
 
    if text:
        return text

In [3]:
# Let's do an example
extract_text_from_pdf('CityofLA/Additional data/City Job Paths/Accountant.pdf')

'CITY OF LOS ANGELES PERSONNEL DEPARTMENT CAREER-OPPORTUNITIES FOR ACCOUNTANT The following information is being given to describe potential opportunities as an Accountant.  The career ladders that these titles commonly follow have been illustrated in the diagram below.  With specific types of experience, promotional or lateral movement between these lines is also possible.  You may review the class specifications and some job bulletins through our Personnel’s Department website.  It is encouraged to examine the options available, to be able to promote for what you qualify for.                  Principal Utility Accountant (Water and Power) Chief Internal Auditor Principal Accountant Fiscal System Specialist Senior Utility Accountant (Water and Power) Internal Auditor Senior Auditor Senior Tax Auditor Senior Accountant Auditor Tax Auditor Accountant Departmental Chief Accountant Financial Manager Accounting Aide Accounting Records Supervisor Payroll Supervisor Accounting Clerk \uf0a7 O

Nice! Note that we can ignore:
* "CITY OF LOS ANGELES PERSEONEL... qualify for."
* \uf0a7 
* Open/Entry-Level
* \x0c

Next, we'll use `nltk` to build a list of keys for `EXP_JOB_CLASS_TITLE`.

In [4]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

# Build a list of stop words. These includes: (1) English stop words, (2) punctuation, 
# and (3) the last 3 bullet points above.
stop = stopwords.words('english') + list(string.punctuation) + ['\uf0a7', 'Open/Entry-Level', '\x0c']

In [5]:
folder_path = 'CityofLA/Additional data/City Job Paths/'
all_pdf_filenames = os.listdir(folder_path)

In [6]:
#%%timeit. Weird, this makes an infinite loop???
exp_job_class_title_keywords = []
for filename in all_pdf_filenames:
    # Get pdf path
    pdf_path = folder_path + filename
    # Convert pdf to text using helper function
    job_path_as_text = extract_text_from_pdf(pdf_path)
    # Simplify text per Note above
    job_path_as_text_simplified = job_path_as_text[job_path_as_text.find('qualify for')+len('qualify for'):]
    # Tokenize text if key is not in the pre-defined stop words
    tokens = [key for key in word_tokenize(job_path_as_text_simplified) if key not in stop]
    # Append to exp_job_class_title_keywords. Use set to remove dups in list. list again to make it a list.
    exp_job_class_title_keywords = list(set(exp_job_class_title_keywords + tokens))
    
print(exp_job_class_title_keywords)

['specifications', 'AuditorSenior', 'Pilot', 'Financial', 'Transmission', 'titles', 'Vehicle', 'Bindery', 'Heavy', 'Examiner', 'Park', 'ExaminerIPolygraph', 'Treatment', 'Land', 'Auditor', 'Technical', 'options', 'Buyer', 'Engineering', 'Commission', 'Bureau', 'Air', 'available', 'Zoo', 'Reader', 'Stores', 'City', 'information', 'Toolroom', 'Motor', 'Golf', 'Cleaning', 'Warehouse', 'Operating', 'Body', 'Plant', 'Water', 'Secretary', 'Senior', 'Waste', 'Tax', 'Material', 'Associate', 'Commercial', 'Drafting', 'Officer', 'Refrigeration', 'Analyst', 'Personnel', 'Supply', 'You', 'bulletins', 'Deck', 'Worker', 'Waterworks', 'Zoning', 'InspectorSenior', 'Gardener', '\uf0a7Open/Entry-Level\uf0a7PromotionalLocksmithBuilding', 'Technician', 'Metal', 'Cartographer', 'Cleaner', 'given', 'potential', 'Biologist', 'Crew', 'Birds', 'Sales', 'Supervising', 'able', 'Delivery', 'Sanitation', 'examine', 'IVPolygraph', 'IIPolygraph', 'also', 'Fleet', 'Principal', 'job', 'Forensic', 'Executive', 'FIREFIG

As we can see from the result, `extract_text_from_pdf` does not always work. There are two things we can do here: (1) tweak code to simplify the result (to avoid words like 'ANGELES'), and (2) manually adjust the list above to have a cleaner one.

In [7]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

# Build a list of stop words. These includes: (1) English stop words, (2) punctuation, 
# and (3) the last 3 bullet points above.
stop = stopwords.words('english') + list(string.punctuation)

In [8]:
# Initializations
exp_job_class_title_keywords = []
ignored_text1 = 'CITY OF LOS ANGELES PERSONNEL DEPARTMENT CAREER-OPPORTUNITIES FOR'
ignored_text2 = 'The following information is being given to describe potential opportunities as'
ignored_text3 = 'The career ladders that these titles commonly follow have been illustrated in the diagram below'  
ignored_text4 = 'With specific types of experience, promotional or lateral movement between these lines is also possible'
ignored_text5 = 'You may review the class specifications and some job bulletins through our Personnel’s Department website.'
ignored_text6 = 'It is encouraged to examine the options available, to be able to promote for what you qualify for'
ignored_text7 = '\uf0a7'
ignored_text8 = 'Open/'
ignored_text9 = 'Entry-Level'
ignored_text10 = 'Promotional'
ignored_text11 = '\x0c'

ignored_list = [ignored_text1, ignored_text2, ignored_text3, ignored_text4, ignored_text5, 
                ignored_text6, ignored_text7, ignored_text8, ignored_text9, ignored_text10, ignored_text11]

In [9]:
#%%timeit. Weird, this makes an infinite loop???
exp_job_class_title_keywords = []
for filename in all_pdf_filenames:
    # Get pdf path
    pdf_path = folder_path + filename
    # Convert pdf to text using helper function
    job_path_as_text = extract_text_from_pdf(pdf_path)
    # Simplify text
    for ig in ignored_list:
        job_path_as_text = job_path_as_text.replace(ig, '')
    # Tokenize text if key is not in the pre-defined stop words
    tokens = [key for key in word_tokenize(job_path_as_text.lower()) if key not in stop]
    # Append to exp_job_class_title_keywords. Use set to remove dups in list. list again to make it a list.
    exp_job_class_title_keywords = list(set(exp_job_class_title_keywords + tokens))
    
print(exp_job_class_title_keywords)

['steam', 'waste', 'specifications', 'body', 'angeles', 'harbor', 'waterworks', 'titles', 'custodial', 'systems', 'police', 'apparatus', 'polygraph', 'land', 'ivpolygraph', 'deputy', 'tax', 'distribution', 'operations', 'grounds', 'treatment', 'options', 'reader', 'available', 'associate', 'shop', 'crew', 'information', 'claims', 'worker', 'occupational', 'drafting', 'bus', 'maker', 'truck', 'officer', 'coordinator', 'applications', 'firefighter', 'workers', 'sergeant', 'engineer', 'real', 'specialist', 'attendant', 'testing', 'iimaintenance', 'duplicating', 'utilization', 'forensic', 'pilot', 'bulletins', 'poster', 'metal', 'iisenior', 'division', 'elevators', 'driver', 'golf', 'craft', 'director', 'given', 'potential', 'lot', 'managing', 'manager', 'instructor', 'able', 'heating', 'district', 'instrument', 'examine', 'locksmithbuilding', 'also', 'machine', 'job', 'technicianenvironmental', 'auditorsenior', 'messenger', 'cleaning', 'power', 'boat', 'planner', 'administrator', 'laborat

# See Bob, I can do it too!
This is not helpful at all, so I'll come back to the first list, copy/paste it to a word file, and manually remove inappropriate words to have a cleaned list.

In [10]:
l = ['Polygraph', 'Grounds', 'Street', 'Starter','Traffic', 'Inspector', 'Plumbing', 'Officer', 'Helper', 
     'Accountant', 'Claims', 'Records', 'Community', 'Typist', 'Land', 'Resources','Warehouse', 'Designer', 
     'Irrigation', 'Air', 'Meter', 'Security', 'Technician', 'Marketing', 'Reader', 'Operator', 'Window', 
     'Analyst', 'Ranger', 'Transportation', 'Cleaner', '4', 'Engineer', 'Technician’, ‘Environmental', 
     'Auditor’, ’Senior’, ‘Operator’, ‘Senior', 'Electrical', 'Laborer', 'Airports', 'Internal', 'Planner', 
     'Elevator', 'Recreation', 'Principal', 'Executive', 'Mate', 'Communications', 'Sanitation', 'Attendant', 
     'Surveying', 'Associate', 'Surveys', 'Commander', 'Auto', 'Animal', 'Forensic', 'Poster', 'Utility', 
     'Steam', 'Locksmith’, Building', 'Materials', 'Fireboat', 'Helicopter', 'Driver', 'Drafting', 'Delivery', 
     'Director', 'Birds', 'Heating', 'Structural', 'Line', 'City', 'Industrial', 'Gardener', 'Body', 'Sign', 
     'Compliance', 'Housing', 'Safety', 'Equipment', 'Quality', 'Conditioning', 'Masonry', 'Geologist', 'Maker', 
     'Services', 'Lieutenant', 'Civil', 'Deck', 'Computer', 'Apparatus', 'Environmental', 'Dispatcher', 
     'Worker’, ‘Airports', 'Division', 'Information', 'Sales', 'Chief', 'Estate', 'Aid', 'Telecommunications', 
     'Automotive', 'Coordinator', 'Departmental', 'Financial','Mechanic', 'Instrument', 'Operating', 'Lighting', 
     'Superintendent', 'Commercial', 'Assistant', 'Fleet', 'Biologist', 'Programmer', 'Detective', 'Managing', 
     'Development', 'Sweeper', 'Pilot', 'Workers', 'Personnel', 'Reptiles', 'Inspector', 'Senior', 'Craft', 
     'Print', 'Inspector', 'Chief', 'Electric', 'Representative', 'Trainee', 'Station', 'Administrative', 
     'Worker', 'Party', 'Administrator', 'Engineer', 'Traffic', 'Laboratory', 'Technical', 'Office', 'Cabinet', 
     'Truck', 'Pipefitter', 'Utility', 'Fire', 'IV', 'Polygraph', 'Rehabilitation', 'Helper', 'Building', 
     'Power', 'Aide', 'Sheet', 'Property','Zoo', 'General', 'Engineering', 'Cartographer', 'Construction', 
     'Builder', 'Police', 'Technician', 'Computer', 'Facility', 'Programs', 'Bureau', 'Secretary', 'Boat', 
     'Repair', 'Motor', 'Tax', 'Machinist', 'District', 'Parking', 'Electrician', 'Collection', 'Controller', 
     'Survey', 'Pressure', 'Lot', 'Hand', 'Deputy', 'Buyer', 'Treatment', 'Pumping', 'Event', 'Machine', 'Sergeant', 
     'Management', 'Special', 'Protection', 'Golf', 'Manager', 'Crew', 'Head', 'Care', 'III', 'Painter', 'Cement', 
     'Firefighter', 'Water', 'Plant', 'Education', 'Bindery', 'Surgeon', 'Storekeeper', 'Examiner’ ‘Polygraph', 
     'Park', 'Cement', 'Keeper', 'Metal', 'Commission', 'Harbor', 'Cleaning', 'Events', 'Veterinary', 
     'Architectural', 'Custodial', 'Marking', 'Tree', 'Stenographer', 'V', 'Fiscal', 'Refuse', 'Real', 
     'Senior', 'Garage', 'Supply', 'III', 'Supervisor', 'Title', 'Carpenter', 'Data', 
     'Battalion', 'Relations', 'IV', 'Custodian', 'Laboratory', 'Elevators', 'Duty', 'Vehicle', 
     'Finisher', 'I', 'Signal', 'Printing', 'Stores', 'Port', 'IT', 'Distribution', 'Photographer', 'Vessel', 
     'Clerk', 'Firefighter', 'Processing', 'Advisor', 'Planning', 'Geographic', 'Shovel', 'Toolroom', 'Public', 
     'Light', 'Systems', 'II', 'Maintenance', 'Supervising', 'Occupational', 'Examiner', 'Transmission', 
     'Compressor', 'Payroll', 'Vessels', 'lateral', 'Specialist', 'Plumber', 'Field', 'Operations', 'Repairer', 
     'Accounting', 'Building', 'Waste', 'Research', 'Detention', 'Project', 'Procurement', 'Compensation', 
     'Refrigeration', 'Bus', 'Graphics', 'Messenger', 'Heavy', 'Load', 'Zoning', 'Auditor', 'Wharfinger', 'Shop', 
     'Service', 'Applications', 'Labor', 'Apprentice', 'Testing', 'Control', 'Curator', 'Waterworks', 'Material', 
     'Caretaker', 'Captain', 'Disposal', 'Mechanical', 'System', 'Duplicating', 'Nurse', 'Airport', 'Press', 
     'Maintenance', 'Wastewater', 'Health', 'III', 'Information', 'II', 'Solid', 'Instructor', 'Repairer', 'I', 
     'Librarian', 'Painter']

print(len(l))

326


In [11]:
exp_job_class_title_keywords = list(set(l))
print(len(exp_job_class_title_keywords))

302
