# City of Los Angeles - Job Bulletins/Descriptions
- Helping the City of Los Angeles to Structure and Analyze its Job Descriptions Using NLTK.

In [None]:
# Libraries

# Install spaCy Library for Advanced Natural Language Processing in Python
!pip install spacy
!python -m spacy download en

The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions.

### Problem Statement

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to: (1) identify language that can negatively bias the pool of applicants; (2) improve the diversity and quality of the applicant pool; and/or (3) make it easier to determine which promotions are available to employees in each job class.

Unsupervised Machine Learning
- Topic Modeling

- https://www.geeksforgeeks.org/readability-index-pythonnlp/

- https://medium.com/@FastCompany/analyzing-the-subtle-bias-in-tech-companies-recruiting-emails-b9b3123b2991

- Word Cloud
- Count average number of words

In [1]:
# Dependencies & Setup

import os, glob, sys
import spacy
from spacy.matcher import PhraseMatcher
import random

In [None]:
# Retrive Data

# Method 1
# Open All Files in Directory
all_files = os.listdir("data/Job Bulletins")
# print(all_files)

# Get all the files in the folder and read their data 
for files in all_files:
    if (files != '.ipynb_checkpoints'):
        try:
            f = open("data/Job Bulletins/" + files, "r")
            contents = f.read()
        except:
            break         
# print(contents)

In [None]:
# Method 2
for all_files in glob.glob('*.txt'):
    print(all_files)

In [2]:
# Method 3
# Create a files array to hold all of the file names in the folder
files = []
# Folder Path
folder_path = "data/Job Bulletins"
# Iterate through all of the files in the folder path
counter = 0
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
    with open(filename, 'r') as f:
        # Throw exception for file names that are not usable
        try:
            files.append(filename)
            counter += 1
        except:
            files.append('None')
print(f'Successfully retrieved {counter} files from folder.')

Successfully retrieved 684 files from folder.


In [3]:
# Create NLP pipeline

# nlp = English()
nlp = spacy.load('en')

# Model and languague data load and check
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe('ner')
else:
    nlp.get_pipe('ner')

In [4]:
# Create known lables and their entities

label = 'TITLE'
matcher = PhraseMatcher(nlp.vocab)
for i in ["ENGINEER", "Engineer",]:
    matcher.add(label, None, nlp(i))

# label = 'COLDEG'
# matcher = PhraseMatcher(nlp.vocab)
# for i in ["bachelor's degree", "university", "four-year college"]:
#     matcher.add(label, None, nlp(i))

In [5]:
# Define the offest function to turn string indexes into item indexes 

def offsetter(lbl, doc, matchitem):
    o_one = len(str(doc[0:matchitem[1]]))
    subdoc = doc[matchitem[1]:matchitem[2]]
    o_two = o_one + len(str(subdoc))
    return (o_one, o_two, lbl)

In [6]:
# Warning ⚠️: Will take a while if used on every file, recommend 
# using test_files for testing.
# Create docs and entities to train the model with the labels created

test_files = files[:10]
test_file = os.path.join('data/Job Bulletins/SENIOR SAFETY ENGINEER ELEVATORS 4264  042718.txt')

res = []
to_train_ents = []

for file_name in test_files:
    if (file_name != 'None'):
        with open(f'{test_file}') as jb:
            line = True
            while line:
                line = jb.readline()
                mnlp_line = nlp(line)
                matches = matcher(mnlp_line)
                res = [offsetter(label, mnlp_line, x)
                      for x
                      in matches]
                to_train_ents.append((line, 
                                      dict(entities=res)))

In [7]:
for ent in to_train_ents:
    if (ent[1] != {'entities': []}):
        print(ent)

('SENIOR SAFETY ENGINEER ELEVATORS\n', {'entities': [(13, 21, 'TITLE')]})
('A Senior Safety Engineer Elevators assigns, reviews and evaluates the work of Safety Engineer Elevators engaged in the inspection of escalators, elevators, and similar devices for conformance to State laws and City ordinances regulating their design, installation and operation; applies sound supervisory principles and techniques in building and maintaining an effective work force; and fulfills equal employment opportunity responsibilities.\n', {'entities': [(15, 23, 'TITLE'), (84, 92, 'TITLE')]})
('Two years of full-time paid experience as a Safety Engineer Elevators with the City of Los Angeles.\n', {'entities': [(50, 58, 'TITLE')]})
('4. Upon appointment, a Senior Safety Engineer Elevators will be required to furnish an automobile, properly insured, for use in City service.  Mileage will be paid on the basis of established rates.\n', {'entities': [(36, 44, 'TITLE')]})
("Your examination score will be based en

In [8]:
# Clean Data

# Remove empty lines...
counter = 0
for line in to_train_ents:
    if ([line[0]] == ['']):
        counter += 1
        to_train_ents.remove(line)
print(counter)

10


In [9]:
# Warning ⚠️: This will use a lot of computer resources to run and will take a while, recommend 
# running on 1 epoch for testing. Although have on 20 for complete model.
# Train the model

optimizer = nlp.begin_training()

other_pipes = [pipe
              for pipe
              in nlp.pipe_names
              if pipe != 'ner']

# Epoch setting
epoch = 1

with nlp.disable_pipes(*other_pipes): # Only train NER
    for itn in range(epoch):
        losses = {}
        random.shuffle(to_train_ents)
        for item in to_train_ents:
#             print([item[0]])
            nlp.update([item[0]],
                       [item[1]],
                       sgd=optimizer,
                       drop=0.35,
                       losses=losses)
print(losses) # Error bounds

{'ner': 280.19521112589337}


In [10]:
# Test label-matcher

one = nlp("In order to apply for this job you need at least one bachelor's degree or a four-year college.\
          and/or university. Engineer.")
matches = matcher(one)
[match for match in matches]

[(12876033169774478903, 26, 27)]

In [11]:
# Test built-in label and entity matcher

to_analyze = ("Hello, my name is Xavier, and tonight we're in San Jose.")
doc = nlp(to_analyze)
ents = [(x.text, x.label_)
       for x in doc.ents]
print(ents)

[]


In [12]:
# Test docs

to_train_ents = to_train_ents[:-1]
for item in to_train_ents:
            print([item[0]])

['\n']
['\n']
['\n']
['3. A final average score of 70% is required to be placed on the eligible list.\n']
['2. Applications are accepted subject to review to ensure that minimum qualifications are met.  Candidates may be disqualified at any time if it is determined that they do not possess the minimum qualifications stated on this bulletin.\n']
['If you receive and accept an offer of employment to a regular position with the City of Los Angeles, your employee benefit coverage (including health and dental coverage as well as life insurance) will commence approximately six weeks after your original regular appointment. Not all positions in the City receive benefit coverage; you should inquire regarding the availability of employee benefits prior to accepting a position.\n']
['Class Code:       4264\n']
['\n']
['If you receive and accept an offer of employment to a regular position with the City of Los Angeles, your employee benefit coverage (including health and dental coverage as well a

In [14]:
# Test model

# from spacy import displacy
# for item in to_train_ents:
#     displacy.render(one, style='dep')

# TO-DO:
- Data Examination:
    - What biases (positive/negative) are we trying to find? **Important
    - How will we optimize each doc? (i.e. Skills, Pay, etc.) 
    - Are there important entities we should establish? (i.e. college degree, certain responsibilites, etc.)
    - What text analyis could make these docs better?