# Data Cleaning: Normalize individual functions

## Background

There have been quite a few notebooks in this competition focusing on the task of data mining using various types of Python string manipulations, most notably regular expression or regex, for short. While these notebooks achieve highly satisfactory outcomes and create directions for future research or inquiries, they quickly encounter their limits due to the poor quality of the given dataset(s). Recall that we are given a collection of .txt files that describe past City of Los Angeles' jobs along with their pdf versions for the actual job postings. While the City has done an excellent jobs of composing these text files, the **source** that generated these .txt files is not revealed, which makes it very hard for participants to anticipate all possible inconsistencies in the data. For example, a diligent search shows that file, FIRE ASSISTANT CHIEF 2166 011218.txt, doesn't have JOB_DUTIES field, which is very questionable since the `kaggle_data_dictionary.csv` file explicitly spells that every job must have this field (see *Accepts Null Values?* column).

By now, hopefully we can see that such inconsistencies will certainly fail any attempt to write code for data mining purposes **before** data cleaning. Even if we can agree to hand-wave this problem (e.g., violating the non-null requirement of JOB_DUTIES field), a person with a strong scientific mindset cannot just hand in the solution and walk away without at least noting where/what/why we have missing values. On the other hand, since it is well-known that exploratory data analysis (EDA) and data cleaning typically take up much of the time of a data scientist, it is understandable that very few kernels (if none at all) are dedicated to tackling data cleaning issue. For unstructured data like .txt files, this could mean one has to actively read the descriptions of several jobs before being able to figure out what should be done for data cleaning.

<font size=4, color='red'>**Important Details.**</font> This series of notebooks is dedicated to data cleaning as I've realized that this dataset is very rich in information. Thus, a first complete and accurate [csv file](https://www.kaggle.com/c/data-science-for-good-city-of-los-angeles/overview), which advanced analyses can be built upon, will definitely assist the City of Los Angeles in restructuring their job posting to attract more talents in the upcoming years. The approach taken here is quite novel, at least comparing to other kernels. First, I focus on writing code to parse SYSTEMS ANALYST 1596 102717.txt into a csv file that is exactly the same as `sample job class export template.csv`. This is done by writing 25 main functions, each of which is dedicated to only extracting information for 25 field names, besides two helper functions. Then I use these functions to validate the consistency of other .txt files and manually change/modify the files to fit the pattern of SYSTEMS ANALYST 1596 102717.txt. For example, in the requirement section of ACCOUNTANT 1513 062218.txt, there is no itemizer, e.g., 1) or 1., so I'll manually add a 1. before the word, Graduation, to match with the pattern in SYSTEMS ANALYST 1596 102717.txt.

Admittedly, the steps taken here are extremely labor-intensive. While I agree that there might be other easier ways, I do this for a purpose as I've learned that it is the best way to get myself familiar with this unstructured data. As mentioned above, since **the source that generated these text file is not known**, we really have no idea on how to retrieve relevant information from pattern matching on raw data. For example, one might attempt to do something such as `job[job.find('REQUIREMENTS/MINIMUM QUALIFICATIONS':job.find('PROCESS NOTES')]` or any similar expressions using regex. Although that this statement may help him retrieve **only** relevant information regarding school type, education majors, etc., this is not guranteed! For instance, I found that in some jobs, the word, PROCESS NOTES, came after the word WHERE TO APPLY, which unarguably causes severe headache later. The only way to avoid this, as far as I realize, is to patiently do some manual data cleaning before any analysis.

Mention job bulletins_clean

## Import relevant modules

In [1]:
import os                       # module to interface with the underlying OS
import numpy as np              # linear algebra
import pandas as pd             # dataframe
import re                       # regular expression
import matplotlib.pyplot as plt # data visualization
%matplotlib inline
import toolkits as tk           # user-define module for efficiently reading files

## Get paths and names of files in each path

In [2]:
# Path and list of jobs in Job Bulletins.
# NOTE 1: These are raw data
(raw_path, raw_jobs) = tk.get_raw_jobs() # tk is a user-define module

# Path and list of jobs in JobBulletins_cleaned
# NOTE 2: These are cleaned data
(cleaned_path, cleaned_jobs) = tk.get_cleaned_jobs()

## Normalize JOB_CLASS_TITLES (jct)

In [3]:
# This is a helper function
def job_class_title(job):
    '''Returns the field JOB_CLASS_TITLE (jct)'''
    # From the beginning to the word Class Code is where the information located
    temp = job[:job.index('Class Code')]
    # Split at newline characters and get the first element in the resulting list
    jct = temp.split('\n')[0]
    # Remove redundant white spaces.
    # Spliting at white space returns empty strings for white spaces,
    # so we can take out redundant white spaces by querying on the length of element e
    jct = [e for e in jct.split(' ') if len(e) > 0]
    
    # Returns
    jct = ' '.join(jct) # join back words with white spaces
    return jct

In [4]:
# Normalization Strategy: 
# print jct in the try clause and print job_path in the except clause
# look at printouts and detect unusual jct's.
for file_name in raw_jobs:
    job_path = raw_path + file_name        # define path to file_name
    raw_job  = open(job_path, 'rt').read() # read in job as a string
    try:
        print(job_class_title(job=raw_job)) 
    except:                                # do some pretty printings here to help our eyes from pain
        ## define some useful variables
        border_line = '##############################################################################################'
        how_many    = int((len(border_line) - len(job_path))/2)
        print(border_line)
        ## do pretty printings
        print('#'*how_many + job_path + '#'*how_many)
        print(border_line)

311 DIRECTOR
ACCOUNTANT
ACCOUNTING CLERK
ACCOUNTING RECORDS SUPERVISOR
ADMINISTRATIVE ANALYST
ADMINISTRATIVE CLERK
ADMINISTRATIVE HEARING EXAMINER
ADVANCE PRACTICE PROVIDER CORRECTIONAL CARE
AIR CONDITIONING MECHANIC
AIR CONDITIONING MECHANIC SUPERVISOR
AIRPORT AIDE

AIRPORT ENGINEER
AIRPORT GUIDE
AIRPORT INFORMATION SPECIALIST
AIRPORT LABOR RELATIONS ADVOCATE
AIRPORT MANAGER
AIRPORT POLICE CAPTAIN

AIRPORT POLICE OFFICER
AIRPORT POLICE SPECIALIST
AIRPORT SUPERINTENDENT OF OPERATIONS
AIRPORTS MAINTENANCE SUPERINTENDENT
AIRPORTS MAINTENANCE SUPERVISOR
AIRPORTS PUBLIC AND COMMUNITY RELATIONS DIRECTOR
ANIMAL CARE ASSISTANT
ANIMAL CARE TECHNICIAN
WATER TREATMENT OPERATOR
ANIMAL CONTROL OFFICER
ANIMAL KEEPER
APPARATUS OPERATOR
APPLICATIONS PROGRAMMER
APPRENTICE - METAL TRADES
APPRENTICE MACHINIST
AQUARIST
AQUARIUM EDUCATOR
AQUATIC DIRECTOR
AQUATIC FACILITY MANAGER
AQUEDUCT AND RESERVOIR KEEPER

ARCHITECT
CAMPUS INTERVIEWS ONLY
ARCHITECTURAL DRAFTING TECHNICIAN
ARCHIVIST
ART CENTER DIRECTOR


UTILITY BUYER
UTILITY EXECUTIVE SECRETARY
UTILITY SERVICES MANAGER
UTILITY SERVICES SPECIALIST
VETERINARY TECHNICIAN 				 	
VIDEO PRODUCTION COORDINATOR
VIDEO TECHNICIAN
VOLUNTEER COORDINATOR
WAREHOUSE AND TOOLROOM WORKER
WASTEWATER TREATMENT OPERATOR
WASTEWATER COLLECTION WORKER
WASTEWATER TREATMENT ELECTRICIAN
WASTEWATER TREATMENT ELECTRICIAN SUPERVISOR
WASTEWATER TREATMENT LABORATORY MANAGER
WASTEWATER TREATMENT MECHANIC
WASTEWATER TREATMENT MECHANIC SUPERVISOR
WASTEWATER TREATMENT OPERATOR
WATER BIOLOGIST

WATER SERVICE REPRESENTATIVE
WATER SERVICE SUPERVISOR
WATER SERVICE WORKER
WATER SERVICES MANAGER
WATER TREATMENT OPERATOR
WATER TREATMENT SUPERVISOR
WATER UTILITY OPERATOR
WATER UTILITY OPERATOR SUPERVISOR
WATER UTILITY SUPERINTENDENT
WATER UTILITY SUPERVISOR
WATER UTILITY WORKER
WATERSHED RESOURCES SPECIALIST
WATERWORKS ENGINEER
WATERWORKS MECHANIC SUPERVISOR
WELDER
WELDER SUPERVISOR
WHARFINGER
WINDOW CLEANER
WORKERS' COMPENSATION ANALYST
WORKERS' COMPENSATION CLAIMS ASSISTANT


A few things to notice here:
1. The function `job_class_title` was able to read all raw jobs (no errors were catched).
2. There are some suspicious blank lines, e.g. AIRPORT AIDE blank-line AIRPORT ENGINEER. 
    * By actively looking at the Job Bulletins folder, we see that the function returns an empty string for some jobs, e.g., AIRPORT CHIEF INFORMATION SECURITY OFFICER 1404 120415_Modified.txt. Of course, this is due to the bug introduced in our code. This job doesn't fall into the pattern specified in SYSTEMS ANALYST 1596 102717.txt, because they start with newline characters. 
    * One may come back to the function `job_class_title` and modify it by using regex, for example, to capture all titles. However, we will **not** do that since the approach we are taking here is **data normalization**, that is, we make sure every job follows the same pattern as SYSTEMS ANALYST 1596 102717.txt by modifying its content.
    * This may sound unexciting; however, the real benefit of this approach won't come in until later when the need to parse information regarding jobs' requirements arises. Once getting there, we'll see that the data normalization approach outperforms most of the traditional ones, which focus on writing functions that fit all jobs. On top of that, by patiently modifying these jobs manually, we've already familiarized ourselves with this type of unstructured data as well as raised our awarenesses when missing values occur.
    * Here are the jobs that were modified by removing initial newline characters:
        * CityofLA/Job Bulletins/AIRPORT CHIEF INFORMATION SECURITY OFFICER 1404 120415_Modified.txt
        * CityofLA/Job Bulletins/AIRPORT POLICE LIEUTENANT 3227 091616.txt
        * CityofLA/Job Bulletins/AQUEDUCT AND RESERVOIR SUPERVISOR 5816 091115.txt
        * CityofLA/Job Bulletins/ART CURATOR 2448 071516 REV 072816.txt
        * CityofLA/Job Bulletins/AUTO ELECTRICIAN 3707 052215.txt
        * CityofLA/Job Bulletins/CARPET LAYER 3418 061915.txt
        * CityofLA/Job Bulletins/CHIEF AIRPORTS ENGINEER 7274 051515 (1).txt
        * CityofLA/Job Bulletins/CHIEF INTERNAL AUDITOR 1619 090916 (5).txt
        * CityofLA/Job Bulletins/CHIEF TAX COMPLIANCE OFFICER 1211 041814.txt
        * CityofLA/Job Bulletins/DIRECTOR OF HOUSING 1568 062317.txt
        * CityofLA/Job Bulletins/ELECTRIC METER SETTER 3822 012017.txt
        * CityofLA/Job Bulletins/ELEVATOR REPAIR SUPERVISOR 032516 REVISED 040516.txt
        * CityofLA/Job Bulletins/EXAMINER OF QUESTIONED DOCUMENTS 3229 120415.txt
        * CityofLA/Job Bulletins/FIREARMS EXAMINER 2233 062416.txt
        * CityofLA/Job Bulletins/GARAGE ATTENDANT 3531 013015.txt
        * CityofLA/Job Bulletins/IMPROVEMENT ASSESSOR SUPERVISOR 1564 100215.txt
        * CityofLA/Job Bulletins/MARINE ENVIRONMENTAL SUPERVISOR 9433 071114 (1).txt
        * CityofLA/Job Bulletins/PARKING ENFORCEMENT MANAGER 9025 021916 rev022516.txt
        * CityofLA/Job Bulletins/PIPEFITTER SUPERVISOR 3438 081216.txt
        * CityofLA/Job Bulletins/POLICE SERGEANT 2227 102116.txt
        * CityofLA/Job Bulletins/POLICE SURVEILLANCE SPECIALIST 3687 052215.txt
        * CityofLA/Job Bulletins/PORT ELECTRICAL MECHANIC 3758 022616.txt
        * CityofLA/Job Bulletins/PRINCIPAL CIVIL ENGINEER 9489 022318.txt
        * CityofLA/Job Bulletins/PRINCIPAL DEPUTY CONTROLLER 7260 032814.txt
        * CityofLA/Job Bulletins/REHABILITATION PROJECT COORDINATOR 8502 032715.txt
        * CityofLA/Job Bulletins/RETIREMENT PLAN MANAGER 9149 052314 (1).txt
        * CityofLA/Job Bulletins/SENIOR BUILDING OPERATING ENGINEER 5925 011615 (1).txt
        * CityofLA/Job Bulletins/SENIOR UTILITY SERVICES SPECIALIST 3753 121815 (1).txt
        * CityofLA/Job Bulletins/SENIOR WASTEWATER TREATMENT OPERATOR 4124 041417.txt
        * CityofLA/Job Bulletins/SIGN PAINTER 3428 121214.txt
        * CityofLA/Job Bulletins/SIGN SHOP SUPERVISOR 3419 030615.txt
        * CityofLA/Job Bulletins/SR CRIME _ INTELLIGENCE ANALYST 2241 011516.txt
        * CityofLA/Job Bulletins/STREET LIGHTING ENGINEER 7537 052617 (4).txt
        * CityofLA/Job Bulletins/SUPERVISING TRANSPORTATION PLANNER 2481 072216.txt
        * CityofLA/Job Bulletins/SUPPLY SERVICES PAYMENT CLERK 1214 031017-.txt
        * CityofLA/Job Bulletins/SYSTEMS PROGRAMMER 1455 091616 REV 100416.txt
        * CityofLA/Job Bulletins/TILE SETTER 3493 090415.txt
        * CityofLA/Job Bulletins/TRAFFIC MARKING AND SIGN SUPERINTENDENT 3430 032219.txt
        * CityofLA/Job Bulletins/WATER MICROBIOLOGIST  7857 072514 rev073114.txt
        * CityofLA/Job Bulletins/ZOO CURATOR 4297 040816.txt
3. Some jobs have suspiciously weird titles, such as CAMPUS INTERVIEW ONLY. 
    * The following list names the jobs that were modified. The phrase CAMPUS INTERVIEW ONLY that used to appear before the job title was moved to the section `NOTES:` and put inside forward and backward Python prompt symbol, i.e. >>>CAMPUS INTERVIEW ONLY<<<
        * CityofLA/Job Bulletins/ARCHITECTURAL ASSOCIATE 7926 013114 REV 032916.txt
        * CityofLA/Job Bulletins/ENVIRONMENTAL ENGINEERING ASSOCIATE  7871 020113 REV 032916.txt
        * CityofLA/Job Bulletins/STREET LIGHTING ENGINEERING ASSOCIATE 7527 101102 REV 032916.txt

**Let's rerun the function and observe the changes. Note we use .txt files in the JobBulletins_cleaned folder.**

Observing the printouts carefully this time, we see that all of the nuances above have been resolved.

In [5]:
# Rerun the function job_class_title using cleaned data
for file_name in cleaned_jobs:
    job_path     = cleaned_path + file_name     # define path to file_name
    cleaned_job  = open(job_path, 'rt').read()  # read in job as a string
    try:
        print(job_class_title(job=cleaned_job)) 
    except:                                     # do some pretty printings here to help our eyes from pain
        ## define some useful variables
        border_line = '##############################################################################################'
        how_many    = int((len(border_line) - len(job_path))/2)
        print(border_line)
        ## do pretty printings
        print('#'*how_many + job_path + '#'*how_many)
        print(border_line)

311 DIRECTOR
ACCOUNTANT
ACCOUNTING CLERK
ACCOUNTING RECORDS SUPERVISOR
ADMINISTRATIVE ANALYST
ADMINISTRATIVE CLERK
ADMINISTRATIVE HEARING EXAMINER
ADVANCE PRACTICE PROVIDER CORRECTIONAL CARE
AIR CONDITIONING MECHANIC
AIR CONDITIONING MECHANIC SUPERVISOR
AIRPORT AIDE
AIRPORT CHIEF INFORMATION SECURITY OFFICER
AIRPORT ENGINEER
AIRPORT GUIDE
AIRPORT INFORMATION SPECIALIST
AIRPORT LABOR RELATIONS ADVOCATE
AIRPORT MANAGER
AIRPORT POLICE CAPTAIN
AIRPORT POLICE LIEUTENANT
AIRPORT POLICE OFFICER
AIRPORT POLICE SPECIALIST
AIRPORT SUPERINTENDENT OF OPERATIONS
AIRPORTS MAINTENANCE SUPERINTENDENT
AIRPORTS MAINTENANCE SUPERVISOR
AIRPORTS PUBLIC AND COMMUNITY RELATIONS DIRECTOR
ANIMAL CARE ASSISTANT
ANIMAL CARE TECHNICIAN
WATER TREATMENT OPERATOR
ANIMAL CONTROL OFFICER
ANIMAL KEEPER
APPARATUS OPERATOR
APPLICATIONS PROGRAMMER
APPRENTICE - METAL TRADES
APPRENTICE MACHINIST
AQUARIST
AQUARIUM EDUCATOR
AQUATIC DIRECTOR
AQUATIC FACILITY MANAGER
AQUEDUCT AND RESERVOIR KEEPER
AQUEDUCT AND RESERVOIR SUPERVIS

PROCUREMENT ANALYST
PROCUREMENT SUPERVISOR
PROGRAMMER ANALYST
PROPERTY MANAGER
PROPERTY OFFICER
PROTECTIVE COATING WORKER
PUBLIC INFORMATION DIRECTOR
PUBLIC RELATIONS SPECIALIST
PUBLIC SAFETY RISK MANAGER
RATES MANAGER	
REAL ESTATE ASSOCIATE
REAL ESTATE OFFICER
REAL ESTATE TRAINEE
RECREATION COORDINATOR
RECREATION FACILITY DIRECTOR
RECREATION SUPERVISOR
REFUSE COLLECTION SUPERVISOR	
REFUSE COLLECTION TRUCK OPERATOR
REFUSE CREW FIELD INSTRUCTOR
REHABILITATION CONSTRUCTION SPECIALIST
REHABILITATION PROJECT COORDINATOR
REINFORCING STEEL WORKER
REPROGRAPHICS OPERATOR
REPROGRAPHICS SUPERVISOR
RETIREMENT PLAN MANAGER
RIDESHARE PROGRAM ADMINISTRATOR
RISK AND INSURANCE ASSISTANT
RISK MANAGEMENT AND PREVENTION PROGRAM SPECIALIST
RISK MANAGER
ROOFER
ROOFER SUPERVISOR
SAFETY ADMINISTRATOR
SAFETY ENGINEER
SAFETY ENGINEER ELEVATORS
SAFETY ENGINEER PRESSURE VESSELS
SAFETY ENGINEERING ASSOCIATE
SANITATION SOLID RESOURCES MANAGER
SANITATION WASTEWATER MANAGER
SECRETARY
SECRETARY LEGAL
SECURITY AIDE
SE

## Normalize JOB_CLASS_NO (jcn)

In [6]:
# This is one of the helper function
def job_class_no(job):
    '''Returns the field JOB_CLASS_NO (jcn)'''
    # From the word Class Code to the word Open Date is where the information located.
    temp = job[job.index('Class Code'):job.index('Open Date')]
    # Check if anything in temp is a digit via isdigit(). If it is, get it
    jcn  = [string_num for string_num in temp.split() if string_num.isdigit()][0] # first element is what we want
    
    return jcn

In [7]:
# Normalization Strategy: 
# print jct in the try clause and print job_path in the except clause
# look at printouts and detect unusual jct's.
for file_name in raw_jobs:
    job_path = raw_path + file_name        # define path to file_name
    raw_job  = open(job_path, 'rt').read() # read in job as a string
    try:
        print(job_class_no(job=raw_job))
    except:                                # do some pretty printings here to help our eyes from pain
        ## define some useful variables
        border_line = '##############################################################################################'
        how_many    = int((len(border_line) - len(job_path))/2)
        print(border_line)
        ## do pretty printings
        print('#'*how_many + job_path + '#'*how_many)
        print(border_line)

9206
1513
1223
1119
1590
1358
9135
2325
3774
3781
1540
1404
7256
0845
1783
9210
7260
3228
3227
3225
3236
7268
3331
3336
1788
4323
4310
5885
4311
4304
2121
1429
3789
3764
2400
2493
2419
2423
5813
5816
7925
7926
7922
1191
2478
2448
2447
2454
2455
3440
3435
4143
4145
7259
3808
3684
4219
9377
3142
4208
9415
3818
3809
3150
1860
7998
6147
1517
3704
3706
3707
3721
3595
3714
3565
1759
1764
1203
3733
3735
3737
7244
3124
7543
4211
3190
7561
4251
5923
3338
3333
3588
3589
1801
3344
3346
3418
3353
3354
3351
7833
1554
7274
9151
5927
1253
1260
1249
1249
1466
7296
3182
5237
4289
9230
2237
9286
4254
1619
9182
7945
7271
7258
9180
1968
5154
4260
3187
4286
1211
4275
7944
7941
7237
7246
7232
1767
1600
1603
1213
9734
3800
3802
3686
3689
7610
7607
1461
2496
8500
2501
9165
3129
3127
3541
3341
7291
9168
7230
2317
2236
2234
3149
3156
3176
1230
1229
1136
1470
5131
1121
1593
3211
1768
9304
9302
7625
4266
4321
1568
7270
3722
3123
1488
3208
9375
4320
6157
3521
1493
3879
3873
3822
7520
5224
3828
3799
7525
7532
4221
