In [1]:
# First, import relevant modules
import numpy as np
import pandas as pd

In [2]:
# Import os module to allow us to interface with the underlying operating system that python is running on
import os

# Define path to look at
path = 'CityofLA/Job Bulletins/'

# Get a list of all txt files in this path
all_txt_files = os.listdir(path) # files won't be in the order shown in their actual folders
all_txt_files.sort() # sort files alphabetically. WARNING: this mutates the list. sorted(all_txt_files) won't

# Note2self: Some people use os.walk which goes through every root, and their directories, to list all files.
# We probably don't need it for now

In [3]:
# Do some checks here
print(len(all_txt_files))      # length should be 683 as manually verified
print(len(set(all_txt_files))) # see if each file is unique. hopefully 683 as well!
print('SYSTEMS ANALYST 1596 102717.txt' in all_txt_files) # should be True

683
683
True


# Note 1
I suddenly remembered that there is a csv file called `job_titles` that listed all the job titles. However, there were only 668 records here while there were 683 txt files in `Job Bulletins`. The difference told me that there was something wrong with these txt files. Thus, I build my own a pd dataframe to inspect the conflict. 

This dataframe consists of all the job titles extracted from the txt files. Later on, I will place it side-by-side with the `job_titles` dataframe (built by importing the csv file), probably through outer join, to see what the problem is. 

To this end, below are the steps to build the `self_build_job_titles` dataframe:
1. For each element *i* in `all_txt_file`, which is a string, split them at white space.
2. Use try/except to build a list of indices of elements in *i* that **cannot** be cast into integers.
3. Build a list of breaks based on the integer sequence (given by the `range` function) and get the first element of this list. This element is the first element that can be converted to a string (it's actually the JOB_CLASS_NO).
4. Join this list with white space.
5. Convert this final list to a dataframe `self_build_job_titles`

All these steps are done through a helper function.

In [4]:
def build_job_titles(list_of_titles):
    '''
    BUILD_JOB_TITLES creates a pandas dataframe based on titles listed in list_of_titles. This function strips away
    all the unnecessary details in each element of list_of_titles, such as JOB_CLASS_NO
    '''
    # Check if input given is a list
    assert isinstance(list_of_titles, (list, np.ndarray))
    
    # Build a list of all job titles
    all_job_titles = []
    for messy_title in list_of_titles:
        #print(messy_title) # to manually correct inconsisten names in the txt files. See Note 2.
        ## Split at white space
        messy_title = messy_title.split()
        ## Build a list of indices of elements, which CANNOT be cast into integers.
        indices_of_nonint_words = []
        for element in messy_title:
            try:
                int(element)
            except:
                indices_of_nonint_words.append(messy_title.index(element))
        ## Build a list of breaks and get the first element of this list. This element is actually the JOB_CLASS_NO
        job_class_no = [idx for idx in range(len(messy_title)) if idx not in indices_of_nonint_words][0]
        ## Get the job title by subsetting messy_title, marking where to stop with job_class_no
        job_title = ' '.join(messy_title[:job_class_no])
        ## Finally, append job_title to all_job_titles
        all_job_titles.append(job_title)

    # Returns
    return pd.DataFrame(data=all_job_titles, columns=['job_title'])

# Note 2
Interestingly, when executing this code, I found that a lot of these text files were not named in a consistent manner. The general format for them should be, "Job Title, Class Code, Open Date in one word, Miscelaneous details.txt". However, quite a few of them did not follow this format. Here's the list:
* "COMMUNITY AFFAIRS ADVOCATE  111414.txt" to "COMMUNITY AFFAIRS ADVOCATE 2496 111414.txt"
* "ELECTRIC SERVICE REPRESENTATIVE 020317.txt" to ELECTRIC SERVICE REPRESENTATIVE 7520 020317.txt"
* "FIRE SPECIAL INVESTIGATOR 021216.txt" to "FIRE SPECIAL INVESTIGATOR 1632 021216.txt"
* "REFUSE COLLECTION TRUCK OPERATOR 021717.txt" to REFUSE COLLECTION TRUCK OPERATOR 3580 021717.txt"
* "REHABILITATION CONSTRUCTION SPECIALIST 072718.txt" to "REHABILITATION CONSTRUCTION SPECIALIST 1569 072718.txt"
* "Vocational Worker  DEPARTMENT OF PUBLIC WORKS.txt" to "Vocational Worker DEPARTMENT OF PUBLIC 0 WORKS.txt". <font color='red'>Note that this job is quite strange as it doesn't contain class code at all. So I had to insert an artificial number to it, so it won't break my code.</font>
* "WASTEWATER TREATMENT OPERATOR 120718.txt" to "WASTEWATER TREATMENT OPERATOR 4123 120718.txt"

I already fixed all of them manually, i.e., renamed these text files appropriately.

In [5]:
self_build_job_titles = build_job_titles(all_txt_files)

# To Be Continued
As discussed in note 1