# Job-Hunt NLP Demo - Part 1

Which demo will also be useful in doing some quick NLP work to see how my résumé's word distribution matches that from job descriptions.

There's a wonderful project out there, [MyBinder](https://mybinder.org), which allows you to interactively run a Jupyter notebook completely online. It's nice to have when you'd like to play with code and see better the outputs that come from running that code. I've had some problems with images going down, but I'm going to work to keep this one up.

The link to the online, interactive notebook - the binder - is at the badge you see right here

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/main?labpath=Part_01_NLPPresentationJobHunt_DemoWordFreq.ipynb)

Note that you're seeing the link to the Part 1 Jupyter notebook.

<hr/>

## We are calling this version 0.1.003

It's the FamilySearch CJKV jobs applied for in August 2023, but we're splitting it into smaller notebooks. Hopefully, MyBinder can load each more quickly. We'll see how things work with pickling variables between the parts.

<hr/>

## What we are doing in Part 1

First of all, let's give you a MyBinder badge link which specifies the version and the part.

[![Binder](./badge_logo_dwb_v_0-1-003_part_1.png)](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/main?labpath=Part_01_NLPPresentationJobHunt_DemoWordFreq.ipynb)

@TODO : write some doohickeys about what we're doing in Part 1.

<hr/>

## Link to setup from the Conda Prompt

The instructions for setting up the conda environment from Windows is in I'm figuring out how to make the [MyBinder](https://mybinder.org) server consistent. Now, you sometimes need several tries before it comes up. I might just leave ONLY the setup stuff, so you can look at it there. I've done that, and the server seems a lot more consistent.

Here is the binder with just the `conda`/`pip` setup parts

[![Binder](./badge_logo_dwb_v_0-1-001_merged_small.png)](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/original-timed-freq?labpath=CondaSetup_v01_NLP_Presentation_Job_Hunt_NLP.ipynb)

If you want to look at my full first try, when I tried to keep myself under a time limit in doing the NLP Presentation, you can look at [this MyBinder](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/original-timed-freq?labpath=A_v01_NLP_Presentation_Job_Hunt_NLP_Useful_Demo_Word_Freq.ipynb).

## The Third Iteration

### Splitting up the long notebook

These are comparisons between my résumé for some FamilySearch jobs and the job description. For this version, I'm splitting the work into several shorter notebooks that should be more easily handled by MyBinder.

My friend at the [FamilySearch Library](https://www.familysearch.org/en/library/) let me know about a few job availabilities. These are all with a group - of which he and I are part - of missionaries and volunteers who have been working on [CJKV (Chinese, Japanese, Korean, Vietnamese)-character](https://en.wikipedia.org/wiki/CJKV_characters) handwriting and block-print recognition. I already put in the applications with résumés, but all résumés are pretty similar. I'm going to see how the different job descriptions compare to the résumé as regards the word-frequency distribution. 

## Texts

The text of my résumé for these jobs is in the local file,

```
res_CJKV.txt
```

the job descriptions for the jobs are in local files as well, specifically,

```
desc_CJKV_dev3.txt
desc_CJKV_dev4.txt
desc_CJKV_dev5.txt
desc_CJKV_devInTest3.txt
```

In [None]:
application_text_filenames = \
  ["res_CJKV.txt",
  ]

In [None]:
job_description_text_filenames = \
  ["desc_CJKV_dev5.txt",
   "desc_CJKV_dev4.txt",
   "desc_CJKV_dev3.txt",
   "desc_CJKV_devInTest3.txt",
  ]

# The "dev5" is the nicest job - and it's with Java, which I know best.

`######################################################`

The job description page looks to contain something like `JavaScript`, `ajax`, etc.

Rather than writing in a webscraper or looking through the code and finding what gets pulled from the database, I'm just going to copy/paste the text into the text files.

In [None]:
##  Code to get current timestamp, if needed.
##+ Meant to be run once, then commented out.
# #######################
# # No need to run again
# #####
# !powershell -c (Get-Date -UFormat "%s_%Y%m%dT%H%M%S%Z00") -replace '[.][0-9]*_', '_'

In [None]:
local_job_desc_filenames = job_description_text_filenames
local_job_appl_filenames = application_text_filenames

import pprint

pprint.pprint(local_job_desc_filenames)
print()
pprint.pprint(local_job_appl_filenames)

Output was

```
['desc_CJKV_dev5.txt',
 'desc_CJKV_dev4.txt',
 'desc_CJKV_dev3.txt',
 'desc_CJKV_devInTest3.txt']

['res_CJKV.txt']
```

at `1691423942_20230807T155902-0600`

In [None]:
# read in texts to original strings

def read_in_texts(local_desc_fnames, local_appl_fnames,
                  do_combine_desc_files = False,
                  do_combine_appl_files = True
                 ):
    for_pairwise_desc_texts = []
    for_pairwise_appl_texts = []
    
    complete_description_text = ""
    complete_application_text = ""
    
    if do_combine_desc_files:
        complete_description_text = " "
    ##endof:  if do_combine-desc_files
    
    if do_combine_appl_files:
        complete_application_text = " "
    ##endof:  if do_combine_appl_files
    
    for this_description_filename in local_job_desc_filenames:
        with open(this_description_filename, 'r', encoding='utf-8') as dfh:
            this_desc_file_content_str = dfh.read()
            if do_combine_desc_files:
                complete_description_text += " " + this_desc_file_content_str
            else:
                this_desc_in_array_str = " " + this_desc_file_content_str + " "
                for_pairwise_desc_texts.append(this_desc_in_array_str)
            ##endof:  if/else do_combine_desc_files
        ##endof:  with open ... dfh
    ##endof:  for this_description_filename in local_job_desc_filenames
    
    for this_application_filename in local_job_appl_filenames:
        with open(this_application_filename, 'r', encoding='utf-8') as afh:
            this_appl_file_content_str = afh.read()
            if do_combine_appl_files:
                complete_application_text += " " + this_appl_file_content_str
            else:
                this_appl_in_array_str = " " + this_appl_file_content_str + " "
                for_pairwise_appl_texts.append(this_appl_in_array_str)
            ##endof:  if/else do_combine_appl_files
        ##endof:  with open ... afh
    ##endof:  for this_application_filename in local_job_appl_filenames
    
    complete_description_text += " "
    complete_application_text += " "
    
    if do_combine_desc_files:
        for_pairwise_desc_texts = [complete_description_text]
    ##endof:  if do_combine-desc_files
    
    if do_combine_appl_files:
        for_pairwise_appl_texts = [complete_application_text]
    ##endof:  if do_combine_appl_files
    
    return for_pairwise_desc_texts, for_pairwise_appl_texts
    
##endof:  read_in_texts(<params>)

#### This next, make_it_one_line_single_spaced function will be very useful as we go forward

In [None]:
import string
import re

def make_it_one_line_single_spaced(input_str):
    processing_str = input_str
    
    processing_str = ' '.join(processing_str.split())
    processing_str = processing_str.replace("\t", " ")
    processing_str = processing_str.replace("\n", " ")
    processing_str = re.sub(r"(^|[^ ])[ ][ ]+($|[^ ])",
                            r"\g<1> \g<2>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    return processing_str
##endof:  make_it_one_line_single_spaced(input_str)

### The actual reading in of the texts

In [None]:
complete_description_text, complete_application_text = \
                    read_in_texts(local_job_desc_filenames,
                                  local_job_appl_filenames)

### Code for cleaning text

We will iterate a bit, so as not to have to write a text normalizer for the whole world. Rather than putting together regexes to test for things like which contractions are there and which other things might need changing (especially things like dashes), I'm doing simple regexes. Q&R

In [None]:
import re
import string

#from bs4 import BeautifulSoup
#from bs4 import UnicodeDammit

def clean_text_string_quickly(input_str):
    processing_str = input_str
    
    # ## one line, single-spaced
    # processing_str = ' '.join(processing_str.split())
    # processing_str = processing_str.replace("\t", " ")
    # processing_str = processing_str.replace("\n", " ")
    # processing_str = re.sub(r"(^|[^ ])[ ][ ]+($|[^ ])",
    #                         r"\g<1> \g<2>",
    #                         processing_str,
    #                         flags=re.IGNORECASE
    #                       )
    
    ## one line, single-spaced
    processing_str = make_it_one_line_single_spaced(processing_str)
    
    
    ## get rid of outside-ascii (or control character)
    processing_str = re.sub(r"[^\u0020-\u007E]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ## my stuff
    processing_str = re.sub(r"[ ][|]+[ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    processing_str = processing_str.replace(r"&", "and")
    processing_str = processing_str.replace(r"U.S.", "U S ")
    
    ## get rid of punctuation
    processing_str = re.sub(r"(([^0-9 ])[.,!?:\"']([) ]|$))",
                            r"\g<2>\g<3>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    processing_str = re.sub(r"(([0-9 ])[.,!?:\"']([ ]|$))",
                            r"\g<2>\g<3>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    # parentheses
    processing_str = processing_str.replace(r"(", " ")
    processing_str = processing_str.replace(r")", " ")
    # dashes
    processing_str = re.sub(r"[ ][-]+[ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ##  lowercase - to skip until a few iterations through
    ##+ cleaning the text
    processing_str = processing_str.casefold()
    
    ## fixes found by iterating this cleaning function
    processing_str = re.sub(r"[ ][/][ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ## What's found in the documents
    # My inspection
    processing_str = processing_str.replace(r"s.r", "s r")
    processing_str = processing_str.replace(r"c++/perl", "c++ perl")
    
    # From the automated looking, below, here for version 0.1.002
    processing_str = processing_str.replace(
                                 r"monitors/equipment", 
                                  "monitors equipment"
    )
    processing_str = processing_str.replace(
                                 r"product/engineering", 
                                  "product engineering"
    )
    processing_str = processing_str.replace(
                                 r"engineering/troubleshooting", 
                                  "engineering troubleshooting"
    )
    processing_str = processing_str.replace(
                                 r"engineering/programming", 
                                  "engineering programming"
    )
    processing_str = processing_str.replace(
                         r"analytical/diagnostic/troubleshooting", 
                          "analytical diagnostic troubleshooting"
    )
    processing_str = processing_str.replace(
                                 r"integration/continuous", 
                                  "integration continuous"
    )
    
    processing_str = processing_str.replace(r"net/powershell", 
                                                "net powershell")
    processing_str = processing_str.replace(r"c/c", "c c")
    
    processing_str = re.sub(r"\b\w{1}\b", "", processing_str)
    
    # KEEP THESE 3 EXAMPLES IN THE CODE FOR COPY/PASTE, WHATEVER
    # processing_str = processing_str.replace(r"notice/more", 
    #                                               "notice more")
    # processing_str = processing_str.replace(r"s.r", "s r")
    # processing_str = processing_str.replace(
    #                              r"monitors/equipment", 
    #                               "monitors equipment"
    # )
    
    # ##spacing fix at the end
    # processing_str = re.sub(r"(^|[^ ])[ ][ ]+($|[^ ])",
    #                         r"\g<1> \g<2>",
    #                         processing_str,
    #                         flags=re.IGNORECASE
    #                        )
    
    ## spacing fix at the end
    processing_str = make_it_one_line_single_spaced(processing_str)
    
    ## Let's give it back
    return processing_str

##endof:  clean_text_string_quickly(input_str)

In [None]:
import re

def remove_stopwords(input_str):
    ##  From https://www.nltk.org/book/ch02.html
    ##+ > [Stopwords are] high-frequency words like the, to and also that we 
    ##+ > sometimes want to filter out of a document before further processing. 
    ##+ > Stopwords usually have little lexical content, and their presence in 
    ##+ > a text fails to distinguish it from other texts.
    
    processing_str = input_str
    
    stopwords_to_remove = [
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 'can', 'will', 'just', 'should', 'now'
]
    
    # No attempt to optimize code, here. Q&R
    for my_stopword in stopwords_to_remove:
        #processing_str = processing_str.replace(my_stopword, " ")
        word_with_boundaries = r"\b" + my_stopword + r"\b"
        processing_str = re.sub(word_with_boundaries, " ", 
                                processing_str, 
                                flags=re.IGNORECASE)
    ##endof:  for my_stopword in stopwords_to_remove
    
    return processing_str
##endof:  remove_stopwords(input_str)

In [None]:
# We'll usually keep these two.
do_look_at_description_text = True
do_look_at_application_text = True

#  this one can go (False) if you don't want the big strings
#+ i.e. you don't want the complete file contents
do_print_the_big_strings = False

In [None]:
if do_look_at_description_text:
    test1 = []
    for desc_text_str in complete_description_text:
        test1.append(clean_text_string_quickly(desc_text_str))
        if do_print_the_big_strings:
            import pprint
            pprint.pprint(test1)

In [None]:
if do_look_at_application_text:
    test2 = []
    for appl_text_str in complete_application_text:
        test2.append(clean_text_string_quickly(appl_text_str))
        if do_print_the_big_strings:
            import pprint
            pprint.pprint(test2)

In [None]:
## Without stopwords
test1 = [make_it_one_line_single_spaced(
                                remove_stopwords(their_text)
                                       ) for their_text in test1
        ]

if do_print_the_big_strings:
    import pprint
    pprint.pprint(test1)

In [None]:
test2 = [make_it_one_line_single_spaced(
                                remove_stopwords(their_text)
                                       ) for their_text in test2
        ]

if do_print_the_big_strings:
    import pprint
    pprint.pprint(test2)

### For these next few cells, we are finding things to search and replace

In [None]:
# looking at contractions
if do_look_at_description_text:
    for their_text in test1:
        print()
        print(re.findall(r"\b('[\w']+\b|[\w']+'[\w']+|[\w']+')\b",
                         their_text)
             )

First run-through had

```
["organization's", "bachelor's", "master's"]

["bachelor's"]

["bachelor's"]

["bachelor's"]
```

In [None]:
if do_look_at_application_text:
    for my_text in test2:
        print()
        print(re.findall(r"\b('[\w']+\b|[\w']+'[\w']+|[\w']+')\b", 
                         my_text)
             )

First run-through had

```
["workplace's", "wife's", "nist's", "container's", "mission's"]
```

In [None]:
# looking at all slashes
if do_look_at_description_text:
    for their_text in test1:
        print()
        print(re.findall(r"\b[\w/]+/[\w/]+\b", 
                         their_text)
             )

First run-through had

```
['monitors/equipment']

['product/engineering', 'engineering/troubleshooting', 'monitors/equipment']

['monitors/equipment']

['engineering/programming', 'analytical/diagnostic/troubleshooting', 'monitors/equipment', 'integration/continuous']
```

In [None]:
if do_look_at_application_text:
    for my_text in test2:
        print()
        print(re.findall(r"\b[\w./]+/[\w/]+\b",
                         my_text)
             )

First run-through had

```
['github.com/bballdave025', 'stackexchange.com/users/8693193', 'net/powershell', 'c/c']
```

### Let's clean things up

<b>We'll give you the chance to look at the originals, if you want.</b>

<b>Only do the two cells below if you want a big preview! What I'm saying is, "The two cells below will give you long outputs if uncommented."</b>

In [None]:
#complete_description_text

In [None]:
#complete_application_text

In [None]:
def get_sorted_word_counts(*cleaned_strings):
                           #,
                           #do_output_sorted_file=False,
                           #sorted_filename=\
                           #    "sorted_words_from_strings.txt"):
    '''
    @return  OrderedDict
    '''
    
    from collections import OrderedDict 
      # Do I need imports inside the function for pickle(?)
    
    EXIT_NOWORDSWEREFOUND = -1
    
    work_with_str = combine_strings(cleaned_strings)
    
    list_of_words_in_str = work_with_str.split()
    
    if len(list_of_words_in_str) <= 0:
        print("No words were found.", file=sys.stderr)
        print("The program will exit.", file=sys.stderr)
        #sys.exit(EXIT_NOWORDSWEREFOUND)
        return EXIT_NOWORDSWEREFOUND
    ##endof:  if len(list_of_words_in_str) <= 0
    
    word_count_ordered_dict = OrderedDict()
    
    for this_word in list_of_words_in_str:
        if this_word in word_count_ordered_dict:
            word_count_ordered_dict[this_word] += 1
        else:
            word_count_ordered_dict[this_word] = 1
        ##endof:  if/else this_word in list_of_words_in_str
    ##endof:  for this_word in list_of_words_in_str
    
    ## DWB note ##
    ##  At this point, the OrderedDict is sorted by the
    ##+ order in which keys were inserted, not by their
    ##+ count.
    
    for key, _ in \
          sorted(word_count_ordered_dict.items(),
                 key=lambda word_and_count: word_and_count[1],
                 reverse=True):
        word_count_ordered_dict.move_to_end(key)
    ##endof:  for myword, _ ...
    
    return word_count_ordered_dict
    
##endof:  get_sorted_word_counts(*cleaned_strings)

def combine_strings(tuple_of_strings):
                    #, 
                    #do_output_raw_file=False,
                    #raw_filename='raw_words_from_strings.txt'):
    '''
    @return  string
    '''
    
    returned_str = " "
    
    for this_str in tuple_of_strings:
        returned_str += this_str + " "
    ##endof:  for this_str in tuple_of_string
    
    ## one line, single-spaced
    returned_str = ' '.join(returned_str.split())
    returned_str = returned_str.replace("\t", " ")
    returned_str = returned_str.replace("\n", " ")
    returned_str = re.sub(r"([^ ])[ ][ ]+($|[^ ])",
                            r"\g<1> \g<2>",
                            returned_str,
                            flags=re.IGNORECASE
                           )
    
    return returned_str
##endof:  combine_strings(tuple_of_strings)

## Some future maybes
# @TODO: add a sort-by-word as well as sort-by-count flag
# @TODO:  also, print out the pre-sorted and sorted files
#       + with word lists, frequency, and in-order-of-
#       + highest-count stuff

In [None]:
## Cleaning and Counting ##

description_strings_pre = \
  [clean_text_string_quickly(this_desc_text_str) 
           for this_desc_text_str in complete_description_text]
application_strings_pre = \
  [clean_text_string_quickly(this_appl_text_str)
           for this_appl_text_str in complete_application_text]

description_strings = \
  [make_it_one_line_single_spaced(remove_stopwords(this_desc)) 
           for this_desc in description_strings_pre]
application_strings = \
  [make_it_one_line_single_spaced(remove_stopwords(this_appl)) 
           for this_appl in application_strings_pre]

description_word_counts = \
  [get_sorted_word_counts(description_str) \
              for description_str in description_strings]
application_word_counts = \
  [get_sorted_word_counts(application_str) \
              for application_str in application_strings]

<b>Once again, the four cells below will give you long outputs if uncommented.</b>

In [None]:
#description_strings

In [None]:
#description_word_counts

In [None]:
#application_strings

In [None]:
#application_word_counts

## We are ready for FormattedWord Frequency Counts in Part 2

### First, though, we'll pickle the things we need.

And the link for the Part 2 MyBinder will be included after the pickling.

In [None]:
import pickle

pickle_filename_1_to_2 = "important_part_1_vars.pkl"

things_to_pickle_1 = [
    local_job_desc_filenames,
    local_job_appl_filenames,
    complete_description_text, 
    complete_application_text,
    description_strings,
    description_word_counts,
    application_strings,
    application_word_counts,
]

with open(pickle_filename_1_to_2, 'wb') as pfh:
    pickle.dump(things_to_pickle_1, pfh)
##endof:  with open ... as pfh # (pickle file handle)

[Part 2 On GitHub]()

[Part 2 On MyBinder](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/main?labpath=Part_02_NLPPresentationJobHunt_DemoWordFreq.ipynb)

Or, alternatively<strike>/eventually</strike>, use the badge as a link for the MyBinder version.

[![Binder](./badge_logo_dwb_v_0-1-003_part_2.png)](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/main?labpath=Part_02_NLPPresentationJobHunt_DemoWordFreq.ipynb)