# Job-Hunt NLP Demo

Which demo will also be useful in doing some quick NLP work to see how my résumé's word distribution matches that from job descriptions.

There's a wonderful project out there, [MyBinder](https://mybinder.org), which allows you to interactively run a Jupyter notebook completely online. It's nice to have when you'd like to play with code and see better the outputs that come from running that code. I've had some problems with images going down, but I'm going to work to keep this one up.

The link to the online, interactive notebook - the binder - will be at the badge you see right here

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/bballdave025/ancestry-freq/main?labpath=Ancestry_NLP_Useful_and_Demo.ipynb)

when I figure out how to get the Binder to stay.

<strike>Go ahead and give it a try!</strike> \[Soon!\]

<hr/>

## Link to setup from the Conda Prompt

The instructions for setting up the conda environment from Windows is in [my first version, here on GitHub](https://github.com/bballdave025/job-app-word-freq/blob/main/A_v01_NLP_Presentation_Job_Hunt_NLP_Useful_Demo_Word_Freq.ipynb). Soon, I will figure out how to make the [MyBinder](https://mybinder.org) server <strike>[MyBinder server]()</strike> for that first version persistent, and you can look at all the setup stuff there.

## The Second Iteration

### Several FamilySearch Résumés

My friend at the [FamilySearch Library](https://www.familysearch.org/en/library/) let me know about a few job availabilities. These are all with a group - of which he and I are part - of missionaries and volunteers who have been working on [CJKV (Chinese, Japanese, Korean, Vietnamese)-character](https://en.wikipedia.org/wiki/CJKV_characters) handwriting and block-print recognition. I already put in the applications with résumés, but all résumés are pretty simple. I'm going to see how the different job descriptions compare to the résumés as regards the word-frequency distribution. 

I'm going to add some improvements to my first, time-limited version. These include the better-presentation output of the word and frequency arrays. I would also like to add something that removes small words that serve a more grammatical function; in the [NLTK book](https://www.nltk.org/book/) (Officially: BIRD, Steven; KLEIN, Ewan; and LOPER, Edward, Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit, retrieved 2023-08-07, https://web.archive.org/web/20230721043038/https://www.nltk.org/book/)
Steven Bird, Ewan Klein, and Edward Loper), these are called stopwords (cf. Chapter 2). Perhaps I'll put in some n-gram comparison, but what I'd really like to do are space-separated bigrams - now I've looked it up, and the official term is #####.

## Texts

The text of my résumé for these jobs is in the local file,

```
res_CJKV.txt
```

the job descriptions for the jobs are in local files as well, specifically,

```
desc_CJKV_dev3.txt
desc_CJKV_dev4.txt
desc_CJKV_dev5.txt
desc_CJKV_devInTest3.txt
```

In [None]:
application_text_filenames = \
  ["res_CJKV.txt",
  ]

In [None]:
job_description_text_filenames = \
  ["desc_CJKV_dev5.txt",
   "desc_CJKV_dev4.txt",
   "desc_CJKV_dev3.txt",
   "desc_CJKV_devInTest3.txt",
  ]

# The "dev5" is the nicest job - and it's with Java, which I know best.

`######################################################`

The job description page looks to contain something like `JavaScript`, `ajax`, etc.

Rather than writing in a webscraper or looking through the code and finding what gets pulled from the database, I'm just going to copy/paste the text into the text files.

In [None]:
##  Code to get current timestamp, if needed.
##+ Meant to be run once, then commented out.
#######################
# No need to run again
#####
!powershell -c (Get-Date -UFormat "%s_%Y%m%dT%H%M%S%Z00") -replace '[.][0-9]*_', '_'

Output was

```
[The]
[Lists]
```

at `Full-timestamp`

In [None]:
local_job_desc_filenames = job_description_text_filenames
local_job_app_filenames  = application_text_filenames

import pprint

pprint.pprint(local_job_desc_filenames)
pprint.pprint(local_job_app_filenames)

In [None]:
# read in texts to original strings

complete_description_text = " "
complete_application_text = " "

for this_description_filename in local_job_desc_filenames:
    with open(this_description_filename, 'r', encoding='utf-8') as dfh:
        complete_description_text += dfh.read()
    ##endof:  with open ... dfh
##endof:  for this_description_filename in local_job_desc_filenames

for this_application_filename in local_job_app_filenames:
    with open(this_application_filename, 'r', encoding='utf-8') as afh:
        complete_application_text += afh.read()
    ##endof:  with open ... afh
##endof:  for this_application_filename in local_job_app_filenames

complete_description_text += " "
complete_application_text += " "

### Code for cleaning text

We will iterate a bit, so as not to have to write a text normalizer for the whole world. Rather than putting together regexes to test for things like which contractions are there and which other things might need changing (especially things like dashes), I'm doing simple regexes. Q&R

In [None]:
import re
import string

#from bs4 import BeautifulSoup
#from bs4 import UnicodeDammit

def clean_text_string_quickly(input_str):
    processing_str = input_str
    
    ## one line, single-spaced
    processing_str = ' '.join(processing_str.split())
    processing_str = processing_str.replace("\t", " ")
    processing_str = processing_str.replace("\n", " ")
    processing_str = re.sub(r"([^ ])[ ][ ]+($|[^ ])",
                            r"\g<1> \g<2>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ## get rid of outside-ascii (or control character)
    processing_str = re.sub(r"[^\u0020-\u007E]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ## my stuff
    processing_str = re.sub(r"[ ][|]+[ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    processing_str = processing_str.replace(r"&", "and")
    processing_str = processing_str.replace(r"U.S.", "U S ")
    
    ## get rid of punctuation
    processing_str = re.sub(r"(([^0-9 ])[.,!?:\"']([) ]|$))",
                            r"\g<2>\g<3>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    processing_str = re.sub(r"(([0-9 ])[.,!?:\"']([ ]|$))",
                            r"\g<2>\g<3>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    # parentheses
    processing_str = processing_str.replace(r"(", " ")
    processing_str = processing_str.replace(r")", " ")
    # dashes
    processing_str = re.sub(r"[ ][-]+[ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ##  lowercase - to skip until a few iterations through
    ##+ cleaning the text
    processing_str = processing_str.casefold()
    
    ## fixes found by iterating this cleaning function
    processing_str = re.sub(r"[ ][/][ ]",
                            " ",
                            processing_str,
                            flags=re.IGNORECASE
                           )
 
    ##  From https://www.nltk.org/book/ch02.html
    ##+ > [Stopwords are] high-frequency words like the, to and also that we 
    ##+ > sometimes want to filter out of a document before further processing. 
    ##+ > Stopwords usually have little lexical content, and their presence in 
    ##+ > a text fails to distinguish it from other texts.

    stopwords_to_remove = [
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 'can', 'will', 'just', 'should', 'now'
]
    
    # No attempt to optimize code, here. Q&R
    for my_stopword in stopwords_to_remove:
        processing_str = processing_str.replace(my_stopword, " ")
    ##endof:  for my_stopword in stopwords_to_remove:
    
    processing_str = processing_str.replace(r"monitors/equipment", 
                                                  "monitors equipment")
    processing_str = processing_str.replace(r"notice/more", 
                                                  "notice more")
    
    ##spacing fix at the end
    processing_str = re.sub(r"([^ ])[ ][ ]+($|[^ ])",
                            r"\g<1> \g<2>",
                            processing_str,
                            flags=re.IGNORECASE
                           )
    
    ## Let's give it back
    return processing_str

##endof:  clean_text_string_quickly(input_str)

In [None]:
# We'll usually keep these two.
do_look_at_description_text = True
do_look_at_application_text = True

#  this one can go (False) if you don't want the big strings
#+ i.e. you don't want the complete file contents
do_print_the_big_strings = False

In [None]:
if do_look_at_description_text:
    test1 = clean_text_string_quickly(complete_description_text)
    if do_print_the_big_strings:
        print(test1)

In [None]:
if do_look_at_application_text:
    test2 = clean_text_string_quickly(complete_application_text)
    if do_print_the_big_strings:
        print(test2)

In [None]:
# looking at contractions
if do_look_at_description_text:
    print(re.findall(r"\b('[\w']+\b|[\w']+'[\w']+|[\w']+')\b", test1))

In [None]:
if do_look_at_application_text:
    print(re.findall(r"\b('[\w']+\b|[\w']+'[\w']+|[\w']+')\b", test2))

In [None]:
# looking at all slashes
if do_look_at_description_text:
    print(re.findall(r"\b[\w/]+/[\w/]+\b", test1))

In [None]:
if do_look_at_application_text:
    print(re.findall(r"\b[\w./]+/[\w/]+\b", test2))

## Word Frequency Counts

I want to use an `OrderedDict`, rather than mess with sorting the contents of a `dict`.

In [None]:
#import sys
from collections import OrderedDict

def get_sorted_word_counts(*cleaned_strings):
                           #,
                           #do_output_sorted_file=False,
                           #sorted_filename=\
                           #    "sorted_words_from_strings.txt"):
    '''
    @return  OrderedDict
    '''
    
    EXIT_NOWORDSWEREFOUND = -1
    
    work_with_str = combine_strings(cleaned_strings)
    
    list_of_words_in_str = work_with_str.split()
    
    if len(list_of_words_in_str) <= 0:
        print("No words were found.", file=sys.stdout)
        print("The program will exit.", file=sys.stdout)
        #sys.exit(EXIT_NOWORDSWEREFOUND)
        return EXIT_NOWORDSWEREFOUND
    ##endof:  if len(list_of_words_in_str) <= 0
    
    word_count_ordered_dict = OrderedDict()
    
    for this_word in list_of_words_in_str:
        if this_word in word_count_ordered_dict:
            word_count_ordered_dict[this_word] += 1
        else:
            word_count_ordered_dict[this_word] = 1
        ##endof:  if/else this_word in list_of_words_in_str
    ##endof:  for this_word in list_of_words_in_str
    
    ## DWB note ##
    ##  At this point, the OrderedDict is sorted by the
    ##+ order in which keys were inserted, not by their
    ##+ count.
    
    for key, _ in \
          sorted(word_count_ordered_dict.items(),
                 key=lambda word_and_count: word_and_count[1],
                 reverse=True):
        word_count_ordered_dict.move_to_end(key)
    ##endof:  for myword, _ ...
    
    return word_count_ordered_dict
    
##endof:  get_sorted_word_counts(*cleaned_strings)

def combine_strings(tuple_of_strings):
                    #, 
                    #do_output_raw_file=False,
                    #raw_filename='raw_words_from_strings.txt'):
    '''
    @return  string
    '''
    
    returned_str = " "
    
    for this_str in tuple_of_strings:
        returned_str += this_str + " "
    ##endof:  for this_str in tuple_of_string
    
    ## one line, single-spaced
    returned_str = ' '.join(returned_str.split())
    returned_str = returned_str.replace("\t", " ")
    returned_str = returned_str.replace("\n", " ")
    returned_str = re.sub(r"([^ ])[ ][ ]+($|[^ ])",
                            r"\g<1> \g<2>",
                            returned_str,
                            flags=re.IGNORECASE
                           )
    
    return returned_str
##endof:  combine_strings(tuple_of_strings)

# @TODO: add a sort-by-word as well as sort-by-count flag
# @TODO:  also, print out the pre-sorted and sorted files
#       + with word lists, frequency, and in-order-of-
#       + highest-count stuff

In [None]:
## Cleaning and Counting ##

description_str = clean_text_string_quickly(complete_description_text)
application_str = clean_text_string_quickly(complete_application_text)

description_word_counts = \
           get_sorted_word_counts(description_str)
application_word_counts = \
           get_sorted_word_counts(application_str)

The next code will take care of the issue I described thusly:
> It's annoying me not to have a nice, aligned output for these 2d lists - basically, they're tables. I need to bring in some previous code that takes care of getting stuff printed nice. That will be after the Q&R.

In [None]:
import string

def print_2d_list_columns_aligned(list_2d_to_print,
                                  joining_delimiter = ",  ",
                                  do_output_as_str_and_print = False,
                                  do_see_the_guts=False):
    ##  The  do_see_the_guts boolean is partly for debugging,
    ##+ partly for remembering and teaching how the process
    ##+ works.
    
    ## Make all elements strings - so we can use len()
    list_2d_all_strings = \
      [[str(item) for item in row] for row in list_2d_to_print]
    
    if do_see_the_guts:
        print()
        print("  list_2d_all_strings:")
        print(list_2d_all_strings)
        print()
    ##endof:  if do_see_the_guts
    
    #  We want to find the max string length for each column
    #+ We can basically transpose the 2d_list to get the
    #+ content of each column
    list_of_column_elems_as_tuples = \
                 [column for column in zip(*list_2d_all_strings)]
    
    if do_see_the_guts:
        print()
        print("  list_of_column_elems_as_tuples:")
        print(list_of_column_elems_as_tuples)
        print()
    ##endof:  if do_see_the_guts
    
    ## find the max string length for each tuple (each column)
    list_of_max_str_len_by_column = \
      [max([len(strng) for strng in tpl]) 
        for tpl in list_of_column_elems_as_tuples]
   
    # -v- Commented code
    # -v- gives array with elements being each longest string
    #[max([strng for strng in tpl], key=len) for tpl in list_of_column_elems_as_tuples]
    # -v- 2d array with strings
    #[[strng for strng in tpl] for tpl in list_of_column_elems_as_tuples]
    
    if do_see_the_guts:
        print()
        print("  list_of_max_str_len_by_column:")
        print(list_of_max_str_len_by_column)
        print()
    ##endof:  if do_see_the_guts
    
    # Create a formatter for each row
    fmt_str = \
      joining_delimiter.join('{{:{}}}'.format(max_len) 
                               for max_len in list_of_max_str_len_by_column)
    fmt_str = "[" + fmt_str + "]"
    
    if do_see_the_guts:
        print()
        print("  fmt_str:")
        print(fmt_str)
        print()
    ##endof:  if do_see_the_guts
    
    # Get a string for each row, formatted correctly
    list_of_formatted_row_strings = \
      [fmt_str.format(*row) for row in list_2d_all_strings]
    
    if do_see_the_guts:
        print()
        print("  list_of_formatted_row_strings:")
        print(list_of_formatted_row_strings)
        print()
    ##endof:  if do_see_the_guts
    
    if do_output_as_str_and_print:
        return list_of_formatted_row_strings
    ##endof:  if do_output_as_str_and_print
    
##endof:  print_2d_list_colunns_aligned(<params>)

#print_2d_list_columns_aligned(matrix, do_see_the_guts=True)

In [None]:
## Only change this boolean if you want to see a lot of output. ##
do_print_long_full_version = False

#import pprint

if do_print_long_full_version:
    dashes="------------------------------------------------------------"
    short_dashes="-----"
    
    print()
    print()
    print(dashes)
    print(" For the job description:")
    print(short_dashes)
    #pprint.pprint(description_word_counts)
    print_2d_list_columns_aligned(description_word_counts)
    print(dashes)
    print()
    print()
    print(dashes)
    print(" For the job application materials (résumé, cover letter, etc.):")
    print(short_dashes)
    #pprint.pprint(application_word_counts)
    print_2d_list_columns_aligned(application_word_counts)
    print(dashes)
    print()
    print()
##endof:  if do_print_long_full_version

In [None]:
try_simpler = False

##  We will do
table_version_desc = [["desc_word", "desc_cnt", "desc_rank"],]
table_version_appl = [["appl_word", "appl_cnt", "appl_rank"],]
table_version_both = [["rank", "desc_word", "desc_cnt", 
                       "appl_word", "appl_cnt"],
                     ]

description_items = list(description_word_counts.items())
application_items = list(application_word_counts.items())

n_words_description = len(description_items)
n_words_application = len(application_items)

print()
print("NOTE THAT THESE ARE THE NUMBERS OF DISTINCT WORDS")
print(f"n_words_description = {str(n_words_description)}")
print(f"n_words_application = {str(n_words_application)}")
print()
print()

if try_simpler:
    import pprint
    print()
    print()
    print("DESCRIPTION")
    pprint.pprint(description_items)
    print()
    print(f"n_words_description = {str(n_words_description)}")
    print()
    print()
    print("APPLICATION")
    pprint.pprint(application_items)
    print()
    print(f"n_words_application = {str(n_words_application)}")
    print()
    print()
    
    print()
    print()
    print()
    print("NOTE THAT THESE ARE THE NUMBERS OF DISTINCT WORDS")
    print(f"n_words_description = {str(n_words_description)}")
    print(f"n_words_application = {str(n_words_application)}")
    print()
##endof:  if try_simpler

In [None]:
import sys

for this_idx in range(max(len(description_items),
                          len(application_items)
                         )
                      ):
    this_rank = this_idx + 1
    
    if this_idx < len(description_items) - 1:
        try:
            this_description_word  = description_items[this_idx+1][0]
        except IndexError as ie:
            print("OTHER ERROR desc word!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {str(this_idx)}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"len(description_items): {str(len(description_items))}", file=sys.stdout)
            end_of_data_bool_try = ( this_idx <= len(description_items) )
            print(f"end_of_data_bool_try: {str(end_of_data_bool_try)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
        
        try:
            this_description_count = description_items[this_idx+1][1]
        except IndexError as ie:
            print("OTHER ERROR desc count!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {str(this_idx)}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"len(description_items): {str(len(description_items))}", file=sys.stdout)
            end_of_data_bool_try = ( this_idx <= len(description_items) )
            print(f"end_of_data_bool_try: {str(end_of_data_bool_try)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
        
        this_description_rank  = this_rank
    else:
        this_description_word  = "-- N/A --"
        this_description_count = "-- N/A --"
        this_description_rank  = "-- N/A --"
    ##endof:  if/else this_idx < len(description_items)
    
    if this_idx < len(application_items) - 1:
        try:
            this_application_word  = application_items[this_idx+1][0]
        except IndexError as ie:
            print("OTHER ERROR appl word!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {str(this_idx)}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"len(application_items): {str(len(application_items))}", file=sys.stdout)
            end_of_data_bool_try = ( this_idx <= len(application_items) )
            print(f"end_of_data_bool_try: {str(end_of_data_bool_try)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
        
        try:
            this_application_count = application_items[this_idx+1][1]
        except IndexError as ie:
            print("OTHER ERROR appl count!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {str(this_idx)}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"len(application_items): {str(len(application_items))}", file=sys.stdout)
            end_of_data_bool_try = ( this_idx <= len(application_items) )
            print(f"end_of_data_bool_try: {str(end_of_data_bool_try)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
        
        this_application_rank  = this_rank
    else:
        this_application_word  = "-- N/A --"
        this_application_count = "-- N/A --"
        this_application_rank  = "-- N/A --"
    ##endof:  if/else this_idx < len(application_items)
    
    if this_description_word != "-- N/A --":
        try:
            table_version_desc.append([this_description_word, 
                                       this_description_count, 
                                       this_rank]
                                     )
        except IndexError as ie:
            print("ERROR desc!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {str(this_idx)}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"this_description_word: {str(this_description_word)}", file=sys.stdout)
            print(f"this_description_count: {str(this_description_count)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/except/finally
        
    ##endof:  if this_description_word != "-- N/A --"
    
    if this_application_word != "-- N/A --":
        try:
            table_version_appl.append([this_application_word, 
                                       this_application_count, 
                                       this_rank
                                      ]
                                     )
        except IndexError as ie:
            print("ERROR appl!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"this_idx: {this_idx}", file=sys.stdout)
            print(f"this_rank: {str(this_rank)}", file=sys.stdout)
            print(f"this_application_word: {str(this_application_word)}", file=sys.stdout)
            print(f"this_application_count: {str(this_application_count)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/except/finally
        
    ##endof:  if this_application_word != "-- N/A --"
    
    try:
        table_version_both.append([this_rank, 
                                   this_description_word, 
                                   this_description_count,
                                   this_application_word, 
                                   this_application_count
                                  ]
                                 )
    except IndexError as ie:
        print("ERROR both!", file=sys.stdout)
        print(str(ie), file=sys.stdout)
        print(f"this_idx: {this_idx}", file=sys.stdout)
        print(f"this_rank: {str(this_rank)}", file=sys.stdout)
        print(f"this_description_word: {str(this_description_word)}", file=sys.stdout)
        print(f"this_description_count: {str(this_description_count)}", file=sys.stdout)
        print(f"this_application_word: {str(this_application_word)}", file=sys.stdout)
        print(f"this_application_count: {str(this_application_count)}", file=sys.stdout)
    finally:
        pass
    ##endof:  try/except/finally
        
##endof:  for this_idx in ..

In [None]:
###
##  Set up the display the first n_lines_to_display 
##+ of the tables, nicely

#### This next one is the one you might change
n_lines_to_display_orig = 25

n_header_lines = 1
n_lines_to_display = n_lines_to_display_orig
n_lines_to_display_desc = n_lines_to_display_orig
n_lines_to_display_appl = n_lines_to_display_orig

do_cut_down_desc = ( n_lines_to_display_orig >
                             len(table_version_desc) 
                   )
if do_cut_down_desc:
    n_lines_to_display_desc = len(table_version_desc)
##endof:  if do_cut_down_desc

do_cut_down_appl = ( n_lines_to_display_orig >
                             len(table_version_appl) 
                   )
if do_cut_down_appl:
    n_lines_to_display_appl = len(table_version_appl)
##endof:  if do_cut_down_appl

# get headers
display_table_desc = [table_version_desc[0]]
display_table_appl = [table_version_appl[0]]
display_table_both = [table_version_both[0]]

if ( len(table_version_desc) - n_header_lines < n_lines_to_display or
     len(table_version_appl) - n_header_lines < n_lines_to_display
   ):
    n_lines_to_display = min(len(table_version_desc) - n_header_lines,
                             len(table_version_appl) - n_header_lines)
##endof:  if <n_lines_conditions>


for table_idx in range(n_header_lines, 
                       n_lines_to_display_orig + n_header_lines):
    if table_idx - n_header_lines < n_lines_to_display_desc:
        try:
            display_table_desc.append(table_version_desc[table_idx])
        except IndexError as ie:
            print("ERROR display_table_desc!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"table_idx: {table_idx}", file=sys.stdout)
            print(f"n_lines_to_display_desc: {n_lines_to_display_desc}", file=sys.stdout)
            print(f"len(table_version_desc): {len(table_version_desc)}", file=sys.stdout)
        except Error as e:
            print("DIFFERENT ERROR display_table_desc!", file=sys.stdout)
            print(str(e), file=sys.stdout)
            print(f"table_idx: {table_idx}", file=sys.stdout)
            print(f"n_lines_to_display_desc: {n_lines_to_display_desc}", file=sys.stdout)
            print(f"len(table_version_desc): {len(table_version_desc)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
    ##endof:  if table_idx - n_header_lines < n_lines_to_display_desc
    
    if table_idx - n_header_lines < n_lines_to_display_appl:
        try:
            display_table_appl.append(table_version_appl[table_idx])
        except IndexError as ie:
            print("ERROR display_table_appl!", file=sys.stdout)
            print(str(ie), file=sys.stdout)
            print(f"table_idx: {table_idx}", file=sys.stdout)
            print(f"n_lines_to_display_appl: {n_lines_to_display_appl}", file=sys.stdout)
            print(f"len(table_version_appl): {len(table_version_appl)}", file=sys.stdout)
        except Error as e:
            print("DIFFERENT ERROR display_table_appl!", file=sys.stdout)
            print(str(e), file=sys.stdout)
            print(f"table_idx: {table_idx}", file=sys.stdout)
            print(f"n_lines_to_display_appl: {n_lines_to_display_appl}", file=sys.stdout)
            print(f"len(table_version_appl): {len(table_version_appl)}", file=sys.stdout)
        finally:
            pass
        ##endof:  try/catch/finally
    ##endof:  if table_idx - n_header_lines < n_lines_to_display_appl
    
    try:
        display_table_both.append(table_version_both[table_idx])
    except IndexError as ie:
        print("ERROR display_table_both!", file=sys.stdout)
        print(str(ie), file=sys.stdout)
        print(f"table_idx: {table_idx}", file=sys.stdout)
        print(f"len(table_version_both): {len(table_version_both)}", file=sys.stdout)
    finally:
        pass
    ##endof:  try/catch/finally
##endof:  for idx in range(<n_lines stuff>)

In [None]:
print(f"len(table_version_desc): {len(table_version_desc)}")
print(f"len(display_table_desc): {len(display_table_desc)}")
print()
print(f"len(table_version_appl): {len(table_version_appl)}")
print(f"len(display_table_appl): {len(display_table_appl)}")
print()
print(f"len(table_version_both): {len(table_version_both)}")
print(f"len(display_table_both): {len(display_table_both)}")

In [None]:
#import pprint

long_dashes = "-----------------------------------------------"
short_dashes = "-----"

print()
print(long_dashes)
print("JOB DESCRIPTION (TOP 25)")
print(short_dashes)
#pprint.pprint(display_table_desc)
print_2d_list_columns_aligned(display_table_desc)
print()
print(long_dashes)
print()
print()
print(long_dashes)
print("JOB APPLICATION STUFF - RéSUMé AND COVER LETTER (TOP 25)")
print(short_dashes)
#pprint.pprint(display_table_appl)
print_2d_list_columns_aligned(display_table_appl)
print()
print(long_dashes)
print()
print()
print(long_dashes)
print("COMPARISON OF DESCRIPTION AND APPLICATION (TOP 25)")
print(short_dashes)
#pprint.pprint(display_table_both)
print_2d_list_columns_aligned(display_table_both)
print()
print(long_dashes)

## Time for histograms (or whatever the discretized version is)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Next line only for Jupyter notebook.
%matplotlib inline

def get_histo_from_freq_dict(word_count_ordered_dict,
                             n_top_words = 25,
                             do_show_word_and_count_lists=False,
                             axx=None
                            ):
    '''
    @return  an axis from matplotlab (with the object - histogram - in it)
    '''
    
    if axx is None:
        fig = plt.figure(figsize=(10, 3))
        axx = fig.add_subplot(111)
    
    counts_pre = list(word_count_ordered_dict.values())
    words_pre  = list(word_count_ordered_dict.keys())
    
    counts = counts_pre[:n_top_words]
    words  = words_pre[:n_top_words]
    
    ## making sure things were working
    if do_show_word_and_count_lists:
        print(f"counts: {counts}")
        print(f"words:  {words}")
    ##endof:  if do_show_word_and_count_lists
    
    x_words_coords = np.arange(len(words))
    axx.bar(x_words_coords, counts, align='center')
    
    axx.set_xticks(x_words_coords)
    axx.set_xticklabels(words, rotation=45, ha='right')
    
##endof:  get_histo_from_freq_dict

In [None]:
get_histo_from_freq_dict(description_word_counts, do_show_word_and_count_lists=True)
plt.show()

In [None]:
get_histo_from_freq_dict(application_word_counts)
plt.show()

## Output for Description and Application:

### &lt;FILL THIS IN&gt;

### Done

In [None]:
# #######################
# # No need to run again
# #####
# !powershell -c (Get-Date -UFormat "%s_%Y%m%dT%H%M%S%Z00") -replace '[.][0-9]*_', '_'

The output when I actually did this was

```
<Here is where the output will go>
```

The output histograms, in an image.

<br/>
<div>
  <img src="first_QandR_word_frequency_plots.jpg"
       alt="The first pair of histograms - one for the job description, one for the job application - with word frequencies"
       width="100%">
</div>
<br/>

Here is some idea of how they match. I hope it makes some sense. Darker green means an exact match; thinner dark green means a match with words that don't add much meaning; lighter green means it's a close match. 

<br/>
<div>
  <img src="first_QandR_word_frequency_plots_w_link_lines.jpg"
       alt="Word matches for the first pair of histograms."
       width="100%">
</div>
<br/>

## Future Steps

- Look at ranking, counts, percentage, etc. for FamilySearch's (job description's) top 25 words as found in my (job application's) word counts, then vice-versa. 
- Get rid of words that are necessary for grammar, but which don't matter too much in determining whether the two documents match up. (Found term on 2023-08-17. It's "stopwords".)