## Job-Hunt NLP Demo - Part 1

Which demo will also be useful in doing some quick NLP work to see how my résumé's word distribution matches that from job descriptions.

There's a wonderful project out there, [MyBinder](https://mybinder.org), which allows you to interactively run a Jupyter notebook completely online. It's nice to have when you'd like to play with code and see better the outputs that come from running that code. I've had some problems with images going down, but I'm going to work to keep this one up.

The link to the online, interactive notebook - the binder - is at the badge you see right here

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/bballdave025/job-app-word-freq/main?labpath=A_02nd_NLPPresentationJobHunt_DemoWordFreq.ipynb)

<hr/>

## We are calling this version 0.1.003

It's the FamilySearch CJKV jobs applied for in August 2023, but we're splitting it into smaller notebooks. Hopefully, MyBinder can load each more quickly. We'll see how things work with pickling variables between the parts.

<hr/>

## Let's start by un-pickle-ing the things we'll need.

In [None]:
import pickle

pickle_filename_1_to_2 = "important_part_1_vars.pkl"

unpickled_array = []

with open(pickle_filename_1_to_2, 'rb') as pfh:
    unpickled_array = pickle.load(pfh)
##endof:  with open ... as pfh # (pickle file handle)

In [None]:
local_job_desc_filenames = unpickled_array[0]
local_job_appl_filenames = unpickled_array[1]
complete_description_text = unpickled_array[2]
complete_application_text = unpickled_array[3]
description_strings = unpickled_array[4]
description_word_counts = unpickled_array[5]
application_strings = unpickled_array[6]
application_word_counts = unpickled_array[7]

## Nicely-formatted Word Frequency Counts

The next code will take care of the issue I described thusly:
> It's annoying me not to have a nice, aligned output for these 2d lists - basically, they're tables. I need to bring in some previous code that takes care of getting stuff printed nice. That will be after the Q&R.

In [None]:
import string
from io import StringIO
import re

def print_2d_list_columns_aligned(list_2d_to_print,
                                  joining_delimiter = ",  ",
                                  do_output_as_str_and_print = False,
                                  do_see_the_guts=False):
    ##  The  do_see_the_guts boolean is partly for debugging,
    ##+ partly for remembering and teaching how the process
    ##+ works.
    
    ## Make all elements strings - so we can use len()
    list_2d_all_strings = \
      [[str(item) for item in row] for row in list_2d_to_print]
    
    if do_see_the_guts:
        print()
        print("  list_2d_all_strings:")
        print(list_2d_all_strings)
        print()
    ##endof:  if do_see_the_guts
    
    #  We want to find the max string length for each column
    #+ We can basically transpose the 2d_list to get the
    #+ content of each column
    list_of_column_elems_as_tuples = \
                 [column for column in zip(*list_2d_all_strings)]
    
    if do_see_the_guts:
        print()
        print("  list_of_column_elems_as_tuples:")
        print(list_of_column_elems_as_tuples)
        print()
    ##endof:  if do_see_the_guts
    
    ## find the max string length for each tuple (each column)
    list_of_max_str_len_by_column = \
      [max([len(strng) for strng in tpl]) 
        for tpl in list_of_column_elems_as_tuples]
    # -v- gives array with elements being each longest string
    #[max([strng for strng in tpl], key=len) for tpl in list_of_column_elems_as_tuples]
    # -v- 2d array with strings
    #[[strng for strng in tpl] for tpl in list_of_column_elems_as_tuples]
    
    if do_see_the_guts:
        print()
        print("  list_of_max_str_len_by_column:")
        print(list_of_max_str_len_by_column)
        print()
    ##endof:  if do_see_the_guts
    
    # output_as_str_not_list = False
    
    # Create a formatter for each row
    
    #if not output_as_str_not_list:
    joining_delimiter = "," + joining_delimiter
    
    fmt_str = \
      joining_delimiter.join('{{:{}}}'.format(max_len) 
                               for max_len in list_of_max_str_len_by_column)
    #if not output_as_str_not_list:
    fmt_str = "[" + fmt_str + "],"
    
    if do_see_the_guts:
        print()
        print("  fmt_str:")
        print(fmt_str)
        print()
    ##endof:  if do_see_the_guts
    
    # Get a string for each row, formatted correctly
    list_of_formatted_row_strings = \
      [fmt_str.format(*row) for row in list_2d_all_strings]
    
    if do_see_the_guts:
        print()
        print("  list_of_formatted_row_strings:")
        print(list_of_formatted_row_strings)
        print()
    ##endof:  if do_see_the_guts
    
    s = StringIO()
    print(*list_of_formatted_row_strings, file=s)
    output_table_raw = s.getvalue()
    
    output_table_raw = re.sub(r"^\[", r"[[ ", output_table_raw)
    output_table_raw = re.sub(r"\],$", r"]]", output_table_raw)
    #output_table_raw = re.sub(r"([^ ])\],", r"\g<1> ],", output_table_raw)
    output_table_raw = re.sub(r"\],", r" ],", output_table_raw)
    output_table_raw = re.sub(r",,", r" ,", output_table_raw)
    output_table_raw = re.sub(r"]]", r" ]]", output_table_raw)
    
    aligned_table_to_return = output_table_raw.replace(r"], [", "],\n [ ")
    
    print(aligned_table_to_return)
    
    if do_output_as_str_and_print:
        return aligned_table_to_return
    ##endof:  if do_output_as_str_and_print
    
##endof:  print_2d_list_colunns_aligned(<params>)

In [None]:
# ##  Uncomment if you want to see the guts of a small example
# ##+ that's been hacked into working all right
# print_2d_list_columns_aligned([['hey', 7], ['work', 3], ['stupid', 2]], do_see_the_guts=True)

### Here, we're giving the option to see the whole `word, word_count` list with however many words were found in the files.

In [None]:
## Only change this boolean to True if you want to see a lot of output. ##
do_print_long_full_version = False

if do_print_long_full_version:
    dashes="------------------------------------------------------------"
    short_dashes="-----"
    
    this_d_str_counter = -1 # quick hack for zero-indexed
    
    for d_word_count_dict in description_word_counts:
        this_d_str_counter += 1
        
        #  I haven't yet written this for several resumes, so
        #+ it just compares each description with the first
        #+ (or, likely, combined) thing in the resume stuff.
        this_a_str_index = 0
        a_word_count_dict = application_word_counts[this_a_str_index]
        
        print()
        print()
        print(dashes)
        print(f" For the job description with index, {this_d_str_counter}")
        print( " (meaning it's from the file:")
        print(f"   {local_job_desc_filenames[this_d_str_counter]}),")
        print(short_dashes)
        this_d_wdct_items_list = list(d_word_count_dict.items())
        this_d_wdct_2d_list = [list(ele) for ele in this_d_wdct_items_list]
        print_2d_list_columns_aligned(this_d_wdct_2d_list)
        print(dashes)
        print()
        print()
        print(dashes)
        print( " For the job application material (résumé, cover letter, etc.)")
        print(f" with index, {this_a_str_index}")
        print( "(meaning it's from the file:")
        print(f"   {local_job_appl_filenames[this_a_str_index]}),")
        print(short_dashes)
        this_a_wdct_items_list = list(a_word_count_dict.items())
        this_a_wdct_2d_list = [list (ele) for ele in this_a_wdct_items_list]
        print_2d_list_columns_aligned(this_a_wdct_2d_list)
        print(dashes)
        print()
        print()
        print(dashes)
        print(dashes)
        print()
        print()
    ##endof:  for d_word_count_dict in description_word_counts
##endof:  if do_print_long_full_version

### Get the data useful and see basic statistics

In [None]:
desc_counter = -1 # hack for zero-indexing
appl_index   =  0 #  combining application files
                  #+ (actually, here, there's only one file)

description_items_list = []
application_items_list = []

this_a_word_count = application_word_counts[appl_index]
application_items = list(this_a_word_count.items())
application_items_list.append(application_items)

for this_d_word_count in description_word_counts:
    desc_counter += 1
    
#     #  I haven't yet written this for several resumes, so
#     #+ it just compares each description with the first
#     #+ (or, likely, combined) thing in the resume stuff.
#     this_a_word_count = application_word_counts[appl_index]
    
    description_items_list.append(list(this_d_word_count.items()))
#     application_items = list(this_a_word_count.items())
    
    n_words_description = len(description_items_list[desc_counter])
    n_words_application = len(application_items_list[appl_index])
    
    print()
    print("NOTE THAT THESE ARE THE NUMBERS OF DISTINCT WORDS")
    print()
    print(f" For the job description with index, {desc_counter}")
    print( " (meaning it's from the file:")
    print(f"   {local_job_desc_filenames[desc_counter]}),")
    print("AND")
    print( " For the job application material (résumé, cover letter, etc.)")
    print(f" with index, {appl_index}")
    print( "(meaning it's from the file:")
    print(f"   {local_job_appl_filenames[appl_index]}),")
    print()
    print(f"n_words_description = {str(n_words_description)}")
    print(f"n_words_application = {str(n_words_application)}")
    print()
    print()
    
##endof:  for this_d_word_count in description_word_counts

### Getting job-descriptions, job-applications rankings nicely formatted

#### And a nice comparison of the rankings for both

In [None]:
import sys
import copy

#    ##  Not handling ties differently - whichever shows up first gets the
#    ##+ higher ranking
#
#    # #  These start out impossible (can't have a negative index), 
#    # #+ so we'll know if we use it without changing
#    # this_d_tie_dict_index = -1
#    # this_d_tie_value = -1
#    # this_a_tie_dict_index = -1
#    # this_a_tie_value = -1

table_version_desc = None
table_version_appl = None
table_version_both = None

list_of_table_version_desc = []
list_of_table_version_appl = []
list_of_table_version_both = []

for this_description_items in description_items_list:
    if table_version_desc is not None:
        table_version_desc.clear()
    if table_version_appl is not None:
        table_version_appl.clear()
    if table_version_both is not None:
        table_version_both.clear()
    
    table_version_desc = [["desc_word", "desc_cnt", "desc_rank"],]
    table_version_appl = [["appl_word", "appl_cnt", "appl_rank"],]
    table_version_both = [["rank", "desc_word", "desc_cnt", 
                           "appl_word", "appl_cnt"],
                         ]
    
    for this_idx in range(max(len(this_description_items),
                              len(application_items)
                             ) # - 1
                          ):
        
        this_rank = this_idx + 1
        
        this_description_word  = ""
        this_description_count = ""
        this_description_rank  = ""
        
        if this_idx < len(this_description_items): # - 1:
            this_description_word  = this_description_items[this_idx][0] #+1][0]
            this_description_count = this_description_items[this_idx][1] #+1][1]
            this_description_rank  = this_rank
        else:
            this_description_word  = "-- N/A --"
            this_description_count = "-- N/A --"
            this_description_rank  = "-- N/A --"
        ##endof:  if/else this_idx < len(description_items)
        
        this_application_word  = ""
        this_application_count = ""
        this_application_rank  = ""
        
        if this_idx < len(application_items): # - 1:
            this_application_word  = application_items[this_idx][0]
            this_application_count = application_items[this_idx][1]
            this_application_rank  = this_rank
        else:
            this_application_word  = "-- N/A --"
            this_application_count = "-- N/A --"
            this_application_rank  = "-- N/A --"
        ##endof:  if/else this_idx < len(application_items)

        if this_description_word != "-- N/A --":
            table_version_desc.append([this_description_word, 
                                       this_description_count, 
                                       this_rank]
                                     )
        ##endof:  if this_description_word != "-- N/A --"

        if this_application_word != "-- N/A --":
            table_version_appl.append([this_application_word, 
                                       this_application_count, 
                                       this_rank
                                      ]
                                     )
        ##endof:  if this_application_word != "-- N/A --"

        table_version_both.append([this_rank, 
                                   this_description_word, 
                                   this_description_count,
                                   this_application_word, 
                                   this_application_count
                                  ]
                                 )
    ##endof:  for this_idx in max(<n_desc_words>, <n_appl_words>)
    
    ## deep copies, since it's an immutable list passed by reference
    
    deep_desc = copy.deepcopy(table_version_desc)
    deep_appl = copy.deepcopy(table_version_appl)
    deep_both = copy.deepcopy(table_version_both)
    
    list_of_table_version_desc.append(deep_desc)
    list_of_table_version_appl.append(deep_appl)
    list_of_table_version_both.append(deep_both)
    
##endof:  for this_description_items in description_items_list

**The next 3 cells are other cells for which, if the code be uncommented, you will get a lot of output.**

We will later see a nicer version of the top 25 words from each file.

In [None]:
#import pprint; pprint.pprint(list_of_table_version_desc)

In [None]:
#import pprint; pprint.pprint(list_of_table_version_appl)

In [None]:
#import pprint; pprint.pprint(list_of_table_version_both)

### Finally, the nice output for the top 25s

After a few cells, we will see it. That will be the end of Part 2.

In [None]:
import sys
import copy

###
##  Set up the display the first n_lines_to_display 
##+ of the tables, nicely

#### This next one is the one you might change
n_lines_to_display_orig = 25

n_header_lines = 1
n_lines_to_display = n_lines_to_display_orig
n_lines_to_display_desc = n_lines_to_display_orig
n_lines_to_display_appl = n_lines_to_display_orig

## Here is where the code for doing it with lists of strings comes

display_table_desc = None
display_table_appl = None
display_table_both = None

list_of_display_table_desc = []
list_of_display_table_appl = []
list_of_display_table_both = []

for desc_fname_idx_for_now in range(len(list_of_table_version_desc)):
    #  find out which one is longer. Make that the length of the
    #+ display_table_both
    do_cut_down_desc = ( n_lines_to_display_orig >
                            len(list_of_table_version_desc[desc_fname_idx_for_now]) 
    )
    if do_cut_down_desc:
        n_lines_to_display_desc = len(list_of_table_version_desc[desc_fname_idx_for_now])
    ##endof:  if do_cut_down_desc
    
    do_cut_down_appl = ( n_lines_to_display_orig >
                            len(list_of_table_version_appl[desc_fname_idx_for_now]) 
    )
    if do_cut_down_appl:
        n_lines_to_display_appl = len(list_of_table_version_appl[desc_fname_idx_for_now])
    ##endof:  if do_cut_down_appl
    
    # start with a new, immutable, pass-by-reference table
    if display_table_desc is not None:
        display_table_desc.clear()
    if display_table_appl is not None:
        display_table_appl.clear()
    if display_table_both is not None:
        display_table_both.clear()
    
    # get headers
    display_table_desc = [list_of_table_version_desc[desc_fname_idx_for_now][0]]
    display_table_appl = [list_of_table_version_appl[desc_fname_idx_for_now][0]]
        # made copies to make it easier
    display_table_both = [list_of_table_version_both[desc_fname_idx_for_now][0]]
    
    if ( len(list_of_table_version_desc[desc_fname_idx_for_now]) - n_header_lines < n_lines_to_display or
         len(list_of_table_version_appl[desc_fname_idx_for_now]) - n_header_lines < n_lines_to_display
    ):
        n_lines_to_display = min(len(list_of_table_version_desc[desc_fname_idx_for_now]) - n_header_lines,
                                 len(list_of_table_version_appl[desc_fname_idx_for_now]) - n_header_lines)
    ##endof:  if <n_lines_conditions>
    
    
    for table_idx in range(n_header_lines, 
                           n_lines_to_display_orig + n_header_lines):
        if table_idx - n_header_lines < n_lines_to_display_desc:
            display_table_desc.append(list_of_table_version_desc[desc_fname_idx_for_now][table_idx])
        ##endof:  if table_idx - n_header_lines < n_lines_to_display_desc
        
        if table_idx - n_header_lines < n_lines_to_display_appl:
            display_table_appl.append(list_of_table_version_appl[desc_fname_idx_for_now][table_idx])
        ##endof:  if table_idx - n_header_lines < n_lines_to_display_appl
        
        display_table_both.append(list_of_table_version_both[0][table_idx])
    ##endof:  for idx in range(<n_lines stuff>)
    
    ####  Dang immutable, pass-by-reference Python stuff. : )
    ####+ I'm used to more C-style, but I'm getting better.
    deep_display_desc = copy.deepcopy(display_table_desc)
    deep_display_appl = copy.deepcopy(display_table_appl)
    deep_display_both = copy.deepcopy(display_table_both)
    
    list_of_display_table_desc.append(deep_display_desc)
    list_of_display_table_appl.append(deep_display_appl)
    list_of_display_table_both.append(deep_display_both)
##endof:  for desc_fname_idx_for_now in range(range(list_of_table_version_desc)

In [None]:
long_dashes = "--------------------------------------------------------------"
short_dashes = "-----"

In [None]:
for desc_fname_idx_for_now in range(len(list_of_display_table_desc)):
    print()
    print(long_dashes)
    print(f"len(list_of_display_table_desc[{desc_fname_idx_for_now}]):"+ \
          f" {len(list_of_table_version_desc[desc_fname_idx_for_now])}")
    print(f"len(display_table_desc): {len(display_table_desc)}")
    print()
    print(f"len(list_of_display_table_appl[{desc_fname_idx_for_now}]):" + \
          f" {len(list_of_display_table_appl[desc_fname_idx_for_now])}")
    print(f"len(display_table_appl): {len(display_table_appl)}")
    print()
    print(f"len(list_of_display_table_both[{desc_fname_idx_for_now}]):" + \
          f" {len(list_of_display_table_both[desc_fname_idx_for_now])}")
    print(f"len(display_table_both): {len(display_table_both)}")
    print(long_dashes)
    print()
##endof:  for desc_fname_idx_for_now in range(len(list_of_table_version_desc)):

Here it comes, the reward for your patience.

In [None]:
#import pprint

desc_counter = -1 # hack for zero-indexed
for this_desc_disp_table in list_of_display_table_desc:
    print()
    print(long_dashes + short_dashes)
    desc_counter += 1
    print( "JOB DESCRIPTION (TOP 25)")
    print(f"  from file: {local_job_desc_filenames[desc_counter]}")
    print(short_dashes)
    #pprint.pprint(display_table_desc)
    print_2d_list_columns_aligned(this_desc_disp_table)
    print()
    print(long_dashes)
    print()
##endof:  for this_desc_disp_table in list_of_display_table_desc

print(long_dashes + 2*short_dashes)
print(long_dashes + 2*short_dashes)

for this_appl_disp_table in list_of_display_table_appl:
    print()
    print(long_dashes)
    print("JOB APPLICATION STUFF - RéSUMé AND COVER LETTER (TOP 25)")
    print(short_dashes)
    #pprint.pprint(display_table_appl)
    print_2d_list_columns_aligned(this_appl_disp_table)
    print()
    print(long_dashes)
    print()
##endof:  for this_appl_disp_table in list_of_display_table_appl

print(long_dashes + 2*short_dashes)
print(long_dashes + 2*short_dashes)

other_desc_counter = -1 # hack for zero-indexed
for this_both_disp_table in list_of_display_table_both:
    print()
    print(long_dashes)
    other_desc_counter += 1
    print("COMPARISON OF DESCRIPTION AND APPLICATION (TOP 25)")
    print(f"  from file: {local_job_desc_filenames[other_desc_counter]}")
    print(short_dashes)
    #pprint.pprint(display_table_both)
    print_2d_list_columns_aligned(this_both_disp_table)
    print()
    print(long_dashes)
##endof:  for this_both_disp_table in list_of_table_version_both:

## Next, in Part 3, We Will Get Words at Different Ranks.

### We will also do histograms.

### But first, it's pickle time

And the link for the Part 2 MyBinder will be included after the pickling.

In [None]:
import pickle

pickle_filename_2_to_3 = "important_part_2_vars.pkl"

things_to_pickle_2 = [
    local_job_desc_filenames,
    local_job_appl_filenames,
    list_of_display_table_desc,
    list_of_display_table_appl
]

with open(pickle_filename_2_to_3, 'wb') as pfh:
    pickle.dump(things_to_pickle_2, pfh)
##endof:  with open ... as pfh # (pickle file handle)

## This is where to copy/paste into Part 3

In [None]:
def get_description_word_at_rank(this_rank = 1, 
                                 this_desc_fname_idx=0,
                                 do_print_details=False
                                ):
    this_idx = this_rank # the header is index 0
    this_table_to_use = \
      list_of_display_table_desc[this_desc_fname_idx]
    this_word = this_table_to_use[this_rank][0]
    if do_print_details:
        print()
        print(f"  The job description word at rank {this_rank},")
        print( 
          ( "  from file:"
           f" '{local_job_desc_filenames[this_desc_fname_idx]}',"
          )
)
        print(f"  is '{this_word}'.")
        print()
    ##endof: if do_print_details
    
    return this_word
##endof:  get_description_word_at_rank(<params>)

def get_application_word_at_rank(this_rank = 1, 
                                 do_print_details=False
                                ):
    this_idx = this_rank # the header is index 0
    this_appl_fname_idx=0
    this_table_to_use = \
      list_of_display_table_appl[this_appl_fname_idx]
    this_word = this_table_to_use[this_rank][0]
    if do_print_details:
        print()
        print(f"  The job application word at rank {this_rank},")
        print( 
          ( "  from file:"
           f" '{local_job_appl_filenames[this_appl_fname_idx]}',"
          )
)
        print(f"  is '{this_word}'.")
        print()
    ##endof: if do_print_details
    
    return this_word
##endof:  get_description_word_at_rank(<params>)

In [None]:
get_description_word_at_rank(1, do_print_details=True);
get_application_word_at_rank(1, do_print_details=True);

## Time for top-25 histograms (or whatever the discretized version is)

I'm going to go through these histograms one at a time. Basically, I'll compare each of the four job descriptions to my job application.

### Choices for the job description

In [None]:
str_for_choices = f"Choices are any of: {list(range(len(local_job_desc_filenames)))}"
print(str_for_choices.replace(r"[", r"{").replace(r"]", r"}"))

Output was most recently

Choices are any of: `{0, 1, 2, 3}`

### Make your choice in the next cell, if you're pressed for time.

Otherwise, you should leave this index as `0`, as it's part of my process of going through all four job descriptions.

In [None]:
##  It's your turn to choose which one you want.
##+ Just do this if you are pressed for time and
##+ want to see a certain result; I will be displaying
##+ all four, here.
the_choice_of_description_index = 0

In [None]:
desc_idx_00 = the_choice_of_description_index # smaller variable name.
the_chosen_filename = local_job_desc_filenames[desc_idx_00]
print(f"We will be looking at: {the_chosen_filename}")

### One value for the job application

This is how I want to structure things in general. Even if I have a résumé and a cover letter and a list of skills from the application and whatever questions they want me to answer, I want to combine them. That is possible in one of the functions above.

In [None]:
#  You can't choose a value for now (or it least doing 
#+ so won't give you anything useful).
the_only_application_index_value = 0
appl_idx = the_only_application_index_value
the_only_application_filename = local_job_appl_filenames[appl_idx]

print(f"And the comparison will be to: {the_only_application_filename}")

### Code for one histogram

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
import copy

## Next line only for Jupyter notebook.
%matplotlib inline

def get_histo_from_freq_dict(word_count_ordered_dict,
                             n_top_words = 25,
                             do_show_frac_not_count=False,
                             do_show_wd_cnt_or_frac_lists=False,
                             axx=None
                            ):
    '''
    @return  an axis from matplotlab (with the object - histogram - in it)
    '''
    
    if axx is None:
        fig = plt.figure(figsize=(10, 3))
        axx = fig.add_subplot(111)
    ##endof:  if axx is None
    
    counts = None
    fractions = None
    
    if do_show_frac_not_count:
        #frac_wd_cnt_ordered_dict = copy.deep_copy(word_count_ordered_dict) 
        frac_wd_cnt_list_of_tuples = \
          [ ( k, (v / len(word_count_ordered_dict)) )
                  for k, v in word_count_ordered_dict.items()]
        fractions_pre = [ this_item[1] 
                           for this_item in frac_wd_cnt_list_of_tuples ]
        
        #fractions_pre = list(frac_wd_cnt_ordered_dict.values())
        fractions = fractions_pre[:n_top_words]
    else:
        counts_pre = list(word_count_ordered_dict.values())
        counts = counts_pre[:n_top_words]
    ##endof:  if/else do_show_frac_not_count
    
    words_pre  = list(word_count_ordered_dict.keys())
    words  = words_pre[:n_top_words]
    
    ## making sure things were working
    if do_show_wd_cnt_or_frac_lists:
        if do_show_frac_not_count:
            print (f"fractions: {fractions}")
        else:
            print(f"counts: {counts}")
        ##endof:  if do_show_frac_not_count
        print(f"words:  {words}")
    ##endof:  if do_show_word_and_count_lists
    
    x_words_coords = np.arange(len(words))
    
    if do_show_frac_not_count:
        axx.bar(x_words_coords, fractions, align='center')
    else:
        axx.bar(x_words_coords, counts, align='center')
    ##endof:  if/else do_show_frac_not_count
    
    axx.set_xticks(x_words_coords)
    axx.set_xticklabels(words, rotation=45, ha='right')
    
##endof:  get_histo_from_freq_dict

### Let's see a histogram for the job description with word counts

In [None]:
get_histo_from_freq_dict(description_word_counts[desc_idx_00],
                         do_show_wd_cnt_or_frac_lists=True)

desc_top_25_hist_fname = "top_25_application_words.png"
plt.savefig(desc_top_25_hist_fname,
            bbox_inches='tight')

plt.show()

### Now, let's see one for the job description with word frequency as a fraction of total words

In [None]:
get_histo_from_freq_dict(description_word_counts[desc_idx_00],
                         do_show_frac_not_count=True)

### Here comes the histogram for the job application with word counts

In [None]:
get_histo_from_freq_dict(application_word_counts[appl_idx],
                         do_show_wd_cnt_or_frac_lists=True)

appl_top_25_hist_fname = "top_25_application_words.png"
plt.savefig(appl_top_25_hist_fname,
            bbox_inches='tight')

plt.show()

### And the histogram for the job application  with word frequency as a fraction of total words

In [None]:
get_histo_from_freq_dict(application_word_counts[appl_idx],
                         do_show_frac_not_count=True)

### Change the img src values and img alt values, then see the histograms together

You might need to double-click on the image to get the html source.

In [None]:
print("  img src values for the two images:")
print(f'"{desc_top_25_hist_fname}"')
print(f'"{appl_top_25_hist_fname}"')

print()
print("  img alt values for the two images:")
wd_count_alt_text_1 = '"The histogram for the job description with word frequencies"'
wd_count_alt_text_2 = '"The histogram for the job application with word frequencies"'

print(wd_count_alt_text_1)
print(wd_count_alt_text_2)

The output histograms, stacked for easier view.

_Remember that you might need to double click on the images to change the img src and img alt values._

<br/>
<div>
  <img src="top_25_description_words.png"
       alt="The histogram for the job description with word frequencies"
       width="auto">
</div>
<br/>

<br/>
<div>
  <img src="top_25_application_words.png"
       alt="The histogram for the job application with word frequencies"
       width="auto">
</div>
<br/>

Sometimes, I'll grab a printscreen of the above two images and draw green lines between words that match. However, from the time when I allowed the view of the match and three surrounding words, this step hasn't seemed as vital.

If this is going to happen, double click on this cell to see the now-commented HTML, get your saved filename, change the HTML accordingly, and uncomment everything. (HTML Comments start with `<!--` and end with `-->`

<!--
<br/>
<div>
  <img src="word_frequency_plots_w_link_lines.jpg"
       alt="Word matches for the pair of histograms."
       width="100%">
</div>
<br/>
-->

## A better way to compare

### Seems like a good time to look at comparisons

#### Between the résumé and the different job descriptions

In [None]:
def find_word_in_both_display_lists(word_to_find,
                                    display_list_1_description,
                                    display_list_2_application,
                                    name_of_display_list_1=None,
                                    name_of_display_list_2=None,
                                    do_print_details=False
                                   ):
    index_count_1 = 0 # skip header
    index_for_found_in_1 = 0
    
    loop_display_list = display_list_1_description
    
    word_found_in_1 = False
    for my_entry_1 in display_list_1_description:
        index_count_1 += 1
        if my_entry_1 == word_to_find:
            word_found_in_1 = True
            index_for_found_in_1 = index_count_1
            break
        ##endof:  if my_entry_1 == word_to_find
    ##endof:  for my_entry_1 in display_list_1
    
    index_count_2 = 0 # skip header
    index_for_found_in_2 = -1
    word_found_in_2 = False
    for my_entry_2 in display_list_2_application:
        index_count_2 += 1
        if my_entry_2 == word_to_find:
            word_found_in_2 = True
            index_for_found_in_2 = index_count_2
            break
        ##endof:  if my_entry_2 == word_to_find
    ##endof:  for my_entry_1 in display_list_1
    
    to_return_found_1 = None
    
    if word_found_in_1:
        to_return_found_1 = index_for_found_in_1 # - 1
        if do_print_details:
            print()
            print(f"The word, {word_to_find}, has rank, {to_return_found_1},")
            if name_of_display_list_1 is not None:
                print(f"in list, {name_of_display_list_1}.")
            #endof:  if name_of_display_list_1 is not None
        ##endof:  if do_print_details
    ##endof:  if word_found_in_1
    
    to_return_found_2 = None
    
    if word_found_in_2:
        to_return_found_2 = index_for_found_in_2 # - 1
        if do_print_details:
            print()
            print(f"The word, {word_to_find}, has rank, {to_return_found_2},")
            if name_of_display_list_2 is not None:
                print(f"in list, {name_of_display_list_2}.")
            ##endof:  if name_of_display_list_2 is not None
        ##endof:  if do_print_details
    ##endof:  if word_found_in_2
    
    return to_return_found_1, to_return_found_2
    
##endof:  find_word_in_both_lists

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Next line only for Jupyter notebook.
%matplotlib inline

def get_freq_histo_specific(word_count_ordered_dict_1,
#                           word_count_ordered_dict_2,
                            rank_index_1 = 1,
                            n_surrounding_words = 3,
                            do_show_word_and_count_lists=False,
                            ax1=None,
#                           ,ax2=None
                            ylim_bottom_val=None,
                            ylim_top_val=None
                           ):
    '''
    @return  an axis from matplotlab (with the object - histogram - in it)
    '''
    
    if ax1 is None:
        fig = plt.figure(figsize=(10, 3))
        ax1 = fig.add_subplot(111)
#       ax2 = fig.add_subplot(121)
    ##endof:  if ax1 is None
    
    counts_pre = list(word_count_ordered_dict_1.values())
    words_pre  = list(word_count_ordered_dict_1.keys())
    
    #word_1, count_1 = word_count_ordered_dict_1[rank_index_1]
    
    #highest_rank_index = -1
    
    # Pad the list with zero-count and empty-set characters
    len_lists = 2 * n_surrounding_words + 1
    counts = [0] * len_lists
    words  = ["\u2205"] * len_lists
    
    #  Fill anything with a valid index with the corresponding
    #+ word/count
    
    current_output_index = -1
    
    for i in range(rank_index_1 - n_surrounding_words -1,
                   rank_index_1 + n_surrounding_words
                  ):
        current_output_index += 1
        if i < 0:
            pass
        else:
            counts[current_output_index] = counts_pre[i]
            words[current_output_index] = words_pre[i]
        ##endof:  if/else i < 1
    ##endof:  for i in range
    
    ## making sure things are working
    if do_show_word_and_count_lists:
        print(f"counts: {counts}")
        print(f"words:  {words}")
    ##endof:  if do_show_word_and_count_lists
    
    #input("Press [Enter] to continue.")
    
    x_words_coords = np.arange(len(words))
    ax1.bar(x_words_coords, counts, align='center')
    
    ax1.set_xticks(x_words_coords)
    ax1.set_xticklabels(words, rotation=45, ha='right')
    
    ax1.set_ylim(ylim_bottom_val, ylim_top_val)
    
##endof:  get_freq_histo_specific

<strike>Below will be code to look for the top 25 (maybe less, maybe more) description words. I'll go through every word that appears 3 times, and I won't include any that appear only twice or once. I'll see where they appear in my résumé list.</strike>

<strike>This will be easily automated and done with a for loop or list comprehension. However, I want to look at some things more manually - that should make the automated stuff better.</strike>

I'm going to make this part more of a look-for-each-word thing. The display is too busy to show each word for each file.

I have a few improvements that would be good, soon:<br/>
  @TODO : get rid of one letter words<br/>
  @TODO : look through the rest of the list to get rid of junk

I want to match two histograms for this stuff, with e.g. the job description's word and (up to) 3 (or 4 or 5 or 6 or 2 or 1 or ...) words more frequent and (up to) 3 words less frequent. I'm going to bring up a picture of the histograms for my brainstorming.

### Here are the specific word-rank comparison histograms...

In [None]:
str_for_choices = f"Choices are any of: {list(range(len(local_job_desc_filenames)))}"

####  For this section, we have calculated everything, but show just two files being compared

Well, when we get to the compare-all-top-25 histograms, we'll show all the comparisons.

For the comparisons of the top-ranked words, just two files at a time

Another thing, to keep this Quick and Reckless (not spending too much time), I'm dispensing with my cherished 80 characters per line. `: (`

**You can change the `desc_fname_idx_to_show` to any of the numbers in the next output ...**

In [None]:
print(str_for_choices.replace(r"[", r"{").replace(r"]", r"}"))

Output was most recently

Choices are any of: `{0, 1, 2, 3}`

... **to see results for a specific job description.**

In [None]:
for my_desc_index in range(len(local_job_desc_filenames)):
    print(f"Choice {my_desc_index} : {local_job_desc_filenames[my_desc_index]}")

### ... for your choice of job description and word/word rank

(rank in the job description)

In [None]:
# Make your choice:
desc_fname_idx_to_show = 0

### Now we can continue with the top-ranked word in the job description

In [None]:
top_word_rank_in_desc = 1
this_corresponding_word = get_description_word_at_rank(top_word_rank_in_desc)
this_desc_idx = desc_fname_idx_to_show
word_indexes = find_word_in_both_display_lists(
                 this_corresponding_word,
                 description_word_counts[desc_fname_idx_to_show],
                 application_word_counts[0],
                   #  we only have one table - 
                   #+ it's at any legal index;
                   #+ let's choose 0
                 name_of_display_list_1 = f"description_word_counts[{this_desc_idx}]",
                 name_of_display_list_2 = "application_word_counts[0]"                
)

print()
print( ("(rank in description, rank in application) for the word,"
        f" '{this_corresponding_word}': {word_indexes}"
       )
)

In [None]:
top_word_rank_in_desc = 1
this_desc_idx = top_word_rank_in_desc

this_corresponding_word = get_description_word_at_rank(top_word_rank_in_desc)

print(f"this_corresponding_word: {this_corresponding_word}")

rank_desc, _ = find_word_in_both_display_lists(
                this_corresponding_word,
                description_word_counts[desc_fname_idx_to_show],
                application_word_counts[0],
                   #  we only have one table - 
                   #+ it's at any legal index;
                   #+ let's choose 0
                name_of_display_list_1 = f"description_word_counts[{this_desc_idx}]",
                name_of_display_list_2 = "application_word_counts[0]"
)

fig_filename_desc = ""

if rank_desc is None:
    import matplotlib.image as mpimg
    fig_filename_desc = "description_word_not_found.png"
    img = mpimg.imread(fig_filename_desc)
    imgplot = plt.imshow(img)
##endof:  if rank_desc
else:
    get_freq_histo_specific(
            description_word_counts[desc_fname_idx_to_show],
            rank_index_1=this_desc_idx,
            n_surrounding_words=3,
            do_show_word_and_count_lists=False,
            ylim_top_val=12)
    
    fig_filename_desc = (
            f"description_word_rank_{this_desc_idx}_"
            f"desc_{desc_fname_idx_to_show}.png"
    )
    
    print(fig_filename_desc)

    title_for_desc = (f"Word frequency rank ({rank_desc}) and surrounding context in "
                      f"job description for the word, {this_corresponding_word}"
                 )
    plt.title(title_for_desc)

    plt.savefig(fig_filename_desc,
                bbox_inches='tight')

    plt.show()
##endof:  if/else rank_desc

In [None]:
_, rank_appl = find_word_in_both_display_lists(
        this_corresponding_word,
        description_word_counts[desc_fname_idx_to_show],
        application_word_counts[0],
           #  we only have one table - 
           #+ it's at any legal index;
           #+ let's choose 0
        name_of_display_list_1 = f"description_word_counts[{this_desc_idx}]",
        name_of_display_list_2 = "application_word_counts[0]"
)

fig_filename_appl = ""

if rank_appl is None:
    import matplotlib.image as mpimg
    fig_filename_appl = "application_word_not_found.png"
    img = mpimg.imread(fig_filename_appl)
    imgplot = plt.imshow(img)
##endof:  if rank_desc
else:
    corresponding_index = rank_appl
    
    get_freq_histo_specific(application_word_counts[0],
                        rank_index_1=corresponding_index,
                        n_surrounding_words=3,
                        do_show_word_and_count_lists=False,
                        ylim_top_val=12)

    fig_filename_appl = (f"application_word_rank_{corresponding_index}_"
                         f"desc_{desc_fname_idx_to_show}.png"
                        )

    title_for_appl = (f"Word frequency rank ({rank_appl}) and surrounding context in "
                      f"job application for the word, {this_corresponding_word}"
                     )
    plt.title(title_for_appl)

    plt.savefig(fig_filename_appl,
            bbox_inches='tight')

    plt.show()
##endof:  ##endof:  if/else rank_desc

In [None]:
print("  img src values for the two images:")
print(f'"{fig_filename_desc}"')
print(f'"{fig_filename_appl}"')

print()
print("  img alt values for the two images:")
alt_text_1 = (f'"Histogram for the word, {this_corresponding_word}, in '
               'the job description text"'
             )
alt_text_2 = (f'"Histogram for the word, {this_corresponding_word}, in '
               'the job application text"'
             )
print(alt_text_1)
print(alt_text_2)

### Change the img src values and img alt values, then see the histograms together

You might need to double-click on the image to get the html source.

<br/>
<div>
  <img src="description_word_rank_1_desc_0.png"
       alt="Histogram for the word, software, in the job description text"
       width="auto">
</div>
<br/>

<br/>
<div>
  <img src="application_word_rank_2_desc_0.png"
       alt="Histogram for the word, software, in the job application text"
       width="auto">
</div>
<br/>

In [None]:
#  Code to look for the top 25 (probably more) application
#+ (résumé) words (down to appearing 3 times -- not
#+ including 2) and see where they appear in the job
#+ descriptions (if at all.)

## Time for top-25 histograms (or whatever the discretized version is)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Next line only for Jupyter notebook.
%matplotlib inline

def get_histo_from_freq_dict(word_count_ordered_dict,
                             n_top_words = 25,
                             do_show_word_and_count_lists=False,
                             axx=None
                            ):
    '''
    @return  an axis from matplotlab (with the object - histogram - in it)
    '''
    
    if axx is None:
        fig = plt.figure(figsize=(10, 3))
        axx = fig.add_subplot(111)
    
    counts_pre = list(word_count_ordered_dict.values())
    words_pre  = list(word_count_ordered_dict.keys())
    
    counts = counts_pre[:n_top_words]
    words  = words_pre[:n_top_words]
    
    ## making sure things were working
    if do_show_word_and_count_lists:
        print(f"counts: {counts}")
        print(f"words:  {words}")
    ##endof:  if do_show_word_and_count_lists
    
    x_words_coords = np.arange(len(words))
    axx.bar(x_words_coords, counts, align='center')
    
    axx.set_xticks(x_words_coords)
    axx.set_xticklabels(words, rotation=45, ha='right')
    
##endof:  get_histo_from_freq_dict

In [None]:
get_histo_from_freq_dict(description_word_counts[desc_fname_idx_to_show], 
                         do_show_word_and_count_lists=True)

desc_top_25_hist_fname = "top_25_description_words.png"
plt.savefig(desc_top_25_hist_fname,
            bbox_inches='tight')

plt.show()

In [None]:
get_histo_from_freq_dict(application_word_counts[0],
                         do_show_word_and_count_lists=True)

appl_top_25_hist_fname = "top_25_application_words.png"
plt.savefig(appl_top_25_hist_fname,
            bbox_inches='tight')

plt.show()

## Output for Description and Application:

### &lt;FILL THIS IN&gt;

### Done

In [None]:
# #######################
# # No need to run again
# #####
!powershell -c (Get-Date -UFormat "%s_%Y%m%dT%H%M%S%Z00") -replace '[.][0-9]*_', '_'

The output when I actually did this was

```
1692569216_20230820T220656-0600
```

### Change the img src values and img alt values, then see the histograms together

You might need to double-click on the image to get the html source.

In [None]:
print("  img src values for the two images:")
print(f'"{desc_top_25_hist_fname}"')
print(f'"{appl_top_25_hist_fname}"')

print()
print("  img alt values for the two images:")
wd_count_alt_text_1 = '"The histogram for the job description with word frequencies"'
wd_count_alt_text_2 = '"The histogram for the job application with word frequencies"'

print(wd_count_alt_text_1)
print(wd_count_alt_text_2)

The output histograms, stacked for easier view.

_Remember that you might need to double click on the images to change the img src and img alt values._

<br/>
<div>
  <img src="top_25_description_words.png"
       alt="The histogram for the job description with word frequencies"
       width="auto">
</div>
<br/>

<br/>
<div>
  <img src="top_25_application_words.png"
       alt="The histogram for the job application with word frequencies"
       width="auto">
</div>
<br/>

Sometimes, I'll grab a printscreen of the above two images and draw green lines between words that match. However, from the time when I allowed the view of the match and three surrounding words, this step hasn't seemed as vital.

If this is going to happen, double click on this cell to see the now-commented HTML, get your saved filename, change the HTML accordingly, and uncomment everything. (HTML Comments start with `<!--` and end with `-->`

<!--
<br/>
<div>
  <img src="word_frequency_plots_w_link_lines.jpg"
       alt="Word matches for the pair of histograms."
       width="100%">
</div>
<br/>
-->

## Future Steps

- Look at ranking, counts, percentage, etc. for FamilySearch's (job description's) top 25 words as found in my (job application's) word counts, then vice-versa. 
  - Code setup completed 2023-08-20. Putting all 25 in would make a very busy display, so I just did a few.
- Get rid of words that are necessary for grammar, but which don't matter too much in determining whether the two documents match up. (Found term on 2023-08-07. It's "stopwords".)
  - Completed 2023-08-09

**Some new future steps**

- Do word counts for the pair of top 25, but then also do the fraction each word comprises of the whole (non-stopword) text.