# Resume Analysis
_**HARD: This is a curveball assignment. Plus, this is Python without Pandas.**_

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

The cell below imports `os` as a dependency because the `os.path.join` function. Also, the `string` dependency is needed because later in the script, `string.punctuation` will be used to detect and remove punctuation symbols. We will be using the Counter later on to use the tallies as dictionary keys.

In [None]:
import os
import string
from collections import Counter

The cell below establishes the path to the resume file to be analyzed and creates a series for both the required and desired skills we are parsing for.

In [None]:
# Paths
resume_path = os.path.join(".", 'resume.md')

# Skills to match
REQUIRED_SKILLS = {"excel", "python", "mysql", "statistics"}
DESIRED_SKILLS = {"r", "git", "html", "css", "leaflet", "modeling"}

The cell below defines a function for loading files given the input of a filepath. It opens a file, reads the content, changes all cases to lowercase, and splits the file's contents into a list of strings, returning the final altered list.

In [None]:
def load_file(filepath):
    # Helper function to read a file and return the data.
    with open(filepath, "r") as resume_file_handler:
        resume_contents = resume_file_handler.read()
        resume_contents = resume_contents.lower()
        resume_tokens = resume_contents.split()
        return resume_tokens

The cell below runs the load_file function just defined on the resume file as per the filepath determined at the top of this program. It assigns `word_list` to the list of lowercase strings (`resume_tokens`) as per the function definition.

In [None]:
# Grab the text for a Resume
word_list = load_file(resume_path)

The cell below creates a set from the words in the resume so that set operations can be performed, mainly finding the union between the skills we are searching for and the body of the text. We iterate through the list of words in the resume and add them to the set. We then create a set out of the pre-defined punctuation objects so that they can be subtracted from the set of words, eliminating variables such as `*` but not `**`. We use `\n` to start a new line between the printed word lists.

In [None]:
# Create a set of unique words from the resume
resume = set()

# HINT: Single elements in a programming language are often referred to as tokens
for token in word_list:
    resume.add(token)

print('\nWORDS BEFORE MOVING PUNCTUATION')    
print(resume)

# Remove Punctuation that were read as whole words
punctuation = set(string.punctuation)
# HINT: Attributes that are in `resume` that are not in `punctuation` (difference)
resume = resume - punctuation

print('\nWORDS AFTER MOVING PUNCTUATION')    
print(resume)

We use the set intersection on our set of words as well as `REQUIRED SKILLS` and `DESIRED SKILLS` to find the words that appear in both. The word cleaning function removes further non-words from the list such as `**` and the character cleaning function subsequently clears any leftover punctuation attached to words, such as words with a comma on the end. The program then removes words from the `stop_words` list that are irrelevant to skills. The contents of this list can be changed if needed. Each of these cleaning steps alters the contents of the `word_list` list until we are left only with the punctuation-free words we want to analyze.

In [None]:
# Calculate the Required Skills Match using Set Intersection
print('REQUIRED SKILLS')
print(resume & REQUIRED_SKILLS)

# Calculate the Desired Skills Match using Set Intersection
print('DESIRED SKILLS')
print(resume & DESIRED_SKILLS)


# Word Punctuation Cleaning
word_list = [word for word in word_list if word not in string.punctuation]
print('\nWORD LIST AFTER PUNCTUATION REMOVAL')
print(word_list)

# Character Punctuation Cleaning
word_list = [''.join(char for char in word if char not in string.punctuation) for word in word_list]
print('\nWORD LIST AFTER CHARACTER PUNCTUATION REMOVAL')
print(word_list)

# Clean Stop Words
stop_words = ["and", "with", "using", "##", "working", "in", "to"]
word_list = [word for word in word_list if word not in stop_words]
print('\nWORD LIST AFTER STOP WORDS')
print(word_list)

The cell below initializes the `word_count` dictionary with the keys `word_list` and 0 in the `fromkeys()` function. The word count finds the number of words in the list by iterating through each one and adding one to the total (which starts at 0 as initialized in the dictionary) for every word it finds.

The `for loop` method iterates through the list and adds to a running total whereas the `Counter` method takes a tally and stores the count as a dictionary value.

In [None]:
# Resume Word Count
# ==========================
# Initialize a dictionary with default values equal to zero
word_count = {}.fromkeys(word_list, 0)

# Loop through the word list and count each word.
for word in word_list:
    word_count[word] += 1
# print(word_count)

# Bonus using collections.Counter
word_counter = Counter(word_list)
# print(word_counter)

# Comparing both word count solutions
print(word_count == word_counter)

# Top 10 Words
print("Top 10 Words")
print("=============")

The cell below sorts the count column from highest to lowest because `reverse` was set to `True`. The list is sliced from the beginning to the 10th value as indicated by `[:10]`. To get rid of the blank token I would add `if len(word) > 0:` within the for loop and before the print statement.

In [None]:
# Sort words by count and print the top 10
sorted_words = []
for word in sorted(word_count, key=word_count.get, reverse=True)[:10]:
    if len(word) > 0:
        print(f"Token: {word:20} Count: {word_count[word]}")