## Resume Analysis

In this activity, you will generate a Python script to analyze a resume text file.

### Instructions

* Read the resume file as text using the `with` statement.

* Create a list containing all words in the resume.

  * Convert each word to lowercase to normalize the data.

* Use `split` to remove and trailing punctuation so only words remain.

* Create a set of unique words from the resume using `set()`.

* Use set operations to filter out all remaining punctuation from the set of words.

  * Create a set from `string.punctuation` to use in the difference operation.

* Use the cleaned set (no punctuation) to find all of the words from the resume that match the required skills.

* Use the cleaned set (no punctuation) to find all of the words that match the desired skills.

In [85]:
# load dependencies
import os
import string

# a function that grabs the text from a resume file and calls a cleaner function to return a cleaned
# list of words from the resume
def LoadResume(resumefile="resume.md"):
    
    filepath = os.path.join("ResumeHolder", "resume.md")
    
    with open(filepath, 'r') as resume:
        raw_text = resume.read()
    
    # split raw text by spaces 
    unprocessed_wordlist = raw_text.split()
    
    # call the cleaner function
    processed_wordlist = CleanResume(unprocessed_wordlist)
    
    # return the cleaned list of resume words
    return processed_wordlist


# a function that takes in an unprocessed string of words and purges them of three common punctuation marks
def CleanResume(dirty):
    
    # split by common punctuation marks and take the word portion of the output
    clean = [word.lower().split(',')[0].split('.')[0].split('/')[0] 
             for word in dirty if word[0] in string.ascii_letters]
        
    return clean


# define a function that takes in a cleaned list of resume words, required skills, and desired skills
# and outputs matches and misses 
def MatchSkills(candidate_skills,
                req_skills = {"excel", "python", "mysql", "statistics"},
                des_skills = {"r", "git", "html", "css", "leaflet"}):
    
    candidate_skills = set(candidate_skills)
    
    matched_req = req_skills & candidate_skills
    missing_req = req_skills - candidate_skills
    
    matched_des = des_skills & candidate_skills
    missing_des = des_skills - candidate_skills
    
    print(f"The candidate has the following required skils:  {list(matched_req)}.\n" +
          f"The candidate is missing the following required skills:  {list(missing_req)}.\n\n" +
          f"The candidate has the following desired skils:  {list(matched_des)}.\n" + 
          f"The candidate is missing the following desired skils:  {list(missing_des)}.")
    
    return

In [86]:
# load the resume, recognizing that 
cur_resume = LoadResume()

MatchSkills(cur_resume)

The candidate has the following required skils:  ['mysql', 'python', 'excel', 'statistics'].
The candidate is missing the following required skills:  [].

The candidate has the following desired skils:  ['css', 'r', 'git', 'html', 'leaflet'].
The candidate is missing the following desired skils:  [].


#### Bonuses

* Count the number of occurrences for each word in the resume and print the top 10 occuring words in the resume.

  * Use a dictionary data structure to hold the counts for each word.

  * Make sure to remove punctuation and [stop words](https://en.wikipedia.org/wiki/Stop_words)

In [125]:
def WordCount(wordlist, 
              stop_words = {"and", "with", "using", "##", "working", "in", "to"},
              top = False, n=10):
    
    # convert to a set and then back to a list to grab the unique values
    unique_list = list(set(wordlist) - stop_words)
    
    # create a dictionary where keys are resume words and values are counts
    # default count value is 0
    wordcount = dict.fromkeys(unique_list, 0)
    
    # loop through word list and increase count for every encounter
    for word in wordlist:
        
        if word in wordcount.keys():
            wordcount[word] += 1
        
    # if th user specifies they want only the top n words, call a custom function
    if top:
        
        topwords = TopWords(wordcount, n)
        
        return topwords
    
    else:
        
        return wordcount

# define a function that grabs the top n words from a wordcount dictionary
def TopWords(wordcount, n):
    
    # sort the wordcount keys by value in descending order and cut the list off at n words
    top = sorted(wordcount, key=wordcount.get, reverse=True)[:n]
    
    # reconstruct a dictionary by zipping the wordcount list with a corresponding frequency value list
    top_dict = dict(zip(top, [wordcount[w] for w in top]))
    
    return top_dict

In [130]:
WordCount(cur_resume)

{'stein': 1,
 'the': 2,
 'frank': 1,
 'cloud': 1,
 'apis': 1,
 'hadoop': 2,
 'developing': 1,
 'designing': 1,
 'social': 2,
 'mysql': 1,
 'experience': 1,
 'intelligence': 1,
 'visualizations': 1,
 'python': 4,
 'pandas': 1,
 'statistics': 2,
 'modeling': 1,
 'aws': 1,
 'tableau': 2,
 'advanced': 1,
 'javascript': 2,
 'r': 1,
 'creating': 1,
 'excel': 2,
 'education': 1,
 'boot': 1,
 'd3': 2,
 'vba': 1,
 'front-end': 1,
 'mongodb': 1,
 'forecasting': 1,
 'microsoft': 1,
 'open-source': 1,
 'performing': 1,
 'git': 1,
 'learning': 2,
 'html': 3,
 'n': 1,
 'writing': 1,
 'interests': 1,
 'css': 2,
 'skills': 1,
 'databases': 1,
 'bootstrap': 1,
 'web': 2,
 'data': 7,
 'analytics': 3,
 'algorithms': 1,
 'contributing': 1,
 'software': 2,
 'api': 1,
 'scripts': 2,
 'machine': 2,
 'graduate': 1,
 'tables': 1,
 'mining': 2,
 'sets': 1,
 'big': 2,
 'media': 2,
 'leaflet': 1,
 'files': 1,
 'analyze': 1,
 'sql': 1,
 'visualization': 2,
 'pivot': 1,
 'business': 1,
 'interactions': 1,
 'basic':

In [128]:
WordCount(cur_resume, top=True, n=15)

{'data': 7,
 'python': 4,
 'html': 3,
 'analytics': 3,
 'the': 2,
 'hadoop': 2,
 'social': 2,
 'statistics': 2,
 'tableau': 2,
 'javascript': 2,
 'excel': 2,
 'd3': 2,
 'learning': 2,
 'css': 2,
 'web': 2}

In [131]:
WordCount(cur_resume, top=True)

{'data': 7,
 'python': 4,
 'html': 3,
 'analytics': 3,
 'the': 2,
 'hadoop': 2,
 'social': 2,
 'statistics': 2,
 'tableau': 2,
 'javascript': 2}

#### Hints

* Carefully consider when to use a Dictionary data structure vs. a Set data structure when operating on Unique and Non-unique elements.