## Task 3: Summarization and comparison of résumés + job postings


### CoNVO

**Context:** Bloc is a career services management platform that builds smart career and data management tools for job-seekers and the organizations serving them. In particular, Bloc seeks to provide and facilitate access to tools for effectively presenting job-seekers' credentials and matching employers' job postings, and thereby improve outcomes.

**Need:** Career advisors are expected to provide personalized recommendations and assistance to many job-seekers at any given time. To accomplish that, they need to stay up-to-date on the current status of particular job markets, and understand how their group of advisees fit into that space.

**Vision:** Automated summarization of the contents of a group of job-seekers' résumés and, separately, open job postings in a particular field(s), along with a quantitative and/or visual representation of the similarity between a set of résumés and job postings, with easy-to-understand outputs for non-technical viewers.

**Outcome:** A standalone, proof-of-concept process for summarizing small- to medium-sized collections of text. A separate POC for comparing sets of résumés to sets of job postings in a quantitative and/or visual manner.


### Data Summary

A collection of ~125 (+ ~2400) résumés as text extracted from PDFs (see Task 1) as well as ~4800 job postings as JSON fetched from external APIs(see Task 2).


### Proposed Methodology

Unsupervised summarization of many texts is a common and well-studied problem; there's no need for us to reinvent the wheel. Topic modeling (e.g. LDA) is a good bet if you have enough texts and don't require fine detail. Ranking the most important words/terms by frequency or a more sophisticated metric (e.g. tf-idf, bm25) works well for smaller corpora, but doesn't necessarily reveal relationships between concepts. Graph-based methods for extract summarization of key words/terms/sentences (e.g. TextRank) could also work here.

Comparing two collections of texts is potentially trickier. An explicit comparison could be made by identifying words/terms that are most important in one corpus and least important in the other (and vice-versa), while an implicit comparison could just show the top words/terms for each corpus side-by-side. Comparing topic models is generally _not_ a valid method. Visualizing the comparison is probably the most important aspect of this task; we want to make it intuitive and interpretable.


### Definitions of Success

This entire task is icing on the cake. Any concrete outputs will be considered a success!


### Risks

We may not have enough texts for some unsupervised methods (namely, topic modeling and any deep learning methods).

## Source Code

In [1]:
%load_ext watermark

### Getting Started

In [2]:
import msvdd_bloc

In [3]:
%watermark -v -iv

CPython 3.7.4
IPython 7.8.0


In [4]:
resumes_fpath = "/Users/burtondewilde/Desktop/datakind/bloc/msvdd_Bloc/data/resumes/fellows_resumes.zip"
for fname, text in msvdd_bloc.data.fileio.load_text_files_from_zip(resumes_fpath):
    # you may want to use the same preprocessing as in task 1
    # but maybe not ;)
    # text = preprocess_resume_text(text)
    break  # just stopping here so we can test things out
print(text[:1000], "...")

Candice Williams
6141 Payne Mountains Suite 436 New Victoriashire, NC 51439   |   001-021-517-1134x560   |   kbarker@gmail.com

EXPERIENCE 
Intuit, ​Software Engineer Intern 
Mountain View, CA           May 2018 – Present 

● Backend engineer speeding login times for Intuit Online Payroll customers using Java
● Testing comprehensive user interfaces on the frontend using automation technologies such as Selenium

and ReactJS based frameworks
U.S. Bank, ​Software Engineer Intern 
St. Paul, MN    May 2017 – August 2017 

● Worked with the software automation team using dynamic technologies such as .NET framework, C#, Visual
Studio, and Angular 2 Javascript Framework to develop automated software used by U.S. Bank engineers

● Created and maintained automation tools used to eliminate manual tasks otherwise performed by individuals
● Created comprehensive unit test plans and test cases in Selenium and javascript frameworks
● Worked on user testing, project management, error handling, and bac

In [5]:
postings_fpath = "/Users/burtondewilde/Desktop/datakind/bloc/msvdd_Bloc/data/postings/github_jobs.json"
postings = msvdd_bloc.data.fileio.load_json(postings_fpath)
print("# postings =", len(postings))
postings[:1]

# postings = 67


[{'id': '3e378b91-7cc0-486e-b18d-518ab29569e3',
  'type': 'Full Time',
  'url': 'https://jobs.github.com/positions/3e378b91-7cc0-486e-b18d-518ab29569e3',
  'created_at': 'Wed Sep 11 18:59:59 UTC 2019',
  'company': 'Simon & Schuster ',
  'company_url': 'https://www.simonandschuster.com/',
  'location': 'New York, NY',
  'title': 'Lead Software Engineer',
  'description': '<p>Simon &amp; Schuster is seeking a Lead Software Engineer to join a rapidly growing team focused on impacting the world of publishing through research and innovation. Working with a team of data scientists, data engineers, designers, and domain experts, you will be involved in rapidly prototyping, developing, and deploying the platforms that put insights and information into the hands of decision-makers.</p>\n<p>From databases to serverless applications, you will be designing, deploying , and managing the systems that connect our real-world data to the books and authors that inform and entertain our world. As a Full