 # NLTK Resume Tool
 
 1. Scrape job description with BeautifulSoup
 2. Process text with nltk, return a list of parts of speech and frequency
 3. Select the best match from a list of possible resume statements.
 4. Construct resume in LaTeX or microsoft word.

### Import the python scientific suite

In [56]:
#imports 
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from pandas.tools.plotting import parallel_coordinates
pd.set_option('display.max_columns', None)

import sklearn
import seaborn as sns
import matplotlib as mpl
import scipy

import itertools

import statsmodels.formula.api as smf
from scipy.optimize import curve_fit
import scipy.signal

from gatspy.periodic import LombScargleFast, LombScargleMultibandFast, LombScargle

from collections import defaultdict

from sklearn import datasets, linear_model
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm

# This is importing a "future" python version 3 print function.
from __future__ import print_function
from __future__ import division

#plotting options
%matplotlib inline
sns.set()
sns.set_style('ticks')
sns.set_context('paper', font_scale = 1.5)
sns.set_palette('husl')



 ### Import nltk and re

In [101]:
import nltk #natural language toolkit
import re #regular expressions
from bs4 import BeautifulSoup #for web scraping
import requests #requests for pulling html data
#nltk.download() #uncomment this to download the required nltk resources that are not in the anaconda package.

#Or in the shell: python -m nltk.downloader all

Let's pull the HTML of a job description:

In [117]:
html = requests.get('https://jobs.te.com/job/-/-/1122/2719233?apstr=%3Fmode%3Djob%26iis%3DIndeed%26iisn%3DIndeed.com&ss=paid').text

From the html, we'll create a soup object, and then get a list of strings ('data')

In [130]:
soup = BeautifulSoup(html, 'lxml')
data = soup.findAll(text = True)

Using some code that I've borrowed from quora, filter the data fesult, to remove style, scripts, heads, etc. Then filter by length to find those job requirements bullets.

In [154]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True
 
result = filter(visible, data)

long_results = [x for x in result if len(x) > 50 and len(x) < 320]
len_array = [len(x) for x in long_results]

In [167]:
noturl = re.compile(r'^(.(?!www))*$')
notee =  re.compile(r'^(.(?!equal opportunity))*$') #exclude the equal opportunity statement

long_results_filtered = [i for i in long_results if noturl.search(i)\
                         and notee.search(i)]

In [168]:
long_results_filtered
single_string = ''.join(long_results_filtered)

Let's just copy-and-paste in an example job description that I have found for Junior Data Scientist at Verizon.

In [30]:
job_description = "Be a part of the team that identifies trends, emerging technologies and growth markets—all things that keep us at the forefront of innovation and drive our success Verizon Communications Inc. is a global leader in delivering broadband and other wireless and wireline communications services to mass market, business, government and wholesale customers. A Dow 30 company, Verizon employs a diverse workforce of more than 177,000 and last year generated consolidated revenues of $127 billion.  The Data Analyst – Junior Data Scientist will be part of an objective assurance and consulting team that is independently managed within Verizon Communications designed to add value and improve operations. The Internal Audit team assists the Audit Committee of the Board of Directors and Verizon management in accomplishing their objectives by bringing a systematic and disciplined approach to evaluate and improve the effectiveness of the overall control environment, risk management, and governance processes. The Internal Audit staff gains extensive exposure to diverse aspects of Verizon's business. These audit assignments include increasing levels of responsibilities and presentations to senior management, making Internal Audit an excellent place to work for high potential employees. Have you read Malcolm Gladwell’s books? Or the string of other authors solving business challenges through Data mining? Are you interested in a path to becoming a data scientist? Do you have what it takes to develop great business acumen? Leverage your critical thinking and problem solving skills while leveraging technical tools to analyze large complex data sets? If so then these responsibilities could be yours. As Wikipedia states ”Data Scientist have the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data and building rich tools that enable others to work effectively.” POSITION RESPONSIBILITIES: The Data Analyst – Junior Data Scientist supports a high performance forensic and audit analytics team in its efforts to identify and drive data mining efforts through a risk based approach. The enthusiasm in this position helps mitigate fraud, identify misconduct and control gaps utilizing data mining and analysis.  Responsibilities include the following: Passion for growth and learning new techniques with vigor and enthusiasm. Strong individual contributor with top notch team collaboration skills.  Strong ability to independently and proactively initiate projects, hypothesize business transaction flows and fraudulent scenarios. Design, extract, normalize, analyze, review and automate analysis for Internal Audit utilizing enterprise data warehouse, extracts and data mining tools; coordinate with business for outside data source requirements. Design, develop, maintain and communicate visual dashboards. Design and develop ad-hoc analysis based on business requirement needs. Identify and use appropriate investigative and analytical technologies to interpret and verify results. Coordinate with business to ensure follow-up and resolution of exceptions including specific individual resolution as well as root-cause analysis and control gap identification. Review large software implementations to identify transaction flow gaps, design flaws and data integrity issues. Actively participates in the completion of department initiatives to support the development of a best-in-class Internal Audit function Maintain databases and related programs in a thorough and efficient manner. Qualifications Take part in training courses to increase skill set and technical capabilities in order to better serve the needs of the analytics team. Strong business analytical skills a must; ability to apply business logic to design and implement data mining techniques on large data sets. Projects with evidence of Creative and Critical thinking a must. Understanding of Data Warehousing is a must. Proficient in the use of Teradata SQL, MS SQL server (SSIS/SSAS experience preferred), Data Visualization (e.g., Tableau or other), MS Access, MS Excel, Visual Basic, and Sharepoint. Experience designing, developing, implementing and maintaining a database and programs to manage data analysis efforts. For internal candidates, experience with Verizon Wireless Enterprise Data Warehouse preferred. Working knowledge of ‘Big Data’ concepts and Hadoop/Hive, Teradata Aster, and R tools preferred. Working knowledge of building self-serve analytics tools for business users a plus. Working knowledge of statistical analysis, data mining and predictive modeling tools and techniques a plus. Working knowledge of application development and/or web development a plus. Demonstrated ability to work independently and within a team in a fast changing environment with changing priorities and changing time constraints. Strong interpersonal skills and ability to multi-task. Ability to interpret business requests as well as communicate findings in a user-friendly manner. Experience in normalizing data to ensure it is homogeneous and consistently formatted to enable sorting, query and analysis. Ability to write clear, concise reports and presentations with an ability to orally communicate effectively; organizational and documentation skills a must. An understanding of risk management methodology and factors. Consolidates issues for management level review; develops clear written recommendations, which require minimal editing; presents recommendations and resolves issues with management. BS/BA degree in Management Information Systems, Computer Science, Accounting, Business, Finance, Economics, Statistics or related field.  Masters degree a plus.  At least a 3.0/4.0 overall GPA or equivalent Requires a minimum of 4 years relevant work experience; Analytics, technology, auditing, accounting, finance, or economics."

So, that's the string. In the next cell, let's tokenize this string by word and sentence, which returns lists of all of the words and lists of all of the sentences respectively.

In [14]:
jd_w_token = nltk.word_tokenize(job_description) #tokenize by word
jd_sent_token = nltk.sent_tokenize(job_description)

This next cell will attempt to tag each word with its part of speech.

In [49]:
tag_w = nltk.pos_tag(jd_w_token) #list of tuples of tagged words

Let's also create a frequency distribution, we'll then make a pandas dataframe out of the part-of-speech and distribution data and then clean and label it

In [73]:
freq = list(nltk.FreqDist(jd_w_token).items())

In [75]:
w_df = pd.DataFrame(tag_w) #dataframe of words and parts of speech
f_df = pd.DataFrame(freq) #dataframe of frequency distribution

In [91]:
d = pd.merge(w_df, f_df, on = 0, how = 'inner') #let's merge these into a single dataframe
d.drop_duplicates([0], keep = 'last', inplace = True)
d.rename(columns = {0: 'word','1_x':'part', '1_y':'count'}, inplace = True)
d.sort_values(['part','count'], ascending = False, axis = 0, inplace = True)

We now have d, a DataFrame of words, sorted by parts of speech, and then their count, let's select the verbs from this job description

In [97]:
d[(d['part'] == 'VB')]

Unnamed: 0,word,part,count
528,develop,VB,3
580,interpret,VB,3
599,ensure,VB,3
629,identify,VB,3
712,communicate,VB,3
199,drive,VB,2
334,diverse,VB,2
371,be,VB,2
388,improve,VB,2
586,manage,VB,2


In [95]:
d[(d['part'] == 'VBP')]

Unnamed: 0,word,part,count
214,is,VBZ,4
658,techniques,VBZ,3
79,identifies,VBZ,1
332,employs,VBZ,1
401,assists,VBZ,1
525,takes,VBZ,1
569,states,VBZ,1
617,supports,VBZ,1
636,helps,VBZ,1
757,participates,VBZ,1


In [94]:
d[(d['part'] == 'VBZ')]

Unnamed: 0,word,part,count
522,what,WP,1
910,which,WDT,1
214,is,VBZ,4
658,techniques,VBZ,3
79,identifies,VBZ,1
332,employs,VBZ,1
401,assists,VBZ,1
525,takes,VBZ,1
569,states,VBZ,1
617,supports,VBZ,1


In [96]:
d[(d['part'] == 'VBN')]

Unnamed: 0,word,part,count
631,based,VBN,2
381,managed,VBN,1
783,set,VBN,1
872,Demonstrated,VBN,1
889,formatted,VBN,1
907,written,VBN,1
