# Decoding Data Science Job Postings to Improve Your Resume

## 1. Extracting Text from Online Job Postings

## By Alexandre Dietrich

#### Date: 11/02/2020

In [30]:
import glob
import pandas as pd
import os
from bs4 import BeautifulSoup as bs
from IPython.core.display import display, HTML

## Point to the job posting html files directory

In [31]:
curdir = os.getcwd()
curdir

'/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master'

In [32]:
curdir = curdir + '/data/html_job_postings/'
curdir

'/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/'

## Get the name of all html files

In [33]:
all_rec = glob.glob(curdir + '*')
print(all_rec[0:10])

['/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/1e92960a19ffdd34_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/3157fcef3ee474da_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/b423ca22a6e2c10f_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/ea487254a487beb5_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/cb8a5bce330854e9_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/a559b6630c13783d_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-master/data/html_job_postings/f579e807b5804620_fccid.html', '/Users/alexandredietrich/Documents/Code/Resume-job-posting-nlp-project-mas

## Play with a file to test some beautiful soup methods

In [34]:
filename = curdir + '0a9d8f4b50fd041e_fccid.html'

os.path.basename(filename)

'0a9d8f4b50fd041e_fccid.html'

In [35]:
soup = bs(open(filename,'r').read())
body = soup.body

In [36]:
soup.body.text

'Engineer I ( Electrical Engineer) - United States\nAt Packaging Corporation of America (PCA), we think of ourselves as more than a box manufacturer. We are an ideas and solutions company. We seek to be the leader in helping our customers — large and small — package, transport and display products of all kinds. It just happens to be that corrugated products are our area of expertise.\n\nAt PCA, you’ll find the best people in the industry operating in a “golden rule” culture. We actively promote mutually rewarding relationships with each other and our customers by advocating respect for every individual, ethical and fair practices, and the highest standards in what we say and do. PCA is proud to have a highly skilled and experienced team leading the way.\n\nAs a Fortune 500 company and one of the largest producers of containerboard and corrugated packaging products in the U.S., PCA offers customers broad expertise and economies of scale, while our multiple plant locations let us rapidly

In [37]:
print(soup.prettify())

<html>
 <head>
  <title>
   Engineer I ( Electrical Engineer) - United States
  </title>
 </head>
 <body>
  <h2>
   Engineer I ( Electrical Engineer) - United States
  </h2>
  <div>
   <div>
    At Packaging Corporation of America (PCA), we think of ourselves as more than a box manufacturer. We are an ideas and solutions company. We seek to be the leader in helping our customers — large and small — package, transport and display products of all kinds. It just happens to be that corrugated products are our area of expertise.
    <br/>
    At PCA, you’ll find the best people in the industry operating in a “golden rule” culture. We actively promote mutually rewarding relationships with each other and our customers by advocating respect for every individual, ethical and fair practices, and the highest standards in what we say and do. PCA is proud to have a highly skilled and experienced team leading the way.
    <br/>
    As a Fortune 500 company and one of the largest producers of contain

In [38]:
display(HTML(soup.prettify()))

In [39]:
unLists = body.find_all('ul')
for i, ul in enumerate(unLists):
    print(f"\nUnstructed List {i}:")
    print(ul.text)


Unstructed List 0:
Bachelor’s degree or higher in Electrical Engineering or Electrical Engineering Technology.
EIT certification or on EIT certification track preferred.
2 years of internship or co-op type experience in an electrical engineering related role or equivalent within a paper producing facility preferred. Paper mill experience or process knowledge preferred.
Must have good troubleshooting skills and knowledge of digital and analog control systems.
Ability to work with the required software for model based control applications.
Good written and verbal communication skills and ability to build and maintain positive relationships with team members.
Good organizational skills, decision maker and results oriented.
Ability to learn in all areas of responsibilities.
Self-starter, willing to work with minimal supervision and input from those who explain the needed results.
Ability to develop and train on processes and systems being put into place.
Demonstrated effective facilitatio

## Read all files and prepare a list with the necessary columns

In [40]:
list_data = []

for filename in all_rec:
    
    file = os.path.basename(filename)

    soup = bs(open(filename,'r').read())
    
    title = soup.title.text
    body = soup.body
    bodytext = soup.body.text
    unLists = body.find_all('ul')
    
    bullets = []
    for i, ul in enumerate(unLists):
        bullets.append(ul.text)

    list_data.append([file,title,bodytext,bullets])

## Examine a sample of the list

In [41]:
list_data[0:3]     # Print 3 samples

[['1e92960a19ffdd34_fccid.html',
  'Quantitative Analyst - Boston, MA 02116',
  'Quantitative Analyst - Boston, MA 02116\nQuantitative Analyst (State Street Bank and Trust Company; Boston, MA): The Quantitative Analyst will be part of State Street Treasury’s Treasury Quantitative Analytics (TQA) group. TQA is responsible for developing/implementing/monitoring advanced financial models that are used in company’s capital management, liquidity management, investment portfolio construction, and balance sheet optimization. The group is accountable for in-depth understanding, modeling, and representation of the complex interaction of global markets, customer behaviors, and regulatory oversights to create a view of risk/revenue opportunities and exposures to the investment committee, Board of Directors, senior management, and regulatory agencies. The Quantitative Analyst role is a key contributor to the realization of the GT’s mission of optimizing net interest income within the desired risk 

## Prepare a Pandas Dataframe from the list that contains the data extracted from the html files

In [42]:
# The file column was included for any future examination required in the original file

df = pd.DataFrame(list_data, columns=['file', 'title', 'body', 'bullets'])
df.head(10)

Unnamed: 0,file,title,body,bullets
0,1e92960a19ffdd34_fccid.html,"Quantitative Analyst - Boston, MA 02116","Quantitative Analyst - Boston, MA 02116\nQuant...",[]
1,3157fcef3ee474da_fccid.html,"Data Scientist - Mountain View, CA","Data Scientist - Mountain View, CA\nGroundTrut...","[\nHelp senior members of the team to explore,..."
2,b423ca22a6e2c10f_fccid.html,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nA Bachelor or Ma...","[Collaborate with HR, business leaders, and pe..."
3,ea487254a487beb5_fccid.html,Senior Natural Language Processing (NLP) Engin...,Senior Natural Language Processing (NLP) Engin...,[Join a small team creating a proprietary NLU ...
4,cb8a5bce330854e9_fccid.html,"FLEXO FOLDER GLUER OPER - McClellan, CA - McCl...","FLEXO FOLDER GLUER OPER - McClellan, CA - McCl...",[]
5,a559b6630c13783d_fccid.html,"Junior Data Scientist - College Park, MD 20740","Junior Data Scientist - College Park, MD 20740...",[Degree: Bachelor’s degree in business analyti...
6,f579e807b5804620_fccid.html,"Data Scientist - New York, NY","Data Scientist - New York, NY\nDescription\nDS...","[\nLanguages: Python, PySpark, SQL\nData Tools..."
7,63031f6bc07bda88_fccid.html,Business Analyst - Medical Claims Data Project...,Business Analyst - Medical Claims Data Project...,[Highly developed analytical skills\nMastery o...
8,13c9ffc0bcb07c8d_fccid.html,"(Entry-Level) Data Scientist - Chicago, IL","(Entry-Level) Data Scientist - Chicago, IL\nDa...",[\nBe the go-to person for Data ingest and sto...
9,b672827e595ad0f4_fccid.html,"Data Scientist, Analytics - Seattle, WA 98101","Data Scientist, Analytics - Seattle, WA 98101\...",[\nApply your expertise in quantitative analys...


## Filter only the Data Scientist job postings

In [43]:
df[df['title'].str.contains('(data scientist) | (data science)', case=False)].sample(10)

Unnamed: 0,file,title,body,bullets
879,72ac09d6947fcac4_fccid.html,Product Data Scientist - Engineering - Los Ang...,Product Data Scientist - Engineering - Los Ang...,[Partner with Product and Engineering and appl...
1071,5e69e3d194ab13d0_fccid.html,"Principal Data Scientist - Tempe, AZ 85281","Principal Data Scientist - Tempe, AZ 85281\nSy...",[Learn and stay current with the latest techni...
1074,86b8f37dffd9511a_fccid.html,"Data Scientist - Santa Ana, CA 92707","Data Scientist - Santa Ana, CA 92707\nJoin our...",[\nLeverage skills in handling very large data...
1283,cba29ba09c81bdfb_fccid.html,"Manager, Data Science, Programming and Visuali...","Manager, Data Science, Programming and Visuali...",[\nUndergraduate degree in a field linked to b...
949,117fdac717d768c9_fccid.html,"Senior/Staff Data Scientist - San Francisco, CA","Senior/Staff Data Scientist - San Francisco, C...",[]
504,b16865affc0576ed_fccid.html,"Data Scientist - Pittsburgh, PA","Data Scientist - Pittsburgh, PA\nCelebrating i...",[]
203,4145274d0b8bbe86_fccid.html,"Data Scientist - Boston, MA 02134","Data Scientist - Boston, MA 02134\nPosition Su...",[\nManipulate large clinical data sets.\nMine ...
125,93e0a5f796e8e1c0_fccid.html,"Clinical Data Scientist - San Francisco, CA","Clinical Data Scientist - San Francisco, CA\nV...",[Work closely with the clinical and technology...
1214,231f6807e11da902_fccid.html,"Data Scientist - Plano, TX","Data Scientist - Plano, TX\nOverview\nPosition...","[Excellent visual, written and verbal communic..."
1208,1c1ded576ec66f45_fccid.html,Data Scientist (Full Time) - United States - S...,Data Scientist (Full Time) - United States - S...,"[Acquire, clean and structure data from multip..."


In [48]:
df[df['title'].str.contains('(data scientist)|(data science)', case=False)].count()

file       496
title      496
body       496
bullets    496
dtype: int64

## Remove all job postings that does not contain "Data Scientist" in the title

In [49]:
df[~df['title'].str.contains('(data scientist)|(data science)', case=False)].count()

file       841
title      841
body       841
bullets    841
dtype: int64

In [50]:
listnotDS = df[~df['title'].str.contains('(data scientist)|(data science)', case=False)]
listnotDS.index

Int64Index([   0,    3,    4,    7,   11,   14,   16,   21,   22,   23,
            ...
            1318, 1319, 1320, 1325, 1327, 1328, 1330, 1331, 1334, 1335],
           dtype='int64', length=841)

In [51]:
df.drop(listnotDS.index, axis=0, inplace=True)

In [52]:
df

Unnamed: 0,file,title,body,bullets
1,3157fcef3ee474da_fccid.html,"Data Scientist - Mountain View, CA","Data Scientist - Mountain View, CA\nGroundTrut...","[\nHelp senior members of the team to explore,..."
2,b423ca22a6e2c10f_fccid.html,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nA Bachelor or Ma...","[Collaborate with HR, business leaders, and pe..."
5,a559b6630c13783d_fccid.html,"Junior Data Scientist - College Park, MD 20740","Junior Data Scientist - College Park, MD 20740...",[Degree: Bachelor’s degree in business analyti...
6,f579e807b5804620_fccid.html,"Data Scientist - New York, NY","Data Scientist - New York, NY\nDescription\nDS...","[\nLanguages: Python, PySpark, SQL\nData Tools..."
8,13c9ffc0bcb07c8d_fccid.html,"(Entry-Level) Data Scientist - Chicago, IL","(Entry-Level) Data Scientist - Chicago, IL\nDa...",[\nBe the go-to person for Data ingest and sto...
9,b672827e595ad0f4_fccid.html,"Data Scientist, Analytics - Seattle, WA 98101","Data Scientist, Analytics - Seattle, WA 98101\...",[\nApply your expertise in quantitative analys...
10,fa5cc99f9075aff3_fccid.html,Data Scientist Intern - Pricing Strategy and A...,Data Scientist Intern - Pricing Strategy and A...,[\nApply statistical methods to analyze the ef...
12,49f85c468f4849f5_fccid.html,"Data Scientist BSLEF8 - Alexandria, VA","Data Scientist BSLEF8 - Alexandria, VA\nEach d...",[Lead and perform hands-on analysis and modeli...
13,8a57ec38468f3689_fccid.html,"Data Scientist - New York, NY","Data Scientist - New York, NY\nYour role\nComb...","[Develop, back test, and implement statistical..."
15,e9af75bf5684dbe1_fccid.html,"Data Scientist - Grapevine, TX 76051","Data Scientist - Grapevine, TX 76051\nData Sci...",[Health benefits401(k)ProgramDaily dress code ...


## Remove duplicate job postings

In [53]:
df.duplicated(subset=['body']).value_counts()

False    491
True       5
dtype: int64

In [54]:
listdup = df.duplicated(subset=['body'])

In [55]:
df[listdup].index

Int64Index([176, 720, 929, 1084, 1211], dtype='int64')

In [56]:
df.drop(df[listdup].index, axis=0, inplace=True)

In [57]:
df.duplicated(subset=['body']).value_counts()

False    491
dtype: int64

## Save dataframe to disk

In [58]:
df.to_pickle("./jobpostingsdf.pkl")

In [59]:
newdf = pd.read_pickle("./jobpostingsdf.pkl")
newdf

Unnamed: 0,file,title,body,bullets
1,3157fcef3ee474da_fccid.html,"Data Scientist - Mountain View, CA","Data Scientist - Mountain View, CA\nGroundTrut...","[\nHelp senior members of the team to explore,..."
2,b423ca22a6e2c10f_fccid.html,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nA Bachelor or Ma...","[Collaborate with HR, business leaders, and pe..."
5,a559b6630c13783d_fccid.html,"Junior Data Scientist - College Park, MD 20740","Junior Data Scientist - College Park, MD 20740...",[Degree: Bachelor’s degree in business analyti...
6,f579e807b5804620_fccid.html,"Data Scientist - New York, NY","Data Scientist - New York, NY\nDescription\nDS...","[\nLanguages: Python, PySpark, SQL\nData Tools..."
8,13c9ffc0bcb07c8d_fccid.html,"(Entry-Level) Data Scientist - Chicago, IL","(Entry-Level) Data Scientist - Chicago, IL\nDa...",[\nBe the go-to person for Data ingest and sto...
9,b672827e595ad0f4_fccid.html,"Data Scientist, Analytics - Seattle, WA 98101","Data Scientist, Analytics - Seattle, WA 98101\...",[\nApply your expertise in quantitative analys...
10,fa5cc99f9075aff3_fccid.html,Data Scientist Intern - Pricing Strategy and A...,Data Scientist Intern - Pricing Strategy and A...,[\nApply statistical methods to analyze the ef...
12,49f85c468f4849f5_fccid.html,"Data Scientist BSLEF8 - Alexandria, VA","Data Scientist BSLEF8 - Alexandria, VA\nEach d...",[Lead and perform hands-on analysis and modeli...
13,8a57ec38468f3689_fccid.html,"Data Scientist - New York, NY","Data Scientist - New York, NY\nYour role\nComb...","[Develop, back test, and implement statistical..."
15,e9af75bf5684dbe1_fccid.html,"Data Scientist - Grapevine, TX 76051","Data Scientist - Grapevine, TX 76051\nData Sci...",[Health benefits401(k)ProgramDaily dress code ...
