# Decoding Data Science Job Postings to Improve Your Resume

First of all, we import the required Python modules.

In [1]:
from zipfile import ZipFile
from bs4 import BeautifulSoup as bs
import pandas as pd
import re

## 1. Extracting Text from Online Job Postings

We load the ZIP archive containing all the job postings as HTML files.

In [2]:
zip = ZipFile('../data/html_job_postings.zip')

We read each HTML file in order to store its content as a sequence of bytes in an accumulator list.

In [3]:
file_contents = []

for filename in zip.namelist():
    if filename.endswith('.html'):
        with zip.open(filename, 'r') as file:
            file_contents.append(file.read())

We close the ZIP archive since we no longer need it.

In [4]:
zip.close()

For each file content of HTML files, we parse it and extract the following fields:
- The title
- The text of the body
- The text of each bullet point (as a list)

While doing so, we create a data frame with a column for each field, and a row for each file content.

In [5]:
ds_df = pd.DataFrame([
    [soup.title.text,
     soup.body.text,
     [li.text for li in soup.find_all('li')]]
    for soup in [bs(file_content, 'lxml') for file_content in file_contents]],

    columns=['title', 'body', 'bullets']
)

We delete the list of file contents since we no longer need it.

In [6]:
del file_contents

Let us peek at the created data frame and see its number of rows.

In [7]:
ds_df.head()

Unnamed: 0,title,body,bullets
0,"Data Engineer - Columbus, GA 31909","Data Engineer - Columbus, GA 31909\nCelebratin...","[Bachelor’s or Master’s degree in statistics, ..."
1,"Data Analyst - St. Louis, MO","Data Analyst - St. Louis, MO\nDuties\nSummary\...",[Job family (Series)\n1501 General Mathematics...
2,"Data Scientist - Newark, CA","Data Scientist - Newark, CA\nData Scientist\n\...","[ Design, develop, document and maintain machi..."
3,Patient Care Assistant / PCA - Med/Surg (Fayet...,Patient Care Assistant / PCA - Med/Surg (Fayet...,[Provides all personal care services in accord...
4,"Scientific Programmer - Berkeley, CA","Scientific Programmer - Berkeley, CA\nCaribou ...","[Demonstrated proficiency with Python, JavaScr..."


In [8]:
len(ds_df.index)

1337

We examine a random sample of job postings to ensure the extraction worked as expected.

In [9]:
random_posting = ds_df.sample().iloc[0]

print('# Title:'.upper())
print(random_posting.title, end='\n' * 2)

print('# Body:'.upper())
print(random_posting.body, end='\n' * 2)

print('# Bullets:'.upper())
for bullet in random_posting.bullets:
    print(f'- {bullet}', end='')

# TITLE:
Data Scientist (Maplewood, MN) - Maplewood, MN

# BODY:
Data Scientist (Maplewood, MN) - Maplewood, MN
3M is seeking an experienced Data Scientist – Digitization and Advanced Analytics within the Manufacturing and Supply Chain organization located in Maplewood, Minnesota . At 3M, you can apply your talent in bold ways that matter. Here, you go.
Job Summary :
The person hired for the position of Data Scientist will be developing data and analytics solutions to address business challenges, interact effectively with both the technology teams and the business units, contribute significantly to the vision of our data analytics and data science capabilities. The ideal candidate is passionate about digging into large amounts of data, both known and unknown, structured and unstructured.
For additional business group/division/product information, please visit: https://www.3m.com/3M/en_US/manufacturing-us/
This position provides an opportunity to transition from other private, public, g

Let us drop all duplicates by considering only the title and body fields. The number of rows decreases.

In [10]:
ds_df.drop_duplicates(subset=['title', 'body'], inplace=True)
len(ds_df.index)

1328

We keep only the job postings related to data science by testing if their title contains the character string `"data
scien"` (case-insensitive). The number of rows decreases.

In [11]:
ds_df = ds_df[ds_df.title.str.contains('data scien', flags=re.IGNORECASE)]
len(ds_df.index)

492

We ensure that data frame contains the expected data science postings.

In [12]:
ds_df.head()

Unnamed: 0,title,body,bullets
2,"Data Scientist - Newark, CA","Data Scientist - Newark, CA\nData Scientist\n\...","[ Design, develop, document and maintain machi..."
6,PwC Labs - Jr. Data Scientist - Machine Learni...,PwC Labs - Jr. Data Scientist - Machine Learni...,[Invite and provide evidence-based feedback in...
12,"Senior Data Scientist - Sunnyvale, CA 94089","Senior Data Scientist - Sunnyvale, CA 94089\nI...",[Ability to mentor and up level junior data sc...
14,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nMS with 2+ years...",[MS with 2+ years of industry experience or Ba...
15,"Data Scientist - Pasadena, CA 91107","Data Scientist - Pasadena, CA 91107\nJob Type:...",[Use statistical and programming software comb...


Finally, we save our data frame to a Pickle file for later reuse.

In [13]:
ds_df.to_pickle('../data/ds_df.pkl')
