<h1 style = 'text-align: center;'>Analyzing Data Science Job Requirements</h1>
<h3 style = 'text-align: center;'>Alex Cohen - DATS 6103</h3>
<p> Applying to jobs can be stressful, especially when trying to balance the application process with school, existing work responsibilties, and general life requirements. This problem is compounded by the fact that 'data science' means so many different things to different people. </p>
<p>Going into the application process with a better idea of what 'makes' a data scientist and exactly what companies are looking for when recruiting applicants can help make this process a little easier. For this, we will use <a href = 'https://www.indeed.com'>indeed.com</a>, a popular job search engine, to better understand what companies are looking for when they recruit data scientists.</p>

<p>This will be done in two steps: </p>
<ol> 
    <li>Use BeautifulSoup to gather job descriptions and company data</li> <p></p>
    <li>Analyze the different requirements to see what these jobs have in common, and where they differ</li>
</ol>


<h3> Use BeautifulSoup to get job descriptions and company data: </h3>

In [1]:
# Import packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re
from datetime import datetime

In [2]:
# set a timer to determine run time
start = datetime.now()
run_date = datetime.date(start)

<p>Here is our dictionary of terms that we want to look for in job descriptions</p>

In [3]:
# create a dictionary of terms and then their associated regex values
words = {    
        # python and related terms
        'python' : 'python',
        'jupyter' : 'jupyter',
        'pandas' : 'pandas',
        'numpy' : 'numpy', 
        'tensorflow' : 'tensorflow', 
        'keras' : 'keras',
        'sklearn' : 'sklearn',
        
        # r and related terms
        'r' : 'r',
        'rstudio' : 'rstudio',
        'tidyverse': '(.?plyr|tidyverse)',
        
        # other programming languages
        'java' : 'java',
        'c' : 'c',
        'c++' : 'c\+{2}',
        'c#' : 'c\#',
        'fortran' : 'fortran',
    
        # statistical/programming languages
        'sas' : 'sas', 
        'matlab' : 'matlab',
        'spss' : 'spss',
        'stata' : 'stata',
        'excel' : '(excel|vba)',
              
        # big data technology 
        'hadoop' : '(hadoop|hdfs)',
        'spark' : 'spark',
        'hive' : 'hive',
        'pig' : 'pig',
        'scala' : 'scala',

        # data visualization
        'tableau' : 'tableau',
        'd3' : 'd3',
        'bi' : '(bi|powerbi)',
    
        # database technology
        'sql' : 'sql',
        'nosql' : '(nosql|no sql)',
        'mysql' : '(mysql|my sql)',
        'postgresql' : 'postgresql',
        'sql server': 'sql server',
        'mongo' : '(mongo|mongodb)',
        'redis' : 'redis',
        'cassandra' : 'cassandra',
        'neo4j' : '(neo4j|cypher)',
    
        # web-development related
        'html' : 'html',
        'css' : 'css',
        'javascript' : 'javascript', 
        'ruby' : 'ruby',
        'php' : 'php',
        
        # buzz-words
        'api': 'api',
        'aws': '(aws|amazon web services)', 
        'cloud':'cloud', 
        'big data' : 'big data', 
        'machine learning': '(machine learning|ml)', 
        'deep learning': '(deep learning|dl)', 
        'ai' : '(artificial intelligence|ai)',
        'nlp': '(nlp|natural language processing)',
        'graph' : '(graph|graph database)',
        'data visual': 'data visual[a-z]+',
        'predictive': 'predictive',
        'model':'(model|modeling)',
        'data mining':'data min[a-z]+',
        'simulation': 'simulat[a-z]+',
        'forecast' : '(forecast|forecasting)',
    
        # degree-related
        'statistic': '(statistic|statistics)',
        'math':'(math|mathematics)',
        'computer science' : 'computer science',
        'economics' : 'economics',
        'engineering': 'engineering',
    
        # dc-specific
        'clearance': 'clearance'
    }

# convert the values to regex values for searching the text - adding either a ' ', ',' or '/'
# at either end to ensure we are truly capturing the value we want

words.update((k,'(\s|,|\/|\.)' + v + '(\s|,|\/|\.)') for k,v in words.items())

columns = list(words.keys())

<h4 style = 'text-align: center;'>Get individual job links</h4>
<p>The first step will be to get a list of links to the individual job postings, since Indeed only returns a list of brief descriptions when you search for the term 'data scientist'.</p>
<p>Additionally, we will not limit ourselves geographically, so our search area will be the 'United States', as you can see in the URL.</p>

In [4]:
# define the number of jobs we want to parse (actual was 2 runs of 2500 jobs)
nJobs = 2500
links = []

print('Getting Job URLs')
print()

# loop through Indeed.com (which shows 50 results per page) until we have individual job links
# for all nJobs
for i in range(0, nJobs, 50):
    url = 'https://www.indeed.com/jobs?q=data+scientist&l=United+States&limit=50&start=' + str(i)
    search = requests.get(url)
    
    # parse html for tags with related link class and append job urlsto running list
    soup = BeautifulSoup(search.text, 'html.parser')
    for link in soup.find_all('div', attrs = {'class':'jobsearch-SerpJobCard'}):
        link = link.find('a')
        links.append(link.get('href'))
    print('.', end = '')

print()
print('complete')

Getting Job URLs

..................................................
complete


In [5]:
# append the first part of the URL to each link and remove paid ad jobs
links = ['https://www.indeed.com' + s for s in links]
links = [x for x in links if x.count('pagead') == 0]

# show number of links and an example link (to prove it's the format we wanted)
print('Number of links retrieved: ' + str(len(links)))
print('Example link: ' + links[0])

Number of links retrieved: 2504
Example link: https://www.indeed.com/company/Numero-Data-LLC/jobs/Entry-Level-Data-Scientist-ad9a154fb51286b7?fccid=e2b02ba1480369a4&vjs=3


<h4 style = 'text-align: center;'> Parse Job Descriptions </h4>
<p>Now that we have a list containing our individual job pages, we want to parse each of them and count the number of times each of our terms appear in the description.</p>

In [6]:
# create emtpy data frame and counter
output = pd.DataFrame(columns=words.keys())
counter = 0

print('Parsing Job Descriptions')
print()

# loop through each link
for job in links:
    
    try: 
        counter += 1

        # get the html content
        page = requests.get(job)
        soup = BeautifulSoup(page.text, 'html.parser')

        # find the company link and company page
        company = soup.find(class_ = 'icl-u-lg-mr--sm icl-u-xs-mr--xs')
        companyPage = soup.find(class_ = "icl-NavigableContainer-linkWrapper")

        # try to get the company page if there is a link
        try:
            companyPage = companyPage.get('href')
        except:
            pass

        # try to get the company name if it exists
        try:
            company = company.text
        except: 
            pass

        # find the job title
        jobTitle = soup.find(class_ = 'jobsearch-JobInfoHeader-title')

        try:
            jobTitle = jobTitle.text
        except:
            pass

        # get the location of the job
        try:
            location = soup.find(class_ = 'jobsearch-InlineCompanyRating')
            location = location.find_all('div')[-1].text
        except:
            pass

        # get the text of the job description and make it all lower case (to avoid miscouning
        # due to capitalization differences)
        temp = soup.find(class_ = 'jobsearch-jobDescriptionText')
        temp = temp.get_text(" ").lower()

        wordMatch = []

        # append the number of times each word in our dictionary appears in the description
        for i in words.keys():
            n = len(re.findall(words.get(i), temp))
            data = [i, int(n)]
            wordMatch.append(data)

        # append on the job, company, title, and company page
        data = ['job', job]
        employer = ['company', company]
        location = ['location', location]
        title = ['job title', jobTitle]
        employerPage = ['employer page', companyPage]

        wordMatch.append(data)
        wordMatch.append(employer)
        wordMatch.append(location)
        wordMatch.append(title)
        wordMatch.append(employerPage)

        # combine each job into a data frame and append
        wordMatch = pd.DataFrame(wordMatch)
        wordMatch = wordMatch.set_index([0]).T
        output = output.append(wordMatch, sort = False)

        # define the counter to measure progress
        if counter % 50 == 0:
            p = '| ' + str(int(50 * (counter / 50)))
            end = '\n'
        else:
            p = '|'
            end = ''

        print(p, end = end)
    
    except:
        pass
    
print()
print('complete')

Parsing Job Descriptions

|||||||||||||||||||||||||||||||||||||||||||||||||| 50
|||||||||||||||||||||||||||||||||||||||||||||||||| 100
|||||||||||||||||||||||||||||||||||||||||||||||||| 150
|||||||||||||||||||||||||||||||||||||||||||||||||| 200
|||||||||||||||||||||||||||||||||||||||||||||||||| 250
|||||||||||||||||||||||||||||||||||||||||||||||||| 300
|||||||||||||||||||||||||||||||||||||||||||||||||| 350
|||||||||||||||||||||||||||||||||||||||||||||||||| 400
|||||||||||||||||||||||||||||||||||||||||||||||||| 450
|||||||||||||||||||||||||||||||||||||||||||||||||| 500
|||||||||||||||||||||||||||||||||||||||||||||||||| 550
|||||||||||||||||||||||||||||||||||||||||||||||||| 600
|||||||||||||||||||||||||||||||||||||||||||||||||| 650
|||||||||||||||||||||||||||||||||||||||||||||||||| 700
|||||||||||||||||||||||||||||||||||||||||||||||||| 750
|||||||||||||||||||||||||||||||||||||||||||||||||| 800
|||||||||||||||||||||||||||||||||||||||||||||||||| 850
||||||||||||||||||||||||||||||||||||||||

<h4 style = 'text-align: center;'>Company Information</h4>
<p>Now that we have the job descriptions and links to the employer pages, let's see if we can get information related to the size of the company (in both employees and revenue), and the company industry</p>

In [7]:
# attempt to add on the company page by splitting the employer link by the 'reviews' piece of the employer link
comp = []
for i in output['employer page']:
    try:
        temp = i.split('reviews?')[0]
        comp.append(temp)
    except:
        comp.append(np.NaN)

# create a data frame out of the job and company link
comp = pd.DataFrame(list(zip(output['job'], comp)), columns = ['job', 'employer page'])

In [8]:
# initialize the data frame
compInfo = pd.DataFrame(columns = ['job', 'Link','Employees size', 'Industry', 'Revenue'])
counter = 0

print('Getting Company Information')
print()

# loop through companies
for i, row in comp.iterrows():
    counter += 1
    url = row['employer page']
    titles = []
    values = []
    temp = []
    
    # try to get the company page information, which include the Number of Employees, 
    # Industry, and Revenue
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        title = soup.find_all(class_ = 'cmp-AboutMetadata-itemTitle')
        value = soup.find_all(class_ = 'cmp-AboutMetadata-itemCotent')
        
        # loop through the different itemcontent values and try to append (as there 
        # aren't defined html tags for each company item)
        for t in title:
            titles.append(t.text)
        for v in value:
            values.append(v.get_text('|').split('|')[0])
        
        temp = dict(zip(titles, values))

    # if the page doesn't exist, set the different values to NaNs that we can filter out later
    except:
        temp = {'Employees Size': np.NaN,
                'Industry': np.NaN,
                'Revenue': np.NaN}
    
    # create a data frame with values and then append to company info
    a = pd.DataFrame([row['job'], row['employer page'], temp.get('Employees size'), temp.get('Industry'), temp.get('Revenue')]).T
    a.columns = compInfo.columns
    compInfo = compInfo.append(a, ignore_index = True)
    
    # define our counter to measure progress
    if counter % 50 == 0:
        p = '| ' + str(int(50 * (counter / 50)))
        end = '\n'
    else:
        p = '|'
        end = ''
        
    print(p, end = end)
    
print()
print('complete')

Getting Company Information

|||||||||||||||||||||||||||||||||||||||||||||||||| 50
|||||||||||||||||||||||||||||||||||||||||||||||||| 100
|||||||||||||||||||||||||||||||||||||||||||||||||| 150
|||||||||||||||||||||||||||||||||||||||||||||||||| 200
|||||||||||||||||||||||||||||||||||||||||||||||||| 250
|||||||||||||||||||||||||||||||||||||||||||||||||| 300
|||||||||||||||||||||||||||||||||||||||||||||||||| 350
|||||||||||||||||||||||||||||||||||||||||||||||||| 400
|||||||||||||||||||||||||||||||||||||||||||||||||| 450
|||||||||||||||||||||||||||||||||||||||||||||||||| 500
|||||||||||||||||||||||||||||||||||||||||||||||||| 550
|||||||||||||||||||||||||||||||||||||||||||||||||| 600
|||||||||||||||||||||||||||||||||||||||||||||||||| 650
|||||||||||||||||||||||||||||||||||||||||||||||||| 700
|||||||||||||||||||||||||||||||||||||||||||||||||| 750
|||||||||||||||||||||||||||||||||||||||||||||||||| 800
|||||||||||||||||||||||||||||||||||||||||||||||||| 850
|||||||||||||||||||||||||||||||||||||

In [9]:
# join job posting info with company info
output = output.merge(compInfo, how = 'inner', on = ['job']).drop(columns = ['employer page'])

<h4 style = 'text-align: center;'>Bring it all together</h4>
<p>Now that we have our company and job info, let's join the two data frames, split the job location to give us a city and state we can examine, and then write it all out to csv files (we could write this to a sql database if we had more data we wanted to store, or if we were periodically running this across weeks/months)</p>

In [10]:
# split location into city and state and append to end
state = output['location'].str.replace('\d+', '').str.split(',', expand = True).loc[:,1].str.strip()
city = output['location'].str.replace('\d+', '').str.split(',', expand = True).loc[:,0].str.strip()
output['city'] = city
output['state'] = state

In [11]:
# write output as a csv file for further analysis (including the date and njobs to 
# provide more informative file names)
output.drop_duplicates(keep = 'first', inplace = True)
output.to_csv('output-' + str(nJobs) + '-' + str(run_date) + '.csv')

In [12]:
# get an idea of how many jobs we have after removing duplicate values
output.shape

(1289, 73)

In [13]:
# find out how long it took to run
print('Run time:', datetime.now() - start)

Run time: 0:30:24.193888


<p style = 'font-weight: bold; text-align: center;'>Now that we have our data, let's head over to the job analysis.ipynb file to analyze our results</p>