# Job Descriptions Topic Modelling using NLP

This notebook will be used to explore and model topics and skills from Job descriptions using NLP.

There are multiple ways to source the data so we will experiment with a couple data source paths:
1. **Extract data from single URL** - Scrap Careers section of target website

This approach can be used to drill down and understand specific high profile clients needs. e.g. If HSBC is a key client, extracting data from HSBC's career section would be key in getting direct insights


2. **Use data directly from CSV** - Kaggle Job Titles and Descriptions CSV

This approach can be used if data have already been collected and sourced from various sources into a CSV or similar format. Here, we'll be using Kaggle's dataset: https://www.kaggle.com/bman93/dataset


3. **Scrap Web Job Boards or aggregator websites** - Indeed.co.uk or Other Similar ones



## 2. CSV from Kaggle's Job Descriptions dataset

In [3]:
import pandas as pd

In [4]:
jobs = pd.read_csv('../offline-datasets/kaggle-top30-job-descriptions.csv', na_filter=False)
jobs.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Query,Description
0,10,10,Administrative Assistant,This Administrative Assistant position is resp...
1,35,35,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...
2,140,140,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...
3,214,214,Sales Representative,Are you ready for something new? Are you tired...
4,350,350,Customer Service Representative,Superior Staff Resources is currently seeking ...


In [5]:
jobs.Description[345]

'<p>If you are a confident and highly motivated individual looking for a great sales opportunity with uncapped earning and growth potential, join our Sales team at Central Payment! We are seeking energetic and entrepreneurial Outside Sales Reps to market our credit card processing services, social media tools, mobile marketing solutions and loyalty rewards programs to businesses across the country. You must also provide excellent customer service to your client accounts. <br>\\r<br>\\r</p>\\r<p><span><b><span>JOB RESPONSIBILITIES <br>\\r<br>\\r</span></b></span></p>\\r<p><span>As an Outside Sales Rep with Central Payment, you will identify customer needs in terms of merchant services, social media and mobile marketing and suggest Central Payment solutions that best meet their business needs. You will also conduct periodic customer service and quality assurance visits with your merchant clients to maintain strong working relationships and grow sales accounts.</span></p>\\r<p><span>Addit

In [6]:
jobs.Description.describe()

count                                                 72292
unique                                                31992
top       <p>Compared to other franchise opportunities i...
freq                                                   3701
Name: Description, dtype: object

## 3. Extract data from Job board search page (Indeed.com)

URL to use: https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London%2C+Greater+London


Source articles: 
1. https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
2. https://medium.com/analytics-vidhya/classifying-tech-data-job-postings-on-indeed-com-1fd8ca6e7cdd

In [52]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

In [53]:
# Source URL
URL = "https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London%2C+Greater+London"

# London search
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London&start=10
    
# Manchester search
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=Manchester
    
# Request a page from the specified URL
page = requests.get(URL)

# Store Parsed-Page (pp) by specifying page format using "html.parser"
#pp = BeautifulSoup(page.text, "html.parser")
pp = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")


#printing soup in a more structured tree format that makes for easier reading
print(pp.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="//d3fw5vlhllyvee.cloudfront.net/s/23a063b/en_GB.js" type="text/javascript">
  </script>
  <link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="https://www.indeed.co.uk/rss?q=data+science+consultant&amp;l=London%2C+Greater+London" rel="alternate" title="Data Science Consultant Jobs, vacancies in London" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+science+consultant&amp;l=London%2C+Greater+London" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+science+consultant&amp;l=London%2C+Greater+London" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {



### Extract Job Titles

In [54]:
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

In [55]:
extract_job_title_from_result(pp)

['Database Management Consultant for Mining and Exploration Industry',
 'Senior Consultant, Incident Response & Investigations | Forensic & Litigation Consulting',
 'DATA SCIENCE CONSULTANT LONDON',
 'Senior Consultant, Banking Data Science, Consulting, London',
 'Data Science Consultant, Data Analyst (London)',
 'Science Consultant- Research and Development',
 'Business Intelligence Consultant - (Office or Remote - Work from Home)',
 'Data Science Consultant',
 'BUSINESS CONSULTANT LONDON',
 'Data Development Programme –',
 'Associate Consultant 2021 - Analytics (London Office)',
 'Graduate Consultant',
 'Apprentice Laboratory Technician',
 'Development Engineer']

### Extract Company Names

In [56]:
def extract_company_name(soup): 
    company_names = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        company = div.find_all(name="span", attrs={"class":"company"})
        if len(company) > 0:
            for b in company:
                company_names.append(b.text.strip())
        else:
            sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
            for span in sec_try:
                company_names.append(span.text.strip())
    return(company_names)

In [57]:
extract_company_name(pp)

['maxgeo.com',
 'FTI Consulting, Inc.',
 'managementsolutions',
 'Deloitte',
 'GreySpark Partners',
 'Leyton UK Limited',
 'Sywbol',
 'Capgemini',
 'managementsolutions',
 'Grayce',
 'OC&C Strategy Consultants',
 'Syneos Health Commercial Solutions',
 'Youthforce',
 'PerkinElmer']

### Extract Location

In [58]:
def extract_location(soup): 
    locations = []
    spans = soup.findAll("span", attrs={"class": "location"})
    #spans = soup.findAll("span", attrs={"class": "rbLabel"})
    #spans = soup.findAll("span", attrs={"class": "salaryText"})
    
    #salaryText
    #rbLabel
    for span in spans:
        locations.append(span.text)
    return(locations)

In [59]:
extract_location(pp)

['\n£40,000 - £60,000 a year',
 '\n£59,040 - £65,190 a year',
 '\n£16,500 a year']

### Extract Salary

In [60]:
def extract_salary(soup): 
    salaries = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        try:
            salaries.append(div.find("nobr").text)
        except:
            try:
                div_two = div.find(name="div", attrs={"class":"sjcl"})
                div_three = div_two.find("div")
                salaries.append(div_three.text.strip())
            except:
                salaries.append("Nothing_found")
    return(salaries)

In [80]:
def extract_salary2(soup): 
    salaries = []
    try:
        for span in soup.findAll("span", attrs={"class": "salaryText"}):
            salaries.append(span.text)
    except:
        salaries.append("No salary information")
    return(salaries)

In [81]:
extract_salary2(pp)

['\n£40,000 - £60,000 a year',
 '\n£59,040 - £65,190 a year',
 '\n£16,500 a year']

### Extract Job Summaries

In [63]:
def extract_summary(soup): 
    summaries = []
    spans = soup.findAll("div", attrs={"class": "summary"})
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)

In [64]:
extract_summary(pp)

['Exposure to automating data capture.\nAssessment of data quality against set criteria.\nTertiary qualification in computing science or geoscience is also an…',
 'You will perform specific technical tasks to help triage breaches, analyse data quickly and to be able to understand underlying causes and issues.',
 'You will be working in key projects for leading organizations in data mining & knowledge Discovery, predictive modeling, trend modeling, Simulation models …',
 'Experience of the banking environment, products and front and back office operations/ functions;Experience of data analytics and visualisation products such as…',
 'Understanding of data best practices and basic data governance principles.\nAs part of GreySpark’s Data Practice, you will be part of a specialised team of data…',
 'We have highly qualified teams of consultants in sales, science, tax, engineering and software who have helped deliver 100% success rate for HMRC submissions.',
 'Use data visualization techniq

In [65]:
def extract_job_text(soup):
    jobs_text = []
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-jobDescriptionText"}):
        #print(div.text[0:100])
        jobs_text.append(div.text)
    return jobs_text

In [66]:
extract_job_text(pp)

[]

### Scrapper code

In [28]:
# Location to add to URL for getting results for
location = "London"

In [29]:
# Role description search term
job_title = "Data+Science+Consultant"


In [34]:
# Number of pages to return
max_results_per_page = 50

In [31]:
columns = ['Location', 'Company', 'Job Title', 'Job Description', 'Salary']

In [32]:
jobs_df = pd.DataFrame(columns = columns)

In [33]:
jobs_df.head()


Unnamed: 0,Location,Company,Job Title,Job Description,Salary


In [76]:
len(jobs_df)

1

In [49]:
page = requests.get('https://www.indeed.co.uk/jobs?q=' + str(job_title) + '&l=' + str(location) + '&start=' + str(10))

In [86]:
num = 0
for start in range(0, max_results_per_page, 10):
    # Constructing page URL based on "Job Title" and "Location"
    page = requests.get('https://www.indeed.co.uk/jobs?q=' + str(job_title) + '&l=' + str(location) + '&start=' + str(start))
    
    # Adding random sleep time of at least 1" between requests
    time.sleep(1+random.random()*2) 
    
    # Extracting all text from page
    #soup = BeautifulSoup(page.text, “lxml”, from_encoding=”utf-8")
                         
    # Store Parsed-Page (pp) by specifying page format using "html.parser"
    parsed_page = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")
  
    # Start Processing loop
    for div in parsed_page.find_all(name="div", attrs={"class":"row"}): 
        
        #specifying row num for index of job posting in dataframe
        num = (len(jobs_df) + 1)
                
        #creating an empty list to hold the data for each posting
        jobs = [] 
                             
        # Appending City Name
        jobs.append(location)
                        
        # Extracting Company Name
        company = div.find_all(name="span", attrs={"class":"company"})
        if len(company) > 0:
            for b in company:
                jobs.append(b.text.strip())
        else:
            sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
            for span in sec_try:
                jobs.append(span.text.strip())
        
        
        # Extracting Job Title
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
        
        
        # Extracting Summary information
        spans = div.findAll("div", attrs={"class": "summary"})
        for span in spans:
            jobs.append(span.text.strip())
    
        
        # Extracting Salary information
        #spans = div.findAll("span", attrs={"class": "salaryText"})
        #for span in spans:
        #    jobs.append(span.text)
        
        try:
            for span in soup.findAll("span", attrs={"class": "salaryText"}):
                jobs.append(span.text)
        except:
            jobs.append("No salary information")
        
        
        #appending list of job post info to dataframe at index num
        jobs_df.loc[num] = jobs


# Saving output to CSV
jobs_df.to_csv("IndeedJobs.csv", encoding='utf-8');

### Loading saved file

In [88]:
parsed_jobs = pd.read_csv('./IndeedJobs.csv', na_filter=False)
parsed_jobs.head(50)

Unnamed: 0.1,Unnamed: 0,Location,Company,Job Title,Job Description,Salary
0,1,London,maxgeo.com,Database Management Consultant for Mining and ...,Exposure to automating data capture.\nAssessme...,"\n£40,000 - £60,000 a year"
1,2,London,maxgeo.com,Database Management Consultant for Mining and ...,Exposure to automating data capture.\nAssessme...,"\n£40,000 - £60,000 a year"
2,3,London,maxgeo.com,Database Management Consultant for Mining and ...,Exposure to automating data capture.\nAssessme...,No salary information
3,4,London,managementsolutions,DATA SCIENCE CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information
4,5,London,Deloitte,"Senior Consultant, Banking Data Science, Consu...","Experience of the banking environment, product...",No salary information
5,6,London,Youthforce,Apprentice Laboratory Technician,Technicians are crucial members of laboratorie...,No salary information
6,7,London,GreySpark Partners,"Data Science Consultant, Data Analyst (London)",Understanding of data best practices and basic...,No salary information
7,8,London,Capgemini,Data Science Consultant,"Capgemini Invent combines strategy, technology...",No salary information
8,9,London,Leyton UK Limited,Science Consultant- Research and Development,We have highly qualified teams of consultants ...,No salary information
9,10,London,managementsolutions,BUSINESS CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information
