# Topic Modelling using NLP - Job Descriptions Data



## Part 1 - Data Sourcing - Introduction


This notebook will be used to explore ways to source data for the topic modelling project.

The objective is to generate a CSV (or similar format) that can be loaded into a DataFrame.

There are multiple ways to source the data so we will experiment with a the following:

1. **Use data directly from CSV** - Kaggle Job Titles and Descriptions CSV

This approach can be used if data have already been collected and sourced from various sources into a CSV or similar format. Here, we'll be using Kaggle's dataset which has a curated CSV file ready to use: https://www.kaggle.com/bman93/dataset


2. **Extract data from Careers URL** - Scrap Careers section of target website

This approach can be used to drill down and understand specific high profile clients needs. e.g. If HSBC is a key client, extracting data from HSBC's career section would be the objective of this method.



3. **Scrap Web Job Boards or aggregator websites** - Indeed.co.uk or Other Similar ones
This approach can be used to source data from aggregator job boards, by parsing information depending on the search topic and extracting the necessary column headers. 

In all three methods above, we'll aim to align our data to the following  column headers format which we'll refer to as **Common CSV format**:


| Location | Company | Job Title | Job Description| Salary | E-mail |
|:---:|:---:|:---:|:---:|:---:|:---:|
|  | | |  |  |  |
|  | | |  |  |  |
|  | | |  |  |  |





## 1. CSV from Kaggle's Job Descriptions dataset

In [1]:
import pandas as pd
import support_functions as sf

In [2]:
jobs = pd.read_csv('../offline-datasets/kaggle-top30-job-descriptions.csv', na_filter=False)
jobs.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Query,Description
0,10,10,Administrative Assistant,This Administrative Assistant position is resp...
1,35,35,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...
2,140,140,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...
3,214,214,Sales Representative,Are you ready for something new? Are you tired...
4,350,350,Customer Service Representative,Superior Staff Resources is currently seeking ...


In [3]:
jobs.Description[2]

'Administrative Assistant -  In Bus 26yrs Fashion Handbag wholesaler need assistant for Sales and Inventory Depmts., barnard@americanwest.cc  SF52144\\r\\nAs leading supplier of handbags and accessories to the Western Industry, American West is looking for a detail orientated, focused self starter that can multi task in both a sales and operations environment.  The Administrative assistant to sales and operations will be responsible for assistance in data entry and customer service for the Sales Department and administrative assistance to the New Product Development and Inventory manager; reporting to the EVP.    POSITION SUMMARY:  The Sales and Operations Administrative Assistant is responsible for assisting the Sales Department, New Product Development and Inventory Manager with general administration duties, data entry and assisting customers with great customer service.    JOB FUNCTIONS: ? Administrative assistance to the Exec VP, Sales Department, and Product Development and Inven

In [4]:
jobs.Description.describe()

count                                                 72292
unique                                                31992
top       <p>Compared to other franchise opportunities i...
freq                                                   3701
Name: Description, dtype: object

In [5]:
len(jobs.Description)

72292

In [6]:
#jobs.Query.unique()

In [7]:
#jobs.Query.value_counts()

#### Tasks to extract info from Description
* Source: where it was published
* E-mail Contact: extract company name from domain and contact info


### Extract E-mail and Company info from Description

In [8]:
# Call function to extract e-mail from description
jobs = sf.extract_email_from_column(jobs, jobs.Description)

In [9]:
jobs.head(50)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Query,Description,Email
0,10,10,Administrative Assistant,This Administrative Assistant position is resp...,
1,35,35,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...,Kenneth.harker@healthcentral.org
2,140,140,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...,barnard@americanwest.cc
3,214,214,Sales Representative,Are you ready for something new? Are you tired...,
4,350,350,Customer Service Representative,Superior Staff Resources is currently seeking ...,
5,375,375,Customer Service Representative,<br />\r<span><strong>The Opportunity:</strong...,
6,388,388,Java Developer,<P><STRONG>As a member of the Web and Portal D...,
7,395,395,Java Developer,<BR>\r<TABLE border=0 cellSpacing=0 cellPaddin...,
8,618,618,Administrative Assistant,The main function of this role is to provide a...,
9,650,650,Financial Analyst,Financial Analyst\n Perform portfolio analysis...,


### Extract Company name from E-mail domain

In [9]:
import re

In [10]:
comp_name = []
for email in jobs['Email']:
    if email == None:
        name = 'None'
        comp_name.append(name)
    else:
        name = email[ email.find("@") : email.find(".")] 
        comp_name.append(name)
        
jobs["Company"] = pd.DataFrame(comp_name)

In [11]:
jobs.head(50)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Query,Description,Email,Company
0,10,10,Administrative Assistant,This Administrative Assistant position is resp...,,
1,35,35,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...,Kenneth.harker@healthcentral.org,
2,140,140,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...,barnard@americanwest.cc,@americanwest
3,214,214,Sales Representative,Are you ready for something new? Are you tired...,,
4,350,350,Customer Service Representative,Superior Staff Resources is currently seeking ...,,
5,375,375,Customer Service Representative,<br />\r<span><strong>The Opportunity:</strong...,,
6,388,388,Java Developer,<P><STRONG>As a member of the Web and Portal D...,,
7,395,395,Java Developer,<BR>\r<TABLE border=0 cellSpacing=0 cellPaddin...,,
8,618,618,Administrative Assistant,The main function of this role is to provide a...,,
9,650,650,Financial Analyst,Financial Analyst\n Perform portfolio analysis...,,


In [12]:
# Alternative way to extract company name
t1 = "mattison@adamsproduce.com"
print(re.findall(r'\w+[.]', t1)[0][:-1])

adamsproduce


In [22]:
s = jobs['Description'][2]
re.findall(r'\S+@\S+', s)
#re.findall(r'\w+@\w+.\w{3}',s)

['barnard@americanwest.cc']

In [47]:
jobs.head(20)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Query,Description,Email,Company
0,10,10,Administrative Assistant,This Administrative Assistant position is resp...,,
1,35,35,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...,harker@healthcentral.org,@healthcentral
2,140,140,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...,barnard@americanwest,@americanwes
3,214,214,Sales Representative,Are you ready for something new? Are you tired...,,
4,350,350,Customer Service Representative,Superior Staff Resources is currently seeking ...,,
5,375,375,Customer Service Representative,<br />\r<span><strong>The Opportunity:</strong...,,
6,388,388,Java Developer,<P><STRONG>As a member of the Web and Portal D...,,
7,395,395,Java Developer,<BR>\r<TABLE border=0 cellSpacing=0 cellPaddin...,,
8,618,618,Administrative Assistant,The main function of this role is to provide a...,,
9,650,650,Financial Analyst,Financial Analyst\n Perform portfolio analysis...,,


In [20]:
comp_name[:5]

['@healthcentral',
 '@americanwes',
 '@adamsproduce',
 '@selectstaffing',
 '@SandCompanies']

In [15]:
s="bal@gmail.com"
s[ s.find("@") : ]

'@gmail.com'

In [None]:

text[ text.find("@") : text.find(".")]

### Transform to "Common CSV format" and Save

In [51]:
columns = ['Location', 'Company', 'Job Title', 'Job Description', 'Salary']
jobs1_df = pd.DataFrame(columns = columns)

In [52]:
jobs1_df['Job Title'] = jobs['Query']

In [53]:
jobs1_df['Job Description'] = jobs['Description']

In [54]:
jobs1_df['E-mail'] = jobs['Email']

In [55]:
jobs1_df.head()

Unnamed: 0,Location,Company,Job Title,Job Description,Salary,E-mail
0,,,Administrative Assistant,This Administrative Assistant position is resp...,,
1,,,Administrative Assistant,ADMINISTRATIVE ASSISTANT Part Time The West Or...,,harker@healthcentral.org
2,,,Administrative Assistant,Administrative Assistant - In Bus 26yrs Fashi...,,barnard@americanwest
3,,,Sales Representative,Are you ready for something new? Are you tired...,,
4,,,Customer Service Representative,Superior Staff Resources is currently seeking ...,,


In [56]:
jobs1_df.to_csv("KaggleJobs.csv", encoding='utf-8');

## 2. Extract data from Careers URL client website

## 3. Extract data from Job board search page (Indeed.com)

URL to use: https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London%2C+Greater+London


Source articles: 
1. https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
2. https://medium.com/analytics-vidhya/classifying-tech-data-job-postings-on-indeed-com-1fd8ca6e7cdd

In [59]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

In [61]:
# Source URL
#URL = "https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London%2C+Greater+London"

URL = "https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London&start=10"
# London search
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=London&start=10
    
# Manchester search
# https://www.indeed.co.uk/jobs?q=data+science+consultant&l=Manchester
    
# Request a page from the specified URL
page = requests.get(URL)

# Store Parsed-Page (pp) by specifying page format using "html.parser"
#pp = BeautifulSoup(page.text, "html.parser")
pp = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")


# Pring parsed soup pages in a readable format
print(pp.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="//d3fw5vlhllyvee.cloudfront.net/s/23a063b/en_GB.js" type="text/javascript">
  </script>
  <link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="https://www.indeed.co.uk/rss?q=data+science+consultant&amp;l=London" rel="alternate" title="Data Science Consultant Jobs, vacancies in London" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+science+consultant&amp;l=London" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+science+consultant&amp;l=London" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallbacks'].push(c

### Extract Job Titles

In [62]:
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

In [63]:
extract_job_title_from_result(pp)

['Associate Consultant 2021 - Analytics (London Office)',
 'Data Development Programme –',
 '2021 Summer Internship - Consulting',
 'Senior Strategic Consultant - Data Science & Analytics',
 'Junior Consultant',
 'Data Science Quant (Banking Consultant)',
 'Data Privacy Consultant Job, London, £60,000',
 '2021 Graduate Consultant',
 'PepsiCo UK Industrial Placement- Data Science & Analytics',
 'Data & Analytics Consultant (m/f/d)',
 'Senior Consultant - People Analytics - People Advisory Services (UK&I) - Open to Flexible Working',
 'Consultant, Data Scientist, Ventures, Tax, London',
 'Operations Transformation - Business Analysis, Consultant',
 'Data & Analytics Consultant',
 'Recruitment Consultant – IT / Data']

### Extract Company Names

In [64]:
def extract_company_name(soup): 
    company_names = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        company = div.find_all(name="span", attrs={"class":"company"})
        if len(company) > 0:
            for b in company:
                company_names.append(b.text.strip())
        else:
            sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
            for span in sec_try:
                company_names.append(span.text.strip())
    return(company_names)

In [65]:
extract_company_name(pp)

['OC&C Strategy Consultants',
 'Grayce',
 'BAE Systems Applied Intelligence',
 'Amey plc',
 'Octamis',
 'La Fosse Associates',
 'Beecher Madden.',
 'BAE Systems Applied Intelligence',
 'PepsiCo',
 'Artefact',
 'EY',
 'Deloitte',
 'Capgemini',
 'Artefact',
 'CCG Associates']

### Extract Location

In [66]:
def extract_location(soup): 
    locations = []
    spans = soup.findAll("span", attrs={"class": "location"})
    #spans = soup.findAll("span", attrs={"class": "rbLabel"})
    #spans = soup.findAll("span", attrs={"class": "salaryText"})
    
    #salaryText
    #rbLabel
    for span in spans:
        locations.append(span.text)
    return(locations)

In [67]:
extract_location(pp)

['London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London',
 'London']

### Extract Salary

In [68]:
def extract_salary(soup): 
    salaries = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        try:
            salaries.append(div.find("nobr").text)
        except:
            try:
                div_two = div.find(name="div", attrs={"class":"sjcl"})
                div_three = div_two.find("div")
                salaries.append(div_three.text.strip())
            except:
                salaries.append("Nothing_found")
    return(salaries)

In [69]:
def extract_salary2(soup): 
    salaries = []
    try:
        for span in soup.findAll("span", attrs={"class": "salaryText"}):
            salaries.append(span.text)
    except:
        salaries.append("No salary information")
    return(salaries)

In [70]:
extract_salary2(pp)

['\n£16,800 a year', '\n£60,000 - £95,000 a year', '\n£21,000 a year']

### Extract Job Summaries

In [71]:
def extract_summary(soup): 
    summaries = []
    spans = soup.findAll("div", attrs={"class": "summary"})
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)

In [72]:
extract_summary(pp)

['OC&C is regularly tasked with problems where the answers are hidden in large, messy and complex data sets such as transaction data, customer-behaviour data and…',
 '2:1 minimum degree with an analytical focus & data handling such as STEM.\nAn ambition to work with data and an understanding of how it can help organisations…',
 'You will creatively apply data science approaches to develop analytic results which turn data into information - information into knowledge and insights - and…',
 'Although you may have strengths in areas of data science, we encourage our consultants to cultivate a wide range of both technical and non-technical skills.',
 'BA/BS preferred in computer science, computer engineering, mathematics or equivalent work experience.\nHappy to travel 80% of the time in the UK and sometimes in…',
 'They are still looking to hire multiple data science quantitative consultants from strong banking, statistical and software programming backgrounds into the…',
 'Producing effec

In [73]:
def extract_job_text(soup):
    jobs_text = []
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-jobDescriptionText"}):
        #print(div.text[0:100])
        jobs_text.append(div.text)
    return jobs_text

In [74]:
extract_job_text(pp)

[]

### Scrapper code (putting it all together)

In [75]:
# Location to add to URL for getting results for
location = "London"

In [76]:
# Role description search term
job_title = "Data+Science+Consultant"

In [77]:
# Number of pages to return
max_results_per_page = 50

In [78]:
columns = ['Location', 'Company', 'Job Title', 'Job Description', 'Salary']

In [79]:
columns = ['Location', 'Company', 'Job Title', 'Job Description', 'Salary']
jobs_df = pd.DataFrame(columns = columns)

In [80]:
jobs_df.head()


Unnamed: 0,Location,Company,Job Title,Job Description,Salary


In [81]:
num = 0
for start in range(0, max_results_per_page, 10):
    # Constructing page URL based on "Job Title" and "Location"
    page = requests.get('https://www.indeed.co.uk/jobs?q=' + str(job_title) + '&l=' + str(location) + '&start=' + str(start))
    
    # Adding random sleep time of at least 1" between requests
    time.sleep(1+random.random()*2) 
    
    # Extracting all text from page
    #soup = BeautifulSoup(page.text, “lxml”, from_encoding=”utf-8")
                         
    # Store Parsed-Page (pp) by specifying page format using "html.parser"
    parsed_page = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")
  
    # Start Processing loop
    for div in parsed_page.find_all(name="div", attrs={"class":"row"}): 
        
        #specifying row num for index of job posting in dataframe
        num = (len(jobs_df) + 1)
                
        #creating an empty list to hold the data for each posting
        jobs = [] 
                             
        # Appending City Name
        jobs.append(location)
                        
        # Extracting Company Name
        company = div.find_all(name="span", attrs={"class":"company"})
        if len(company) > 0:
            for b in company:
                jobs.append(b.text.strip())
        else:
            sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
            for span in sec_try:
                jobs.append(span.text.strip())
        
        
        # Extracting Job Title
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
        
        
        # Extracting Summary information
        spans = div.findAll("div", attrs={"class": "summary"})
        for span in spans:
            jobs.append(span.text.strip())
    
        
        # Extracting Salary information
        try:
            for span in soup.findAll("span", attrs={"class": "salaryText"}):
                jobs.append(span.text)
        except:
            jobs.append("No salary information")
        
        
        #appending list of job post info to dataframe at index num
        jobs_df.loc[num] = jobs






In [83]:
jobs_df.head(50)

Unnamed: 0,Location,Company,Job Title,Job Description,Salary
1,London,Mango Business Solutions,Data Consultant,A data consultant experienced in working in a ...,No salary information
2,London,Cambridge Healthcare Research,Associate Consultant,Where appropriate manages external contractors...,No salary information
3,London,Celonis,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
4,London,Celonis SE,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
5,London,managementsolutions,DATA SCIENCE CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information
6,London,Deloitte,"Senior Consultant, Banking Data Science, Consu...","Experience of the banking environment, product...",No salary information
7,London,GreySpark Partners,"Data Science Consultant, Data Analyst (London)",Understanding of data best practices and basic...,No salary information
8,London,Leyton UK Limited,Science Consultant- Research and Development,We have highly qualified teams of consultants ...,No salary information
9,London,Capgemini,Data Science Consultant,"Capgemini Invent combines strategy, technology...",No salary information
10,London,managementsolutions,BUSINESS CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information


In [84]:
# Saving output to CSV
jobs_df.to_csv("IndeedJobs.csv", encoding='utf-8');

### Loading saved file

In [85]:
parsed_jobs = pd.read_csv('./IndeedJobs.csv', na_filter=False)
parsed_jobs.head(50)

Unnamed: 0.1,Unnamed: 0,Location,Company,Job Title,Job Description,Salary
0,1,London,Mango Business Solutions,Data Consultant,A data consultant experienced in working in a ...,No salary information
1,2,London,Cambridge Healthcare Research,Associate Consultant,Where appropriate manages external contractors...,No salary information
2,3,London,Celonis,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
3,4,London,Celonis SE,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
4,5,London,managementsolutions,DATA SCIENCE CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information
5,6,London,Deloitte,"Senior Consultant, Banking Data Science, Consu...","Experience of the banking environment, product...",No salary information
6,7,London,GreySpark Partners,"Data Science Consultant, Data Analyst (London)",Understanding of data best practices and basic...,No salary information
7,8,London,Leyton UK Limited,Science Consultant- Research and Development,We have highly qualified teams of consultants ...,No salary information
8,9,London,Capgemini,Data Science Consultant,"Capgemini Invent combines strategy, technology...",No salary information
9,10,London,managementsolutions,BUSINESS CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information


In [87]:
parsed_jobs.drop('Unnamed: 0', axis=1, inplace=True)

In [88]:
parsed_jobs.head()

Unnamed: 0,Location,Company,Job Title,Job Description,Salary
0,London,Mango Business Solutions,Data Consultant,A data consultant experienced in working in a ...,No salary information
1,London,Cambridge Healthcare Research,Associate Consultant,Where appropriate manages external contractors...,No salary information
2,London,Celonis,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
3,London,Celonis SE,Intern/Working Student Data Science / Business...,... have already gained substantial know-how i...,No salary information
4,London,managementsolutions,DATA SCIENCE CONSULTANT LONDON,You will be working in key projects for leadin...,No salary information
