<a href="https://colab.research.google.com/github/futureCodersSE/data-roles/blob/main/Hackathon_data_roles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#FutureCoders Data Roles Hackathon

Importing all the required libraries used within this script.

In [None]:
#web scrapping libraries
from bs4 import BeautifulSoup
import requests

#data manipulation/presentation libraries
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'  ##<- this is really annoying, I WANT to copy the DF.
import datetime
from math import floor, log10

#Google libraries for uploading and saving documnets
from google.colab import files
from google.colab import drive

These are the URLs for the job searches for Reed.co.uk.

For Reed, the search parameters are: Data jobs within 50 miles of Glasgow,
https://www.reed.co.uk/jobs/data-jobs-in-glasgow?proximity=50

You can register for a reed API key very easily at this address: https://www.reed.co.uk/developers/Jobseeker. Copy the key into the input box below.

In [None]:
reed_url = "https://www.reed.co.uk/jobs/data-jobs-in-glasgow?&proximity=50"

reed_api_key = input("Enter your Reed API Key: ") 

This is a general function that is used multiple times in each search.

In [None]:
#returns the full HTML of a page.
def get_html(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

##Reed Job Search
---

This function returns an array of job IDs found on the web page that is passed as an argument.  

In [None]:
def reed_jobs_iterate(page_url):
  site_html = get_html(page_url)
  results = site_html.find(class_="col-sm-8 col-md-9 results-container")

  id_list = np.array([])
  job_cards = results.find_all(class_="job-result-card")

  for job in job_cards: #finds the jobs, gets it's ID, returns the id
      job_id = int(job["id"].split("jobSection")[1]) #jobSection48529572 ect...
      
      id_list = np.append(id_list,[job_id])
  return id_list

This funtion finds information about the total amount of jobs found by the search and cleans up that information so it is usable.

In [None]:
def find_reed_total_jobs(site_html,max_job_call):
  max_page_requests = int(max_job_call/25) 

  ###
  total_jobs_text = site_html.find(class_="col-sm-11 col-xs-12 page-title").text # '\n' '\r' '\n' x,xxx\r\n  Data Jobs near Ashford       '\n'....
  total_jobs_text = total_jobs_text.replace("\n","").replace("\r","").replace(",","") #          xxxx         Data Jobs near Ashford       .       

  total_jobs = int(total_jobs_text.split("Data")[0].strip(" "))
  total_pages_found = int(np.ceil(total_jobs/25))

  #limits the requests to only 2000
  if total_pages_found > max_page_requests:
    total_pages_search = max_page_requests
  else:
    total_pages_search = total_pages_found


  return total_jobs, total_pages_found, total_pages_search

This is the Main function. It scapes the site once to find the total jobs in the search, then builds the required URLs for each page and passes them to the job finding funcion.

In [None]:
def reed_scrape(base_url,max_job_call):
  soup = get_html(base_url)
  total_id_list = np.array([],dtype=int)

  #finds out how many more pages to scrape
  total_jobs,total_pages_found , total_pages_search = find_reed_total_jobs(soup,max_job_call)

  print("Total jobs found: ", total_jobs)
  print("Maximum results allowed", 1000)
  print("Pages to search: ",total_pages_search)
  print("Pages processed:", end=" ")

  #scarping pages in range [1,x] not pages in range [1,x[
  for page_no in np.arange(1,total_pages_search+1): 
    page_url = base_url+"&pageno="+str(page_no)
    page_id_list = reed_jobs_iterate(page_url)
    total_id_list = np.append(total_id_list,[page_id_list])

    #user feedback
    print(page_no, end=",")
  return total_id_list

This function takes the generated list of job IDs and uses the Reed API to retrive all of the information about that job.

In [None]:
def call_reed_api(id_list,reed_api_key):
  base_url = "https://www.reed.co.uk/api/1.0/jobs/"
  api_url = ""
  job_df = pd.DataFrame()

  #user feedback
  print("\nRetrieving job information...")
  print("Number of job IDs processed:", end=" ")

  unique_id_list = np.unique(id_list)

  for job_id in unique_id_list:
    #index_
    if np.where(unique_id_list == job_id)[0][0] % 100 == 0:
      #user feedback
      print(np.where(unique_id_list == job_id)[0][0],end=".....")
    api_url = base_url+str(job_id)
    post_request = requests.get(api_url, auth=(reed_api_key,""))
    json_data = post_request.json()

    new_row = pd.json_normalize(json_data)
    job_df = job_df.append(new_row,ignore_index=True)
  return job_df

Calls previous functions in order.   
NB. This function can take anywhere from 5-10 minutes to fully run based on the max number of results you have specified (currently 1000)

In [None]:
def get_jobsearch_data():
  max_job_call = 1000 #reed has a limit of 2000 job search api calls per hour.
  reed_id_list = reed_scrape(reed_url,max_job_call)
  reed_id_list = reed_id_list.astype("int") #it really didnt want to save the array as an int

  full_reed_df = call_reed_api(reed_id_list, reed_api_key)
  
  return full_reed_df


##Cleaning

For the Salary normalisation, I have taken the lowever value where a salary range is given and followed the following formulas for converting hourly and daily rates into per annum rates.    
For day rates I have assumed working 5 days a week and 4 weeks of holiday.   
For hourly rates I have assumed 36 hours a week and 4 weeks of holiday.   

In [None]:
#function for rounding to significant figures rather than to decimal placeses
def round_sig(x, sig=2):
  return round(x, sig-int(floor(log10(abs(x))))-1)

#normalises the salary into a £000's/year integer value 
def salary_manip(string): 
  if pd.isna(string) == True or "not spec" in string:
    return np.nan
  else:
    multiple = [1,260-20,36*(52-4)]
    type_ = 0

    if "annum" in string:
      type_ = 0
    elif "day" in string:
      type_ = 1
    elif "hour" in string:
      type_ = 2

    string = string.split("per")[0]

    try:
      string = string.split("-")[0]
    except:
      pass
    string = string.replace("£","").replace(",","")

    string = string.strip("")
    value = np.nan
    try:
      value = float(string)*multiple[type_]
    except Exception as e: print(e)
    return round_sig(value)/1000


This takes the full reed dataframe produced by the get_jobsearch_data() function and returns the dateframe with only the required headings, as well as renaming them.

In [None]:
def clean_reed(df_in): #this filters out the necessary rows and renames them so the cleaning function works.
  df = df_in[["jobTitle", "employerName", "salary", "contractType", "locationName",  "jobUrl"]]

  df.rename(columns={"jobTitle":"job_title", "employerName": "company", "contractType":"contract", "locationName":"location",  "jobUrl":"job_url"}, inplace=True)

  df["salary per annum (£ 000's)"] =  df["salary"].apply(salary_manip)
  df = df[["job_title","company","salary","salary per annum (£ 000's)","contract","location","job_url"]] #reorders the columns into a more readable form.
  return df

##Uploading
---

Mount/Unmount functions

In [None]:
#This connects your Drive to the Colab document
def mount_drive(data_path):
  drive.mount('/content/drive/', force_remount=True)
  project_dir = "/content/drive/MyDrive/"+data_path #--working parth
  return project_dir

#This disconnects your Drive from the Colab document
def unmount_drive():
  drive.flush_and_unmount()
  print('Drive Unmounted')

Saving to your Google Drive.   
This requires you to mount Drive to the Colab, a pop-up will appear asking you to log in to authenticate this.

In [None]:
def upload_dataset(path_, df):
  date_today = datetime.date.today()
  project_dir = mount_drive("") #change the text within the quotes if you want to save to another folder within google drive
  df.to_csv(project_dir + "/" + str(date_today) +"_data_jobs_df.csv",index=False)
  unmount_drive()
  print("The file, " + str(date_today) +"_data_jobs_df.csv,", "is now saved in ", project_dir, "folder")

### Run the jobsearch and save the results
---


In [None]:
def main():
  full_reed_df = get_jobsearch_data()
  display(full_reed_df.info())
  reed_df = clean_reed(full_reed_df)
  display(reed_df)
  upload_dataset("", reed_df)

main()

Total jobs found:  1014
Maximum results allowed 1000
Pages to search:  40
Pages processed: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,
Retrieving job information...
Number of job IDs processed: 0,100,200,300,400,500,600,700,<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   employerId           777 non-null    int64 
 1   employerName         777 non-null    object
 2   jobId                777 non-null    int64 
 3   jobTitle             777 non-null    object
 4   locationName         777 non-null    object
 5   minimumSalary        508 non-null    object
 6   maximumSalary        508 non-null    object
 7   yearlyMinimumSalary  508 non-null    object
 8   yearlyMaximumSalary  508 non-null    object
 9   currency             508 non-null    object
 10  salaryType

None

Unnamed: 0,job_title,company,salary,salary per annum (£ 000's),contract,location,job_url
0,Data Entry Work From Home Online,925 HomeJobs,,,Permanent,Glasgow,https://www.reed.co.uk/jobs/data-entry-work-fr...
1,Data Consultant,Dufrain,,,Permanent,Edinburgh,https://www.reed.co.uk/jobs/data-consultant/43...
2,Data Engineer,Dufrain,,,Permanent,Edinburgh,https://www.reed.co.uk/jobs/data-engineer/4394...
3,Data Entry Clerk - Remote Work From Home (Part...,Apex Focus Group,£25.00 - £55.00 per hour,43.0,Permanent,Glasgow,https://www.reed.co.uk/jobs/data-entry-clerk-r...
4,Administration Assistant Clerk - Remote Work F...,Apex Focus Group,£25.00 - £55.00 per hour,43.0,Permanent,Glasgow,https://www.reed.co.uk/jobs/administration-ass...
...,...,...,...,...,...,...,...
772,Foreman,Gi Group,£12.67 per hour,22.0,Temporary,Glespin,https://www.reed.co.uk/jobs/foreman/48993694
773,IT Test Analyst,Michael Page Technology,"£35,000 - £45,000 per annum",35.0,Permanent,Glasgow,https://www.reed.co.uk/jobs/it-test-analyst/48...
774,Data Analyst,eFinancialCareers,,,Permanent,Edinburgh,https://www.reed.co.uk/jobs/data-analyst/48994697
775,Data Analyst,eFinancialCareers,,,Permanent,Edinburgh,https://www.reed.co.uk/jobs/data-analyst/48994702


Mounted at /content/drive/
Drive Unmounted
The file, 2022-11-21_data_jobs_df.csv, is now saved in  /content/drive/MyDrive/ folder
