# Scraper for Jobs Listed by Company Name and Location
### Parsing the response from searching request and get the listed jobs of companies

* Website: glassdoor.com
* Library usage: requests`pip install requests`, lxml`pip install lxml`
* Results saved to joblist_company.csv under data folder
* Fields of output: job title, company name, salary, job link, location
* Reference: https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/

In [1]:
from lxml import html, etree
import requests
import re
import os
import sys
import unicodecsv as csv
import argparse
import json
import pandas as pd

In [2]:
def parse(keyword, place):

	headers = {	'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
				'accept-encoding': 'gzip, deflate, sdch, br',
				'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6',
				'referer': 'https://www.glassdoor.com/',
				'upgrade-insecure-requests': '1',
				'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36',
				'Cache-Control': 'no-cache',
				'Connection': 'keep-alive'
	}

	location_headers = {
		'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.01',
		'accept-encoding': 'gzip, deflate, sdch, br',
		'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6',
		'referer': 'https://www.glassdoor.com/',
		'upgrade-insecure-requests': '1',
		'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36',
		'Cache-Control': 'no-cache',
		'Connection': 'keep-alive'
	}
	data = {"term": place,
			"maxLocationsToReturn": 10}

	location_url = "https://www.glassdoor.co.in/findPopularLocationAjax.htm?"
	try:
		# Getting location id for search location
		#print("Fetching location details")
		location_response = requests.post(location_url, headers=location_headers, data=data).json()
		place_id = location_response[0]['locationId']
		job_litsting_url = 'https://www.glassdoor.com/Job/jobs.htm'
		# Form data to get job results
		data = {
			'clickSource': 'searchBtn',
			'sc.keyword': keyword,
			'locT': 'C',
			'locId': place_id,
			'jobType': ''
		}

		job_listings = []
		if place_id:
			response = requests.post(job_litsting_url, headers=headers, data=data)
			# extracting data from
			# https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=true&clickSource=searchBtn&typedKeyword=andr&sc.keyword=android+developer&locT=C&locId=1146821&jobType=
			parser = html.fromstring(response.text)
			# Making absolute url 
			base_url = "https://www.glassdoor.com"
			parser.make_links_absolute(base_url)

			XPATH_ALL_JOB = '//li[@class="jl"]'
			XPATH_NAME = './/a/text()'
			XPATH_JOB_URL = './/a/@href'
			XPATH_LOC = './/span[@class="subtle loc"]/text()'
			XPATH_COMPANY = './/div[@class="flexbox empLoc"]/div/text()'
			XPATH_SALARY = './/span[@class="green small"]/text()'

			listings = parser.xpath(XPATH_ALL_JOB)
			for job in listings:
				raw_job_name = job.xpath(XPATH_NAME)
				raw_job_url = job.xpath(XPATH_JOB_URL)
				raw_lob_loc = job.xpath(XPATH_LOC)
				raw_company = job.xpath(XPATH_COMPANY)
				raw_salary = job.xpath(XPATH_SALARY)

				# Cleaning data
				job_name = ''.join(raw_job_name).strip('–') if raw_job_name else None
				job_location = ''.join(raw_lob_loc) if raw_lob_loc else None
				raw_state = re.findall(",\s?(.*)\s?", job_location)
				state = ''.join(raw_state).strip()
				raw_city = job_location.replace(state, '')
				city = raw_city.replace(',', '').strip()
				company = ''.join(raw_company).replace('–','')
				salary = ''.join(raw_salary).strip()
				job_url = raw_job_url[0] if raw_job_url else None

				jobs = {
					"Name": job_name,
					"Company": company,
					"State": state,
					"City": city,
					"Salary": salary,
					"Location": job_location,
					"Url": job_url
				}
				job_listings.append(jobs)

			return job_listings
		else:
			print("location id not available")

	except:
		print("Failed to load locations")

In [3]:
cp=pd.read_csv('../00-data/companies.csv')
companies = cp.groupby('name')['location'].apply(list).to_dict()
#print(companies)
print(len(companies))

397


In [5]:
cCount=0
jCount=0
with open('../00-data/joblist_company.csv', 'wb')as csvfile:
    fieldnames = ['Name', 'Company', 'State','City', 'Salary', 'Location', 'Url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames,quoting=csv.QUOTE_ALL, encoding='utf-8')
    writer.writeheader()
    print("Writing data to output file")
    for company, location in companies.items():
        print(company, location)
        scraped_data = parse(company, location)
        if scraped_data:
            cCount+=1
            for data in scraped_data:
                jCount+=1
                writer.writerow(data)
            else:
                print("Your search for %s, in %s does not match any jobs"%(company,location))

print("Complete, %s companies found, %s listed job found"%(cCount,jCount))

Writing data to output file
100% Vision ['Ukraine']
Your search for 100% Vision, in ['Ukraine'] does not match any jobs
13th Lab ['Stockholm']
3D Data ['-']
Your search for 3D Data, in ['-'] does not match any jobs
4tiitoo ['San Francisco']
9fin ['London']
AE Studio ['Los Angeles']
Failed to load locations
AMP Robotics ['Boulder']
Aatonomy ['San Francisco']
Abundant Robotics ['Hayward']
Your search for Abundant Robotics, in ['Hayward'] does not match any jobs
AdMobilize ['Miami']
AdvertEyes ['-']
Affectiva ['Boston']
Your search for Affectiva, in ['Boston'] does not match any jobs
Affinity ['San Francisco']
Your search for Affinity, in ['San Francisco'] does not match any jobs
AgriData ['Mountain View']
AiFi ['-']
Aireal ['Dallas']
Alcatraz AI ['Palo Alto']
AlchemyAPI (Acquired by IBM Watson Group) ['-']
Algolux ['San Francisco']
Anchovi Labs ['San Diego']
Antumbra ['Boston']
Anyline ['Vienna']
Your search for Anyline, in ['Vienna'] does not match any jobs
Aptonomy ['San Francisco']
Aq

LookAllure ['Menlo Park']
Loom.ai ['San Francisco']
Your search for Loom.ai, in ['San Francisco'] does not match any jobs
Lucid ['Santa Clara']
Lumific ['Sydney']
MDacne (YC W17) ['San Francisco']
MTailor ['San Francisco']
Your search for MTailor, in ['San Francisco'] does not match any jobs
Mad Street Den ['-']
Mageca ['Athens']
Magentiq Eye ['Israel']
Magic Pony Technology ['London']
Your search for Magic Pony Technology, in ['London'] does not match any jobs
Makelight ['London']
Mana Ventures ['San Francisco']
Mangolytics ['San Francisco']
Mapillary ['Malmö']
Markable.ai ['New York City']
Your search for Markable.ai, in ['New York City'] does not match any jobs
Mashgin ['Mountain View']
Mati ['San Francisco']
Your search for Mati, in ['San Francisco'] does not match any jobs
Matterport ['Sunnyvale']
Your search for Matterport, in ['Sunnyvale'] does not match any jobs
Mavrx ['San Francisco']
MealPic ['-']
Mediachain Labs ['Brooklyn']
Medviv ['Boston']
Menache ['Los Angeles']
Meograph

VideoPlusPlus(videopls.com) ['Shanghai']
Your search for VideoPlusPlus(videopls.com), in ['Shanghai'] does not match any jobs
VideoSlick ['-']
Vidrovr (Techstars '17) ['New York City']
Viewdle ['Palo Alto']
Visada ['San Francisco']
Visbit ['Sunnyvale']
Visibly ['Chicago']
Your search for Visibly, in ['Chicago'] does not match any jobs
Vision Computer Solutions ['-']
Your search for Vision Computer Solutions, in ['-'] does not match any jobs
Visualead ['Tel Aviv-Yafo']
Vivacity Labs ['London']
Your search for Vivacity Labs, in ['London'] does not match any jobs
Volumental ['Stockholm']
WIREWAX ['London']
Your search for WIREWAX, in ['London'] does not match any jobs
WayWay ['New York City']
Waygo ['San Francisco']
Wild Me ['Portland']
Your search for Wild Me, in ['Portland'] does not match any jobs
XLabs ['Melbourne']
Xihelm ['London']
Ximmerse ['Los Angeles']
Z Imaging ['Boston']
Zippin ['San Francisco']
Your search for Zippin, in ['San Francisco'] does not match any jobs
airXsys ['Sun

## Contributions 
-By own: 50%  
-By online resources: 50%  

## Citations
1. https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/

## License

Copyright 2019 COPYRIGHT Yunan Shao

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.