# WEB SCRAPPING GLASSDOOR

## ABSTRACT

The aim of this notebook is to extract details about the job postings from Glassdoor. We are extracting the 'JobTitle', 'Company_Name','Company_location', 'Job_Description','Rating' and 'Salary'. 
We have used selenium chrome web driver for extracting the Job postings. We are extracting the listings for 'Data Scientist', 'Data Analyst', 'Data Engineer', 'Business Intelligence Analyst'. We are looking for all these positions in five states namely 'MA','TX','CA'.
We have used selenium with python to find the specific html tags path by using 'Find_element_by_xpath' to get the text at the specified html tag location.

### WHAT IS WEB SCRAPPING?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

![](files/ss/Web-Scraping.jpg)

### WHY SELENIUM WEB DRIVER? 

The primary new feature in Selenium 2.0 is the integration of the WebDriver API. WebDriver is designed to provide a simpler, more concise programming interface in addition to addressing some limitations in the Selenium-RC API. Selenium-WebDriver was developed to better support dynamic web pages where elements of a page may change without the page itself being reloaded. WebDriver’s goal is to supply a well-designed object-oriented API that provides improved support for modern advanced web-app testing problems. Selenium web driver helps in mimiking the web browser in way that a specific website would think that the request is being sent by a original web browser which would help in extracting the important information from the website.

![](files/ss/Selenium.png)

### STEPS TO INSTALL SELENIUM

Go to Anaconda prompt:

-- pip install selenium

-- For chrome webbrowser download the chromedriver from http://chromedriver.chromium.org/downloads and save it in the same  location as the python notebook.

-- For mozilla firefox download the geckodriver and save it in the same location as the python notebook.

-- Go to environement variables and set the path location to the path of the python notebook.

-- pip install nltk

-- pip install stop-words

In [1]:
#import all the important libraries for scrapping
import urllib
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import re
import lxml.html as lh

from time import sleep
import random

#importing the stop_words to get the important stop words
from stop_words import get_stop_words
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

#importing the selenium webdriver
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException

https://github.com/natmod/glassdoor-scrape

Some part of this notebook has been taken from this links notebook.

In this step we have defined a search_jobs function which will be called by providing the driver(url), the job_title_input and the job_location_input so that it could be searched on the website.

![](files/ss/glassdoor-1.png)

The job_title will be entered in the space where JobTitle,Keywords or Company is mentioned and the location will be entered in the location search bar. The HeroSearchButton will click the search button to search the listings based on the job_title_input and the location_input.

In [2]:
# The function is used to search the job_title and location based on the input provided 
def search_jobs(driver, job_title_input, location_input):
    """Enter query terms into search box and run job search"""
    # the job_title keyword
    job_title = driver.find_element_by_name("sc.keyword")
    # this will clear any previous job title 
    job_title.clear()
    job_title.send_keys(job_title_input)
    # the location input keyword
    location = driver.find_element_by_id("sc.location")
    #this will clear any job location
    location.clear()
    location.send_keys(location_input)
    #this will click on the search button
    driver.find_element_by_id("HeroSearchButton").click()
    return

After defining the method, we will create a chrome session. Once the chrome browser opens hover to login button and enter the login details. Do not wait long to enter the details as it will give error.

![](files/ss/glassdoor-2.png)

After pressing the sign in, wait for three to four minutes to check if there are any new pop ups so that it does not interupt the code. Next section contains the code for finding all the important information. All the code has been put try- catch block because if the driver does not find any element it will consider it as an expection and continue with the next listing. 
We are catching NoSuchElementException for the html element which are not present and StaleElementReferenceException which occurs if the website become unresponsive.

In [3]:
# create a new Chrome session
driver = webdriver.Chrome()
# make the driver wait for 1000 secs so that login information could be entered
driver.implicitly_wait(1000)
#this will make the browser full screen
driver.maximize_window()

#define job posting search
#these inputs can be changed!
url = "https://www.glassdoor.com/index.htm"

driver.get(url)

#initialize loop variables
i=0
results=pd.DataFrame(columns=['Title','Company','Location','Description','Salary','Company_Rating','Skillset'],index=(range(10000)))

#input location and job_title can be changed
for location_input in ("CA","MA","TX"):
    for job_title_input in ("Data Scientist","Data Analyst","Business Intelligence Analyst","Data Engineer"):
        #call search_jobs function
        search_jobs(driver, job_title_input, location_input)
        x=0
        # search for 7 pages of each job title and location
        while (x<6):
    #let user know the scraping has started
            print("starting round")
    #find job listing elements on web page
            listings = driver.find_elements_by_class_name("jl")
    
            #change the implicit wait time to 5 secs so that whenever the driver is not able to find any element it only waits for 5 secs
            driver.implicitly_wait(5)
            
            #for every listing in the list of listings
            for listing in listings:
                try:
                    #click on the particular listing
                    listing.click()          
                    sleep(2)
                    #find the header tag
                    inf=listing.find_element_by_xpath("//div[@class='header']")
                    
                    inf.find_element_by_xpath("//h1[@class='jobTitle h2 strong']")
                    #storing the title of the job posting
                    title = inf.find_element_by_xpath("//h1[@class='jobTitle h2 strong']").text
            
            #sleep for 2 secs so that the driver does not move too fast
                    sleep(2)
            
            #finding the compInfo class
                    info= listing.find_element_by_xpath("//div[@class='compInfo']") 
                    sleep(2)
              #some of listing do not have rating specified, so if it does not find the rating it will wait for 10 secs and just continue to next listing.      
                    try:
                        info.find_element_by_xpath("//span[@class='compactRating lg margRtSm']")
                        #storing the rating of the job posting
                        rating= info.find_element_by_xpath("//span[@class='compactRating lg margRtSm']").text
                        #storing the company_name of the posting
                        company_name = info.find_element_by_xpath("//a[@class='plain strong empDetailsLink']").text
                        #storing the company_location of the posting
                        company_location=info.find_element_by_xpath("//div[@class='compInfo']//span[2]").text
                
                        sleep(2)
                
                # finding the salary html tag
                        try:
                            sal=listing.find_element_by_xpath("//div[@class='salaryRow']//div[@class='salEstWrap']")                    
                    
                        except NoSuchElementException as Exception:               
                    
                            continue
                    
                        #storing the salary of the listing
                        salary=sal.find_element_by_xpath("//span[@class='green small salary']").text
                        sleep(2)
                        listing.find_element_by_xpath("//div[@id='JobDescriptionContainer']")
                        #storing the description of the listing
                        description = listing.find_element_by_xpath("//div[@id='JobDescriptionContainer']").text
                
                    except NoSuchElementException as Exception:
                        continue                          
                     
               
            
              
                except (StaleElementReferenceException,NoSuchElementException) as Exception:
                    continue

                #storing the results in the dataframe
                if(i<len(results)):
                    results['Title'][i]=title
        
                    results['Company'][i]=company_name
        
                    results['Location'][i]=company_location
        
                    results['Description'][i]=description
            
                    results['Company_Rating'][i]= rating
            
                    results['Salary'][i] = salary
        
                    i = i + 1
        

                
    

    
    
            print("end of round")
        #finding the next button location
            next_btn = driver.find_element_by_xpath("//li[@class='next']")    
    #click the next button
            next_btn.click()
    #tell webdriver to wait until it finds the job listing elements on the new page
            WebDriverWait(driver, 100).until(lambda driver: driver.find_elements_by_class_name("jl"))
    #let the user know how many job listings have been scraped
    
            x=x+1
    

#results=results.dropna()
print("Scrapping Complete")
# close the browser window
driver.quit()

starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end of round
starting round
end o

In [4]:
#display the initial results
results

Unnamed: 0,Title,Company,Location,Description,Salary,Company_Rating,Skillset
0,Senior Data Scientist,Instacart,"– San Francisco, CA","Founded in 2012, Instacart is a leader in Nort...",$144K-$193K (Glassdoor est.),3.7 ★,
1,DATA SCIENTIST / ANALYTIC CONSULTANT 4,Wells Fargo,"– San Francisco, CA","Job Description\n\nAt Wells Fargo, we want to ...",$82K-$131K (Glassdoor est.),3.5 ★,
2,Sr. Applications Scientist – Charged Particle ...,Multibeam,"– Santa Clara, CA",Sr. Applications Scientist – Charged Particle ...,Employer Provided Salary:$100K-$135K,5.0 ★,
3,Principal Scientist,bioMérieux,"– San Diego, CA",World leader in the field of in vitro diagnost...,$91K-$126K (Glassdoor est.),3.4 ★,
4,Data Engineer,LeadCrunch,"– San Diego, CA",Data Engineer\n\nAre you ready to be a part of...,Employer Provided Salary:$125K-$155K,4.1 ★,
5,Advanced Analytics Manager - Healthcare,Central California Alliance for Health,"– Scotts Valley, CA",This is a new position at the Alliance where y...,$113K-$155K (Glassdoor est.),3.5 ★,
6,"Senior Software Engineer, Data Infrastructure",New Relic,"– San Francisco, CA, United States","New Relic\nSenior Software Engineer, Data Infr...",$138K-$221K (Glassdoor est.),4.5 ★,
7,Financial Analytics Manager,Central California Alliance for Health,"– Scotts Valley, CA",ABOUT US\n\nWe are a group of over 500 dedicat...,$113K-$155K (Glassdoor est.),3.5 ★,
8,"Senior Scientist, Culture Process Development",Roche,– Pleasanton,ROLE SUMMARY:\n\ntRED Antibody Innovation grou...,$81K-$122K (Glassdoor est.),3.9 ★,
9,"Data Analyst, Business Intelligence","Gemological Institute of America, Inc.","– Carlsbad, CA",The Business Intelligence team is responsible ...,$40K-$64K (Glassdoor est.),2.9 ★,


In [8]:
results['Skillset']='Skill'

In [9]:
#droping all the null values
results=results.dropna()

In [10]:
results

Unnamed: 0,Title,Company,Location,Description,Salary,Company_Rating,Skillset
0,Senior Data Scientist,Instacart,"– San Francisco, CA","Founded in 2012, Instacart is a leader in Nort...",$144K-$193K (Glassdoor est.),3.7 ★,Skill
1,DATA SCIENTIST / ANALYTIC CONSULTANT 4,Wells Fargo,"– San Francisco, CA","Job Description\n\nAt Wells Fargo, we want to ...",$82K-$131K (Glassdoor est.),3.5 ★,Skill
2,Sr. Applications Scientist – Charged Particle ...,Multibeam,"– Santa Clara, CA",Sr. Applications Scientist – Charged Particle ...,Employer Provided Salary:$100K-$135K,5.0 ★,Skill
3,Principal Scientist,bioMérieux,"– San Diego, CA",World leader in the field of in vitro diagnost...,$91K-$126K (Glassdoor est.),3.4 ★,Skill
4,Data Engineer,LeadCrunch,"– San Diego, CA",Data Engineer\n\nAre you ready to be a part of...,Employer Provided Salary:$125K-$155K,4.1 ★,Skill
5,Advanced Analytics Manager - Healthcare,Central California Alliance for Health,"– Scotts Valley, CA",This is a new position at the Alliance where y...,$113K-$155K (Glassdoor est.),3.5 ★,Skill
6,"Senior Software Engineer, Data Infrastructure",New Relic,"– San Francisco, CA, United States","New Relic\nSenior Software Engineer, Data Infr...",$138K-$221K (Glassdoor est.),4.5 ★,Skill
7,Financial Analytics Manager,Central California Alliance for Health,"– Scotts Valley, CA",ABOUT US\n\nWe are a group of over 500 dedicat...,$113K-$155K (Glassdoor est.),3.5 ★,Skill
8,"Senior Scientist, Culture Process Development",Roche,– Pleasanton,ROLE SUMMARY:\n\ntRED Antibody Innovation grou...,$81K-$122K (Glassdoor est.),3.9 ★,Skill
9,"Data Analyst, Business Intelligence","Gemological Institute of America, Inc.","– Carlsbad, CA",The Business Intelligence team is responsible ...,$40K-$64K (Glassdoor est.),2.9 ★,Skill


**WHAT IS PUNKT IN NLTK?**

The punkt.zip file contains pre-trained Punkt sentence tokenizer (Kiss and Strunk, 2006) models that detect sentence boundaries. These models are used by nltk.sent_tokenize to split a string into a list of sentences.

![](files/ss/NLTK.png)

**WHAT ARE STOP-WORDS?**

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

**WHY TO EXCLUDE STOP-WORDS?**

Stop words are excluded from the text because these are too common words and are not much of importance in the analysis. 

![](files/ss/Stop-word-removal-using-NLTK.png)

In [5]:
#importing the nltk for some analysis
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jaina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jaina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

After finding all the job listings we have defined the function to tokenize the description and find the frequency of the important skills in all the job listings.

#### WHAT IS TOKENIZATION? 

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.

https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python

In [12]:
def tokenize_description(description):
    """take a job description and return a list of tokens excluding stop words"""
    tokens = word_tokenize(description)
    stopset = set(stopwords.words('english'))
    tokens = [w.lower() for w in tokens if not w in stopset]
    text = nltk.Text(tokens)
    return list(set(text))
        
def find_skills_frequency(results_df):
    """count frequency of key words (as defined in dictionaries within function) appearing in job descriptions and return dataframe with skill frequency"""
    words = []
    for description in results_df['Description']:
        words.append(tokenize_description(description))
    
    doc_frequency = Counter()
    [doc_frequency.update(word) for word in words]
    #all these skills and dictionary defination can be changed based on the need of the analysis
    prog_lang_dict = Counter({'R':doc_frequency['r'], 'Python':doc_frequency['python'],
                    'Java':doc_frequency['java'], 'C++':doc_frequency['c++'],
                    'Ruby':doc_frequency['ruby'], 'Julia':doc_frequency['julia'],
                    'Perl':doc_frequency['perl'], 'Matlab':doc_frequency['matlab'], 
                    'Mathematica':doc_frequency['mathematica'], 'Php':doc_frequency['php'],
                    'JavaScript':doc_frequency['javascript'], 'Scala': doc_frequency['scala'],
                    'Octave':doc_frequency['octave']})
                      
    analysis_tool_dict = Counter({'Excel':doc_frequency['excel'],  'Tableau':doc_frequency['tableau'],
                        'D3.js':doc_frequency['d3.js'], 'SAS':doc_frequency['sas'],
                        'SPSS':doc_frequency['spss'], 'D3':doc_frequency['d3'],
                        'Spotfire': doc_frequency['spotfire'],'Stata':doc_frequency['stata'],
                        'Power BI': doc_frequency['power bi']})  

    hadoop_dict = Counter({'Hadoop':doc_frequency['hadoop'], 'MapReduce':doc_frequency['mapreduce'],
                'Spark':doc_frequency['spark'], 'Pig':doc_frequency['pig'],
                'Hive':doc_frequency['hive'], 'Shark':doc_frequency['shark'],
                'Oozie':doc_frequency['oozie'], 'ZooKeeper':doc_frequency['zookeeper'],
                'Flume':doc_frequency['flume'], 'Mahout':doc_frequency['mahout']})
    
    other_dict = Counter({'Azure':doc_frequency['azure'], 'AWS':doc_frequency['aws']})
                
    database_dict = Counter({'SQL':doc_frequency['sql'], 'NoSQL':doc_frequency['nosql'],
                    'HBase':doc_frequency['hbase'], 'Cassandra':doc_frequency['cassandra'],
                    'MongoDB':doc_frequency['mongodb']})
                    
    edu_dict = Counter({'Bachelor':doc_frequency['bachelor'],'Master':doc_frequency['master'],
                          'PhD': doc_frequency['phd'],'MBA':doc_frequency['mba']})
                          
          
    education_dict = Counter({'Computer Science':doc_frequency['computer-science'],  
                              'Statistics':doc_frequency['statistics'], 
                              'Mathematics':doc_frequency['mathematics'],
                              'Physics':doc_frequency['physics'], 
                              'Machine Learning':doc_frequency['machine-learning'], 
                              'Economics':doc_frequency['economics'], 
                              'Software Engineer': doc_frequency['software-engineer'],
                              'Information System':doc_frequency['information-system'], 
                              'Quantitative Finance':doc_frequency['quantitative-finance']})
    
    skills = prog_lang_dict + analysis_tool_dict + hadoop_dict \
                           + database_dict + other_dict + education_dict \
                            +  edu_dict
    print(list(skills.items()))
    skills_frame = pd.DataFrame(list(skills.items()), columns = ['Term','NumPostingsPercentage'])
    skills_frame.NumPostingsPercentage = (skills_frame.NumPostingsPercentage)*100/len(results_df)
    # Sort the data for plotting purposes
    skills_frame.sort_values(by='NumPostingsPercentage', ascending = False, inplace = True)
    return skills_frame

After defining the method we call the method by passing the results dataframe.

In [13]:
#call the method by passing the results dataframe
find_skills_frequency(results)

[('R', 403), ('Python', 583), ('Java', 284), ('C++', 120), ('Ruby', 40), ('Julia', 3), ('Perl', 36), ('Matlab', 63), ('Mathematica', 2), ('Php', 14), ('JavaScript', 91), ('Scala', 134), ('Excel', 432), ('Tableau', 299), ('D3.js', 13), ('SAS', 152), ('SPSS', 43), ('D3', 18), ('Spotfire', 25), ('Stata', 21), ('Hadoop', 174), ('MapReduce', 15), ('Spark', 196), ('Pig', 19), ('Hive', 106), ('Oozie', 2), ('ZooKeeper', 1), ('Flume', 6), ('Mahout', 4), ('SQL', 799), ('NoSQL', 127), ('HBase', 25), ('Cassandra', 67), ('MongoDB', 61), ('Azure', 65), ('AWS', 192), ('Statistics', 428), ('Mathematics', 310), ('Physics', 110), ('Machine Learning', 19), ('Economics', 202), ('Bachelor', 628), ('Master', 297), ('PhD', 173), ('MBA', 25)]


Unnamed: 0,Term,NumPostingsPercentage
29,SQL,49.627329
41,Bachelor,39.006211
1,Python,36.21118
12,Excel,26.832298
36,Statistics,26.583851
0,R,25.031056
37,Mathematics,19.254658
13,Tableau,18.571429
42,Master,18.447205
2,Java,17.639752


The next method will be used to process the text by removing the punctuation marks and replace it by the blank space to prepare the text for finding the ngrams.

It first converts all the characters in the text to lowercases. After that, it replaces commas, forward slashes, brackets and full stops with single whitespaces. Finally, it uses the split function on the text to split words by spaces and returns the result.

In [14]:
#method for processing the text
def process_text(text):
    
        text = text.lower()
        text = text.replace(',', ' ')
        text = text.replace('/', ' ')
        text = text.replace('(', ' ')
        text = text.replace(')', ' ')
        text = text.replace('.', ' ')
        text = text.replace('[', ' ')
        text = text.replace(']', ' ')
        text = text.replace('"', ' ')
                        
 # Convert text string to a list of words
        return text.split()

**WHAT IS N-GRAM?**

n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

![](files/ss/ngram.png)

The ngram method will take the word list(processed text) and on the basis n(1,2,3) it will find the ngrams and save in the ngrams_list. It will first convert the list into a string and then from the first word it will start finding the gram.

Upon receiving the input parameters, the generate_ngrams function declares a list to keep track of the generated n-grams. It then loops through all the words in words_list to construct n-grams and appends them to ngram_list.

When the loop completes, the generate_ngrams function returns ngram_list back to the caller.

-- pip install ngram

https://www.techcoil.com/blog/how-to-generate-n-grams-in-python-without-using-any-external-libraries/

In [15]:
#method for finding the ngrams
def generate_ngrams(words_list, n):
    ngrams_list = []
 
    for num in range(0, len(words_list)):
        ngram = ' '.join(words_list[num:num + n])
        ngrams_list.append(ngram)
 
    return ngrams_list

In [16]:
one=[]
two=[]
three=[]

In [17]:
#to find the unigram,bigram and trigram in the job description
i=0
while (i<results.shape[0]):
    if __name__ == '__main__':
        if results['Description'][i]==None:
            i=i+1
        else:
            text = results['Description'][i]
             
            #it will call the process text method to remove the punctuation 
            words_list = process_text(text)
            #finding the unigram, n=1
            unigrams = generate_ngrams(words_list, 1)
            one.append(unigrams)
            #finding the bigram, n=2
            bigrams = generate_ngrams(words_list, 2)
            two.append(bigrams)
            #finding the trigram, n=3
            trigrams = generate_ngrams(words_list, 3)
            three.append(trigrams)
            i=i+1

Next we have created a flatlist for each gram to make sure there are no hidden or nested lists inside the ngram list, which will not affect the count of the ngram.

https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists

In [18]:
#create a flat list for one gram list
flat_list_one = []
for sublist in one:
    for item in sublist:
        flat_list_one.append(item)

In [19]:
#counting the most important words
Counter(flat_list_one).most_common(20)

[('and', 53599),
 ('to', 27637),
 ('the', 22663),
 ('of', 19947),
 ('in', 14773),
 ('a', 14608),
 ('with', 13105),
 ('data', 12311),
 ('for', 9483),
 ('or', 8754),
 ('experience', 8102),
 ('business', 6602),
 ('is', 6459),
 ('our', 5968),
 ('as', 5707),
 ('we', 5114),
 ('work', 4816),
 ('on', 4592),
 ('will', 4263),
 ('you', 4105)]

In [20]:
word_list=flat_list_one

Next remove the stop_words from the unigram list to actually find the important words and there count.

In [21]:
#removing the stop words
stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

In [22]:
Counter(output).most_common(20)

[('data', 12311),
 ('experience', 8102),
 ('business', 6602),
 ('work', 4816),
 ('team', 3791),
 ('skills', 3162),
 ('ability', 3030),
 ('development', 2766),
 ('management', 2534),
 ('software', 2516),
 ('support', 2319),
 ('analysis', 2305),
 ('solutions', 2301),
 ('systems', 2220),
 ('years', 2160),
 ('knowledge', 2032),
 ('new', 2004),
 ('strong', 1981),
 ('working', 1911),
 ('analytics', 1840)]

In [23]:
#creating the flat list for a bigram list
flat_list_two = []
for sublist in two:
    for item in sublist:
        flat_list_two.append(item)

In [24]:
#creating the flat list for a trigram list
flat_list_three = []
for sublist in three:
    for item in sublist:
        flat_list_three.append(item)

In [25]:
#creating a set of skills
final=[]

Next create a list of important skills we might be present in the job description. This list can be altered on the need.

In [26]:
final.append('python')
final.append('r')
final.append('data mining')
final.append('data analysis')
final.append('data modeling')
final.append('data visualization')
final.append('hadoop')
final.append('hive')
final.append('tableau')
final.append('power BI')
final.append('sas')
final.append('matlab')
final.append('scala')
final.append('pig')
final.append('mapreduce')
final.append('spark')
final.append('machine learning')
final.append('statistics')
final.append('excel')
final.append('aws')
final.append('azure')
final.append('mongodb')
final.append('sql')
final.append('pl/sql')
final.append('nosql')
final.append('mysql')
final.append('mssql')
final.append('oracle')
final.append('hbase')
final.append('ms acess')
final.append('dashboards')
final.append('agile')
final.append('innovator')
final.append('critical thinking')
final.append('docker')
final.append('leader')
final.append('master degree')
final.append('java')
final.append('big data')
final.append('cassandra')
final.append('qlik sense')
final.append('qlik view')
final.append('phd')
final.append('bachelor degree')
final.append('numpy')
final.append('tensorflow')
final.append('keras')
final.append('pandas')
final.append('sci-kit learn')
final.append('matplotlib')
final.append('seaborn')
final.append('deep learning')
final.append('classification')
final.append('decision tree')
final.append('clustering')
final.append('regression')
final.append('forcasting')
final.append('kpi')
final.append('linux')


Finally we loop through all the job descriptions to find the skills in the final list and save it in the skill list. We find the unigram and bigrams for matching the skill in the finals list.

The skills list is then assigned to the skillset column of the dataframe.

In [29]:
#finding the skills in each job description and storing in the skillset column
i=0
#j=0
while(i<results.shape[0]):
    if results['Description'][i] == None:
        i=i+1
    else:
        skill=[]
        flat_list_new=[]
        two=[]
        #looping through the description
        text=results['Description'][i]
        text=text.lower()
        words=text.split()
        #for each item in the final
        for item in final:
            for word in words:
                if item == word:
                    if item not in skill:
                        skill.append(item)
        
        #finding the bigram
        words_list = process_text(text)
        bigrams = generate_ngrams(words_list, 2)
        two.append(bigrams)               
    
    
        #matching the bigram for every item in the final list
        for sublist in two:
            for item in sublist:
                flat_list_new.append(item)
        for words in flat_list_new:
            for items in final:
                if items == words:
                    if items not in skill:
                        skill.append(words)
    
        #print(skill)    
        results.loc[i]['Skillset']=skill       
        i=i+1

In [30]:
#displaying the final dataframe
results

Unnamed: 0,Title,Company,Location,Description,Salary,Company_Rating,Skillset
0,Senior Data Scientist,Instacart,"– San Francisco, CA","Founded in 2012, Instacart is a leader in Nort...",$144K-$193K (Glassdoor est.),3.7 ★,"[python, sql, dashboards, leader]"
1,DATA SCIENTIST / ANALYTIC CONSULTANT 4,Wells Fargo,"– San Francisco, CA","Job Description\n\nAt Wells Fargo, we want to ...",$82K-$131K (Glassdoor est.),3.5 ★,"[python, hive, tensorflow, keras, machine lear..."
2,Sr. Applications Scientist – Charged Particle ...,Multibeam,"– Santa Clara, CA",Sr. Applications Scientist – Charged Particle ...,Employer Provided Salary:$100K-$135K,5.0 ★,"[leader, data analysis]"
3,Principal Scientist,bioMérieux,"– San Diego, CA",World leader in the field of in vitro diagnost...,$91K-$126K (Glassdoor est.),3.4 ★,"[leader, phd, data analysis]"
4,Data Engineer,LeadCrunch,"– San Diego, CA",Data Engineer\n\nAre you ready to be a part of...,Employer Provided Salary:$125K-$155K,4.1 ★,"[python, hadoop, leader, data mining]"
5,Advanced Analytics Manager - Healthcare,Central California Alliance for Health,"– Scotts Valley, CA",This is a new position at the Alliance where y...,$113K-$155K (Glassdoor est.),3.5 ★,"[python, sql, data mining]"
6,"Senior Software Engineer, Data Infrastructure",New Relic,"– San Francisco, CA, United States","New Relic\nSenior Software Engineer, Data Infr...",$138K-$221K (Glassdoor est.),4.5 ★,"[python, sql, leader, machine learning]"
7,Financial Analytics Manager,Central California Alliance for Health,"– Scotts Valley, CA",ABOUT US\n\nWe are a group of over 500 dedicat...,$113K-$155K (Glassdoor est.),3.5 ★,"[sql, data mining]"
8,"Senior Scientist, Culture Process Development",Roche,– Pleasanton,ROLE SUMMARY:\n\ntRED Antibody Innovation grou...,$81K-$122K (Glassdoor est.),3.9 ★,[]
9,"Data Analyst, Business Intelligence","Gemological Institute of America, Inc.","– Carlsbad, CA",The Business Intelligence team is responsible ...,$40K-$64K (Glassdoor est.),2.9 ★,"[excel, sql, pl/sql, oracle, classification]"


In [32]:
#saving the results in the csv file
results.to_csv('glassdor_scraping.csv', index = False)

## CONCLUSION

After scrapping the job listings, we have scrapped 1610 job listing for 'Data Scientist', 'Data engineer', 'Data Analyst' and 'Business Intelligence Analyst' for three states namely 'MA', 'CA' and 'TX'. After scrapping we have processed the job description to remove the stop-words, find the ngrams and then find the skills out of each job description to create the results dataframe. This dataframe is saved as a csv file.

## CONTRIBUTION

In this notebook we have written 80% of the code on our own. Where ever we have used the code from any location we have cited the source.

## LICENSE 

This document is Licensed under the MIT license and is documented by Ayush Jain and Shweta Pathak.

https://opensource.org/licenses/MIT

MIT License

Copyright (c) 2019 Ayush Jain and Shweta Pathak

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.