# LinkedIn Job Search WebScraping

## Possible use case:

This function can be helpful for finding a suitable job for a job seeker in the market that he/she is interested in in a more efficient way. 

Simply typing the keyword into the search bar is usually not very helpful if a person wants to have better matches for their career goals or the area they have expertise due to a lack of existing search options. Plus it is usually the case that one does not have the time to go through all of the job postings which means that they might not even get to the posting that suits them the best simply because they were not sponsored enough.

With this simple function, one can generate a data frame of job listings for as many as they want to have and then search for phrase or keyword directly in the Job description.




## Importing Necessary Libraries:

In [37]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import math
import time
from yaspin import yaspin

## Creating a function that would generate the url given the inputs.

### Inputs
__title:__ the search keywords ex: "data scientist" or "senior data analyst" <br>
__location:__ location of the Job that we want to search for. ex: "Hamburg" or "Germany" <br>
__start_num:__  since the search result is seperated to pages, this indicates from which Job posting number the result will be shown. We will iterate thorough these pages in the main function later on. 

In [49]:
def generate_url(title,location,start_num):

    title_split = title.split()
    seperator="%2B"
    search_title=seperator.join(title_split)
    
    return("https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords="+search_title+"&location="+location+"&start="+str(start_num))


## Creating functions that would extract the parts from the html that we are interested in.
__Functions will extract:__ location, company name, job title.

In [16]:
def get_title(soup):
    job_list=[]

    for item in soup.find_all("h3"):
        job_list.append(item.get_text(strip = True))

    return(job_list)

In [17]:
def get_company(soup):
    company_list=[]
    
    for item in soup.find_all("h4"):
        company_list.append(item.get_text(strip = True))
    return(company_list)

In [18]:
def get_location(soup):
    location_list=[]
    
    for item in soup.find_all("span", {"class": "job-search-card__location"}):
        location_list.append(item.get_text(strip=True))
    return(location_list)


## get_job_id (function) 
This function fetches the Job ids one by one from the listing. As I have encountered postings with no id from time to time I have added the try except handling. 

In [19]:
def get_job_id (soup):
#soup=soup_job_listing    
    job_id_list=[]
    for job in soup.find_all("li"):
        base_card=job.find("div", {"class": "base-card"})
        try:
           job_id_list.append(base_card.get("data-entity-urn").split(":")[3])
        except:
            job_id_list.append("NA")
   
    return(job_id_list)

## get_linkedin_job_listing
The main function that brings everything together.

### Inputs:
__title:__ will be passed through to the url generating function that is created in the first step. <br>
__location:__ will be passed through to the same url generating function.<br>
__number:__ is the minimum number of job postings that needs to be extracted. Since the resulting pages only show 10 job postings at a time, if the number equals 25, it will extract 30 and if equals 81, it will extract the first 90 job postings. 

### Outer Loop
First loop will iterate through all the pages to achieve the minimum number of job postings and collects the data into seperate lists.

### Inner Loop
Iterates through individual job listings on each search response page. It requests the individual job listing page and collects the job descriptions in a list.<br>
Again try except handling was added as, although very rarely, some listins were missing id. 

### The Result
The result is a data frame object with all the listings.

In [43]:
def get_linkedin_job_listing(title,location,number):
    
    with yaspin(text="Processing, please wait...") as spinner:
    
        number_list=list(range(1,number,10))
        job_list_all=[]
        company_list_all=[]
        location_list_all=[]
        job_id_list_all=[]
        job_desc_list_all=[]
        url_list_all=[]
    
        for number in number_list:
            job_search_url=generate_url(title,location,number)
            r=requests.get(job_search_url)
            soup_job_listing=bs(r.content, 'html.parser')
        
            job_list_all.extend(get_title(soup_job_listing))
            company_list_all.extend(get_company(soup_job_listing))
            location_list_all.extend(get_location(soup_job_listing))
            job_id_list_all.extend(get_job_id(soup_job_listing))
            job_id_list=get_job_id(soup_job_listing)
#    print(r)

    
  #  job_description_list=[]
   # url_list=[]
            for job_id in job_id_list:
                url=f"https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/{job_id}"
                r=requests.get(url)
#        print(r)
                soup=bs(r.content,'html.parser')
                try:
                    job_desc_list_all.append(soup.find("div",{"class":"description__text description__text--rich"}).get_text(strip=True))
                except:
                    job_desc_list_all.append('NA')
                url_list_all.append(url)
        
 
        jobs_df = pd.DataFrame(
            {
            "position" : job_list_all,
            "company" : company_list_all,
            "location" : location_list_all,
            "job_id":job_id_list_all,
            "job_description":job_desc_list_all,
            "url":url_list_all
            }
        )
    
   # spinner.ok("✔")
    return(jobs_df)



## Let's try!

In [65]:
jobs_df=get_linkedin_job_listing("data analyst","hamburg",100)

                             

In [57]:
import os

In [58]:
os.getcwd()

'C:\\Users\\Lenovo'

In [70]:
jobs_df.to_csv("data_analyst_hamburg_100.csv", header=True)

In [1]:
jobs_df

NameError: name 'jobs_df' is not defined