<a href="https://colab.research.google.com/github/ajay1808/Web-Scraping-Projects/blob/main/Indeed_Web_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Indeed WebScraper**

Author: Ajay Rangan Kasturirangan <br>
[GitHub](https://github.com/ajay1808), [Twitter](https://twitter.com/rangan_ajay) 

This code is a basic webscaper for [in.indeed.com](https://in.indeed.com/) which is a job search portal in India. 

I'm going to use BeautifulSoup to parse the webpage and scrape data. The data can further be used to create reports on the same.
The next step of the project will be to create a dashboard comparing various indian cities on the basis of Data Analyst job availability and salary.<br>
For learning, refer to this [YouTube Video](https://www.youtube.com/watch?v=eN_3d4JrL_w&t=702s) by [Israel Dryer](https://github.com/israel-dryer)

Let's import the required libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
from time import sleep
from random import randint
from datetime import datetime


The following are 3 functions get_url , get_record and main.

The functions for each are as follows:<br>
**get_url**: Generate a url based on the City and Job position you want to check on [https://in.indeed.com/](in.indeed.com)<br>
**get_record**: This pulls each job posting individually from the webpage. Which will represent each row in our dataset.<br>
**main**: This is function to call the other two functions as well as write the data into a csv file which can be downloaded as well as pushed into a google sheet.


In [2]:
def get_url(position, location):
        position = position.replace(" ","%20")
        location = location.replace(" ","%20")
        template = 'https://in.indeed.com/jobs?q={}&l={}'
        url = template.format(position, location)
        return url

def get_record(card):
    '''Extract job date from a single record '''
    try:
        job_title = card.find('h2', 'jobTitle').text.strip()
    except AttributeError:
        job_title = ''
    try:
        company = card.find('span', 'companyName').text.strip()
    except AttributeError:
        company = ''
    try:
        location = card.find('div', 'companyLocation').text.strip()
    except AttributeError:
        location = ''
    try:
        job_summary = card.find('div', 'job-snippet').text.strip()
    except AttributeError:
        job_summary = ''
    try:
        post_date = card.find('span', 'date').text.strip()
        post_date = post_date[6:]
    except AttributeError:
        post_date = ''
    try:
        salary = card.find('div', 'attribute_snippet').text.strip()
    except AttributeError:
        salary = ''
    
    #extract_date = datetime.today().strftime('%Y-%m-%d')
    #job_url = 'https://www.in.indeed.com' + atag.get('href')
    
    return (job_title, company, location, job_summary, salary, post_date)

def main(position, location):
    records = []  
    count = 0
    url = get_url(position, location)
    purl = url 
    while True:
        count += 1
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.find_all('div', 'job_seen_beacon')


        for card in cards:
            record = get_record(card)
            records.append(record)
        try:
            url = 'https://in.indeed.com' + soup.find('a', {'aria-label': 'Next'}).get('href')
            delay = randint(1, 10)
            sleep(delay)
        except AttributeError:
            print("Number of web pages surfed: ",count)
            print("End of Results")
            break
    MyColumns = ['Job Title', 'Company', 'Location', 'Summary', 'Salary', 'Posted Date']
    with open('results.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(MyColumns)
        writer.writerows(records)
    
    dataset = pd.DataFrame(data = records)
    dataset.columns = MyColumns

Now lets call our main function by using the format <br>
**main(Job Position , Location )**
Please note, this webpage is exclusive to the Indian Job Market. 
Indeed does have services in various other countries under a different website.

In [3]:
main('Data Analyst','Chennai')

Number of web pages surfed:  36
End of Results


In [4]:
MyData = pd.read_csv('results.csv')
MyData.head(10)

Unnamed: 0,Job Title,Company,Location,Summary,Salary,Posted Date
0,newData Quality Analyst,Standard Chartered,"Chennai, Tamil Nadu",Various data domain owners and data quality te...,,Just posted
1,newData Quality Analyst - C10 - Chennai (R2103...,Citi,"Chennai, Tamil Nadu",Implement data quality strategies to effective...,,7 days ago
2,Associate Analyst - Data Analyst,AstraZeneca,"Chennai, Tamil Nadu",Experience in translating requirements into fi...,,23 days ago
3,Markets Data Management Analyst,NatWest Markets,"Chennai, Tamil Nadu",Acting as a point of contact for static data q...,,9 days ago
4,newR&A CO Data and MI Analyst,Shell,"Chennai, Tamil Nadu",This role is ideal for a “data junkie” who fin...,,Today
5,newData Analyst 2,PayPal,"Chennai, Tamil Nadu",Analytics professional with a proven track rec...,,Today
6,Data Analyst,Hitachi Energy,"Chennai, Tamil Nadu",Display technical expertise in data analytics ...,,30+ days ago
7,Data Analytics Intmd Analyst - C11,Citi,"Chennai, Tamil Nadu",Applies professional judgment when interpretin...,,13 days ago
8,newData Operations Analyst,Athenahealth,"Chennai, Tamil Nadu",Your job will be to improve quality practition...,,7 days ago
9,Data Analyst,Freshworks,"Chennai, Tamil Nadu",Create and maintain rich interactive visualiza...,,16 days ago


Now lets push this dataset into Google sheets

The following libraries need to be installed and imported for the same. There are multiple sources one can refer for the same.<br>
 The links are : [TDS](https://towardsdatascience.com/using-python-to-push-your-pandas-dataframe-to-google-sheets-de69422508f), [Medium](https://medium.com/craftsmenltd/from-csv-to-google-sheet-using-python-ef097cb014f9), [Google  API Documentation](https://developers.google.com/sheets/api), [df2gspread documentation](https://df2gspread.readthedocs.io/en/latest/overview.html)

In [None]:
pip install gspread

In [None]:
pip install oauth2client

In [None]:
pip install df2gspread

In [8]:
import gspread
from df2gspread import df2gspread as d2g
from oauth2client.service_account import ServiceAccountCredentials

In [9]:
MyData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Job Title    533 non-null    object
 1   Company      533 non-null    object
 2   Location     533 non-null    object
 3   Summary      533 non-null    object
 4   Salary       32 non-null     object
 5   Posted Date  533 non-null    object
dtypes: object(6)
memory usage: 25.1+ KB


In [10]:
scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
gc = gspread.authorize(credentials)

In [11]:
spreadsheet_key = '1JHpwzpG5SeAhSsVLcZkX1rNg6-R2PGNRm40JuIeSywU'
wks_name = 'Master'
d2g.upload(MyData, spreadsheet_key, wks_name, credentials=credentials, row_names=True)

<Worksheet 'Master' id:2060507442>

[Please click here to view the google sheet](https://docs.google.com/spreadsheets/d/1JHpwzpG5SeAhSsVLcZkX1rNg6-R2PGNRm40JuIeSywU/edit?usp=sharing)

Now the Google sheet can be used as a dynamic data source in multiple reporting and dashboarding tools.
The next step in the Process is to use this data to create a Tableau report.