# Naukri Job Scraper

This notebook scrapes job listings from Naukri.com for Data Analyst positions in Delhi/NCR. It then stores the scraped data in a Pandas DataFrame and exports it to a Google Sheet and a CSV file.

## Introduction

This project aims to automate the process of collecting job data from Naukri.com. By scraping the website, we can gather valuable information about Data Analyst job openings in Delhi/NCR, including:

* Job Title
* Company Name
* Experience Required
* Job Link
* Location
* Salary

This data can be used for various purposes, such as:

* **Tracking job openings:** Monitor the latest Data Analyst job postings in Delhi/NCR.
* **Analyzing job market trends:** Gain insights into salary ranges, company hiring patterns, and other relevant data.
* **Automating job search:** Use the scraped data to identify potential job opportunities. 


In [7]:
!pip3 install -r requirements.txt

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from datetime import datetime, time

chrome_options = Options()
#chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")  # Disable GPU hardware acceleration
chrome_options.add_argument("--no-sandbox")  # Disable sandboxing
chrome_options.add_argument("--disable-dev-shm-usage")  # Disable /dev/shm usage
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

In [2]:
titles, names, experiences, links, locations, salaries = [], [], [], [], [], []
for j in range(1,11,1):
    url = f'https://www.naukri.com/data-analyst-jobs-in-delhi-ncr-{j}?k=data+analyst&l=delhi+%2F+ncr&experience=1&nignbevent_src=jobsearchDeskGNB'
    driver.get(url)
    time.sleep(3)
    print(f'Page {j} loaded. Scraping...')
    for i in range(20):
        
        #Fetching the desired data and storing them for each page
        title_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[1]/a'
        co_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[2]/span/a[1]'
        exp_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[3]/div/span[1]/span/span'
        link_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[1]/a'
        location_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[3]/div/span[3]/span/span'
        salary_path = f'//*[@id="listContainer"]/div[2]/div/div[{i}]/div/div[3]/div/span[2]/span/span'
        title = driver.find_elements(By.XPATH, title_path)
        co_name = driver.find_elements(By.XPATH, co_path)
        experience = driver.find_elements(By.XPATH, exp_path)
        link = driver.find_elements(By.XPATH, link_path)
        location = driver.find_elements(By.XPATH, location_path)
        salary = driver.find_elements(By.XPATH, salary_path)
        
        #Adding the text to the relevant lists
        titles.extend([t.text for t in title])
        names.extend([c.text for c in co_name])
        experiences.extend([e.text for e in experience])
        links.extend([lnk.get_attribute('href') for lnk in link])
        locations.extend([loc.text for loc in location])
        salaries.extend([sal.text for sal in salary])
data = {'Titles': titles, 'Company': names, 'Experience': experiences,
        'Links': links, 'Location': locations, 'Compensation': salaries}
df = pd.DataFrame(data)
df = df.drop_duplicates()
df['Date_added'] = datetime.now().strftime('%Y-%m-%d')
driver.quit()
print(f'Length of the dataframe : {len(df)}')

Page 1 loaded. Starting Scraping...
Page 2 loaded. Starting Scraping...
Page 3 loaded. Starting Scraping...
Page 4 loaded. Starting Scraping...
Page 5 loaded. Starting Scraping...
Page 6 loaded. Starting Scraping...
Page 7 loaded. Starting Scraping...
Page 8 loaded. Starting Scraping...
Page 9 loaded. Starting Scraping...
Page 10 loaded. Starting Scraping...
Length of the dataframe : 170


In [3]:
df.head()

Unnamed: 0,Titles,Company,Experience,Links,Location,Compensation
0,Data Analyst (Tamil/ English),Zbiz Solutions,0-3 Yrs,https://www.naukri.com/job-listings-data-analy...,"Kolkata, Mumbai, New Delhi, Hyderabad, Pune, C...",Not disclosed
1,Data Analyst,Larko Tech,1-3 Yrs,https://www.naukri.com/job-listings-data-analy...,"Kolkata, Mumbai, New Delhi, Hyderabad, Pune, C...",Not disclosed
2,Data Analyst(English Required),Peroptyx,0-5 Yrs,https://www.naukri.com/job-listings-data-analy...,Remote,Not disclosed
3,Data Analyst (Position Closed),Bluslash Consulting,0-2 Yrs,https://www.naukri.com/job-listings-data-analy...,Gurugram,Not disclosed
4,Data Analyst,Centre for Knowledge & Development (CKD),1-4 Yrs,https://www.naukri.com/job-listings-data-analy...,New Delhi,Not disclosed


In [21]:
#defining the function to write data to google sheets
def write_data(spreadsheet_id, sheet_name, df):
    from googleapiclient import discovery
    from oauth2client.service_account import ServiceAccountCredentials
    keys_path = 'keys.json'
    scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
    creds = ServiceAccountCredentials.from_json_keyfile_name(keys_path, scope)
    service = discovery.build('sheets', 'v4', credentials=creds)

    cols = list(df.columns[df.columns.str.contains('Date|Time',regex=True)])
    if cols:
        for col in cols:
            df[col] = df[col].astype(str)

    input_values = [list(df.columns)] + df.values.tolist()
    input_request = [{
                'range': f"{sheet_name}!R1C1:R{df.shape[0]+1}C{df.shape[1]}",
                'majorDimension' : "ROWS",
                'values' : input_values
            }]
    body = {'valueInputOption' : "RAW", 'data' : input_request}
    
    result = service.spreadsheets().values().batchUpdate(spreadsheetId = spreadsheet_id,
                                                            body=body).execute()
    return result

In [28]:
sheet_id = 'sheet_id'
response = write_data(sheet_id, 'Sheet1', df)

In [9]:
df.to_csv('data.csv', index=False)

## Conclusion

This notebook demonstrates a simple yet effective approach to scraping job data from Naukri.com. By automating this process, we can save time and effort while gaining valuable insights into the job market. 

**Further Improvements:**

* **Error Handling:** Implement robust error handling to gracefully handle unexpected website changes or network issues.
* **Pagination:** Handle pagination effectively to scrape data from multiple pages of job listings.
* **Data Cleaning:** Implement more sophisticated data cleaning techniques to ensure data accuracy and consistency.
* **Visualization:** Create visualizations to gain deeper insights from the scraped data.

This project serves as a starting point for exploring web scraping techniques and their applications in job market.