## Data Science Jobs in India - Naukri

<strong><a href="https://www.naukri.com/">Naukri.com</a></strong> is an Indian employment website operating in India and Middle East. Naukri.com was founded in March 1997. Naukri was ranked No.1 by 9 independent sources, placing it way ahead of competition.Google Trends names Naukri “the most preferred job search destination in India”.

In this project, I scraped the 12,000 results obtained for <strong>'data scientist'</strong> job role in <strong>India</strong>, on May 18, 2022

In [1]:
# import libraries

import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

import warnings
warnings.filterwarnings('ignore')

In [2]:
# webdriver for Firefox - geckodriver
driver = webdriver.Firefox(executable_path='/home/futures/Documents/geckodriver')

# sticky timeout to implicitly wait for an element to be found
driver.implicitly_wait(10) 

# open url in the browser
driver.get('https://www.naukri.com/')

In [3]:
# enter job role - data scientist
role=driver.find_element_by_xpath('/html/body/div/div[2]/div[3]/div/div/div[1]/div/div/div/input')
role.send_keys('data scientist')

# location - India
location=driver.find_element_by_xpath('/html/body/div/div[2]/div[3]/div/div/div[3]/div/div/div/input')
location.send_keys('india')

# click search
driver.find_element_by_class_name('qsbSubmit').click()

In [4]:
# new dataframe fwith required columns
df=pd.DataFrame(columns=['Job_Role','Company','Location','Job Experience','Skills/Description'])
df

Unnamed: 0,Job_Role,Company,Location,Job Experience,Skills/Description


In [5]:
# helper function

# Function to convert  a list to string
def listToString(s): 
    
    # initialize an empty string
    str1 = ", " 
    
    # return string  
    return (str1.join(s))

In [6]:
%%time

# page number
page = 1

condition = True

while condition:

    # atmost 20 results per page
    for i in range(1,21):

        try:

            role = driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[2]/article[{}]/div[1]/div[1]/a'.format(i)).text
            company = driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[2]/article[{}]/div[1]/div[1]/div/a[1]'.format(i)).text
            location = driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[2]/article[{}]/div[1]/div[1]/ul/li[3]/span[1]'.format(i)).text
            experience = driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[2]/article[{}]/div[1]/div/ul/li[1]/span'.format(i)).text.replace(' Yrs','')
            skill_lst = driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[2]/article[{}]/ul'.format(i)).text.split('\n')
            skill = listToString(skill_lst)

            # append collected information to the dataframe df
            df.loc[len(df.index)] = [role,company,location,experience,skill]

        except NoSuchElementException:

            pass

    page += 1

    # click next
    nxt=driver.find_element_by_xpath('/html/body/div/div[3]/div[2]/section[2]/div[3]/div/a[2]')
    driver.execute_script("arguments[0].click();", nxt)

    # total 12700+ results showed, 20 results per page => 600+ pages

    # stop after 620 pages
    if page >= 620:
        condition = False

CPU times: user 1min 25s, sys: 1.96 s, total: 1min 27s
Wall time: 20min 1s


In [7]:
# drop duplicates
df = df.drop_duplicates()

# shape of the dataframe
df.shape

(12232, 5)

In [8]:
# selecting first 12,000 results
df = df.iloc[:12000, :]
df.shape

(12000, 5)

In [9]:
# first five rows
df.head()

Unnamed: 0,Job_Role,Company,Location,Job Experience,Skills/Description
0,Senior Data Scientist,UPL,"Bangalore/Bengaluru, Mumbai (All Areas)",3-6,"python, MLT, statistical modeling, machine lea..."
1,Senior Data Scientist,Walmart,Bangalore/Bengaluru,5-9,"Data Science, Machine learning, Python, Azure,..."
2,Applied Data Scientist / ML Senior Engineer (P...,SAP India Pvt.Ltd,Bangalore/Bengaluru,5-10,"Python, IT Skills, Testing, Cloud, Product Man..."
3,Data Scientist,UPL,"Bangalore/Bengaluru, Mumbai (All Areas)",1-4,"python, machine learning, Data Science, data a..."
4,Data Scientist,Walmart,Bangalore/Bengaluru,4-8,"IT Skills, Python, Data Science, Machine Learn..."


In [12]:
# last five rows
df.tail()

Unnamed: 0,Job_Role,Company,Location,Job Experience,Skills/Description
12143,Tech Lead/Architect ( Contractual ),Krazy Mantra HR Solutions Pvt. Ltd,"Kolkata, Chennai, Bangalore/Bengaluru",8-13,"Spark, Python, S3, lambda, Athena, AWS, IT Ski..."
12144,Tech Lead / POD Lead,cliqhr.com,"Hyderabad/Secunderabad, Pune, Bangalore/Bengaluru",10-12,"AirFlow, BigQuery, GCS, Kafka, Java, Shell scr..."
12145,Java Full Stack Developer - Hibernate / Spring,Serving Skill,"Hyderabad/Secunderabad, Bangalore/Bengaluru",2-5,"IT Skills, Java, Software Development, Testing..."
12146,Tech Lead - Azure,cliqhr.com,"Kochi/Cochin, Mumbai, Hyderabad/Secunderabad, ...",8-10,"PowerShell, Azure Data Factory, Azure, PaaS, M..."
12147,Full Stack Developer - Machine Learning,Huquo Consulting Pvt. Ltd,Bangalore/Bengaluru,5-10,"Java, JavaScript, Machine Learning, NoSQL, Clo..."


In [15]:
# # save as csv file
df.to_csv('naukri_data_science_jobs_india.csv', index=False)