# Problem Statement: Scrap the data from Techolution Careers website and store the data according to the date of posting(Most old first) as a DataFrame in CSV.

# Importing all the necessary packages

In [1]:
import pandas as pd
import numpy as np
import re

# Extract Job postings - Techolution careers

There are multiple ways to scrape data from web browser. Below, I have used Selenium as it is highly efficient to automate web browser interaction from Python and several other features including handling wait time to load web page,timeput situations, 
identifying elements is very feasible

# Approach followed: 

     - Import Packages
     - Import Selenium webdriver package
     - Browser initiation
     - Running the desired url in the driver window
     - Identify css elements using Find element by css selector method
     - Create an empty DataFrame
     - Creating a For loop such that it iterates over all the div elements and appending these elements into the empty data frame
     - Sorting the dataframe based on job posted date
     - Converting the dataframe into a csv file

Importing webdriver from selenium package 

In [2]:
from selenium import webdriver 

Using the below command Selenium will now start a browser session

In [3]:
browser = webdriver.Chrome('chromedriver.exe')

mentioned the url for the Techolution careers website. Browser will now run the specified url

In [4]:
url = 'https://techolution.app.param.ai/jobs/'
browser.get(url)

We can identify the elements using the find elements by css selector function. It is always important to use the browser 
keyword as it acts as a parent

Open the url in another tab to identify elements

We can verify each element using 'Inspect element'. Right click on any hyperlink or in this case a job opening and click on 
Inspect element

Note: Do not click or make changes in the browser window initiated by Selenium

Hover over elements you wish to identify. Each job posting is enclosed in a 'div' tag. 
So identifying all the div tags in the page

In [5]:
job_tags = browser.find_elements_by_css_selector('div.twelve.wide.computer.twelve.wide.tablet.sixteen.wide.mobile.column')
len(job_tags) #all jobs information so elements

24

We have Identified there are 24 listings on the page

Created an empty data frame jobs and after each iteration, appended the elements into the dataframe

Using a For Loop, iterating over each job tag and identifying job title, job type, required experience and location

Job type, location and experience are specified in paragraph tag ang hence used regular expression to identify those elements
and split each item and corresponding value is assigned under each category

In [16]:
jobs = pd.DataFrame()

for job in job_tags:
    job_title = job.find_element_by_css_selector('h3.job_name.text-ellipsis').text
    elements = job.find_element_by_css_selector('p').text
    list1 = re.split(r"\ . ", elements)
    job_type = list1[0]
    loc = list1[1]
    exp = list1[2] + '-' + list1[3] 
    
    curr_job = {'job_title' : job_title, 'job_type': job_type, 'location' : loc, 'experience' : exp} #, 'date_posted' : date 
    jobs = jobs.append(curr_job,ignore_index = True)

Data elements in jobs Dataframe

In [7]:
jobs

Unnamed: 0,experience,job_title,job_type,location
0,0-2 Years,Big Data Intern,Internship,Hyderabad
1,5-10 Years,Senior Cloud Specialist,Full-time,Singapore
2,2-5 Years,Cloud Native Developer,Full-time,Hyderabad
3,0-4 Years,Data Scientist Intern,Internship,Hyderabad
4,2-4 Years,Embedded Engineer,Full-time,Hyderabad
5,2-6 Years,Networking & Security Specialist,Full-time,Hyderabad
6,0-1 Years,System Engineer,Internship,Mauritius
7,1-3 Years,Associate QA Engineer,Full-time,Hyderabad
8,9-15 Years,Solution Architect,Full-time,Hyderabad
9,4-9 Years,Sr. Microservices Developer,Full-time,Hyderabad


All dates on which each job posting has been updated is specified in a separate div tag, so using similar approach as previous
identified all the date div tags and iterated over each tag to get the date the job has been posted

In [8]:
date_tags = browser.find_elements_by_css_selector('div.four.wide.right.aligned.computer.tablet.only.column')
len(date_tags)

24

Created an empty DataFrame and appended each date the job has been posted into the dataframe

In [9]:
dates = pd.DataFrame()

for date in date_tags:
    date1 = date.find_element_by_css_selector('span').text
    date_p = {'date_posted' : date1}
    dates  = dates.append(date_p, ignore_index = True)

View data points in dates df

In [17]:
dates

Unnamed: 0,date_posted
0,5 days ago
1,9 days ago
2,10 days ago
3,12 days ago
4,13 days ago
5,18 days ago
6,18 days ago
7,19 days ago
8,19 days ago
9,a month ago


Concatenating both Dataframes

In [18]:
final = [jobs, dates]

In [19]:
result = pd.concat(final, axis=1, join_axes=[jobs.index])

In [None]:
#returning dataframe into a csv

In [21]:
results_df = result.sort_index(ascending= False)

In [25]:
results_df

Unnamed: 0,experience,job_title,job_type,location,date_posted
23,3-5 Years,Machine Learning Engineer,Full-time,Hyderabad,2 months ago
22,7-18 Years,Engineering Lead,Full-time,Mauritius,2 months ago
21,3-10 Years,Sr SDET,Full-time,New York,2 months ago
20,6-12 Years,OSS DevOps Engineer,Full-time,Hyderabad,2 months ago
19,1-3 Years,Site Reliability Engineer,Full-time,New York,2 months ago
18,5-11 Years,Lead DevOps Engineer,Full-time,Hyderabad,2 months ago
17,3-10 Years,Senior DevOps Engineer,Full-time,Hyderabad,2 months ago
16,1-2 Years,Junior Cloud Native Developer,Full-time,Delaware,2 months ago
15,1-4 Years,Blockchain Developer,Full-time,Hyderabad,a month ago
14,7-12 Years,Sr SAP PI/PO Developer,Contract,New Jersey,a month ago


In [22]:
results_df.to_csv('JobsList.csv', index=False)