<center>
<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color: royalblue;
           font-size:110%;
           letter-spacing:0.5px">

<h4 style="padding: 10px;
              color:white;">We will be scrapping data here from job listing websites

    
</h4>
</div>
    
</center>


**Web scraping** is a technique used to extract large amounts of data from websites. It is done by writing computer programs that visit web pages, parse the contents of the page, and extract data for further processing. Web scraping is important in data science because it allows businesses to collect data from websites, including news sites, social media, and other online sources, which can be used to analyze trends, measure customer sentiment, and create predictive models. 

It is also used to populate databases, monitor prices, and collect other types of information that can be used in data science projects. Web scraping is a powerful tool for data science, as it allows businesses to collect and analyze data from different sources quickly and cost-effectively.

**What are the several ways of webscraping?**

1. **API**: API (Application Programming Interface) is a set of protocols, routines, and tools for building software applications. APIs can be used to access web-based data and services, such as webpages, images, and other content. 

2. **HTML Parsers**: HTML parsers are programs that parse HTML documents and extract the data from them. They are usually used to access webpages and extract the content from them.

3. **Scrapy**: Scrapy is a Python-based web scraping framework that provides a complete toolkit for scraping websites. It can be used to crawl multiple pages, extract data from them, and store it in a structured format.

4. **Selenium**: Selenium is an open-source tool for automating web browsers. It can be used to simulate user actions on a web page and extract data from it.

5. **Wget/Curl**: Wget and Curl are command-line programs used to download webpages and other web-based resources. They can be used to access and scrape webpages.

We will be choosing the HTML Parser method for this notebook.

In [1]:
import warnings
warnings.filterwarnings("ignore")
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['font.size'] = 12

### Making a request to the website

In [2]:
url = 'https://www.shine.com/job-search/data-scientist-jobs?top_companies_boost=true&q=data-scientist'
response = requests.get(url)
print('The response that we got back from the URL is', response.status_code)

The response that we got back from the URL is 200


#### Since we are getting 200 as response object, we are ready to proceed ahead.

<div class="alert alert-block alert-info">
<p align="center">A Response 200 is an HTTP status code that signifies a successful request. This code is returned when a web browser or client successfully receives the requested web page from a server. It indicates that the server has processed the request without errors and has provided the requested content, typically in the form of a web page, document, or data. </p>
</div>

**BEAUTIFULSOUP**

For Scraping we will be using a Python Library called BeautifulSoup. Beautiful Soup is a Python library used for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need.

### Parsing HTML

In [3]:
soup = BeautifulSoup(response.text,'html.parser')
html = soup.find_all('div')

The HTML contains the contents of the website in the HTML format.

### Fetching the Job titles

The select method allows a user to query a page using a CSS selector to find elements matching the selector. This can be used to find specific tags, classes, or ids in the page's HTML. Remember to not use .text method on a list as it will result in an error.

In [4]:
req = soup.select('div h2[itemprop="name"]')
#fetching the text from the html
titles = [r.text for r in req]
#Removing any spaces
titles = [t.replace("  ", "") for t in titles]
titles[:5]

['Data Scientist',
 'Data Scientists 2024',
 'Data Scientist Vacancy',
 'Data Scientist Vacancy',
 'Data Scientist Vacancy']

### Fetching the Company Name

FIND ALL
- The `find_all()` method in `BeautifulSoup` searches through a parsed document and returns a collection of all tags that match the criteria specified in the function’s arguments. 
- It searches through the entire tree of parsed code and returns a list of tags that match the given criteria. 
- It is an efficient way to search for specific tags and their contents in a parsed document.

In [5]:
orgs = soup.find_all('div', class_='jobCard_jobCard_cName__mYnow')
#fetching the text from the HTML
orgs1 = [o.text for o in orgs]
sub_string ='Hiring'
#Splitting the string on a sub string and getting the first index (Cleaning up names)
orgs1 = [o.split(sub_string)[0] for o in orgs1]
#Removing any spaces
orgs1 = [o.strip() for o in orgs1]

In [6]:
orgs1[:5]

['Advance Immigrations',
 'Bosch Group',
 'Divya Interprises',
 'Divya Interprises',
 'Divya Interprises']

### Fetching the Locations

In [7]:
#fetching the HTML data from the class where the location data is available
loc = soup.find_all('div', class_='jobCard_jobCard_lists__fdnsc')
#fetching all the text from the HTML
location = [l.text for l in loc]
#cleaning the locations (Getting everything after the Yr(s))
location = [re.findall("Yrs?(.*)$", i)[0] for i in location]
#Getting rid of unnecessary text
location = [l.replace("+4", ", ") for l in location]

In [8]:
location[:5]

['Australia+2Canada, Singapore',
 'Coimbatore',
 'Oman+14Qatar, Asansol, Saudi Arabia, Dibrugarh, Guwahati, Shimla, Kuwait, Dehradun, Indore, Jabalpur, United Arab Emirates, Kolkata, Shillong, Bhopal',
 'Oman+14Katni, Qatar, Ratlam, Saudi Arabia, Jammu, Kuwait, Bangalore, Kochi, United Arab Emirates, Vellore, Hosur, Pune, Mangalore, Bhopal',
 'Oman+14Qatar, Saudi Arabia, Bhubaneswar, Jammu, Cuttack, Dehradun, Rourkela, Jaipur, Kuwait, Srinagar, Jodhpur, United Arab Emirates, Udaipur, Pune']

### Fetching Experience

We will use `re` which is a library meant for regular expressions.

In [9]:
#fetching the text from the loc variable for the experience
exp = [l.text for l in loc]
#Cleaning up using regex
experience = [re.findall("^(.*) Yrs?", i)[0] for i in exp]

In [10]:
experience[:5]

['1 to 6', '0 to 4', '0 to 4', '0 to 4', '0 to 4']

In [11]:
##Getting Vacancies

vac = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')
#fetching the text from the HTML
vac = [v.text for v in vac ]

In [12]:
#Cleaning up the data
vacancies = [int(re.findall(r'\d+', text)[0]) if re.findall(r'\d+', text) else 1 for text in vac]

<center>
<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color: royalblue;
           font-size:110%;
           letter-spacing:0.5px">

<h4 style="padding: 10px;
              color:white;"> Putting the pieces together

    
</h4>
</div>
    
</center>

In [13]:
data = {'Titles':titles, 'Firm Name': orgs1, 
        'Job Location':location, 'Experience':experience,
        'Positions': vacancies}
df = pd.DataFrame(data)

In [14]:
df.head()

Unnamed: 0,Titles,Firm Name,Job Location,Experience,Positions
0,Data Scientist,Advance Immigrations,"Australia+2Canada, Singapore",1 to 6,20
1,Data Scientists 2024,Bosch Group,Coimbatore,0 to 4,1
2,Data Scientist Vacancy,Divya Interprises,"Oman+14Qatar, Asansol, Saudi Arabia, Dibrugarh...",0 to 4,99
3,Data Scientist Vacancy,Divya Interprises,"Oman+14Katni, Qatar, Ratlam, Saudi Arabia, Jam...",0 to 4,99
4,Data Scientist Vacancy,Divya Interprises,"Oman+14Qatar, Saudi Arabia, Bhubaneswar, Jammu...",0 to 4,99


### The same way, we can fetch more job findings from different websites, like indeed.com.
Though it is difficult now to get data out form indeed due to bots, but we managed to get it form there.

Now if there are several pages one could also write a loop for the rest of the pages which is what has been done in the code cell below.

In [15]:
TITLES = []
COMPANIES = []
LOCATIONS = []
EXPERIENCE = []
VACANCIES = []

Range = range(1,6)
for i in Range:
    link = f'https://www.shine.com/job-search/data-scientist-jobs-{i}?top_companies_boost=true&sort=1&q=data-scientist'
    response = requests.get(link)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        req = soup.select('div h2[itemprop="name"]')
        titles = [r.text for r in req]
        titles1 = [t.replace("|","") for t in titles]
        titles = [t.replace("  ", "") for t in titles1]
        TITLES.extend(titles)
        
        orgs = soup.find_all('div', class_='jobCard_jobCard_cName__mYnow')
        orgs1 = [o.text for o in orgs]
        sub_str = "Hiring"
        companies = [o.split(sub_str)[0] for o in orgs1]
        COMPANIES.extend(companies) 
        
        loc = soup.find_all('div', class_='jobCard_jobCard_lists__fdnsc')
        location = [l.text for l in loc]
        location = [re.findall("Yrs?(.*)$", i)[0] for i in location]
        location = [l.replace("+4", ", ") for l in location]
        LOCATIONS.extend(location)
        
        #Fetching Experience
        exp = [l.text for l in loc]
        experience = [re.findall("^(.*) Yrs?", i)[0] for i in exp]
        EXPERIENCE.extend(experience)  
        
        vacancies = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')
        vac = [v.text for v in vacancies]
        vacancies = [int(re.findall(r'\d+', text)[0]) if re.findall(r'\d+', text) else 1 for text in vac]
        VACANCIES.extend(vacancies)
        
    else:
        print('Invalid Response')

df = pd.DataFrame({'Job Title': TITLES, 
                   'Employer': COMPANIES,
                   'Job Location': LOCATIONS, 
                   'Experience': EXPERIENCE, 
                   'Positions': VACANCIES})

print(f'We have managed to fetch {len(df)} job postings while scraping {len(Range)} pages.')

We have managed to fetch 100 job postings while scraping 5 pages.


In [16]:
#Dropping the duplicates and adding the new column
df = df.drop_duplicates(subset=['Job Title'])
df['Category'] = ['Fresher' if '0' in i else 'Experienced' for i in df['Experience']]

In [17]:
df

Unnamed: 0,Job Title,Employer,Job Location,Experience,Positions,Category
0,Power Bi Developer,FUTURE A4,Kochi,2 to 7,5,Experienced
1,Business Analyst - Trading,REVA HR SOLUTIONS.,Bangalore+1Pune,7 to 10,3,Fresher
2,Data Scientist,Advance Immigrations,"Australia+2Canada, Singapore",1 to 6,20,Experienced
3,Required Business Analyst in Europe,AMEURO MIGRATION PRIVATE LIMITED,Germany,4 to 9,46,Experienced
4,Management Trainee India (U.Gro Next Level),FUGRO SURVEY INDIA PRIVATE LIMITED,Navi Mumbai,0 to 2,2,Fresher
...,...,...,...,...,...,...
95,sales officer,HAPPY SQUARE OUTSOURCING SERVICES P...,Ahmedabad,1 to 6,1,Experienced
96,Sales officer,HAPPY SQUARE OUTSOURCING SERVICES P...,Bhubaneswar,1 to 5,1,Experienced
97,Invoicing and Documentation Specialist,HAPPY SQUARE OUTSOURCING SERVICES P...,Bangalore,2 to 5,1,Experienced
98,UrgentWipro-Data Engineering,Wipro Technologies,"Bangalore+6Noida, Chennai, Hyderabad, Gurugram...",6 to 11,1,Experienced


## Scrapping Indeed.com

This doesn't needed to be run all the time because the bots on indeed.com doesn't allow us to scrap data most of the time. We managed to grab the data and put that to a csv file in the folder. We can use that to our next set of analysis.
Most of the time the indeed.com bots return bogus response and we can't decrypt the data out of it.

In [18]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [19]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import warnings

In [20]:
pd.options.display.max_rows = 500
pd.options.display.max_columns = 100
warnings.filterwarnings('ignore')

In [21]:
#Create the request
r = requests.get(URL)

In [22]:
soup = BeautifulSoup(r.text,'html.parser')

In [23]:
#Function to scrape the location
def extract_location_from_result(result):
    location = []
    for l in result.find_all('div', attrs={'class':'jobsearch-SerpJobCard'}):
        try:
            location.append(l.find(('div', 'span'), attrs={'class':'location'}).text)
        except:
            location.append(np.nan)
    return location

#Function to scrape the salary
def extract_salary_from_result(result):
    salary = []
    for s in result.find_all('div', attrs={'class':'jobsearch-SerpJobCard'}):
        try:
            salary.append(s.find(('div', 'span'), attrs={'class':'salary no-wrap'}).text.strip())
        except:
            salary.append(np.nan)
    return salary

#Function to scrape the job
def extract_job_from_result(result):
    job = []
    for j in result.find_all('div', attrs={'class':'jobsearch-SerpJobCard'}):
        try:
            job.append(j.find('a', attrs={'data-tn-element':'jobTitle'}).text.strip())
        except:
            job.append(np.nan)
    return job

#Function to scrape the company
def extract_company_from_result(result):
    company = []
    for c in result.find_all('div', attrs={'class':'jobsearch-SerpJobCard'}):
        try:
            company.append(c.find(('div', 'span'), attrs={'class':'company'}).text.strip())
        except:
            company.append(np.nan)
    return company

In [24]:
YOUR_CITY = 'Seattle','Montgomery', 'San+Antonio', 'San+Diego', 'San+Jose', 'Jacksonville', 'Columbus', 'Fort+Worth', 'Indianapolis', 'Charlotte', 'Boston', 'Washington+DC', 'El+Paso', 'Detroit', 'Nashville', 'Memphis', 'Portland', 'Oklahoma+City', 'Las+Vegas'

In [25]:
max_results_per_city = 1000 

results_location = []
results_job = []
results_salary = []
results_company = []

for city in set(['New+York', 'Chicago']):
    temp_location = []
    temp_job = []
    temp_salary = []
    temp_company = []
    
    for start in tqdm_notebook (range(0, max_results_per_city, 10)):
        url = 'http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}'.format(city,start)
        r = requests.get(url)
        global_soup = BeautifulSoup(r.text, 'html.parser')
        
        temp_company.append(extract_company_from_result(global_soup))
        temp_job.append(extract_job_from_result(global_soup))
        temp_salary.append(extract_salary_from_result(global_soup))
        temp_location.append(extract_location_from_result(global_soup))
        
    results_location.append(temp_location)
    results_job.append(temp_job)
    results_salary.append(temp_salary)
    results_company.append(temp_company)

   
        
        

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

In [26]:
df = pd.DataFrame({
    'Job': [item for l in results_job for sublist in l for item in sublist], 
    'Company': [item for l in results_company for sublist in l for item in sublist], 
    'Salary': [item for l in results_salary for sublist in l for item in sublist], 
    'Location':[item for l in results_location for sublist in l for item in sublist]
})

In [27]:
df.to_csv('Data_Science.csv', sep='\t', index = False)

## CONCLUSION

In this Jupyter Notebook, we have implemented a web scraping script to extract job-related data from a website. The script goes through the following steps to gather information:

1. **Sending HTTP Requests**: We use the `requests` library to send HTTP GET requests to the constructed URLs. We check the status code of the response to ensure that we receive a successful response (HTTP status code 200) before proceeding.

2. **Parsing HTML Content**: Upon successful retrieval of a web page, we used the `BeautifulSoup` library to parse the HTML content of the page. This enabled us to navigate the page's structure and extract relevant information.

3. **Extracting Data**: We learnt how to navigate the HTML of a website to get the required data, stored in various classes, we are after. In our case, we went to get the job titles, company names, locations, and experience requirements. We also created a new column based on our existing data.

4. **Looping through Pages**: Once we got the hang of how to do this for one page, we iterated through a range of pages, where each page contains a list of job postings. For each page, we construct a URL with the page number and specific query parameters related to job type, top companies, and sorting.

5. **Visualizing Data**: We also took a look at some charts to understand our data i.e what is data distribution of Number of Positions, How many jobs in the market are for freshers and for experienced professionals.

By following these steps, we systematically gather data from multiple web pages, ensuring that we handle variations in the format of information such as job titles, company names, locations, and experience requirements. The use of regular expressions allowed us to extract specific patterns from the text, enhancing the accuracy of our data extraction process.