# <center>Data Acquisition by Web Scraping</center>


### Importing Required Libraries:

In [27]:
# Importing the pandas library, which is used for data manipulation and analysis
import pandas as pd

# Importing the requests library, which is used to send HTTP requests to a web server
import requests

# Importing the BeautifulSoup class from the bs4 library for parsing HTML and XML documents
from bs4 import BeautifulSoup

# Importing specific exceptions from the requests.exceptions module
# These exceptions are used to handle various types of errors that can occur during HTTP requests
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects, RequestException

### Getting the access to URL through request:

`requests.get(...)`: This function call sends a HTTP GET request to the specified URL. The requests library handles this request and returns a Response object.


`requests.get(...)` contains various attributes, including:

- `.text`: The raw HTML content of the page.
- `.status_code`: The HTTP status code of the response.
- `.headers`: The headers of the response.

In [28]:
requests.get('https://www.workindia.in/jobs-in-bengaluru/?pg=1')

<Response [200]>

### Looking at the text for the given URL:

In [29]:
requests.get('https://www.workindia.in/jobs-in-bengaluru/?pg=1').text

'\n<!DOCTYPE html>\n<html lang="en">\n    <head>\n        <title>Jobs in Bengaluru: Latest Job Vacancies &amp; Job Openings in Bengaluru</title>\n        <meta charset="UTF-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n        <meta name="google-site-verification" content="eF3EYCP7S8LHBCCMZQCUGXayIATc39vSLkd-Vc1oWhY" />\n        <meta name="ahrefs-site-verification" content="041a0cd4aacc5153d5793e5abdc48d60bace27f85eddb16fd31fdd92cc841e84">\n        <meta name="msvalidate.01" content="6D13CD0AAC8315C4CFB89EF7BD9DC102" />\n        <meta name="viewport" content="width=device-width, initial-scale=1">\n        <meta name="author" content="workindia">\n        <link rel="shortcut icon" href="//www.workindia.in/dist/client/../client/assets/favicon.ico" type="image/x-icon" />\n        \n        <meta name="mobile-web-app-capable" content="yes">\n        <meta name="apple-mobile-web-app-capable" content="yes">\n        <meta name="application-name" content="WorkIndia">\n 

### If response code is 403


A 403 Forbidden response code indicates that the server understood the request, but refuses to authorize it. <br>
It happens beecause the website might have set human like access to the site but when we access it via `requests` method the server might treat it like a bot so it denies the access to the site, giving a response code of 403

Here are some common reasons you might encounter a 403 error during web scraping:

- IP Blocking:

    - The server might have detected that your IP address is making too many requests, which it considers abusive. As a result, it blocks your IP from accessing the resource.

- User-Agent Blocking:

    - Some websites block requests that don't come from recognized browsers. If you're not setting a User-Agent header, or if you're using a user-agent string that looks suspicious, the server might block your request.

- Insufficient Permissions:

    - The resource you're trying to access might require special permissions (e.g., logged-in user sessions) that your scraping script doesn't provide.

- Geo-restrictions:

    - The server might restrict access to users from certain countries or regions. If your IP address falls into a restricted range, you’ll receive a 403 response.

- Anti-Scraping Measures:

    - Websites often implement anti-scraping techniques to prevent automated access to their content. These measures might include checking for bot-like behavior, such as making requests too quickly or repeatedly accessing the same resource.

- Authentication Required:

    - Some websites require authentication (e.g., cookies, tokens) to access certain pages. If your request doesn’t include the necessary authentication, the server may respond with a 403.

In [30]:
# Define headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
# 'User-Agent': A custom HTTP header to identify the browser making the request.
# This helps mimic a request from a web browser and avoid potential blocking by the server.

try:
    # Send a GET request to the website with the custom User-Agent
    response = requests.get('https://www.workindia.in/jobs-in-bengaluru/?pg=1', headers=headers)
    # 'requests.get(...)': Sends an HTTP GET request to the specified URL.
    # 'headers=headers': Includes the custom headers in the request.

    # Raise an HTTPError for bad responses (4xx or 5xx)
    response.raise_for_status()
    # 'response.raise_for_status()': Checks if the status code indicates an error.
    # If there is an error (4xx or 5xx), an HTTPError exception will be raised.

    webpage = response.text
    # 'response.text': Retrieves the HTML content of the webpage as a string.

    # Print the response content or status code
    print(webpage)  
    # 'print(webpage)': Outputs the HTML content of the webpage.
    # You can also print 'response.status_code' to see the HTTP status code if needed.

except ConnectionError as ce:
    print(f"Connection error occurred: {ce}")
    # 'ConnectionError': Catches errors related to network issues or failed connections.
    # Prints an error message if a ConnectionError is raised.

except Timeout as te:
    print(f"Request timed out: {te}")
    # 'Timeout': Catches errors if the request exceeds the timeout period.
    # Prints an error message if a Timeout exception is raised.

except TooManyRedirects as tmr:
    print(f"Too many redirects: {tmr}")
    # 'TooManyRedirects': Catches errors if there are too many redirects while accessing the URL.
    # Prints an error message if a TooManyRedirects exception is raised.

except RequestException as e:
    print(f"An error occurred: {e}")
    # 'RequestException': A base class for all exceptions raised by the requests library.
    # Catches any other errors that don't fall into the specific categories above.
    # Prints a general error message for any other exceptions raised.



<!DOCTYPE html>
<html lang="en">
    <head>
        <title>Jobs in Bengaluru: Latest Job Vacancies &amp; Job Openings in Bengaluru</title>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="google-site-verification" content="eF3EYCP7S8LHBCCMZQCUGXayIATc39vSLkd-Vc1oWhY" />
        <meta name="ahrefs-site-verification" content="041a0cd4aacc5153d5793e5abdc48d60bace27f85eddb16fd31fdd92cc841e84">
        <meta name="msvalidate.01" content="6D13CD0AAC8315C4CFB89EF7BD9DC102" />
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta name="author" content="workindia">
        <link rel="shortcut icon" href="//www.workindia.in/dist/client/../client/assets/favicon.ico" type="image/x-icon" />
        
        <meta name="mobile-web-app-capable" content="yes">
        <meta name="apple-mobile-web-app-capable" content="yes">
        <meta name="application-name" content="WorkIndia">
        <meta name=

In [31]:
soup = BeautifulSoup(webpage, 'lxml')
# 'BeautifulSoup(webpage, 'lxml')': Initializes a BeautifulSoup object to parse the HTML content.
# 'webpage': The HTML content of the webpage retrieved from the GET request.
# 'lxml': Specifies the parser to use. 'lxml' is a fast and feature-rich XML and HTML parser.
# The BeautifulSoup object (stored in the variable 'soup') allows for easy navigation and extraction of data from the HTML content.

In [32]:
soup.prettify()
# 'soup.prettify()': Formats the HTML content of the BeautifulSoup object 'soup' in a more readable, indented format.
# This method makes the HTML structure easier to understand by adding line breaks and indentation.
# It helps visualize the nested structure of HTML tags and is useful for debugging or analyzing the HTML structure.

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <title>\n   Jobs in Bengaluru: Latest Job Vacancies &amp; Job Openings in Bengaluru\n  </title>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="eF3EYCP7S8LHBCCMZQCUGXayIATc39vSLkd-Vc1oWhY" name="google-site-verification"/>\n  <meta content="041a0cd4aacc5153d5793e5abdc48d60bace27f85eddb16fd31fdd92cc841e84" name="ahrefs-site-verification"/>\n  <meta content="6D13CD0AAC8315C4CFB89EF7BD9DC102" name="msvalidate.01"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="workindia" name="author"/>\n  <link href="//www.workindia.in/dist/client/../client/assets/favicon.ico" rel="shortcut icon" type="image/x-icon"/>\n  <meta content="yes" name="mobile-web-app-capable"/>\n  <meta content="yes" name="apple-mobile-web-app-capable"/>\n  <meta content="WorkIndia" name="application-name"/>\n  <meta content="WorkIndia" name="apple-mobile-web-app-title"/>\n  <meta c

In [33]:
print(soup.findAll('h1'))
# 'soup.findAll('h1')': Finds and returns a list of all 'h1' HTML tags within the BeautifulSoup object 'soup'.
# 'findAll' (or 'find_all') method collects all occurrences of the specified tag ('h1' in this case) and returns them as a list.
# This allows you to access and manipulate all 'h1' elements on the webpage.

print('\n\n')

print(soup.findAll('h1')[0])
# 'soup.findAll('h1')[0]': Accesses the first 'h1' tag in the list returned by 'findAll'.
# This retrieves the first occurrence of the 'h1' tag and prints it. The output includes the tag itself and its content.

print('\n\n')

print(soup.findAll('h1')[0].text)
# 'soup.findAll('h1')[0].text': Extracts and prints only the text content of the first 'h1' tag.
# '.text' attribute gets the text inside the HTML tag, excluding the HTML tags themselves.
# This is useful for obtaining just the textual content without the surrounding HTML markup.


[<h1 class="f13 text-primary" style="letter-spacing: 0.3px; display: inline;">Showing <strong>8190 Jobs In Bengaluru</strong></h1>]



<h1 class="f13 text-primary" style="letter-spacing: 0.3px; display: inline;">Showing <strong>8190 Jobs In Bengaluru</strong></h1>



Showing 8190 Jobs In Bengaluru


## Breaking the Final Working Concept:

### Function to scrap the page based on the location:

In [34]:
def web_scrapping(location_name):
    # Use f-string to insert the location_name into the URL
    return requests.get(f'https://www.workindia.in/jobs-in-{location_name}/?pg=1').text

# final_webpage = web_scrapping('bengaluru')
final_webpage = web_scrapping('mumbai')

In [35]:
# Creating a varible of Beautiful Soup:
soup = BeautifulSoup(final_webpage, 'lxml')

In [36]:
soup.prettify()

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <title>\n   Jobs in Mumbai: Latest Job Vacancies &amp; Job Openings in Mumbai\n  </title>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="eF3EYCP7S8LHBCCMZQCUGXayIATc39vSLkd-Vc1oWhY" name="google-site-verification"/>\n  <meta content="041a0cd4aacc5153d5793e5abdc48d60bace27f85eddb16fd31fdd92cc841e84" name="ahrefs-site-verification"/>\n  <meta content="6D13CD0AAC8315C4CFB89EF7BD9DC102" name="msvalidate.01"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="workindia" name="author"/>\n  <link href="//www.workindia.in/dist/client/../client/assets/favicon.ico" rel="shortcut icon" type="image/x-icon"/>\n  <meta content="yes" name="mobile-web-app-capable"/>\n  <meta content="yes" name="apple-mobile-web-app-capable"/>\n  <meta content="WorkIndia" name="application-name"/>\n  <meta content="WorkIndia" name="apple-mobile-web-app-title"/>\n  <meta content

In [37]:
jobs = soup.findAll('div', class_ = 'JobItemV2')


for job in jobs:
    print(job.text)
    print()

2D/3D ArchitectRs. 25000 - Rs. 30000Call HRNavi Mumbai, MumbaiArchitect/Interiors2+ Years of ExperienceSpeak Thoda EnglishJoy House Interior12th PassPosted on: 8/5/2024New30 ViewsEnter your Details!Enter your full nameEnter your mobile number+91Apply

Hotel Chief EngineerRs. 40000 - Rs. 60000Call HRSakinaka, MumbaiEngineer2+ Years of ExperienceSpeak Good EnglishPeninsula Grand HotelGraduate / Male OnlyPosted on: 8/5/2024New15 ViewsEnter your Details!Enter your full nameEnter your mobile number+91Apply

Work From Home - Relationship ManagerRs. 19000 - Rs. 40000Call HRGoregaon, MumbaiBusiness DevelopmentFresherSpeak Thoda EnglishIon Business Solutions Pvt LtdGraduatePosted on: 8/5/2024New36 ViewsEnter your Details!Enter your full nameEnter your mobile number+91Apply

Picker / PackerRs. 15000 - Rs. 20000Call HRKanjurmarg, MumbaiBack OfficeFresherSpeak Thoda EnglishBlinkit (Zomato)< 10th PassPosted on: 8/5/2024New4463 ViewsEnter your Details!Enter your full nameEnter your mobile number+91A

In [38]:
job_role = []
salary_range = []
location = []
job_type = []
experience = []
english_level = []
company_name = []
qualification = []
job_publication_date = []

for i in jobs:
    job_role.append(i.find('h2').text.strip())
    
    salary_range.append(i.find('div', class_ = 'f14').text.strip())
    
    location.append(i.find('div', class_ = 'LocationDetail').text.strip())
    
    job_type.append(i.find('div', class_ = 'JobTypeDetail').text.strip())
    
    experience.append(i.find('div', class_ = 'ExperienceDetail').text.strip())
    
    english_level.append(i.find('div', class_ = 'EnglishDetail').text.strip())
    
    company_name.append(i.find('div', class_ = 'CompanyDetail').text.strip())
    
    qualification.append(i.find('div', class_ = 'QualificationDetail').text.strip())
    
    job_publication_date.append(i.find('div', class_='JobPostedOnDetail').text.strip().replace('Posted on: ', '').strip())

In [39]:
# Create a dictionary with actual lists
job_dict = {
    'Role': job_role,
    'Job Type': job_type,
    'Company': company_name,
    'Location': location,
    'Salary Range': salary_range,
    'Experience': experience,
    'Qualification': qualification,
    'English Proficiency': english_level,
    'Job Published On': job_publication_date,
}

In [40]:
df = pd.DataFrame(job_dict)

In [41]:
df

Unnamed: 0,Role,Job Type,Company,Location,Salary Range,Experience,Qualification,English Proficiency,Job Published On
0,2D/3D Architect,Architect/Interiors,Joy House Interior,"Navi Mumbai, Mumbai",Rs. 25000 - Rs. 30000,2+ Years of Experience,12th Pass,Speak Thoda English,8/5/2024
1,Hotel Chief Engineer,Engineer,Peninsula Grand Hotel,"Sakinaka, Mumbai",Rs. 40000 - Rs. 60000,2+ Years of Experience,Graduate / Male Only,Speak Good English,8/5/2024
2,Work From Home - Relationship Manager,Business Development,Ion Business Solutions Pvt Ltd,"Goregaon, Mumbai",Rs. 19000 - Rs. 40000,Fresher,Graduate,Speak Thoda English,8/5/2024
3,Picker / Packer,Back Office,Blinkit (Zomato),"Kanjurmarg, Mumbai",Rs. 15000 - Rs. 20000,Fresher,< 10th Pass,Speak Thoda English,8/5/2024
4,Office Executive - Back Office,Back Office,Freeze Craft (Daikin Airconditioning-Authorize...,"Bandra East, Mumbai",Rs. 15000 - Rs. 25000,Fresher,12th Pass,Speak Thoda English,8/5/2024
5,Delivery Boy,Delivery,Right To Job,"Mahape, Mumbai",Rs. 16000 - Rs. 21000,Fresher,< 10th Pass / Male Only,No English Required,8/5/2024
6,Delivery Boy,Delivery,RIGHT TO JOB,"Chinchpokli, Mumbai",Rs. 16000 - Rs. 21000,Fresher,< 10th Pass / Male Only,No English Required,8/5/2024
7,Delivery Boy,Delivery,Right To Job,"Kandivali, Mumbai",Rs. 16000 - Rs. 21000,Fresher,< 10th Pass / Male Only,No English Required,8/5/2024
8,Production Manager,Engineer,Rocky Steels,"Grant Road, Mumbai",Rs. 15000 - Rs. 35000,2+ Years of Experience,Graduate / Male Only,Speak Thoda English,8/5/2024
9,Marketing Manager,Marketing,Rocky Steels,"Grant Road, Mumbai",Rs. 13000 - Rs. 30000,2+ Years of Experience,Graduate,Speak Fluent English,8/5/2024


# Final Outputs:

In [42]:
# Function to perform web scraping for a specific job location and page number
def fetch_webpage(job_location_name, page_number):
    
    # Construct the URL for the specified job location and page number
    url = f'https://www.workindia.in/jobs-in-{job_location_name}/?pg={page_number}'
    
    # Send a GET request to the constructed URL
    response = requests.get(url)
    
    # Return the HTML content of the webpage
    return response.text








# Function to parse job details from the webpage
def parse_job_details(soup):
    
    # Initialize lists to store job details
    job_role = []
    salary_range = []
    location = []
    job_type = []
    experience = []
    english_level = []
    company_name = []
    qualification = []
    job_publication_date = []
    

    # Find all job items on the webpage
    jobs = soup.find_all('div', class_='JobItemV2')
    
    
    # Iterate through each job item
    for job in jobs:
        # Extract and clean each detail for the job
        job_role.append(job.find('h2').text.strip())
        salary_range.append(job.find('div', class_='f14').text.strip())
        location.append(job.find('div', class_='LocationDetail').text.strip())
        job_type.append(job.find('div', class_='JobTypeDetail').text.strip())
        experience.append(job.find('div', class_='ExperienceDetail').text.strip())
        english_level.append(job.find('div', class_='EnglishDetail').text.strip())
        company_name.append(job.find('div', class_='CompanyDetail').text.strip())
        qualification.append(job.find('div', class_='QualificationDetail').text.strip())
        job_publication_date.append(job.find('div', class_='JobPostedOnDetail').text.strip().replace('Posted on: ', '').strip())
    
    
    # Return a dictionary with lists of job details
    return {
        'Role': job_role,
        'Job Type': job_type,
        'Company': company_name,
        'Location': location,
        'Salary Range': salary_range,
        'Experience': experience,
        'Qualification': qualification,
        'English Proficiency': english_level,
        'Job Published On': job_publication_date,
    }









# Function to scrape jobs from multiple pages for a given location
def scrape_jobs_for_location(job_location_name, max_pages):
    
    # Initialize an empty DataFrame to store all job data
    final_df = pd.DataFrame()
    
    # Loop through each page number from 1 to max_pages
    for page_number in range(1, max_pages + 1):
        # Fetch the webpage for the current page number
        webpage = fetch_webpage(job_location_name, page_number)
        # Parse the HTML content into a BeautifulSoup object
        soup = BeautifulSoup(webpage, 'lxml')
        # Extract job details from the BeautifulSoup object
        job_data = parse_job_details(soup)
        # Convert the job data dictionary into a DataFrame
        temp_df = pd.DataFrame(job_data)
        # Append the new data to the final DataFrame
        final_df = pd.concat([final_df, temp_df], ignore_index=True)
    
    # Return the final DataFrame containing all job data
    return final_df















# Specify the job location and the number of pages to scrape
job_location_name = 'mumbai'
max_pages = 15

# Call the function to scrape jobs and store the result in final_df
final_df = scrape_jobs_for_location(job_location_name, max_pages)

In [43]:
final_df

Unnamed: 0,Role,Job Type,Company,Location,Salary Range,Experience,Qualification,English Proficiency,Job Published On
0,2D/3D Architect,Architect/Interiors,Joy House Interior,"Navi Mumbai, Mumbai",Rs. 25000 - Rs. 30000,2+ Years of Experience,12th Pass,Speak Thoda English,8/5/2024
1,Hotel Chief Engineer,Engineer,Peninsula Grand Hotel,"Sakinaka, Mumbai",Rs. 40000 - Rs. 60000,2+ Years of Experience,Graduate / Male Only,Speak Good English,8/5/2024
2,Work From Home - Relationship Manager,Business Development,Ion Business Solutions Pvt Ltd,"Goregaon, Mumbai",Rs. 19000 - Rs. 40000,Fresher,Graduate,Speak Thoda English,8/5/2024
3,Picker / Packer,Back Office,Blinkit (Zomato),"Kanjurmarg, Mumbai",Rs. 15000 - Rs. 20000,Fresher,< 10th Pass,Speak Thoda English,8/5/2024
4,Office Executive - Back Office,Back Office,Freeze Craft (Daikin Airconditioning-Authorize...,"Bandra East, Mumbai",Rs. 15000 - Rs. 25000,Fresher,12th Pass,Speak Thoda English,8/5/2024
...,...,...,...,...,...,...,...,...,...
145,Associate Trainer,Teacher,The Trial Co,"Mira Road, Mumbai",Rs. 25000 - Rs. 35000,1 - 2 Years of Experience,Graduate,Speak Good English,8/5/2024
146,Admin Executive,Back Office,Max Engineering,"Kurla West, Mumbai",Rs. 15000 - Rs. 30000,2+ Years of Experience,12th Pass / Male Only,Speak Fluent English,8/5/2024
147,Graphic Designer,Graphic Designer,Swapnali Labels (I) Pvt. Ltd,"Vasai Road East, Mumbai",Rs. 22000 - Rs. 27000,2+ Years of Experience,12th Pass,Speak Thoda English,8/5/2024
148,Housekeeping Boy,Office Boy/ Peon,Peopable Staffing Services Private Limited (Pss),"Sakinaka, Mumbai",Rs. 11800 - Rs. 13000,Fresher,< 10th Pass / Male Only,No English Required,8/5/2024
