## **Problem Statement: Navigating the Data Science Job Landscape**

üöÄ Unleash your creativity in crafting a solution that taps into the heartbeat of the data science job market! Envision an ingenious project that seamlessly wields cutting-edge web scraping techniques and illuminating data analysis.

üîç Your mission? To engineer a tool that effortlessly gathers job listings from a multitude of online sources, extracting pivotal nuggets such as job descriptions, qualifications, locations, and salaries.

üß© However, the true puzzle lies in deciphering this trove of data. Can your solution discern patterns that spotlight the most coveted skills? Are there threads connecting job types to compensation packages? How might it predict shifts in industry demand?

üéØ The core objectives of this challenge are as follows:

1. Web Scraping Mastery: Forge an adaptable and potent web scraping mechanism. Your creation should adeptly harvest data science job postings from a diverse array of online platforms. Be ready to navigate evolving website structures and process hefty data loads.

2. Data Symphony: Skillfully distill vital insights from the harvested job listings. Extract and cleanse critical information like job titles, company names, descriptions, qualifications, salaries, locations, and deadlines. Think data refinement and organization.

3. Market Wizardry: Conjure up analytical tools that conjure meaningful revelations from the gathered data. Dive into the abyss of job demand trends, geographic distribution, salary variations tied to experience and location, favored qualifications, and emerging skill demands.

4. Visual Magic: Weave a tapestry of visualization magic. Design captivating charts, graphs, and visual representations that paint a crystal-clear picture of the analyzed data. Make these visuals the compass that guides users through job market intricacies.

üåê While the web scraping universe is yours to explore, consider these platforms as potential stomping grounds:

* LinkedIn Jobs
* Indeed
* Naukri
* Glassdoor
* AngelList

üéà Your solution should not only decode the data science job realm but also empower professionals, job seekers, and recruiters to harness the dynamic shifts of the industry. The path is open, the challenge beckons ‚Äì are you ready to embark on this exciting journey?






## **Github Link- [link text](https://github.com/akshatbhuryan/Web_Scraping_Project)**

In [18]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

In [30]:
URL = "https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"
#conducting a request of the stated URL above:
page = requests.get(URL)
#specifying the desired format of "page" using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, "html.parser")
#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
 <!--<![endif]-->
 <head>
  <title>
   Attention Required! | Cloudflare
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="noindex, nofollow" name="robots"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" rel="stylesheet"/>
  <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
  <style>
   body{margin:0;padding:0}
  </style>
  <!--[if gte IE 10]><!-->
  <script>
   if (!navigator.cookieEnabled) {
    window.addEven

## **Getting Job Title**
As can be seen, the entirety of each job posting is under <div> tags, with an attribute ‚Äúclass‚Äù = ‚Äúrow result.‚Äù

Further, we could also see that job titles are under <a> tags, with the attribute ‚Äútitle = (title)‚Äù. One can see the value of the tag‚Äôs attribute with tag[‚Äúattribute‚Äù], so I can use it to find each posting‚Äôs job title.

If we summarize, the function we are going to see involves the following three steps,

Pulling out all the <div> tags with class including ‚Äúrow‚Äù.
Identifying <a> tags with attribute ‚Äúdata-tn-element‚Äù:‚ÄùjobTitle‚Äù
For each of these <a> tags, find attribute values ‚Äútitle‚Äù

In [None]:
def extract_job_title_from_result(soup):
  jobs = []
  for div in soup.find_all(name="div", attrs={"class":"row"}):
    for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
      jobs.append(a["title"])
  return(jobs)
extract_job_title_from_result(soup)



## **Getting Company Name**
Getting company names can be a bit tricky because most of them are appearing in <span> tags, with ‚Äúclass‚Äù:‚Äù company‚Äù.  They are also housed in <span> tags with ‚Äúclass‚Äù:‚Äù result-link-source‚Äù.

We will be using if/else statements to extract the company info from each of these places. In order to remove the white spaces around the company names when they are outputted, we will use inputting.strip() at the end.

In [None]:
def extract_company_from_result(soup):
 companies = []
 for div in soup.find_all(name="div", attrs={"class":"row"}):
   company = div.find_all(name="span", attrs={"class":"company"})
   if len(company) > 0:
    for b in company:
     companies.append(b.text.strip())
   else:
    sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
    for span in sec_try:
      companies.append(span.text.strip())
 return(companies)

extract_company_from_result(soup)

## **Getting Location**
Locations are located under the <span> tags. Span tags are sometimes nested within each other, such that the location text may sometimes be within ‚Äúclass‚Äù:‚Äùlocation‚Äù attributes, or nested in ‚Äúitemprop‚Äù:‚ÄùaddressLocality‚Äù. However a simple for loop can examine all span tags for text and retrieve the necessary information.

In [None]:
def extract_location_from_result(soup):
  locations = []
  spans = soup.findAll('span', attrs={'class': 'location'})
  for span in spans:
    locations.append(span.text)
  return(locations)
extract_location_from_result(soup)

## **Getting Salary**
Salary is the most challenging part to extract from job postings. Most postings don‚Äôt publish salary information at all, while others that do, there can be multiple places to pick that. So we have to write a code that can pick up multiple salaries from multiple places, and if no salary is found, we need to create a placeholder ‚ÄúNothing Found‚Äù value for any jobs that don‚Äôt contain salary.

Some salaries are under <nobr> tags, while others are under <div> tags, ‚Äúclass‚Äù:‚Äùsjcl‚Äù and are under separate div tags with no attributes. Try/except statement can be helpful while extracting this information.

In [None]:
def extract_salary_from_result(soup):
  salaries = []
  for div in soup.find_all(name="div", attrs={"class":"row"}):
    try:
      salaries.append(div.find('nobr').text)
    except:
      try:
        div_two = div.find(name="div", attrs={"class":"sjcl"})
        div_three = div_two.find("div")
        salaries.append(div_three.text.strip())
      except:
        salaries.append("Nothing_found")
  return(salaries)
extract_salary_from_result(soup)

## **Getting Job Summary**
The final job is to get the job summary. However, it is not possible to get the job summaries for each particular position because they are not included in the HTML from a given Indeed page. We can get some information about each job from what‚Äôs provided. We can use Selenium for this purpose.

But let‚Äôs first try this using python. Summaries are located under <span> tags. Span tags are nested within each other such that the location text is within ‚Äúclass‚Äù:‚Äù location‚Äù tags or nested in ‚Äúitemprop‚Äù:‚Äù adressLocality‚Äù. However, using a simple for loop can examine all span tags for text to retrieve the necessary information.