# Scrapping GitHub Topics Repository

### Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

##### Outline
- Scrape website &#8594; https://datastackjobs.com
- We will scrape the following content
    - Company Name
    - Job Title
    - Location
    - Tags or Skills
    - Time Posted
    - Job Type
    - Category
    - Job Url
- We will then arrange the data in a tabular form and eport it to  CSV file
- Output will look like

Company Name,Job Title,Location,Tags or Skills,Time Posted,Job Type,Category,Job Url
Boulevard,Sr. Product Analyst,United States. Remote,Amplitude | Mixpanel | SQL,a month ago,Full-Time,Product,https://datastackjobs.com/jobs/yifxlruwe2-sr-product-analyst

Bitquery,Data Engineer/Data Ops,Worldwide . Remote,Airflow | Clickhouse | Apache Spark,2 months ago,Full-Time,Data Engineering,https://datastackjobs.com/jobs/ixumzawhqq-data-engineer-data-ops

![datastack jobs](https://beeimg.com/images/i82872082642.png)

In [2]:
# import libraries
"""Import libraries if not then install them"""

import os
import pandas
import requests # to work with http and api requests
from bs4 import BeautifulSoup # for parsing HTML, XML, JSON and other data

# Config File
config = {}
config["base_url"] = r"https://datastackjobs.com"
config["status_code_info_url"] = r"https://developer.mozilla.org/en-US/docs/Web/HTTP/Status"
config["response_file_content"] = "Web_Scrapping_content.html"

# using requests library to fetch data from base_url
response = requests.get(config["base_url"])

# check for response and status code of response
print(f"Website --> {response.url} has returned a response having status code --> {response.status_code}.\
\nRead more about status code --> {config['status_code_info_url']}")

#  parsing the content returned by response

# saving the response content in a html file
with open(config["response_file_content"], "w", encoding="utf-8") as file:
    file.write(response.text)

print(f"response content stored in {config['response_file_content']}")


Website --> https://datastackjobs.com/ has returned a response having status code --> 200.
Read more about status code --> https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
response content stored in Web_Scrapping_content.html


In [3]:
# passing the response content to Beautifulsoup for parsing html data
web_page = BeautifulSoup(response.text, 'html.parser')

In [26]:
# finding required data from the parsed html content
# Job Url <a tag> ---- it is the page where we will find all the related info
job_url_class = "chakra-linkbox__overlay css-nzdyzb"
job_url_class_tags = web_page.find_all('a',{'class':job_url_class})
job_url_class_tags[:5]

[<a class="chakra-linkbox__overlay css-nzdyzb" href="/jobs/yifxlruwe2-sr-product-analyst"><style data-emotion="css 11xthxf">.css-11xthxf{font-size:1rem;font-weight:600;}@media screen and (min-width: 30em){.css-11xthxf{font-size:1.125rem;}}</style><p class="chakra-text css-11xthxf">Sr. Product Analyst</p></a>,
 <a class="chakra-linkbox__overlay css-nzdyzb" href="/jobs/ixumzawhqq-data-engineer-data-ops"><p class="chakra-text css-11xthxf">Data Engineer / Data Ops</p></a>,
 <a class="chakra-linkbox__overlay css-nzdyzb" href="/jobs/o6okkm0d0f-data-engineer"><p class="chakra-text css-11xthxf">Data Engineer</p></a>,
 <a class="chakra-linkbox__overlay css-nzdyzb" href="/jobs/spxkdouo5t-senior-ml-engineer"><p class="chakra-text css-11xthxf">Senior ML Engineer</p></a>,
 <a class="chakra-linkbox__overlay css-nzdyzb" href="/jobs/okmf74i1ul-staff-data-engineer"><p class="chakra-text css-11xthxf">Staff Data Engineer</p></a>]

In [27]:
# Job Title - Done
# Company Name - Done
# Location - Done
# Tags or Skills - Done
# Time Posted
# Job Type - Done
# Category - Done
# Job Url - Done

In [28]:
# test for first url
profile_url = config["base_url"] + job_url_class_tags[0]["href"]
profile_url
# get response from first url
response_profile_url = requests.get(profile_url)
# parse the text from url
web_page_profile_url = BeautifulSoup(response_profile_url.text,'html.parser')

In [29]:
# Job Title
lst_Job_Title = []
job_title_class = "chakra-text css-1j21cv6"
for job_title in web_page_profile_url.find_all('p',{'class':job_title_class}):
    lst_Job_Title.append(job_title.text)

lst_Job_Title

['Sr. Product Analyst']

In [30]:
# Comapany Name
lst_Company_Name = []
company_name_class = "chakra-text css-0"
for company_name in web_page_profile_url.find_all('p',{'class':company_name_class}):
    lst_Company_Name.append(company_name.text)

lst_Company_Name

['Boulevard']

In [38]:
# Location, Job Type, Category
lst_Location_Type_Category = []
lst_Location = []
lst_Job_Type = []
lst_Category = []
location_type_category_class = "chakra-text css-10iahqc"
for location_type_category in web_page_profile_url.find_all('p',{'class':location_type_category_class}):
    lst_Location_Type_Category.append(location_type_category.text)

# lst_Location
lst_Location.append(lst_Location_Type_Category[0])
# lst_Job_Type
lst_Job_Type.append(lst_Location_Type_Category[1])
# lst_Category
lst_Category.append(lst_Location_Type_Category[2])



lst_Job_Type



['Full-Time']

In [59]:
# Skills Tags
lst_Skills_Tag = []
profile_skills_tags = []
skills_tag_class = "css-1fogp5u"
for skills_tag in web_page_profile_url.find_all('span',{'class':skills_tag_class}):    
    profile_skills_tags.append(skills_tag.text)
    
lst_Skills_Tag.append(profile_skills_tags)

lst_Skills_Tag

[['FiveTran', 'Healthcare', 'PostgreSQL', 'Python']]

In [63]:
lst_Company_Name = []
lst_Job_Title = []
lst_Location = []
lst_Job_Type = []
lst_Category = []
lst_Skills_Tag = []
lst_Job_Url = []


import pandas as pd

def parse_website_datastacks_jobs(web_page_profile_url):
    
    # Comapany Name    
    company_name_class = "chakra-text css-0"
    for company_name in web_page_profile_url.find_all('p',{'class':company_name_class}):
        lst_Company_Name.append(company_name.text)
    
    # Job Title
    job_title_class = "chakra-text css-1j21cv6"
    for job_title in web_page_profile_url.find_all('p',{'class':job_title_class}):
        lst_Job_Title.append(job_title.text)

    # Location, Job Type, Category
    lst_Location_Type_Category = []
    location_type_category_class = "chakra-text css-10iahqc"
    for location_type_category in web_page_profile_url.find_all('p',{'class':location_type_category_class}):
        lst_Location_Type_Category.append(location_type_category.text)
    # lst_Location
    lst_Location.append(lst_Location_Type_Category[0])
    # lst_Job_Type
    lst_Job_Type.append(lst_Location_Type_Category[1])
    # lst_Category
    lst_Category.append(lst_Location_Type_Category[2])

    # Skills Tags
    skills_tag_class = "css-1fogp5u"
    profile_skills_tags = []
    for skills_tag in web_page_profile_url.find_all('span',{'class':skills_tag_class}):
        profile_skills_tags.append(skills_tag.text)
        
    lst_Skills_Tag.append(profile_skills_tags)


# Job Url <a tag> ---- it is the page where we will find all the related info
job_url_class = "chakra-linkbox__overlay css-nzdyzb"
job_url_class_tags = web_page.find_all('a',{'class':job_url_class})

# Parsing job urls
for tags in job_url_class_tags:
    profile_url = config["base_url"] + tags["href"]
    print(f"Scrapping data for profile url: {profile_url}")
    lst_Job_Url.append(profile_url)
    # get response from profile url
    response_profile_url = requests.get(profile_url)
    # parse the text from profile url
    web_page_profile_url = BeautifulSoup(response_profile_url.text,'html.parser')
    parse_website_datastacks_jobs(web_page_profile_url)

parsed_data_dict = {"Company Name":lst_Company_Name,
                "Job Title":lst_Job_Title,
                "Location":lst_Location,
                "Skill Tags":lst_Skills_Tag,
                "Job Type":lst_Job_Type,
                "Category":lst_Category,
                "Job URL":lst_Job_Url}

parsed_data = pd.DataFrame(parsed_data_dict)


Scrapping data for profile url: https://datastackjobs.com/jobs/yifxlruwe2-sr-product-analyst
Scrapping data for profile url: https://datastackjobs.com/jobs/ixumzawhqq-data-engineer-data-ops
Scrapping data for profile url: https://datastackjobs.com/jobs/o6okkm0d0f-data-engineer
Scrapping data for profile url: https://datastackjobs.com/jobs/spxkdouo5t-senior-ml-engineer
Scrapping data for profile url: https://datastackjobs.com/jobs/okmf74i1ul-staff-data-engineer
Scrapping data for profile url: https://datastackjobs.com/jobs/p1azjgcujr-marketing-manager
Scrapping data for profile url: https://datastackjobs.com/jobs/ojswkhzzy6-data-science-advocate
Scrapping data for profile url: https://datastackjobs.com/jobs/w45xlizshh-senior-data-scientist-product
Scrapping data for profile url: https://datastackjobs.com/jobs/q4cjm1n6fm-senior-data-engineer
Scrapping data for profile url: https://datastackjobs.com/jobs/vc3ynqb6l9-mlops-back-end-engineer
Scrapping data for profile url: https://datastackj

In [38]:
len(lst_Skills_Tag)
# lst_Job_Title = []
# lst_Location = []
# lst_Job_Type = []
# lst_Category = []
# lst_Skills_Tag = []
# lst_Job_Url = []


196

In [66]:
parsed_data.head().to_csv(r'C:\Test\webscrapper.csv', index=False, encoding='utf-8')

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.

**Credit and Source** --> [**Jovian - Building a Python Web Scraping Project From Scratch**](https://jovian.ai/aakashns/python-web-scraping-project-guide)