# Scrapping GitHub Topics Repository

### Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

##### Outline
- Scrape website &#8594; https://datastackjobs.com
- We will scrape the following content
    - Company Name
    - Job Title
    - Location
    - Tags or Skills
    - Time Posted
    - Job Type
    - Category
    - Job Url
- We will then arrange the data in a tabular form and eport it to  CSV file
- Output will look like

Company Name,Job Title,Location,Tags or Skills,Time Posted,Job Type,Category,Job Url
Boulevard,Sr. Product Analyst,United States. Remote,Amplitude | Mixpanel | SQL,a month ago,Full-Time,Product,https://datastackjobs.com/jobs/yifxlruwe2-sr-product-analyst

Bitquery,Data Engineer/Data Ops,Worldwide . Remote,Airflow | Clickhouse | Apache Spark,2 months ago,Full-Time,Data Engineering,https://datastackjobs.com/jobs/ixumzawhqq-data-engineer-data-ops

![datastack jobs](https://beeimg.com/images/i82872082642.png)

In [2]:
# import libraries
"""Import libraries if not then install them"""

import os
import pandas
import requests # to work with http and api requests
from bs4 import BeautifulSoup # for parsing HTML, XML, JSON and other data

# Config File
config = {}
config["base_url"] = r"https://datastackjobs.com"
config["status_code_info_url"] = r"https://developer.mozilla.org/en-US/docs/Web/HTTP/Status"
config["response_file_content"] = "Web_Scrapping_content.html"

# using requests library to fetch data from base_url
response = requests.get(config["base_url"])

# check for response and status code of response
print(f"Website --> {response.url} has returned a response having status code --> {response.status_code}.\
\nRead more about status code --> {config['status_code_info_url']}")

#  parsing the content returned by response

# saving the response content in a html file
with open(config["response_file_content"], "w", encoding="utf-8") as file:
    file.write(response.text)

print(f"response content stored in {config['response_file_content']}")


Website --> https://datastackjobs.com/ has returned a response having status code --> 200.
Read more about status code --> https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
response content stored in Web_Scrapping_content.html


In [3]:
# passing the response content to Beautifulsoup for parsing html data
web_page = BeautifulSoup(response.text, 'html.parser')

In [4]:
# finding required data from the parsed html content
# Job Title
job_title_class = "chakra-text css-11xthxf" # it is in p tag within an a tag
job_title_tags = web_page.find_all('p',{'class': job_title_class})
job_title_tags[:5]

[<p class="chakra-text css-11xthxf">Sr. Product Analyst</p>,
 <p class="chakra-text css-11xthxf">Data Engineer / Data Ops</p>,
 <p class="chakra-text css-11xthxf">Data Engineer</p>,
 <p class="chakra-text css-11xthxf">Senior ML Engineer</p>,
 <p class="chakra-text css-11xthxf">Staff Data Engineer</p>]

In [5]:
# Job Url
company_name_class = "chakra-text css-0"
company_class_tags = web_page.find_all('p',{'class':company_name_class})
company_class_tags[:5]

# Location
# Tags or Skills
# Time Posted
# Job Type
# Category
# Job Url

[]

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.

**Credit and Source** --> [**Jovian - Building a Python Web Scraping Project From Scratch**](https://jovian.ai/aakashns/python-web-scraping-project-guide)