
## Step 1: Selecting a Category and Scraping Company Names and Links

The initial phase of our scraping project involves selecting a specific category from the NSW supplier portal. For illustrative purposes, let's consider we have chosen the category "Agriculture."

Once the category is selected, the next task is to run a script that is designed to perform two primary functions:

1. **Scrape Company Names:**
   - The script will navigate through the "Agriculture" category on the NSW supplier portal.
   - It will methodically identify and extract the names of companies listed under this category.

2. **Extract Links to NSW Portal of Companies:**
   - Concurrently, the script will also capture the unique URL link associated with each company's specific page on the NSW portal.
   - These links are crucial as they lead to more detailed information about each company, which will be utilized in the subsequent steps of the scraping process.



In [None]:
!apt-get update
!apt install chromium-chromedriver
!pip install selenium
!pip install pandas openpyxl

In [None]:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

# Setup Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Initialize WebDriver
driver = webdriver.Chrome(options=chrome_options)

def get_supplier_data(page):
    url = f"https********ADD THE URL HERE *********&page={page}" #Add url
    driver.get(url)
    time.sleep(2)  # Wait for the page to load
    supplier_data = []

    suppliers = driver.find_elements(By.CSS_SELECTOR, "h3 > a")
    for supplier in suppliers:
        supplier_name = supplier.text
        supplier_url = supplier.get_attribute('href')
        supplier_data.append((supplier_name, supplier_url))
    return supplier_data

all_supplier_data = []

for page in range(1, 48):  # ******************CHANGE PAGE NUMBER**************************
    all_supplier_data.extend(get_supplier_data(page))
    time.sleep(1)  # Wait before loading the next page

driver.quit()

# Create a DataFrame and save to Excel
df = pd.DataFrame(all_supplier_data, columns=['Supplier Name', 'Supplier URL'])
df.to_excel('supplier_data_agriculture.xlsx', index=False) #SAVE SUPPLIER_DATA_CATEGORY

In [None]:
#DOWNLOAD THE SHEET
from google.colab import files
files.download('suppliers_data_agriculture.xlsx')

This script is designed to scrape data from a website using Selenium, a tool for automating web browsers. Here's a step-by-step explanation:

1. **Importing Libraries:**
   - `pandas`: A library for data manipulation and analysis. Used here to create a DataFrame and export it to Excel.
   - `selenium`: A tool for automating web browsers. Used to interact with web pages.
   - `time`: A standard Python library for handling time-related tasks.

2. **Setting Up Chrome Options:**
   - `Options()`: Creates an object to set Chrome options.
   - `--headless`: Allows Chrome to run without a GUI (Graphical User Interface). This is useful for running the script on servers or automated environments.
   - `--no-sandbox`: Disables the Chrome sandboxing feature, which can be necessary in certain environments, particularly on Linux.
   - `--disable-dev-shm-usage`: Helps avoid some common issues in containerized environments.

3. **Initializing WebDriver:**
   - `webdriver.Chrome(options=chrome_options)`: Initializes a new instance of Chrome, applying the options you've set.

4. **Defining the `get_supplier_data` Function:**
   - This function takes a `page` number as an input.
   - Constructs a URL using the given page number and navigates to it with `driver.get(url)`.
   - Waits for 2 seconds (`time.sleep(2)`) to ensure the page loads completely.
   - Finds all elements matching the CSS selector `"h3 > a"` – these are presumably the links to supplier details.
   - Extracts the text and 'href' attribute (URL) from each element, storing them in `supplier_data`.

5. **Scraping Multiple Pages:**
   - An empty list `all_supplier_data` is created to store data from all pages.
   - A for loop iterates over a range of page numbers (1 to 47).
   - For each page, it calls `get_supplier_data` and extends `all_supplier_data` with the results.
   - After each call, it waits for 1 second before proceeding to the next page (`time.sleep(1)`).

6. **Closing the WebDriver:**
   - `driver.quit()`: Closes the browser window and ends the WebDriver session.

7. **Saving Data to Excel:**
   - A DataFrame is created from `all_supplier_data` with columns 'Supplier Name' and 'Supplier URL'.
   - This DataFrame is then saved to an Excel file named 'supplier_data_agriculture.xlsx'.

### Important Notes:

- The URL in `get_supplier_data` function is incomplete. You need to insert the actual URL you want to scrape. For example, in case of Agriculture category the link to be used is - "https://buy.nsw.gov.au/supplier/search?query=&services=agriculture&area=&regions=&schemes=&schemeLevel=&capabilities=&identifiers=&sizes=&page=" 
REMEMBER TO EXCLUDE PAGE NUMBER 
- The page range in the for loop (`range(1, 48)`) should match the actual number of pages you intend to scrape.
- The script includes a basic delay (`time.sleep()`) for loading pages and between requests. However, you should be cautious of the website's scraping policy to avoid overloading the server or getting banned.
- The script assumes a specific structure of the webpage (suppliers contained in `h3` tags with an `a` tag). If the website's structure is different, you'll need to modify the CSS selector accordingly.