<a href="https://colab.research.google.com/github/djmousefarm/AgileScrum/blob/master/i_want_to_write_a_script_in_python_that_will_trac_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import hashlib

def fetch_list_items(url, target_selector):
    """Fetches the text content of list items from a webpage."""
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        soup = BeautifulSoup(response.content, 'html.parser')
        list_items = soup.select(target_selector)
        return [item.text.strip() for item in list_items]
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None
    except Exception as e:
        print(f"Error parsing HTML: {e}")
        return None

def generate_list_hash(item_list):
    """Generates a hash of the list items to detect changes."""
    sorted_list = sorted(item_list)  # Sort to ensure consistent hashing regardless of order
    combined_string = "".join(sorted_list).encode('utf-8')
    return hashlib.sha256(combined_string).hexdigest()

def track_list_changes(url, target_selector, check_interval_seconds=60):
    """Tracks changes in a list on a webpage and notifies of additions."""
    previous_list_hash = None
    print(f"Tracking list at {url} using selector '{target_selector}'. Checking every {check_interval_seconds} seconds. Press Ctrl+C to stop.")

    while True:
        current_items = fetch_list_items(url, target_selector)

        if current_items is not None:
            current_hash = generate_list_hash(current_items)

            if previous_list_hash is not None and current_hash != previous_list_hash:
                # Compare the current list with the previous one to find additions
                previous_set = set(previous_items)
                current_set = set(current_items)
                new_items = list(current_set - previous_set)

                if new_items:
                    print("\n--- New item(s) added! ---")
                    for item in new_items:
                        print(f"- {item}")
                    print("-------------------------\n")

            previous_list_hash = current_hash
            previous_items = current_items

        time.sleep(check_interval_seconds)

if __name__ == "__main__":
    target_url = input("Enter the URL of the webpage to track: ")
    list_selector = input("Enter the CSS selector for the list items (e.g., 'ul li', 'ol li.item'): ")
    interval = int(input("Enter the check interval in seconds (e.g., 30, 60): "))

    track_list_changes(target_url, list_selector, interval)

**How to Use This Script:**

1.  **Install Libraries:**
    ```bash
    pip install requests beautifulsoup4
    ```

2.  **Run the Script:**
    ```bash
    python your_script_name.py
    ```

3.  **Enter Information:** The script will prompt you for:
    * **URL of the webpage:** The full URL of the page containing the list you want to monitor.
    * **CSS selector for the list items:** This is crucial for BeautifulSoup to correctly identify the elements within the list. You'll need to inspect the webpage's HTML source to determine the appropriate selector. Examples:
        * If the list is an unordered list (`<ul>`) and each item is an `<li>`, the selector would be `ul li`.
        * If it's an ordered list (`<ol>`) with `<li>` items having a specific class like `item`, the selector would be `ol li.item`.
        * If the list items are within `<div>` elements inside a specific `div` with ID `my-list`, the selector might be `#my-list div`.
    * **Check interval in seconds:** How often you want the script to check the webpage for changes.

**Explanation of the Code:**

1.  **`fetch_list_items(url, target_selector)`:**
    * Takes the webpage URL and a CSS selector as input.
    * Uses `requests` to fetch the content of the webpage.
    * Uses `BeautifulSoup` to parse the HTML content.
    * Uses `soup.select(target_selector)` to find all HTML elements that match the provided CSS selector (these should be your list items).
    * Extracts the text content (`.text`) of each list item, removes leading/trailing whitespace (`.strip()`), and returns a list of these text items.
    * Includes error handling for network issues and HTML parsing errors.

2.  **`generate_list_hash(item_list)`:**
    * Takes a list of strings (the list items) as input.
    * Sorts the list to ensure that the order of items doesn't affect the hash. This way, only the addition or removal of items will change the hash.
    * Joins the sorted items into a single string and encodes it to UTF-8.
    * Calculates the SHA-256 hash of the combined string using the `hashlib` library. This hash acts as a unique fingerprint of the list's content.

3.  **`track_list_changes(url, target_selector, check_interval_seconds=60)`:**
    * This is the main function that continuously monitors the webpage.
    * Initializes `previous_list_hash` to `None`.
    * Enters an infinite `while True` loop to periodically check the webpage.
    * Inside the loop:
        * Calls `fetch_list_items()` to get the current list items.
        * If the fetching is successful:
            * Calculates the `current_hash` of the current list.
            * If `previous_list_hash` is not `None` (meaning it's not the first check) and the `current_hash` is different from the `previous_list_hash`:
                * It means the list has changed.
                * It compares the `previous_items` with the `current_items` using sets to efficiently find the new items (items present in the current list but not in the previous list).
                * If `new_items` are found, it prints a notification and lists the newly added items.
            * Updates `previous_list_hash` and `previous_items` for the next iteration.
        * Pauses the execution for the specified `check_interval_seconds` using `time.sleep()`.
    * The loop can be stopped by pressing `Ctrl+C` in the terminal.

**Important Considerations:**

* **Website Structure:** This script relies on the HTML structure of the webpage remaining consistent. If the website changes its layout or the way the list is presented, you'll need to update the CSS selector accordingly.
* **Dynamic Content:** If the list on the webpage is loaded dynamically using JavaScript, this simple script might not work directly. In such cases, you might need to use more advanced techniques like Selenium or Puppeteer to render the JavaScript and then extract the data.
* **Rate Limiting:** Be mindful of the website's terms of service and avoid making too many requests in a short period. Implement a reasonable `check_interval_seconds` to avoid overloading the server.
* **Error Handling:** The script includes basic error handling for network and parsing issues, but you might want to add more robust error logging or handling as needed.
* **Notifications:** Instead of just printing to the console, you could extend the script to send email or other types of notifications when a change is detected.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://dev.to/eswar108/web-scraping-using-python-1f4o">https://dev.to/eswar108/web-scraping-using-python-1f4o</a></li>
  <li><a href="https://github.com/ASoleBusinessMan/Web-Scraper">https://github.com/ASoleBusinessMan/Web-Scraper</a></li>
  </ol>
</div>