We are looking for a skilled web scraper to extract vehicle listings for a specific dealership from the CarGurus website. The ideal candidate will have experience in web scraping and data extraction, ensuring accuracy and efficiency. The project involves retrieving details such as make, model, price, and mileage. The scraped data will be used for analysis and reporting purposes.

given this dealership url:
https://www.cargurus.com/Cars/m-Carrio-MotorCars-sp385771

Symbol Instruction:
- ⬜️ Task added
- ✅ Task Completed
- 🟩 Task Progress
- 🟥 Task Issue
- 🟦 Task Plan

**Following Task:**
1. ⬜️ Mock dealerships API endpoint
2. ⬜️ Navigate to Dealership's Home page
3. ⬜️ Extract Dealership details
4. ⬜️ Navigate through ALL of a dealership's vehicles extract links
5. ⬜️ Extract a Vehicle's details
6. ⬜️ trigger extraction jobs for all dealerships
7. ⬜️ Design API endpoint to post vehicle data
8. ⬜️ Database writes (preventing duplication)
9. ⬜️ Additional Retry attempts (2)
10. ⬜️ Scraper Error Handling by slack post
11. ⬜️ Scraper Cron Job implementation
12. ⬜️ Deploy Scraper to server to digital Ocean


#### Package installation command:
`!pip install lxml pandas slack_sdk selenium schedule webdriver-manager`

In [20]:
import requests, urllib.parse, lxml, pandas as pd, time, requests, json, os, schedule, hashlib, hmac
from bs4 import BeautifulSoup
from datetime import datetime
from math import ceil
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
from concurrent.futures import ThreadPoolExecutor, as_completed

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [2]:
def scroll_incrementally(driver, pause_time=2, max_scroll_attempts=3):
    """Scroll incrementally and wait for content to load."""
    last_height = driver.execute_script("return document.body.scrollHeight")
    for attempt in range(max_scroll_attempts):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(pause_time)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:  # Exit if no new content is loaded
            break
        last_height = new_height

############################ For Linux
# def initialize_driver():
#     """Initializes and returns a cross-platform Selenium WebDriver."""
#     service = Service(ChromeDriverManager().install())
#     options = webdriver.ChromeOptions()
#
#     # Add headless mode only if running on a server
#     options.add_argument("--headless")  # Runs without UI
#     options.add_argument("--no-sandbox")  # Required for Linux servers
#     options.add_argument("--disable-dev-shm-usage")  # Prevent memory issues
#     options.add_argument("--disable-gpu")  # Fixes rendering issues in headless mode
#     options.add_argument("--remote-debugging-port=9222")  # Debugging
#
#     return webdriver.Chrome(service=service, options=options)

##################### For WIndows
def initialize_driver():
    """Initializes and returns a Selenium WebDriver."""
    service = Service(ChromeDriverManager().install())
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # Run without UI
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-application-cache")  # Disable caching
    options.add_argument("--incognito")  # Use incognito mode
    driver = webdriver.Chrome(service=service, options=options)
    driver.set_page_load_timeout(500)  # Set timeout for page loading
    return driver

# Create a new driver instance for each scraping attempt
driver = initialize_driver()
scroll_incrementally(driver)

### TASK 10. Scraper error handling and post on slack by integration

In [3]:
# up next:
# put up on accessible endpoint as a stand-alone micro service?
SLACK_BOT_TOKEN="xoxb-8104347625139-8544713643409-37Z4UzEIkGpUA5ArzSUx3Wc6"
SLACK_SIGNING_SECRET="ecdfd545eac42f98586bb00519ec5df7"

class SlackClient:

    def __init__(self):
        """Initialize Slack client with bot token and signing secret."""
        print("Initializing Slack client with token from environment...")
        self.token = SLACK_BOT_TOKEN
        if not self.token:
            raise ValueError("SLACK_BOT_TOKEN not found in environment variables")
        self.client = WebClient(token=self.token)
        self.signing_secret = SLACK_SIGNING_SECRET

    def send_message(self, message: str, channel_id: str) -> bool:
        """
        Send a message to a Slack channel.

        Args:
            message (str): The message text to send
            channel_id (str): The ID of the channel to send the message to

        Returns:
            bool: True if message was sent successfully, False otherwise
        """
        try:
            response = self.client.chat_postMessage(
                channel=channel_id,
                text=message
            )
            return True
        except SlackApiError as e:
            error_details = {
                'error': str(e.response['error']),
                'response': e.response.data,
                'status_code': e.response.status_code,
                'headers': dict(e.response.headers)
            }
            print("Detailed Slack Error:")
            print(f"Error Type: {error_details['error']}")
            print(f"Status Code: {error_details['status_code']}")
            print(f"Response Data: {error_details['response']}")
            print(f"Headers: {error_details['headers']}")
            return False
        except Exception as e:
            print(f"Unexpected error: {str(e)}")
            return False

    # Ensures that incoming requests to your microservice actually originate from Slack and haven’t been tampered with.
    # This is critical for security when handling webhooks or API calls from Slack.
    def verify_slack_request(self, timestamp: str, signature: str, body: str) -> bool:
        """
        Verify that the request actually came from Slack.

        Args:
            timestamp (str): X-Slack-Request-Timestamp header
            signature (str): X-Slack-Signature header
            body (str): Raw request body

        Returns:
            bool: True if the request is valid, False otherwise
        """
        if not self.signing_secret:
            print("Warning: SLACK_SIGNING_SECRET not set")
            return False

        # Check if the timestamp is too old
        if abs(time.time() - int(timestamp)) > 60 * 5:
            return False

        # Create the signature base string
        sig_basestring = f"v0:{timestamp}:{body}"

        # Calculate the signature
        my_signature = 'v0=' + hmac.new(
            self.signing_secret.encode(),
            sig_basestring.encode(),
            hashlib.sha256
        ).hexdigest()

        # Compare signatures
        return hmac.compare_digest(my_signature, signature)

# Initialize the Slack client
slack_obj = SlackClient()

Initializing Slack client with token from environment...


### TASK 1. Mock dealerships API endpoint

In [4]:
# dealerships_api_url = "https://api/v1/scraper/dealership-marketplaces"
# headers = {"accept": "application/json"}
# response = requests.get(url, headers=headers)
# print(response.json())

dealerships_api_url = [
    # {
    #   "id": "123e4567-e89b-12d3-a456-426614174000",
    #   "address_id": "223e4567-e89b-12d3-a456-426614174001",
    #   "inventory_source_id": "323e4567-e89b-12d3-a456-426614174002",
    #   "name": "Best Dealership",
    #   "phone_number": "555-123-4567",
    #   "email": "contact@bestdealership.com",
    #   "general_manager": "John Doe",
    #   "website": "http://www.bestdealership.com",
    #   "created_at": "2021-01-01T12:00:00Z",
    #   "updated_at": "2021-01-02T12:00:00Z",
    #   "inventory_source": {
    #     "id": "323e4567-e89b-12d3-a456-426614174002",
    #     "url": "https://www.cargurus.com/Cars/m-Twins-Auto-Sales--Taylor-sp457133",
    #     "category": "car_gurus",
    #     "created_at": "2021-01-01T12:00:00Z",
    #     "updated_at": "2021-01-02T12:00:00Z"
    #   }
    # },
    {
      "id": "423e4567-e89b-12d3-a456-426614174003",
      "address_id": "523e4567-e89b-12d3-a456-426614174004",
      "inventory_source_id": "623e4567-e89b-12d3-a456-426614174005",
      "name": "Quality Cars",
      "phone_number": "555-987-6543",
      "email": "info@qualitycars.com",
      "general_manager": "Jane Smith",
      "website": "http://www.qualitycars.com",
      "created_at": "2021-02-01T12:00:00Z",
      "updated_at": "2021-02-02T12:00:00Z",
      "inventory_source": {
        "id": "623e4567-e89b-12d3-a456-426614174005",
        "url": "https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683", #  https://www.cargurus.com/Cars/m-Carrio-MotorCars-sp385771
        "category": "car_gurus",
        "created_at": "2021-02-01T12:00:00Z",
        "updated_at": "2021-02-02T12:00:00Z"
      }
    }
  ]

### TASK 2. Navigate to dealership home page and
### TASK 3. extract dealership details
**extract a dealership’s details:**
- address
- email
- hours of operation
- phone number
- description

In [5]:
def extract_dealership_data(url, dealership_name):
    driver.get(url)
    soup_data = BeautifulSoup(driver.page_source, "lxml") #response.content
    # print(soup_data.prettify())
    try:
        title = soup_data.find("div", class_="dealerDetailsHeader").find("h1", class_="dealerName").get_text(strip=True) if \
            (soup_data.find("div", class_="dealerDetailsHeader").find("h1", class_="dealerName")) else None
        address = ' '.join(soup_data.find('div', class_='dealerDetailsInfo').find_all(string=True, recursive=False)).strip() if \
            (soup_data.find('div', class_='dealerDetailsInfo').find_all(string=True, recursive=False)) else None
        link = soup_data.find("p", class_="dealerWebLinks").find("a").get_text(strip=True) if \
            (soup_data.find("p", class_="dealerWebLinks").find("a")) else None
        phone = soup_data.find("span", class_="dealerSalesPhone").get_text(strip=True) if \
            (soup_data.find("span", class_="dealerSalesPhone")) else None
        hours_operation = soup_data.find("div", class_="dealerText").get_text(strip=True) if \
            (soup_data.find("div", class_="dealerText")) else None
        logo = soup_data.find("div", class_="dealerLogo").find("img").get("src") if \
            (soup_data.find("div", class_="dealerLogo").find("img")) else None

        data = {
            "title": title,
            "link": link,
            "address": address,
            "phone": phone,
            "hours_operation": hours_operation,
            "logo": logo,
        }
        # slack_obj.send_message(
        #     message=f"Success✅: extracting Dealership details NAME: {dealership_name}, URL: {url}, TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
        #     channel_id="C08FQ51LEAF"
        # )
        return data
    except Exception as e :
        slack_obj.send_message(
            message=f"Error❌: extracting Dealership details NAME: {dealership_name}, URL: {url}, TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, ERROR: {str(e)}",
            channel_id="C08FQ51LEAF"
        )
        return print(f"Error extracting Dealership details for URL {url}: {str(e)}")


dealership_details = [] # Extract dealership details
dealership_list = [] # Extract dealership list

for dealership in dealerships_api_url:
    inventory_source = dealership["inventory_source"]
    dealership_id = dealership["id"]
    dealership_name = dealership["name"]
    inventory_source_id = dealership["inventory_source_id"]
    url = inventory_source["url"]
    category = inventory_source["category"]
    dealership_list.append({
        "dealership_id": dealership_id,
        "inventory_source_id": inventory_source_id,
        "dealership_name": dealership_name,
        "url": url,
        "category": category
    })

    # Extract dealership data
    extracted_data = extract_dealership_data(url, dealership_name)
    if not extracted_data:
        continue  # Skip this entry and move to the next one

    dealership_details.append({
        "dealership_name": dealership_name,
        "data": extracted_data
    })

# Final Output: Print the collection of dealership data
# for detail in dealership_details:
#     print(f"Dealership: {detail['category']}, Data: {detail['data']}")
print(len(dealership_list))

1


### 4. Navigate through ALL of a dealership's vehicles extract links
given a dealership url, navigate through all the vehicles on each page, and all the pages for the dealership.

In [6]:
def extract_vehicle_links(soup, base_url):
    """Extracts vehicle links from the BeautifulSoup object"""
    vehicle_links = []
    for a_tag in soup.find_all('a', {'data-testid': 'car-blade-link'}):
        if a_tag and a_tag.get('href'):
            href = a_tag.get('href')
            vehicle_links.append(urllib.parse.urljoin(base_url, href))
    return vehicle_links


def get_total_pages(soup):
    """Extracts total number of pages from 'Page X of Y' text."""
    try:
        span_text = soup.find("span", string=lambda text: text and "Page" in text)
        if span_text:
            parts = span_text.text.split()
            if len(parts) >= 4 and parts[-1].isdigit():
                return int(parts[-1])  # Extract last number (total pages)
    except Exception as e:
        print(f"Error extracting total pages:{e}")
    return 1  # Default to 1 if not found


def get_all_pages(dealership_list):
    """Scrapes vehicle links from multiple pages by modifying the URL."""
    increment = 1
    for dealership_item in dealership_list:
        # Skip specific iteration
        # if increment == 1:
        #     increment += 1
        #     continue
        # Stop the loop when increment > 2
        # if increment > 1:
        #     break
        try:
            base_url = dealership_item["url"]
            driver.get(base_url)
            soup_data = BeautifulSoup(driver.page_source, "lxml")

            # Get Maximum page number
            max_pages = get_total_pages(soup_data)
            print(f"Detected {max_pages} total pages.")
            if not max_pages:
                continue # Skip this entry and move to the next one

            all_vehicle_links = []
            for page_num in range(1, max_pages + 1):
                page_url = f"{base_url}#resultsPage={page_num}"

                try:
                    driver.get(page_url)
                    WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.TAG_NAME, "body"))
                    )  # Wait until the body tag is loaded

                    soup = BeautifulSoup(driver.page_source, "lxml")
                    vehicle_links = extract_vehicle_links(soup, base_url)

                    if not vehicle_links:
                        print("No vehicle links found on this page. Stopping...")
                        # slack_obj.send_message(
                        #     message=f"Error: No vehicle links found on this page. Stopping...",
                        #     channel_id="C08FQ51LEAF"
                        # )
                        break

                    all_vehicle_links.extend(vehicle_links)
                    print(f"Found {len(vehicle_links)} links on Page {page_num}")

                except Exception as e:
                    slack_obj.send_message(
                        message=f"Error❌: "
                                f"extracting Vehicle Navigation URL, "
                                f"NAME: {dealership_item['dealership_name']}, "
                                f"URL: {page_url}, "
                                f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, "
                                f"ERROR: {str(e)
                                }",
                        channel_id="C08FQ51LEAF"
                    )
                    print(f"error pagination extraction data: {e}")

            dealership_item["vehicle_url"] = all_vehicle_links
            # slack_obj.send_message(
            #     message=f"Success✅: "
            #             f"Dealership vehicles links: {len(dealership_item["vehicle_url"])}, "
            #             f"NAME: {dealership_item['dealership_name']},"
            #             f" URL: {url}, "
            #             f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            #             }",
            #     channel_id="C08FQ51LEAF"
            # )
            # increment += 1
        except Exception as e:
            slack_obj.send_message(
                message=f"Error❌: "
                        f"extracting Dealership navigation, "
                        f"NAME: {dealership_item['dealership_name']}, "
                        f"URL: {dealership_item['url']}, "
                        f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, "
                        f"ERROR: {str(e)
                        }",
                channel_id="C08FQ51LEAF"
            )
            print(f"error dealership pagination data: {e}")


    return dealership_list

# # testing navigation list
# dealership_list_demo = [
#     {'dealership_id': '123e4567-e89b-12d3-a456-426614174000', 'inventory_source_id': '323e4567-e89b-12d3-a456-426614174002', 'dealership_name': 'Best Dealership', 'url': 'https://www.cargurus.com/Cars/m-Twins-Auto-Sales--Taylor-sp457133', 'category': 'car_gurus'},
#     # {'dealership_id': '423e4567-e89b-12d3-a456-426614174003', 'inventory_source_id': '623e4567-e89b-12d3-a456-426614174005', 'dealership_name': 'Quality Cars', 'url': 'https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683', 'category': 'car_gurus'}
# ]
vehicle_urls = get_all_pages(dealership_list)
# print(f"Total dealership found: {len(vehicle_urls)}")

Detected 1 total pages.
Found 13 links on Page 1


### 5. Extract a Vehicle's details
given a vehicle, get all the required data as per our db schema.


In [7]:
def extract_vehicle_data(soup_data, vehicle_url, dealership_item):
    """ Extract vehicle data from details page"""
    # title = soup_data.find("div", class_="_titleInfo_uw1k0_49").find("h4").get_text(strip=True) if \
    #     (soup_data.find("div", class_="_titleInfo_uw1k0_49").find("h4")) else None

    # price
    price_section = soup_data.find("div", class_="_dealInfo_uw1k0_70")
    price = None  # Default to None if price is not found
    if price_section and price_section.find("h5", class_="WoAzt"):
        price = price_section.find("h5", class_="WoAzt").get_text(strip=True).replace("$", "").replace(",", "")


    # Features data
    features_data = soup_data.find_all("li", class_="_listItem_1tanl_14") or []
    # Dictionary to store extracted data
    feature_details = {}
    for data in features_data:
        key = None
        value = None
        if data.find("h5"):  # Ensure the <h5> tag exists
            key = data.find("h5").get_text(strip=True).replace(" ", "_").lower()
        if data.find("p"):  # Ensure the <p> tag exists
            value = data.find("p").get_text(strip=True)
        if key and value:
            feature_details[key] = value


    # # # extract dl data to dt
    records_container = soup_data.find("div", class_="_records_1vyus_9", attrs={"data-cg-ft": "listing-vdp-stats"})
    dt_tags = records_container.find("ul").find_all("li") if (
                records_container and records_container.find("ul")) else []

    overview_details = {}
    for dt_tag in dt_tags:
        key = None
        value = None
        if dt_tag.find("span", class_="_label_zbkq7_7"):  # Check if the key exists
            key = (
                dt_tag.find("span", class_="_label_zbkq7_7")
                .get_text(strip=True)
                .replace(":", "")
                .replace(" ", "_")
                .lower()
            )
        if dt_tag.find("span", class_="_value_zbkq7_14"):  # Check if the value exists
            value = dt_tag.find("span", class_="_value_zbkq7_14").get_text(strip=True)
        if key and value:
            overview_details[key] = value


     # Parse and build the data dictionary
    try:
        data = {
            "inventory_source_id": dealership_item["inventory_source_id"],
            "listing_url": vehicle_url,
            "status": "available",  # Check availability here if needed
            "price": float(price) if price else None,  # Gracefully handle None price
            "vehicle": {
                "dealership_id": dealership_item["dealership_id"],
                "vin": overview_details.get("vin"),
                "mileage": int(feature_details.get("mileage").replace(",", "")) if feature_details.get(
                    "mileage") else None,
                "stock_number": overview_details.get("stock_number"),
                "description": "",
                "exterior_color": overview_details.get("exterior_color"),
                "interior_color": overview_details.get("interior_color"),
                "model": {
                    "name": overview_details.get("model"),
                    "year": overview_details.get("year"),
                    "trim": overview_details.get("trim"),
                    "body_style": overview_details.get("body_type"),
                    "transmission": feature_details.get("transmission"),
                    "fuel_type": feature_details.get("fuel_type"),
                    "drivetrain": feature_details.get("drivetrain"),
                    "engine": feature_details.get("engine"),
                    "make": {
                        "name": overview_details.get("make"),
                    },
                },
            },
        }
    except Exception as e:
        print(f"Data construction error for URL {vehicle_url}: {str(e)}")  # Log the error
        data = None  # If any critical data is missing, return None or handle appropriately

    return data

In [8]:
from random import randint

vehicle_output_data = []
max_threads = 20  # Adjust this based on your system's capability
increment = 1


# Function to process individual vehicle URLs (used in parallel threads)
def process_vehicle_url(vehicle_url, dealership_item):
    driver = None
    try:
        # print(f"Start time{increment}: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        driver = initialize_driver()  # Create a new WebDriver instance

        # Introduce a random delay before loading the page
        time.sleep(randint(*(2,5)))

        # Load the page
        driver.get(vehicle_url)
        WebDriverWait(driver, 60).until(
            EC.any_of(
                EC.presence_of_element_located((By.CLASS_NAME, "_dealInfo_uw1k0_70")),
                EC.presence_of_element_located((By.CLASS_NAME, "_listItem_1tanl_14")),
                EC.presence_of_element_located((By.CLASS_NAME, "_records_1vyus_9")),
            )
        )
        soup_data = BeautifulSoup(driver.page_source, "lxml")  # Parse the page source

        # Extract vehicle data
        vehicle_data = extract_vehicle_data(soup_data, vehicle_url, dealership_item)

        if vehicle_data:
            # print(f"end time{increment}: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
            return vehicle_data

    except Exception as e:
        print(f"Error for vehicle URL {vehicle_url}: {str(e)}")

    finally:
        if driver:
            driver.quit()  # Close the WebDriver instance
    return None


# Use ThreadPoolExecutor for parallel scraping
for dealership_item in dealership_list:
    with ThreadPoolExecutor(max_threads) as executor:
        # Submit scraping tasks for each vehicle URL
        futures = {
            executor.submit(process_vehicle_url, vehicle_url, dealership_item): vehicle_url
            for vehicle_url in dealership_item["vehicle_url"]
        }

        # Collect results as they complete
        for future in as_completed(futures):
            # if increment > 2:
            #     break
            vehicle_url = futures[future]
            try:
                result = future.result()  # Get the result of the scraping function
                if result:
                    vehicle_output_data.append(result)

            except Exception as e:
                print(f"Error processing URL {vehicle_url}: {str(e)}")
            # increment += 1


print(len(vehicle_output_data))
# print(json.dumps(vehicle_output_data, indent=1))
# store in a .csv file
df = pd.DataFrame(vehicle_output_data)
df.to_csv("cargurus_cars_data.csv", index=False)

Error for vehicle URL https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=405529067/NONE/DEFAULT: HTTPConnectionPool(host='localhost', port=63561): Read timed out. (read timeout=120)
12


### 7. Design API endpoint to post and get vehicle data

In [9]:
def post_vehicle_data(post_url, data_batch):
    # print(f"Running scheduled task at {datetime.now()}")
    headers = {"Content-Type": "application/json"}
    try:
        response = requests.post(post_url, data=json.dumps(data_batch), headers=headers)
        response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        print(f"Successfully posted {len(data_batch)} records. Status: {response.status_code}")
        # slack_obj.send_message(
        #     message=f"Success✅: "
        #             f"Successfully posted Vehicle data {len(data_batch)} records."
        #             f" URL: {post_url}, "
        #             f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        #             }",
        #     channel_id="C08FQ51LEAF"
        # )
        return True
    except requests.exceptions.RequestException as errr:
        print(f"Error Send vehicle data: {errr}")
        slack_obj.send_message(
            message=f"Error❌: "
                    f"Error posted Vehicle data {len(data_batch)} records. "
                    f"URL: {post_url}, "
                    f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, "
                    f"ERROR: {str(errr)
                    }",
            channel_id="C08FQ51LEAF"
        )
        return False

def get_vehicle_data(get_url):
    # print(f"Running scheduled task at {datetime.now()}")
    headers = {"Content-Type": "application/json"}
    try:
        response = requests.get(get_url, headers=headers)
        response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        print(f"Successfully get records. Status: {response.status_code}")
        slack_obj.send_message(
            message=f"Success✅: "
                    f"Successfully posted Vehicle data {len(response.json())} records."
                    f" URL: {get_url}, "
                    f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                    }",
            channel_id="C08FQ51LEAF"
        )
        return response.json()
    except requests.exceptions.RequestException as errr:
        print(f"Error posting data: {e}")
        slack_obj.send_message(
            message=f"Error❌: "
                    f"Error get Vehicle data"
                    f"URL: {get_url}, "
                    f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, "
                    f"ERROR: {str(errr)
                    }",
            channel_id="C08FQ51LEAF"
        )
        return []

### 8. Database writes (preventing duplication)

**Problem statement**

once we’ve extracted info for vehicles, we’ll need to store it in our database. there may already be entries for a given vehicle, if that is the case, overwrite with the latest data.

we’ll need to make sure data is always kept up-to-date. and that we have what is truly available. we’ll also need to mark previous vehicles that are no longer part of the inventory as `unavailable`.

**Success criteria**

- have a way to identify each dealership’s vehicle. could be the url
- save all new vehicles into the database
- overwrite existing vehicles’ data to have them show the latest
- mark all other vehicles as `unavaliable`
- notify us of any failed scrape attempts via `paper boy`

In [10]:
def is_duplicate_vehicle_data(existing_records, new_entry):
    try:
        for record in existing_records:
            if record.get("listing_url") == new_entry.get("listing_url"): #and record.get("title") == new_entry.get("title")
                # Raise an exception to log the error for a duplicate record
                raise ValueError(f"Duplicate found:  URL: {new_entry.get('listing_url')}")
        return False
    except ValueError as ve:
        print(f"Error duplicate data: {ve}")
        slack_obj.send_message(
            message=f"Error❌: "
                    f"Error Duplicate Vehicle data"
                    f"URL: {new_entry.get('listing_url')}, "
                    f"TIMESTAMP: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, "
                    f"ERROR: {str(ve)
                    }",
            channel_id="C08FQ51LEAF"
        )
        return True

### 6. trigger extraction jobs for all dealerships and
**Problem statement**

we need to query our database for all available dealerhips, determine what marketplace they post their vehicles in, and trigger the extraction job.




### 9. Additional Retry attempts (2)
given a failed vehicle scrape, the system should attempt to retry 2 additional times. if after the 3 consecutive attempts we were still unable to complete the job, notify the failure via paper boy

In [12]:
# Configuration
API_URL = "https://retoolapi.dev/OQm51O/data"  # Replace with your actual POST URL
CSV_FILE = "cargurus_cars_data.csv"
SEND_BATCH_SIZE = 1  # Number of records to send in each batch
MAX_RETRY_ATTEMPTS = 2  # Maximum number of retry attempts for a failed batch

# # DEMO Sample data to post
# vehicle_output_data_demo = [
#     {'inventory_source_id': '323e4567-e89b-12d3-a456-426614174002', 'listing_url': 'https://www.cargurus.com/Cars/m-Twins-Auto-Sales--Taylor-sp457133#listing=408360065/NONE/DEFAULT', 'status': 'available', 'price': 9995.0, 'vehicle': {'dealership_id': '123e4567-e89b-12d3-a456-426614174000', 'vin': '5XYPGDA51HG315228', 'mileage': 115771, 'stock_number': '441', 'description': '', 'exterior_color': 'Gray', 'interior_color': 'Gray', 'model': {'name': 'Sorento', 'year': '2017', 'trim': 'LX V6 AWD', 'body_style': 'SUV / Crossover', 'transmission': '6-Speed Automatic', 'fuel_type': 'Gasoline', 'drivetrain': 'All-Wheel Drive', 'engine': '290 hp 3.3L V6', 'make': {'name': 'Kia'}}}
#      }
# ]
def read_csv_file(csv_file):
    """
    Reads the data from the CSV file into a pandas DataFrame.
    """
    try:
        df = pd.read_csv(csv_file)
        # if df.empty():
        #     raise FileNotFoundError(f"Error: File {csv_file} not found.")
        # Convert DataFrame to a list of dictionaries
        vehicle_output_data = df.to_dict(orient="records")
        print(f"Successfully read {len(df)} records from {csv_file}")
        return vehicle_output_data
    except FileNotFoundError:
        print(f"Error: File {csv_file} not found.")
        return pd.DataFrame()


def process_and_send_batches_by_post_data(API_URL, vehicle_output_data, SEND_BATCH_SIZE):
    # Fetch existing records from GET_URL
    existing_records = get_vehicle_data(API_URL)

    # Filter out duplicates
    non_duplicate_data = [item for item in vehicle_output_data if not is_duplicate_vehicle_data(existing_records, item)]
    print(f"Filtered {len(vehicle_output_data) - len(non_duplicate_data)} duplicate records.")

    # non_duplicate_data = [item for item in vehicle_output_data]
    # Split data into batches
    # total_records = len(vehicle_output_data)
    total_records = len(non_duplicate_data)
    total_batches = ceil(total_records / SEND_BATCH_SIZE)
    # print(total_records)
    for batch_number in range(total_batches):
        # Get the current batch
        start_idx = batch_number * SEND_BATCH_SIZE
        end_idx = min(start_idx + SEND_BATCH_SIZE, total_records) #start_idx + SEND_BATCH_SIZE
        # data_batch = vehicle_output_data[start_idx:end_idx][0]
        data_batch = non_duplicate_data[start_idx:end_idx][0]
        # print(data_batch)

        print(f"Sending Batch {batch_number + 1}/{total_batches} (Records {start_idx + 1}-{min(end_idx, total_records)})...")

        # Attempt to send the batch with retry
        attempt = 0
        success = False
        while attempt < MAX_RETRY_ATTEMPTS and not success:
            success = post_vehicle_data(API_URL, data_batch)
            attempt += 1
            if not success:
                print(f"Retrying... Attempt {attempt}/{MAX_RETRY_ATTEMPTS}")
                time.sleep(2)  # Optional sleep between retry attempts
                # continue

        # Handle a permanently failed batch
        if not success:
            print(f"Batch {batch_number + 1} failed after {MAX_RETRY_ATTEMPTS} attempts.")

# csv file data transfer
vehicle_output_data_csv = read_csv_file(CSV_FILE)
process_and_send_batches_by_post_data(
    API_URL,
    vehicle_output_data_csv,
    SEND_BATCH_SIZE)

# Define the job to run
# def job():
#     print("Starting the data send job...")
#     process_and_send_batches_by_post_data(
#     API_URL,
#     vehicle_output_data_csv,
#     SEND_BATCH_SIZE)
#     print("Job completed.")



Successfully read 12 records from cargurus_cars_data.csv
Successfully get records. Status: 200
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=409458789/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=412081215/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=411540591/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=412194769/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=381943807/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/Cars/m-Rogers-Auto-Sales-Inc-sp340683#listing=371632943/NONE/DEFAULT
Error duplicate data: Duplicate found:  URL: https://www.cargurus.com/C

### 11. Scraper Cron Job implementation
create a mechanism that triggers new scrapes to happen every 72 hours for all our registered dealerships.
1. Open the crontab to edit it:
``` bash
   crontab -e
```
1. Add the following line to schedule the job:
``` bash
   0 */3 * * * /usr/bin/python3 /path/to/send_vehicle_data.py >> /path/to/logfile.log 2>&1
```
- `0 */3 * * *` schedules the script to run every 3 hours on the hour (e.g., 0:00, 3:00, 6:00, etc.).
- `/usr/bin/python3` is the path to your Python 3 executable.
- `/path/to/send_vehicle_data.py` is the full path to your Python script.
- `>> /path/to/logfile.log 2>&1` redirects output and errors to a log file for debugging purposes.

1. Save and exit.
2. Check that the cron job is active:
``` bash
   crontab -l
```
1. Ensure the cron service is running on your server:
``` bash
   sudo service cron start
```
### 3. **Logging Output (Optional)**
To monitor the script's behavior, you can redirect the output of the script to a log file in the cron job. For example:
``` bash
0 */3 * * * /usr/bin/python3 /path/to/send_vehicle_data.py >> /path/to/logs/send_vehicle_data.log 2>&1
```


In [None]:
# Define the job to run
# def job():
#     print("Starting the data send job...")
#     vehicle_output_data_csv = read_csv_file(CSV_FILE)
#     process_and_send_batches_by_post_data(API_URL,vehicle_output_data_csv,SEND_BATCH_SIZE)
#     print("Job completed.")
#
#
# # Schedule the task
# SCHEDULE_INTERVAL_MINUTES = 2
# schedule.every(SCHEDULE_INTERVAL_MINUTES).minutes.do(job)
#
#
# # Keep the script running to execute jobs
# while True:
#     schedule.run_pending()
#     time.sleep(1)

### 12. Deploy Scraper to server to digital Ocean