# [Get car listing IDs](#get-car-listing-ids)

<font color='red' size="5">**Important Note**</font>
- this notebook does **not** support Cell > Run All
- please run cells manually and wait for the preceding page to load before executing the second last cell of section 2.

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
from random import randint
from time import time, sleep, strftime

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

In [None]:
options = Options()
options.add_argument("--headless") # Runs Chrome in headless mode.
options.add_argument('--no-sandbox') # Bypass OS security model
options.add_argument('--disable-gpu')  # applicable to windows os only
options.add_argument('start-maximized') # 
options.add_argument('disable-infobars')
options.add_argument("--disable-extensions")

<a id="toc"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [Make User Inputs on `cars.com`](#make-user-inputs-on-cars.com)
2. [Make search selections on home page](#make-search-selections-on-home-page)
3. [Retrieve listing `id`s from search results page](#retrieve-listing-ids-from-search-results-page)
4. [Close browser](#close-browser)

<a id="about"></a>

## 0. [About](#about)

As mentioned in the `README.md` file, we're looking for new car listings in Austin, TX and SEA, WA on a budget of $45,000.

In this notebook, we will scrape car listing IDs from the `cars.com` homepage. We will programmatically apply filters based on our preferences for the listings (eg. zipcode, maximum price wanted, type of car and Make of the car).

<a id="make-user-inputs-on-cars.com"></a>

## 1. [Make User Inputs on `cars.com`](#make-user-inputs-on-cars.com)

We'll define one input variable as the zipcode required, in string format

In [None]:
zipcode_wanted = "98052"  # AUS: "78745", SEA: "99208", "98052"

Some other inputs we'll define below should not be modified by the user

In [None]:
# Relative path to Chrome driver
chromedriver = "./chromedriver"

# Main webpage from which to apply search filters
web_url = "https://www.cars.com/"

# Number of pages of listings to return - this was manually determined by visiting cars.com and entering the above-mentioned filters
num_pages_of_results = 30  # AUS: 30, SEA: 11

# Relative path to file where filtered listing IDs should be stored
ids_filename = f"data/Listings_IDs_{strftime('%Y%m%d')}.txt"

<a id="make-search-selections-on-home-page"></a>

## 2. [Make search selections on home page](#make-search-selections-on-home-page)

To preform scraping, we'll first download the [Chrome `webdriver`](https://chromedriver.chromium.org/downloads)

In [None]:
!wget https://chromedriver.storage.googleapis.com/77.0.3865.40/chromedriver_linux64.zip -O chromedriver_linux64.zip

In [None]:
!unzip chromedriver_linux64.zip && rm -f chromedriver_linux64.zip

Next, we'll instantiate the Chrome webdriver

In [None]:
driver = webdriver.Chrome(options=options, executable_path=str(chromedriver))

Next, we'll load the `cars.com` homepage

In [None]:
driver.get(web_url)

From the drop-down menus and text input box, we will make specifications for
- type of car (click on required item in dropdown menu)
  - this should be a new car so select "New"
- Make (click on required item in dropdown menu)
  - we want all makes so select All Makes
- maximum acceptable price (click on required item in dropdown menu)
  - our budget is \$45,000, so make this selection
- zip code (enter text into user input box)
  - enter required zipcode and press RETURN to move to the next page of search results

In [None]:
# select type of car wanted (New)
driver.find_element_by_xpath("//select[@name='stockType']/option[text()='New Cars']").click()

# select make wanted
driver.find_element_by_xpath("//select[@name='makeId']/option[text()='All Makes']").click()

# select max price wanted
driver.find_element_by_xpath("//select[@name='priceMax']/option[text()='$45,000']").click()

In order to specify the required zipcode, we will first clear the existing entry in the text box

In [None]:
# enter zipcode and press the RETURN key to submit the form
zip_elem = driver.find_element_by_xpath("//input[@name='zip']")
zip_elem.send_keys(Keys.CONTROL + "a");
zip_elem.send_keys(Keys.DELETE);

Next, we will enter the required zipcode and press `RETURN`

In [None]:
zip_elem.send_keys(zipcode_wanted)
zip_elem.send_keys(Keys.RETURN)

From the page of search listings, select "100 Per Page" in order to display 100 search result listings
- we will make this selection in order to reduce the number of pages of search results that must be navigated
- **NOTE about using Cell > Run All**
  - please wait for the page of search results (from the above cell) to fully load before executing the cell below this line
    - reason: the option to increase the number of viewable listings to the maximum value of 100 is not available until the search results have fully loaded

In [None]:
# Specify that 100 results should be shown per page
driver.find_element_by_xpath(
    "//select[@class='ng-pristine ng-untouched ng-valid ng-not-empty']/option[text()='100 Per Page']"
).click()

<a id="retrieve-listing-ids-from-search-results-page"></a>

## 3. [Retrieve listing `id`s from search results page](#retrieve-listing-ids-from-search-results-page)

Next, on each page of listing results, we'll do the following
1. get the `bs4` soup
2. Use a helper function to extract listing `id`, which will be used later to assemble a url for a single listing
   - this `id` will be used to assemble the web url of each listing
3. append a list of 100 listing `id`s per page to a text file, such that a single line of the text file will contain all listing `id`s for a single page of search results
   - since we specified that 100 results should be shown per page, this text file will consist of rows of 100 listing IDs
4. pause for a random amount of time
5. scroll to bottom of page
6. pause for a random amount of time
7. click `Next` button to navigate to next page
8. wait for 5 seconds, for the page URL to be updated to that of the next page

In [None]:
def get_all_ids_from_search_results_soup(soup):
    """
    Get list with id for each search result listing
    """
    id_checkboxes_elements = soup.find_all("input", {"class": "checkbox__input"})
    
    ids_per_page = []
    for k, c in enumerate(id_checkboxes_elements):
        if "-compare" in c["id"]:
            listing_id = c["id"].replace("-compare", "")
            # print(f"Listing: {k+1}, ID: {listing_id}")
            ids_per_page.append(listing_id)
    return ids_per_page

In [None]:
def pause_code(min_time, max_time, delay_msg):
    """Wait for a random amount of time before proceeding"""
    # Pause
    delay_time = randint(min_time, max_time)
    print(delay_msg)
    sleep(delay_time)    

Now, we can loop over the pre-defined required number of pages and perform above actions on each page

In [None]:
ids = []
for page in range(1, num_pages_of_results+1):  
    # 1. Get the bs4 soup from each page of listings for serach results
    soup_contents = BeautifulSoup(driver.page_source, 'html.parser')
    # r = requests.get(driver.current_url)
    # soup_contents = BeautifulSoup(r.text, 'html.parser')
    # print(soup_contents.prettify())

    # 2. Get list of listing IDs from page
    list_of_ids_per_page = get_all_ids_from_search_results_soup(soup_contents)
    print(f"Found {len(list_of_ids_per_page)} listings")
    ids.append(list_of_ids_per_page)

    # 3. Write list of string IDs to file
    list_of_ids_as_string = ", ".join(list_of_ids_per_page) + "\n"
    with open(ids_filename, 'a') as f:
        f.write(list_of_ids_as_string)

    # print current url
    current_url = driver.current_url
    print(f"Current URL: {current_url}")

    # If the next page number is less than the maximum required number
    # of pages of search results, the nnavigate to the next page
    if page+1 <= num_pages_of_results:        
        # 4. Pause
        pause_code(
            min_time=3,
            max_time=7,
            delay_msg=f"Pausing before scrolling to bottom of page {page}",
        )

        # 5. Scroll to bottom of page, so that Next button is enabled and can be clicked
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        print(f"Reached to the bottom of the page {page}")

        # 6. Pause
        pause_code(
            min_time=2,
            max_time=9,
            delay_msg=f"Pausing before navigating to next page",
        )

        # 7. Click Next button to navigate to next page
        driver.find_element_by_xpath("//a[@class='button next-page']").click()
        print(f"Displaying page {page + 1}\n")

        # 8. wait for URL to change with 5 seconds timeout
        try:
            WebDriverWait(driver, 5).until(EC.url_changes(current_url))

            # print new URL
            new_url = driver.current_url
        except TimeoutException as e:
            print(
                f"When accessing page {page + 1}, stopped due to error message: "
                f"{str(e)}"
            )
            break
    else:
        print(f"Reached last requested page ({page}) of listings. Stopping here.")

Next, we will display a breif summary of the number of listing IDs found per city

In [None]:
print(
    "Contents of file containing scraped listing IDs\n"
    "==============================================="
)
with open(ids_filename) as f:
    lines = f.readlines()
    for page_num, element in enumerate(lines):
        id_list = element.split(", ")
        city = "AUS" if page_num + 1 <= 15 else "SEA"
        print(
            f"Page: {page_num + 1}, "
            f"City: {city}, "
            f"Number of listings on page: {len(id_list)}"
        )

<a id="close-browser"></a>

## 4. [Close browser](#close-browser)

Finally, we'll close all web browser windows

In [None]:
# driver.close()  # closes active browser window
driver.quit()  # closes all browser windows