# [Scrape car listing details](#scrape-car-listing-details)

<font color='red' size="5">**Important Note**</font>
- this notebook does **not** support scraping more than 1 row of IDs from `data/Listings_IDs.txt` at a time
- if `num_pages_of_results`, in the last cell of section 1., is set to a value larger than 1, then the behavior of this notebook will be unreliable
- this notebook does **not** support Cell > Run All
- please run cells manually and wait for the preceding page to load before executing the second last cell before section 0.

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
import re
from pathlib import Path
from random import randint, sample
from time import sleep, time

from IPython.display import display

import src.listing_scraper as lsc
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from numpy import nan as np_nan
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

Set options for the Chrome webdriver

In [None]:
options = Options()
options.add_argument("--headless") # Runs Chrome in headless mode.
options.add_argument('--no-sandbox') # Bypass OS security model
options.add_argument('--disable-gpu')  # applicable to windows os only
options.add_argument('start-maximized') # 
options.add_argument('disable-infobars')
options.add_argument("--disable-extensions")

Set display options for `pandas`

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 5000)
pd.set_option('display.width', 1000)

Instantiate the `UserAgent` class in order to create a fake user agent when scraping with Selenium.

In [None]:
ua = UserAgent()

<a id="toc"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [User Inputs](#user-inputs)
2. [Use saved IDs to scrape car listings](#use-saved-ids-to-scrape-car-listings)

<a id="about"></a>

## 0. [About](#about)

In this notebook, we will scrape car listing details from the webpage corresponding to `cars.com` listing IDs that were previously retrieved from `1_cars_conroller.ipynb` and stored in the file `data/Listings_IDs.txt` file.

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

We'll define two input variables

1. `n_listings_per_page`
   - this is the number of IDs from each row of `data/Listings_IDs.txt` to be scraped

2. `start_listing_num` is the starting number for the IDs from `data/Listings_IDs_YYYmmdd.txt` to be scraped
   - the maximum value for `start_listing_num` (for each city) is explained below based on row numbers from `data/ListingsIDs.txt`
     - AUS
       - 30 pages of 100
       - listings 0-2999
       - `min(start_listing_num) = 0`
       - `max(start_listing_num) = 2900`
     - SEA
       - 11 pages of 100
       - listings 3000-4099
       - `min(start_listing_num) = 3000`
       - `max(start_listing_num) = 4000`

In [None]:
n_listings_per_page = 100  # max = 100
start_listing_num = 1100  # min = 0 (must be 0 or multiple of 100)

Some other inputs we'll define below should not be modified by the user

In [None]:
# Relative path to Chrome driver
chromedriver = "./chromedriver"

# Format for listing URL with placeholder for listing ID
url_base = "https://www.cars.com/vehicledetail/detail/{}/overview/"

# Path to folder where filtered listing IDs should be stored
fpath = Path().cwd() / "data"

# Path to file where filtered listing IDs should be stored
# ids_filename = str(fpath / "Listings_IDs_20191008_AUS_only.txt")
ids_filename = str(fpath / "Listings_IDs_20191009_SEA_only.txt")

# Dictionary of xpaths for elements that should randomly be brought into view on the listing page
page_element_xpath_strings = {
    "All Features": "//h4[@class='vdp-details-basics__features page-section__title--sub cui-heading-2']",
    "Have a question?": "//h3[@class='cui-heading-3']",
    "Request an Appointment": "//a[@data-linkname='email-lead-form-test-drive-bottom']",
    "Seller's Notes (See more)": "//label[@data-linkname='expand-seller-notes']",
}

We programmatically determine the state based on the number of listing IDs scraped for each zipcode
- details are shown above under the explanation for the variable `start_listing_num`

In [None]:
state = "WA" if start_listing_num <= 2900 else "TX"

Below, we will specify the number of rows of listing IDs from `data/Listings_IDs.txt` that will be scraped during one full run of this notebook. This variable should always be set to 1.

In [None]:
num_pages_of_results = 1

<a id="use-saved-ids-to-scrape-car-listings"></a>

## 2. [Use saved IDs to scrape car listings](#use-saved-ids-to-scrape-car-listings)

In `1_cars_controller.ipynb`, we submitted inputs to the user submission form on `cars.com` and then scraped the ID of search results for listings. 

Here, we will scrape individual search result listings by using those previously saved IDs to assemble a URL for the associated listing. The scraped results will be appended to a `*.csv` file - one file per listing.

We will start by loading the file of scraped listing `ID`s. We will use these to assemble the URL for the listing, by pre-pending the base url string.

In [None]:
def pause_code(min_time, max_time, delay_msg):
    """Wait for a random amount of time before proceeding"""
    # Pause
    delay_time = randint(min_time, max_time)
    print(delay_msg)
    sleep(delay_time)    

Next, we will define variables to track
- the first and last required listing
- page number
- (randomly chosen) listings at which to scroll to the bottom of the page
- etc.

To explain how these are calculated, assume `start_listing_num = 400`

1. `page_num` and `page_start_listing_num`
   - Based on the user's input for `start_listing_num` above, we will `divmod` obtain the
     - starting listing number on the appropriate page
     - page number (zero-indexed)
   - For each page of listings returned from the filters applied to the `cars.com` homepage, the previous notebook `1_cars_controller.ipynb` had exported 100 listing IDs to `data/Listings_IDs.txt`. In the current notebook, if the user enters `start_listing_num = 400`, this will correspond to the
     - 400th listing ID which is the 0th ID on row 5 of `data/Listing_IDs.txt`
       - 0 is assigned to `page_start_listing_num`
     - 5th page of search results returned from the `cars.com` homepage
       - 5 is assigned to `page_num`
2. `page_end_listing_num`
   - this is just `page_start_listing_num` (0) + `n_listings_per_page` (100)
3. `element`
   - this is a list of all IDs on the required row (row 5)
4. `id_list`
   - this comes from splitting the requited row (row 5), which is a string of comma-separated IDs, into a list of strings
5. `listings_to_move`
   - this is a list of randomly selected listings at which `selenium` will scroll to the bottom of the page
   - at all other listings, `selenium` will successively [scroll into view](https://stackoverflow.com/a/50288690/4057186) three separate elements on the listing page (if those elements are present)
6. `out_fpath`
   - this is the name of the `*.csv` file to which scraped listing details will be exported

In [None]:
with open(ids_filename) as f:
    # Read lines in file with listing IDs
    # - one line per page of returned search results that were saved
    #   to a *.txt file
    lines = f.readlines()

# Get the (a) preceding page number and (b) first listing number
# on the required page
page_num, page_start_listing_num = divmod(start_listing_num, 100)

# Get the last listing number on the required page (upper bound exclusive)
page_end_listing_num = n_listings_per_page + page_start_listing_num

# Get all listings on required page number
element = lines[page_num]

# Split line string to create a list and slice to get only the required listing IDs
id_list = element.split(", ")[page_start_listing_num: page_end_listing_num]

# Randomly select ID index at which to generate scroll with
# selenium on the corresponding listing page
listings_to_move = sample(
    range(page_start_listing_num, page_end_listing_num),
    int(n_listings_per_page / 3)
) if n_listings_per_page >=3 else [None]

# Assemble path to output file that will be produced
out_fpath = (
        fpath / (
            f"p{page_num}__"
            f"{page_start_listing_num}_"
            f"{page_end_listing_num - 1}.csv"
        )
    )

We'll summarize the above variables in a pandas `DataFrame`

In [None]:
d = {
    "Required number of listings per page": n_listings_per_page,
    "Overall first listing number required": start_listing_num,
    "Maximum number of listings available": (30 + 11) * 100,
    "Maximum pages available": 30 + 11,  # AUS zipcode: 30, SEA zipcode: 11
    "State to scrape": state,
    "Page number selected": page_num,
    "First selected listing number": page_start_listing_num,
    "Required total number of listings": n_listings_per_page,
    "Last selected listing number (upper-bound inclusive)": page_end_listing_num - 1,
    "Scrolling to bottom of page for listing numbers": f"{', '.join(str(x) for x in listings_to_move)}",
    "Output *.csv filepath": out_fpath,
}
display(pd.DataFrame.from_dict(d, orient="index").reset_index(drop=False))

Next, for each listing ID in `data/Listings_IDs.txt`, we'll do the following
1. Assemble URL to the listing web page on `cars.com`
2. Load the listing webpage
3. get the `bs4` soup
4. Randomly do one of the following
   - scroll to bottom of page
   - first: pause for a random amount of time
   - second: successively bring three page elements into view (if they are found on the page)
     - the `xpath` search string for each of these elements is stored as values in the earlier defined dictionary `page_element_xpath_strings`
5. Close the active browser window

In [None]:
cell_st = time()
header_check = []

# Loop over list of IDs and scrape the associated listings
for link_cntr, eid in enumerate(id_list):
    start_time = time()
    # 1. Assemble listing url from ID
    listing_url = f"{url_base}".format(eid.replace("\n", ""))
    print(
        f"Page: {page_num}, "
        f"Listing: {link_cntr + start_listing_num}, "
        f"URL: {listing_url}"
    )

    # Instantiate a random user agent
    userAgent = ua.random
    # print(userAgent)
    options.add_argument(f"user-agent={userAgent}")
    
    # Instantiate Chrome webdriver with the above random user agent
    driver = webdriver.Chrome(
        options=options, executable_path=str(chromedriver)
    )

    # 2. Load web page
    driver.get(listing_url)

    try:
        # 3. Scrape web page and append to one *.csv per page
        soup_contents = BeautifulSoup(driver.page_source, "html.parser")
        sold_check = soup_contents.find("div", {"class": "vdp__no-listing__alert"})
        if sold_check and "No longer listed" in sold_check.text:
            print("Sold, car is no longer listed. Will skip listing...\n")
            header_check.append(link_cntr)
        else:
            link_cntr = (link_cntr - 1) if header_check and header_check[0] == 0 else link_cntr
            d_listing, d_errors = lsc.scrape_single_listing(
                soup=soup_contents,
                page_number=page_num,
                listing_number=link_cntr + start_listing_num,
                state=state,
            )

            # Put errors and listings into DataFrame
            df_listing = pd.DataFrame.from_dict(d_listing, orient="index").T
            dfe = pd.DataFrame.from_dict(d_errors, orient="index").T
            df = lsc.pandas_clean_data(df_listing)
            if not dfe.empty:
                dfe[["page", "listing", "error"]] = pd.DataFrame(
                    dfe["error"].values.tolist(),
                    index=dfe.index
                )
                display(dfe)
            # display(df)
            header_spec = True if (link_cntr + 1) == 1 else False
            # Append DataFrame to *.csv
            print("Writing header to output *.csv file?", header_spec)
            df.to_csv(
                path_or_buf=str(out_fpath),
                mode="a",
                header=header_spec,
                index=False,
            )

            # 3. Randomly perform of the following 2 actions on the page
            #    - scroll to bottom of page
            #    - bring one of 3 pre-selected sections of the page into view
            #      (if element is found on page)
            if (link_cntr + start_listing_num) in listings_to_move:
                # (a) Scroll to bottom of page
                print(
                    f"Moving to bottom of page for listing number {link_cntr + start_listing_num}"
                )
                driver.execute_script(
                    "window.scrollTo(0, document.body.scrollHeight);"
                )
                print(f"Reached to the bottom of the page")
            else:
                # (b) Bring one of 3 pre-selected elements into view
                #     (if element is found on page)
                a = 1
                for (
                    element_name,
                    element_string_xpath,
                ) in page_element_xpath_strings.items():
                    # Pause for random time delay
                    pause_code(
                        min_time=0,  # 3
                        max_time=2,  # 10
                        delay_msg=(
                            f"Pause between bringing {element_name} "
                            "page element into view"
                        ),
                    )
                    # Bring element into view (if element is found on page)
                    try:
                        element = driver.find_element_by_xpath(
                            element_string_xpath
                        )
                        driver.execute_script(
                            "arguments[0].scrollIntoView();", element
                        )
                    except NoSuchElementException as e:
                        print(
                            f"Page: {page_num}, Listing: {link_cntr + start_listing_num} "
                            + str(e)
                        )
            # 4. Close active web browser window
            driver.close()

            print(f"Leaving page {page_num} listing {link_cntr + start_listing_num}")
            elapsed_time = time() - start_time
            print(
                f"Time spent on page {page_num} "
                f"listing {link_cntr + start_listing_num} = {elapsed_time:.2f} seconds\n"
            )
    except Exception as e:
        print(str(e))
total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds")

Finally, we'll print all the exported listing details in a pandas `DataFrame`

In [None]:
df_loaded = pd.read_csv(str(out_fpath))
display(df_loaded)