# Cars.com Web Scraping

- Get basic info on used cars
- Get review for the car also
- Grab a good amount of entries 
- Using a 200 mile radius from Birmingham AL

# Start of WebScrape

- Gather a list of links to cars 
- Use timer to time response
- Go through 150 pages with 100 results each page
- 200mi radius around birmingham al

In [1]:
# imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random
import numpy as np
from tqdm.notebook import tqdm

In [7]:
link_end_list = []

for page in tqdm(range(2,102)):
    url = f'https://www.cars.com/shopping/results/?page={page}&page_size=100&list_price_max=&makes[]=&maximum_distance=200&models[]=&stock_type=used&zip=35215'
    resp = requests.get(url).text
    soup = BeautifulSoup(resp)
    
    #getting links
    links = soup.find_all('a', attrs={"class":"vehicle-card-link js-gallery-click-link"})
    link_end = [l['href'] for l in links]
    link_end_list.append(link_end)
    time.sleep(random.choice([1, 2, 0.5]))
    

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [8]:
len(link_end_list)

100

**Join the list of lists into one**
- ended with a lists of list of links (the way i setup the for loop)

In [9]:
results = sum(link_end_list, [])
len(results)

9983

**Looking at the number of unique links**

In [10]:
un, counts = np.unique(results, return_counts=True)

**Seems like the repeated values come in 52 number of repeats**
- ~~assuming these are ad cars so they will pop up on multiple pages~~
- ~~roughly 33% of listings seem to be ads~~
- ~~each ad car probably gets a limit of 52 listings/times shown~~


- turns out when showing 100 cars per page, you can only go to 100 pages
- the extra 52 repeats are from trying to go to page 101+ and the website redirecting you to page 100
- so only 119 cars were duplicates out of 10000 cars (100 listings * 100 pages), probably ads, so only about 1% of cars were ad cars

In [11]:
dict(zip(un, counts))

{'/vehicledetail/000443e5-575c-458d-8689-7bf7c5c0a408/': 1,
 '/vehicledetail/00069f18-5482-439e-ae88-c61e85bc7058/': 1,
 '/vehicledetail/0010703f-e9ee-4fed-8485-c1854cba9996/': 1,
 '/vehicledetail/002658fc-08fd-4b40-90a9-2148f391a4ae/': 1,
 '/vehicledetail/00284ab5-cf99-4c03-b58c-3e40839579d5/': 1,
 '/vehicledetail/0033137a-57fd-46a9-ba87-7899e780b6e7/': 1,
 '/vehicledetail/003451be-94e6-4040-a79d-327dc8004910/': 1,
 '/vehicledetail/0038d8fc-5742-4054-bb4b-6446c318c7c2/': 1,
 '/vehicledetail/0039e36d-e5f6-4633-81e9-11dfc7df84e5/': 1,
 '/vehicledetail/0045fd98-c348-478d-ad99-846b8c9efc7c/': 1,
 '/vehicledetail/0049b93d-7fdc-4a1e-b6f3-9fb60e82ade0/': 1,
 '/vehicledetail/00500748-cbb4-44cf-92ca-5479588f34ac/': 1,
 '/vehicledetail/0063347d-9778-413a-a79f-3215fe3ff696/': 1,
 '/vehicledetail/0068f306-88de-4130-975d-f9ed2cc546cc/': 1,
 '/vehicledetail/007cd07e-1305-4c29-a1a2-e622aed2051f/': 1,
 '/vehicledetail/00853be1-e300-4879-877f-4a175691af28/': 1,
 '/vehicledetail/0087fb35-67d6-4b17-bdad

In [12]:
len(set(results))

9695

**Making sure length stays the same after getting rid of duplicates and turning into a list**

In [13]:
len(list(set(results)))

9695

In [14]:
final_links = list(set(results))

# Start getting reviews and car data

- get all the info from car possible
- get all the available reviews
- maybe try to get rating with review, could help a model

In [16]:
reviews = []
driveTrain = []
mpg = []
fuelType = []
trans = []
engine = []
mileage = []
sale_price = []

In [17]:
error_links = []

**Loop to get the data**

In [18]:
for link in tqdm(final_links):
    #using tqdm to check progress
    
    #try to get info, if fails, move to next link
    try:
        #each car url
        url = f'https://www.cars.com{link}'
        response = requests.get(url).text
        soup = BeautifulSoup(response)

        #getting reviews
        rev = soup.find_all("p", attrs={"class":"review-body"})
        reviews.append([review.get_text() for review in rev])

        #getting car details
        details = soup.find_all("dl", attrs={"class":"fancy-description-list"})
        det_list = [det.get_text() for det in details]
        det_split = det_list[0].split('\n')
        #for loop to extract each part
        for i in range(len(det_split)):
            if det_split[i] == 'Drivetrain':
                driveTrain.append(det_split[i+1])
            if det_split[i] == 'Fuel type':
                fuelType.append(det_split[i+1])
            if det_split[i] == 'Transmission':
                trans.append(det_split[i+1])
            if det_split[i] == 'Mileage':
                mileage.append(det_split[i+1])
            if det_split[i] == 'MPG':
                mpg.append(det_split[i+3])
            if det_split[i] == 'Engine':
                engine.append(det_split[i+1])

        #getting price
        car_price = soup.find_all("span", attrs={"class":"primary-price"})
        sale_price.append([p.get_text() for p in car_price][0])

        time.sleep(random.choice([0.1, 0.5, 1, 2, 0.75]))
        
    except:
        error_links.append(link) #catching error links
        

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9695.0), HTML(value='')))




**Checking length to make sure they're all the same**
-  if not the same length, have to tweak for loop to fill in as null when there is no value

In [19]:
print(len(reviews),
len(driveTrain),
len(mpg),
len(fuelType),
len(trans),
len(engine),
len(mileage),
len(sale_price))

9695 9616 9616 9616 9616 9616 9616 9616


In [20]:
df = pd.DataFrame(list(zip(reviews, driveTrain, mpg, fuelType, trans, engine, mileage, sale_price)), 
                  columns = ['Reviews', 'DriveTrain', 'MPG', 'FuelType', 'Transmission', 'Engine', 'Mileage', 'SalePrice'])

In [50]:
df['Reviews']

0       [Virtually nothing has gone wrong with my 2020...
1       [2018 XTL, 4X4, 302A Package 48000 mi. Other t...
2       [Stranded today. Could not get to work. Someth...
3       [Excellent road car, quiet, stable, comfortabl...
4       [PURCHASED FROM NYE TOYOTA, MY FIRST TACOMA. \...
                              ...                        
9611    [Uconnect system is awful,\r\nMost of the time...
9612    [Very happy with this car. It’s great value fo...
9613    [I own one and for the past two years I have n...
9614    [This car is great. It is an amazing experienc...
9615    [I have always wanted this vehicle so purchasi...
Name: Reviews, Length: 9616, dtype: object

In [22]:
#df.to_csv('carinfo.csv')

**I forgot to get the model/year of the cars :(**
- running the loop a 2nd time
- this time no time.sleep()
- lesson learned

In [63]:
reviews2 = []
driveTrain2 = []
mpg2 = []
fuelType2 = []
trans2 = []
engine2 = []
mileage2 = []
sale_price2 = []
model2 = []

error_links2 = []

In [64]:
for link in tqdm(final_links):
    #using tqdm to check progress
    
    #try to get info, if fails, move to next link
    try:
        #each car url
        url = f'https://www.cars.com{link}'
        response = requests.get(url).text
        soup = BeautifulSoup(response)
        
        #getting car model and year
        mod = soup.find_all("h1", attrs={"class":"listing-title"})
        model2.append([ti.get_text() for ti in mod])

        #getting reviews
        rev = soup.find_all("p", attrs={"class":"review-body"})
        reviews2.append([review.get_text() for review in rev])

        #getting car details
        details = soup.find_all("dl", attrs={"class":"fancy-description-list"})
        det_list = [det.get_text() for det in details]
        det_split = det_list[0].split('\n')
        #for loop to extract each part
        for i in range(len(det_split)):
            if det_split[i] == 'Drivetrain':
                driveTrain2.append(det_split[i+1])
            if det_split[i] == 'Fuel type':
                fuelType2.append(det_split[i+1])
            if det_split[i] == 'Transmission':
                trans2.append(det_split[i+1])
            if det_split[i] == 'Mileage':
                mileage2.append(det_split[i+1])
            if det_split[i] == 'MPG':
                mpg2.append(det_split[i+3])
            if det_split[i] == 'Engine':
                engine2.append(det_split[i+1])

        #getting price
        car_price = soup.find_all("span", attrs={"class":"primary-price"})
        sale_price2.append([p.get_text() for p in car_price][0])

        time.sleep(random.choice([0.1, 0.5, 0.75]))
        
    except:
        error_links2.append(link) #catching error links

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9695.0), HTML(value='')))




In [65]:
print(len(reviews2),
len(driveTrain2),
len(mpg2),
len(fuelType2),
len(trans2),
len(engine2),
len(mileage2),
len(sale_price2),
     len(model2))

9694 9222 9222 9222 9222 9222 9222 9222 9694


In [66]:
df2 = pd.DataFrame(list(zip(model2, reviews2, driveTrain2, mpg2, fuelType2, trans2, engine2, mileage2, sale_price2)), 
                  columns = ['Model', 'Reviews', 'DriveTrain', 'MPG', 'FuelType', 'Transmission', 'Engine', 'Mileage', 'SalePrice'])

In [67]:
df2

Unnamed: 0,Model,Reviews,DriveTrain,MPG,FuelType,Transmission,Engine,Mileage,SalePrice
0,[2020 Toyota RAV4 LE],[Virtually nothing has gone wrong with my 2020...,Front-wheel Drive,27–35,Gasoline,8-Speed Automatic,2.5L I4 16V PDI DOHC,"53,200 mi.","$29,000"
1,[],[],Front-wheel Drive,Electric,Electric,1-Speed Automatic,Electric,"62,439 mi.","$18,989"
2,[2016 Volkswagen e-Golf SE],[Stranded today. Could not get to work. Someth...,Rear-wheel Drive,18–26,Gasoline,10-Speed Automatic,3.0L V6 24V GDI DOHC Twin Turbo,"22,690 mi.","$55,975"
3,[2020 Lincoln Aviator Reserve RWD],"[Excellent road car, quiet, stable, comfortabl...",Four-wheel Drive,18–22,Gasoline,6-Speed Automatic,3.5L V6 24V PDI DOHC,"17,854 mi.","$38,900"
4,[2021 Toyota Tacoma TRD Off Road],"[PURCHASED FROM NYE TOYOTA, MY FIRST TACOMA. \...",Front-wheel Drive,28–39,Gasoline,Automatic CVT,2.5L I4 16V GDI DOHC,"60,907 mi.","$22,125"
...,...,...,...,...,...,...,...,...,...
9217,[2022 BMW X5 xDrive40i],[Only had the X5 for 4 weeks now but so far it...,Four-wheel Drive,18–25,Gasoline,8-Speed Automatic,3.6L V6 24V MPFI DOHC,"78,378 mi.","$25,500"
9218,[2019 Toyota Tacoma TRD Off Road],[When I bought my 2020 Tacoma V6 (3.5L) 4x4 in...,All-wheel Drive,18–25,Gasoline,8-Speed Automatic,3.6L V6 24V MPFI DOHC,"30,909 mi.","$35,590"
9219,[2015 INFINITI QX80 Base],[Everything is great except the rear view came...,Front-wheel Drive,23–30,Gasoline,6-SPEED A/T,4 Cylinder Engine,"30,186 mi.","$24,000"
9220,[2020 Toyota Tundra SR5],[I have owned smaller SUV's and Trucks for ove...,Four-wheel Drive,15–21,Gasoline,8-Speed Automatic,5.7L V8 16V MPFI OHV,"22,309 mi.","$71,077"


In [70]:
#df2.to_csv('carinfoV2.csv')