# Review Text Scraping

In this notebook, I use requests and BeautifulSoup to programmatically scrape page URLs, car statistics, and car reviews. I create a csv named `cars.csv` containing the car statistics and create text files containing the car reviews.

## Importing modules

In [2]:
from requests import get
import re
import time
import csv
from bs4 import BeautifulSoup
from more_itertools import unique_everseen

## Getting Car Stats and Reviews from Motortrend

In [80]:
response = get('http://www.motortrend.com/cars/')
soup = BeautifulSoup(response.text, 'html.parser')
make_urls = [item.a.attrs.get('href') for item in [item for item in soup.find_all('li', class_='item-container')] if 'cars' in item.a.attrs.get('href')]

In [17]:
%%time
full_model_list = []
for url in make_urls:
    make = url
    response = get(make)
    soup = BeautifulSoup(response.text, 'html.parser')
    model_list = []
    for griditem in soup.find_all('section',  class_='browse-vehicle-results body-style-container hub-make'):
        for item in griditem.find_all('a'):
            model_list.append(item.attrs.get('href'))
    time.sleep(0.5)
    full_model_list.append(list(set(model_list)))

CPU times: user 14.4 s, sys: 48 ms, total: 14.4 s
Wall time: 1min 56s


**I'm only interested in reviews for cars from MY (Model Year) 2018, so I modify the model list urls as follows:**

In [338]:
model_urls = [item+('2018') for sublist in full_model_list[:-18] for item in sublist]

In [534]:
def car_scraper(urls):
    
    with open('../data/cars.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',)
        writer.writerow(['year/make/model', 'price', 'value', 'engine', 'trans', 'trim', 'group', 'horsepower', 'mpg'])
                        
        for url in urls:
            yearmakemodel = []
            specs = []
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            for item in soup.find_all('h1', attrs={'itemprop': 'name'}): #get the car year, model, and make
                yearmakemodel.append(item.text.replace('\n','').replace('\t', ''))
            for item in soup.find_all('section', attrs={'id': 'overview'}): #get the specs
                   for item in item.find_all('div', class_='value'):
                        specs.append(item.text.replace('\n', '').replace('\t', ''))
            writer.writerow(yearmakemodel + specs)
            time.sleep(1)

In [536]:
%%time
car_scraper(model_urls)

CPU times: user 3min 16s, sys: 940 ms, total: 3min 17s
Wall time: 20min 24s


In [584]:
def review_scraper(urls):
    reviews = []
    with open('../reviews/motortrend_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            review_text=[]
            #get the review text
            for item in soup.find_all('div', class_='entry-content'):
                for text in item.find_all('p'):
                    file.write('\n')
                    file.write(str(text))
                    file.write('\n')
            time.sleep(1)

In [586]:
%%time
review_scraper(model_urls)

CPU times: user 3min, sys: 840 ms, total: 3min
Wall time: 17min 2s


##  Edmunds

In [588]:
model = 'http://www.edmunds.com/acura'
response = get(model)
soup = BeautifulSoup(response.text, 'html.parser')

In [580]:
edmunds_makes = [item.replace('http://www.motortrend.com/cars', 'http://www.edmunds.com') for item in make_urls]

In [605]:
abc = [item.a.attrs.get('href') for item in soup.find_all('div', class_='card-container')]

Scrape the model list from Edmunds and create the list `model_list`:

In [647]:
model_list = []
def model_scraper_edmunds(makes):   
    for url in makes:
        url = url
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='card-container'):
            model_list.append(item.a.attrs.get('href'))
    time.sleep(1)

In [648]:
%%time
model_scraper_edmunds(edmunds_makes)

CPU times: user 7.85 s, sys: 56 ms, total: 7.9 s
Wall time: 31.5 s


Edmunds hosts their reviews in the format 'edmunds.com/make/model/year/review'; let's reformat our list to that so we can iterate through `model_list` and scrape reviews:

In [677]:
sep = '/'

cleaned_model_list = ['http://edmunds.com'+sep.join(x.split(sep)[:3])+'/2018/review' for x in model_list]

In [678]:
review_list = list(unique_everseen(cleaned_model_list))

In [679]:
review_list[0]

'http://edmunds.com/acura/ilx/2018/review'

In [736]:
def edmunds_review_scraper(urls):
    reviews = []
    with open('../reviews/edmunds_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            review_text=[]
            #get the review text
            file.write(url)
            for item in soup.find_all('div', class_='mb-1'):
                if item.find_all('p') != []:
                    file.write('\n')
                    file.write(str(item.find_all('p')))
                    file.write('\n')
            file.write('---')
            time.sleep(1)

In [737]:
%%time
edmunds_review_scraper(review_list)

CPU times: user 42.4 s, sys: 340 ms, total: 42.7 s
Wall time: 11min 12s


## NewCarTestDrive

NewCarTestDrive (NCTD) doesn't have reviews for all the cars, but can still contribute to our recommender.

In [726]:
pages_to_scrape = ['http://www.newcartestdrive.com/reviews/year/2018/']
pages_to_scrape = pages_to_scrape + ['http://www.newcartestdrive.com/reviews/year/2018/'+'page/'+str(i) for i in range(2,8,1)]

In [730]:
nctd_reviews = []
def nctd_review_urls(urls):
    for url in urls: 
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('h2'):
            nctd_reviews.append(item.a.attrs.get('href'))
        time.sleep(1)

In [731]:
nctd_review_urls(pages_to_scrape)

In [733]:
len(nctd_reviews)

174

In [741]:
def nctd_review_scraper(urls):
    with open('../reviews/nctd_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            file.write(url)
            file.write('\n')
            file.write(str(soup.find_all('p')))
            file.write('\n')
            file.write('---------')
            time.sleep(1)

In [743]:
%%time
nctd_review_scraper(nctd_reviews)

CPU times: user 7.28 s, sys: 80 ms, total: 7.36 s
Wall time: 4min 27s


## Car Connection

In [763]:
#get all the make pages
makes = []
url = 'https://www.thecarconnection.com/new-cars'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all('div', class_='content')[0].find_all('a', class_='add-zip'):
    makes.append(item.attrs.get('href'))

In [765]:
make_urls = ['http://www.thecarconnection.com'+suffix for suffix in makes]

In [795]:
model_list = []
def model_scraper_cc(makes):   
    for url in makes:
        url = url
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='name'):
            model_list.append(item.a.attrs.get('href'))
    time.sleep(1)

In [796]:
model_scraper_cc(make_urls)

In [798]:
review_urls = ['http://www.thecarconnection.com'+suffix for suffix in model_list]

In [822]:
def cc_review_scraper(urls):
    with open('../reviews/cc_reviews.txt', 'w', newline='') as file:
        i=1
        for url in urls:
            start = time.time()
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            for item in soup.find_all('div', id='expert-review'):
                file.write(url)
                file.write('\n')
                file.write(str(item.find_all('p')))
                file.write('\n')
                file.write('--------------')
                time.sleep(1)
                print(f'Scraped review {i} in', time.time()-start)
            i+=1
        print(f'Scraped {i} reviews in total')

In [823]:
cc_review_scraper(review_urls)

Scraped review 1 in 1.2414438724517822
Scraped review 2 in 1.2473793029785156
Scraped review 3 in 1.2539036273956299
Scraped review 4 in 1.2398560047149658
Scraped review 5 in 1.2841401100158691
Scraped review 6 in 1.2349932193756104
Scraped review 7 in 1.2350366115570068
Scraped review 8 in 1.2282752990722656
Scraped review 9 in 1.2344415187835693
Scraped review 12 in 1.218871831893921
Scraped review 13 in 1.252692461013794
Scraped review 14 in 1.2804896831512451
Scraped review 15 in 1.2494783401489258
Scraped review 16 in 1.2521746158599854
Scraped review 17 in 1.2688920497894287
Scraped review 18 in 1.244976282119751
Scraped review 19 in 1.2533769607543945
Scraped review 20 in 1.2575037479400635
Scraped review 21 in 1.2396979331970215
Scraped review 22 in 1.2764084339141846
Scraped review 24 in 1.2442049980163574
Scraped review 25 in 1.2844531536102295
Scraped review 26 in 1.2306618690490723
Scraped review 27 in 2.5188000202178955
Scraped review 28 in 1.9448268413543701
Scraped revi

Scraped review 369 in 1.4741520881652832
Scraped review 370 in 1.2252628803253174
Scraped review 371 in 10.05290937423706
Scraped review 373 in 1.2682194709777832
Scraped review 375 in 1.2656400203704834
Scraped review 376 in 1.2433512210845947
Scraped review 377 in 1.480130910873413
Scraped review 378 in 1.2424602508544922
Scraped review 379 in 1.2239339351654053
Scraped review 380 in 1.2444164752960205
Scraped review 381 in 1.5001840591430664
Scraped review 382 in 1.2449746131896973
Scraped review 383 in 1.1959545612335205
Scraped review 384 in 1.4305927753448486
Scraped review 385 in 1.4214043617248535
Scraped review 386 in 1.2649378776550293
Scraped review 387 in 1.2516348361968994
Scraped review 388 in 1.2276029586791992
Scraped review 389 in 1.3233323097229004
Scraped review 390 in 2.18259334564209
Scraped review 391 in 1.2769136428833008
Scraped review 392 in 1.262416124343872
Scraped review 393 in 1.2981784343719482
Scraped review 394 in 1.9397525787353516
Scraped review 395 in

## ConsumerReports

In [82]:
make_urls[0]

'http://www.motortrend.com/cars/acura/'

In [172]:
cg_makes = []
response = get('http://consumerguide.com/find-review/')
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all('div', class_='vc_column-inner '):
    for link in item.find_all('p'):
        for url in link.find_all('a'):
            cg_makes.append(url.attrs.get('href'))

In [177]:
cg_review_list = []
def cg_model_scrape(urls):
    for url in urls:
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='mk-fancy-table table-style1 ')[0].find_all('td'):
            if (re.findall('\d+', str(item))[0] == '2018'):
                cg_review_list.append(item.a.attrs.get('href'))
            time.sleep(1)

In [179]:
%%time
cg_model_scrape(cg_makes)

CPU times: user 1.82 s, sys: 16 ms, total: 1.83 s
Wall time: 9min 33s


In [189]:
cg_review_list[63:70]

['http://consumerguide.com/honda/hr-v/',
 'http://consumerguide.com/honda/odyssey/',
 'http://consumerguide.com/honda/pilot/',
 'http://consumerguide.com/honda/ridgeline/',
 'http://consumerguide.com/hyundai/accent/',
 'http://consumerguide.com/hyundai/elantra/',
 'http://consumerguide.com/hyundai/santa-fe/']

In [205]:
def cg_review_scraper(urls):
    with open('../reviews/cg_reviews.txt', 'w', newline='') as file:        
        for i, url in enumerate(urls):
            start = time.time()
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            try:
                for item in soup.find_all('div', id='text-block-4'):
                    file.write(url)
                    file.write('\n')
                    file.write(str(item.find_all('p')))
                    file.write('\n')
                for item in soup.find_all('div', id='text-block-6'):
                    file.write(str(item.find_all('p')))
                    file.write('\n')
                    file.write('----------')
                    time.sleep(2) 
                    print(f'Scraped review {i+1} in {time.time()-start} seconds')
            except:
                print(f'Scraping review {i+1} failed')
        print(f'Scraped {i+1} reviews in total')
        file.write(f'Scraped {i+1} reviews in total')

In [206]:
cg_review_scraper(cg_review_list)

Scraped review 1 in 2.3886260986328125 seconds
Scraped review 2 in 2.393385410308838 seconds
Scraped review 3 in 2.38273024559021 seconds
Scraped review 4 in 2.353861093521118 seconds
Scraped review 5 in 2.3817076683044434 seconds
Scraped review 6 in 2.3827719688415527 seconds
Scraped review 7 in 2.3845462799072266 seconds
Scraped review 8 in 2.391955614089966 seconds
Scraped review 9 in 2.382988691329956 seconds
Scraped review 10 in 2.5116965770721436 seconds
Scraped review 11 in 2.39180588722229 seconds
Scraped review 12 in 2.3909687995910645 seconds
Scraped review 13 in 2.391101121902466 seconds
Scraped review 14 in 2.3909647464752197 seconds
Scraped review 15 in 2.3813881874084473 seconds
Scraped review 16 in 2.390390634536743 seconds
Scraped review 17 in 2.7104151248931885 seconds
Scraped review 18 in 2.3817732334136963 seconds
Scraped review 19 in 2.3788275718688965 seconds
Scraped review 20 in 2.425126552581787 seconds
Scraped review 21 in 2.329712152481079 seconds
Scraped revie

In [208]:
with open('../reviews/cg_reviews.txt', 'a+', newline='') as file:        
    for i, url in enumerate(cg_review_list[150:]):
        start = time.time()
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', id='text-block-4'):
            file.write(url)
            file.write('\n')
            file.write(str(item.find_all('p')))
            file.write('\n')
        for item in soup.find_all('div', id='text-block-6'):
            file.write(str(item.find_all('p')))
            file.write('\n')
            file.write('----------')
            time.sleep(2) 
            print(f'Scraped review {i+1} in {time.time()-start} seconds')

Scraped review 1 in 3.225029230117798 seconds
Scraped review 2 in 3.2617709636688232 seconds
Scraped review 3 in 3.122955560684204 seconds
Scraped review 4 in 3.21842098236084 seconds
Scraped review 5 in 3.2574336528778076 seconds
Scraped review 6 in 3.1450846195220947 seconds
Scraped review 7 in 3.22688627243042 seconds
Scraped review 8 in 2.39076828956604 seconds
Scraped review 9 in 3.223116636276245 seconds
Scraped review 10 in 3.1028568744659424 seconds
Scraped review 11 in 3.1718270778656006 seconds
Scraped review 12 in 3.4250197410583496 seconds
Scraped review 13 in 3.1751487255096436 seconds
Scraped review 14 in 5.9352946281433105 seconds
Scraped review 15 in 3.152092218399048 seconds
Scraped review 16 in 3.1108949184417725 seconds
Scraped review 17 in 3.1898841857910156 seconds
Scraped review 18 in 3.3364851474761963 seconds
Scraped review 19 in 3.164276361465454 seconds


## Kelley Blue Book