# Review Text Scraping

In this notebook, I use requests and BeautifulSoup to programatically scrape page URLs, car statistics, and car reviews. I create a csv named `cars.csv` containing the car statistics and create text files containing the car reviews.

## Importing modules

In [390]:
from requests import get
import re
import time
import csv
from bs4 import BeautifulSoup
from more_itertools import unique_everseen

## Getting Car Stats and Reviews from Motortrend

In [310]:
%%time
full_model_list = []
for url in make_urls:
    make = url
    response = get(make)
    soup = BeautifulSoup(response.text, 'html.parser')
    model_list = []
    for griditem in soup.find_all('section',  class_='browse-vehicle-results body-style-container hub-make'):
        for item in griditem.find_all('a'):
            model_list.append(item.attrs.get('href'))
    time.sleep(0.5)
    full_model_list.append(list(set(model_list)))

CPU times: user 14.7 s, sys: 40 ms, total: 14.7 s
Wall time: 1min 45s


**I'm only interested in reviews for cars from MY (Model Year) 2018, so I modify the model list urls as follows:**

In [338]:
model_urls = [item+('2018') for sublist in full_model_list[:-18] for item in sublist]

In [534]:
def car_scraper(urls):
    
    with open('cars.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',)
        writer.writerow(['year/make/model', 'price', 'value', 'engine', 'trans', 'trim', 'group', 'horsepower', 'mpg'])
                        
        for url in urls:
            yearmakemodel = []
            specs = []
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            for item in soup.find_all('h1', attrs={'itemprop': 'name'}): #get the car year, model, and make
                yearmakemodel.append(item.text.replace('\n','').replace('\t', ''))
            for item in soup.find_all('section', attrs={'id': 'overview'}): #get the specs
                   for item in item.find_all('div', class_='value'):
                        specs.append(item.text.replace('\n', '').replace('\t', ''))
            writer.writerow(yearmakemodel + specs)
            time.sleep(1)

In [536]:
%%time
car_scraper(model_urls)

CPU times: user 3min 16s, sys: 940 ms, total: 3min 17s
Wall time: 20min 24s


In [584]:
def review_scraper(urls):
    reviews = []
    with open('motortrend_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            review_text=[]
            #get the review text
            for item in soup.find_all('div', class_='entry-content'):
                for text in item.find_all('p'):
                    file.write('\n')
                    file.write(str(text))
                    file.write('\n')
            time.sleep(1)

In [586]:
%%time
review_scraper(model_urls)

CPU times: user 3min, sys: 840 ms, total: 3min
Wall time: 17min 2s


##  Edmunds

In [588]:
model = 'http://www.edmunds.com/acura'
response = get(model)
soup = BeautifulSoup(response.text, 'html.parser')

In [580]:
edmunds_makes = [item.replace('http://www.motortrend.com/cars', 'http://www.edmunds.com') for item in make_urls]

In [605]:
abc = [item.a.attrs.get('href') for item in soup.find_all('div', class_='card-container')]

Scrape the model list from Edmunds and create the list `model_list`:

In [647]:
model_list = []
def model_scraper_edmunds(makes):   
    for url in makes:
        url = url
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='card-container'):
            model_list.append(item.a.attrs.get('href'))
    time.sleep(1)

In [648]:
%%time
model_scraper_edmunds(edmunds_makes)

CPU times: user 7.85 s, sys: 56 ms, total: 7.9 s
Wall time: 31.5 s


Edmunds hosts their reviews in the format 'edmunds.com/make/model/year/review'; let's reformat our list to that so we can iterate through `model_list` and scrape reviews:

In [677]:
sep = '/'

cleaned_model_list = ['http://edmunds.com'+sep.join(x.split(sep)[:3])+'/2018/review' for x in model_list]

In [678]:
review_list = list(unique_everseen(cleaned_model_list))

In [679]:
review_list[0]

'http://edmunds.com/acura/ilx/2018/review'

In [736]:
def edmunds_review_scraper(urls):
    reviews = []
    with open('edmunds_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            review_text=[]
            #get the review text
            file.write(url)
            for item in soup.find_all('div', class_='mb-1'):
                if item.find_all('p') != []:
                    file.write('\n')
                    file.write(str(item.find_all('p')))
                    file.write('\n')
            file.write('---')
            time.sleep(1)

In [737]:
%%time
edmunds_review_scraper(review_list)

CPU times: user 42.4 s, sys: 340 ms, total: 42.7 s
Wall time: 11min 12s


## NewCarTestDrive

NewCarTestDrive (NCTD) doesn't have reviews for all the cars, but can still contribute to our recommender.

In [726]:
pages_to_scrape = ['http://www.newcartestdrive.com/reviews/year/2018/']
pages_to_scrape = pages_to_scrape + ['http://www.newcartestdrive.com/reviews/year/2018/'+'page/'+str(i) for i in range(2,8,1)]

In [730]:
nctd_reviews = []
def nctd_review_urls(urls):
    for url in urls: 
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('h2'):
            nctd_reviews.append(item.a.attrs.get('href'))
        time.sleep(1)

In [731]:
nctd_review_urls(pages_to_scrape)

In [733]:
len(nctd_reviews)

174

In [741]:
def nctd_review_scraper(urls):
    with open('nctd_reviews.txt', 'w', newline='') as file:
        for url in urls:
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            file.write(url)
            file.write('\n')
            file.write(str(soup.find_all('p')))
            file.write('\n')
            file.write('---------')
            time.sleep(1)

In [743]:
%%time
nctd_review_scraper(nctd_reviews)

CPU times: user 7.28 s, sys: 80 ms, total: 7.36 s
Wall time: 4min 27s


## Car Connection

In [763]:
#get all the make pages
makes = []
url = 'https://www.thecarconnection.com/new-cars'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all('div', class_='content')[0].find_all('a', class_='add-zip'):
    makes.append(item.attrs.get('href'))

In [765]:
make_urls = ['http://www.thecarconnection.com'+suffix for suffix in makes]

In [795]:
model_list = []
def model_scraper_cc(makes):   
    for url in makes:
        url = url
        response = get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='name'):
            model_list.append(item.a.attrs.get('href'))
    time.sleep(1)

In [796]:
model_scraper_cc(make_urls)

In [798]:
review_urls = ['http://www.thecarconnection.com'+suffix for suffix in model_list]

In [822]:
def cc_review_scraper(urls):
    with open('cc_reviews.txt', 'w', newline='') as file:
        i=1
        for url in urls:
            start = time.time()
            response = get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            for item in soup.find_all('div', id='expert-review'):
                file.write(url)
                file.write('\n')
                file.write(str(item.find_all('p')))
                file.write('\n')
                file.write('--------------')
                time.sleep(1)
                print(f'Scraped review {i} in', time.time()-start)
            i+=1
        print(f'Scraped {i} reviews in total')

In [823]:
cc_review_scraper(review_urls)

Scraped review 1 in 1.2414438724517822
Scraped review 2 in 1.2473793029785156
Scraped review 3 in 1.2539036273956299
Scraped review 4 in 1.2398560047149658
Scraped review 5 in 1.2841401100158691
Scraped review 6 in 1.2349932193756104
Scraped review 7 in 1.2350366115570068
Scraped review 8 in 1.2282752990722656
Scraped review 9 in 1.2344415187835693
Scraped review 12 in 1.218871831893921
Scraped review 13 in 1.252692461013794
Scraped review 14 in 1.2804896831512451
Scraped review 15 in 1.2494783401489258
Scraped review 16 in 1.2521746158599854
Scraped review 17 in 1.2688920497894287
Scraped review 18 in 1.244976282119751
Scraped review 19 in 1.2533769607543945
Scraped review 20 in 1.2575037479400635
Scraped review 21 in 1.2396979331970215
Scraped review 22 in 1.2764084339141846
Scraped review 24 in 1.2442049980163574
Scraped review 25 in 1.2844531536102295
Scraped review 26 in 1.2306618690490723
Scraped review 27 in 2.5188000202178955
Scraped review 28 in 1.9448268413543701
Scraped revi

Scraped review 369 in 1.4741520881652832
Scraped review 370 in 1.2252628803253174
Scraped review 371 in 10.05290937423706
Scraped review 373 in 1.2682194709777832
Scraped review 375 in 1.2656400203704834
Scraped review 376 in 1.2433512210845947
Scraped review 377 in 1.480130910873413
Scraped review 378 in 1.2424602508544922
Scraped review 379 in 1.2239339351654053
Scraped review 380 in 1.2444164752960205
Scraped review 381 in 1.5001840591430664
Scraped review 382 in 1.2449746131896973
Scraped review 383 in 1.1959545612335205
Scraped review 384 in 1.4305927753448486
Scraped review 385 in 1.4214043617248535
Scraped review 386 in 1.2649378776550293
Scraped review 387 in 1.2516348361968994
Scraped review 388 in 1.2276029586791992
Scraped review 389 in 1.3233323097229004
Scraped review 390 in 2.18259334564209
Scraped review 391 in 1.2769136428833008
Scraped review 392 in 1.262416124343872
Scraped review 393 in 1.2981784343719482
Scraped review 394 in 1.9397525787353516
Scraped review 395 in