# Web Scraping for Vehicle Data

### What is project about?

As a data entry operator in a software company which specializes in Automotive Industry, my daily task is to update vehicle's information. Often they are small changes here and there, but sometimes they can be a very heavy boring stuffs such as: updating Vehicle's CO2 Emission Level for thousands of cars. You have to go to a website, search for a model, do a few clicks just to get the infomation you needed, then put it in an excel file; and repeat.

Fortunately, I am a lazy person. I won't spend a whole week sit there and do that crazy process. That's why I wrote a program using python to scrape a web for info I wanted and export it to csv file.

### What is my goal?

My goal is to have a csv files that contains data of vehicles by Make.

### What are my approaches?

I will use [carguides](https://www.carsguide.com.au/) as source data. It is one of the biggest websites with more than 100,000 cars listed for sales. Everything you need to know about a car can be find here. For web scraping techniques, this [tutorial](https://www.dataquest.io/blog/web-scraping-beautifulsoup/) is my reference.

Below are my steps to achieve my goals:

* Step 1: Collect info of only one car.
* Step 2: Collect info of all cars in one page.
* Step 3: Collect info of all cars in mutiple pages by Make

Let's start...

## I. Collecting information of one car:

### Part 1. Request to a web page:

[2019 Alfa Romeo Giulietta Veloce TCT](https://www.carsguide.com.au/cars-for-sale/D_10315385/ALFA+ROMEO--GIULIETTA--WA+-+Perth--OSBORNE+PARK+6017,+WA--Hatchback?searchKey=cg_s.293fddbef3edd83e3bdfc3b4825c9bf9#pos0) is my sample for the first step. To request the content of this single web page, I will use `get` method from [request](https://realpython.com/python-requests/) module:

In [1]:
from requests import get
url = 'https://www.carsguide.com.au/cars-for-sale/D_10315385/ALFA+ROMEO--GIULIETTA--WA+-+Perth--OSBORNE+PARK+6017,+WA--Hatchback?searchKey=cg_s.293fddbef3edd83e3bdfc3b4825c9bf9#pos0'
response = get(url)
response.status_code

200

A status code of `200` means that the page downloaded sucessfully. If the code starting with `4` or `5`, it indicates an error.

### Part 2. Understanding the HTML structure of the single page:

HTML is the standard markup language for Web pages. If you are using [Chrome](https://developers.google.com/web/tools/chrome-devtools/?utm_source=dcc&utm_medium=redirect&utm_campaign=2018Q2), right-click on info that interests you, select `Inspect`, you will see how the content is created with HTML:

![Car's Name](images/details_page_heading.png)

Simply speaking, to have title on the left, programmer have to write codes on the right panel. I won't go into details because it is out of the scope of this project. Please visit [HTML Tutorial](https://www.w3schools.com/html/html_intro.asp) to understand `div`, `h1`, `span`, ect tags mean.

Below is a list of info that I am interested to collect:

- Car's Title (above image)

- Body Type

![Body Type](images/body_type.png)

- and all specifications in Tech Specs tab

![Tech Specs Tab](images/tab_tech_specs.png)

`div class = 'tab-tech-specs'` is parent tag, `dl collapse` (yellow lines) are children tags. There are 8 types of specs, so there should be a `dl collapse` for each. Let's extract all these info by parsing the HTML document.

### Part 3. Using BeautifulSoup to parse the HTML content and extracting information:

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the most common web scraping module for Python.

In [2]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

#### Part 3.1. Extracting car's title:

I'll use [find()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) method to extract tags that contains car's title.

In [3]:
details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
print(type(details_page_heading))
print(len(details_page_heading))
print('\n')
print(details_page_heading)

<class 'bs4.element.Tag'>
3


<h1 class="details-page-heading">
<span>2019 Alfa Romeo Giulietta Veloce TCT</span>
</h1>


`details_page_heading` is [tag](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects) object. To get the string **2019 Alfa Romeo Giulietta Veloce**, we have to access the text from within `<span>` tag:

In [4]:
car_title = details_page_heading.span.text
print(car_title)

2019 Alfa Romeo Giulietta Veloce TCT


#### Part 3.2. Extracting body type:

With body type, we cannot use the same method as extracting car's title because it always return first text of the object.

In [5]:
details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
body_type = details_page_glance.td.text
print(body_type)

$47,300                        Calculate my repayments


We might attempt to use `index` to select position where `td` body type is and access its text. However, it won't work.

In [6]:
body_type = details_page_glance[3].td.text

KeyError: 3

The solution is turning the tag object into a list, so we can use list's index to get body type:

In [7]:
rows = details_page_glance.findChildren(['th', 'tr'])
glanceList = []
for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        strValue = cell.string
        glanceList.append(strValue)
print(glanceList)

[None, '2019 Alfa Romeo Giulietta Veloce TCT Series 2', None, 'Hatchback, 5 Doors, 5 Seats', 'Automatic', '4 cyl, 1.7 L', 'Front', 'Premium', '6.8 L / 100 km', '-', 'Grey / -', '-', '-', 'ZAR94000007523817', '51941', None]


The body type is the fourth item of the list:

In [8]:
body_type = glanceList[3]
print(body_type)

Hatchback, 5 Doors, 5 Seats


Now we have car's title and body type, next job is collecting car's specifications.
#### Part 3.3. Extracting specifications:
This time we need to use [find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all) method to find all 8 types of specs in tab Tech Specs. If we inspect closer HTML lines, we can see that content we need is contained within `dl`tags. These tags are nested within `div` tag. Our task is to extract 8 `dl` tags. Before doing that, we should figure out what distinguishes them from other `dl` elements of that page. We can use (`Ctrl + F`) to search class `details-page-tab-desc-list clearfix`:

![Detail Tab Tech Specs](images/details_tab_tech_specs.png)

`8 matches` result matches numbers of types.

In [9]:
tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')
print(type(tab_tech_specs))
print(len(tab_tech_specs))
print(tab_tech_specs[0])

<class 'bs4.element.ResultSet'>
8
<dl class="details-page-tab-desc-list clearfix" collapse="!techSpecsComfort">
<dt>Seating capacity:</dt>
<dd>5</dd>
</dl>


Since indexing available, we are able to:

1. Seperating each item in `tab_tech_specs` and access `<dd>` tag:

In [10]:
techSpecsComfort = tab_tech_specs[0].find_all('dd')
techSpecsTrans = tab_tech_specs[1].find_all('dd')
techSpecsExterior = tab_tech_specs[2].find_all('dd')
techSpecsPerformance = tab_tech_specs[3].find_all('dd')
techSpecsDimensions = tab_tech_specs[4].find_all('dd')
techSpecsGeneral = tab_tech_specs[5].find_all('dd')

2. Accessing text of `<dd>` tag:

In [11]:
make = techSpecsGeneral[1].text
family = techSpecsGeneral[0].text
variant = techSpecsGeneral[2].text
series = techSpecsGeneral[3].text
doors = techSpecsExterior[4].text
seating_capacity = techSpecsComfort[0].text
transmission_type = techSpecsTrans[0].text
drive_type = techSpecsTrans[1].text
engine_cc = techSpecsPerformance[1].text
cylinders = techSpecsPerformance[2].text
fuel_type = techSpecsPerformance[7].text
fuel_consumption = techSpecsPerformance[9].text
co2_level = techSpecsPerformance[15].text
overal_height = techSpecsDimensions[0].text
overal_lenght = techSpecsDimensions[1].text
overal_width = techSpecsDimensions[2].text
wheelbase = techSpecsDimensions[4].text

print(car_title, make, family, variant, series, body_type, doors, seating_capacity, transmission_type, drive_type, 
      engine_cc, cylinders, fuel_type, fuel_consumption, co2_level, overal_height, overal_lenght, overal_width, wheelbase)

2019 Alfa Romeo Giulietta Veloce TCT Alfa Romeo Giulietta Veloce TCT Series 2 Hatchback, 5 Doors, 5 Seats 5 5 Automatic Front 1742 4 Premium 6.8 L / 100 km 157 1465 mm 4351 mm 1798 mm 2634 mm


### Part 4. Exporting Pandas DataFrame to a CSV file:

We have desired information for a car. Next is to put everything together in a table. We will use [Pandas](https://pandas.pydata.org/pandas-docs/stable/) DataFrame to do that.

In [12]:
# Lists to store the scraped data in
titleList = []
makeList = []
familyList = []
variantList = []
seriesList = []
body_typeList = []
doorsList = []
seating_capacityList = []
transmission_typeList = []
drive_typeList = []
engine_ccList = []
cylindersList = []
fuel_typeList = []
fuel_consumptionList = []
co2_levelList = []
overal_heightList = []
overal_lenghtList = []
overal_widthList = []
wheelbaseList = []

# Append data to lists
titleList.append(car_title)
makeList.append(make)
familyList.append(family)
variantList.append(variant)
seriesList.append(series)
body_typeList.append(body_type)
doorsList.append(doors)
seating_capacityList.append(seating_capacity)
transmission_typeList.append(transmission_type)
drive_typeList.append(drive_type)
engine_ccList.append(engine_cc)
cylindersList.append(cylinders)
fuel_typeList.append(fuel_type)
fuel_consumptionList.append(fuel_consumption)
co2_levelList.append(co2_level)
overal_heightList.append(overal_height)
overal_lenghtList.append(overal_lenght)
overal_widthList.append(overal_width)
wheelbaseList.append(wheelbase)

import pandas as pd
oneCar_df = pd.DataFrame({'title':titleList, 'make':makeList, 'family':familyList, 'variant':variantList, 
'series':seriesList, 'body_type':body_typeList, 'doors':doorsList, 
'seating_capacity':seating_capacityList, 'transmission_type':transmission_typeList, 
'drive_type':drive_typeList, 'engine_cc':engine_ccList, 'cylinders':cylindersList, 'fuel_type':fuel_typeList, 
'fuel_consumption':fuel_consumptionList, 'co2_level':co2_levelList, 'overal_height':overal_heightList, 
'overal_lenght':overal_lenghtList, 'overal_width':overal_widthList, 'wheelbase':wheelbaseList})
print(oneCar_df.info())
oneCar_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 19 columns):
title                1 non-null object
make                 1 non-null object
family               1 non-null object
variant              1 non-null object
series               1 non-null object
body_type            1 non-null object
doors                1 non-null object
seating_capacity     1 non-null object
transmission_type    1 non-null object
drive_type           1 non-null object
engine_cc            1 non-null object
cylinders            1 non-null object
fuel_type            1 non-null object
fuel_consumption     1 non-null object
co2_level            1 non-null object
overal_height        1 non-null object
overal_lenght        1 non-null object
overal_width         1 non-null object
wheelbase            1 non-null object
dtypes: object(19)
memory usage: 232.0+ bytes
None


Unnamed: 0,title,make,family,variant,series,body_type,doors,seating_capacity,transmission_type,drive_type,engine_cc,cylinders,fuel_type,fuel_consumption,co2_level,overal_height,overal_lenght,overal_width,wheelbase
0,2019 Alfa Romeo Giulietta Veloce TCT,Alfa Romeo,Giulietta,Veloce TCT,Series 2,"Hatchback, 5 Doors, 5 Seats",5,5,Automatic,Front,1742,4,Premium,6.8 L / 100 km,157,1465 mm,4351 mm,1798 mm,2634 mm


The last thing to do is export it to csv file.

In [13]:
oneCar_df.to_csv('F:\DuyLam\Project\web_scraping_for_vehicle_data\oneCar.csv')

Step one is done. Second step is to collect info of all cars on first page.

## II. Collecting info of all cars on first page:

Click this [link](https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?searchOffset=0&searchLimit=12), you will see that there are 8,418 pages for 101,007 cars. One page shows 12 cars. Last time, we only dealt with one car. Now it's time to upgrade our program to handle 12 cars at a time.

In [None]:
def oneCar(input_link):
    # download content of single page:
    url = input_link
    response = get(url)
    response.status_code
    
    # parse the HTML content
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    # extract car's title
    details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
    car_title = details_page_heading.span.text
    
    # extract body type
    details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
    rows = details_page_glance.findChildren(['th', 'tr'])
    glanceList = []
    for row in rows:
        cells = row.findChildren('td')
        for cell in cells:
            strValue = cell.string
            glanceList.append(strValue)
    body_type = glanceList[3]
    
    # extract all tech specs
    tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')
     
    techSpecsComfort = tab_tech_specs[0].find_all('dd')
    techSpecsTrans = tab_tech_specs[1].find_all('dd')
    techSpecsExterior = tab_tech_specs[2].find_all('dd')
    techSpecsPerformance = tab_tech_specs[3].find_all('dd')
    techSpecsDimensions = tab_tech_specs[4].find_all('dd')
    techSpecsGeneral = tab_tech_specs[5].find_all('dd')

    make = techSpecsGeneral[1].text
    family = techSpecsGeneral[0].text
    variant = techSpecsGeneral[2].text
    series = techSpecsGeneral[3].text
    doors = techSpecsExterior[4].text
    seating_capacity = techSpecsComfort[0].text
    transmission_type = techSpecsTrans[0].text
    drive_type = techSpecsTrans[1].text
    engine_cc = techSpecsPerformance[1].text
    cylinders = techSpecsPerformance[2].text
    fuel_type = techSpecsPerformance[7].text
    fuel_consumption = techSpecsPerformance[9].text
    co2_level = techSpecsPerformance[15].text
    overal_height = techSpecsDimensions[0].text
    overal_lenght = techSpecsDimensions[1].text
    overal_width = techSpecsDimensions[2].text
    wheelbase = techSpecsDimensions[4].text

    # Append data to lists
    titleList.append(car_title)
    makeList.append(make)
    familyList.append(family)
    variantList.append(variant)
    seriesList.append(series)
    body_typeList.append(body_type)
    doorsList.append(doors)
    seating_capacityList.append(seating_capacity)
    transmission_typeList.append(transmission_type)
    drive_typeList.append(drive_type)
    engine_ccList.append(engine_cc)
    cylindersList.append(cylinders)
    fuel_typeList.append(fuel_type)
    fuel_consumptionList.append(fuel_consumption)
    co2_levelList.append(co2_level)
    overal_heightList.append(overal_height)
    overal_lenghtList.append(overal_lenght)
    overal_widthList.append(overal_width)
    wheelbaseList.append(wheelbase)

Continue inspect the [page](https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?searchOffset=0&searchLimit=12), we can see that tag `div class="listing-cars"` contains links of 12 cars. Double check by search (`Ctrl + F`) `carListing carListing-slideBtn`, it also returns `12 matches`.

![Listing Cars](images/listing_cars.png)

To collect these car's information, we will:
1. Make a list that contains 12 links
2. Loop through that list with our above code.

In [14]:
urlPage = 'https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?searchOffset=0&searchLimit=12'
responsePage = get(urlPage)
responsePage.status_code

200

In [15]:
html_soupPage = BeautifulSoup(responsePage.text, 'html.parser')
type(html_soupPage)

bs4.BeautifulSoup

In [16]:
listing_car = html_soupPage.find_all('a', class_ = 'carListing carListing-slideBtn')
print(type(listing_car))
print('\n')
print('Review first element of listing_car:\n', listing_car[0])

<class 'bs4.element.ResultSet'>


Review first element of listing_car:
 <a class="carListing carListing-slideBtn" data-cpo="0" data-listing-id="D_10673243" data-perf-rating="7" data-rank="0" data-snowplow-listing-id="10673243" href="/cars-for-sale/D_10673243/KIA--SORENTO--NT+-+North--WINNELLIE+0820,+NT--SUV" id="pos0">
<div class="carListing--labelContainer">
</div>
<div class="carListing--content">
<!-- car listing - thumbnails - start -->
<div class="carListing--thumbnails">
<img alt="2014 Kia Sorento" class="carListing--thumbnail1" src="https://autotraderau-res.cloudinary.com/t_cg_car_m/inventory/2019-08-13/20703462665651/10673243/2014_kia_sorento_Used_1.jpg"/> <ul class="carListing--mediaCount">
<li class="icon icon-camera"><span class="media-count-text">26</span></li>
<li class="icon icon-video-camera" ng-show="true"><span class="media-count-text">1</span></li>
</ul>
<div class="carListing--arrow"></div>
</div>
<!-- car listing - thumbnails - end -->
<!-- car listing - text - star

All we need is the link in `href` tag:

In [17]:
href = listing_car[0].get('href')
print(href)

/cars-for-sale/D_10673243/KIA--SORENTO--NT+-+North--WINNELLIE+0820,+NT--SUV


In [18]:
# Make the list that contains 12 links:

carListing = []
domain = 'https://www.carsguide.com.au'

for item in listing_car:
    href = domain + str(item.get('href'))
    carListing.append(href)

print(len(carListing))
print('\n')
print('Double check few elements in the list:')
print(carListing[0])
print(carListing[5])
print(carListing[11])

12


Double check few elements in the list:
https://www.carsguide.com.au/cars-for-sale/D_10673243/KIA--SORENTO--NT+-+North--WINNELLIE+0820,+NT--SUV
https://www.carsguide.com.au/cars-for-sale/D_10558538/HOLDEN--COMMODORE--NT+-+North--WINNELLIE+0820,+NT--Sedan
https://www.carsguide.com.au/cars-for-sale/D_10603698/LAND+ROVER--FREELANDER+2--NT+-+North--WINNELLIE+0820,+NT--SUV


So far so good. Now we only need to loop through that list to get information of 12 cars.

In [19]:
# Lists to store the scraped data in
titleList = []
makeList = []
familyList = []
variantList = []
seriesList = []
body_typeList = []
doorsList = []
seating_capacityList = []
transmission_typeList = []
drive_typeList = []
engine_ccList = []
cylindersList = []
fuel_typeList = []
fuel_consumptionList = []
co2_levelList = []
overal_heightList = []
overal_lenghtList = []
overal_widthList = []
wheelbaseList = []

for each_link in carListing:
    url = each_link
    response = get(url)
    response.status_code
    
    # parse the HTML content
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    # extract car's title
    details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
    car_title = details_page_heading.span.text
    
    # extract body type
    details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
    rows = details_page_glance.findChildren(['th', 'tr'])
    glanceList = []
    for row in rows:
        cells = row.findChildren('td')
        for cell in cells:
            strValue = cell.string
            glanceList.append(strValue)
    body_type = glanceList[3]
    
    # extract all tech specs
    tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')
     
    techSpecsComfort = tab_tech_specs[0].find_all('dd')
    techSpecsTrans = tab_tech_specs[1].find_all('dd')
    techSpecsExterior = tab_tech_specs[2].find_all('dd')
    techSpecsPerformance = tab_tech_specs[3].find_all('dd')
    techSpecsDimensions = tab_tech_specs[4].find_all('dd')
    techSpecsGeneral = tab_tech_specs[5].find_all('dd')

    make = techSpecsGeneral[1].text
    family = techSpecsGeneral[0].text
    variant = techSpecsGeneral[2].text
    series = techSpecsGeneral[3].text
    doors = techSpecsExterior[4].text
    seating_capacity = techSpecsComfort[0].text
    transmission_type = techSpecsTrans[0].text
    drive_type = techSpecsTrans[1].text
    engine_cc = techSpecsPerformance[1].text
    cylinders = techSpecsPerformance[2].text
    fuel_type = techSpecsPerformance[7].text
    fuel_consumption = techSpecsPerformance[9].text
    co2_level = techSpecsPerformance[15].text
    overal_height = techSpecsDimensions[0].text
    overal_lenght = techSpecsDimensions[1].text
    overal_width = techSpecsDimensions[2].text
    wheelbase = techSpecsDimensions[4].text

    # Append data to lists
    titleList.append(car_title)
    makeList.append(make)
    familyList.append(family)
    variantList.append(variant)
    seriesList.append(series)
    body_typeList.append(body_type)
    doorsList.append(doors)
    seating_capacityList.append(seating_capacity)
    transmission_typeList.append(transmission_type)
    drive_typeList.append(drive_type)
    engine_ccList.append(engine_cc)
    cylindersList.append(cylinders)
    fuel_typeList.append(fuel_type)
    fuel_consumptionList.append(fuel_consumption)
    co2_levelList.append(co2_level)
    overal_heightList.append(overal_height)
    overal_lenghtList.append(overal_lenght)
    overal_widthList.append(overal_width)
    wheelbaseList.append(wheelbase)
    
onePage_df = pd.DataFrame({'title':titleList, 'make':makeList, 'family':familyList, 'variant':variantList, 
'series':seriesList, 'body_type':body_typeList, 'doors':doorsList, 
'seating_capacity':seating_capacityList, 'transmission_type':transmission_typeList, 
'drive_type':drive_typeList, 'engine_cc':engine_ccList, 'cylinders':cylindersList, 'fuel_type':fuel_typeList, 
'fuel_consumption':fuel_consumptionList, 'co2_level':co2_levelList, 'overal_height':overal_heightList, 
'overal_lenght':overal_lenghtList, 'overal_width':overal_widthList, 'wheelbase':wheelbaseList})

print(onePage_df.info())
onePage_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 19 columns):
title                12 non-null object
make                 12 non-null object
family               12 non-null object
variant              12 non-null object
series               12 non-null object
body_type            12 non-null object
doors                12 non-null object
seating_capacity     12 non-null object
transmission_type    12 non-null object
drive_type           12 non-null object
engine_cc            12 non-null object
cylinders            12 non-null object
fuel_type            12 non-null object
fuel_consumption     12 non-null object
co2_level            12 non-null object
overal_height        12 non-null object
overal_lenght        12 non-null object
overal_width         12 non-null object
wheelbase            12 non-null object
dtypes: object(19)
memory usage: 1.9+ KB
None


Unnamed: 0,title,make,family,variant,series,body_type,doors,seating_capacity,transmission_type,drive_type,engine_cc,cylinders,fuel_type,fuel_consumption,co2_level,overal_height,overal_lenght,overal_width,wheelbase
0,2014 Kia Sorento Platinum (4X4),Kia,Sorento,Platinum (4X4),XM MY14,"SUV, 4 Doors, 7 Seats",4,7,Automatic,AWD,2199,4,Diesel,7.3 L / 100 km,192,1745 mm,4685 mm,1885 mm,2700 mm
1,2002 Toyota RAV4 Cruiser (4X4),Toyota,RAV4,Cruiser (4X4),ACA20R,"SUV, 2 Doors, 4 Seats",2,4,Automatic,4WD,1998,4,Unleaded,9 L / 100 km,-,1665 mm,3800 mm,1735 mm,2280 mm
2,2009 Ford Focus CL,Ford,Focus,CL,LT 08 Upgrade,"Sedan, 4 Doors, 5 Seats",4,5,Manual,Front,1999,4,Unleaded,7.1 L / 100 km,170,1443 mm,4488 mm,1840 mm,2640 mm
3,2010 Toyota Corolla Conquest,Toyota,Corolla,Conquest,ZRE152R MY10,"Hatchback, 5 Doors, 5 Seats",5,5,Automatic,Front,1798,4,Unleaded,7.7 L / 100 km,180,1515 mm,4220 mm,1760 mm,2600 mm
4,2014 HSV Clubsport R8,HSV,Clubsport,R8,GEN F MY15,"Sedan, 4 Doors, 5 Seats",4,5,Manual,Rear,6200,8,Premium,12.6 L / 100 km,300,1467 mm,4943 mm,1899 mm,2915 mm
5,2011 Holden Commodore SS-V Redline Edition,Holden,Commodore,SS-V Redline Edition,VE II,"Sedan, 4 Doors, 5 Seats",4,5,Manual,Rear,5967,8,Premium,14.6 L / 100 km,344,1476 mm,4894 mm,1899 mm,2915 mm
6,2015 Subaru WRX STI Premium,Subaru,WRX,STI Premium,MY16,"Sedan, 4 Doors, 5 Seats",4,5,Manual,AWD,2457,4,Premium,10.4 L / 100 km,242,1475 mm,4595 mm,1795 mm,2650 mm
7,2017 Land Rover Discovery TD6 SE,Land Rover,Discovery,TD6 SE,MY17,"SUV, 4 Doors, 5 Seats",4,5,Automatic,4WD,2993,6,Diesel,7.2 L / 100 km,189,1888 mm,4970 mm,2073 mm,2923 mm
8,2013 Holden Commodore SS,Holden,Commodore,SS,VF,"Ute / Tray, 2 Doors, 2 Seats",2,2,Automatic,Rear,5967,8,Unleaded,12.4 L / 100 km,296,1480 mm,5040 mm,1899 mm,3009 mm
9,2014 Hyundai I20 Active,Hyundai,I20,Active,PB MY14,"Hatchback, 5 Doors, 5 Seats",5,5,Automatic,Front,1396,4,Unleaded,5.9 L / 100 km,140,1490 mm,3940 mm,1710 mm,2525 mm


It seems a good output. However, what if tab `Tech Specs` has less than 8 sections? The problem is when we do with single car **2019 Alfa Romeo Giulietta Veloce**, tab `Tech Specs` has 8 sections (you can roll up to see the images). When we look at first car in the [link](https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?searchOffset=0&searchLimit=12), there are only 6 sections!. As sequences, our code cannot work.

![Missing Tech Specs](images/missing_tech_specs.png)

To avoid the error, we must include one more condition inside our code to check number of elements in the tab tech specs. If the number is different from 8, do not process further.

##### NEW CODE FOR COLLECTING INFO OF 12 CAR
###### Lists to store the scraped data in
titleList = []
makeList = []
familyList = []
variantList = []
seriesList = []
body_typeList = []
doorsList = []
seating_capacityList = []
transmission_typeList = []
drive_typeList = []
engine_ccList = []
cylindersList = []
fuel_typeList = []
fuel_consumptionList = []
co2_levelList = []
overal_heightList = []
overal_lenghtList = []
overal_widthList = []
wheelbaseList = []

for each_link in carListing:
    url = each_link
    response = get(url)
    response.status_code
    
    ###### parse the HTML content
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    ###### extract car's title
    details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
    car_title = details_page_heading.span.text
    
    ###### extract body type
    details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
    rows = details_page_glance.findChildren(['th', 'tr'])
    glanceList = []
    for row in rows:
        cells = row.findChildren('td')
        for cell in cells:
            strValue = cell.string
            glanceList.append(strValue)
    body_type = glanceList[3]
    
    ###### extract all tech specs
    tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')
	
	if len(tab_tech_specs) == 8: #THIS IS THE NEW CONDITION
     
		techSpecsComfort = tab_tech_specs[0].find_all('dd')
		techSpecsTrans = tab_tech_specs[1].find_all('dd')
		techSpecsExterior = tab_tech_specs[2].find_all('dd')
		techSpecsPerformance = tab_tech_specs[3].find_all('dd')
		techSpecsDimensions = tab_tech_specs[4].find_all('dd')
		techSpecsGeneral = tab_tech_specs[5].find_all('dd')

		make = techSpecsGeneral[1].text
		family = techSpecsGeneral[0].text
		variant = techSpecsGeneral[2].text
		series = techSpecsGeneral[3].text
		doors = techSpecsExterior[4].text
		seating_capacity = techSpecsComfort[0].text
		transmission_type = techSpecsTrans[0].text
		drive_type = techSpecsTrans[1].text
		engine_cc = techSpecsPerformance[1].text
		cylinders = techSpecsPerformance[2].text
		fuel_type = techSpecsPerformance[7].text
		fuel_consumption = techSpecsPerformance[9].text
		co2_level = techSpecsPerformance[15].text
		overal_height = techSpecsDimensions[0].text
		overal_lenght = techSpecsDimensions[1].text
		overal_width = techSpecsDimensions[2].text
		wheelbase = techSpecsDimensions[4].text

		###### Append data to lists
		titleList.append(car_title)
		makeList.append(make)
		familyList.append(family)
		variantList.append(variant)
		seriesList.append(series)
		body_typeList.append(body_type)
		doorsList.append(doors)
		seating_capacityList.append(seating_capacity)
		transmission_typeList.append(transmission_type)
		drive_typeList.append(drive_type)
		engine_ccList.append(engine_cc)
		cylindersList.append(cylinders)
		fuel_typeList.append(fuel_type)
		fuel_consumptionList.append(fuel_consumption)
		co2_levelList.append(co2_level)
		overal_heightList.append(overal_height)
		overal_lenghtList.append(overal_lenght)
		overal_widthList.append(overal_width)
		wheelbaseList.append(wheelbase)
    
onePage_df = pd.DataFrame({'title':titleList, 'make':makeList, 'family':familyList, 'variant':variantList, 
'series':seriesList, 'body_type':body_typeList, 'doors':doorsList, 
'seating_capacity':seating_capacityList, 'transmission_type':transmission_typeList, 
'drive_type':drive_typeList, 'engine_cc':engine_ccList, 'cylinders':cylindersList, 'fuel_type':fuel_typeList, 
'fuel_consumption':fuel_consumptionList, 'co2_level':co2_levelList, 'overal_height':overal_heightList, 
'overal_lenght':overal_lenghtList, 'overal_width':overal_widthList, 'wheelbase':wheelbaseList})

print(onePage_df.info())

onePage_df

In [20]:
onePage_df.to_csv('F:\DuyLam\Project\web_scraping_for_vehicle_data\onePage.csv')

Step two is cleared!. We are one step closer.

## III. Collecting information on multiple pages:

### Part 1. Make a list of pages:

Examine the image below:

![search critical](images/search_critical.png)

* `searchOffset` is 0 at page 1 and change to 12 at page 2.
* `searchLimit` is 12 means each page contains 12 cars.
* `orderBy` is descending because I sorted by Year(Youngest - Oldest)

Let's make a list of Offset values of first seven pages

In [21]:
# Get List of Offset values of first seven pages
searchOffset = []
offSetNum = 0
pageNum = 1

while pageNum <= 7:
    searchOffset.append(offSetNum)
    offSetNum += 12
    pageNum += 1

print(searchOffset)

[0, 12, 24, 36, 48, 60, 72]


Now we just need to change `searchOffset` to get links of seven pages and add them to a list:

In [22]:
page_linkList = []

# Get List of Urls of first seven pages, sort by year descending
for element in searchOffset:
    page_link = 'https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?sortBy=year&orderBy=desc&searchOffset='+ str(element) +'&searchLimit=12'
    page_linkList.append(page_link)
    
print(page_linkList[0])
print(page_linkList[3])
print(page_linkList[6])

https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?sortBy=year&orderBy=desc&searchOffset=0&searchLimit=12
https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?sortBy=year&orderBy=desc&searchOffset=36&searchLimit=12
https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/all-makes?sortBy=year&orderBy=desc&searchOffset=72&searchLimit=12


### Part 2. Controlling the crawl-rate

If we hammer the server with tens of requests per second, most likely our IP will be banned. Besides, we shouldn't interrupt other people access the website. That's why we will control the rate of scraping with [sleep()](https://docs.python.org/3/library/time.html?highlight=time%20module#time.sleep) method. It suspends the execution for the given number of seconds. To mimic human behavior, we will combine `sleep` with [randint()](https://docs.python.org/3/library/random.html?highlight=random%20module#random.randint). `randint` return integer randomly.

In [23]:
from time import sleep
from random import randint

for _ in range(0, 5):
    print('Testing')
    sleep(randint(1,15))

Testing
Testing
Testing
Testing
Testing


### Part 3. Monitoring the oop as it's still going

It's good thing to know our process is still going and we haven't hammered the server to much. We can use below methods:

In [24]:
from time import time
timestart_time = time()
requestsCount = 0

for _ in range(5):
    requestsCount += 1
    sleep(randint(1,3))
    elapsed_time = time() - timestart_time
    print('Request: {}; Frequency: {} requestes/s'.format(requestsCount, requestsCount/elapsed_time))

Request: 1; Frequency: 0.9983766249222827 requestes/s
Request: 2; Frequency: 0.49976735216257634 requestes/s
Request: 3; Frequency: 0.5997483069968281 requestes/s
Request: 4; Frequency: 0.6663763728072212 requestes/s
Request: 5; Frequency: 0.5553560179161762 requestes/s


It can be very messy if we request a hundred times. [clear_out()](https://ipython.org/ipython-doc/dev/api/generated/IPython.display.html) function to clear previous output and show most recent one.

In [25]:
from IPython.core.display import clear_output
start_time = time()
requestsCount = 0
for _ in range(5):
    requestsCount += 1
    sleep(randint(1,3))
    current_time = time()
    elapsed_time = current_time - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requestsCount, requestsCount/elapsed_time))
    clear_output(wait = True)

Request: 5; Frequency: 0.4540258435208237 requests/s


I also use [warn()](https://docs.python.org/2/library/warnings.html) to check wether my requests exceed number of total cars.

In [26]:
from warnings import warn
warn('Warning Simulation')

  


### Part 4. Piecing Everything Together:

By looping through `page_linkList` using method of step 1 and 2, we will have information of all cars on seven pages. In below codes, I added new condition `if len(details_page_heading) == 3`.

Sometime, when you click a link and see a warning 'Cannot find your car...' and the site automatically return to main page. It would be problematic because our progam cannot access that link to extract the data of a single car. The solution is adding one more condition in our code. Remember when we extract title of [one car](https://www.carsguide.com.au/cars-for-sale/D_10315385/ALFA+ROMEO--GIULIETTA--WA+-+Perth--OSBORNE+PARK+6017,+WA--Hatchback?searchKey=cg_s.293fddbef3edd83e3bdfc3b4825c9bf9#pos0), `details_page_heading` type is `<class 'bs4.element.Tag'>` and its len is 3. When we pass a [link](https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/kia) that [find()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) method cannot find class `details-page-heading` within `<h1>` tag, it will throw an error if we use built-in function [len()](https://docs.python.org/3/library/functions.html#len)(See the sample).

In [27]:
# The Sample

url_test = 'https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/kia'
response_test = get(url_test)
response.status_code

# parse the HTML content
html_soup_test = BeautifulSoup(response_test.text, 'html.parser')

# extract car's title
details_page_heading_test = html_soup_test.find('h1', class_ = 'details-page-heading')

len(details_page_heading_test)

TypeError: object of type 'NoneType' has no len()

###### NEW CODE FOR COLLECTING INFO OF SEVEN PAGES
###### Lists to store the scraped data in
titleList = []
makeList = []
familyList = []
variantList = []
seriesList = []
body_typeList = []
doorsList = []
seating_capacityList = []
transmission_typeList = []
drive_typeList = []
engine_ccList = []
cylindersList = []
fuel_typeList = []
fuel_consumptionList = []
co2_levelList = []
overal_heightList = []
overal_lenghtList = []
overal_widthList = []
wheelbaseList = []

###### Preparing the monitoring of the loop
start_time = time()
requestsCount = 0

for each_page in page_linkList:
	###### download page content
	urlPage = each_page
	responsePage = get(urlPage)

	###### parse the HTML page
	html_soupPage = BeautifulSoup(responsePage.text, 'html.parser')

	###### extract inoformation
	listing_car = html_soupPage.find_all('a', class_ = 'carListing carListing-slideBtn')

	carListing = []
	domain = 'https://www.carsguide.com.au'

	for item in listing_car:
		href = domain + str(item.get('href'))
		carListing.append(href)
		
	for each_link in carListing:
		###### download content of single page:
		url = each_link
		response = get(url)
		
		###### pause the loop between 8s and 15s randomly
		sleep(randint(8,15))
		
		###### monitor the requests
		requestsCount += 1
		elapsed_time = time() - start_time
		print('Request:{}; Frequency: {} requests/s'.format(requestsCount, requestsCount/elapsed_time))
		clear_output(wait = True)
		
		###### parse the HTML content
		html_soup = BeautifulSoup(response.text, 'html.parser')
		
		###### extract car's title
		details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
        if len(details_page_heading) == 3: # NEW CONDITION to avoid case 'Cannot found the page'
            car_title = details_page_heading.span.text

            ###### extract body type
            details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
            rows = details_page_glance.findChildren(['th', 'tr'])
            glanceList = []
            for row in rows:
                cells = row.findChildren('td')
                for cell in cells:
                    strValue = cell.string
                    glanceList.append(strValue)
            body_type = glanceList[3]

            ###### extract all tech specs
            tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')

            if len(tab_tech_specs) == 8: #NEW CONDITION to avoid case missing sections in tab Tech Specs

                techSpecsComfort = tab_tech_specs[0].find_all('dd')
                techSpecsTrans = tab_tech_specs[1].find_all('dd')
                techSpecsExterior = tab_tech_specs[2].find_all('dd')
                techSpecsPerformance = tab_tech_specs[3].find_all('dd')
                techSpecsDimensions = tab_tech_specs[4].find_all('dd')
                techSpecsGeneral = tab_tech_specs[5].find_all('dd')

                make = techSpecsGeneral[1].text
                family = techSpecsGeneral[0].text
                variant = techSpecsGeneral[2].text
                series = techSpecsGeneral[3].text
                doors = techSpecsExterior[4].text
                seating_capacity = techSpecsComfort[0].text
                transmission_type = techSpecsTrans[0].text
                drive_type = techSpecsTrans[1].text
                engine_cc = techSpecsPerformance[1].text
                cylinders = techSpecsPerformance[2].text
                fuel_type = techSpecsPerformance[7].text
                fuel_consumption = techSpecsPerformance[9].text
                co2_level = techSpecsPerformance[15].text
                overal_height = techSpecsDimensions[0].text
                overal_lenght = techSpecsDimensions[1].text
                overal_width = techSpecsDimensions[2].text
                wheelbase = techSpecsDimensions[4].text

                ###### Append data to lists
                titleList.append(car_title)
                makeList.append(make)
                familyList.append(family)
                variantList.append(variant)
                seriesList.append(series)
                body_typeList.append(body_type)
                doorsList.append(doors)
                seating_capacityList.append(seating_capacity)
                transmission_typeList.append(transmission_type)
                drive_typeList.append(drive_type)
                engine_ccList.append(engine_cc)
                cylindersList.append(cylinders)
                fuel_typeList.append(fuel_type)
                fuel_consumptionList.append(fuel_consumption)
                co2_levelList.append(co2_level)
                overal_heightList.append(overal_height)
                overal_lenghtList.append(overal_lenght)
                overal_widthList.append(overal_width)
                wheelbaseList.append(wheelbase)
                
sevenPage_df = pd.DataFrame({'title':titleList, 'make':makeList, 'family':familyList, 'variant':variantList, 
'series':seriesList, 'body_type':body_typeList, 'doors':doorsList, 
'seating_capacity':seating_capacityList, 'transmission_type':transmission_typeList, 
'drive_type':drive_typeList, 'engine_cc':engine_ccList, 'cylinders':cylindersList, 'fuel_type':fuel_typeList, 
'fuel_consumption':fuel_consumptionList, 'co2_level':co2_levelList, 'overal_height':overal_heightList, 
'overal_lenght':overal_lenghtList, 'overal_width':overal_widthList, 'wheelbase':wheelbaseList})

print(sevenPage_df.info())

sevenPage_df


sevenPage_df.to_csv('F:\DuyLam\Project\web_scraping_for_vehicle_data\sevenPage.csv')

Request over 100,000 cars at once is not a good choice. Many Makes in Carsguide are also not what I need. To limit number of requests, I will only request for Make that I interested and number of cars available for that Make. Let's add some juicy into the program and encapsulate it into a function called `sprapingCar`. It will take `make_name` as an argument.

In [31]:
def scrapingCar(make_name):
    
    from requests import get
    from bs4 import BeautifulSoup
    import pandas as pd
    from time import sleep
    from datetime import datetime, timedelta
    from random import randint
    from time import time
    from warnings import warn
    import re
    
    make_name = str(make_name)
    make_name = make_name.strip()
    make_name = make_name.replace(' ', '-')
    
    # Get maximun number of page
    urlMake = 'https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/'+ make_name +'?searchOffset=0&searchLimit=12&sortBy=year&orderBy=desc'
    responseMake = get(urlMake)
    html_soupMake = BeautifulSoup(responseMake.text, 'html.parser')
    listing_pagination = html_soupMake.find('div', string = re.compile('Page 1 of'))
    pagination_txt = listing_pagination.text
    pagination_list = pagination_txt.split()
    pageMax = int(pagination_list[-1])
    
    # Get total number of car for sales
    listing_search_title = html_soupMake.find('h1', class_ = 'listing-search-title')
    searchTitle = listing_search_title.text
    searchTitleList = searchTitle.split()
    carTotal = int(searchTitleList[0])
    
    # Show time when process finishs:
    eta = datetime.timestamp(datetime.now() + timedelta(seconds = 660))
    dt_object = datetime.fromtimestamp(eta)
    
    # Get List of Offset values of first seven pages
    searchOffset = []
    offSetNum = 0
    pageNum = 1

    while pageNum <= pageMax:
        searchOffset.append(offSetNum)
        offSetNum += 12
        pageNum += 1

    page_linkList = []
    
    # Get List of Urls of pages, sort by year descending
    for element in searchOffset:
        page_link = 'https://www.carsguide.com.au/buy-a-car/all-new-and-used/all-states/all-locations/all-bodytypes/'+ make_name +'?sortBy=year&orderBy=desc&searchOffset='+ str(element) +'&searchLimit=12'
        page_linkList.append(page_link)
        
    # Empty lists to store the scraped data in
    titleList = []
    makeList = []
    familyList = []
    variantList = []
    seriesList = []
    body_typeList = []
    doorsList = []
    seating_capacityList = []
    transmission_typeList = []
    drive_typeList = []
    engine_ccList = []
    cylindersList = []
    fuel_typeList = []
    fuel_consumptionList = []
    co2_levelList = []
    overal_heightList = []
    overal_lenghtList = []
    overal_widthList = []
    wheelbaseList = []
    
    # Preparing the monitoring of the loop
    start_time = time()
    requestsCount = 0
    
    for each_page in page_linkList:
        # download page content
        urlPage = each_page
        responsePage = get(urlPage)

        # parse the HTML page
        html_soupPage = BeautifulSoup(responsePage.text, 'html.parser')

        # extract inoformation
        listing_car = html_soupPage.find_all('a', class_ = 'carListing carListing-slideBtn')

        carListing = []
        domain = 'https://www.carsguide.com.au'

        for item in listing_car:
            href = domain + str(item.get('href'))
            carListing.append(href)

        for each_link in carListing:
            # download content of single page:
            url = each_link
            response = get(url)

            # pause the loop
            sleep(randint(8,15))

            # monitor the requests
            requestsCount += 1
            elapsed_time = time() - start_time
            print('Request:{}; Frequency: {} requests/s; ETA: {}'.format(requestsCount, requestsCount/elapsed_time, dt_object))
            clear_output(wait = True)
            
            # break the loop if the number of requests is greater than expected
            if requestsCount > carTotal:
                warn('Number of requests was greater than expected.')
                break

            # parse the HTML content
            html_soup = BeautifulSoup(response.text, 'html.parser')

            # extract car's title
            details_page_heading = html_soup.find('h1', class_ = 'details-page-heading')
            if details_page_heading is None:
                break
            car_title = details_page_heading.span.text

            # extract body type
            details_page_glance = html_soup.find('table', class_ = 'details-page-tab-table more-details clearfix')
            rows = details_page_glance.findChildren(['th', 'tr'])
            glanceList = []
            
            for row in rows:
                cells = row.findChildren('td')
                for cell in cells:
                    strValue = cell.string
                    glanceList.append(strValue)
            body_type = glanceList[3]

            # extract all tech specs
            tab_tech_specs = html_soup.find_all('dl', class_ = 'details-page-tab-desc-list clearfix')

            if len(tab_tech_specs) == 8: #NEW CONDITION

                techSpecsComfort = tab_tech_specs[0].find_all('dd')
                techSpecsTrans = tab_tech_specs[1].find_all('dd')
                techSpecsExterior = tab_tech_specs[2].find_all('dd')
                techSpecsPerformance = tab_tech_specs[3].find_all('dd')
                techSpecsDimensions = tab_tech_specs[4].find_all('dd')
                techSpecsGeneral = tab_tech_specs[5].find_all('dd')

                make = techSpecsGeneral[1].text
                family = techSpecsGeneral[0].text
                variant = techSpecsGeneral[2].text
                series = techSpecsGeneral[3].text
                doors = techSpecsExterior[4].text
                seating_capacity = techSpecsComfort[0].text
                transmission_type = techSpecsTrans[0].text
                drive_type = techSpecsTrans[1].text
                engine_cc = techSpecsPerformance[1].text
                cylinders = techSpecsPerformance[2].text
                fuel_type = techSpecsPerformance[7].text
                fuel_consumption = techSpecsPerformance[9].text
                co2_level = techSpecsPerformance[15].text
                overal_height = techSpecsDimensions[0].text
                overal_lenght = techSpecsDimensions[1].text
                overal_width = techSpecsDimensions[2].text
                wheelbase = techSpecsDimensions[4].text

                # Append data to lists
                titleList.append(car_title)
                makeList.append(make)
                familyList.append(family)
                variantList.append(variant)
                seriesList.append(series)
                body_typeList.append(body_type)
                doorsList.append(doors)
                seating_capacityList.append(seating_capacity)
                transmission_typeList.append(transmission_type)
                drive_typeList.append(drive_type)
                engine_ccList.append(engine_cc)
                cylindersList.append(cylinders)
                fuel_typeList.append(fuel_type)
                fuel_consumptionList.append(fuel_consumption)
                co2_levelList.append(co2_level)
                overal_heightList.append(overal_height)
                overal_lenghtList.append(overal_lenght)
                overal_widthList.append(overal_width)
                wheelbaseList.append(wheelbase)
                
    autos_df = pd.DataFrame({'title':titleList, 'make':makeList, 'family':familyList, 'variant':variantList, 
                'series':seriesList, 'body_type':body_typeList, 'doors':doorsList, 
                'seating_capacity':seating_capacityList, 'transmission_type':transmission_typeList, 
                'drive_type':drive_typeList, 'engine_cc':engine_ccList, 'cylinders':cylindersList, 'fuel_type':fuel_typeList, 
                'fuel_consumption':fuel_consumptionList, 'co2_level':co2_levelList, 'overal_height':overal_heightList, 
                'overal_lenght':overal_lenghtList, 'overal_width':overal_widthList, 'wheelbase':wheelbaseList})
    
    print('The processing completed with no errors detected during the process.')
    
    return autos_df

Let's do final test with the function and export it to csv file:

In [32]:
abarth = scrapingCar('abarth')

Request:40; Frequency: 0.07471563919822136 requests/s; ETA: 2019-08-20 13:30:08.553252


In [33]:
abarth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 19 columns):
title                40 non-null object
make                 40 non-null object
family               40 non-null object
variant              40 non-null object
series               40 non-null object
body_type            40 non-null object
doors                40 non-null object
seating_capacity     40 non-null object
transmission_type    40 non-null object
drive_type           40 non-null object
engine_cc            40 non-null object
cylinders            40 non-null object
fuel_type            40 non-null object
fuel_consumption     40 non-null object
co2_level            40 non-null object
overal_height        40 non-null object
overal_lenght        40 non-null object
overal_width         40 non-null object
wheelbase            40 non-null object
dtypes: object(19)
memory usage: 6.0+ KB


In [35]:
abarth

Unnamed: 0,title,make,family,variant,series,body_type,doors,seating_capacity,transmission_type,drive_type,engine_cc,cylinders,fuel_type,fuel_consumption,co2_level,overal_height,overal_lenght,overal_width,wheelbase
0,2019 Abarth 595C Competizione,Abarth,595C,Competizione,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Automatic,Front,1368,4,Unleaded,5.8 L / 100 km,134,1485 mm,3657 mm,1627 mm,2300 mm
1,2019 Abarth 595C Competizione,Abarth,595C,Competizione,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Manual,Front,1368,4,Unleaded,6 L / 100 km,139,1485 mm,3657 mm,1627 mm,2300 mm
2,2019 Abarth 124 Spider,Abarth,124,Spider,Series 1,"Convertible, 2 Doors, 2 Seats",2,2,Manual,Rear,1368,4,Premium,6.5 L / 100 km,150,1233 mm,4054 mm,1740 mm,2310 mm
3,2019 Abarth 595,Abarth,595,-,Series 4,"Hatchback, 3 Doors, 4 Seats",3,4,Automatic,Front,1368,4,Unleaded,5.8 L / 100 km,134,1485 mm,3657 mm,1627 mm,2300 mm
4,2019 Abarth 595C,Abarth,595C,-,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Manual,Front,1368,4,Unleaded,6 L / 100 km,139,1485 mm,3657 mm,1627 mm,2300 mm
5,2019 Abarth 595 Competizione,Abarth,595,Competizione,Series 4,"Hatchback, 3 Doors, 4 Seats",3,4,Automatic,Front,1368,4,Unleaded,5.8 L / 100 km,134,1485 mm,3657 mm,1627 mm,2300 mm
6,2019 Abarth 595C,Abarth,595C,-,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Automatic,Front,1368,4,Unleaded,5.8 L / 100 km,134,1485 mm,3657 mm,1627 mm,2300 mm
7,2019 Abarth 595C,Abarth,595C,-,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Manual,Front,1368,4,Unleaded,6 L / 100 km,139,1485 mm,3657 mm,1627 mm,2300 mm
8,2019 Abarth 695C Rivale,Abarth,695C,Rivale,Series 4,"Convertible, 2 Doors, 4 Seats",2,4,Automatic,Front,1368,4,Unleaded,0 L / 100 km,139,1485 mm,3657 mm,1627 mm,2300 mm
9,2019 Abarth 695 Rivale,Abarth,695,Rivale,Series 4,"Hatchback, 3 Doors, 4 Seats",3,4,Automatic,Front,1368,4,Unleaded,0 L / 100 km,139,1485 mm,3660 mm,1627 mm,2300 mm


In [37]:
abarth.to_csv(r'F:\DuyLam\Project\web_scraping_for_vehicle_data\abarth.csv')

Everything as we expected. 

Before close this project, we should know that web scraping and crawling by themselve are not illegal. However, the way we use the scraped data could lead us to vulnerable position. Please read this [article](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/) to be informed what might be happen if you crawl the website of somebody else without their permission.