## Webscrapping with BeautifulSoup
In this project, I would like to scrap Used Cars data from "https://www.sydneycars.com.au/" by BeautifulSoup.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

The url located the page which I am interested in.
And the "<Response [200]>" is the HTTP response status code we got from my requests, which means successfull.

In [9]:
url= "https://www.sydneycars.com.au/"
agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
page = requests.get(url, headers=agent)
page

<Response [200]>

Let's copy all html code in to varible soup. The details we want is under 'div', with a class called "card mb-4 w-100 border border-0 border-warning". We can use function find_all to achieve that.

In [10]:
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_="card mb-4 w-100 border border-0 border-warning")

Now, all information we want is in "lists". We then use a nest for-loop to obtain car price, model, year transimission type and km travelled from "lists". We can use string.replace() function to remove unnecessary charactors and append all data in to a list "car".

In [11]:
car = []
for list in lists:
    price = list.find('h5', class_="mb-3 text-center").text.replace('\n', '').replace('$', '')
    price = price.replace(u' \xa0', u'')
    vehicles = list.find_all('li', style="list-style-type:none!important;font-size:15px;line-height:25px;text-align: left;")
    for vehicle in vehicles:
        vehicle = str(vehicle).split(">")[-2][:-4].replace('\xa0', '').replace('kms ', '').replace('kms', u'').replace('Year', '').replace(' model','')
        car.append(vehicle)
        
    car.append(price)
car


['Holden Astra',
 ' 2005',
 'Manual transmission',
 '220604 ',
 '5400',
 'Ford Fiesta',
 ' 2005',
 'Manual transmission',
 '154608 ',
 '5490',
 'Mitsubishi Lancer',
 ' 1998',
 'Automatic Transmission',
 '202,243',
 '5400',
 'Dodge Caliber',
 ' 2011',
 'Automatic Transmission',
 '136,346 ',
 '8900',
 'Peugeot 308 HDi',
 ' 2010',
 'Manual transmission',
 '99,281 ',
 '9400',
 'Renault Koleos SUV',
 ' 2010',
 'Manual transmission',
 '170,000',
 '9990']

We then change our list to a list of lists and convert into a dataframe for clear view.

In [12]:
car = [car[i: i+5] for i in range(0, len(car), 5)]
df = pd.DataFrame(car, columns=['Model', 'Make Year','Transmission Type','Travelled (km)','Price ($)'])
df

Unnamed: 0,Model,Make Year,Transmission Type,Travelled (km),Price ($)
0,Holden Astra,2005,Manual transmission,220604,5400
1,Ford Fiesta,2005,Manual transmission,154608,5490
2,Mitsubishi Lancer,1998,Automatic Transmission,202243,5400
3,Dodge Caliber,2011,Automatic Transmission,136346,8900
4,Peugeot 308 HDi,2010,Manual transmission,99281,9400
5,Renault Koleos SUV,2010,Manual transmission,170000,9990


Now we have successfully scrapping data from the first page. We need to find a way that we can pull data from all pages. Before that, let's write above work into a function so we can recall that later.

In [13]:
def car_scrapping(url):
    
    agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
    page = requests.get(url, headers=agent)
    
    soup = BeautifulSoup(page.content, 'html.parser')
    lists = soup.find_all('div', class_="card mb-4 w-100 border border-0 border-warning")

    car = []
    for list in lists:
        price = list.find('h5', class_="mb-3 text-center").text.replace('\n', '').replace('$', '')
        price = price.replace(u' \xa0', u'')
        vehicles = list.find_all('li', style="list-style-type:none!important;font-size:15px;line-height:25px;text-align: left;")
        for vehicle in vehicles:
            vehicle = str(vehicle).split(">")[-2][:-4].replace('\xa0', '').replace('kms ', '').replace('kms', u'').replace('Year', '').replace(' model','')
            car.append(vehicle)
        
        car.append(price)
    return car

First, let's get the total page numbers with find_all function from bs4. It returns all page link buttons, and the "26" is what we need.

In [14]:
page_text = soup.find_all('a', class_="page-link")
page_text

[<a class="page-link" href="https://www.sydneycars.com.au/page/2">2</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/3">3</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/4">4</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/5">5</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/25">25</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/26">26</a>,
 <a class="page-link" href="https://www.sydneycars.com.au/page/2">Next</a>]

From here, we want to use multiple split functions to get the pages numbers and save it as an integer.

In [15]:
pages = int(str(page_text).split(",")[-2].split(">")[-2][:-3])
pages

26

Now, let's loop though all pages and run function car_scrapping(). This will gives us data from all pages.

In [22]:
car = []
for page in range(1, pages + 1):
    url = f"https://www.sydneycars.com.au/page/{page}/"
    
    car = car+ car_scrapping(url)
    
car = [car[i: i+5] for i in range(0, len(car), 5)]

df = pd.DataFrame(car, columns=['Model', 'Make Year','Transmission Type','Travelled (km)','Price ($)'])
df.head(100)

Unnamed: 0,Model,Make Year,Transmission Type,Travelled (km),Price ($)
0,Holden Astra,2005,Manual transmission,220604,5400
1,Ford Fiesta,2005,Manual transmission,154608,5490
2,Mitsubishi Lancer,1998,Automatic Transmission,202243,5400
3,Dodge Caliber,2011,Automatic Transmission,136346,8900
4,Peugeot 308 HDi,2010,Manual transmission,99281,9400
...,...,...,...,...,...
95,Kia Rio,2008,Manual transmission,"245, 865",5490
96,Toyota Echo,2002,Automatic transmission,191278,6950
97,Ford Laser,2001,Automatic Transmission,122705,5400
98,Ford Falcon XR6,2003,Automatic transmission,166000,7950
