<h1><center> Python project</center></h1>
<h2><center>Current car prices and other relevant parameters from bazos.cz</center></h2>
<h3><center>Daniel Brosko, Vojtěch Suchánek</center></h3>

Our goal is to web-scrape advertisements listed on website bazos.cz, which is currently one of the most used websites for selling used cars in Czech republic. It has more than 15 000 car adds daily. On the other hand, it has really poor search options, which pretty much complicates searching for desired car based on your parameters.

We are going to code algorithm, which will scan adds for the current day, pick those, which fulfill our conditions on date and car type and save their links. Then we will go to each link and save the text of the add. Then we will try to analyze the text of the add to find our parameters. In the end, we will perform some visualizations of distribution of price and selected parameters and try to select some underprices offers.

This task might be better to perform continuouslly, for example each hour, to not overload the website. This approach might also allow for longer time period to analyze the data. But since this project should be designed as one-time run, we decided to limit the data to only current date.

In [12]:
import requests
from bs4 import BeautifulSoup
import re
from datetime import date, datetime, timedelta
import time
import numpy as np
import pandas as pd

In [95]:
# Just to get the version of packages for requirements.txt
#pip list

In [96]:
bots = requests.get('https://auto.bazos.cz/robots.txt')
#print(bots.text)

From the robots page we can see that our actions done in our projects are allowed, since we are not gonna use these search commands.

Here we will get the main page from car section of bazos.

In [13]:
# firstly, we define the input variable so the user can search according to their preference
def search_model(user_input):
    """
    Function search_model takes string input, hence it has to be in quotes ("").
    The input should be the name of the car model you would like to get results for,
    e.g. "octávia 3" - there should be no problem even when full Czech alphabet is used.
    
    Then the string is stripped of the characters that are not supposed to be in the search input,
    and if there are more than single word in the input, they are connected by '+' (plus) sign,
    since that is the format that bazos.cz use in their URLs.
    
    Then, the prepared string is paste into the common URL format that bazos.cz use.
    
    The very next step is obtaining the number of found advertisements for user's input from the html source code.
    This number of advertisements is used to select the proper length of the adv. tabs list that we will scrape.
    
    The function returns "soup_list" - the list of html codes for each adv. tab that we process with function 
    n_days_search.
    """
    
    user_search_input = str(user_input)
    user_search_input = re.sub(r"[^\w\s]", '', user_search_input)
    user_search_input = re.sub(r"\s+", '+', user_search_input)

    no_of_adv_url = 'https://auto.bazos.cz/0/?hledat=' + user_search_input + '&hlokalita=&humkreis=25&cenaod=&cenado=&order='

    page = requests.get(no_of_adv_url)
    no_of_adv_html = BeautifulSoup(page.text, 'html')

    get_no_adv = no_of_adv_html.find('div', {'class':'listainzerat inzeratyflex'})
    get_no_adv = get_no_adv.find('div', {'class':'inzeratynadpis'})
    get_no_adv = get_no_adv.text
    get_no_adv = get_no_adv.split("z ")[-1]
    number_adv = get_no_adv.replace(" ","")
    number_adv = int(number_adv)
    # works properly, "number_adv" represents total number of advertisements for particular search input,
    # however, we modify it so that it correspondents correctly to the structure of page
    no_of_ad_lists = number_adv // 20
    if number_adv % 20 == 0:
        number_adv = no_of_ad_lists * 19
    else:
        number_adv = no_of_ad_lists * 20


    number_sequence = range(0, number_adv, 20) #start, stop (not included), step
    # we create empty list for saving the urls,
    # and then append other tabs with offers (since there are only 20 offers per tab by default)
    main_url_list = list()

    for i in range(0,len(number_sequence)):
        url = f'https://auto.bazos.cz/{number_sequence[i]}/'
        url = url + '?hledat=' + user_search_input + '&hlokalita=&humkreis=25&cenaod=&cenado=&order='
        main_url_list.append(url)
    # here we can check the list of urls for particular tabs
    #print(main_url_list)

    # in the next step, we get the text of each of those tabs using the BeautifulSoup function,
    # and save it as elements of the "soup_list"
    soup_list = list()
    for url in main_url_list:
        page = requests.get(url)

        ## MAYBE SLOW DOWN LATER by 0.3s per iteration
        soup_list.append(BeautifulSoup(page.text, 'html'))

    # filter to cut-off pseudo-empty elements in main_url_list to prevent the unwanted behaviour of the code
    res_soup_list = []
    for element in soup_list:
        if "html" in element:
            res_soup_list.append(element)
    soup_list = res_soup_list
    
    print("Initial search for your model successfully finished.")
    return soup_list

search_model("octávia 3")

Initial search for your model successfully finished.


[<!DOCTYPE html>
 <html lang="cs">
 <head>
 <title>Octávia 3 bazar - Auto | Bazoš.cz</title>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="Auto - Octávia 3 bazar. Vybírejte z 2 146 inzerátů. Prodej snadno a rychle na Bazoši. Přes půl milionů uživatelů za den. Najděte co potřebujete." name="description"/><link href="https://auto.bazos.cz/inzeraty/oct%C3%A1via-3/" rel="canonical"/><meta content="1055875657" property="fb:admins"/>
 <link href="https://www.bazos.cz/bazos60s.css" rel="stylesheet" type="text/css"/><link as="image" href="https://www.bazos.cz/obrazky/bazos.svg" rel="preload"/>
 <link href="https://www.bazos.cz/bazosprint.css" media="print" rel="stylesheet" type="text/css"/>
 <link href="https://www.bazos.cz/favicon.ico" rel="shortcut icon"/>
 <link href="https://www.bazos.cz/obrazky/icon-192x192.png" rel="icon" sizes="192x192"/>
 <link href="https://www.bazos.cz/apple-touch-icon.png" rel="apple-touch-icon"/>
 <link href="https://www.bazo

There are 20 adds plus other page elements. Lets pick just one add to see their construction.

There are several elements, which are important to us. In this section, we are interested in add id, which we are able to extract from href in element a on the second line. Its the number after "inzerat/". We are also interested in the href itself, since it is our link to follow to the actual add. Another element important to us is h2 class = "nadpis", which contains title of the add, from which we can extract type of car. And the last interesting element is span with class "velikost10", which reffers to date uploaded.

Now we can follow the link to the actual add page to see its content.

In the next chunk, we filter for the ads added today - if we compare the numbers, now we have today's 98 of out entire 2020 (as of now) avertisements.

In [14]:
#FILTER for advertisements added in last n_days 
def n_days_search(n_days):
    """
    Function n_days_search beahves as a filter, and consider only the advertisements that have been added
    within the selected time range. Note that we restrict the maximum number of past days included
    in the search to 5. If the inserted number is greaer than 5, the function consider just 5 past days.
    The range user can select therefore is from 0 to 5 - where 0 means no past days (just today),
    and the 5 means today and 5 days before.
    
    The date is scraped from each adv. source code, converted into datetime object and then challenged
    by condition based on input of n_days_search provided by user.
    
    Function returns list of URLs of all advertisements that satisfied the required condition,
    and prepared for further processing - scraping and data-mining of the desired parameters.
    """
    number_of_accepted_past_days = n_days #range (0:5) - if more,
    # print error message that number of max days exceeded and we work with only 5 past days

    today = date.today()
    accepted_days = list()

    for i in list(range(0, min(number_of_accepted_past_days + 1, 6), 1)):
        i_days_ago = today - timedelta(days=i)
        accepted_days.append(i_days_ago)

    # by the following code, we get the urls of each advertisement /offer/ (listed in the tabs we work with),
    # and save it to "list_of_offers_url"
    list_of_offers_url = list()

    for element in soup_list:
        x = element.find_all('div', {'class':'inzeraty inzeratyflex'})

        for sub_element in x:
            y = sub_element.find('div', {'class':'inzeratynadpis'})
            attribute_a = y.find('a')
            w = attribute_a.get('href')
            attribute_span = y.find('span')

            #here we just obtain the date from the particular advertisement to make sure we analyze only ads added today
            date_str = str(attribute_span.text)
            date_str = date_str.replace(" ","")
            date_str = date_str.strip("-[]")
            date_str = date_str.strip("TOP-[]")
            date_object = datetime.strptime(date_str,'%d.%m.%Y')
            date_object = date_object.date()

            if not date_object in accepted_days:
                continue

            list_of_offers_url.append(f'https://auto.bazos.cz{w}')

    print('The number of found advertisements matching the criteria:',len(list_of_offers_url),'.')
    return(list_of_offers_url)

n_days_search(5)

The number of found advertisements matching the criteria: 486 .


['https://auto.bazos.cz/inzerat/157103354/skoda-octavia-3-20tdi-110kw-dsg-dab-nakup-v-klidu.php',
 'https://auto.bazos.cz/inzerat/157075143/skoda-octavia-3-20tdi-110kw-dsg-plna-zaruka-2-roky-zdarma.php',
 'https://auto.bazos.cz/inzerat/157035970/octavia-scout-4x420tdi-110kwr2018dsgledacc123tiskm.php',
 'https://auto.bazos.cz/inzerat/157035183/octavia-rs-20tsi-230psm2018facecantoncolumbuslane.php',
 'https://auto.bazos.cz/inzerat/156861400/skoda-octavia-3-20tdi-110kw-plna-zaruka-2-roky-zdarma.php',
 'https://auto.bazos.cz/inzerat/156839898/skoda-octavia-3-20tdi-110kw-dsg-plna-zaruka-2-roky-zdarma.php',
 'https://auto.bazos.cz/inzerat/156800462/skoda-octavia-3-16tdi-85kw-plna-zaruka-2-roky-zdarma.php',
 'https://auto.bazos.cz/inzerat/156776866/skoda-octavia-3-fc-14tsi-cng-81kw-dsg-style-koupcr107tkm.php',
 'https://auto.bazos.cz/inzerat/156583245/skoda-octavia-3-16tdi-85kw-dsg-plna-zaruka-2-roky-zdarma.php',
 'https://auto.bazos.cz/inzerat/157066972/skoda-octavia-rs-dsg-fullled-acc-colum

In [15]:
# DATA/TEXT MINING PART
def modify_text(text):
    """ Internal function modifiing the text of the add to better suit for text mining. """
    return text.replace(" ", "").replace(".", "").replace("xxx", "000").replace("-", "")
def get_numbers_from_text(text):
    """ Internal function for finding all numbers in the text. """
    text = modify_text(text)
    pattern = '[.]?[\d]+[\.]?\d*(?:[eE][-+]?\d+)?'
    list_of_numbers = re.findall(pattern, text)
    return list_of_numbers
def find_years(numbers):
    """ Internal function picking numbers which might be years from all numbers in text. """
    numbers = [x for x in numbers if (float(x) > 1980) and (float(x) < 2023)]
    return numbers
def find_km(numbers):
    """ Internal function picking numbers which might be mileage from all numbers in text. """
    numbers = [x for x in numbers if (float(x) > 3000) and (float(x) < 500000)]
    return numbers
def get_context(text, list_of_tokens, year_dictionary = ['egistr', 'rv', 'RV', 'yrob', 'ýrob', 'prov', 'rok', 'Rok'], km_dictionary = ['km', 'Km', 'KM', 'ilomet', 'ajet', 'ájez', 'achom', 'atoč'], context_span=20):
    """ Internal function looking into surroundings of each year and mileage candidate and searching for parts of words defined in dictionaries. """
    #import re
    context = []
    year = 'No match'
    km = 'No match'
    for token in find_years(list_of_tokens):
        all_occurences_indices = [m.start() for m in re.finditer(token, text)]
        for index in all_occurences_indices:
            left_index = max(index - context_span, 0)
            right_index = min(index + context_span, len(text))
            substring = text[left_index: right_index].strip()
            for s in year_dictionary:
                year_find = [m.start() for m in re.finditer(s, substring)]
                if len(year_find) > 0:
                    year = token
    for token in find_km(list_of_tokens):
        all_occurences_indices = [m.start() for m in re.finditer(token, text)]
        for index in all_occurences_indices:
            left_index = max(index - context_span, 0)
            right_index = min(index + context_span, len(text))
            substring = text[left_index: right_index].strip()
            for s in km_dictionary:
                km_find = [m.start() for m in re.finditer(s, substring)]
                if len(km_find) > 0:
                    km = token
    return [year, km]
class ResultTable(pd.core.frame.DataFrame):
    def show_results(self, min_price = 0, max_price = 10000000, min_year = 1950, max_year = 2022, min_mileage = 0, max_mileage = 500000):
        temp_table = self[(self["mileage"] != "No match") & (self["year_of_manuf"] != "No match")]
        temp_table2 = temp_table[(pd.to_numeric(temp_table["price"]) > min_price) & (pd.to_numeric(temp_table["price"]) < max_price) &
                  (pd.to_numeric(temp_table["year_of_manuf"]) > min_year) & (pd.to_numeric(temp_table["year_of_manuf"]) < max_year) &
                    (pd.to_numeric(temp_table["mileage"]) > min_mileage) & (pd.to_numeric(temp_table["mileage"]) < max_mileage)].sort_values(by = "price")
        print(temp_table2)
    def show_best(self, n = 5, penalty = 5000):
        temp_table = self[(self["mileage"] != "No match") & (self["year_of_manuf"] != "No match")]
        temp_table = temp_table.assign(score=lambda x: ((2022 - pd.to_numeric(x.year_of_manuf))*penalty +
                                       pd.to_numeric(x.mileage)) / pd.to_numeric(x.price))
        temp_table = temp_table.sort_values(by = "score").head(n)
        print(temp_table)
def get_info(links):
    """ Get_info is a final function performing text mining and creating results in form of ResultTable class. """
    results_temp = []
    for i in links:
        print(i)
        add_page = requests.get(i)
        soup_add = BeautifulSoup(add_page.text, 'html')
        add = modify_text(soup_add.find('div', {'class':'popisdetail'}).get_text())
        price = soup_add.find('table').find_all('b')[-1].get_text()
        all_numbers = get_numbers_from_text(add)
        context_got = get_context(add, all_numbers)
        result = [i, context_got[0], context_got[1], price.replace(" ", "").replace("Kč", "")]
        results_temp.append(result)
        time.sleep(0.2)
    results = ResultTable(results_temp)
    results.columns = ['link', 'year_of_manuf', 'mileage', 'price']
    results = results[(results["price"] != "Dohodou") & (results["price"] != "Vtextu") & (results["price"] != "Nabídněte")]
    return results
result = get_info(list_of_offers_url)


pd.options.display.max_colwidth = 120
test = ResultTable(result)
test.show_results(min_price = 50000, max_price = 350000, min_year = 2013, max_year = 2018, min_mileage = 100000, max_mileage = 200000)
test.show_best(n = 10)

https://auto.bazos.cz/inzerat/157103354/skoda-octavia-3-20tdi-110kw-dsg-dab-nakup-v-klidu.php
https://auto.bazos.cz/inzerat/157075143/skoda-octavia-3-20tdi-110kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157035970/octavia-scout-4x420tdi-110kwr2018dsgledacc123tiskm.php
https://auto.bazos.cz/inzerat/157035183/octavia-rs-20tsi-230psm2018facecantoncolumbuslane.php
https://auto.bazos.cz/inzerat/156861400/skoda-octavia-3-20tdi-110kw-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/156839898/skoda-octavia-3-20tdi-110kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/156800462/skoda-octavia-3-16tdi-85kw-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/156776866/skoda-octavia-3-fc-14tsi-cng-81kw-dsg-style-koupcr107tkm.php
https://auto.bazos.cz/inzerat/156583245/skoda-octavia-3-16tdi-85kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157066972/skoda-octavia-rs-dsg-fullled-acc-columbus-vyhrsedacky.php
https://auto.bazos

https://auto.bazos.cz/inzerat/156024216/skoda-octavia-3-rs-liftback.php
https://auto.bazos.cz/inzerat/156879818/multifunkcni-volant-5e0419091ah-skoda-octavia-3-rv-2013.php
https://auto.bazos.cz/inzerat/156879203/ridici-jednotka-abs-5q0907379r-agregat-5q0614517q.php
https://auto.bazos.cz/inzerat/156879000/skoda-octavia-3-greenline-rozvody.php
https://auto.bazos.cz/inzerat/155859328/skoda-octavia-3-rs-20tsi-challenge-combi-manual-facelift.php
https://auto.bazos.cz/inzerat/156867606/skoda-octavia-iii-naraznik.php
https://auto.bazos.cz/inzerat/156858058/zimni-sada-17-alu-kol-crystal.php
https://auto.bazos.cz/inzerat/156853685/prevodovka-6q-dsg-hut-20tsi-147kw-axx-vw-golf-5-gti-152tis.php
https://auto.bazos.cz/inzerat/156752913/skoda-octavia-3-facelift-16tdi-clever-led-navi-alu.php
https://auto.bazos.cz/inzerat/156551382/skoda-octavia-3-combi-16-tdi-81kw-style-vybava.php
https://auto.bazos.cz/inzerat/156686405/octavia-3-dsg-14tsi-110kw-style-1maj-2018-cr-lednavi-dph.php
https://auto.bazos.c

https://auto.bazos.cz/inzerat/156750388/climatronic-klimatronik-5e0907044r-skoda-octavia-3-kombi-16.php
https://auto.bazos.cz/inzerat/157057962/sedackasedadlo-ridice-octavia-3-rs-alcantaralatka.php
https://auto.bazos.cz/inzerat/156749699/skoda-octavia-3-style-20-tdi-110kw-dsg-acc-laneassist.php
https://auto.bazos.cz/inzerat/157056725/volant-skoda-fabia3octavia3superb2a3-roomsyeti.php
https://auto.bazos.cz/inzerat/157055312/packy-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/157055004/skoda-octavia-3-12-tsi-75000-km-1majitel-soukroma-osoba.php
https://auto.bazos.cz/inzerat/157053437/skoda-octavia-iii-combi-style-20-tdi-dsg-12019.php
https://auto.bazos.cz/inzerat/157053350/skoda-octavia-iii-5e0-combi-2013-2017-14-tgi-nahradni-dily.php
https://auto.bazos.cz/inzerat/157053335/motor-19-tdi-16-hdi-16-mpi-20-fsi-12-htp-27-tdi.php
https://auto.bazos.cz/inzerat/157052681/skoda-octavia-3-pekny-blatnik.php
https://auto.bazos.cz/inzerat/157052643/skoda-volkswagen-audi-seat-15-tsi.php
https://a

https://auto.bazos.cz/inzerat/157011603/skoda-octavia-3-combi-20tdi-rv2015-27000km.php
https://auto.bazos.cz/inzerat/157011494/octavia-3-plechove-disky-65x16-et46.php
https://auto.bazos.cz/inzerat/157010320/pneu-barum-bravuris-5hm-22550-r-17-y-xl.php
https://auto.bazos.cz/inzerat/156697221/navigace-5e0919605-skoda-octavia-3-kombi-rv-2014.php
https://auto.bazos.cz/inzerat/156695465/zadni-vnejsi-svetla-skoda-octavia-3-kombi-rv-2015.php
https://auto.bazos.cz/inzerat/156695463/predni-dotykovy-display-amundsen-skoda-octavia-3-rv-2015.php
https://auto.bazos.cz/inzerat/156694782/ridici-jednotka-motoru-ckf-04l907309d-vw-golf-7-kombi-2014.php
https://auto.bazos.cz/inzerat/157005646/skoda-octavia-3-rs.php
https://auto.bazos.cz/inzerat/156692196/octavia-3-combi-20tdi-110kw-16tdi-dily-cerne-bila-2y2y.php
https://auto.bazos.cz/inzerat/157004069/redukce-facelift-svetel-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/157003573/kryty-pedalu-z-uslechtile-oceli-golf-7-octavia-3-leon3-atp.php
https://a

https://auto.bazos.cz/inzerat/156941920/zadni-5-dvere-skoda-octavia-lll-kombi.php
https://auto.bazos.cz/inzerat/156941812/zadni-naraznik-skoda-octavia-lll-lift.php
https://auto.bazos.cz/inzerat/156941401/skoda-octavia-iii-stresni-nosic.php
https://auto.bazos.cz/inzerat/155132701/skoda-octavia-3-14tsi-110kw-led-navi-tempomat-appconnect.php
https://auto.bazos.cz/inzerat/156940493/koupim-svetla-na-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/156940422/stabilizator-vw-golf-7-octavia-3vw-touran.php
https://auto.bazos.cz/inzerat/156940174/predni-kotouce-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/156940087/skoda-octavia-3-kombi-viko-kufru.php
https://auto.bazos.cz/inzerat/156937127/2253519-alu.php
https://auto.bazos.cz/inzerat/156936884/chromova-lista-octavia-3-pred-face.php
https://auto.bazos.cz/inzerat/156935516/prodam.php
https://auto.bazos.cz/inzerat/156934220/alu-kola-17-skoda-octavia-trius-5x112-s-pneu-22545r17.php
https://auto.bazos.cz/inzerat/156932371/sadu-al-kola-5x1127j