<h1><center> Python project</center></h1>
<h2><center>Current car prices and other relevant parameters from bazos.cz</center></h2>
<h3><center>Daniel Brosko, Vojtěch Suchánek</center></h3>

Our goal is to web-scrape advertisements listed on website bazos.cz, which is currently one of the most used websites for selling used cars in Czech republic. It has more than 15 000 car adds daily. On the other hand, it has really poor search options, which pretty much complicates searching for desired car based on your parameters.

We are going to code algorithm, which will scan adds for the current day, pick those, which fulfill our conditions on date and car type and save their links. Then we will go to each link and save the text of the add. Then we will try to analyze the text of the add to find our parameters.

This approach might also allow for longer time period analysis in further steps - we would collect data periodically and investigate the trends in price changes, number of ads for selected car added during particular days, and more. However, since this project should be designed as one-time run, we decided to limit the data to only current date.

In [16]:
import requests
from bs4 import BeautifulSoup
import re
from datetime import date, datetime, timedelta
import time
import numpy as np
import pandas as pd

The commented line below displays the version of packages so they can be used in requirements.txt.

In [1]:
#pip list

By the code in the following chunk, we checked that we are allowed to scrape particular parts of bazos.cz domain.

In [3]:
bots = requests.get('https://auto.bazos.cz/robots.txt')
#print(bots.text)

From the robots page we can see that our actions done in our projects are allowed, since we are not gonna use these search commands.

In the next chunk, we import two .py scripts where we defined the functions search_model, and n_days_search to filter and include only advertisements relevant to our preferences. More comments on the functions are printed few chunks below - where we print the documentation, but also by looking at .py scripts directly in GitHub.

In [17]:
from model_search import search_model
from n_days_search import n_days_search

In [19]:
print("Enter your desired car model here:")
my_search = input()
my_days = input()

Enter your desired car model here:
octavia 3
4


In [None]:
#print("Enter your desired car model here:")
#my_search = input()
soup_list = search_model(my_search)
#print("Enter the number of past days (max. 5) that you want to include in your search here:")
#my_days = input()


help(search_model)
help(n_days_search)

In [20]:
# Now we run the functions we loaded from script, it will take some time, though
soup_list = search_model(my_search)
my_days = int(my_days)
list_of_offers_url = n_days_search(my_days, soup_list)



  no_of_adv_html = BeautifulSoup(page.text, 'html')


  soup_list.append(BeautifulSoup(page.text, 'html'))


Initial search for your model successfully finished.
The number of found advertisements matching the criteria: 448 .


In the next chunk, we filter for the ads added today + max 5 days old. If we compared the number for today (2022-08-30) there were 98 at the time, while the number of all ads for the same car-model input was 2022.

In [9]:
print("Enter the number of past days (max. 5) that you want to include in your search here:")
my_days = input()

Enter the number of past days (max. 5) that you want to include in your search here:
4


In [21]:
# Now we specify on how many past days we want to include in our search
my_days = int(my_days)
list_of_offers_url = n_days_search(my_days, soup_list)

The number of found advertisements matching the criteria: 448 .


In [12]:
# Here we print the documentation for our functions we imported earlier.

help(search_model)

help(n_days_search)

Help on function search_model in module model_search:

search_model(user_input: str)
    Function search_model takes string input, hence it has to be in quotes ("").
    The input should be the name of the car model you would like to get results for,
    e.g. "octávia 3" - there should be no problem even when full Czech alphabet is used.
    
    Then the string is stripped of the characters that are not supposed to be in the search input,
    and if there are more than single word in the input, they are connected by '+' (plus) sign,
    since that is the format that bazos.cz use in their URLs.
    
    Then, the prepared string is paste into the common URL format that bazos.cz use.
    
    The very next step is obtaining the number of found advertisements for user's input from the html source code.
    This number of advertisements is used to select the proper length of the adv. tabs list that we will scrape.
    
    The function returns "soup_list" - the list of html codes for each

Finally, we proceed to Data-mining part, where we extract the desired parameters - year of manufacture, year, and price. 

This is probably the most demanding part of the project - we need to extract the relevant data from unformated text. There is no official format of the text, so we tried to find a way how to extract this information from various formats. The results are not the best since sometimes it happen that our code is not able to recognize the unusual format of the parameter. In further steps, probably implementing some ML algorithm could improve the successful recognition significantly.

We save all of those parameters along with the URLs of particular advertisements. We created a class ResultTable that has two methods - "show_results" and "show_best" by which we can display the best recommended ads for our desired car model.

Hence, now we can take a look on potentially most interesting advertisements for us by following the URLs and checking the entire content of several ads instead of looking at "thousands" of them.

In [22]:
# DATA/TEXT MINING PART
def modify_text(text):
    """ Internal function modifying the text of the add to better suit for text mining. """
    return text.replace(" ", "").replace(".", "").replace("xxx", "000").replace("-", "")
def get_numbers_from_text(text):
    """ Internal function for finding all numbers in the text. """
    text = modify_text(text)
    pattern = '[.]?[\d]+[\.]?\d*(?:[eE][-+]?\d+)?'
    list_of_numbers = re.findall(pattern, text)
    return list_of_numbers
def find_years(numbers):
    """ Internal function picking numbers which might be years from all numbers in text. """
    numbers = [x for x in numbers if (float(x) > 1980) and (float(x) < 2023)]
    return numbers
def find_km(numbers):
    """ Internal function picking numbers which might be mileage from all numbers in text. """
    numbers = [x for x in numbers if (float(x) > 3000) and (float(x) < 500000)]
    return numbers
def get_context(text, list_of_tokens, year_dictionary = ['egistr', 'rv', 'RV', 'yrob', 'ýrob', 'prov', 'rok', 'Rok'], km_dictionary = ['km', 'Km', 'KM', 'ilomet', 'ajet', 'ájez', 'achom', 'atoč'], context_span=20):
    """ Internal function looking into surroundings of each year and mileage candidate and searching for parts of words defined in dictionaries. """
    #import re
    context = []
    year = 'No match'
    km = 'No match'
    for token in find_years(list_of_tokens):
        all_occurences_indices = [m.start() for m in re.finditer(token, text)]
        for index in all_occurences_indices:
            left_index = max(index - context_span, 0)
            right_index = min(index + context_span, len(text))
            substring = text[left_index: right_index].strip()
            for s in year_dictionary:
                year_find = [m.start() for m in re.finditer(s, substring)]
                if len(year_find) > 0:
                    year = token
    for token in find_km(list_of_tokens):
        all_occurences_indices = [m.start() for m in re.finditer(token, text)]
        for index in all_occurences_indices:
            left_index = max(index - context_span, 0)
            right_index = min(index + context_span, len(text))
            substring = text[left_index: right_index].strip()
            for s in km_dictionary:
                km_find = [m.start() for m in re.finditer(s, substring)]
                if len(km_find) > 0:
                    km = token
    return [year, km]

class ResultTable(pd.core.frame.DataFrame):
    def show_results(self, min_price = 0, max_price = 10000000, min_year = 1950, max_year = 2022, min_mileage = 0, max_mileage = 500000):
        temp_table = self[(self["mileage"] != "No match") & (self["year_of_manuf"] != "No match")]
        temp_table2 = temp_table[(pd.to_numeric(temp_table["price"]) > min_price) & (pd.to_numeric(temp_table["price"]) < max_price) &
                  (pd.to_numeric(temp_table["year_of_manuf"]) > min_year) & (pd.to_numeric(temp_table["year_of_manuf"]) < max_year) &
                    (pd.to_numeric(temp_table["mileage"]) > min_mileage) & (pd.to_numeric(temp_table["mileage"]) < max_mileage)].sort_values(by = "price")
        print(temp_table2)
    def show_best(self, n = 5, penalty = 5000):
        temp_table = self[(self["mileage"] != "No match") & (self["year_of_manuf"] != "No match")]
        temp_table = temp_table.assign(score=lambda x: ((2022 - pd.to_numeric(x.year_of_manuf))*penalty +
                                       pd.to_numeric(x.mileage)) / pd.to_numeric(x.price))
        temp_table = temp_table.sort_values(by = "score").head(n)
        print(temp_table)
def get_info(links):
    """ Get_info is a final function performing text mining and creating results in form of ResultTable class. """
    results_temp = []
    for i in links:
        print(i)
        add_page = requests.get(i)
        soup_add = BeautifulSoup(add_page.text, 'html')
        add = modify_text(soup_add.find('div', {'class':'popisdetail'}).get_text())
        price = soup_add.find('table').find_all('b')[-1].get_text()
        all_numbers = get_numbers_from_text(add)
        context_got = get_context(add, all_numbers)
        result = [i, context_got[0], context_got[1], price.replace(" ", "").replace("Kč", "")]
        results_temp.append(result)
        time.sleep(0.2)
    results = ResultTable(results_temp)
    results.columns = ['link', 'year_of_manuf', 'mileage', 'price']
    results = results[(results["price"] != "Dohodou") & (results["price"] != "Vtextu") & (results["price"] != "Nabídněte")]
    return results
result = get_info(list_of_offers_url)


pd.options.display.max_colwidth = 120
test = ResultTable(result)
test.show_results(min_price = 50000, max_price = 350000, min_year = 2013, max_year = 2018, min_mileage = 100000, max_mileage = 200000)
test.show_best(n = 10)

https://auto.bazos.cz/inzerat/157968553/octavia-iii-14-tsi-110kw-dsg-odpocet-dph-style-business.php
https://auto.bazos.cz/inzerat/157833840/skoda-octavia-3-combi-scout-4x4-20-tdi-135kw-dsg-facelift.php
https://auto.bazos.cz/inzerat/157680831/skoda-octavia-3-fc-16tdi-85kw-koupcr1majiteltazne2017.php
https://auto.bazos.cz/inzerat/157649412/skoda-octavia-3-20tdi-110kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157495572/skoda-octavia-3-fc-14tsi-81kw-dsg-style-koupcr111tkm2018.php
https://auto.bazos.cz/inzerat/157457320/skoda-octavia-3-rs-20tsi-169kw-dsg-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157425576/skoda-octavia-3-16tdi-85kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157935040/skoda-octavia-3-rs-20-tdi-135kw-serviska.php
https://auto.bazos.cz/inzerat/157421675/skoda-octavia-3-16tdi-85kw-dsg-plna-zaruka-2-roky-zdarma.php
https://auto.bazos.cz/inzerat/157358035/skoda-octavia-iii-style-dsg-20tdi-2018.php
https://auto.bazos.cz/inzer

https://auto.bazos.cz/inzerat/157446013/skoda-octavia-3-combi-16tdi-81kw-aluxenon.php
https://auto.bazos.cz/inzerat/156449050/skoda-octavia-a-fabia-koupiim-auto-nd.php
https://auto.bazos.cz/inzerat/156481269/5q0407257c5q0407258c-skoda-tehlice-hlinik.php
https://auto.bazos.cz/inzerat/157694302/skoda-octavia-3-facelift-16tdi-clever-led-navi-alu.php
https://auto.bazos.cz/inzerat/157583339/skoda-octavia-3-tdi-facelifgt-dsg-model-2019-navi-tempomat.php
https://auto.bazos.cz/inzerat/157718075/vstrikovaci-trysky-04l130277ak-20tdi-110kw-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/157718063/motor-clh-clhc-16tdi-77kw-cr-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/157694323/dpf-filtr-pevnych-castic-04l131601h-skoda-octavia-3.php
https://auto.bazos.cz/inzerat/157697484/delici-sit-kufru-5e9861691d-skoda-octavia-3-kombi.php
https://auto.bazos.cz/inzerat/156787528/octavia-3-fl-style-16-tdi-85-kw.php
https://auto.bazos.cz/inzerat/156574604/skoda-octavia-iii-scout-20-tdi-135kw-dsg-navi-tazn

https://auto.bazos.cz/inzerat/157949614/skoda-octavia-iii-generace.php
https://auto.bazos.cz/inzerat/157949272/skoda-octavia-iii-original-skodaautopotahy.php
https://auto.bazos.cz/inzerat/157949232/original-koberecky-octavia-3-iii-5e1-863-011-j-xmv.php
https://auto.bazos.cz/inzerat/157947788/dsg-setrvacnik-original-19-tdi-20tdi-setrvacnik-dsg-novy.php
https://auto.bazos.cz/inzerat/157947754/delici-sit-pro-skoda-octavia-3-combi.php
https://auto.bazos.cz/inzerat/157946196/ventilator-octavia-3-superb-3vetrak-superb-3-octavia-3-novy.php
https://auto.bazos.cz/inzerat/157946060/turbo-turbodmychadlo-20-110kw-135kw-nove-original.php
https://auto.bazos.cz/inzerat/157944502/nove-alu-disky-5x112-r15-volkswagen-skoda-seat.php
https://auto.bazos.cz/inzerat/157944419/skoda-octavia-3-facelift-honeycomb-maska-mrizka.php
https://auto.bazos.cz/inzerat/157944163/skoda-octavia-3-vyrezy-karoserie-nosnik-prahy.php
https://auto.bazos.cz/inzerat/157942868/skoda-octavia-3-20tdi-135kw-dsg-4x4.php
https://auto.b

ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))