# Data Scraper

This Jupyter notebook is used to download, pre-process and save data about 6000 samples of cars in the Czech Republic from the website called "420auto.cz". Output from this notebook is a CSV file with downloaded data about cars that will be further processed, cleaned and interpreted.

# Import of packages

In the class Car_downloader, we will be applying methods from requests (downloading site content), BeautifulSoup (machine reading of downloaded data), tqdm.notebook (interactive measurement of the downloading progress) and time (setting pause between requests from the site) packages.

After the data are downloaded we use pandas to pre-process and save them.

In [1]:
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
import time
from lxml import html
import html5lib
import random
import matplotlib.pyplot as plt
import numpy as np
import re
from tqdm.notebook import tqdm_notebook

# Getting info about dentists

First we will create class "Car_Downloader" that will help us to download data effectively. Detailed description of each defined method is provided within the class. Note that since new car data are being uploaded each single day, our data contain info from the previous week and we also suggest you to directly use the respective CSV file.

In [35]:
class Car_Downloader():
       
    '''
    Download manager class created for specific purpose: scraping information about cars from https://420auto.cz/ 
    
    It contains methods for collection of links, downloading and storing data
    '''
    
    def __init__(self, allowLog = True):
        '''
        Class creator
    
        Takes single argument : True or False, defining if you want to display status messages
    
        Creates class with attributes that are used to store data
        '''
        self.allowLog = allowLog
        if self.allowLog:
            print('Successfully Initialized Car Data')
            
            
            
    def getAllLinks(self, last):
        '''
        Method used to collect links to all sub-pages containing list of cars
    
        Takes one argument - Last page we want to see; it is then transformed to the last piece of URL
        
        Stores all proper URL's in a list called links
        
        '''
        global page_to_url, links
    
        if self.allowLog:
            print('Found {} sub-pages'.format(len(range(1,last+1))))

        page_to_url = pd.DataFrame({
            'Page': range(1, 301),
            'URL': range(1, 5982, 20)
        }).reset_index(drop = True)

        if self.allowLog:
            print('Successfully Collected Links')
        links = ['https://420auto.cz/listings.php?next=' + str(i) for i in range(1, int(page_to_url.loc[page_to_url['Page']==last]['URL']+1), 20)]
        return links


        
    def getCarData(self):
        '''
        Method used to collect the car data for the given pages in the previous links
        
        No argument intake, because links has already been given as default in a for loop below
        
        '''
        global df
        df = pd.DataFrame()
        
        if self.allowLog:
            print('Successfully Downloaded Car Data for the given page')
        name = []
        year = []
        price = []
        mileage = []
        engine = []
        gearbox = []
        body = []
        location = []
        for url in tqdm_notebook(links):
            r = requests.get(url)
            r.encoding = 'UTF-8'
            soup = BeautifulSoup(r.text, 'lxml')
            raw = soup.find_all('div', {'class': 'car-vertical-info'})
            for item in raw:
                name.append(item.findAll("span", attrs = {"class" : "car-vertical-title"})[0].text.strip())
                year.append(item.findAll("span", attrs = {"class" : "car-vertical-year"})[0].text.strip())
                price.append(item.findAll("span", attrs = {"class" : "car-vertical-price"})[0].text.strip())
                mileage.append(item.findAll("span", attrs = {"class" : "car-vertical-mileage"})[0].text.strip())
                engine.append(item.findAll("span", attrs = {"class" : "car-vertical-engine"})[0].text.strip())
                gearbox.append(item.findAll("span", attrs = {"class" : "car-vertical-gearbox"})[0].text.strip())
                body.append(item.findAll("span", attrs = {"class" : "car-vertical-body"})[0].text.strip())
                location.append(item.findAll("span", attrs = {"class" : "car-vertical-location"})[0].text.strip())
                
        
                cars = {
                        'name':name,
                        'year':year,
                        'price':price,
                        'mileage':mileage,
                        'engine':engine,
                        'gearbox':gearbox,
                        'body':body,
                        'location':location
                        }
            
                df = pd.DataFrame(cars)
                df.columns = ['Name', 'Year', 'Price', 'Mileage', 'Engine', 'Gearbox', 'Body', 'Location']
                df = df.apply(lambda x: x.str.strip()).replace('', np.nan).replace('0000', np.nan)
                pd.set_option('display.max_rows', None)
                pd.set_option('display.max_columns', None)
                pd.set_option('display.width', 1000)
                pd.set_option('display.colheader_justify', 'center')
                pd.set_option('display.precision', 3)
                time.sleep(0.1)

        return df

In [37]:
CARS = Car_Downloader()
CARS.getAllLinks(300)
CARS.getCarData()

Successfully Initialized Car Data
Found 300 sub-pages
Successfully Collected Links
Successfully Downloaded Car Data for the given page


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=300.0), HTML(value='')))




Unnamed: 0,Name,Year,Price,Mileage,Engine,Gearbox,Body,Location
0,"Škoda Fabia, 1.4 TDI Elegance Combi",2007.0,65 000 Kč,270 723 km,"Nafta, 59 kW",Manuální,kombi,Středočeský
1,"Hyundai i20, 1.2 koup v ČR",2012.0,99 000 Kč,89 449 km,"Benzín, 57 kW",,hatchback,Středočeský
2,"Jeep Compass, 1,4 i 103kW nové CZ odpoč.DP",,319 000 Kč,52 084 km,"Benzín, 103 kW",Manuální,SUV,Jihočeský
3,"Škoda Superb, 2.0 TDi Style+",2015.0,433 000 Kč,173 708 km,"Nafta, 140 kW",Manuální,limuzína,Středočeský
4,"Peugeot 206, 1.4 55kW",2004.0,33 300 Kč,190 780 km,"Benzín, 55 kW",,hatchback,Karlovarský
5,"Renault Scenic, 1.5dCi 78kW navigace",2009.0,99 900 Kč,160 589 km,"Nafta, 78 kW",Manuální,MPV,Liberecký
6,"Opel Frontera, 2.2 DTI LIMITED, VŮZ PO RENOV",2002.0,119 000 Kč,227 000 km,"Nafta, 85 kW",Manuální,terénní,Praha
7,"Renault Megane, 1.6i,ČR.,1.MAJ.,SERVISKA",2007.0,59 000 Kč,213 000 km,"Benzín, 82 kW",Manuální,kombi,Praha
8,"Jeep Wrangler, 2.8CRD SAHARA UNLIMITED/ZÁRUK",2014.0,689 000 Kč,141 065 km,"Nafta, 147 kW",Manuální,pick up,Praha
9,"Fiat Punto, 1.2i 51kW,STK 12/2023",2011.0,114 900 Kč,83 758 km,"Benzín, 51 kW",Manuální,hatchback,Plzeňský


In [38]:
df.to_csv('car_project_data.csv', sep = ',')

#  Getting info on district of each individual car

Due to the fact that the scraper above gets data about regions from each page, we need to elaborate on individual links to cars in order to get the data on which towns in Czechia the cars are being sold. Mind that in all individual car urls, those links carry the word "inzerce"; thus we will scrape the links which contain that key word.

In [41]:
car_links = []
linksDF = []
substring = 'inzerce'
for car_url in tqdm_notebook(links):
    t = requests.get(car_url)
    t.encoding = 'UTF-8'
    carSoup = BeautifulSoup(t.text, 'lxml')
    for linkCar in carSoup.findAll('a'):
        car_links.append(linkCar.get('href'))
#print(car_links)
for i in range(len(car_links)):
    if substring in car_links[i]:
        linksDF.append(car_links[i])
#display(pd.DataFrame(linksDF))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=300.0), HTML(value='')))




This is nothing but a mere display of a mock scraping for a single link:

In [2]:
#f = requests.get('https://420auto.cz/inzerce/volkswagen/touran/351631/')
#f.encoding = 'UTF-8'
#fSoup = BeautifulSoup(f.text, 'html.parser')
#rawSoup = fSoup.select('div > p')[1].get_text(separator=', ', strip=True).split(",")[1]

The part below shows the parser of 6000 individual links and extraction of data about the towns and municipalities.
WARNING! Note that since the number of individual links is very large, it may take sufficiently more time to perform the
for loop below. Thus, we suggest you to directly use the CSV file "merged_car_data.csv" from the GitHub page.

In [144]:
obec = []
for j in tqdm_notebook(linksDF):
    f = requests.get(j)
    f.encoding = 'UTF-8'
    fSoup = BeautifulSoup(f.text, 'html.parser')
    try:
        rawSoup = fSoup.select('div > p')[1].get_text(separator=', ', strip=True).split(",")[1]
    except:
        rawSoup = str('NA')
    obec.append(rawSoup)
obec

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=6000.0), HTML(value='')))




[' PRAHA 8',
 ' PRAHA 8',
 ' PRAHA 8',
 ' PRAHA 8',
 ' Velké Přítočno',
 ' VESELÍ NAD LUŽNICÍ',
 ' VESELÍ NAD LUŽNICÍ',
 ' VESELÍ NAD LUŽNICÍ',
 ' VESELÍ NAD LUŽNICÍ',
 ' VESELÍ NAD LUŽNICÍ',
 ' POLIČKA',
 ' CVIKOV',
 ' KLADNO - KROČEHLAVY',
 ' KLADNO - KROČEHLAVY',
 ' PROTIVÍN',
 ' KOLÍN 5',
 ' Karlovy Vary',
 ' Liberec',
 ' PRAHA 9',
 ' PRAHA 9',
 ' Praha 4',
 ' PLZEŇ',
 ' STAŇKOV',
 ' STAŇKOV',
 ' Frýdek-Místek',
 ' Frýdek-Místek',
 ' Frýdek-Místek',
 ' PRAHA 3',
 ' Čestice',
 ' Čestice',
 ' Čestice',
 ' Čestice',
 ' Čestice',
 ' Čestice',
 ' Most',
 ' Janov u Litomyšle',
 ' Janov u Litomyšle',
 ' Janov u Litomyšle',
 ' Benešov',
 ' Benešov',
 ' Benešov',
 ' Benešov',
 ' PRAHA 9 - Čakovice',
 ' PRAHA 9 - Čakovice',
 ' PRAHA 9 - Čakovice',
 ' TŘEBÍČ',
 ' Liberec - Stráž nad Nisou',
 ' Liberec - Stráž nad Nisou',
 ' VALAŠSKÉ MEZIŘÍČÍ',
 ' Praha 8',
 ' Praha 8',
 ' Praha 8',
 'NA',
 ' ČESKÉ BUDĚJOVICE',
 ' Tábor',
 ' Tábor',
 ' Tábor',
 ' Tábor',
 ' Praha 10',
 ' Praha 10',
 ' Praha 10

In [145]:
len(obec)

6000

Mind that since there are some districts within Prague (e.g. Prague-8), the script below transforms such strings to "Praha".

In [149]:
dfObec = pd.Series(obec).str.title()
substring = 'Praha'
for i in range(len(dfObec)):
    if substring in dfObec[i]:
        dfObec[i] = 'Praha'        

0                                Praha
1                                Praha
2                                Praha
3                                Praha
4                       Velké Přítočno
5                   Veselí Nad Lužnicí
6                   Veselí Nad Lužnicí
7                   Veselí Nad Lužnicí
8                   Veselí Nad Lužnicí
9                   Veselí Nad Lužnicí
10                             Polička
11                              Cvikov
12                 Kladno - Kročehlavy
13                 Kladno - Kročehlavy
14                            Protivín
15                             Kolín 5
16                        Karlovy Vary
17                             Liberec
18                               Praha
19                               Praha
20                               Praha
21                               Plzeň
22                             Staňkov
23                             Staňkov
24                       Frýdek-Místek
25                       

In [186]:
dfMerged = pd.merge(df, pd.DataFrame(dfObec), left_index = True, right_index = True)
dfMerged = dfMerged.rename(columns={0: 'Obec'})
dfMerged.to_csv('merged_car_data.csv', ',')

#  Importing Data about Czech districts

Let us clarify the whole logic such that as we searched for geojson files for performing interactive map in the next chapter, we could not find a file containing geographical coordinates of Czech regions. Instead we found a nice file for Czech districts; and currently we have data about Czech towns. But thanks to our colleagues from the previous years, they have the below data containing a dataframe of Czech towns, districts, and regions. So, using this data, we can easily match our data on Czech towns with the appropriate Czech districts and further create an interactive map. 

In [302]:
brick = pd.read_csv('TOWN_BRICK_REGION.csv', sep = ',')
okres = [] 

In [303]:
for row in tqdm_notebook(range(len(dfMerged.Obec))):
    okres.append(brick.OKRES[brick.NAZEV == dfMerged.Obec[row].strip()].to_string(index=False).strip())       

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=6000.0), HTML(value='')))




#  Matching towns with districts
Below we create another CSV file called "data_okresy.csv" which carries data with a single column about the appropriate district for each individual car. Mind that there will be a small number NA's that will be omitted.

In [307]:
okresDF = pd.DataFrame(okres)
okresDF.columns = ['Okres']
okresDF[okresDF == 'Series([], )'] = 'NA'
okresDF
okresDF.to_csv('data_okresy.csv', ',')

#  Step to the next part
Now you can proceed to the next part that will continue with further data processing, aggregating, and manipulating. Furthermore, we will eventually create a nice interactive map and build a Linear Regression model using the respective Machine Learning libraries and packages. You can also find our notes about the interpretation of these results. 