# Projekt iz SPI

## Checkpoint 1 za 09.03.2023:

Odabrati skup podataka i napraviti osnovnu analizu podataka u pythonu

Odabran skup podataka: **Secondhand Car Market Hungary**

Link na dataset: https://www.kaggle.com/datasets/attilakiss/secondhand-car-market-data-parsing-dataset-v1?select=advertisements_202006112147.csv

Osnovna analiza podataka:

- učitati iz csv u dataframe (pandas)
- pregled prvih 5 redaka
- veličina skupa
- nazivi stupaca
- broj nedostajućih vrijednosti po stupcu (.isna)
- jedinstvene vrijednosti (.unique())
- ispis tipova podataka (.dtypes)
- frekvencije vrijednosti po stupcu (petlja, data[column].value_counts())

Pitanja:

1. Da li je skup podataka dovoljno velik?
2. Da li skup ima dovoljno različite podatke?
3. Da li skup ima vremensku dimenziju?
4. Da li skup ima kvantitativne i kvalitativne podatke?
5. Da li skup ima puno nedostajućih vrijednosti?

Skup, rezultate analize i odgovore na pitanja kratko prezentirati (5 min.) na vježbama 09.03.

### Instalacija paketa

In [1]:
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#from collections import Counter

### Učitavanje dataseta u dataframe

In [2]:
region_df = pd.read_csv('datasets/region_202006112147.csv')
model_df = pd.read_csv('datasets/model_202006112147.csv')
environmental_df = pd.read_csv('datasets/environmental_202006112147.csv')
drive_df = pd.read_csv('datasets/drive_202006112147.csv')
clime_df = pd.read_csv('datasets/clime_202006112147.csv')
category_df = pd.read_csv('datasets/category_202006112147.csv')
catalogs_df = pd.read_csv('datasets/catalogs_202006112147.csv')
car_types_df = pd.read_csv('datasets/car_type_202006112147.csv')
brand_df = pd.read_csv('datasets/brand_202006112147.csv')
advertisements_df = pd.read_csv('datasets/advertisements_202006112147.csv', low_memory=False)

In [3]:
main_df = pd.merge(advertisements_df, brand_df, on="brand_id", how="inner")
main_df = pd.merge(main_df, region_df, on="region_id", how="inner")
main_df = pd.merge(main_df, model_df, on="model_id", how="inner")
main_df = pd.merge(main_df, clime_df, on="clime_id", how="inner")
main_df = pd.merge(main_df, catalogs_df, on="catalog_url", how="inner")
main_df = pd.merge(main_df, drive_df, on="drive_id", how="inner")
main_df = pd.merge(main_df, car_types_df, on="car_type_id", how="inner")
main_df = pd.merge(main_df, category_df, on="category_id", how="inner")
main_df = pd.merge(main_df, environmental_df, on="environmental_id", how="inner")

In [4]:
main_df = main_df.drop([
    "description",
    "adoldness",
    "advertisement_url",
    "sales_date",
    "download_date",
    "sales_update_date",
    "gas_id",
    "documentvalid",
    "color",
    "doorsnumber_x",
    "ccm_x",
    "person_capacity_x",
    "msrp",
    "doorsnumber_y",
    "ccm_y",
    "person_capacity_y",
    "car_type_name",
    "category_name",
    "drive_name"
], axis=1)

main_df = main_df.replace({
    'no catalog': np.nan,
    'null': np.nan,
    'na': np.nan,
    'Nan': np.nan,
    'NaN': np.nan,
    'no AC': np.nan
})

# Setting english names for car types, categories and drives
main_df.rename(columns = {
    'car_type_name_translated': 'car_type_name',
    'category_name_translated': 'category_name',
    'drive_name_translated': 'drive_name'
}, inplace = True)

main_df = main_df.dropna()
main_df.columns

Index(['ad_id', 'region_id', 'ad_price', 'numpictures', 'proseller',
       'postalcode', 'production', 'mileage', 'clime_id', 'shifter',
       'brand_id', 'model_id', 'highlighted', 'upload_date', 'catalog_url',
       'is_sold', 'brand_name', 'region_name', 'model_name', 'clime_name',
       'category_id', 'start_production', 'end_production', 'car_type_id',
       'weight', 'fuel_tank', 'boot_capacity', 'fuel', 'environmental_id',
       'cylinder_layout', 'cylinders', 'drive_id', 'consump_city',
       'consump_highway', 'consump_mixed', 'top_speed', 'acceleration',
       'torque', 'power', 'drive_name', 'car_type_name', 'category_name',
       'environmental_name'],
      dtype='object')

In [5]:
import pgeocode

nomi = pgeocode.Nominatim('hu')

postal_code_city_map = lambda x: nomi.query_postal_code(x).place_name

postal_codes_df = main_df[["region_id", "postalcode"]]
postal_codes_df = postal_codes_df.drop_duplicates(subset='postalcode')
postal_codes_df['city_name'] = postal_codes_df['postalcode'].map(postal_code_city_map)
postal_codes_df = postal_codes_df.rename(columns={"postalcode": "postal_code"})

postal_codes_df

Unnamed: 0,region_id,postal_code,city_name
12967,9,4002,Debrecen
12968,10,3000,Hatvan
12972,7,2481,Velence
12977,3,1039,Budapest
13075,3,1185,Budapest
...,...,...,...
37953,8,9345,Páli
38034,6,6921,Maroslele
38054,8,9113,Koroncó
38058,4,6446,Rém


In [6]:
city_df = pd.DataFrame(postal_codes_df[['city_name', 'region_id']])
city_df = city_df.drop_duplicates()
city_df['city_id'] = city_df.index
city_df

Unnamed: 0,city_name,region_id,city_id
12967,Debrecen,9,12967
12968,Hatvan,10,12968
12972,Velence,7,12972
12977,Budapest,3,12977
13076,Miskolc,2,13076
...,...,...,...
37953,Páli,8,37953
38034,Maroslele,6,38034
38054,Koroncó,8,38054
38058,Rém,4,38058


In [7]:
# postal_codes table

postal_codes_df = pd.merge(postal_codes_df, city_df, on="city_name", how="inner")
postal_codes_df = postal_codes_df[['postal_code', 'city_id']]
postal_codes_df

Unnamed: 0,postal_code,city_id
0,4002,12967
1,4032,12967
2,4031,12967
3,4030,12967
4,4028,12967
...,...,...
1277,9345,37953
1278,6921,38034
1279,9113,38054
1280,6446,38058


In [8]:
region_df

Unnamed: 0,region_id,region_name
0,1,Baranya megye
1,2,Borsod-Abaúj-Zemplén megye
2,3,Budapest
3,4,Bács-Kiskun megye
4,5,Békés megye
5,6,Csongrád megye
6,7,Fejér megye
7,8,Győr-Moson-Sopron megye
8,9,Hajdú-Bihar megye
9,10,Heves megye


In [9]:
brands_models = main_df[["model_id", "model_name", "brand_id", "brand_name"]]
model_df = brands_models.drop_duplicates(subset="model_id")[["model_id", "model_name", "brand_id"]]
model_df.sort_values("model_id", inplace=True)
brand_df = brands_models.drop_duplicates(subset="brand_id")[["brand_id", "brand_name"]]
brand_df.sort_values("brand_id", inplace=True)

In [10]:
brand_df

Unnamed: 0,brand_id,brand_name
13678,1,ABARTH
14980,4,ALFA ROMEO
13075,8,AUDI
13166,13,BMW
19608,15,CADILLAC
13681,16,CHEVROLET
26054,17,CHRYSLER
13787,18,CITROEN
16123,19,DACIA
35000,20,DAEWOO


In [11]:
model_df

Unnamed: 0,model_id,model_name,brand_id
16571,1,MONDEO,31
33129,3,XC90,99
17174,5,V60,99
16715,6,A4,8
23560,8,C 220,59
...,...,...,...
26802,1252,CL-OSZTÁLY,59
33356,1260,GL 320,59
26793,1274,SC,48
16744,1275,9-3,80


In [12]:
advertisements_columns = set(main_df.columns).intersection(advertisements_df.columns)
advertisements_df = advertisements_df[list(advertisements_columns)]
advertisements_df = advertisements_df.drop_duplicates(subset="ad_id")

In [13]:
fuel_df = pd.DataFrame(main_df['fuel'])
fuel_df = fuel_df.drop_duplicates()
fuel_df['fuel_id'] = fuel_df.index
fuel_df

Unnamed: 0,fuel,fuel_id
12967,Dízel,12967
13076,Benzin,13076
13868,Hibrid (Benzin),13868
15396,LPG,15396
16686,Etanol,16686
20381,Hibrid (Dízel),20381
27680,Benzin/Gáz,27680


In [14]:
cylinder_df = pd.DataFrame(main_df['cylinder_layout'])
cylinder_df = cylinder_df.drop_duplicates()
cylinder_df['cylinder_id'] = cylinder_df.index
cylinder_df

Unnamed: 0,cylinder_layout,cylinder_id
12967,Soros,12967
13102,V,13102
13255,Boxer,13255
13927,W,13927


In [15]:
shifter_df = pd.DataFrame(main_df['shifter'])
shifter_df = shifter_df.drop_duplicates()
shifter_df['shifter_id'] = shifter_df.index
shifter_df

Unnamed: 0,shifter,shifter_id
12967,T7,12967
12977,M6,12977
13075,V0,13075
13110,0,13110
13118,A6,13118
13119,A7,13119
13127,A0,13127
13131,M5,13131
13166,T8,13166
13179,T0,13179


In [16]:
catalogs_df = catalogs_df.merge(advertisements_df[['catalog_url', 'shifter', 'model_id']], on='catalog_url', how='right')
catalogs_df = catalogs_df.merge(shifter_df, on='shifter', how='right')
catalogs_df = catalogs_df.merge(fuel_df, on='fuel', how='right')
catalogs_df = catalogs_df.merge(cylinder_df, on='cylinder_layout', how='right')
catalogs_df = catalogs_df.replace(['no catalog'], np.nan)
catalogs_df = catalogs_df.drop_duplicates(subset='catalog_url')
catalogs_df = catalogs_df.drop(['fuel', 'cylinder_layout', 'shifter'], axis=1)
catalogs_df = catalogs_df.dropna()
catalogs_df = catalogs_df.rename(columns = {
    "ccm": "engine_size_cm3",
    "msrp": "price_as_new",
    "doorsnumber": "doors_number",
    "torque": "torque_nm",
    "weight": "weight_kg",
    "fuel_tank": "fuel_tank_liter",
    "boot_capacity": "boot_capacity_liter",
    "cylinders": "cylinders_number",
    "power": "horse_power",
    "consump_city": "consumption_city",
    "consump_highway": "consumption_highway",
    "consump_mixed": "consumption_mixed",
    "top_speed": "top_speed_kmh",
    "acceleration": "acceleration_100khm_seconds"
})

In [17]:
catalogs_df

Unnamed: 0,catalog_url,category_id,start_production,end_production,price_as_new,car_type_id,doors_number,person_capacity,weight_kg,fuel_tank_liter,...,consumption_highway,consumption_mixed,top_speed_kmh,acceleration_100khm_seconds,torque_nm,horse_power,model_id,shifter_id,fuel_id,cylinder_id
0,http://katalogus.hasznaltauto.hu/mercedes-benz...,3,2011-01-01,2013-01-01,14670620.0,2,2.0,4.0,1795.0,66.0,...,5.0,6.3,230.0,8.9,400.0,170.0,35,12967,12967,12967
6,http://katalogus.hasznaltauto.hu/mercedes-benz...,3,2011-01-01,2013-01-01,14541280.0,12,4.0,5.0,1735.0,59.0,...,4.5,5.3,242.0,7.5,500.0,204.0,57,12967,12967,12967
9,http://katalogus.hasznaltauto.hu/mercedes-benz...,6,2015-01-01,2016-01-01,12398590.0,3,2.0,4.0,1605.0,41.0,...,3.6,4.1,234.0,7.8,400.0,170.0,8,12967,12967,12967
17,http://katalogus.hasznaltauto.hu/mercedes-benz...,5,2014-01-01,2015-01-01,13306350.0,9,4.0,6.0,2075.0,57.0,...,5.3,5.7,195.0,10.8,380.0,163.0,37,12967,12967,12967
32,http://katalogus.hasznaltauto.hu/mercedes-benz...,3,2011-01-01,2013-01-01,12904830.0,12,4.0,5.0,1735.0,59.0,...,4.7,5.4,207.0,9.5,360.0,136.0,75,12967,12967,12967
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33781,http://katalogus.hasznaltauto.hu/subaru/imprez...,1,2007-01-01,2010-01-01,6920000.0,6,5.0,5.0,1360.0,60.0,...,6.5,8.2,182.0,11.6,196.0,150.0,789,13715,13076,13255
33783,http://katalogus.hasznaltauto.hu/subaru/legacy...,3,2003-01-01,2005-01-01,10346000.0,12,4.0,5.0,1380.0,64.0,...,6.1,8.1,210.0,10.8,226.0,165.0,414,26214,13076,13255
33784,http://katalogus.hasznaltauto.hu/subaru/outbac...,3,2003-01-01,2007-01-01,10485000.0,9,5.0,5.0,1440.0,64.0,...,6.7,8.5,197.0,10.9,226.0,165.0,327,26214,13076,13255
33785,http://katalogus.hasznaltauto.hu/audi/a8_6.3_w...,8,2011-01-01,2013-01-01,38399670.0,12,4.0,5.0,2055.0,90.0,...,9.0,12.4,250.0,4.7,625.0,500.0,391,13166,13076,13927


In [18]:
catalogs_df = catalogs_df.convert_dtypes()
catalogs_df.dtypes

catalog_url                     string
category_id                      Int64
start_production                string
end_production                  string
price_as_new                     Int64
car_type_id                      Int64
doors_number                     Int64
person_capacity                  Int64
weight_kg                        Int64
fuel_tank_liter                  Int64
boot_capacity_liter              Int64
environmental_id                 Int64
cylinders_number                 Int64
drive_id                         Int64
engine_size_cm3                  Int64
consumption_city               Float64
consumption_highway            Float64
consumption_mixed              Float64
top_speed_kmh                    Int64
acceleration_100khm_seconds    Float64
torque_nm                        Int64
horse_power                      Int64
model_id                         Int64
shifter_id                       Int64
fuel_id                          Int64
cylinder_id              

In [19]:
advertisements_df = advertisements_df.drop(["model_id", "brand_id", "region_id", "shifter", "production"], axis=1)
advertisements_df = advertisements_df.replace(['no catalog'], np.nan)
advertisements_df = advertisements_df.rename(columns={
    "postalcode": "postal_code",
    "numpictures": "pictures_number",
    "proseller": "pro_seller"
})
advertisements_df = advertisements_df.dropna()
advertisements_df = advertisements_df.convert_dtypes()

In [20]:
advertisements_df

Unnamed: 0,upload_date,ad_price,is_sold,postal_code,highlighted,ad_id,catalog_url,pictures_number,clime_id,mileage,pro_seller
1,2010-08-11,1290000,False,2671,False,4066033,http://katalogus.hasznaltauto.hu/opel/astra_1....,6,4,148000,False
2,2010-08-30,580000,False,6000,False,4109007,http://katalogus.hasznaltauto.hu/saab/900_2.5_...,6,2,181900,False
3,2010-10-25,1450000,False,4033,False,4246385,http://katalogus.hasznaltauto.hu/seat/leon_1.4...,6,2,185000,False
4,2012-01-17,9990000,False,8600,False,5440448,http://katalogus.hasznaltauto.hu/mercedes-benz...,6,5,98500,False
5,2012-03-20,120000,True,2360,False,5624476,http://katalogus.hasznaltauto.hu/peugeot/307_b...,5,3,80000,True
...,...,...,...,...,...,...,...,...,...,...,...
38183,2020-05-06,5250000,False,6728,False,15726818,http://katalogus.hasznaltauto.hu/audi/q5_2.0_t...,12,4,201500,True
38184,2020-05-06,5300000,False,1154,False,15726825,http://katalogus.hasznaltauto.hu/volkswagen/go...,12,5,129711,True
38185,2020-05-06,4990000,False,1152,False,15726904,http://katalogus.hasznaltauto.hu/audi/a5_cabri...,12,5,140000,True
38186,2020-05-06,5150000,False,8000,False,15727058,http://katalogus.hasznaltauto.hu/jeep/wrangler...,5,4,155000,True


In [21]:
advertisements_df['upload_date'] = pd.to_datetime(advertisements_df['upload_date'])
catalogs_df['start_production'] = pd.to_datetime(catalogs_df['start_production'])
catalogs_df['end_production'] = pd.to_datetime(catalogs_df['end_production'])

In [22]:
advertisements_df = advertisements_df.convert_dtypes()
advertisements_df.dtypes

upload_date        datetime64[ns]
ad_price                    Int64
is_sold                   boolean
postal_code                 Int64
highlighted               boolean
ad_id                       Int64
catalog_url                string
pictures_number             Int64
clime_id                    Int64
mileage                     Int64
pro_seller                boolean
dtype: object

In [23]:
category_df = main_df[['category_id', 'category_name']]
category_df = category_df.drop_duplicates(subset='category_id')
category_df = category_df.sort_values(by='category_id')
category_df = category_df.convert_dtypes()
category_df

Unnamed: 0,category_id,category_name
14105,1,C-Segment
19633,2,M-Segment
13075,3,D-Segment
17841,4,B-Segment
12967,5,Utility
16524,6,E-Segment
13652,7,A-Segment
13867,8,F-Segment
19018,9,J-Segment


In [24]:
clime_df = clime_df.convert_dtypes()
clime_df

Unnamed: 0,clime_id,clime_name
0,1,no AC
1,2,manual AC
2,3,automatic AC
3,4,digital AC
4,5,digital 2zone AC
5,6,digital multizone AC


In [25]:
car_types_df = main_df[['car_type_id', 'car_type_name']]
car_types_df = car_types_df.drop_duplicates(subset='car_type_id')
car_types_df = car_types_df.sort_values(by='car_type_id')
car_types_df = car_types_df.convert_dtypes()
car_types_df

Unnamed: 0,car_type_id,car_type_name
13561,2,Cabrio
13589,3,Coupe
16400,5,Minivan
13075,6,Hatchback
25667,8,Minibus
12967,9,Estate
13177,12,Sedan
19155,13,SUV
25620,14,Closed


In [26]:
drive_df = main_df[['drive_id', 'drive_name']]
drive_df = drive_df.drop_duplicates(subset='drive_id')
drive_df = drive_df.sort_values(by='drive_id')
drive_df = drive_df.convert_dtypes()
drive_df

Unnamed: 0,drive_id,drive_name
13075,1,FWD
12967,2,RWD
13143,4,AWD


In [27]:
environmental_df = environmental_df.replace('na', np.nan)
environmental_df = environmental_df.dropna()
environmental_df = environmental_df.convert_dtypes()
environmental_df

Unnamed: 0,environmental_id,environmental_name
0,1,EURO 1
1,2,EURO 2
2,3,EURO 3
3,4,EURO 4
4,5,EURO 5
5,6,EURO 6


In [28]:
main_df = pd.merge(advertisements_df, clime_df, on="clime_id", how="inner")
main_df = pd.merge(main_df, catalogs_df, on="catalog_url", how="inner")
main_df = pd.merge(main_df, postal_codes_df, on="postal_code", how="inner")
main_df = pd.merge(main_df, city_df, on="city_id", how="inner")
main_df = pd.merge(main_df, region_df, on="region_id", how="inner")
main_df = pd.merge(main_df, model_df, on="model_id", how="inner")
main_df = pd.merge(main_df, brand_df, on="brand_id", how="inner")
main_df = pd.merge(main_df, drive_df, on="drive_id", how="inner")
main_df = pd.merge(main_df, car_types_df, on="car_type_id", how="inner")
main_df = pd.merge(main_df, category_df, on="category_id", how="inner")
main_df = pd.merge(main_df, environmental_df, on="environmental_id", how="inner")
main_df = pd.merge(main_df, fuel_df, on="fuel_id", how="inner")
main_df = pd.merge(main_df, cylinder_df, on="cylinder_id", how="inner")
main_df = pd.merge(main_df, shifter_df, on="shifter_id", how="inner")

### Pregled prvih 5 redaka

In [29]:
main_df

Unnamed: 0,upload_date,ad_price,is_sold,postal_code,highlighted,ad_id,catalog_url,pictures_number,clime_id,mileage,...,model_name,brand_id,brand_name,drive_name,car_type_name,category_name,environmental_name,fuel,cylinder_layout,shifter
0,2010-08-11,1290000,False,2671,False,4066033,http://katalogus.hasznaltauto.hu/opel/astra_1....,6,4,148000,...,ASTRA H,69,OPEL,FWD,Hatchback,C-Segment,EURO 4,Benzin,Soros,0
1,2020-01-02,599000,False,6000,False,15262033,http://katalogus.hasznaltauto.hu/opel/astra_1....,3,4,369000,...,ASTRA H,69,OPEL,FWD,Hatchback,C-Segment,EURO 4,Benzin,Soros,0
2,2020-02-26,1449000,False,9028,False,15483792,http://katalogus.hasznaltauto.hu/opel/astra_1....,9,4,158453,...,ASTRA H,69,OPEL,FWD,Hatchback,C-Segment,EURO 4,Benzin,Soros,0
3,2020-02-27,1449000,False,9028,False,15486577,http://katalogus.hasznaltauto.hu/opel/astra_1....,11,4,52626,...,ASTRA H,69,OPEL,FWD,Hatchback,C-Segment,EURO 4,Benzin,Soros,0
4,2019-10-17,1470000,False,1148,False,14987234,http://katalogus.hasznaltauto.hu/opel/astra_1....,8,2,147000,...,ASTRA H,69,OPEL,FWD,Hatchback,C-Segment,EURO 4,Benzin,Soros,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15090,2020-03-02,2200000,False,3903,False,15501949,http://katalogus.hasznaltauto.hu/audi/allroad_...,6,1,412000,...,A6 ALLROAD,8,AUDI,AWD,Estate,D-Segment,EURO 4,Dízel,V,T0
15091,2019-12-27,2899999,False,9600,False,15244550,http://katalogus.hasznaltauto.hu/audi/allroad_...,6,5,267000,...,A6 ALLROAD,8,AUDI,AWD,Estate,D-Segment,EURO 4,Dízel,V,T0
15092,2020-02-24,2590000,False,4032,False,15474497,http://katalogus.hasznaltauto.hu/audi/allroad_...,12,5,202504,...,A6 ALLROAD,8,AUDI,AWD,Estate,D-Segment,EURO 4,Dízel,V,T0
15093,2020-01-23,37990000,False,1112,True,15343216,http://katalogus.hasznaltauto.hu/mercedes-benz...,12,5,22,...,S 450,59,MERCEDES-BENZ,AWD,Coupe,F-Segment,EURO 6,Benzin,V,T9


### Veličina skupa


In [30]:
main_df.shape

(15095, 51)

### Nazivi stupaca

In [31]:
list(main_df.columns)

['upload_date',
 'ad_price',
 'is_sold',
 'postal_code',
 'highlighted',
 'ad_id',
 'catalog_url',
 'pictures_number',
 'clime_id',
 'mileage',
 'pro_seller',
 'clime_name',
 'category_id',
 'start_production',
 'end_production',
 'price_as_new',
 'car_type_id',
 'doors_number',
 'person_capacity',
 'weight_kg',
 'fuel_tank_liter',
 'boot_capacity_liter',
 'environmental_id',
 'cylinders_number',
 'drive_id',
 'engine_size_cm3',
 'consumption_city',
 'consumption_highway',
 'consumption_mixed',
 'top_speed_kmh',
 'acceleration_100khm_seconds',
 'torque_nm',
 'horse_power',
 'model_id',
 'shifter_id',
 'fuel_id',
 'cylinder_id',
 'city_id',
 'city_name',
 'region_id',
 'region_name',
 'model_name',
 'brand_id',
 'brand_name',
 'drive_name',
 'car_type_name',
 'category_name',
 'environmental_name',
 'fuel',
 'cylinder_layout',
 'shifter']

### Broj nedostajućih vrijednosti po stupcu

In [32]:
col_null_values = {col: main_df[col].isna().sum() for col in main_df.columns}
col_null_values

{'upload_date': 0,
 'ad_price': 0,
 'is_sold': 0,
 'postal_code': 0,
 'highlighted': 0,
 'ad_id': 0,
 'catalog_url': 0,
 'pictures_number': 0,
 'clime_id': 0,
 'mileage': 0,
 'pro_seller': 0,
 'clime_name': 0,
 'category_id': 0,
 'start_production': 0,
 'end_production': 0,
 'price_as_new': 0,
 'car_type_id': 0,
 'doors_number': 0,
 'person_capacity': 0,
 'weight_kg': 0,
 'fuel_tank_liter': 0,
 'boot_capacity_liter': 0,
 'environmental_id': 0,
 'cylinders_number': 0,
 'drive_id': 0,
 'engine_size_cm3': 0,
 'consumption_city': 0,
 'consumption_highway': 0,
 'consumption_mixed': 0,
 'top_speed_kmh': 0,
 'acceleration_100khm_seconds': 0,
 'torque_nm': 0,
 'horse_power': 0,
 'model_id': 0,
 'shifter_id': 0,
 'fuel_id': 0,
 'cylinder_id': 0,
 'city_id': 0,
 'city_name': 680,
 'region_id': 0,
 'region_name': 0,
 'model_name': 0,
 'brand_id': 0,
 'brand_name': 0,
 'drive_name': 0,
 'car_type_name': 0,
 'category_name': 0,
 'environmental_name': 0,
 'fuel': 0,
 'cylinder_layout': 0,
 'shifter'

### Jedinstvene vrijednosti

In [33]:
cols_unique_num = {col: len(main_df[col].unique()) for col in main_df.columns}
cols_unique_num

{'upload_date': 756,
 'ad_price': 1243,
 'is_sold': 2,
 'postal_code': 1151,
 'highlighted': 2,
 'ad_id': 14483,
 'catalog_url': 6522,
 'pictures_number': 13,
 'clime_id': 6,
 'mileage': 6192,
 'pro_seller': 2,
 'clime_name': 6,
 'category_id': 9,
 'start_production': 25,
 'end_production': 22,
 'price_as_new': 4545,
 'car_type_id': 9,
 'doors_number': 4,
 'person_capacity': 7,
 'weight_kg': 831,
 'fuel_tank_liter': 59,
 'boot_capacity_liter': 321,
 'environmental_id': 5,
 'cylinders_number': 8,
 'drive_id': 3,
 'engine_size_cm3': 277,
 'consumption_city': 175,
 'consumption_highway': 87,
 'consumption_mixed': 114,
 'top_speed_kmh': 125,
 'acceleration_100khm_seconds': 153,
 'torque_nm': 247,
 'horse_power': 191,
 'model_id': 514,
 'shifter_id': 23,
 'fuel_id': 7,
 'cylinder_id': 4,
 'city_id': 889,
 'city_name': 880,
 'region_id': 20,
 'region_name': 20,
 'model_name': 514,
 'brand_id': 41,
 'brand_name': 41,
 'drive_name': 3,
 'car_type_name': 9,
 'category_name': 9,
 'environmental_

### Tipovi podataka

In [34]:
main_df.dtypes

upload_date                    datetime64[ns]
ad_price                                Int64
is_sold                               boolean
postal_code                             Int64
highlighted                           boolean
ad_id                                   Int64
catalog_url                            string
pictures_number                         Int64
clime_id                                Int64
mileage                                 Int64
pro_seller                            boolean
clime_name                             string
category_id                             Int64
start_production               datetime64[ns]
end_production                 datetime64[ns]
price_as_new                            Int64
car_type_id                             Int64
doors_number                            Int64
person_capacity                         Int64
weight_kg                               Int64
fuel_tank_liter                         Int64
boot_capacity_liter               

### Frekvencije vrijednosti po stupcu

In [35]:
value_counts = [main_df[col].value_counts() for col in main_df.columns]
value_counts

[2020-03-10    273
 2020-03-13    268
 2020-03-12    242
 2020-03-11    229
 2020-03-17    224
              ... 
 2015-11-21      1
 2018-02-05      1
 2019-05-09      1
 2018-11-13      1
 2018-09-26      1
 Name: upload_date, Length: 756, dtype: int64,
 1490000     354
 1990000     345
 1599000     329
 1999000     309
 1590000     290
            ... 
 2199700       1
 1918000       1
 1719000       1
 2889000       1
 33350000      1
 Name: ad_price, Length: 1243, dtype: Int64,
 False    12274
 True      2821
 Name: is_sold, dtype: Int64,
 6000    553
 8000    354
 4400    348
 9700    311
 6100    231
        ... 
 7988      1
 7228      1
 9324      1
 2921      1
 3713      1
 Name: postal_code, Length: 1151, dtype: Int64,
 False    13397
 True      1698
 Name: highlighted, dtype: Int64,
 15528526    10
 11842504    10
 15405104    10
 15537922    10
 15424856    10
             ..
 15292732     1
 15477948     1
 15546910     1
 15596758     1
 15032400     1
 Name: ad_id, Len

### Pitanja

1. Da li je skup podataka dovoljno velik?

2. Da li skup ima dovoljno različite podatke?

3. Da li skup ima vremensku dimenziju?

4. Da li skup ima kvantitativne i kvalitativne podatke?

5. Da li skup ima puno nedostajućih vrijednosti?


## Checkpoint 2 za 23.03. i 30.03.2023

Izrada relacijskog modela i baze podataka

1. Izraditi ER dijagram (konceptualni dizajn baze):
- identificirati entitete
- identificirati atribute entiteta
- definirati odnose između entiteta
- definirati kardinalnost

2. Izraditi bazu podataka u DBMS-u (logički dizajn baze, npr. u MySQL)

3. Napuniti bazu podataka sa podacima iz CSVa
- napisati skriptu u Pythonu koja će napuniti bazu

### ER dijagram relacijskog modela

![ER Dijagram relacijskom modela](image/er-diagram.png "ER dijagram relacijskom modela")

### Instalacija paketa za spajanje sa PostgreSQL bazom podataka

In [36]:
import psycopg2
from sqlalchemy import create_engine

### Spajanje s bazom podataka

In [37]:
conn_string = 'postgresql://postgres:password123@127.0.0.1/spi-projekt'
db = create_engine(conn_string)

### Kreiranje i popunjavanje tablica

In [38]:
try:
    fuel_table = fuel_df.to_sql('fuels', db, if_exists='replace', index=False)
    cylinder_table = cylinder_df.to_sql('cylinders', db, if_exists='replace', index=False)
    shifter_table = shifter_df.to_sql('shifters', db, if_exists='replace', index=False)
    region_table = region_df.to_sql('regions', db, if_exists='replace', index=False)
    model_table = model_df.to_sql('models', db, if_exists='replace', index=False)
    city_table = city_df.to_sql('cities', db, if_exists='replace', index=False)
    postal_codes_table = postal_codes_df.to_sql('postal_codes', db, if_exists='replace', index=False)
    environmental_table = environmental_df.to_sql('environmentals', db, if_exists='replace', index=False)
    drive_table = drive_df.to_sql('drives', db, if_exists='replace', index=False)
    clime_table = clime_df.to_sql('climes', db, if_exists='replace', index=False)
    category_table = category_df.to_sql('categories', db, if_exists='replace', index=False)
    catalogs_table = catalogs_df.to_sql('catalogs', db, if_exists='replace', index=False)
    car_types_table = car_types_df.to_sql('car_types', db, if_exists='replace', index=False)
    brand_table = brand_df.to_sql('brands', db, if_exists='replace', index=False)
    advertisements_table = advertisements_df.to_sql('advertisements', db, if_exists='replace', index=False)
except Exception as ex:
    print(ex)
else:
    print("PostgreSQL tablice su kreirane i popunjene.")

PostgreSQL tablice su kreirane i popunjene.


## Checkpoint 3 za 04.05. i 11.05.2023:

1. Izraditi dimenzijski model skladišta podataka

- identificirati tablicu činjenica
- identificirati dimenzijske tablice
- spojiti slične dimenzije u jednu
- izdvojiti vremensku dimenziju u posebnu dimenzijsku tablicu
- koristiti strategiju sporo mijenjajućih dimenzija tipa 2
- prikazati grafički kreirano skladište podataka (ER dijagram)

2. ETL - napuniti skladište podataka sa podacima

- koristiti alat Pentaho Data Integration

### ER dijagram dimenzijskog modela

![ER dijagram dimenzijskog modela](image/er-diagram-dim-model.png "ER dijagram dimenzijskog modela")