# Projekt iz SPI

## Checkpoint 1 za 09.03.2023:

Odabrati skup podataka i napraviti osnovnu analizu podataka u pythonu

Odabran skup podataka: **Secondhand Car Market Hungary**

Link na dataset: https://www.kaggle.com/datasets/attilakiss/secondhand-car-market-data-parsing-dataset-v1?select=advertisements_202006112147.csv

Osnovna analiza podataka:

- učitati iz csv u dataframe (pandas)
- pregled prvih 5 redaka
- veličina skupa
- nazivi stupaca
- broj nedostajućih vrijednosti po stupcu (.isna)
- jedinstvene vrijednosti (.unique())
- ispis tipova podataka (.dtypes)
- frekvencije vrijednosti po stupcu (petlja, data[column].value_counts())

Pitanja:

1. Da li je skup podataka dovoljno velik?
2. Da li skup ima dovoljno različite podatke?
3. Da li skup ima vremensku dimenziju?
4. Da li skup ima kvantitativne i kvalitativne podatke?
5. Da li skup ima puno nedostajućih vrijednosti?

Skup, rezultate analize i odgovore na pitanja kratko prezentirati (5 min.) na vježbama 09.03.

### Instalacija paketa

In [2]:
#import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#from collections import Counter

### Učitavanje dataseta u dataframe

In [3]:
region_df = pd.read_csv('datasets/region_202006112147.csv')
model_df = pd.read_csv('datasets/model_202006112147.csv')
environmental_df = pd.read_csv('datasets/environmental_202006112147.csv')
drive_df = pd.read_csv('datasets/drive_202006112147.csv')
clime_df = pd.read_csv('datasets/clime_202006112147.csv')
category_df = pd.read_csv('datasets/category_202006112147.csv')
catalogs_df = pd.read_csv('datasets/catalogs_202006112147.csv')
car_types_df = pd.read_csv('datasets/car_type_202006112147.csv')
brand_df = pd.read_csv('datasets/brand_202006112147.csv')
advertisements_df = pd.read_csv('datasets/advertisements_202006112147.csv', low_memory=False)

In [4]:
# Removing useless columns
advertisements_df = advertisements_df.drop([
    "description",
    "advertisement_url",
    "sales_date",
    "download_date",
    "sales_update_date",
    "gas_id",
    "documentvalid",
    "color",
    "doorsnumber",
    "ccm",
    "person_capacity"
], axis=1)

catalogs_df = catalogs_df.drop([
    "msrp",
], axis=1)

In [5]:
# Removing rows with empty cells
catalogs_df = catalogs_df.dropna()
region_df = region_df.dropna()
car_types_df = car_types_df.dropna()

In [6]:
# Tranformacija tipova podataka
region_df = region_df.convert_dtypes()
model_df = model_df.convert_dtypes()
environmental_df = environmental_df.convert_dtypes()
drive_df = drive_df.convert_dtypes()
clime_df = clime_df.convert_dtypes()
category_df = category_df.convert_dtypes()
catalogs_df = catalogs_df.convert_dtypes()
car_types_df = car_types_df.convert_dtypes()
brand_df = brand_df.convert_dtypes()
advertisements_df = advertisements_df.convert_dtypes()

dates = [
    'upload_date',
    'start_production',
    'end_production',
    'documentvalid',
    'production'
]

for _date in dates:
    if _date in advertisements_df.columns:
        advertisements_df[_date] = pd.to_datetime(advertisements_df[_date])
    if _date in catalogs_df.columns:
        catalogs_df[_date] = pd.to_datetime(catalogs_df[_date])

In [7]:
# Merge dataframes into one main dataframe
main_df = pd.merge(advertisements_df, brand_df, on="brand_id", how="inner")
main_df = pd.merge(main_df, region_df, on="region_id", how="inner")
main_df = pd.merge(main_df, model_df, on="model_id", how="inner")
main_df = pd.merge(main_df, clime_df, on="clime_id", how="inner")
main_df = pd.merge(main_df, catalogs_df, on="catalog_url", how="inner")
main_df = pd.merge(main_df, drive_df, on="drive_id", how="inner")
main_df = pd.merge(main_df, car_types_df, on="car_type_id", how="inner")
main_df = pd.merge(main_df, category_df, on="category_id", how="inner")
main_df = pd.merge(main_df, environmental_df, on="environmental_id", how="inner")

### Pregled prvih 5 redaka

In [100]:
main_df.head()

Unnamed: 0,ad_id,region_id,ad_price,numpictures,proseller,adoldness,postalcode,production,mileage,clime_id,...,acceleration,torque,power,drive_name,drive_name_translated,car_type_name,car_type_name_translated,category_name,category_name_translated,environmental_name
0,15564920,5,1945000,9,True,39,5600,2005-06-01,177666,3,...,9.8,320,140,Első kerék,FWD,Sedan,Sedan,Középkategória,E-Segment,EURO 4
1,15404052,19,1299000,4,False,2,8500,2006-11-01,312000,1,...,9.8,320,140,Első kerék,FWD,Sedan,Sedan,Középkategória,E-Segment,EURO 4
2,15550647,14,1250000,6,False,7,2344,2006-06-01,290000,5,...,9.8,320,140,Első kerék,FWD,Sedan,Sedan,Középkategória,E-Segment,EURO 4
3,15555420,9,1280000,10,True,6,4032,2008-01-01,354799,5,...,9.8,320,140,Első kerék,FWD,Sedan,Sedan,Középkategória,E-Segment,EURO 4
4,15406966,17,350000,9,True,40,7030,2006-02-01,225000,4,...,9.8,320,140,Első kerék,FWD,Sedan,Sedan,Középkategória,E-Segment,EURO 4


### Veličina skupa


In [101]:
main_df.shape

(17942, 52)

### Nazivi stupaca

In [102]:
list(main_df.columns)

['ad_id',
 'region_id',
 'ad_price',
 'numpictures',
 'proseller',
 'adoldness',
 'postalcode',
 'production',
 'mileage',
 'clime_id',
 'shifter',
 'person_capacity_x',
 'brand_id',
 'model_id',
 'ccm_x',
 'highlighted',
 'upload_date',
 'catalog_url',
 'is_sold',
 'brand_name',
 'region_name',
 'model_name',
 'clime_name',
 'category_id',
 'start_production',
 'end_production',
 'car_type_id',
 'doorsnumber',
 'person_capacity_y',
 'weight',
 'fuel_tank',
 'boot_capacity',
 'fuel',
 'environmental_id',
 'cylinder_layout',
 'cylinders',
 'drive_id',
 'ccm_y',
 'consump_city',
 'consump_highway',
 'consump_mixed',
 'top_speed',
 'acceleration',
 'torque',
 'power',
 'drive_name',
 'drive_name_translated',
 'car_type_name',
 'car_type_name_translated',
 'category_name',
 'category_name_translated',
 'environmental_name']

### Broj nedostajućih vrijednosti po stupcu

In [103]:
col_null_values = {col: main_df[col].isna().sum() for col in main_df.columns}
col_null_values

{'ad_id': 0,
 'region_id': 0,
 'ad_price': 0,
 'numpictures': 0,
 'proseller': 0,
 'adoldness': 0,
 'postalcode': 0,
 'production': 0,
 'mileage': 0,
 'clime_id': 0,
 'shifter': 0,
 'person_capacity_x': 0,
 'brand_id': 0,
 'model_id': 0,
 'ccm_x': 0,
 'highlighted': 0,
 'upload_date': 0,
 'catalog_url': 0,
 'is_sold': 0,
 'brand_name': 0,
 'region_name': 0,
 'model_name': 0,
 'clime_name': 0,
 'category_id': 0,
 'start_production': 0,
 'end_production': 0,
 'car_type_id': 0,
 'doorsnumber': 0,
 'person_capacity_y': 0,
 'weight': 0,
 'fuel_tank': 0,
 'boot_capacity': 0,
 'fuel': 0,
 'environmental_id': 0,
 'cylinder_layout': 0,
 'cylinders': 0,
 'drive_id': 0,
 'ccm_y': 0,
 'consump_city': 0,
 'consump_highway': 0,
 'consump_mixed': 0,
 'top_speed': 0,
 'acceleration': 0,
 'torque': 0,
 'power': 0,
 'drive_name': 0,
 'drive_name_translated': 0,
 'car_type_name': 0,
 'car_type_name_translated': 0,
 'category_name': 0,
 'category_name_translated': 0,
 'environmental_name': 0}

### Jedinstvene vrijednosti

In [104]:
cols_unique_num = {col: len(main_df[col].unique()) for col in main_df.columns}
cols_unique_num

{'ad_id': 17942,
 'region_id': 20,
 'ad_price': 1343,
 'numpictures': 13,
 'proseller': 2,
 'adoldness': 785,
 'postalcode': 1475,
 'production': 289,
 'mileage': 7282,
 'clime_id': 6,
 'shifter': 26,
 'person_capacity_x': 10,
 'brand_id': 47,
 'model_id': 633,
 'ccm_x': 383,
 'highlighted': 2,
 'upload_date': 810,
 'catalog_url': 8642,
 'is_sold': 2,
 'brand_name': 47,
 'region_name': 20,
 'model_name': 633,
 'clime_name': 6,
 'category_id': 9,
 'start_production': 27,
 'end_production': 24,
 'car_type_id': 9,
 'doorsnumber': 4,
 'person_capacity_y': 8,
 'weight': 905,
 'fuel_tank': 66,
 'boot_capacity': 388,
 'fuel': 7,
 'environmental_id': 6,
 'cylinder_layout': 4,
 'cylinders': 8,
 'drive_id': 3,
 'ccm_y': 353,
 'consump_city': 188,
 'consump_highway': 98,
 'consump_mixed': 124,
 'top_speed': 132,
 'acceleration': 165,
 'torque': 282,
 'power': 209,
 'drive_name': 3,
 'drive_name_translated': 3,
 'car_type_name': 9,
 'car_type_name_translated': 9,
 'category_name': 9,
 'category_na

### Tipovi podataka

In [105]:
main_df.dtypes

ad_id                                Int64
region_id                            Int64
ad_price                             Int64
numpictures                          Int64
proseller                          boolean
adoldness                            Int64
postalcode                           Int64
production                  datetime64[ns]
mileage                              Int64
clime_id                             Int64
shifter                             string
person_capacity_x                    Int64
brand_id                             Int64
model_id                             Int64
ccm_x                                Int64
highlighted                        boolean
upload_date                 datetime64[ns]
catalog_url                         string
is_sold                            boolean
brand_name                          string
region_name                         string
model_name                          string
clime_name                          string
category_id

### Frekvencije vrijednosti po stupcu

In [106]:
value_counts = [main_df[col].value_counts() for col in main_df.columns]
value_counts

[15564920    1
 12440272    1
 15253611    1
 15504420    1
 15340753    1
            ..
 15551226    1
 14761640    1
 14631941    1
 15299620    1
 15372474    1
 Name: ad_id, Length: 17942, dtype: Int64,
 3     3889
 14    2492
 4     1462
 8      982
 9      887
 2      792
 7      731
 16     709
 6      664
 19     635
 1      612
 18     600
 12     578
 20     538
 15     530
 10     487
 17     414
 11     407
 5      321
 13     212
 Name: region_id, dtype: Int64,
 1490000    373
 1990000    354
 599000     351
 1599000    343
 1999000    320
           ... 
 4250000      1
 2875000      1
 2286000      1
 1925000      1
 2487000      1
 Name: ad_price, Length: 1343, dtype: Int64,
 6     6853
 12    4461
 5     1740
 11    1248
 10     855
 4      750
 9      558
 8      387
 3      329
 7      296
 0      205
 2      139
 1      121
 Name: numpictures, dtype: Int64,
 True     9967
 False    7975
 Name: proseller, dtype: Int64,
 2       634
 1       631
 4       543
 3      

### Pitanja

1. Da li je skup podataka dovoljno velik?

2. Da li skup ima dovoljno različite podatke?

3. Da li skup ima vremensku dimenziju?

4. Da li skup ima kvantitativne i kvalitativne podatke?

5. Da li skup ima puno nedostajućih vrijednosti?


## Checkpoint 2 za 23.03. i 30.03.2023

Izrada relacijskog modela i baze podataka

1. Izraditi ER dijagram (konceptualni dizajn baze):
- identificirati entitete
- identificirati atribute entiteta
- definirati odnose između entiteta
- definirati kardinalnost

2. Izraditi bazu podataka u DBMS-u (logički dizajn baze, npr. u MySQL)

3. Napuniti bazu podataka sa podacima iz CSVa
- napisati skriptu u Pythonu koja će napuniti bazu

## ER Dijagram

![ER Dijagram](image/er-diagram.png "ER Dijagram")

### Instalacija paketa za spajanje sa PostgreSQL bazom podataka

In [107]:
import psycopg2
from sqlalchemy import create_engine

### Spajanje s bazom podataka

In [108]:
conn_string = 'postgresql://postgres:password123@127.0.0.1/spi-projekt'
db = create_engine(conn_string)

### Kreiranje i popunjavanje tablica

In [109]:
try:
    region_table = region_df.to_sql('regions', db, if_exists='replace', index=False)
    model_table = model_df.to_sql('models', db, if_exists='replace', index=False)
    environmental_table = environmental_df.to_sql('environmentals', db, if_exists='replace', index=False)
    drive_table = drive_df.to_sql('drives', db, if_exists='replace', index=False)
    clime_table = clime_df.to_sql('climes', db, if_exists='replace', index=False)
    category_table = category_df.to_sql('categories', db, if_exists='replace', index=False)
    catalogs_table = catalogs_df.to_sql('catalogs', db, if_exists='replace', index=False)
    car_types_table = car_types_df.to_sql('car_types', db, if_exists='replace', index=False)
    brand_table = brand_df.to_sql('brands', db, if_exists='replace', index=False)
    advertisements_table = advertisements_df.to_sql('advertisements', db, if_exists='replace', index=False)
except Exception as ex:
    print(ex)
else:
    print("PostgreSQL tablice su kreirane i popunjene.")

PostgreSQL tablice su kreirane i popunjene.
