* Master DAC - BDLE
* Author: Mohamed-Amine Baazizi
* Affiliation: LIP6 - Faculté des Sciences - Sorbonne Université
* Email: mohamed-amine.baazizi@lip6.fr
* October 2024

# Building an effective data preparation pipeline for ML

**Auteur:**

**ZHOU runlin 28717281**

**ZHANG zhile 21201131**



## Outline

This homework is about building an effective data preparation pipeline.
It covers the following aspects covered throughout the session:

* ingest raw data, curate it, transform it
* load the data into delta tables to enforce constraints and allow updates
* build an ML pipeline for training a decision tree model and run cross validation

It is based on raw data about car prices crawled from a public source.
Start by running some data exploration queries to decide which do select or discard based on general understanding.








## Prerequisite

### System setup

In [None]:
%%capture
!pip install -q pyspark
!pip install -q delta-spark
!pip install pyngrok

In [None]:
!pip list|grep spark

delta-spark                        3.2.1
pyspark                            3.5.3


In [None]:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

local = "local[*]"
appName = "ADIA certificate - Delta Lake "
localConfig = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "8G").\
  set("spark.driver.memory","8G").\
  set("spark.sql.catalogImplementation","in-memory").\
  set("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").\
  set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").\
  set("spark.jars.packages","io.delta:delta-spark_2.12:3.1.0").\
  set("spark.databricks.delta.schema.autoMerge.enabled","true")


spark = SparkSession.builder.config(conf = localConfig).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [None]:
spark

### Data import

In [None]:
%%capture
!wget --no-verbose https://nuage.lip6.fr/s/89BG8HD9r3iE693/download/MLData.tgz -O /tmp/MLData.tgz
!tar -xzvf /tmp/MLData.tgz  --directory /tmp/

In [None]:
!ls -hal /tmp/MLData

total 73M
drwxr-xr-x 2  501 staff 4.0K Jan  6  2022 .
drwxrwxrwt 1 root root  4.0K Oct 24 08:32 ..
-rw-r--r-- 1  501 staff  66M Jan  6  2022 autos.csv
-rw-r--r-- 1  501 staff  176 Jan  6  2022 ._loan.csv
-rw-r--r-- 1  501 staff 6.8M Jan  6  2022 loan.csv


In [None]:
query = """
CREATE TABLE IF NOT EXISTS raw_vehiculePrices
USING csv
OPTIONS (
  header "true",
  path "/tmp/MLData/autos.csv",
  inferSchema "true"
)
"""
spark.sql(query)

DataFrame[]

## Phase 0: Understanding the data

In this part, you are invited to get some knowledge about the data by reading its schema and extracting  some basic statistical information about the values of columns that you will find interesting.

In [None]:
query = """
DESCRIBE raw_vehiculePrices
"""
spark.sql(query).show()

+-------------------+---------+-------+
|           col_name|data_type|comment|
+-------------------+---------+-------+
|        dateCrawled|timestamp|   NULL|
|               name|   string|   NULL|
|             seller|   string|   NULL|
|          offerType|   string|   NULL|
|              price|      int|   NULL|
|             abtest|   string|   NULL|
|        vehicleType|   string|   NULL|
| yearOfRegistration|      int|   NULL|
|            gearbox|   string|   NULL|
|            powerPS|      int|   NULL|
|              model|   string|   NULL|
|          kilometer|      int|   NULL|
|monthOfRegistration|      int|   NULL|
|           fuelType|   string|   NULL|
|              brand|   string|   NULL|
|  notRepairedDamage|   string|   NULL|
|        dateCreated|timestamp|   NULL|
|       nrOfPictures|      int|   NULL|
|         postalCode|      int|   NULL|
|           lastSeen|timestamp|   NULL|
+-------------------+---------+-------+



In [None]:
query = """
SELECT * FROM raw_vehiculePrices TABLESAMPLE (5 ROWS);
"""
spark.sql(query).show()


+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|        dateCrawled|                name|seller|offerType|price|abtest|vehicleType|yearOfRegistration|  gearbox|powerPS|model|kilometer|monthOfRegistration|fuelType|     brand|notRepairedDamage|        dateCreated|nrOfPictures|postalCode|           lastSeen|
+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|2016-03-24 11:52:17|          Golf_3_1.6|privat|  Angebot|  480|  test|       NULL|              1993|  manuell|      0| golf|   150000|                  0|  benzin|volkswagen|             NULL|2016-03-24 00:00:00|     

In [None]:
query = """
SELECT  min(yearOfRegistration), max(yearOfRegistration),
          avg(yearOfRegistration), median(yearOfRegistration)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+-----------------------+-----------------------+-----------------------+--------------------------+
|min(yearOfRegistration)|max(yearOfRegistration)|avg(yearOfRegistration)|median(yearOfRegistration)|
+-----------------------+-----------------------+-----------------------+--------------------------+
|                   1000|                   9999|     2004.5767206439623|                    2003.0|
+-----------------------+-----------------------+-----------------------+--------------------------+



In [None]:
# query = """
# SELECT  yearOfRegistration, count(*)
# FROM vehiculePrices
# GROUP BY yearOfRegistration
# order by 1 desc,2 desc
# """
# spark.sql(query).show(150)

In [None]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+----------+----------+------------------+-------------+
|min(price)|max(price)|        avg(price)|median(price)|
+----------+----------+------------------+-------------+
|         0|2147483647|17286.338865535483|       2950.0|
+----------+----------+------------------+-------------+



In [None]:
query = """
SELECT  min(kilometer), max(kilometer),
          avg(kilometer), median(kilometer)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+--------------+--------------+------------------+-----------------+
|min(kilometer)|max(kilometer)|    avg(kilometer)|median(kilometer)|
+--------------+--------------+------------------+-----------------+
|          5000|        150000|125618.56044408226|         150000.0|
+--------------+--------------+------------------+-----------------+



## Phase 1: Cleaning the data and selecting relevant columns

In this part you are invited to decide which columns are useful for you analysis and to perform some cleaning on the data by removing outlier values (e.g. remove records with strange values for a specific column).
The result of your cleaning and selection should be stored in a table called `phase1`

In [None]:
query = """
SELECT  count(*), nrOfPictures
FROM raw_vehiculePrices
GROUP BY nrOfPictures
order by 1 desc,2 desc
"""
spark.sql(query).show()

+--------+------------+
|count(1)|nrOfPictures|
+--------+------------+
|  371823|           0|
|       1|        NULL|
+--------+------------+



In [None]:
query = """
SELECT  count(*), monthOfRegistration
FROM raw_vehiculePrices
GROUP BY monthOfRegistration
order by 1 desc,2 desc
"""
spark.sql(query).show()

+--------+-------------------+
|count(1)|monthOfRegistration|
+--------+-------------------+
|   37706|                  0|
|   36191|                  3|
|   33201|                  6|
|   30945|                  4|
|   30649|                  5|
|   28983|                  7|
|   27360|                 10|
|   25509|                 11|
|   25404|                 12|
|   25089|                  9|
|   24576|                  1|
|   23782|                  8|
|   22428|                  2|
|       1|               NULL|
+--------+-------------------+



In [None]:
query = """
SELECT  count(*), monthOfRegistration
FROM raw_vehiculePrices
GROUP BY monthOfRegistration
order by 1 desc,2 desc
"""
spark.sql(query).show()

+--------+-------------------+
|count(1)|monthOfRegistration|
+--------+-------------------+
|   37706|                  0|
|   36191|                  3|
|   33201|                  6|
|   30945|                  4|
|   30649|                  5|
|   28983|                  7|
|   27360|                 10|
|   25509|                 11|
|   25404|                 12|
|   25089|                  9|
|   24576|                  1|
|   23782|                  8|
|   22428|                  2|
|       1|               NULL|
+--------+-------------------+



In [None]:
query = """
DROP TABLE IF EXISTS phase1
"""
spark.sql(query)

query = """
CREATE OR REPLACE TABLE phase1
USING delta
AS
SELECT  *
FROM raw_vehiculePrices
WHERE yearOfRegistration > 1900 and yearOfRegistration < 2024 and price > 500 and monthOfRegistration > 0 and kilometer > 0
  and dateCreated < dateCrawled and dateCreated < lastSeen
  and name is not null and seller is not null and offerType is not null and abtest is not null
  and vehicleType is not null and gearbox is not null and model is not null and fuelType is not null
  and brand is not null and notRepairedDamage is not null and dateCreated is not null
"""
spark.sql(query)

DataFrame[]

In [None]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM phase1
"""
spark.sql(query).show()

+----------+----------+-----------------+-------------+
|min(price)|max(price)|       avg(price)|median(price)|
+----------+----------+-----------------+-------------+
|       501|  99999999|8790.667430541642|       4300.0|
+----------+----------+-----------------+-------------+



In [None]:
query = """
SELECT  count(*), kilometer
FROM phase1
GROUP BY kilometer
order by 1 desc,2 desc
"""
spark.sql(query).show()

+--------+---------+
|count(1)|kilometer|
+--------+---------+
|  144067|   150000|
|   27103|   125000|
|   11468|   100000|
|    9525|    90000|
|    8634|    80000|
|    7787|    70000|
|    7102|    60000|
|    6276|    50000|
|    5290|    40000|
|    4807|    30000|
|    4195|    20000|
|    1871|     5000|
|    1443|    10000|
+--------+---------+



In [None]:
query = """
SELECT count(*), price
FROM phase1
GROUP BY price
having count(*) > 2
order by 1, 2 desc
"""
spark.sql(query).show()

+--------+------+
|count(1)| price|
+--------+------+
|       3|195000|
|       3|150000|
|       3|145000|
|       3|129000|
|       3|116000|
|       3|112000|
|       3|110000|
|       3|109000|
|       3|104900|
|       3| 99900|
|       3| 88900|
|       3| 87900|
|       3| 84999|
|       3| 83500|
|       3| 82500|
|       3| 79900|
|       3| 79500|
|       3| 78500|
|       3| 76000|
|       3| 75900|
+--------+------+
only showing top 20 rows



In [None]:
query = """
CREATE OR REPLACE TABLE temp_phase1
USING delta
AS
SELECT *, COUNT(*) OVER (PARTITION BY price) AS price_count
FROM phase1;
"""
spark.sql(query)

query = """
DELETE FROM temp_phase1
WHERE price_count <= 2;
"""
spark.sql(query)

query = """
DROP TABLE IF EXISTS phase1
"""
spark.sql(query)

query = """
CREATE OR REPLACE TABLE phase1
USING delta
AS
SELECT dateCrawled, name, seller, offerType, price, abtest, vehicleType,
       yearOfRegistration, gearbox, powerPS, model, kilometer, monthOfRegistration,
       fuelType, brand, notRepairedDamage, dateCreated, nrOfPictures,
       postalCode, lastSeen
FROM temp_phase1
"""
spark.sql(query)

DataFrame[]

In [None]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM temp_phase1
"""
spark.sql(query).show()

+----------+----------+-----------------+-------------+
|min(price)|max(price)|       avg(price)|median(price)|
+----------+----------+-----------------+-------------+
|       510|    225000|6932.356494751951|       4250.0|
+----------+----------+-----------------+-------------+



**Give a brief summary of your choices**

This process ensures that only valid records with realistic and complete data are retained. The following key cleaning operations were performed:
- Filtering rows based on business rules and data quality checks.
- Removing rare and potentially outlier prices.
- Retaining only relevant columns for further analysis or use.

For example, the price of a sold car cannot be zero, the month of the registeration also cannot be zero.



## Phase 2: Organizing the data

In this part you are invited to load the data into delta tables where you will define meaningful constraints and conditions to be fulfiled by any future incoming data.
The result of this phase should a delta table called `deltaPrices`

In [None]:
# 创建 deltaPrices 表
query = """
CREATE OR REPLACE TABLE deltaPrices (
    dateCrawled TIMESTAMP,
    name STRING,
    seller STRING,
    offerType STRING,
    price INT,
    abtest STRING,
    vehicleType STRING,
    yearOfRegistration INT,
    gearbox STRING,
    powerPS INT,
    model STRING,
    kilometer INT,
    monthOfRegistration INT,
    fuelType STRING,
    brand STRING,
    notRepairedDamage STRING,
    dateCreated TIMESTAMP,
    nrOfPictures INT,
    postalCode INT,
    lastSeen TIMESTAMP
)
USING DELTA;
"""
spark.sql(query)


DataFrame[]

In [None]:
query = """
INSERT INTO deltaPrices
SELECT *
FROM phase1
WHERE price >= 510 AND price <= 225000
  AND yearOfRegistration BETWEEN 1900 AND YEAR(CURRENT_DATE)
  AND powerPS > 0 AND powerPS < 20000
  AND kilometer >= 5000 AND kilometer <= 150000;
"""
spark.sql(query)

DataFrame[]

In [None]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM deltaPrices
"""
spark.sql(query).show()

+----------+----------+-----------------+-------------+
|min(price)|max(price)|       avg(price)|median(price)|
+----------+----------+-----------------+-------------+
|       510|    225000|7017.582971192949|       4350.0|
+----------+----------+-----------------+-------------+



In [None]:
query = """
SELECT  min(powerPS), max(powerPS),
          avg(powerPS), median(powerPS)
FROM deltaPrices
"""
spark.sql(query).show()

+------------+------------+-----------------+---------------+
|min(powerPS)|max(powerPS)|     avg(powerPS)|median(powerPS)|
+------------+------------+-----------------+---------------+
|           1|       17700|132.5757705199566|          120.0|
+------------+------------+-----------------+---------------+



In [None]:
query = """
SELECT  min(kilometer), max(kilometer),
          avg(kilometer), median(kilometer)
FROM deltaPrices
"""
spark.sql(query).show()

+--------------+--------------+------------------+-----------------+
|min(kilometer)|max(kilometer)|    avg(kilometer)|median(kilometer)|
+--------------+--------------+------------------+-----------------+
|          5000|        150000|123094.06641879847|         150000.0|
+--------------+--------------+------------------+-----------------+



Comment on the constraints you added


After data cleaning in the first step, we successfully removed the outliers. Therefore we believe that normal data appearing below the minimum value or greater than the maximum value will be very rare. Therefore, we consider only data in the range \[min, max\] as normal

## Phase 3: Analysing the data and ensuring query evaluation effeciency

Suggest 2 or 3 meaningfull queries as described above and suggest a data organization scheme for optimizing one such query of your choice.

In [None]:
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

ui_port = 4040
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken
··········




 * ngrok tunnel "https://8158-34-125-204-244.ngrok-free.app" -> "http://127.0.0.1:4040"


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import year, col, avg
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, VectorIndexer
import pandas as pd

# Initialize SparkSession
spark = SparkSession.builder.appName("OptimizeQueryWithCorrelation").getOrCreate()
df = spark.table("deltaPrices")
df = df.withColumn("year", year(col("dateCreated")))

# choisit les colonnes numériques
numeric_cols = ["price", "powerPS", "kilometer", "yearOfRegistration", "postalCode", "nrOfPictures", "monthOfRegistration"]
correlation_df = df.select([col(c) for c in numeric_cols]).toPandas()


# calcule la matrice de corrélation
correlation_matrix = correlation_df.corr()
print("Correlation with price:")
print(correlation_matrix["price"].sort_values(ascending=False))


# Par la corrélation, on choisit les colonnes qui ont une corrélation supérieure à 0.25 ou inférieure à -0.25 avec le prix
selected_features = correlation_matrix["price"][(correlation_matrix["price"] > 0.25) | (correlation_matrix["price"] < -0.25)].index.tolist()
selected_features.remove("price")
print(f"Selected features based on correlation: {selected_features}")

# Define the StringIndexer to convert the categorical 'gearbox' column to numerical index
label_col = "gearbox"
indexed_label_col = "indexed_gearbox"
label_indexer = StringIndexer(inputCol=label_col, outputCol=indexed_label_col, handleInvalid="keep")

vec_assembler = VectorAssembler(inputCols=selected_features, outputCol="features_vector")
vec_indexer = VectorIndexer(inputCol="features_vector", outputCol="features", maxCategories=3)

# Cree un pipeline avec les etapes
stages = [label_indexer, vec_assembler, vec_indexer]
pipeline = Pipeline(stages=stages)
# Fit the pipeline
model = pipeline.fit(df)
transformed_df = model.transform(df)


Correlation with price:
price                  1.000000
yearOfRegistration     0.440699
powerPS                0.259631
postalCode             0.067653
monthOfRegistration   -0.003259
kilometer             -0.455710
nrOfPictures                NaN
Name: price, dtype: float64
Selected features based on correlation: ['powerPS', 'kilometer', 'yearOfRegistration']


In [None]:
query = """
SELECT gearbox, YEAR(dateCreated) as year, AVG(price) as avg_price
FROM deltaPrices
GROUP BY gearbox, year
ORDER BY year, avg_price DESC;
"""
spark.sql(query).show()

+---------+----+-----------------+
|  gearbox|year|        avg_price|
+---------+----+-----------------+
|automatik|2015|          11834.8|
|  manuell|2015|9233.615384615385|
|automatik|2016|11423.03049594229|
|  manuell|2016|5612.816970393791|
+---------+----+-----------------+



In [None]:
# 分组聚合，计算平均价格
result_df = transformed_df.groupBy("indexed_gearbox", "year").agg(avg("price").alias("avg_price"))
result_df = result_df.orderBy(col("year").asc(), col("avg_price").desc())
result_df.show()

+---------------+----+-----------------+
|indexed_gearbox|year|        avg_price|
+---------------+----+-----------------+
|            1.0|2015|          11834.8|
|            0.0|2015|9233.615384615385|
|            1.0|2016|11423.03049594229|
|            0.0|2016|5612.816970393791|
+---------------+----+-----------------+



In [None]:
query = """
SELECT seller, vehicleType, avg(price) as avg_price
FROM deltaPrices
GROUP BY seller, vehicleType
ORDER BY avg_price DESC;
"""
spark.sql(query).show()

+----------+-----------+------------------+
|    seller|vehicleType|         avg_price|
+----------+-----------+------------------+
|    privat|        suv|13801.530453447813|
|    privat|      coupe|11502.184746825522|
|    privat|     cabrio|10546.197891654465|
|    privat|        bus|  7268.99056727368|
|gewerblich|  limousine|            6900.0|
|    privat|  limousine| 6760.974133380257|
|    privat|      kombi| 6754.698268980295|
|    privat|     andere| 5119.884555382216|
|    privat| kleinwagen| 3576.979805071446|
|gewerblich| kleinwagen|            1100.0|
+----------+-----------+------------------+



## Ingesting new data and reruning analytics  

In this part you are invited to suggest the insertion of fictious new data that conforms to the schema established in phase 2 and to rerun some queries of phase 3 to see the evolution of the result. Ideally, write a query that compares an aggregation value in two different versions of the data by exploiting the delta history feature.

In [None]:
# read from the original table
delta_prices_df = spark.table("deltaPrices")

# select a slice of the table
slice_df = delta_prices_df.select("brand", "model", "price")
slice_df.write.format("delta").mode("overwrite").saveAsTable("brand_model_price")

In [None]:
df = spark.table("deltaPrices")
possible_data_list = []
brands = df.select("brand").distinct().collect()

for brand in brands:
    models = df.filter(df.brand == brand[0]).select("model").distinct().collect()
    for model in models:
        possible_data_list.append((brand[0], model[0]))

print(possible_data_list[20])

('lada', 'samara')


In [None]:
history_query = """
DESCRIBE HISTORY brand_model_price;
"""
history_df = spark.sql(history_query)
history_df.show()

+-------+--------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|version|           timestamp|userId|userName|           operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+--------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|      0|2024-10-22 18:35:...|  NULL|    NULL|CREATE OR REPLACE...|{isManaged -> tru...|NULL|    NULL|     NULL|       NULL|  Serializable|        false|{numFiles -> 2, n...|        NULL|Apache-Spark/3.5....|
+-------+--------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+-----------

In [None]:
import random

# insert a serie of random data
random_insert_query = """
INSERT INTO brand_model_price (brand, model, price)
VALUES
"""
nb_inserts = 1000
random_values = []
for i in range(nb_inserts):
    brand, model = random.choice(possible_data_list)
    price = random.randint(510, 225000)
    random_data = f"('{brand}', '{model}', {price})"
    random_values.append(random_data)

# construct the sql line
random_insert_query += ", ".join(random_values) + ";"
spark.sql(random_insert_query)

DataFrame[]

In [None]:
history_query = """
DESCRIBE HISTORY brand_model_price;
"""
history_df = spark.sql(history_query)
history_df.show()

+-------+--------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|version|           timestamp|userId|userName|           operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+--------------------+------+--------+--------------------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|      1|2024-10-22 18:36:...|  NULL|    NULL|               WRITE|{mode -> Append, ...|NULL|    NULL|     NULL|          0|  Serializable|         true|{numFiles -> 2, n...|        NULL|Apache-Spark/3.5....|
|      0|2024-10-22 18:35:...|  NULL|    NULL|CREATE OR REPLACE...|{isManaged -> tru...|NULL|    NULL|     NULL|       NULL|  Serializable|        false|{numFiles -

In [None]:
# comparing the difference of two version
# the changes of the averange price
compare_query = """
WITH prev_version AS (
    SELECT brand, model, ROUND(AVG(price), 2) AS avg_price
    FROM brand_model_price VERSION AS OF 0 -- 假设0是旧版本
    GROUP BY brand, model
),
current_version AS (
    SELECT brand, model, ROUND(AVG(price), 2) AS avg_price
    FROM brand_model_price VERSION AS OF 1 -- 假设1是新版本
    GROUP BY brand, model
)
SELECT
    current_version.brand,
    current_version.model,
    current_version.avg_price AS current_avg_price,
    prev_version.avg_price AS previous_avg_price,
    (current_version.avg_price - prev_version.avg_price) AS price_change
FROM current_version
JOIN prev_version
ON current_version.brand = prev_version.brand
AND current_version.model = prev_version.model
ORDER BY price_change DESC;
"""
spark.sql(compare_query).show()

+----------+---------------+-----------------+------------------+------------------+
|     brand|          model|current_avg_price|previous_avg_price|      price_change|
+----------+---------------+-----------------+------------------+------------------+
|     rover|       defender|         120501.0|             550.0|          119951.0|
|    lancia|     elefantino|          85638.6|             999.0|           84639.6|
|     rover|      discovery|         93976.67|           16000.0|          77976.67|
|      lada|         kalina|          75195.3|            2079.8|           73115.5|
|     rover|     freelander|         68915.14|            1950.0|          66965.14|
|     rover|     rangerover|         81968.17|          20733.33|          61234.84|
|land_rover|        serie_3|         72851.57|          13083.33|59768.240000000005|
|      lada|         samara|         41541.25|            1849.0|          39692.25|
|land_rover|discovery_sport|          83282.0|           47950.0|