![Imgur](https://image.freepik.com/free-vector/automobiles-models-icon-collection_74855-5435.jpg)

<a href='https://www.freepik.com/'>models icon @freepik.com </a>


# Exploring Ebay Car Sales Data 

## Introduction

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka, the original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

This is the sort version, 50000 rows.  

The aim of this project is **to clean the data** and analyze the included used car listings 


### Data dictionary:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.


- `name` - Name of the car.


- `seller` - Whether the seller is private or a dealer.


- `offerType` - The type of listing


- `price` - The price on the ad to sell the car.


- `abtest` - Whether the listing is included in an A/B test.


- `vehicleType` - The vehicle Type.


- `yearOfRegistration` - The year in which the car was first registered.


- `gearbox` - The transmission type.


- `powerPS` - The power of the car in PS.


- `model` - The car model name.


- `kilometer` - How many kilometers the car has driven.


- `monthOfRegistration` - The month in which the car was first registered.


- `fuelType` - What type of fuel the car uses.


- `brand` - The brand of the car.


- `notRepairedDamage`- If the car has a damage which is not yet repaired.


- `dateCreated` - The date on which the eBay listing was created.


- `nrOfPictures` - The number of pictures in the ad.


- `postalCode` - The postal code for the location of the vehicle.


- `lastSeenOnline` - When the crawler saw this ad last online.




# <a id='0'>Index</a>


### <a href='#1'>1. Basic Exploration Data</a>

### <a href='#2'>2. Cleaning column names </a>

### <a href='#3'>3. Initial Exploration Data and cleaning </a>

### <a href='#4'>4. Columns candidates to be droped</a>

### <a href='#5'>5. Columns that need more investigation.</a>

### <a href='#6'>6. Numeric data stored as text that needs to be cleaned</a>

### <a href='#7'>7. Character deletion and integer conversion</a>

### <a href='#8'>8. What does our dataset look like?</a>

### <a href='#9'>9. Exploring the odometer_km</a>

### <a href='#10'>10. Upper limit odometer_km</a>

### <a href='#11'>11. Lower limit odometer_km</a>

### <a href='#12'>12. Exploring the price_in_dollar</a>

### <a href='#13'>13. Upper limit price_in_dollars</a>

### <a href='#14'>14. Exploring price outliers on the botton of the list</a>

### <a href='#15'>15. The date columns </a>

### <a href='#16'>16. Price by Brand</a>

### <a href='#17'>17. Average price per vehicle brand</a>

### <a href='#18'>18. Average number of kilometers per vehicle brand</a>

### <a href='#19'>19. List of car brands by average price and average mileage</a>

### <a href='#20'>20. Storing Aggregate Data in a DataFrame</a>

### <a href='#21'>21. Identify categorical data</a>

### <a href='#22'>22. Finding the most common brand/model combi</a>

### <a href='#23'>23. Dividing the odometer in groups</a>

### <a href='#24'>24. How much cheaper are damaged cars than their undamaged counterparts?</a>

### <a href='#25'>25. Conclusions</a>

In [1]:
import numpy as np
import pandas as pd

### checking our csv

Dejo unos cuantos enlaces para investigar el tema, todavía no entiendo cuando saber 

In [2]:
! file -k autos.csv

autos.csv: CSV text\012- , Non-ISO extended-ASCII text


In [3]:
! file -i autos.csv

autos.csv: application/csv; charset=unknown-8bit


In [4]:
autos = pd.read_csv("autos.csv",encoding='ISO-8859-1')

 https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1
 
 
- ASCII: 7 bits. 128 code points.

- ISO-8859-1: 8 bits. 256 code points.

- UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:

<a href='#0'> back to index</a>
* * *
## <a id='1'>1. Basic Exploration Data</a>

In [5]:
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


In [6]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### observations

- There's a max numbers of 50000 rows with 20 columns, there are rows that are complete and others that are not, some columns are missing data.


- Column's name is not properly writed, so we need to [lowerCamelCase](https://en.wikipedia.org/wiki/Camel_case) it.


- Some columns do not have the appropriate data type.


- Some columns have null values, but none have more than ~20% null values.

## <a id='2'>2. Cleaning column names</a>

The column names use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case), which means we can't just replace spaces with underscores.

Working on fixing the column names:

   - `yearOfRegistration`  to `registration_year`
   - `monthOfRegistration` to `registration_month`
   - `notRepairedDamage`   to `unrepaired_damagè`
   - `dateCreated`         to `ad_created`

In [7]:
column_name = autos.columns
column_name

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [8]:
autos.rename({'yearOfRegistration':'registration_year','monthOfRegistration':'registration_month',
             'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created',},axis = 1, inplace = True)

and the rest of columns

In [9]:
autos.rename({'dateCrawled':'date_crawled','price':'price_in_dollars',
              'offerType':'offer_type','vehicleType':'vehicle_type',
              'powerPS':'CV','fuelType':'fuel_type',
              'nrOfPictures':'nr_pictures','postalCode':'postal_code',
              'lastSeen':'last_seen',},
             axis = 1, 
             inplace = True) 

Renaming names on snake format, and **powerPS** are converted to **CV** (Cavalli Vapore) means 'Horse Power' in metric system, system used in Europe where the dataframe comes (Germany)

based on this [link](https://www.autoweek.com/news/technology/a1820831/what-ps-metric-horsepower-autoweek-explains/)

In [10]:
autos.head(3)

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


<a href='#0'> back to index</a>
* * *
## <a id='3'>3. Initial Exploration Data and cleaning</a>

Cleaning tasks need to be done: 


- 1. Text columns where all or almost all values are the same. These can often **be dropped as they don't have useful** information for analysis.


- 2. Examples of numeric **data stored as text which can be cleaned and converted**.

In [11]:
len(autos.columns)

20

In [12]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


#### 3.1 Columns candidates to be droped:

- `seller`
- `offer_type`
- `abtest`
- `nr_picture`
- `postal_code`
- `last_seen`

In [13]:
len(autos.columns) # number columns before

20

In [14]:
columns_todrop = ['seller','offer_type','abtest','nr_pictures']

for i in columns_todrop:
    autos.drop([i], axis=1, inplace=True)

In [15]:
len(autos.columns) # number columns after

16

#### 3.2 Columns that need more investigation.

In [16]:
autos['unrepaired_damage'].value_counts()

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

#### 3.3 Numeric data stored as text that needs to be cleaned

A quick look at all the text-type columns to determine which ones should be cleaned and changed the data type.

In [17]:
for index in autos.columns:
    condition = autos[index].dtype
    if  condition == 'object':
        print("\n" )
        print(autos[index][0:2])



0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
Name: date_crawled, dtype: object


0              Peugeot_807_160_NAVTECH_ON_BOARD
1    BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
Name: name, dtype: object


0    $5,000
1    $8,500
Name: price_in_dollars, dtype: object


0          bus
1    limousine
Name: vehicle_type, dtype: object


0      manuell
1    automatik
Name: gearbox, dtype: object


0    andere
1       7er
Name: model, dtype: object


0    150,000km
1    150,000km
Name: odometer, dtype: object


0       lpg
1    benzin
Name: fuel_type, dtype: object


0    peugeot
1        bmw
Name: brand, dtype: object


0    nein
1    nein
Name: unrepaired_damage, dtype: object


0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
Name: ad_created, dtype: object


0    2016-04-06 06:45:54
1    2016-04-06 14:45:08
Name: last_seen, dtype: object


In [18]:
%%html
<style>
table {float:left}
</style>

|text column names|to clean|
|:---|:---|
|date_crawled|datetime|
|name|**ok**|
|price_in_dollars|clean $ sign, to number|
|vehicle_type|**ok**|
|gearbox|**ok**|
|mmodel|**ok**|
|odometer|clean km, to number and rename column to odometer_km|
|fuel_type|**ok**|
|brand|**ok**|
|unrepaired_damage|**ok**|
|ad_created|**ok**|

#### 3.4 Character deletion and integer conversion

Let's only modify the columns price in dollars and odometer

In [19]:
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace('$', '')
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace(',', '')
autos['price_in_dollars'] = autos['price_in_dollars'].astype(int)
autos['price_in_dollars']

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price_in_dollars, Length: 50000, dtype: int64

In [20]:
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km'] = autos['odometer_km'].str.replace(',','')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','')
autos['odometer_km'] = autos['odometer_km'].astype(int)
autos['odometer_km']

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64

### What does our dataset look like?


In [21]:
autos.describe() # numeric columns

Unnamed: 0,price_in_dollars,registration_year,CV,odometer_km,registration_month,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,9840.044,2005.07328,116.35592,125732.7,5.72336,50813.6273
std,481104.4,105.712813,209.216627,40042.211706,3.711984,25779.747957
min,0.0,1000.0,0.0,5000.0,0.0,1067.0
25%,1100.0,1999.0,70.0,125000.0,3.0,30451.0
50%,2950.0,2003.0,105.0,150000.0,6.0,49577.0
75%,7200.0,2008.0,150.0,150000.0,9.0,71540.0
max,100000000.0,9999.0,17700.0,150000.0,12.0,99998.0


### Exploring the `odometer_km` 

In [22]:
autos['odometer_km'].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [23]:
autos['odometer_km'].unique().shape

(13,)

In [24]:
autos['odometer_km'].value_counts().sort_index(ascending = False)

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

#### Upper limit `odometer_km`


In [25]:
autos['odometer_km'].value_counts().sort_index(ascending = False).head(5)

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

#### Lower limit `odometer_km`

In [26]:
autos['odometer_km'].value_counts().sort_index(ascending = True).head(5)

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

In [27]:
autos["odometer_km"].describe().round()

count     50000.0
mean     125733.0
std       40042.0
min        5000.0
25%      125000.0
50%      150000.0
75%      150000.0
max      150000.0
Name: odometer_km, dtype: float64

there do not appear to be outliers in this column `odometer_km`

### Exploring the `price_in_dollar`

In [28]:
autos['price_in_dollars'].unique().shape

(2357,)

#### Upper limit `price_in_dollars`

In [29]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price_in_dollars, dtype: int64

Let's see to which vehicle the value of 265000$ corresponds. 

In the event that this price is consistent with the vehicle, we can take it as a reference.

In [30]:
autos[autos["price_in_dollars"] == 265000]

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
12682,2016-03-28 22:48:01,Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...,265000,coupe,2016,automatik,500,911,5000,3,benzin,porsche,nein,2016-03-28 00:00:00,70193,2016-04-05 03:44:51


It makes sense, so we will use our comparison from this value upwards and see what we find.

https://www.carindigo.com/used-cars/porsche-911-gt3-rs

As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.

let's see which vehicles are above this one, creating a new dataset variable `top_list` to have a look.

In [31]:
top_list = autos[(autos["price_in_dollars"] >= 265000)].sort_index(ascending = False)

In [32]:
top_list[["price_in_dollars","brand","model","vehicle_type"]].sort_index(ascending = False)

Unnamed: 0,price_in_dollars,brand,model,vehicle_type
47634,3890000,sonstige_autos,,coupe
47598,12345678,opel,vectra,limousine
43049,999999,volkswagen,transporter,bus
42221,27322222,citroen,c4,limousine
39705,99999999,mercedes_benz,s_klasse,limousine
39377,12345678,volvo,v40,
37585,999990,volkswagen,jetta,limousine
36818,350000,porsche,911,coupe
35923,295000,porsche,911,cabrio
34723,299000,porsche,911,coupe


### Exploring price outliers on the top of the list:

There are vehicles that make sense to be on this list and that others clearly do not, for example: 

Strange values of 1234566, corresponding to a BMW or 11111111 to a Volkswagen even a C4 that is worth 27322222. 

Clearly, these cases could be eliminated as error.

However, there are some kind of vehicles called **'sonstige autos'** (mean **other cars in German**) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.  

In [33]:
top_list[top_list["brand"] == "sonstige_autos"]

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
47634,2016-04-04 21:25:21,Ferrari_FXX,3890000,coupe,2006,,799,,5000,7,,sonstige_autos,nein,2016-04-04 00:00:00,60313,2016-04-05 12:07:37
14715,2016-03-30 08:37:24,Rolls_Royce_Phantom_Drophead_Coupe,345000,cabrio,2012,automatik,460,,20000,8,benzin,sonstige_autos,nein,2016-03-30 00:00:00,73525,2016-04-07 00:16:26
11137,2016-03-29 23:52:57,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000,coupe,1960,manuell,368,,100000,1,benzin,sonstige_autos,nein,2016-03-29 00:00:00,73033,2016-04-06 21:18:11
7814,2016-04-04 11:53:31,Ferrari_F40,1300000,coupe,1992,,0,,50000,12,,sonstige_autos,nein,2016-04-04 00:00:00,60598,2016-04-05 11:34:11


#### let's check if these prices correspond to reality

| name | price in Dollars |registration_year | source |
| :--- | :----------- |:------ | :--- |
| Ferrari_FXX | 4,000,000| 2006 |https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1 |
| Rolls_Royce_Phantom_Drophead_Coupe | From 450,000 | 2012 | https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/ |
|Ferrari_F40| 1,959,900 |1992 | https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434 |
|Maserati 3200 GT|  | |  **Note** | 

**Note**: In relation to the Maserati, there are several things to consider.


- The [3200 Gt](https://es.wikipedia.org/wiki/Maserati_3200_GT) model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the  [3500 Gt](https://en.wikipedia.org/wiki/Maserati_3500_GT#:~:text=The%20Maserati%203500%20GT%20(Tipo,Maserati%20between%201957%20and%201964.), only 2,222 units were produced between the Coupe and the Spider version.

- If we also take into account the description referred to in the cell of the name:

     ***" Zustand_unwichtig_laufe... // Condition_unimportant_running..."***


- The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.

- As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car [I have found that this ranges between 863.170,50 USD / 151.290,57 USD](https://www.el-parking.es/coches-usados/maserati-3500-gt.html#!/coches-usados/maserati-3500-gt.html%3Ftri%3Dprix_decroissant)

Therefore, the best thing I can think of is assign the <strong>'average between prices'</strong> to our vehicle.


In [34]:
autos.iloc[11137,:]

date_crawled                                        2016-03-29 23:52:57
name                  suche_maserati_3200_gt_Zustand_unwichtig_laufe...
price_in_dollars                                               10000000
vehicle_type                                                      coupe
registration_year                                                  1960
gearbox                                                         manuell
CV                                                                  368
model                                                               NaN
odometer_km                                                      100000
registration_month                                                    1
fuel_type                                                        benzin
brand                                                    sonstige_autos
unrepaired_damage                                                  nein
ad_created                                          2016-03-29 0

Change the **'average between prices'** price of the Masserati = 507230$

In [35]:
autos.iloc[11137,4] = 507230

#### So... what do we have to remove from this top list?

Everything other than 'Sonstige_autos' and 'Porsche'

In [36]:
removed_bad_price_bool = ((top_list["brand"] != 'sonstige_autos') & (top_list["brand"] != 'porsche'))

In [37]:
bad_cars = top_list[removed_bad_price_bool]
bad_cars

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
47598,2016-03-31 18:56:54,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,12345678,limousine,2001,manuell,101,vectra,150000,3,benzin,opel,nein,2016-03-31 00:00:00,4356,2016-03-31 18:56:54
43049,2016-03-21 19:53:52,2_VW_Busse_T3,999999,bus,1981,manuell,70,transporter,150000,1,benzin,volkswagen,,2016-03-21 00:00:00,99880,2016-03-28 17:18:28
42221,2016-03-08 20:39:05,Leasinguebernahme,27322222,limousine,2014,manuell,163,c4,40000,2,diesel,citroen,,2016-03-08 00:00:00,76532,2016-03-08 20:39:05
39705,2016-03-22 14:58:27,Tausch_gegen_gleichwertiges,99999999,limousine,1999,automatik,224,s_klasse,150000,9,benzin,mercedes_benz,,2016-03-22 00:00:00,73525,2016-04-06 05:15:30
39377,2016-03-08 23:53:51,Tausche_volvo_v40_gegen_van,12345678,,2018,manuell,95,v40,150000,6,,volvo,nein,2016-03-08 00:00:00,14542,2016-04-06 23:17:31
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,999990,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,50997,2016-03-29 11:38:54
27371,2016-03-09 15:45:47,Fiat_Punto,12345678,,2017,,95,punto,150000,0,,fiat,,2016-03-09 00:00:00,96110,2016-03-09 15:45:47
24384,2016-03-21 13:57:51,Schlachte_Golf_3_gt_tdi,11111111,,1995,,0,,150000,0,,volkswagen,,2016-03-21 00:00:00,18519,2016-03-21 14:40:18
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,1234566,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,17454,2016-04-02 03:17:32
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,11111111,limousine,1973,manuell,48,escort,50000,3,benzin,ford,nein,2016-03-12 00:00:00,94469,2016-03-12 22:45:27


#### through the index of each row we eliminate the vehicles from the main dataset `autos`

In [38]:
for bad in bad_cars.index:
    autos.drop([bad], inplace = True)

### Exploring price outliers on the botton of the list:

In [39]:
autos["price_in_dollars"].describe().round()

count       49989.0
mean         6025.0
std         49134.0
min             0.0
25%          1100.0
50%          2950.0
75%          7200.0
max      10000000.0
Name: price_in_dollars, dtype: float64

#### The minimum value is 0 and we do not reach 25% of the total dataset the price does not reach 1100€.

Let's see if prices vary a lot or if they are very spread out within this range (25%). 

In [40]:
autos[autos['price_in_dollars'].between(0,1100)].describe()

Unnamed: 0,price_in_dollars,registration_year,CV,odometer_km,registration_month,postal_code
count,12538.0,12538.0,12538.0,12538.0,12538.0,12538.0
mean,556.422396,2002.788563,74.677461,136162.466103,4.831552,47992.659595
std,339.147087,146.343951,174.629327,34988.383373,3.947714,25833.421398
min,0.0,1111.0,0.0,5000.0,0.0,1069.0
25%,300.0,1996.0,45.0,150000.0,1.0,27376.25
50%,599.0,1999.0,71.0,150000.0,4.0,46240.0
75%,850.0,2001.0,101.0,150000.0,8.0,66482.0
max,1100.0,9999.0,15016.0,150000.0,12.0,99988.0


if i want to the total sum of the price of these vehicles **in different price ranges**

i.e. how much 75 of the 25 of the total.

In [41]:
seventy_five = autos[autos['price_in_dollars'].between(850,1100)] #75% of the total 25%
seventy_five.loc[:,'price_in_dollars'].sum()

3227534

In [42]:
fifty_percent = autos[autos['price_in_dollars'].between(599,850)] #50% of the total 25%
fifty_percent.loc[:,'price_in_dollars'].sum()

2462601

In [43]:
cuart = autos[autos['price_in_dollars'].between(300,599)] #25% of the total 25%
cuart.loc[:,'price_in_dollars'].sum()

1431649

In this way the importance in the overall data set can be compared.


| price ranges | price in Dollars | percentage over total |
| :---          | :-----  | :-----      | 
| [850 ~  1100] | 3227534 | 1,095478069 % |
| [599 ~ 850]   | 2462601 | 0,835847241 % |
| [300 ~ 599]   | 1431649 | 0,485925193 % |

It seems that it is not too relevant, so in order to take advantage of the maximum amount of information I will take into account a range that goes from the maximum that we have already defined before to a minimum price of **850 €.**

Once we have the properlly data at the top and at the bottom, we make a clean copy of our dataframe and call it:

- `auto_clean`

In [44]:
auto_clean = autos[autos['price_in_dollars'].between(850,10000000)].copy()

In [45]:
auto_clean['price_in_dollars'].describe().round()

count       40786.0
mean         7293.0
std         54315.0
min           850.0
25%          1950.0
50%          3999.0
75%          8500.0
max      10000000.0
Name: price_in_dollars, dtype: float64

## The date columns

In [46]:
auto_clean[['date_crawled','ad_created','last_seen']].describe()

Unnamed: 0,date_crawled,ad_created,last_seen
count,40786,40786,40786
unique,39611,76,32824
top,2016-03-11 22:38:16,2016-04-03 00:00:00,2016-04-07 03:16:17
freq,3,1608,7


In [47]:
auto_clean[['registration_year']].describe()

Unnamed: 0,registration_year
count,40786.0
mean,2017.778184
std,2503.086461
min,1000.0
25%,2000.0
50%,2005.0
75%,2009.0
max,507230.0


Select the dates **by the number of characters** with the **`str` method** and adding the number of characters to be filtered, in this case 10 characters.

In [48]:
print(auto_clean['date_crawled'].str[:10].value_counts(normalize=True,
                                                       dropna=False).sort_index(ascending = True))

2016-03-05    0.025573
2016-03-06    0.014147
2016-03-07    0.035600
2016-03-08    0.032683
2016-03-09    0.032462
2016-03-10    0.033075
2016-03-11    0.032658
2016-03-12    0.037415
2016-03-13    0.016035
2016-03-14    0.036434
2016-03-15    0.033639
2016-03-16    0.029177
2016-03-17    0.030770
2016-03-18    0.012921
2016-03-19    0.035110
2016-03-20    0.038028
2016-03-21    0.037439
2016-03-22    0.032805
2016-03-23    0.032315
2016-03-24    0.028956
2016-03-25    0.030721
2016-03-26    0.032879
2016-03-27    0.031310
2016-03-28    0.035208
2016-03-29    0.033713
2016-03-30    0.033100
2016-03-31    0.031261
2016-04-01    0.034424
2016-04-02    0.036238
2016-04-03    0.039156
2016-04-04    0.036802
2016-04-05    0.013264
2016-04-06    0.003212
2016-04-07    0.001471
Name: date_crawled, dtype: float64


In [49]:
print(auto_clean['ad_created'].str[:10].sort_index(ascending = False).value_counts(normalize=True,
                                                                                   dropna=False))

2016-04-03    0.039425
2016-03-20    0.038150
2016-03-21    0.037684
2016-03-12    0.037268
2016-04-04    0.037194
                ...   
2016-01-16    0.000025
2016-02-11    0.000025
2016-01-14    0.000025
2016-02-16    0.000025
2016-02-17    0.000025
Name: ad_created, Length: 76, dtype: float64


Between March and April 2016 there was the highest concentration of vehicles for sale ads. 

In [50]:
print(auto_clean['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))

2016-03-05    0.001128
2016-03-06    0.003800
2016-03-07    0.004781
2016-03-08    0.006522
2016-03-09    0.009121
2016-03-10    0.010126
2016-03-11    0.011622
2016-03-12    0.022753
2016-03-13    0.008581
2016-03-14    0.012038
2016-03-15    0.015103
2016-03-16    0.015618
2016-03-17    0.026700
2016-03-18    0.007478
2016-03-19    0.014932
2016-03-20    0.019811
2016-03-21    0.020301
2016-03-22    0.020988
2016-03-23    0.018266
2016-03-24    0.018928
2016-03-25    0.018143
2016-03-26    0.015961
2016-03-27    0.014858
2016-03-28    0.019786
2016-03-29    0.021061
2016-03-30    0.023537
2016-03-31    0.022998
2016-04-01    0.022998
2016-04-02    0.025425
2016-04-03    0.024567
2016-04-04    0.023537
2016-04-05    0.129554
2016-04-06    0.231550
2016-04-07    0.137425
Name: last_seen, dtype: float64


In [51]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).head(25)

507230       1
9999         2
9000         1
8888         1
6200         1
5911         1
5000         2
4500         1
4100         1
2800         1
2019         1
2018       435
2017      1113
2016       733
2015       362
2014       649
2013       795
2012      1306
2011      1617
2010      1585
2009      2076
2008      2205
2007      2261
2006      2660
2005      2816
Name: registration_year, dtype: int64

In [52]:
auto_clean['registration_year']

0        2004
1        1997
2        2009
3        2007
4        2003
         ... 
49995    2011
49996    1996
49997    2014
49998    2013
49999    1996
Name: registration_year, Length: 40786, dtype: int64

The maximum and minimum dates are wrong, we must find out which are the values to work with.

We see that beyond 2019 the values of the dates do not make sense, however we must remember that the data capture was done in 2016 so 2019 would be a date that would be equally wrong and we should limit the highest year to 2016.

In [53]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).tail(5)

1931    1
1929    1
1927    1
1001    1
1000    1
Name: registration_year, dtype: int64

Once we have corrected the incorrect values we have that our 75% of the vehicles were registered in 2008.

In [54]:
auto_clean['registration_year'].value_counts(normalize=True,
                                             dropna=False, 
                                             bins = 20).sort_index(ascending = False).round(3)

(481918.5, 507230.0]    0.0
(456607.0, 481918.5]    0.0
(431295.5, 456607.0]    0.0
(405984.0, 431295.5]    0.0
(380672.5, 405984.0]    0.0
(355361.0, 380672.5]    0.0
(330049.5, 355361.0]    0.0
(304738.0, 330049.5]    0.0
(279426.5, 304738.0]    0.0
(254115.0, 279426.5]    0.0
(228803.5, 254115.0]    0.0
(203492.0, 228803.5]    0.0
(178180.5, 203492.0]    0.0
(152869.0, 178180.5]    0.0
(127557.5, 152869.0]    0.0
(102246.0, 127557.5]    0.0
(76934.5, 102246.0]     0.0
(51623.0, 76934.5]      0.0
(26311.5, 51623.0]      0.0
(493.769, 26311.5]      1.0
Name: registration_year, dtype: float64

This is the distribution of the dates of registration by groups of 20, in which we can see how the period between 2002 and 2007 is the one with the highest number of registrations. 

## 7. Price by Brand

Brands with the highest number of vehicles on the sales list

In [55]:
top_ten_brands=auto_clean['brand'].value_counts()
top_ten_brands

volkswagen        8658
bmw               4881
mercedes_benz     4390
opel              3846
audi              3838
ford              2488
renault           1651
peugeot           1135
fiat               920
seat               758
skoda              734
smart              667
mazda              586
citroen            582
nissan             576
toyota             575
hyundai            427
mini               415
sonstige_autos     408
volvo              366
honda              311
kia                309
mitsubishi         292
porsche            279
alfa_romeo         263
chevrolet          261
suzuki             239
chrysler           136
dacia              128
jeep               107
land_rover          99
daihatsu            80
subaru              75
jaguar              70
saab                59
daewoo              45
trabant             38
rover               35
lancia              32
lada                27
Name: brand, dtype: int64

In [56]:
selected_brands=auto_clean['brand'].value_counts().index[:10]

### Average price per vehicle brand

In [57]:
brands_price = {}

for brand in selected_brands:
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_price[brand] = sel_brand['price_in_dollars'].mean().round()

brands_price_sorted = sorted(brands_price.items(),key = lambda kv: kv[1],reverse=True)
brands_price_sorted

[('audi', 9961.0),
 ('mercedes_benz', 9013.0),
 ('bmw', 8887.0),
 ('volkswagen', 6271.0),
 ('seat', 5137.0),
 ('ford', 4905.0),
 ('opel', 3864.0),
 ('peugeot', 3734.0),
 ('fiat', 3664.0),
 ('renault', 3227.0)]

### Average number of kilometers per vehicle brand 

In [58]:
brands_km = {}

for brand in sorted(selected_brands):
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_km[brand] = sel_brand['odometer_km'].mean().round()

brands_km_sorted = sorted(brands_km.items(),key = lambda kv: kv[1], reverse=True)
brands_km_sorted

[('bmw', 132295.0),
 ('mercedes_benz', 130489.0),
 ('audi', 128415.0),
 ('volkswagen', 126803.0),
 ('opel', 125711.0),
 ('peugeot', 123621.0),
 ('renault', 123540.0),
 ('ford', 121481.0),
 ('seat', 119162.0),
 ('fiat', 111332.0)]

### List of car brands by average price and average mileage

In [59]:
listado = {}

template_string = "Mean price {money:.2f}$ and {km:.2f} mean Kilometers"

for vehiculo in brands_price:
    mean_price = brands_price[vehiculo]
    mean_km = brands_km[vehiculo]
    output = template_string.format(money = mean_price, km = mean_km )
    listado[vehiculo] = output
    
listado

{'volkswagen': 'Mean price 6271.00$ and 126803.00 mean Kilometers',
 'bmw': 'Mean price 8887.00$ and 132295.00 mean Kilometers',
 'mercedes_benz': 'Mean price 9013.00$ and 130489.00 mean Kilometers',
 'opel': 'Mean price 3864.00$ and 125711.00 mean Kilometers',
 'audi': 'Mean price 9961.00$ and 128415.00 mean Kilometers',
 'ford': 'Mean price 4905.00$ and 121481.00 mean Kilometers',
 'renault': 'Mean price 3227.00$ and 123540.00 mean Kilometers',
 'peugeot': 'Mean price 3734.00$ and 123621.00 mean Kilometers',
 'fiat': 'Mean price 3664.00$ and 111332.00 mean Kilometers',
 'seat': 'Mean price 5137.00$ and 119162.00 mean Kilometers'}

In [60]:
frame = {'brands average price':brands_price,
         'brands average kilometer':brands_km}

output = pd.DataFrame(frame)
output

Unnamed: 0,brands average price,brands average kilometer
volkswagen,6271.0,126803.0
bmw,8887.0,132295.0
mercedes_benz,9013.0,130489.0
opel,3864.0,125711.0
audi,9961.0,128415.0
ford,4905.0,121481.0
renault,3227.0,123540.0
peugeot,3734.0,123621.0
fiat,3664.0,111332.0
seat,5137.0,119162.0


## 8. Storing Aggregate Data in a DataFrame

In [61]:
brands_price_series = pd.Series(brands_price)

In [62]:
brands_km_series = pd.Series(brands_km)

In [63]:
frame = {'mean_price':brands_price_series,
        'mean_kilometers':brands_km_series}

output = pd.DataFrame(frame)
output

Unnamed: 0,mean_price,mean_kilometers
audi,9961.0,128415.0
bmw,8887.0,132295.0
fiat,3664.0,111332.0
ford,4905.0,121481.0
mercedes_benz,9013.0,130489.0
opel,3864.0,125711.0
peugeot,3734.0,123621.0
renault,3227.0,123540.0
seat,5137.0,119162.0
volkswagen,6271.0,126803.0


List of car brands by highest price, their average price and average mileage.

### 9.1 Identify categorical data 

Using German words, translate them and map the values to their equivalents, in this case to Spanish, because there are categories that are unknown here. 

In [64]:
 auto_clean['vehicle_type'].value_counts(dropna = False)

limousine     11059
kombi          7892
kleinwagen     7613
bus            3797
NaN            3042
cabrio         2893
coupe          2244
suv            1937
andere          309
Name: vehicle_type, dtype: int64

In [65]:
auto_clean['vehicle_type'].value_counts().dropna()

limousine     11059
kombi          7892
kleinwagen     7613
bus            3797
cabrio         2893
coupe          2244
suv            1937
andere          309
Name: vehicle_type, dtype: int64

In [66]:
categorical_vehicle_type = auto_clean['vehicle_type'].value_counts().dropna().index[:]

In [67]:
category_translator = {'bus':'monovolumen','limousine':'sedan', 'kleinwagen':'compacto','kombi':'familiar',
                       'coupe':'coupe','suv':'suv','cabrio':'cabrio','andere':'otros'}

for category in categorical_vehicle_type:
    bool_category = auto_clean['vehicle_type'] == category
    auto_clean.loc[bool_category,'vehicle_type'] = category_translator[category]

In [68]:
auto_clean['vehicle_type'].value_counts()

sedan          11059
familiar        7892
compacto        7613
monovolumen     3797
cabrio          2893
coupe           2244
suv             1937
otros            309
Name: vehicle_type, dtype: int64

In [69]:
categorical_type_fuel = auto_clean['fuel_type'].value_counts().index[:]
categorical_type_fuel.value_counts()

hybrid     1
lpg        1
cng        1
elektro    1
benzin     1
diesel     1
andere     1
dtype: int64

In [70]:
categorical_type_fuel_translator = {'benzin':'gasolina', 'diesel':'diesel', 'lpg':'lpg', 'cng':'cng',
                                    'hybrid':'híbrido', 'elektro':'electrico', 'andere':'otros'}

for category in categorical_type_fuel:
    bool_fuel = auto_clean['fuel_type'] == category
    auto_clean.loc[bool_fuel,'fuel_type'] = categorical_type_fuel_translator[category]

In [71]:
auto_clean['fuel_type'].value_counts()

gasolina     23572
diesel       13878
lpg            605
cng             64
híbrido         37
electrico       17
otros           13
Name: fuel_type, dtype: int64

In [72]:
condicion_1 = auto_clean['fuel_type'] == 'híbrido' # testing hybrid category
auto_clean[condicion_1].head(3) # checking

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
1127,2016-03-10 12:06:22,Toyota_Auris_1.8_Hybrid_Executive,9850,sedan,2010,automatik,136,auris,100000,9,híbrido,toyota,nein,2016-03-10 00:00:00,56637,2016-04-05 20:46:24
1710,2016-03-31 19:39:57,Toyota_Yaris_Hybrid_1.5_VVT_i_Club_Panorama__A...,14900,compacto,2014,automatik,75,yaris,20000,5,híbrido,toyota,nein,2016-03-31 00:00:00,85757,2016-04-06 13:45:27
2062,2016-03-18 17:48:58,Toyota_Auris_1.8_VVT_i_Hybrid_Automatik_Tourin...,19500,familiar,2014,automatik,99,auris,50000,4,híbrido,toyota,nein,2016-03-18 00:00:00,53894,2016-04-03 21:45:15


In [73]:
auto_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40786 entries, 0 to 49999
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        40786 non-null  object
 1   name                40786 non-null  object
 2   price_in_dollars    40786 non-null  int64 
 3   vehicle_type        37744 non-null  object
 4   registration_year   40786 non-null  int64 
 5   gearbox             39234 non-null  object
 6   CV                  40786 non-null  int64 
 7   model               39044 non-null  object
 8   odometer_km         40786 non-null  int64 
 9   registration_month  40786 non-null  int64 
 10  fuel_type           38186 non-null  object
 11  brand               40786 non-null  object
 12  unrepaired_damage   34483 non-null  object
 13  ad_created          40786 non-null  object
 14  postal_code         40786 non-null  int64 
 15  last_seen           40786 non-null  object
dtypes: int64(6), object(10

In [74]:
replace_columns = ['date_crawled', 'ad_created', 'last_seen']

for column in replace_columns:
    auto_clean[column] = auto_clean[column].str[:10]
    auto_clean[column] = auto_clean[column].str.replace('-','')

In [75]:
auto_clean[['date_crawled','ad_created','last_seen']]

Unnamed: 0,date_crawled,ad_created,last_seen
0,20160326,20160326,20160406
1,20160404,20160404,20160406
2,20160326,20160326,20160406
3,20160312,20160312,20160315
4,20160401,20160401,20160401
...,...,...,...
49995,20160327,20160327,20160401
49996,20160328,20160328,20160402
49997,20160402,20160402,20160404
49998,20160308,20160308,20160405


In [76]:
all_brands= auto_clean['brand'].unique()
sorted(all_brands)

['alfa_romeo',
 'audi',
 'bmw',
 'chevrolet',
 'chrysler',
 'citroen',
 'dacia',
 'daewoo',
 'daihatsu',
 'fiat',
 'ford',
 'honda',
 'hyundai',
 'jaguar',
 'jeep',
 'kia',
 'lada',
 'lancia',
 'land_rover',
 'mazda',
 'mercedes_benz',
 'mini',
 'mitsubishi',
 'nissan',
 'opel',
 'peugeot',
 'porsche',
 'renault',
 'rover',
 'saab',
 'seat',
 'skoda',
 'smart',
 'sonstige_autos',
 'subaru',
 'suzuki',
 'toyota',
 'trabant',
 'volkswagen',
 'volvo']

### 9.3 Finding the most common brand/model combi.

The top three brands on our list also have their own favourite models, as can be seen below.

In [77]:
def combi_marca_modelo(marca):
    bool_fiat = auto_clean['brand'] == marca 
    print(auto_clean.loc[bool_fiat,'model'].value_counts()[:5]) # top 5

In [78]:
combi_marca_modelo('audi')

a4        1145
a3         827
a6         792
andere     213
tt         146
Name: model, dtype: int64


In [79]:
combi_marca_modelo('mercedes_benz')

c_klasse    1088
e_klasse     929
a_klasse     511
andere       426
clk          240
Name: model, dtype: int64


In [80]:
combi_marca_modelo('bmw')

3er        2408
5er        1100
1er         525
x_reihe     302
7er         127
Name: model, dtype: int64


In [81]:
def combi_marca_modelo_top(marca):
    bool_fiat = auto_clean['brand'] == marca
    return auto_clean.loc[bool_fiat,'model'][0:1].max()
    #return auto_clean[bool_fiat]

Model within each brand that is repeated more times within our data set, the order is given by the brand of the vehicle.

In [82]:
marc_modelo = {}

for brand in sorted(all_brands):
    marc_modelo[brand] = combi_marca_modelo_top(brand)

marc_modelo = pd.Series(marc_modelo)
marc_modelo

alfa_romeo               156
audi                      a3
bmw                      7er
chevrolet             andere
chrysler             voyager
citroen             berlingo
dacia                sandero
daewoo                nubira
daihatsu              terios
fiat                  andere
ford                   focus
honda                  civic
hyundai              i_reihe
jaguar                andere
jeep                wrangler
kia                 carnival
lada                  kalina
lancia                 lybra
land_rover        freelander
mazda                 andere
mercedes_benz       e_klasse
mini                  cooper
mitsubishi              colt
nissan               primera
opel                  andere
peugeot               andere
porsche                  911
renault                 clio
rover                 andere
saab                  andere
seat                   altea
skoda                octavia
smart                 fortwo
sonstige_autos           NaN
subaru        

### 9.4 Dividing the odometer in groups 

Use aggregation to see if the prices follow any pattern in mileage.

In [83]:
km_group = auto_clean['odometer_km'].value_counts()
km_group

150000    25074
125000     4535
100000     1923
90000      1609
80000      1357
70000      1169
60000      1108
50000       990
40000       799
30000       750
20000       706
5000        535
10000       231
Name: odometer_km, dtype: int64

In [84]:
km_group = auto_clean['odometer_km'].value_counts().index[:]
km_group

Int64Index([150000, 125000, 100000,  90000,  80000,  70000,  60000,  50000,
             40000,  30000,  20000,   5000,  10000],
           dtype='int64')

In [85]:
price_group = auto_clean['price_in_dollars'].value_counts().index[:]
price_group

Int64Index([ 1500,  2500,  1000,  1200,  3500,  2000,   999,   900,   850,
             4500,
            ...
             2870,  3501,  4955,  5454,  9480, 29890, 33980, 11090,  5198,
             1420],
           dtype='int64', length=2131)

In [86]:
avg_price_by_km_non_damage = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    combined = mean
    avg_price_by_km_non_damage[km] = combined

In [87]:
avg_price_by_km_non_damage

{150000: 4546.0,
 125000: 6789.0,
 100000: 13970.0,
 90000: 8981.0,
 80000: 10005.0,
 70000: 11296.0,
 60000: 12744.0,
 50000: 15321.0,
 40000: 15782.0,
 30000: 17175.0,
 20000: 19455.0,
 5000: 20426.0,
 10000: 22149.0}

In [88]:
avg_price_by_km = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    vehiculos = selected_km['price_in_dollars'].value_counts(normalize = False, sort = True)
    combined = (mean, vehiculos)
    avg_price_by_km[km] = combined

In [89]:
avg_price_by_km[150000]

(4546.0,
 1500     605
 1000     540
 1200     536
 2500     473
 2000     366
         ... 
 16300      1
 20450      1
 23950      1
 5495       1
 32800      1
 Name: price_in_dollars, Length: 1302, dtype: int64)

In [90]:
avg_price_by_km[150000]

(4546.0,
 1500     605
 1000     540
 1200     536
 2500     473
 2000     366
         ... 
 16300      1
 20450      1
 23950      1
 5495       1
 32800      1
 Name: price_in_dollars, Length: 1302, dtype: int64)

In [91]:
auto_clean[auto_clean['price_in_dollars'] == 69993]

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
41891,20160308,Porsche_993_Targa,69993,coupe,1996,automatik,286,911,150000,5,gasolina,porsche,nein,20160308,1445,20160407


### 9.4 How much cheaper are damaged cars than their undamaged counterparts?

we retrieved the dictionary we had made called the 'avg_price_by_km_non_damage' and we create another dictionary that compiles those vehicles that are indeed damaged.

In [92]:
avg_price_by_km_damage = {}

for km in km_group:
    selected_km = auto_clean[(auto_clean['odometer_km'] == km) &
                             (auto_clean['unrepaired_damage'] == 'ja') ]
    mean = selected_km['price_in_dollars'].mean()
    avg_price_by_km_damage[km] = mean.round()

In [93]:
avg_price_by_km_damage

{150000: 2929.0,
 125000: 3805.0,
 100000: 4718.0,
 90000: 4408.0,
 80000: 6057.0,
 70000: 5648.0,
 60000: 7861.0,
 50000: 9357.0,
 40000: 13433.0,
 30000: 6899.0,
 20000: 6724.0,
 5000: 3186.0,
 10000: 4566.0}

In [94]:
series_nondamage = pd.Series(avg_price_by_km_non_damage) 
df_nondamage = pd.DataFrame(series_nondamage)
df_nondamage = df_nondamage.rename(columns = {0:'price_not_damage'})

series_damage = pd.Series(avg_price_by_km_damage)
df_damage = pd.DataFrame(series_damage)
df_damage = df_damage.rename(columns = {0:'price_damage'})

series_diference = series_nondamage - series_damage
df_diference = pd.DataFrame(series_diference)
df_diference = df_diference.rename(columns = {0:'price_difference'})

series_diference_percent = (series_damage * 100) / series_nondamage.round(2)
df_diference_percent = pd.DataFrame(series_diference_percent)
df_diference_percent = df_diference_percent.rename(columns = {0:'%difference_damage_notdamage '})


df = pd.concat([df_nondamage, df_damage, df_diference, df_diference_percent], axis = 1)
df

Unnamed: 0,price_not_damage,price_damage,price_difference,%difference_damage_notdamage
150000,4546.0,2929.0,1617.0,64.430268
125000,6789.0,3805.0,2984.0,56.046546
100000,13970.0,4718.0,9252.0,33.772369
90000,8981.0,4408.0,4573.0,49.081394
80000,10005.0,6057.0,3948.0,60.53973
70000,11296.0,5648.0,5648.0,50.0
60000,12744.0,7861.0,4883.0,61.68393
50000,15321.0,9357.0,5964.0,61.073037
40000,15782.0,13433.0,2349.0,85.115955
30000,17175.0,6899.0,10276.0,40.16885


## Conclusions

The conclusions we can draw from our study are several:

 1. The 75% of the vehicles for sale have an average mileage of about 150,000 km and if we count the vehicles with a minimum selling price of 850 we have that 75% of the total have an average price of 8750.

 2. Once we have corrected the values we have that our 75% of the vehicles were registered in 2008 and that most of the registrations were between the period 2002 / 2007.

 3. The most offered vehicle brand is Volkswagen.

 4. The highest average price belongs to Audi, then Mercedes and finally to BMW. 

 5. The average number of kilometers per brand is led by BMW followed by Mercedes and Audi.

 6. most of the vehicles are sedan type and are gasoline cars followed by diesel and lpg cars.


Taking into account the order of average price per vehicle brand 
we know that the most repeated per model is:

   - Audi with A4 followed by the A3 and A6.

   - Mercedes the  C Class followed E Class and the A Class.

   - BMW with the 3 Series, the 5 Series in second place and the 1 Series in third place .
   

Finally we can see the differences between the price kilometer of the damaged and non-damaged cars and we observe that the biggest difference in price is given in vehicles with 40000 Kilometers 150000Km and 60000Km.