
# Exploring eBay Car Sales Data

In this project I'll be cleaning, preparing, and researching used car sales data. The data is sourced from a German eBay site (*eBay Kleinanzeigan*).

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding='Latin-1')

Looks like we have a count of 50,000 rows and 20 columns to work with. I'm going to definitely make sure the `dateCrawled` index is in a consistent format. `name` will need to be broken down into make, model, and manufacturer (which looks synonymous with `brand`).  

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


Columns `price` and `odometer` will need to be converted to `int64` or `float64` as needed. It will also be nice to change all our column headers to a more consistently pythonic version. There is also quite a bit of missing info in certain columns (particularly `notRepairedDamage`).

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

**Did Pandas add an index column or do I need to do that?**

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Columns are `camelCase` and not our lovely, Pythonic `snakecase`. We'll also convert oddly named columns like `nrOfPictures` to `picture_count` and the ambiguous `dateCreated` to `ad_created`.

In [5]:
mapping_col_dict = {'dateCrawled': 'date_crawled',
                    'offerType': 'offer_type',
                    'vehicleType': 'vehicle_type',
                    'yearOfRegistration': 'registration_year',
                    'powerPS': 'power_ps',
                    'monthOfRegistration': 'registration_month',
                    'fuelType': 'fuel_type',
                    'notRepairedDamage': 'unrepaired_damage',
                    'dateCreated': 'ad_created',
                    'nrOfPictures': 'picture_count',
                    'postalCode': 'postal_code',
                    'lastSeen': 'last_seen'
                   }

autos = autos.rename(mapping_col_dict, axis="columns")

autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture_count,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Much better! :)

In [6]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture_count,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


The one unique item under seller that is not `privat` is "gewerblich" which translates to "commercial". Since there was only one commercial listing we can drop this column.

In [7]:
autos[~(autos['seller'] == 'privat')]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture_count,postal_code,last_seen
7738,2016-03-15 18:06:22,Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten,gewerblich,Angebot,$100,control,kombi,2000,manuell,0,megane,"150,000km",8,benzin,renault,,2016-03-15 00:00:00,0,65232,2016-04-06 17:15:37


Similar to the `seller` column, `offer_type` only has one non-Angebot (offer); "Gesuch" translates to "request". Odd, and we will drop this column as well.

In [8]:
autos[~(autos['offer_type'] == 'Angebot')]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture_count,postal_code,last_seen
17541,2016-04-03 15:48:33,Suche_VW_T5_Multivan,privat,Gesuch,$0,test,bus,2005,,0,transporter,"150,000km",0,,volkswagen,,2016-04-03 00:00:00,0,29690,2016-04-05 15:16:06


The options for `unrepaired_damage` seem to just be yes or no ("ja" or "nein"). I have a feeling that "nein" has become NaN. We'll have to change those to "nein".

In [9]:
autos[~(autos['unrepaired_damage'] == 'nein')].describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture_count,postal_code,last_seen
count,14768,14768,14768,14768,14768,14768,11456,14768.0,12823,14768.0,13204,14768,14768.0,11967,14768,4939,14768,14768.0,14768.0,14768
unique,14603,13357,2,2,940,2,8,,2,,234,13,,7,40,1,51,,,13884
top,2016-03-11 22:38:16,Renault_Twingo,privat,Angebot,$0,test,kleinwagen,,manuell,,golf,"150,000km",,benzin,volkswagen,ja,2016-03-21 00:00:00,,,2016-04-05 20:47:37
freq,3,36,14767,14767,971,7616,3288,,10505,,1205,11072,,8513,3280,4939,583,,,4
mean,,,,,,,,2006.698199,,92.009886,,,4.766522,,,,,0.0,49302.549093,
std,,,,,,,,180.538972,,223.598583,,,4.022647,,,,,0.0,25407.158447,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1998.0,,43.0,,,0.0,,,,,0.0,28755.0,
50%,,,,,,,,2001.0,,85.0,,,4.0,,,,,0.0,47805.0,
75%,,,,,,,,2006.0,,125.0,,,8.0,,,,,0.0,67261.5,


## Todo

`seller` and `offer_type` only have two unique entries. One of them has 49,999 occurences. We can probably drop these columns altogether.  
`unrepaired_damage` has three different entry types. That will require more investigation.  
`registration_year`, `power_ps`, `postal_code`, and `picture_count` need to be converted to `int`.  
`price` and `odometer` are currently string objects. They need to be stripped of `$` and converted to `int`.
All the date specific columns have a date format. Gonna have to make sure those become the correct `dtype`.  
`registration_month` has values of 1000 and 9999. Those are certianly not years that are possible.

### Drop Useless Columns

I'm going to start by dropping the useless columns containing 99.99% the same data.  
`seller` has `privat` for all but one entry so that can go.  
`offer_type` has 49,999 `Angebot`. That is not useful. Drop it.  
`picture_count` has 0's and NaN's. Seeya.

In [10]:
autos.drop(['seller', 'offer_type', 'picture_count'], axis=1, inplace=True)

autos.describe(include='all')

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
count,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000
unique,48213,38754,2357,2,8,,2,,245,13,,7,40,2,76,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,2016-04-07 06:17:27
freq,3,78,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,8
mean,,,,,,2005.07328,,116.35592,,,5.72336,,,,,50813.6273,
std,,,,,,105.712813,,209.216627,,,3.711984,,,,,25779.747957,
min,,,,,,1000.0,,0.0,,,0.0,,,,,1067.0,
25%,,,,,,1999.0,,70.0,,,3.0,,,,,30451.0,
50%,,,,,,2003.0,,105.0,,,6.0,,,,,49577.0,
75%,,,,,,2008.0,,150.0,,,9.0,,,,,71540.0,



### `price` and `odometer`

Let's convert these two datasets to `int`.

**Clean `price` column, convert to `int64`.**  
`dtype` for `value_count` being `int64` refers to the `value_count` as it's OWN object.

In [12]:
print(autos["price"].value_counts())
print(autos["price"].describe())

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
414         1
79933       1
5198        1
18890       1
16995       1
Name: price, Length: 2357, dtype: int64
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


In [11]:
autos["price"] = autos["price"].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(int)
autos["price"].value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
414         1
79933       1
5198        1
18890       1
16995       1
Name: price, Length: 2357, dtype: int64

**Clean `odometer` column, convert to `int64`.**  
Rename column as `odometer_km`.

`odometer` seems to have preset ranges. 

In [13]:
print(autos["odometer"].value_counts())
print(autos["odometer"].describe())

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64
count         50000
unique           13
top       150,000km
freq          32424
Name: odometer, dtype: object


Sorting prices into different bins shows us some zero dollar starting prices. We can also see that some vehicles are listed over the 350k price point. Even as far as a million. That's just not going to happen on an auction so we'll cap it at 351k.

In [15]:
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [17]:
autos = autos[autos["price"].between(1,351000)]
round(autos["price"].describe())

count     48565.0
mean       5889.0
std        9060.0
min           1.0
25%        1200.0
50%        3000.0
75%        7490.0
max      350000.0
Name: price, dtype: float64

Analyzing the `odometer_km` column.

A starting bid of 1 dollar is not unheard of on auction sites, but I find it hard to believe that anyone is a listing a 350,000 dollar car on an eBay site. We'll remove everything over 351,000.


### `last_seen`, `ad_created`, and `last_seen`


In [19]:
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [20]:
autos['date_crawled'].str[:10]

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 48565, dtype: object

In [21]:
date_crawled = autos['date_crawled'].str[:10]
date_crawled.value_counts(normalize=True, dropna=False).sort_index() * 100

2016-03-05    2.532688
2016-03-06    1.404304
2016-03-07    3.601359
2016-03-08    3.329558
2016-03-09    3.308967
2016-03-10    3.218367
2016-03-11    3.257490
2016-03-12    3.691959
2016-03-13    1.566972
2016-03-14    3.654896
2016-03-15    3.428395
2016-03-16    2.960980
2016-03-17    3.162772
2016-03-18    1.291053
2016-03-19    3.477813
2016-03-20    3.788737
2016-03-21    3.737259
2016-03-22    3.298672
2016-03-23    3.222485
2016-03-24    2.934212
2016-03-25    3.160712
2016-03-26    3.220426
2016-03-27    3.109235
2016-03-28    3.486050
2016-03-29    3.409863
2016-03-30    3.368681
2016-03-31    3.183363
2016-04-01    3.368681
2016-04-02    3.547823
2016-04-03    3.860805
2016-04-04    3.648718
2016-04-05    1.309585
2016-04-06    0.317101
2016-04-07    0.140019
Name: date_crawled, dtype: float64

It appears that this was all taken from a period of four weeks.

In [22]:
ad_created = autos['ad_created'].str[:10]
ad_created.value_counts(normalize=True, dropna=False).sort_index() * 100

2015-06-11    0.002059
2015-08-10    0.002059
2015-09-09    0.002059
2015-11-10    0.002059
2015-12-05    0.002059
                ...   
2016-04-03    3.885514
2016-04-04    3.685782
2016-04-05    1.181921
2016-04-06    0.325337
2016-04-07    0.125605
Name: ad_created, Length: 76, dtype: float64

Ads were created as far back as 2015 June to our most recent 2016 April.

In [23]:
last_seen = autos['last_seen'].str[:10]
round(last_seen.value_counts(normalize=True, dropna=False).sort_index() * 100, 2)

2016-03-05     0.11
2016-03-06     0.43
2016-03-07     0.54
2016-03-08     0.74
2016-03-09     0.96
2016-03-10     1.07
2016-03-11     1.24
2016-03-12     2.38
2016-03-13     0.89
2016-03-14     1.26
2016-03-15     1.59
2016-03-16     1.65
2016-03-17     2.81
2016-03-18     0.74
2016-03-19     1.58
2016-03-20     2.07
2016-03-21     2.06
2016-03-22     2.14
2016-03-23     1.85
2016-03-24     1.98
2016-03-25     1.92
2016-03-26     1.68
2016-03-27     1.56
2016-03-28     2.09
2016-03-29     2.23
2016-03-30     2.48
2016-03-31     2.38
2016-04-01     2.28
2016-04-02     2.49
2016-04-03     2.52
2016-04-04     2.45
2016-04-05    12.48
2016-04-06    22.18
2016-04-07    13.19
Name: last_seen, dtype: float64

Something was happening on 2016-04-06 to cause a spike in last seen view count.


### `registration_year`


In [24]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The `registration_year` column has values ranging `1000` to `9999`. A car can't be registered after the date of the listing (2016). So these are incorrect.  
As for the other direction, it's fairly improbable for a car to have been registered before the first few decades of the 1900s. So we'll take a look at the rows outside `1900-2016` and see if we can just drop these rows.  

Only 6 rows before 1900. There are `NaN` values for `vehicle_type` and`gearbox`, and odometer readings that make no sense. Also these rows are claiming they were registered before the companies were even founded.

In [25]:
reg_year = autos["registration_year"]

pre_1900 = autos[reg_year < 1900]
# pre_1900.describe(include='all')

pre_1900

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
10556,2016-04-01 06:02:10,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-01 00:00:00,63322,2016-04-01 09:42:30
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,48324,2016-03-31 10:15:28
24511,2016-03-17 19:45:11,Trabant__wartburg__Ostalgie,490,control,,1111,,0,,5000,0,,trabant,,2016-03-17 00:00:00,16818,2016-04-07 07:17:29
32585,2016-04-02 16:56:39,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-02 00:00:00,63322,2016-04-04 14:46:21
49283,2016-03-15 18:38:53,Citroen_HY,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15 00:00:00,66706,2016-04-06 18:47:20


There are 1879 incorrectly registered vehicle rows. If there's one error there might be two so we're gonna drop them.

In [26]:
post_2016 = autos[reg_year > 2016]
post_2016.describe(include='all')

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
count,1879,1879,1879.0,1879,2,1879.0,1650,1879.0,1617,1879.0,1879.0,1171,1879,1088,1879,1879.0,1879
unique,1876,1818,,2,2,,2,,179,,,5,40,2,39,,1866
top,2016-03-29 20:38:23,Ford_Focus,,test,coupe,,manuell,,golf,,,benzin,volkswagen,nein,2016-03-28 00:00:00,,2016-04-07 08:45:02
freq,2,4,,957,1,,1386,,193,,,827,473,939,78,,2
mean,,,3693.304417,,,2052.345929,,98.30761,,130643.959553,4.680681,,,,,47950.638105,
std,,,4873.190245,,,444.99292,,433.249342,,37393.589236,3.88468,,,,,25373.285156,
min,,,1.0,,,2017.0,,0.0,,5000.0,0.0,,,,,1067.0,
25%,,,1000.0,,,2017.0,,0.0,,125000.0,1.0,,,,,27671.5,
50%,,,2000.0,,,2017.0,,75.0,,150000.0,4.0,,,,,45966.0,
75%,,,4500.0,,,2018.0,,120.0,,150000.0,8.0,,,,,66450.0,


Let's remove the rows outside 1900-2016.

In [27]:
autos = autos[autos["registration_year"].between(1900, 2016)]

round(autos.describe())

Unnamed: 0,price,registration_year,power_ps,odometer_km,registration_month,postal_code
count,46681.0,46681.0,46681.0,46681.0,46681.0,46681.0
mean,5978.0,2003.0,118.0,125587.0,6.0,51097.0
std,9178.0,7.0,185.0,39853.0,4.0,25755.0
min,1.0,1910.0,0.0,5000.0,0.0,1067.0
25%,1250.0,1999.0,75.0,100000.0,3.0,30827.0
50%,3100.0,2003.0,109.0,150000.0,6.0,49828.0
75%,7500.0,2008.0,150.0,150000.0,9.0,71732.0
max,350000.0,2016.0,17700.0,150000.0,12.0,99998.0


We removed 1972 rows from 50,000 leaving us 48028. Our ranges seem far more plausible now. With the vast majority beteween `1999-2008`.

In [28]:
round(autos["registration_year"].value_counts(normalize=True, sort=True) * 100, 4)

2000    6.7608
2005    6.2895
1999    6.2060
2004    5.7904
2003    5.7818
         ...  
1939    0.0021
1948    0.0021
1938    0.0021
1953    0.0021
1943    0.0021
Name: registration_year, Length: 78, dtype: float64


### `brand`


In [29]:
autos['brand'].describe()

count          46681
unique            40
top       volkswagen
freq            9862
Name: brand, dtype: object

In [30]:
brands_count = autos["brand"].value_counts(normalize=True)
round(brands_count * 100, 2)

volkswagen        21.13
bmw               11.00
opel              10.76
mercedes_benz      9.65
audi               8.66
ford               6.99
renault            4.71
peugeot            2.98
fiat               2.56
seat               1.83
skoda              1.64
nissan             1.53
mazda              1.52
smart              1.42
citroen            1.40
toyota             1.27
hyundai            1.00
sonstige_autos     0.98
volvo              0.91
mini               0.88
mitsubishi         0.82
honda              0.78
kia                0.71
alfa_romeo         0.66
porsche            0.61
suzuki             0.59
chevrolet          0.57
chrysler           0.35
dacia              0.26
daihatsu           0.25
jeep               0.23
subaru             0.21
land_rover         0.21
saab               0.16
jaguar             0.16
daewoo             0.15
trabant            0.14
rover              0.13
lancia             0.11
lada               0.06
Name: brand, dtype: float64

It looks like the top brand is volkswagen by a lot. The following big dogs are bmw, opel, mercedes_benz and audi. Let's see the mean prices for the brands above 5% frequency.

In [31]:
common_brands = brands_count[brands_count > 0.05].index

top_brand_avg = {}

for b in common_brands:
    b_rows = autos[autos['brand'] == b]
    b_prices = b_rows['price']
    mean_price = b_prices.mean()
    top_brand_avg[b] = int(mean_price)

pd.DataFrame.from_dict(top_brand_avg, orient='index').sort_values(by=0, ascending=False)

Unnamed: 0,0
audi,9336
mercedes_benz,8628
bmw,8332
volkswagen,5402
ford,3749
opel,2975


After looking at the mean car prices for the most commonly listed brands it looks the higher end vehicles are Audi, Mercedes Benz, and BMW.  
The most affordable cars would be Ford and Opel.  
Volkswagen is like the best of both worlds right in the middle. Also the fact they account for 20% of all listings if probably wouldn't be too difficult to find for a very reasonable price.