# Exploring Ebay Car Sales Data

In this project we will be reviewing car sales within the German Ebay classifieds.

### Loading the data

In [1]:
import pandas as pd
import numpy as np

The data set is found here https://data.world/data-society/used-cars-data.
Our dataset contains 50,000 rows of cars listed on the German Ebay marketplace.
Our dataset has been made unclean to demonstrate the process of datacleaning

In [2]:
autos = pd.read_csv('autos.csv', encoding = "Latin-1") 
#Encoding here to interpret the data correctly#
#Another popular encoding is Windows-1252#

As an overview lets see which columns have null values

In [3]:
autos.isnull().sum()

dateCrawled               0
name                      0
seller                    0
offerType                 0
price                     0
abtest                    0
vehicleType            5095
yearOfRegistration        0
gearbox                2680
powerPS                   0
model                  2758
odometer                  0
monthOfRegistration       0
fuelType               4482
brand                     0
notRepairedDamage      9829
dateCreated               0
nrOfPictures              0
postalCode                0
lastSeen                  0
dtype: int64

We have a lot of null values. Additionally the dataset is in german and will need translation.

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Lets rename the columns into a more readable format. We shall remove the snake case (vehicleType) in favour of camel case (vehicle_type) and name the columns something more appropriate for the descriptions

In [5]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

Now a check to see its worked

In [6]:
autos.head(1)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54


### Data Overview

Next lets see a summary statistical overview of the dataset.

In [7]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 20:37:19,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


There are some interesting and suprising statistics on this screen.
Most popular car sold was the Ford Fiesta.
There is a car registration in the year 1000.
We shall look at this in more depth later.

### Data Cleaning - Converting Titles

All columns seem important at this stage with all having significant count of non null values.

Columns that need exploring: power_PS, registration month as they all have zero minima.

Numeric Data stored as text that needs cleaning: price, odometer, powerPS.
So lets now convert these to numeric.

In [8]:
autos["price"] = autos["price"].str.replace('$','').str.replace(",","")
autos["price"] = autos["price"]
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["price"] = autos["price"].astype(float)
#Add on a check to see it has worked#
print(autos["price"].unique())
print(autos["odometer"].unique())

[ 5000.  8500.  8990. ...   385. 22200. 16995.]
['150,000' '70,000' '50,000' '80,000' '10,000' '30,000' '125,000' '90,000'
 '20,000' '60,000' '5,000' '100,000' '40,000']


In [9]:
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"].astype(float)
autos.rename(columns= {"odometer": "odometer_km"}, inplace = True)

In [10]:
#Add on a check to see it has worked#
print(autos["odometer_km"].unique())

['150000' '70000' '50000' '80000' '10000' '30000' '125000' '90000' '20000'
 '60000' '5000' '100000' '40000']


### Data Cleaning - Removing Non Sensical Data

Lets see what the most expensive (>$1million) cars on the marketplace are.

In [11]:
top_cars =autos[autos["price"] >= 1000000].sort_values(by='price', ascending= False)
top_cars[['name','price','vehicle_type','registration_year','odometer_km','brand']]

Unnamed: 0,name,price,vehicle_type,registration_year,odometer_km,brand
39705,Tausch_gegen_gleichwertiges,99999999.0,limousine,1999,150000,mercedes_benz
42221,Leasinguebernahme,27322222.0,limousine,2014,40000,citroen
27371,Fiat_Punto,12345678.0,,2017,150000,fiat
39377,Tausche_volvo_v40_gegen_van,12345678.0,,2018,150000,volvo
47598,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,12345678.0,limousine,2001,150000,opel
2897,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,11111111.0,limousine,1973,50000,ford
24384,Schlachte_Golf_3_gt_tdi,11111111.0,,1995,150000,volkswagen
11137,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000.0,coupe,1960,100000,sonstige_autos
47634,Ferrari_FXX,3890000.0,coupe,2006,5000,sonstige_autos
7814,Ferrari_F40,1300000.0,coupe,1992,50000,sonstige_autos


In [12]:
free_cars =autos[autos["price"] == 0]
free_cars[['name','price','vehicle_type','registration_year','odometer_km','brand']]

Unnamed: 0,name,price,vehicle_type,registration_year,odometer_km,brand
27,Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE,0.0,,2005,150000,ford
71,Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re...,0.0,,1990,5000,opel
80,Nissan_Primera_Hatchback_1_6_16v_73_Kw___99Ps_...,0.0,coupe,1999,150000,nissan
87,Bmw_520_e39_zum_ausschlachten,0.0,,2000,150000,bmw
99,Peugeot_207_CC___Cabrio_Bj_2011,0.0,cabrio,2011,60000,peugeot
118,VW_Sharan_V6_204_PS_Karosse_Rohkarosse_mit_Pap...,0.0,bus,2001,150000,volkswagen
146,Ford_Fiesta_rot,0.0,kleinwagen,1996,20000,ford
167,Suche_VW_Multivan_Innenausstattung_Set_oder_TE...,0.0,,2011,5000,volkswagen
180,Zu_verkaufen,0.0,,2016,150000,mazda
226,Porsche_911_S_Targa__67er_SWB,0.0,cabrio,1967,5000,porsche


There are 1421 cars listed as free. Its possible a car may be free as the person may feel the car is worthless or would cost more to scrap themselves.
One car listed as 99,999,999 or 100 million. 

This car is listed as a limosene. Even the very most expensive cars do not usually cost more than 10million. This seems especially true given ebay is a second hand car sale site. Its high price is excessive.

For cars above 1million only two seem 'reasonable' priced in two Ferraris.
The rest above 1million (9 results) include a Fiat Punto, Opel Vectra, and a VW golf and the high prices are laughable and probably a joke or at worst a con.

We will remove these 9 results.

In [13]:
autos = autos[(autos["price"] <= 1000000) | (autos["brand"] == 'sonstige_autos')]

### Data Cleaning - Dates From Strings

Now lets take a look at the dates_crawled, ad_created and last_seen columns. At the moment these are strings but we can can convert these into numerical to understand them quantitatively.

In [14]:
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

2016-03-05    0.025384
2016-03-06    0.013942
2016-03-07    0.035966
2016-03-08    0.033265
2016-03-09    0.033205
2016-03-10    0.032125
2016-03-11    0.032485
2016-03-12    0.036766
2016-03-13    0.015562
2016-03-14    0.036626
2016-03-15    0.033985
2016-03-16    0.029505
2016-03-17    0.031525
2016-03-18    0.013062
2016-03-19    0.034906
2016-03-20    0.037826
2016-03-21    0.037506
2016-03-22    0.032905
2016-03-23    0.032385
2016-03-24    0.029105
2016-03-25    0.031745
2016-03-26    0.032485
2016-03-27    0.031045
2016-03-28    0.034846
2016-03-29    0.034185
2016-03-30    0.033625
2016-03-31    0.031905
2016-04-01    0.033805
2016-04-02    0.035406
2016-04-03    0.038686
2016-04-04    0.036526
2016-04-05    0.013102
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

In [15]:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033225
2016-03-10    0.031865
2016-03-11    0.032785
2016-03-12    0.036606
2016-03-13    0.016923
2016-03-14    0.035226
2016-03-15    0.033745
2016-03-16    0.030005
2016-03-17    0.031205
2016-03-18    0.013722
2016-03-19    0.033845
2016-03-20    0.037866
2016-03-21 

In [16]:
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005361
2016-03-08    0.007581
2016-03-09    0.009842
2016-03-10    0.010762
2016-03-11    0.012522
2016-03-12    0.023804
2016-03-13    0.008981
2016-03-14    0.012802
2016-03-15    0.015883
2016-03-16    0.016443
2016-03-17    0.027924
2016-03-18    0.007421
2016-03-19    0.015743
2016-03-20    0.020703
2016-03-21    0.020723
2016-03-22    0.021583
2016-03-23    0.018583
2016-03-24    0.019563
2016-03-25    0.019203
2016-03-26    0.016963
2016-03-27    0.016023
2016-03-28    0.020863
2016-03-29    0.022344
2016-03-30    0.024844
2016-03-31    0.023824
2016-04-01    0.023104
2016-04-02    0.024884
2016-04-03    0.025364
2016-04-04    0.024624
2016-04-05    0.124300
2016-04-06    0.220995
2016-04-07    0.130941
Name: last_seen, dtype: float64

First ad_created in the data base is 2015-06-11.

First date crawled was 2016-03-05. Clearly the high percentage (0.025384 or 2.5%) of results is because of the older entries rather than a high proportion of these sales being on this day. 

Almost seems like date crawled is not relevant.

### Data Cleaning - Reviewing Registration Year

In [17]:
autos[autos["registration_year"] <= 1920].shape

(15, 20)

In [18]:
autos["registration_year"].describe()

count    49992.000000
mean      2005.073772
std        105.721119
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

A mean of 2005 is not completely unreasonable given the database contains car sold in 2015 and 2016 although we might expect this to be lower given cars tend to be alot older.

From the 75% bracket we see 25% of cars have registrations beyond 2008, again seems reasonable.

Max is listed as 9999, clearly untrue. And there are 512 entries for 2018 registrations again, should not be possible.

Lowest is 1000, clearly untrue. Given cars only started being produced in the 1880's.

15 cars listed as before 1920's.
Only a few of these like 'Opel' could be real. Our cut-off will be 1920 to 2016.

In [19]:
autos = autos[autos["registration_year"].between(1920,2016)]

In [20]:
autos["registration_year"].describe()

count    48013.000000
mean      2002.821652
std          7.199126
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [21]:
autos["registration_year"].value_counts(normalize = True, ascending = False)

2000    0.069856
2005    0.062795
1999    0.062441
2004    0.057005
2003    0.056797
2006    0.056401
2001    0.056276
2002    0.052757
1998    0.051090
2007    0.047987
2008    0.046467
2009    0.043696
1997    0.042239
2011    0.034032
2010    0.033262
1996    0.030075
2012    0.027555
2016    0.027409
1995    0.027326
2013    0.016787
2014    0.013850
1994    0.013746
1993    0.009268
2015    0.008310
1990    0.008227
1992    0.008144
1991    0.007415
1989    0.003770
1988    0.002958
1985    0.002187
          ...   
1974    0.000500
1966    0.000458
1977    0.000458
1969    0.000396
1975    0.000396
1965    0.000354
1964    0.000250
1963    0.000187
1959    0.000146
1961    0.000125
1956    0.000104
1958    0.000083
1937    0.000083
1962    0.000083
1950    0.000062
1954    0.000042
1941    0.000042
1934    0.000042
1957    0.000042
1951    0.000042
1955    0.000042
1931    0.000021
1953    0.000021
1943    0.000021
1938    0.000021
1939    0.000021
1927    0.000021
1929    0.0000

This has had a vast impact with now the average car being registered in 2002 rather than 2005 and we have purged nearly 2000 anomolous or false entries.

### Data Analysis - Top Brands On The Market

Now lets perform some aggregation and explore the top 6 brands in this database.

In [22]:
top6brands = autos["brand"].value_counts(dropna = False, normalize = True).head(6)
top6brands

volkswagen       0.212172
bmw              0.110033
opel             0.108137
mercedes_benz    0.095370
audi             0.086414
ford             0.069794
Name: brand, dtype: float64

In [23]:
top6brands.index

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Above we can see the top 6 brands in our ebay listings with volkswagen (VW) as the most popular by a significant margin at 21%. Which is the same as the sum of the next two popular brands (BMW and Opel)

We said we would peform aggregation but how can we do this using loops?

Here is how:
- Identify the unique values we want to aggregate by
- Create an empty dictionary to store our aggregate data
- Loop over the unique values, and for each:
    - Subset the dataframe by the unique values
    - Calculate the mean of whichever column we're interested in
    - Assign the val/mean to the dict as k/v.
    
(Some modules have functions to do these steps for us like 'pandas.groupby')

In [24]:
print(autos["brand"].unique())

['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']


In [25]:
mean_brand_price = {}
brands_in_top_20 = autos["brand"].unique()
top6_brands = top6brands.index
for row in top6_brands:
    selected_rows = autos[autos["brand"] == row]
    mean_price = selected_rows["price"].mean()
    mean_brand_price[row] = mean_price

In [26]:
mean_brand_price

{'audi': 9093.65003615329,
 'bmw': 8102.536248343744,
 'ford': 3949.42345568487,
 'mercedes_benz': 8485.239571958942,
 'opel': 2877.7224576271187,
 'volkswagen': 5426.382546382644}

Here is what the same code looks like in pandas.

In [27]:
topbrands = autos.groupby(by=["brand"]).mean()
topbrands["price"].sort_values(ascending=False)

brand
porsche           44553.467577
sonstige_autos    40002.024952
land_rover        19108.091837
jeep              11434.750000
jaguar            11176.197368
mini              10460.012048
audi               9093.650036
mercedes_benz      8485.239572
bmw                8102.536248
chevrolet          6488.981752
skoda              6334.919481
dacia              5915.528455
kia                5789.351906
volkswagen         5426.382546
hyundai            5308.539112
toyota             5115.333890
volvo              4757.108108
nissan             4664.891034
seat               4296.492554
mazda              4010.771664
suzuki             3995.757042
honda              3988.000000
alfa_romeo         3984.855346
ford               3949.423456
subaru             3765.038095
citroen            3699.935629
smart              3542.706587
mitsubishi         3333.800512
lancia             3246.365385
chrysler           3229.443182
saab               3211.649351
peugeot            3039.468265
op

Porche, Sonstige_autos ('other cars'), Jaguars, Land rover, Jeeps and Minis have the highest average sale cost of >100000

Trabant,Daihatsu, Rover, and Daewoo all have a low sale cost of <2000.

### Data Analysis - Odometer Review

First we will convert the string into integers.

In [28]:
autos["odometer_km"] = autos["odometer_km"].astype(int)

Then we will calculate the mean mileage and mean price for each of the top brands.

In [29]:
topbrandsindex = topbrands.index
topbrandsindex

Index(['alfa_romeo', 'audi', 'bmw', 'chevrolet', 'chrysler', 'citroen',
       'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai',
       'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda',
       'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot',
       'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart',
       'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen',
       'volvo'],
      dtype='object', name='brand')

In [30]:
mean_mileage_per_brand = {}
for row in topbrandsindex:
    selected_rows = autos[autos["brand"] == row]
    mean_mileage = selected_rows["odometer_km"].mean()
    mean_mileage_per_brand[row] = mean_mileage

And represent this in a table.

In [31]:
bmp_series = pd.Series(mean_brand_price)
bmm_series = pd.Series(mean_mileage_per_brand)

new_df = pd.DataFrame(bmp_series, columns =['mean_price'])
new_df['mean_mileage'] = bmm_series
new_df.sort_values(by='mean_price')

Unnamed: 0,mean_price,mean_mileage
opel,2877.722458,129224.768875
ford,3949.423456,124068.934646
volkswagen,5426.382546,128728.281143
bmw,8102.536248,132431.383684
mercedes_benz,8485.239572,130856.082114
audi,9093.650036,129287.780188


### Most Popular Models
Next we will look at the most common brand/model combinations on the market.

In [32]:
make_model = autos[["brand","model","price"]].groupby(by= ["brand","model"])
make_model = make_model.count().sort_values(by="price",ascending=False)[:10]
#Rename column to count for clarity#
make_model.rename(columns={"price":"count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,count
brand,model,Unnamed: 2_level_1
volkswagen,golf,3815
bmw,3er,2688
volkswagen,polo,1677
opel,corsa,1644
opel,astra,1388
volkswagen,passat,1388
audi,a4,1265
bmw,5er,1163
mercedes_benz,c_klasse,1147
mercedes_benz,e_klasse,981


In this German ebay market the top 10 models are all German manufactures. Perhaps this is not too suprising.

The volkswagen golf and BMW 3 series combined are as popular as the next 4 popular car models.

### Comparing Odometer Readings And Price.
It is well known that cars depreciate over time.
One factor is the mileage (or odometer reading)

Lets compare the odometer reading and price.
First lets review the first 5 rows.

In [33]:
autos[:5]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Next we will use the pandas cut function to assign the odomoter readings into 10 brackets (called bins) of equal size.

(Note we could use qcut to get an equal frequency distribution, but in this case it would actually split certain odometer values into two bins. This would be non-sensical in this project)

In [34]:
deciles = pd.cut(autos["odometer_km"], bins= 10)
deciles[:5]

0    (135500.0, 150000.0]
1    (135500.0, 150000.0]
2      (63000.0, 77500.0]
3      (63000.0, 77500.0]
4    (135500.0, 150000.0]
Name: odometer_km, dtype: category
Categories (10, interval[float64]): [(4855.0, 19500.0] < (19500.0, 34000.0] < (34000.0, 48500.0] < (48500.0, 63000.0] ... (92000.0, 106500.0] < (106500.0, 121000.0] < (121000.0, 135500.0] < (135500.0, 150000.0]]

Now lets join this deciles list (pandas calls this a series) to the autos database.

In [35]:
odometer_to_price = pd.concat([deciles, autos["price"]], axis=1)
#display first 5 rows#
odometer_to_price[:5]

Unnamed: 0,odometer_km,price
0,"(135500.0, 150000.0]",5000.0
1,"(135500.0, 150000.0]",8500.0
2,"(63000.0, 77500.0]",8990.0
3,"(63000.0, 77500.0]",4350.0
4,"(135500.0, 150000.0]",1350.0


Finally perform the groupby function to get mean prices.

In [36]:
odometer_to_price_mean = odometer_to_price.groupby(by="odometer_km").mean()
odometer_to_price_mean

Unnamed: 0_level_0,price
odometer_km,Unnamed: 1_level_1
"(4855.0, 19500.0]",13710.194276
"(19500.0, 34000.0]",17174.599217
"(34000.0, 48500.0]",15441.445
"(48500.0, 63000.0]",13519.483675
"(63000.0, 77500.0]",10817.81985
"(77500.0, 92000.0]",8903.513907
"(92000.0, 106500.0]",12677.601233
"(106500.0, 121000.0]",
"(121000.0, 135500.0]",6286.593548
"(135500.0, 150000.0]",3718.333419


Odometer range between 106500 and 121000 gave a NaN result.

Lets have a look why.

In [37]:
odometer_filt = (106500.0 <= autos["odometer_km"]) & (autos["odometer_km"] <=121000.0)
odometer_high = autos[odometer_filt]
odometer_high[["price","odometer_km"]]

Unnamed: 0,price,odometer_km


It is displaying Nan because there are no results in this range. So all is fine with our selection

We can see there is a clear drop in value when above 120000 km.
But the highest average value is for 19500 to 34000 km.

Lets now see how many of those cars lie in those bins.

In [38]:
#rename the price column to show it becomes counts#
odometer_to_price_counts= odometer_to_price.rename(columns={"price":"count"})
odometer_to_price_counts = odometer_to_price_counts.groupby(by="odometer_km").count()
odometer_to_price_counts

Unnamed: 0_level_0,count
odometer_km,Unnamed: 1_level_1
"(4855.0, 19500.0]",1153
"(19500.0, 34000.0]",1532
"(34000.0, 48500.0]",800
"(48500.0, 63000.0]",2144
"(63000.0, 77500.0]",1199
"(77500.0, 92000.0]",3092
"(92000.0, 106500.0]",2109
"(106500.0, 121000.0]",0
"(121000.0, 135500.0]",4960
"(135500.0, 150000.0]",31024


In [39]:
odometer_to_price_counts.sum()

count    48013
dtype: int64

There are 31000+ cars with more than 135500 miles. This is the majority of our cleaned dataset. As they are the cheapest cars they bring the average price of the market place way down.

Conversely the lowest set of cars (excluding the NaN) is the 34000 to 48500km car market. Which has the second highest average price.

Nearly 80% of cars on the market have done more than 92000km. 

### Comparing Damaged and Non Damaged Cars

In [40]:
#First convert the ja and nein in the rows to yes and no#
autos2 = autos[["brand","model","price","unrepaired_damage"]]
damaged = autos2.groupby(by="unrepaired_damage")
damaged.mean()

Unnamed: 0_level_0,price
unrepaired_damage,Unnamed: 1_level_1
ja,2335.379937
nein,7549.058331


In [71]:
damaged_brand = autos2.groupby(by=["brand", "unrepaired_damage"])
#Having grouped by brand and unrepaired damage we can now apply a mean function#
#We also need 'unstack' to split the ya & nein rows to create columns#
damaged_brand2 = damaged_brand.mean().unstack()
#Remove the price main heading to leave ya and nein subheadings#
damaged_brand2 = damaged_brand2["price"]
#Work out the difference in price
damaged_brand2["price_lost_with_damage"] = (damaged_brand2["nein"]) - (damaged_brand2["ja"])
#Work out the percentage of value lost with damage#
damaged_brand2["%_price_lost_with_damage"] = (((damaged_brand2["nein"]) - (damaged_brand2["ja"])) *100) / (damaged_brand2["nein"])

#Next lets add on the topbrands average price from earlier#
topbrands = autos.groupby(by=["brand"]).mean()
damaged_brand2["mean_price"] = topbrands["price"]
#Finally sort the values by % price lost#
damaged_brand2.sort_values(by="price_lost_with_damage", ascending = False)

unrepaired_damage,ja,nein,price_lost_with_damage,%_price_lost_with_damage,mean_price
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sonstige_autos,6187.823529,63327.96519,57140.14166,90.228924,40002.024952
porsche,13454.545455,50215.248,36760.702545,73.206255,44553.467577
land_rover,5223.75,21688.512821,16464.762821,75.914669,19108.091837
jeep,2549.9,12626.0,10076.1,79.804372,11434.75
jaguar,3999.9,13792.724138,9792.824138,70.999927,11176.197368
audi,3181.37931,10763.899903,7582.520593,70.44399,9093.650036
mini,4595.0,11067.973118,6472.973118,58.483817,10460.012048
bmw,3276.835648,9322.011258,6045.17561,64.848405,8102.536248
mercedes_benz,3865.471264,9711.511442,5846.040177,60.197017,8485.239572
kia,1973.522727,6844.203125,4870.680398,71.165047,5789.351906


Perhaps its hardly suprising but the the most expensive cars tend to show the greatest magnitude of price difference when comparing damaged and undamaged cars.

What is interesting is the % price lost which is best seen sorted below.

In [75]:
damaged_brand2[["%_price_lost_with_damage","mean_price"]].sort_values(by="%_price_lost_with_damage", ascending = False)


unrepaired_damage,%_price_lost_with_damage,mean_price
brand,Unnamed: 1_level_1,Unnamed: 2_level_1
sonstige_autos,90.228924,40002.024952
saab,88.291817,3211.649351
trabant,81.611849,1572.851351
jeep,79.804372,11434.75
land_rover,75.914669,19108.091837
mazda,74.511029,4010.771664
ford,74.159717,3949.423456
porsche,73.206255,44553.467577
suzuki,71.825194,3995.757042
kia,71.165047,5789.351906


There is a small caveat in that we cannot adjust for the extend of the damage.
Some cars may have been in huge accidents or disrepair. Some may just have minor dents. Therefore we can make observations but not predictions.

Here we see some suprising results. 
There are cheap (<$3000) cars that have a wide range of % price loss (between 23 and 82%). Clearly intial price of a car is no measure of how well the price of a damaged car is on the market.

Expensive brands(<$10000)(jaguar, porsche, land_rover, jeep, sonstige_autos) tend to have high percentage losses when damaged (71 to 90%). Perhaps because the owners tend to use them at high-speed or off-road. The exception to this is the Mini with 58% difference. This is a below average difference and could perhaps could be because the Mini is used for high-speed or off-road.

Sonstige_autos means 'other cars' and is some niche brands. They are they are very expensive, and with 90% loss with damage they depreciate rapidly. Lets have a look at these.

In [100]:
#to see more rows we need to change the default display settings#
pd.set_option('display.max_rows', 1000)
#Filter for only sonstige_autos and remove where unrepaired damage is missing ("Na")#
sonstige_filter = (autos["brand"] == "sonstige_autos") & (autos["unrepaired_damage"].notna())
sonstige = autos[sonstige_filter][["name","price","unrepaired_damage","vehicle_type","registration_year","odometer_km"]]
sonstige.sort_values(by=["unrepaired_damage","price"],ascending=False)

Unnamed: 0,name,price,unrepaired_damage,vehicle_type,registration_year,odometer_km
11137,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000.0,nein,coupe,1960,100000
47634,Ferrari_FXX,3890000.0,nein,coupe,2006,5000
7814,Ferrari_F40,1300000.0,nein,coupe,1992,50000
14715,Rolls_Royce_Phantom_Drophead_Coupe,345000.0,nein,cabrio,2012,20000
28090,Tesla_Model_X_P90D_Signature_Sondermodel__Neuw...,194000.0,nein,suv,2016,5000
22060,Tesla_Model_X90D_Autopilot_Leder_AHK_Kaltwette...,114400.0,nein,suv,2016,5000
49391,"Lamborghini_Gallardo_LP560_4_E_Gear_""Callisto_...",109999.0,nein,coupe,2008,30000
16964,Bentley_Continental_Supersports,105000.0,nein,coupe,2010,80000
3283,Melkus_RS1000_GT_Nr_129,80000.0,nein,andere,1980,5000
8446,Bentley_Continental_Flying_Spur_Speed,79999.0,nein,limousine,2011,60000


Well sonstigue autos are definitely expensive!
We have a mixture of very fast cars, limousines, classic cars (lie the Melkus), luxury cars (like Bentley and Rolls Royce) and expensive import cars (like the America Dodge and Corvette)

Oddly there are some free cars in this list that seem to have no damage. 2 of which are buses. And one usually very expensive Maserati.

### Conclusion
- German car manufactures are disproportionately represented in this eBay Market. Which is hardly surprising given the eBay marketplace is German. Volkswagen is the most popular.
- Porche, Sonstige_autos ('other cars'), Jaguars, Land rover, Jeeps and Minis have the highest average sale cost of >100000
- Trabant,Daihatsu, Rover, and Daewoo all have a low sale cost of <2000.
- The volkswagen golf and BMW 3 series combined are as popular as the next 4 popular car models. The top 10 models are all German.
- There are 31000+ cars with more than 135500 miles. This is the majority of our cleaned dataset. As they are the cheapest cars they bring the average price of the market place way down.
- Conversely the lowest set of cars is the 34000 to 48500km car market. Which has the second highest average price.
- Nearly 80% of cars on the market have done more than 92000km.
- There are cheap (<3000 dollars) cars that have a wide range of % price loss from damage (between 23 and 82%).
- Expensive brands (<10000 dollars)(jaguar, porsche, land_rover, jeep, sonstige_autos) tend to have high percentage losses when damaged (71 to 90%).

