# Guided Project: Exploring Ebay Car Sales Data

We will work with a dataset of used cars from eBay Kleinanzeigen. The aim of this project is to perform basic data cleaning and analysis of the car listings.

## Introduction

In [81]:
import pandas as pd

In [82]:
autos = pd.read_csv('autos.csv', encoding="Windows-1252")

In [83]:
autos.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [84]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


There are total of 50,000 rows and 20 columns. Dataset contains some empty values and most of the columns have string datatype.

## Cleaning Column Names

Let's start with providing new column names in snake_case style and making them more understandable.

In [85]:
autos.rename(columns=
             {
                 'yearOfRegistration': 'registration_year',
                 'monthOfRegistration': 'registration_month',
                 'notReparedDamage': 'unrepaired_damage',
                 'dateCreated': 'ad_created',
                 'dateCrawled': 'date_crawled',
                 'offerType': 'offer_type',
                 'vehicleType': 'vehicle_type',
                 'powerPS': 'power_ps',
                 'nrOfPictures': 'nr_of_pictures',
                 'postalCode': 'postal_code',
                 'lastSeen': 'last_seen'
             }, inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuelType', 'brand',
       'notRepairedDamage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

## Initial Exploration and Cleaning

Data cleaning and basic transformations goals:
- Drop useless columns (have mostly one value or not valuable for analysis)
- Investigate and prepare numeric data

In [86]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuelType,brand,notRepairedDamage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 19:38:20,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Let's inspect and double check our candidates to be dropped.

In [87]:
autos['seller'].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [88]:
autos['offer_type'].value_counts()

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [89]:
autos['nr_of_pictures'].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

In [90]:
autos.drop(columns=['seller', 'offer_type', 'nr_of_pictures'], inplace=True)

We will work with the following columns:
- price - remove non integer characters ($ and ,) and cast column type as integer.
- odometer - remove non integer characters (km and ,) and cast column type as integer. Then rename column to 'odometer_km'.

This will help us to perform aggregation with this columns.

In [91]:
autos['price'] = autos['price'].str.replace('\$|,', '').astype(int)
autos['price'].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [92]:
autos['odometer'] = autos['odometer'].str.replace(',|km', '').astype(int)
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
autos['odometer_km'].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

## Exploring the Odometer and Price Columns

Odometer and Price columns now contain numerical data, and can be analyzed more deeply using math aggregations. We will describe them using df method and remove outliers if they exist.

### 'odometer_km' column

In [93]:
autos['odometer_km'].unique().shape

(13,)

In [94]:
autos['odometer_km'].describe().round(2)

count     50000.00
mean     125732.70
std       40042.21
min        5000.00
25%      125000.00
50%      150000.00
75%      150000.00
max      150000.00
Name: odometer_km, dtype: float64

In [95]:
autos['odometer_km'].value_counts().sort_index(ascending=False)

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

### 'price' column 

In [96]:
autos['price'].unique().shape

(2357,)

In [97]:
autos['price'].describe().round(2)

count       50000.00
mean         9840.04
std        481104.38
min             0.00
25%          1100.00
50%          2950.00
75%          7200.00
max      99999999.00
Name: price, dtype: float64

In [98]:
autos['price'].value_counts().sort_index().head(10)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

In [99]:
autos['price'].value_counts().sort_index(ascending=False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64

In [100]:
sum(~autos['price'].between(1, 35000))

1944

In [101]:
sum(~autos['price'].between(1, 35000)) / len(autos)

0.03888

1944 listings will be deleted due the incorrect or wrong price data, which is 4% of the cars.

In [102]:
autos = autos[autos['price'].between(1, 35000)]

In [103]:
autos['price'].describe()

count    48056.000000
mean      5331.427168
std       6047.229364
min          1.000000
25%       1200.000000
50%       2999.000000
75%       7200.000000
max      35000.000000
Name: price, dtype: float64

## Exploring the date columns

The following columns contain date/time data: 'date_crawled', 'ad_created', 'last_seen'. We will explore each of these to learn more. This include aggregation by date and rate normalization.

In [104]:
autos[['date_crawled', 'ad_created', 'last_seen']].head(5)

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


We can just extract just the date values (first 10 symbols of the string) to understand the distribution without any additional transformations and conversions.

In [105]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_values() #.sort_index()

2016-04-07    0.001415
2016-04-06    0.003184
2016-03-18    0.012860
2016-04-05    0.012922
2016-03-06    0.014067
2016-03-13    0.015690
2016-03-05    0.025366
2016-03-24    0.029278
2016-03-16    0.029653
2016-03-27    0.031047
2016-03-25    0.031526
2016-03-17    0.031630
2016-03-31    0.031775
2016-03-10    0.032233
2016-03-23    0.032233
2016-03-26    0.032379
2016-03-11    0.032483
2016-03-22    0.032920
2016-03-09    0.033149
2016-03-08    0.033378
2016-04-01    0.033607
2016-03-30    0.033752
2016-03-29    0.034189
2016-03-15    0.034293
2016-03-19    0.034689
2016-03-28    0.034793
2016-04-02    0.035438
2016-03-07    0.036083
2016-04-04    0.036582
2016-03-14    0.036603
2016-03-12    0.037082
2016-03-21    0.037144
2016-03-20    0.037872
2016-04-03    0.038684
Name: date_crawled, dtype: float64

In [106]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000042
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000042
2016-02-05    0.000042
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000042
2016-02-14    0.000042
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000042
2016-02-19    0.000062
2016-02-20    0.000042
2016-02-21    0.000062
2016-02-22    0.000021
                ...   
2016-03-09    0.033232
2016-03-10    0.031921
2016-03-11    0.032837
2016-03-12    0.036915
2016-03-13    0.017022
2016-03-14    0.035209
2016-03-15    0.034064
2016-03-16    0.030173
2016-03-17    0.031234
2016-03-18    0.013568
2016-03-19    0.033607
2016-03-20    0.037935
2016-03-21 

In [107]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001082
2016-03-06    0.004370
2016-03-07    0.005390
2016-03-08    0.007491
2016-03-09    0.009655
2016-03-10    0.010758
2016-03-11    0.012465
2016-03-12    0.023930
2016-03-13    0.008948
2016-03-14    0.012631
2016-03-15    0.015981
2016-03-16    0.016460
2016-03-17    0.028196
2016-03-18    0.007346
2016-03-19    0.015919
2016-03-20    0.020643
2016-03-21    0.020663
2016-03-22    0.021475
2016-03-23    0.018603
2016-03-24    0.019810
2016-03-25    0.019269
2016-03-26    0.016814
2016-03-27    0.015690
2016-03-28    0.020996
2016-03-29    0.022474
2016-03-30    0.024908
2016-03-31    0.023826
2016-04-01    0.022911
2016-04-02    0.025012
2016-04-03    0.025283
2016-04-04    0.024659
2016-04-05    0.124001
2016-04-06    0.221346
2016-04-07    0.130993
Name: last_seen, dtype: float64

In [108]:
autos['registration_year'].describe()

count    48056.000000
mean      2004.635862
std         87.023293
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Seems like registration year has odd values like 1000 and 9999.

## Dealing with Incorrect Registration Year Data

Because cars were invented in 1886 we decided to clean records outside this range. For the upper bound we will use 2016 year - dataset creation year.

In [109]:
autos.loc[~autos['registration_year'].between(1900, 2016), 'registration_year'].value_counts().sort_index()

1000       1
1001       1
1111       1
1800       2
2017    1389
2018     467
2019       2
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
8888       1
9000       1
9999       3
Name: registration_year, dtype: int64

In [110]:
autos[~autos['registration_year'].between(1900, 2016)].describe()

Unnamed: 0,price,registration_year,power_ps,odometer_km,registration_month,postal_code
count,1877.0,1877.0,1877.0,1877.0,1877.0,1877.0
mean,3521.715503,2048.35642,99.452318,130596.696857,4.666489,47901.0309
std,3977.795359,436.760444,436.836486,37616.072668,3.889685,25372.448846
min,1.0,1000.0,0.0,5000.0,0.0,1067.0
25%,1000.0,2017.0,0.0,125000.0,1.0,27628.0
50%,1999.0,2017.0,75.0,150000.0,4.0,45964.0
75%,4500.0,2018.0,120.0,150000.0,8.0,66424.0
max,29699.0,9999.0,16011.0,150000.0,12.0,99974.0


In [111]:
len(autos[~autos['registration_year'].between(1900, 2016)])

1877

In [112]:
len(autos[~autos['registration_year'].between(1900, 2016)]) / len(autos)

0.03905859830198102

In [113]:
autos[autos['registration_year'] < 1900]

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuelType,brand,notRepairedDamage,ad_created,postal_code,last_seen
10556,2016-04-01 06:02:10,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-01 00:00:00,63322,2016-04-01 09:42:30
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,48324,2016-03-31 10:15:28
24511,2016-03-17 19:45:11,Trabant__wartburg__Ostalgie,490,control,,1111,,0,,5000,0,,trabant,,2016-03-17 00:00:00,16818,2016-04-07 07:17:29
32585,2016-04-02 16:56:39,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-02 00:00:00,63322,2016-04-04 14:46:21
49283,2016-03-15 18:38:53,Citroen_HY,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15 00:00:00,66706,2016-04-06 18:47:20


The dataset was created in the 2016, so registration year cannot be above this year. Also, years below 1900 are going to be excluded due the historical reasons. As we can see, this rows contain invalid data. Filter will affect almost 4% of the records.

In [114]:
autos = autos[autos['registration_year'].between(1900, 2016)]

In [115]:
autos['registration_year'].value_counts(normalize=True).sort_values(ascending=False).head(20).sort_index()

1995    0.026462
1996    0.029689
1997    0.042162
1998    0.051084
1999    0.062712
2000    0.068299
2001    0.057039
2002    0.053747
2003    0.058425
2004    0.058338
2005    0.063492
2006    0.057732
2007    0.048918
2008    0.047489
2009    0.044912
2010    0.033825
2011    0.034345
2012    0.026787
2013    0.016111
2016    0.026116
Name: registration_year, dtype: float64

In [116]:
autos['registration_year'].describe()

count    46179.000000
mean      2002.858789
std          7.094485
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2016.000000
Name: registration_year, dtype: float64

Most of the vehicles were registered in 2002 year and above.

## Exploring Price by Brand

Let's start with retrieving data amount by brand so we will be able to limit our analysis for specific ones.

In [117]:
pd.DataFrame({'count': autos['brand'].value_counts(), 'perc': autos['brand'].value_counts(normalize=True)})

Unnamed: 0,count,perc
volkswagen,9830,0.212867
bmw,5069,0.109769
opel,5021,0.108729
mercedes_benz,4410,0.095498
audi,3939,0.085299
ford,3253,0.070443
renault,2200,0.047641
peugeot,1393,0.030165
fiat,1197,0.025921
seat,853,0.018472


Most of the brands represent 1% and less of the overall listings. There is a large gap between 47 and 70% so we will limit our brands representing more than 5% of total.

In [118]:
brands = autos['brand'].value_counts()
brands_5p = brands[brands > len(autos) * 0.05]
brands_5p

volkswagen       9830
bmw              5069
opel             5021
mercedes_benz    4410
audi             3939
ford             3253
Name: brand, dtype: int64

In [119]:
brand_mean_price_d = {}
for brand in brands_5p.index:
    brand_mean_price_d[brand] = autos.loc[autos['brand'] == brand, 'price'].mean()
brand_mean_price = pd.Series(brand_mean_price_d).sort_values()
brand_mean_price

opel             2968.069110
ford             3577.183215
volkswagen       5280.133672
mercedes_benz    7727.017687
bmw              7742.472677
audi             8377.750190
dtype: float64

Between mean price of Opel and Audi cars is significant price gap. Audi, BMW and Mercedes are the most expensive of popular cars. Opel and Ford are less expensive. Volkswagen is in the middle of the list.

### Exploring Mileage by Brand

In [120]:
brand_mean_mileage_d = {}
for brand in brands_5p.index:
    brand_mean_mileage_d[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
brand_mean_mileage = pd.Series(brand_mean_mileage_d).sort_values()
brand_mean_mileage

ford             124495.849985
volkswagen       128961.851475
opel             129310.894244
audi             131315.054582
mercedes_benz    132235.827664
bmw              133689.090550
dtype: float64

## Storing Aggregate Data in a DataFrame

In [121]:
pd.DataFrame({'brand_mean_price': brand_mean_price, 'brand_mean_mileage': brand_mean_mileage})

Unnamed: 0,brand_mean_mileage,brand_mean_price
audi,131315.054582,8377.75019
bmw,133689.09055,7742.472677
ford,124495.849985,3577.183215
mercedes_benz,132235.827664,7727.017687
opel,129310.894244,2968.06911
volkswagen,128961.851475,5280.133672
