# Analysis of German Used Car Sales from eBay Classifieds
The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

1. dateCrawled - When this ad was first crawled. All field-values are taken from this date.
2. name - Name of the car.
3. seller - Whether the seller is private or a dealer.
4. offerType - The type of listing
5. price - The price on the ad to sell the car.
6. abtest - Whether the listing is included in an A/B test.
7. vehicleType - The vehicle Type.
8. yearOfRegistration - The year in which the car was first registered.
9. gearbox - The transmission type.
10. powerPS - The power of the car in PS (horse power).
11. model - The car model name.
12. kilometer - How many kilometers the car has driven.
13. monthOfRegistration - The month in which the car was first registered.
14. fuelType - What type of fuel the car uses.
15. brand - The brand of the car.
16. notRepairedDamage - If the car has a damage which is not yet repaired.
17. dateCreated - The date on which the eBay listing was created.
18. nrOfPictures - The number of pictures in the ad.
19. postalCode - The postal code for the location of the vehicle.
20. lastSeenOnline - When the crawler saw this ad last online.

First, we load the dataset into a variable.


In [1]:
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding="Latin-1") # encoding indcluded because of UTF-5 error

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
print(autos.info())
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


From initial analysis, there are 50,000 rows with 20 columns. Rows are a mix of numeric and object data types.

There are 4 datetime columns that need to be cleaned to datetime data type.

Column headings need to be cleaned for ease of reference.

Odometer and price column needs to converted to numeric column for statistical operations to be done on it.

Columns with German records may need to be converted to English records.

### Cleaning Column Names

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
new_columns = {'dateCrawled': 'date_crawled', 'offerType':'offer_type',
              'vehicleType':'vehicle_type', 'yearOfRegistration': 'registration_year', 'powerPS': 'power_ps',
              'monthOfRegistration': 'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage': 'unrepaired_damage', 'dateCreated': 'ad_created', 'nrOfPictures': 'num_of_pictures', 'postalCode': 'postal_code', 'lastSeen': 'last_seen'}

autos.rename(columns=new_columns, inplace=True)
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Initial Exploration and Cleaning

In [6]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


The number of pictures column will be dropped as it contains no records.

Statistical computation is being done on the postal code column. This should be a string of digits not a numeric column.

Odometer & price columns needs cleaning to convert to numeric.

The datetime columns need to be cleaned.

In [7]:
# Drop Num of Pictures Column
autos.drop('num_of_pictures', axis=1, inplace=True)

# Convert Postal Code Column to String
autos['postal_code'] = autos['postal_code'].astype(str)

# Review columns post cleaning
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,7014.0,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,10115.0,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,109.0,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,,


In [8]:
# Clean Price & Odometer Columns
# check the columns for values
autos['price'].head()
autos['odometer'].head()

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

In [9]:
# functions to clean odometer & price columns
def clean_odometer(a_series):
    a_series = a_series.str.replace(',','')
    a_series = a_series.str.replace('km','')
    a_series = a_series.astype(float)

    return a_series

def clean_price(a_series):
    a_series = a_series.str.replace('$','')
    a_series = a_series.str.replace(',','')
    a_series = a_series.astype(float)

    return a_series

In [10]:
# Apply functions to clean odometer and price columns
autos['price'] = clean_price(autos['price'])
autos['odometer'] = clean_odometer(autos['odometer'])

autos.rename(columns={'odometer':'odometer_km'}, inplace=True)

autos[['price', 'odometer_km']]

Unnamed: 0,price,odometer_km
0,5000.0,150000.0
1,8500.0,150000.0
2,8990.0,70000.0
3,4350.0,70000.0
4,1350.0,150000.0
...,...,...
49995,24900.0,100000.0
49996,1980.0,150000.0
49997,13200.0,5000.0
49998,22900.0,40000.0


### Exploring the Odometer and Price Columns

In [11]:
autos['odometer_km'].unique()
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [12]:

autos['odometer_km'].value_counts()

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

The odometer has 13 unique values. A statistical review of the column shows that the average odometer_km value is 125,732.70, the min odometer_km value is 5000 and the max odometer_km value is 150,000. The max value is also the most frequent value. There are no clear outliers in this column.

In [13]:
autos['price'].unique()
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [14]:
autos = autos[autos["price"].between(100, 100000)]
autos['price'].describe()

count    48185.000000
mean      5796.099741
std       7525.532405
min        100.000000
25%       1250.000000
50%       3000.000000
75%       7499.000000
max      99900.000000
Name: price, dtype: float64

The price columns has 2,357 unique values. Of the 50,000 dataset records, 0 represents 1,491 records and prices over 100,000 represent 53 records. The max price is 100,000,000, a clear outlier. Our cleaning removed the null records and the records above 100,000.

### Exploring the date columns

In [15]:
date_cols = ['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']
autos[date_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48185 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        48185 non-null  object
 1   last_seen           48185 non-null  object
 2   ad_created          48185 non-null  object
 3   registration_month  48185 non-null  int64 
 4   registration_year   48185 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ MB


In [16]:
autos[date_cols].describe(include='all')

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
count,48185,48185,48185,48185.0,48185.0
unique,46533,38207,76,,
top,2016-03-11 22:38:16,2016-04-07 06:17:27,2016-04-03 00:00:00,,
freq,3,8,1872,,
mean,,,,5.802096,2004.730456
std,,,,3.677562,87.932039
min,,,,0.0,1000.0
25%,,,,3.0,1999.0
50%,,,,6.0,2004.0
75%,,,,9.0,2008.0


In [17]:
autos[date_cols].head()

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00,3,2004
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00,6,1997
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00,7,2009
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00,6,2007
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00,7,2003


We will extract the dates from the 3 datetime columns and perform some statistical analysis.

In [18]:
def extract_date(a_series):
    a_series = a_series.str[:10]
    return a_series

autos['date_crawled'] = extract_date(autos['date_crawled'])
autos['last_seen'] = extract_date(autos['last_seen'])
autos['ad_created'] = extract_date(autos['ad_created'])

autos[date_cols].head()

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
0,2016-03-26,2016-04-06,2016-03-26,3,2004
1,2016-04-04,2016-04-06,2016-04-04,6,1997
2,2016-03-26,2016-04-06,2016-03-26,7,2009
3,2016-03-12,2016-03-15,2016-03-12,6,2007
4,2016-04-01,2016-04-01,2016-04-01,7,2003


We will calculate the distribution of values in the date columns as percentages. The result will be sorted by date in ascending order.

In [19]:
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025340
2016-03-06    0.014050
2016-03-07    0.036090
2016-03-08    0.033164
2016-03-09    0.033019
2016-03-10    0.032313
2016-03-11    0.032624
2016-03-12    0.036920
2016-03-13    0.015690
2016-03-14    0.036692
2016-03-15    0.034305
2016-03-16    0.029470
2016-03-17    0.031504
2016-03-18    0.012867
2016-03-19    0.034762
2016-03-20    0.037813
2016-03-21    0.037190
2016-03-22    0.032811
2016-03-23    0.032292
2016-03-24    0.029449
2016-03-25    0.031504
2016-03-26    0.032292
2016-03-27    0.031109
2016-03-28    0.034949
2016-03-29    0.034139
2016-03-30    0.033703
2016-03-31    0.031856
2016-04-01    0.033662
2016-04-02    0.035633
2016-04-03    0.038601
2016-04-04    0.036567
2016-04-05    0.013054
2016-04-06    0.003175
2016-04-07    0.001390
Name: date_crawled, dtype: float64

There are 34 unique date values in the date_crawled column. Nearly all dates appear between 3% to 1% of the time in the dataset. There are no null values

In [20]:
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001079
2016-03-06    0.004317
2016-03-07    0.005437
2016-03-08    0.007326
2016-03-09    0.009567
2016-03-10    0.010646
2016-03-11    0.012411
2016-03-12    0.023804
2016-03-13    0.008882
2016-03-14    0.012639
2016-03-15    0.015876
2016-03-16    0.016437
2016-03-17    0.028121
2016-03-18    0.007305
2016-03-19    0.015773
2016-03-20    0.020650
2016-03-21    0.020546
2016-03-22    0.021376
2016-03-23    0.018574
2016-03-24    0.019736
2016-03-25    0.019114
2016-03-26    0.016623
2016-03-27    0.015544
2016-03-28    0.020836
2016-03-29    0.022310
2016-03-30    0.024717
2016-03-31    0.023846
2016-04-01    0.022870
2016-04-02    0.024883
2016-04-03    0.025132
2016-04-04    0.024551
2016-04-05    0.124935
2016-04-06    0.221999
2016-04-07    0.132137
Name: last_seen, dtype: float64

The dates in the last_seen column reflect those in the date_crawled column and the percentage spread are within the same range.

In [21]:
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038850
2016-04-04    0.036920
2016-04-05    0.011788
2016-04-06    0.003258
2016-04-07    0.001245
Name: ad_created, Length: 76, dtype: float64

There are 76 unique records in the ad_created column and percentage spread shows smaller records than the others.

### Dealing with Incorrect Registration Year Data

In [22]:
autos[date_cols[-2:]].describe()

Unnamed: 0,registration_month,registration_year
count,48185.0,48185.0
mean,5.802096,2004.730456
std,3.677562,87.932039
min,0.0,1000.0
25%,3.0,1999.0
50%,6.0,2004.0
75%,9.0,2008.0
max,12.0,9999.0


The registration year column contains some outliers especially in its minimum and maximum values.

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate.

These need to be cleaned.

In [23]:
autos['registration_year'].info()

<class 'pandas.core.series.Series'>
Int64Index: 48185 entries, 0 to 49999
Series name: registration_year
Non-Null Count  Dtype
--------------  -----
48185 non-null  int64
dtypes: int64(1)
memory usage: 752.9 KB


In [24]:
autos.loc[autos['registration_year'] <= 1900, 'registration_year']

10556    1800
22316    1000
24511    1111
32585    1800
49283    1001
Name: registration_year, dtype: int64

In [25]:
autos.loc[autos['registration_year'] <= 1900, ['registration_year', 'brand', 'price']] # there are 5 records here. These are clear date outliers

Unnamed: 0,registration_year,brand,price
10556,1800,mitsubishi,450.0
22316,1000,volkswagen,1500.0
24511,1111,trabant,490.0
32585,1800,mitsubishi,450.0
49283,1001,citroen,7750.0


In [26]:
autos.loc[autos['registration_year'] > 2016, ['registration_year', 'brand', 'price']] # these are incorrect records and will be dropped.

Unnamed: 0,registration_year,brand,price
10,2017,volkswagen,999.0
65,2017,ford,250.0
68,2017,mini,10990.0
84,2018,renault,900.0
113,2017,volkswagen,1200.0
...,...,...,...
49770,2018,volkswagen,999.0
49796,2017,opel,4500.0
49841,2017,volkswagen,600.0
49910,9000,opel,22200.0


Records before 1900 and after 2016 will be removed as obviously incorrect data.

In [27]:
autos = autos[autos["registration_year"].between(1901, 2016)]
autos['registration_year'].value_counts(normalize=True).sort_index()

1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
          ...   
2012    0.028178
2013    0.017209
2014    0.014251
2015    0.008119
2016    0.025824
Name: registration_year, Length: 78, dtype: float64

After removing records before 1900 and those after 2016, most of the records are contained within the range 1980 - 2016, although there are some records in the 1900-1980 range.

### Exploring Price by Brand

In [28]:
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)

volkswagen        0.211582
bmw               0.110207
opel              0.107335
mercedes_benz     0.096668
audi              0.086822
ford              0.069872
renault           0.047114
peugeot           0.029884
fiat              0.025630
seat              0.018267
skoda             0.016432
nissan            0.015352
mazda             0.015244
smart             0.014208
citroen           0.014057
toyota            0.012804
hyundai           0.010019
sonstige_autos    0.009436
volvo             0.009134
mini              0.008810
mitsubishi        0.008183
honda             0.007881
kia               0.007082
alfa_romeo        0.006672
suzuki            0.005938
chevrolet         0.005679
porsche           0.005463
chrysler          0.003520
dacia             0.002656
daihatsu          0.002505
jeep              0.002289
land_rover        0.002116
subaru            0.002116
saab              0.001663
jaguar            0.001533
daewoo            0.001490
trabant           0.001360
r

Brand data includes the list of car brands in the dataset. Volkswagen is the most popular car in the dataset (which makes sense since this is a German used car dataset). The top 5 brands are German automakers. The brand 'sonstige_autos' may refer to records where no car brand is provided. We have chosen to aggregate through the whole unique brand set.

In [29]:
avg_price_per_brand = dict()
auto_brands = autos['brand'].unique()

for brand in auto_brands:
    selected_rows = autos[autos["brand"] == brand]
    mean = selected_rows["price"].mean()
    avg_price_per_brand[brand] = mean

avg_price_per_brand

{'peugeot': 3113.860549132948,
 'bmw': 8249.652429467085,
 'volkswagen': 5436.950096948668,
 'smart': 3596.40273556231,
 'ford': 3740.2639060568604,
 'chrysler': 3486.5766871165642,
 'seat': 4433.419621749409,
 'renault': 2496.070577451879,
 'mercedes_benz': 8573.484922939468,
 'audi': 9339.529967669734,
 'sonstige_autos': 10943.649885583523,
 'opel': 3005.4960772480385,
 'mazda': 4129.774787535411,
 'porsche': 34368.41106719367,
 'mini': 10639.450980392157,
 'toyota': 5167.091062394604,
 'dacia': 5915.528455284553,
 'nissan': 4756.659634317863,
 'jeep': 11650.5,
 'saab': 3211.6493506493507,
 'volvo': 4993.208037825059,
 'mitsubishi': 3439.10290237467,
 'jaguar': 11961.56338028169,
 'fiat': 2836.8736310025274,
 'skoda': 6409.609724047306,
 'subaru': 4033.7551020408164,
 'kia': 6018.442073170731,
 'citroen': 3796.26267281106,
 'chevrolet': 6759.885931558935,
 'hyundai': 5411.075431034483,
 'honda': 4119.109589041096,
 'daewoo': 1064.0579710144928,
 'suzuki': 4126.341818181818,
 'trabant

The most expensive car on average is the Land Rover. The top 10 most expensive cars are:
Porsche (34.3k), Land Rover (19.1k), Jaguar (11.9k), Jeep (11.6k), Mini (10.6k), Audi (9.3k), Mercedes Benz (8.5k), BMW (8.2k), Chevrolet (6.7k), Skoda (6.4k)

The 'sonstige_autos' price average is in the top 5 most expensive cars. This may require further analysis or the record may be removed.

Based on frequency of occurrence, the most expensive cars and most frequent cars in the dataset are the BMW, Audi and Mercedes.

The rover on average is the most affordable car in the dataset.
Based on frequency and price, the least expensive most frequent cars are Renault (2.4k), Fiat (2.8k), Opel (3k), Peugeot (3.1k), and Ford (3.7k)

VW, the most frequent car in the dataset has an average price in the mid-range (5.4k).

### Storing Aggregate Data in a DataFrame

In [30]:
# Average price for Top 10 Brands
avg_price_top10brand = dict()
top_10_brands = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:10].index

for brand in top_10_brands:
    selected_rows = autos[autos["brand"] == brand]
    mean = selected_rows["price"].mean()
    avg_price_top10brand[brand] = mean

avg_price_top10brand

{'volkswagen': 5436.950096948668,
 'bmw': 8249.652429467085,
 'opel': 3005.4960772480385,
 'mercedes_benz': 8573.484922939468,
 'audi': 9339.529967669734,
 'ford': 3740.2639060568604,
 'renault': 2496.070577451879,
 'peugeot': 3113.860549132948,
 'fiat': 2836.8736310025274,
 'seat': 4433.419621749409}

In [31]:
# Average mileage for Top 10 Brands
avg_mileage_top10brand = dict()
top_10_brands = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:10].index

for brand in top_10_brands:
    selected_rows = autos[autos["brand"] == brand]
    mean = selected_rows["odometer_km"].mean()
    avg_mileage_top10brand[brand] = mean

avg_mileage_top10brand

{'volkswagen': 128799.87753852434,
 'bmw': 132756.66144200627,
 'opel': 129384.42969221485,
 'mercedes_benz': 131088.89881617154,
 'audi': 129276.29942800298,
 'ford': 124300.06180469715,
 'renault': 128281.3932172319,
 'peugeot': 127127.8901734104,
 'fiat': 116950.29486099411,
 'seat': 121536.64302600473}

In [32]:
# Convert both dictionaries to Series
bmp_series = pd.Series(avg_price_top10brand)
bmm_series = pd.Series(avg_mileage_top10brand)

In [33]:
# Convert to DataFrame
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df['mean_mileage'] = bmm_series
df

Unnamed: 0,mean_price,mean_mileage
volkswagen,5436.950097,128799.877539
bmw,8249.652429,132756.661442
opel,3005.496077,129384.429692
mercedes_benz,8573.484923,131088.898816
audi,9339.529968,129276.299428
ford,3740.263906,124300.061805
renault,2496.070577,128281.393217
peugeot,3113.860549,127127.890173
fiat,2836.873631,116950.294861
seat,4433.419622,121536.643026


The top 10 most frequent brands in the dataset are spread evenly in the average price range from low to high and are clustered in the same range for average mileage.

### Further Cleaning & Analysis

Data cleaning next steps:

* Identify categorical data that uses german words, translate them and map the values to their english counterpart
* Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
* See if there are particular keywords in the name column that you can extract as new columns

Analysis next steps:

* Find the most common brand/model combinations
* Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
* How much cheaper are cars with damage than their non-damaged counterparts?

### Identifying & Mapping Categorical Data from German to English

In [34]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26,79588,2016-04-06
1,2016-04-04,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04,71034,2016-04-06
2,2016-03-26,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26,35394,2016-04-06
3,2016-03-12,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000.0,6,benzin,smart,nein,2016-03-12,33729,2016-03-15
4,2016-04-01,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000.0,7,benzin,ford,nein,2016-04-01,39218,2016-04-01


The seller, offer_type, vehicle_type, gearbox, fuel_type, and unrepaired_damage columns all have German records to be translated. We will highlight unique values in the columns to understand, translate and replace the records.

In [35]:
autos['seller'].unique()
mapping_dict = {'privat':'private', 'gewerblich':'commercial'}
autos['seller'] = autos['seller'].map(mapping_dict)

autos['seller'].unique()

array(['private', 'commercial'], dtype=object)

In [36]:
autos['offer_type'].unique()
mapping_dict = {'Angebot':'Offer'}
autos['offer_type'] = autos['offer_type'].map(mapping_dict)

autos['offer_type'].unique()

array(['Offer'], dtype=object)

In [37]:
autos['vehicle_type'].unique()
mapping_dict = {'bus':'bus', 'limousine':'limousine','kleinwagen':'small car',
                'kombi':'station wagon', 'coupe':'coupe', 'suv':'suv',
                'cabrio':'convertible', 'andere':'other'}
autos['vehicle_type'] = autos['vehicle_type'].map(mapping_dict)

autos['vehicle_type'].unique()

array(['bus', 'limousine', 'small car', 'station wagon', nan, 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

In [38]:
autos['gearbox'].unique()
def gearbox_values(a_series):
    a_series = a_series.str.replace('manuell','manual')
    a_series = a_series.str.replace('automatik','automatic')
    return a_series

autos['gearbox'] = gearbox_values(autos['gearbox'])
autos['gearbox'].unique()

array(['manual', 'automatic', nan], dtype=object)

In [39]:
autos['fuel_type'].unique()
def fueltype_values(a_series):
    a_series = a_series.str.replace('benzin','petrol')
    a_series = a_series.str.replace('elektro','electric')
    a_series = a_series.str.replace('andere','other')
    return a_series

autos['fuel_type'] = fueltype_values(autos['fuel_type'])
autos['fuel_type'].unique()

array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electric',
       'other'], dtype=object)

In [40]:
autos['unrepaired_damage'].unique()
mapping_dict = {'nein':'no', 'ja':'yes'}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(mapping_dict)

autos['unrepaired_damage'].unique()

array(['no', nan, 'yes'], dtype=object)

### Convert the dates to be uniform numeric data

Result - "2016-03-21" becomes the integer 20160321

In [41]:
autos[date_cols[0:3]]
def clean_date(a_series):
    a_series = a_series.str.replace('-','')
    a_series = a_series.astype(int)
    return a_series

autos['date_crawled'] = clean_date(autos['date_crawled'])
autos['last_seen'] = clean_date(autos['last_seen'])
autos['ad_created'] = clean_date(autos['ad_created'])

autos[date_cols[0:3]]

Unnamed: 0,date_crawled,last_seen,ad_created
0,20160326,20160406,20160326
1,20160404,20160406,20160404
2,20160326,20160406,20160326
3,20160312,20160315,20160312
4,20160401,20160401,20160401
...,...,...,...
49995,20160327,20160401,20160327
49996,20160328,20160402,20160328
49997,20160402,20160404,20160402
49998,20160308,20160405,20160308


### Review Name Column
See if there are particular keywords in the name column that you can extract as new columns

In [42]:
autos[['name', 'brand', 'model']].head(50)
# Later Analysis

Unnamed: 0,name,brand,model
0,Peugeot_807_160_NAVTECH_ON_BOARD,peugeot,andere
1,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,bmw,7er
2,Volkswagen_Golf_1.6_United,volkswagen,golf
3,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,smart,fortwo
4,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,ford,focus
5,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,chrysler,voyager
6,VW_Golf_III_GT_Special_Electronic_Green_Metall...,volkswagen,golf
7,Golf_IV_1.9_TDI_90PS,volkswagen,golf
8,Seat_Arosa,seat,arosa
9,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,renault,megane


### Find the most common brand/model combinations

### Find Patterns in Odometer Columns
Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.

### Determine Relationship between Price and Un-repaired Damage
How much cheaper are cars with damage than their non-damaged counterparts?