# Analyzing Used Car Sales Data from *eBay Kleinanzeigen* Listings

## Introduction

We will clean and analyze a 50,000-point dataset of used car listings from *eBay Kleinanzeigen*, a classifieds section of German eBay.

The dataset used is a modified subset of a larger, cleaned dataset, originally scraped and uploaded to Kaggle by user orgesleka. This full dataset can now be found here: [used_car_full_data](https://data.world/data-society/used-cars-data).

The data dictionary for the modified dataset used in the analysis is listed below:
- `dateCrawled` -- date when ad was first crawled; all field values taken from this date;
- `name` -- name of car;
- `seller` -- whether seller is private or dealer;
- `offerType` -- type of listing;
- `price` -- price listed for sale of car;
- `abtest` -- whether listing included in A/B test;
- `vehicleType` -- type of vehicle for sale;
- `yearOfRegistration` -- year in which car first registered;
- `gearbox` -- transmission type;
- `powerPS` -- power of car in PS (horse power);
- `model` -- car model name;
- `odometer` -- number of kilometers car has driven;
- `monthOfRegistration` -- month in which car first registered;
- `fuelType` -- type of fuel car uses;
- `brand` -- brand of car;
- `notRepairedDamage` -- whether car has damage yet to be repaired;
- `dateCreated` -- date on which eBay listing created;
- `nrOfPictures` -- number of pictures included in ad;
- `postalCode` -- postal code for location of vehicle;
- `lastSeenOnline` -- date and time at which crawler last saw ad online.

## Exploring and cleaning the data

We'll first take a look at the dataset of used car listings before we manipulate it in order to get a better sense of the information stored and of what cleaning tasks need to be performed. We'll then perform a number of cleaning procedures to rename the columns, convert numeric data from text, and remove outliers.

In [1]:
# import libraries
import numpy as np
import pandas as pd

# read in dataset
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:
# view dataset
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [3]:
# quick glance at dataset info + first 5 rows
print(autos.info())
print(autos.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

At first glance, we observe that the dataset contains twenty columns of data. Most of these columns consist of string values.

Five columns are missing information (have null values) for between approximately 5 to 20% of the data: `vehicleType`, `gearbox`, `model`, `fuelType`, and `notRepairedDamage`.

Three of the columns are also seen to contain inappropriate data types for the information they represent: `price`, which should be represented as a float, `odometer`, which should be represented as an integer, and `unrepaired_damage`, which could be represented as a boolean.

We also note the column names are written in camelCase, but we would prefer they be written in snake_case, following the Python variable naming convention.

### Renaming the data columns

The first step in cleaning the data is to edit the column names to be easier to read and understand. We first convert all of the column names from camelCase to snake_case, as variables written in this style are easier for readers to quickly recognize. We then modify some of the column names to be more descriptive of the data contained in the corresponding columns.

In [4]:
# print array of existing column names
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [5]:
# define function to change string from camel case to snake case
def camel_to_snake_case(camel_str):
    snake_str = ''
    for c in camel_str:
        if c.isupper():
            c = c.lower()
            c = '_' + c
        snake_str += c
    return snake_str
   
# loop through column name array and change case
auto_cols = []
for col in autos.columns:
    new_col = camel_to_snake_case(col)
    # modify certain column names to be more descriptive
    if new_col == "year_of_registration":
        new_col = "registration_year"
    elif new_col == "month_of_registration":
        new_col = "registration_month"
    elif new_col == "not_repaired_damage":
        new_col = "unrepaired_damage"
    elif new_col == "date_created":
        new_col = "ad_created"
    # add new column name to list
    auto_cols.append(new_col)
    
# assign modified column names back to df.columns attribute
autos.columns = auto_cols
# check assignment
print(autos.columns)


Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [6]:
# check first 5 rows after column name modifications
print(autos.head())

          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  power_p_s   model   odometer  registration_month fuel_type  \
0    manuell        15

### Investigating column data

Next we want to investigate the individual columns to identify those containing irrelevant information, inappropriate data types, and outliers. We will then clean and correct for such instances.

In [7]:
# view descriptive statistics for all columns in dataset
print(autos.describe(include='all'))

               date_crawled         name  seller offer_type  price abtest  \
count                 50000        50000   50000      50000  50000  50000   
unique                48213        38754       2          2   2357      2   
top     2016-03-14 20:50:02  Ford_Fiesta  privat    Angebot     $0   test   
freq                      3           78   49999      49999   1421  25756   
mean                    NaN          NaN     NaN        NaN    NaN    NaN   
std                     NaN          NaN     NaN        NaN    NaN    NaN   
min                     NaN          NaN     NaN        NaN    NaN    NaN   
25%                     NaN          NaN     NaN        NaN    NaN    NaN   
50%                     NaN          NaN     NaN        NaN    NaN    NaN   
75%                     NaN          NaN     NaN        NaN    NaN    NaN   
max                     NaN          NaN     NaN        NaN    NaN    NaN   

       vehicle_type  registration_year  gearbox     power_p_s  model  \
cou

#### Dropping irrelevant columns
The `seller` and `offer_type` data columns both contain values that are identical for all entries but one. As such, these columns don't lend much meaningful information to the dataset as a whole and can therefore safely be dropped without any real loss of data. The `nr_of_pictures` column is also uninformative, as it contains the same value (zero) for all entries, so it can also be dropped.

In [8]:
# drop `seller`, `offer_type`, and `nr_of_pictures` columns
autos.drop(['seller','offer_type','nr_of_pictures'], axis=1, inplace=True)

In [9]:
# check updated dataset with dropped columns
print(autos.head(0))

Empty DataFrame
Columns: [date_crawled, name, price, abtest, vehicle_type, registration_year, gearbox, power_p_s, model, odometer, registration_month, fuel_type, brand, unrepaired_damage, ad_created, postal_code, last_seen]
Index: []


#### Removing outliers
It may also be worthwhile to look more closely at the following numerical columns to identify and remove outliers: `registration_year`, `registration_month`, and `power_p_s`. The minimum and maximum values for the `registration_year` are nonsensical, and the listings with registration years outside of a realistic range should be investigated further. Similarly, the `registration_month` minimum value of 0 does not represent an actual calendar month and should be looked into. Finally, the maximum value of the `power_p_s` column appears to be extremely high compared to the average and to what can realistically be expected, so we will need to take a closer look at this column as well.

In [10]:
# check number of rows before dropping outliers
print(autos.shape)

(50000, 17)


Let's first take a closer look at the data in the `registration_year` column. The modern automobile was first produced in 1886, and the dataset was crawled in 2016, so we should investigate the datapoints with registration years outside of this range.

In [11]:
# view `registration_year` statistics
print(autos['registration_year'].describe())

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64


In [12]:
# define boolean array to check if registration year within desired range
ryear_out_range = (autos['registration_year'] < 1886) | \
                  (autos['registration_year'] > 2016)
# view number of data points with registration years outside of range
print(autos.loc[ryear_out_range, 'registration_year'].value_counts())
print(autos[ryear_out_range].shape[0])

2017    1453
2018     492
9999       4
5000       4
2019       3
9000       2
1800       2
6200       1
4500       1
8888       1
4800       1
2800       1
1001       1
1000       1
1111       1
1500       1
9996       1
5911       1
4100       1
Name: registration_year, dtype: int64
1972


We see that most of the registration years outside of the desired range each occur with a very small frequency, so we can safely regard these as outliers and throw them away. The registration years 2017 and 2018, however, each occur several hundreds of times, but the data points corresponding to these registration years only represent about 4% of the total dataset, so we can safely remove all points with registration years outside of the 1886-2016 range without any significant loss of data.

In [13]:
# remove rows with registration years outside of designated range
autos.drop(autos[ryear_out_range].index, inplace=True)

In [14]:
# check dataset no longer contains registration years outside of range
print(autos.loc[ryear_out_range, 'registration_year'].value_counts())

Series([], Name: registration_year, dtype: int64)


In [57]:
# view distribution of remaining values after removal of outliers
ryear_counts = autos['registration_year'].value_counts(normalize=True)
print(ryear_counts[:25])
ryc_sum = 0
for ryc in ryear_counts[:25]:
    ryc_sum += ryc
print(ryc_sum)

2000    0.067571
2005    0.062878
1999    0.062085
2004    0.057863
2003    0.057799
2006    0.057241
2001    0.056406
2002    0.053234
1998    0.050619
2007    0.048776
2008    0.047448
2009    0.044683
1997    0.041811
2011    0.034761
2010    0.034053
1996    0.029382
2012    0.028074
1995    0.026295
2016    0.026124
2013    0.017209
2014    0.014209
1994    0.013480
1993    0.009108
2015    0.008401
1992    0.007951
Name: registration_year, dtype: float64
0.9574600317174573


Of the remaining `registration_year` values, over 95% of them fall between 1992 and 2016.

Next, let's take a look at the outliers of the `registration_month` column, i.e. data points with registration months outside of the range of values 1-12 corresponding to real calendar months.

In [15]:
# define boolean array to check if registration month within desired range
rmonth_out_range = (autos['registration_month'] < 1) | \
                   (autos['registration_month'] > 12)
# view number of data points with registration months outside of range
print(autos.loc[rmonth_out_range, 'registration_month'].value_counts())

0    4587
Name: registration_month, dtype: int64


Entries with registration month values of zero make up nearly 10% of all data points. Instead of indicating bad data, it is likely a value of zero in the `registration_month` column simply represents a null entry, where the registration month is unknown, but the rest of the listing is valid. Therefore, we can leave this column as is.

Finally, we want to investigate the `power_p_s` column a bit further to see if we can remove any outliers. Before we move forward with any analysis of this column, however, let's convert it from German units (PS) to American units (HP). We will also update the column name accordingly.

In [16]:
# print power values in PS
print(autos['power_p_s'].head())
# convert power values to HP
autos['power_p_s'] = autos['power_p_s']*0.986
# print power values in HP
print(autos['power_p_s'].head())

0    158
1    286
2    102
3     71
4      0
Name: power_p_s, dtype: int64
0    155.788
1    281.996
2    100.572
3     70.006
4      0.000
Name: power_p_s, dtype: float64


In [17]:
# rename power column and check new headers
autos.rename(columns={'power_p_s': 'power_h_p'}, inplace=True)
print(autos.head(0))

Empty DataFrame
Columns: [date_crawled, name, price, abtest, vehicle_type, registration_year, gearbox, power_h_p, model, odometer, registration_month, fuel_type, brand, unrepaired_damage, ad_created, postal_code, last_seen]
Index: []


Now we can take a look at the outliers of our new horsepower column.

In [18]:
# view descriptive statistics for `power_p_s` column
print(autos['power_h_p'].describe())

count    48028.000000
mean       115.431431
std        192.419160
min          0.000000
25%         70.006000
50%        105.502000
75%        147.900000
max      17452.200000
Name: power_h_p, dtype: float64


In [19]:
# count number of cars with horsepower of 0
print(autos.loc[autos['power_h_p'] == 0,   'power_h_p'].shape[0])
# count number of cars with horsepower above 1500
print(autos.loc[autos['power_h_p'] > 1500, 'power_h_p'].shape[0])

4989
26


We see that nearly 10% of entries have `power_h_` values of zero. Similar to the zero-values in the `registration_month` column, we can assume these are null entries representing unknown horsepower values, rather than bad data for zero-horsepower cars.

On the other hand, there are only a handful of entries with horsepower values above 1500, the approximate horsepower of the most powerful cars in existence. Thus, we will take 1500 to be the maximum allowed value for the `power_h_p` column and remove any outliers with larger values from the dataset.

In [20]:
# remove rows with horsepowers above maximum value (1500)
autos.drop(autos[autos['power_h_p'] > 1500].index, inplace=True)

In [21]:
# check number of rows after dropping outliers
print(autos.shape)

(48002, 17)


#### Converting columns to numeric data types
Additionally, the `price` and `odometer` columns contain numeric values stored as text. As such, any non-numeric characters need to be removed from the entries so they may be converted to numeric data types.

In [22]:
# remove '$' and ',' from price 
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
# convert price from string to integer
autos['price'] = autos['price'].astype(int)
# check updated price column
print(autos['price'].head())

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64


In [23]:
# remove ',' and 'km' from odometer
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].str.replace('km','')
# convert odometer from string to integer
autos['odometer'] = autos['odometer'].astype(int)
# rename odometer column to include units
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
# check updated odometer column
print(autos['odometer_km'].head())

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64


In [24]:
autos['price'].describe()

count    4.800200e+04
mean     9.588527e+03
std      4.845129e+05
min      0.000000e+00
25%      1.150000e+03
50%      2.990000e+03
75%      7.400000e+03
max      1.000000e+08
Name: price, dtype: float64

#### Analyzing the `price` and `odometer_km` columns
Now that the `price` and `odometer_km` columns have been converted to numeric data types, we will want to further investigate the data within them to determine if there are any outliers that need to be removed.

In [25]:
# investigate `price` column
print('EXPLORING PRICE DATA')
print('')
# --> look at descriptive statistics
print('price stats:')
print(autos['price'].describe())
print('')
# --> view number of unique values
print('number of unique price values:')
print(autos['price'].unique().shape[0])
print('')
# --> look at top 10 unique values
print('10 most frequent price values:')
print(autos['price'].value_counts().head(10))
print('')
# --> look at 10 lowest values
print('10 lowest price values:')
print(autos['price'].value_counts().sort_index().head(10))
print('')
# --> look at 10 highest values
print('10 highest price values:')
print(autos['price'].value_counts().sort_index(ascending=False).head(10))
print('')

EXPLORING PRICE DATA

price stats:
count    4.800200e+04
mean     9.588527e+03
std      4.845129e+05
min      0.000000e+00
25%      1.150000e+03
50%      2.990000e+03
75%      7.400000e+03
max      1.000000e+08
Name: price, dtype: float64

number of unique price values:
2334

10 most frequent price values:
0       1334
500      757
1500     696
2500     614
1200     605
1000     601
600      510
3500     480
800      469
2000     438
Name: price, dtype: int64

10 lowest price values:
0     1334
1      150
2        2
3        1
5        2
8        1
9        1
10       6
11       2
12       3
Name: price, dtype: int64

10 highest price values:
99999999    1
27322222    1
12345678    1
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price, dtype: int64



We note the top (most frequent) value in the `price` column is zero, which would represent a free car. It's possible these entries could theoretically represent ads for free cars, but eBay is an auction site that does not allow for items to be listed with no price. On the other hand, a small but nonzero price, such as $1, is allowed, despite being an unusual price for a car. It is likely these entries represent "free" cars listed with the lowest prices allowed on the site. Therefore, we will remove all listings with prices of zero.

On the other hand, the most expensive cars in the world are about $5 million. It's unlikely any used cars are selling for this much, but it's possible, especially if the used cars are vintage. As such, we will set this as the upper limit for the car price and remove any listings with prices over five million dollars.

In [26]:
# remove price outliers (less than $1 or above $5 million)
autos = autos[autos['price'].between(1,5000000)]

Let's take one more look at the `price` column now that we've updated it to exclude the outliers.

In [27]:
# investigate `price` column again
print('EXPLORING PRICE DATA AFTER REMOVING OUTLIERS')
print('')
# --> look at descriptive statistics
print('price stats:')
print(autos['price'].describe())
print('')
# --> view number of unique values
print('number of unique price values:')
print(autos['price'].unique().shape[0])
print('')
# --> look at top 10 unique values
print('10 most frequent price values:')
print(autos['price'].value_counts().head(10))
print('')
# --> look at 10 lowest values
print('10 lowest price values:')
print(autos['price'].value_counts().sort_index().head(10))
print('')
# --> look at 10 highest values
print('10 highest price values:')
print(autos['price'].value_counts().sort_index(ascending=False).head(10))
print('')

EXPLORING PRICE DATA AFTER REMOVING OUTLIERS

price stats:
count    4.666200e+04
mean     6.180154e+03
std      2.322246e+04
min      1.000000e+00
25%      1.250000e+03
50%      3.100000e+03
75%      7.500000e+03
max      3.890000e+06
Name: price, dtype: float64

number of unique price values:
2328

10 most frequent price values:
500     757
1500    696
2500    614
1200    605
1000    601
600     510
3500    480
800     469
2000    438
999     413
Name: price, dtype: int64

10 lowest price values:
1     150
2       2
3       1
5       2
8       1
9       1
10      6
11      2
12      3
13      2
Name: price, dtype: int64

10 highest price values:
3890000    1
1300000    1
1234566    1
999999     2
999990     1
350000     1
345000     1
299000     1
295000     1
265000     1
Name: price, dtype: int64



After removing the `price` outliers, we see both the mean and standard deviation of the column decrease. The mean moves closer to the most frequent values, and the standard deviation gets smaller, indicating the spread of values has decreased.

In [28]:
# investigate `odometer_km` column
print('EXPLORING ODOMETER_KM DATA')
print('')
# --> look at descriptive statistics
print('odometer_km stats:')
print(autos['odometer_km'].describe())
print('')
# --> view number of unique values
print('number of unique odometer_km values:')
print(autos['odometer_km'].unique().shape[0])
print('')
# --> look at top 10 unique values
print('10 most frequent odometer_km values:')
print(autos['odometer_km'].value_counts().head(10))
print('')
# --> look at 10 lowest values
print('10 lowest odometer_km values:')
print(autos['odometer_km'].value_counts().sort_index().head(10))
print('')
# --> look at 10 highest values
print('10 highest odometer_km values:')
print(autos['odometer_km'].value_counts().sort_index(ascending=False).head(10))
print('')

EXPLORING ODOMETER_KM DATA

odometer_km stats:
count     46662.000000
mean     125588.487420
std       39850.947118
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

number of unique odometer_km values:
13

10 most frequent odometer_km values:
150000    30074
125000     4854
100000     2058
90000      1672
80000      1374
70000      1185
60000      1128
50000       994
40000       797
5000        784
Name: odometer_km, dtype: int64

10 lowest odometer_km values:
5000      784
10000     241
20000     741
30000     760
40000     797
50000     994
60000    1128
70000    1185
80000    1374
90000    1672
Name: odometer_km, dtype: int64

10 highest odometer_km values:
150000    30074
125000     4854
100000     2058
90000      1672
80000      1374
70000      1185
60000      1128
50000       994
40000       797
30000       760
Name: odometer_km, dtype: int64



We see there are only 13 unique values of mileage in the `odometer_km` column, all of which seem reasonable. This indicates the mileage entries are likely chosen from a preset series of options, so there are no outliers. Therefore, we will leave this column as is.

### Analyzing date columns
Now we will look at the three columns containing date information: `date_crawled`, `last_seen`, and `ad_created`. We sort each column by date from earliest to latest and calculate the distribution of values in each as percentages. We also look at the top five most frequently occuring dates in each column.

In [29]:
# sort `date_crawled` from earliest to latest and calculate percentages
print('DATE AD CRAWLED:')
print(autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index())
# view five most frequent values
print('5 most frequent dates:')
print(autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(5))

DATE AD CRAWLED:
2016-03-05    0.025203
2016-03-06    0.014166
2016-03-07    0.036261
2016-03-08    0.033560
2016-03-09    0.033260
2016-03-10    0.032232
2016-03-11    0.032468
2016-03-12    0.036839
2016-03-13    0.015880
2016-03-14    0.036304
2016-03-15    0.034353
2016-03-16    0.029467
2016-03-17    0.031803
2016-03-18    0.012794
2016-03-19    0.034675
2016-03-20    0.037997
2016-03-21    0.037332
2016-03-22    0.032875
2016-03-23    0.032210
2016-03-24    0.029489
2016-03-25    0.031525
2016-03-26    0.032060
2016-03-27    0.030796
2016-03-28    0.034589
2016-03-29    0.034096
2016-03-30    0.033796
2016-03-31    0.031803
2016-04-01    0.033796
2016-04-02    0.035511
2016-04-03    0.038725
2016-04-04    0.036625
2016-04-05    0.013008
2016-04-06    0.003086
2016-04-07    0.001414
Name: date_crawled, dtype: float64
5 most frequent dates:
2016-04-03    0.038725
2016-03-20    0.037997
2016-03-21    0.037332
2016-03-12    0.036839
2016-04-04    0.036625
Name: date_crawled, dtype: f

All of the data was crawled between March 5th and April 7th of 2016, with a few percent of the total dataset being crawled each day.

In [30]:
# sort `ad_created` from earliest to latest and calculate percentages
print('DATE AD CREATED:')
print(autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index())
# view five most frequent values
print('5 most frequent dates:')
print(autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head())

DATE AD CREATED:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000043
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-27    0.000043
2016-02-01    0.000021
2016-02-02    0.000043
2016-02-05    0.000043
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000043
2016-02-14    0.000043
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000043
2016-02-19    0.000064
2016-02-20    0.000043
2016-02-21    0.000064
2016-02-22    0.000021
2016-02-23    0.000086
                ...   
2016-03-09    0.033303
2016-03-10    0.031953
2016-03-11    0.032768
2016-03-12    0.036668
2016-03-13    0.017187
2016-03-14    0.034953
2016-03-15    0.034161
2016-03-16    0.029917
2016-03-17    0.031482
2016-03-18    0.013480
2016-03-19    0.033625
2016-03-20    0.0

The ads were all created between June 2015 and March 2016, with most of the ads being created in March and April.

In [31]:
# sort `last_seen` from earliest to latest and calculate percentages
print('DATE AD LAST SEEN:')
print(autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index())
# view five most frequent values
print('5 most frequent dates:')
print(autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(5))

DATE AD LAST SEEN:
2016-03-05    0.001072
2016-03-06    0.004115
2016-03-07    0.005379
2016-03-08    0.007479
2016-03-09    0.009772
2016-03-10    0.010694
2016-03-11    0.012387
2016-03-12    0.023767
2016-03-13    0.008658
2016-03-14    0.012666
2016-03-15    0.016009
2016-03-16    0.016287
2016-03-17    0.028074
2016-03-18    0.007222
2016-03-19    0.015623
2016-03-20    0.020638
2016-03-21    0.020595
2016-03-22    0.020852
2016-03-23    0.018302
2016-03-24    0.019695
2016-03-25    0.018945
2016-03-26    0.016802
2016-03-27    0.015623
2016-03-28    0.020724
2016-03-29    0.022116
2016-03-30    0.024624
2016-03-31    0.023595
2016-04-01    0.022952
2016-04-02    0.024688
2016-04-03    0.025138
2016-04-04    0.024110
2016-04-05    0.125455
2016-04-06    0.223222
2016-04-07    0.132720
Name: last_seen, dtype: float64
5 most frequent dates:
2016-04-06    0.223222
2016-04-07    0.132720
2016-04-05    0.125455
2016-03-17    0.028074
2016-04-03    0.025138
Name: last_seen, dtype: float

The range of dates when the ad was last seen by the crawler is the same as that of the date crawled.

### Aggregating brand data
Next we will aggregate the data by car brand.

In [86]:
brand_counts = autos['brand'].value_counts(normalize=True)
ibc = 0
sum_bc = 0
for brandct in brand_counts:
    ibc += 1
    sum_bc += brandct
    if sum_bc > 0.95:
        break
print(ibc, sum_bc)

23 0.9513522780849514


We see that 95% of the data is spread out between 23 different car brands, so we will aggregate the data over these top 23 brands and ignore the rest.

In [122]:
# create empty dictionary to store mean price per car brand
mean_price_per_brand = {}
# create array of top 23 unique car brands
brands = brand_counts.index[:23]
# loop over selected brands
for brand in brands:
    # select rows corresponding to specific brand
    brand_rows = autos[autos['brand'] == brand]
    # calculate mean price for brand
    mean_price = brand_rows['price'].mean()
    # assign mean price to dictionary
    mean_price_per_brand[brand] = mean_price
# sort and print dictionary
sorted_price_per_brand = sorted(mean_price_per_brand.items(), 
                                key=lambda x: x[1],
                                reverse=True)
for idict in sorted_price_per_brand:
    print(idict[0], '--', idict[1])

sonstige_autos -- 23567.513043478262
mini -- 10613.459657701711
audi -- 9337.807378063877
mercedes_benz -- 8630.716
bmw -- 8571.75136292835
skoda -- 6368.0
kia -- 5982.330303030303
volkswagen -- 5605.204280352977
hyundai -- 5365.254273504273
toyota -- 5167.091062394604
volvo -- 4946.501170960188
nissan -- 4743.40252454418
seat -- 4401.218309859155
mazda -- 4112.596614950635
honda -- 4107.857923497268
ford -- 4056.534191965655
citroen -- 3783.7013782542112
smart -- 3581.7848484848487
mitsubishi -- 3394.5729166666665
peugeot -- 3094.0172290021537
opel -- 2974.8289814630257
fiat -- 2813.748538011696
renault -- 2475.4668181818183


The most expensive category of car is "sonstige", or "other", which includes a number of luxury and sports cars, like Corvettes, Cadillacs, Ferraris, Lamborghinis, etc. The next most highly-priced brand of car is the Mini Cooper, followed by Audi, Mercedes Benz, and BMW, which range from about \$10,000 to \$8,500 on average. The least expensive cars on average are Opels, Fiats, and Renaults, which each have a mean price of under \$3000.

Let's now take a look at just the top six most popular car brands, calculating both the mean price and the mean mileage of each.

In [145]:
# create array of top 6 unique car brands
brands_top6 = brand_counts.index[:6]

# create empty dictionaries to store mean price and mean mileage per top-6 car brand
mean_price_per_brand_top6 = {}
mean_mileage_per_brand_top6 = {}
# loop over top 6 selected brands
for brand6 in brands_top6:
    # select rows corresponding to specific top-6 brand
    brand6_rows = autos[autos['brand'] == brand6]
    # calculate mean price for top-6 brand and assign to dictionary
    mean_price_top6 = brand6_rows['price'].mean()
    mean_price_per_brand_top6[brand6] = mean_price_top6
    # calculate mean mileage for top-6 brand and assign to dictionary
    mean_mileage_top6 = brand6_rows['odometer_km'].mean()
    mean_mileage_per_brand_top6[brand6] = mean_mileage_top6

In [147]:
# convert dictionaries to series objects
brand_price_series = pd.Series(mean_price_per_brand_top6).sort_values(ascending = False)
brand_mileage_series = pd.Series(mean_mileage_per_brand_top6).sort_values(ascending = False)
# create dataframe from mean price series
brand_price_df = pd.DataFrame(brand_price_series, columns=['mean_price'])
# add mean mileage series to dataframe
brand_price_df['mean_mileage'] = brand_mileage_series
# print dataframe
print(brand_price_df)

                mean_price   mean_mileage
audi           9337.807378  129179.252290
mercedes_benz  8630.716000  130793.333333
bmw            8571.751363  132569.119938
volkswagen     5605.204280  128720.458464
ford           4056.534192  124242.563631
opel           2974.828981  129348.216065


All of the top six car brands listed on the site have approximately the same average mileage of around 130,000 km. This indicates the price of the car is less dependent on the mileage than on the brand.

# Conclusion
In this project, we explored, cleaned, and analyzed the data of used car sale listings from a German classifieds websited. We found that among the top six car brands listed for sale on *eBay Kleinanzeigen*, Audi, Mercedes Benz, and BMW are the most expensive on average, Volkswagen is mid-range in price, and Ford and Opel sell for the lowest mean price. The mean mileage of the most popular car brands are roughly the same and thus does not appear to be a deciding factor in the price of the car.