# Exploring Ebay Car Sales Data

In this project, we'll work with a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally [scraped](https://en.wikipedia.org/wiki/Web_scraping) and uploaded to Kaggle by user [orgesleka](https://www.kaggle.com/orgesleka). The original dataset isn't available on Kaggle anymore, but it can be found [here](https://data.world/data-society/used-cars-data).

## Introduction

The data dictionary provided with data is as follows:

+ `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
+ `name` - Name of the car.
+ `seller` - Whether the seller is private or a dealer.
+ `offerType` - The type of listing
+ `price` - The price on the ad to sell the car.
+ `abtest` - Whether the listing is included in an A/B test.
+ `vehicleType` - The vehicle Type.
+ `yearOfRegistration` - The year in which the car was first registered.
+ `gearbox` - The transmission type.
+ `powerPS` - The power of the car in PS.
+ `model` - The car model name.
+ `kilometer` - How many kilometers the car has driven.
+ `monthOfRegistration` - The month in which the car was first registered.
+ `fuelType` - What type of fuel the car uses.
+ `brand` - The brand of the car.
+ `notRepairedDamag`e - If the car has a damage which is not yet repaired.
+ `dateCreated` - The date on which the eBay listing was created.
+ `nrOfPictures` - The number of pictures in the ad.
+ `postalCode` - The postal code for the location of the vehicle.
+ `lastSeenOnline` - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings. 

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:
import pandas as pd
import numpy as np

with open('autos.csv', encoding='Latin-1') as f:
    autos = pd.read_csv(f)
    
print(autos.info())
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We can make the following observations:

+ The dataset contains 20 columns, most of which are strings.
+ Some columns have null values, but none have more than ~20% null values.
+ The column names use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case), which means we can't just replace spaces with underscores.

## Cleaning Column Names

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [2]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [3]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.

In [4]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 20:37:19,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [5]:
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')
autos['price'] = autos['price'].astype(int)

autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

We learned that there are a number of text columns where almost all of the values are the same (`seller` and `offer_type`). We also converted the `price` and `odometer` columns to numeric types and renamed `odometer` to `odometer_km`.

## Exploring the Odometer and Price Columns

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the `odometer_km` and `price columns`. Here's the steps we'll take:

+ Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.
+ We'll use:
+ `Series.unique().shape` to see how many unique values
+ `Series.describe()` to view min/max/median/mean etc
+ `Series.value_counts()`, with some variations:
  + chained to `.head()` if there are lots of values.
  + Because `Series.value_counts()` returns a series, we can   use `Series.sort_index()` with `ascending=` `True` or `False` to view the highest and lowest values with their counts (can also chain to `head()` here).
+ When removing outliers, we can do `df[(df["col"] > x ) & (df["col"] < y )]`, but it's more readable to use `df[df["col"].between(x,y)]`.

In [6]:
## Removing the scientific notation
pd.options.display.float_format = '{:.5f}'.format

autos['price'].describe()

count      50000.00000
mean        9840.04376
std       481104.38050
min            0.00000
25%         1100.00000
50%         2950.00000
75%         7200.00000
max     99999999.00000
Name: price, dtype: float64

In [7]:
autos['odometer_km'].describe()

count    50000.00000
mean    125732.70000
std      40042.21171
min       5000.00000
25%     125000.00000
50%     150000.00000
75%     150000.00000
max     150000.00000
Name: odometer_km, dtype: float64

A commonly used rule says that a data point is an outlier if it is more than 1.5 * IQR(the difference between 75th and 25th percentiles) above the third quartile or below the first quartile. Said differently, low outliers are below Q1 - 1.5IQR and high outliers are above Q3 + 1.5IQR.

Let's clean both series.

In [8]:
iqr_price = autos['price'].describe()['75%'] - autos['price'].describe()['25%']

price_lower_bound = autos['price'].describe()['25%'] - (1.5 * iqr_price)
price_higher_bound = autos['price'].describe()['75%'] + (1.5 * iqr_price)

autos = autos[autos['price'].between(price_lower_bound, price_higher_bound)]
autos['price'].describe()

count   46216.00000
mean     3963.69610
std      3847.23868
min         0.00000
25%      1000.00000
50%      2500.00000
75%      5900.00000
max     16350.00000
Name: price, dtype: float64

In [9]:
iqr_price = autos['odometer_km'].describe()['75%'] - autos['price'].describe()['25%']

price_lower_bound = autos['odometer_km'].describe()['25%'] - (1.5 * iqr_price)
price_higher_bound = autos['odometer_km'].describe()['75%'] + (1.5 * iqr_price)

autos = autos[autos['odometer_km'].between(price_lower_bound, price_higher_bound)]
autos['odometer_km'].describe()

count    46216.00000
mean    129603.27592
std      36811.59610
min       5000.00000
25%     125000.00000
50%     150000.00000
75%     150000.00000
max     150000.00000
Name: odometer_km, dtype: float64

In [10]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

5000        863
10000       137
20000       460
30000       496
40000       530
50000       717
60000       883
70000       967
80000      1199
90000      1526
100000     1917
125000     4834
150000    31687
Name: odometer_km, dtype: int64

## Exploring the Date Columns

Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values: `date_crawled`, `last_seen`, `ad_created`, `registration_month`, `registration_year`. Right now, the `date_crawled`, `last_seen`, and `ad_created columns` are all identified as string values by pandas. 

Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like `Series.describe()` to understand the distribution without any extra data processing.

In [11]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True) * 100

2016-03-05   2.55756
2016-03-06   1.38913
2016-03-07   3.60481
2016-03-08   3.36031
2016-03-09   3.32569
2016-03-10   3.26294
2016-03-11   3.23914
2016-03-12   3.72165
2016-03-13   1.54925
2016-03-14   3.70651
2016-03-15   3.38627
2016-03-16   2.98382
2016-03-17   3.18504
2016-03-18   1.29392
2016-03-19   3.43604
2016-03-20   3.76493
2016-03-21   3.73247
2016-03-22   3.26510
2016-03-23   3.26510
2016-03-24   2.91241
2016-03-25   3.21534
2016-03-26   3.28890
2016-03-27   3.06820
2016-03-28   3.47715
2016-03-29   3.42522
2016-03-30   3.37978
2016-03-31   3.16773
2016-04-01   3.29540
2016-04-02   3.49230
2016-04-03   3.86230
2016-04-04   3.63727
2016-04-05   1.29609
2016-04-06   0.31807
2016-04-07   0.13415
Name: date_crawled, dtype: float64

In [12]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True) * 100

2015-08-10   0.00216
2015-09-09   0.00216
2015-11-10   0.00216
2015-12-05   0.00216
2015-12-30   0.00216
               ...  
2016-04-03   3.88826
2016-04-04   3.66972
2016-04-05   1.16843
2016-04-06   0.32456
2016-04-07   0.12117
Name: ad_created, Length: 73, dtype: float64

In [13]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True) * 100

2016-03-05    0.11684
2016-03-06    0.46737
2016-03-07    0.56474
2016-03-08    0.80708
2016-03-09    1.03211
2016-03-10    1.12515
2016-03-11    1.31773
2016-03-12    2.49264
2016-03-13    0.94556
2016-03-14    1.30691
2016-03-15    1.62714
2016-03-16    1.69422
2016-03-17    2.88428
2016-03-18    0.76380
2016-03-19    1.63580
2016-03-20    2.12697
2016-03-21    2.12697
2016-03-22    2.21352
2016-03-23    1.91492
2016-03-24    2.02095
2016-03-25    1.99714
2016-03-26    1.73966
2016-03-27    1.65960
2016-03-28    2.16592
2016-03-29    2.29791
2016-03-30    2.55323
2016-03-31    2.43639
2016-04-01    2.36282
2016-04-02    2.51861
2016-04-03    2.58352
2016-04-04    2.51861
2016-04-05   12.08672
2016-04-06   21.36706
2016-04-07   12.52813
Name: last_seen, dtype: float64

In [14]:
autos['registration_year'].describe()

count   46216.00000
mean     2004.34386
std        96.22550
min      1000.00000
25%      1999.00000
50%      2003.00000
75%      2007.00000
max      9999.00000
Name: registration_year, dtype: float64

One thing that stands out from the exploration we did is that the `registration_year` column contains some odd values:

+ The minimum value is `1000`, before cars were invented
+ The maximum value is `9999`, many years into the future

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [15]:
autos = autos[autos['registration_year'].between(1900, 2016)]
autos['registration_year'].value_counts(normalize=True).sort_index(ascending=True) * 100

1910   0.02032
1929   0.00226
1934   0.00452
1937   0.00677
1938   0.00226
         ...  
2012   1.83546
2013   0.90080
2014   0.54861
2015   0.26414
2016   2.87624
Name: registration_year, Length: 70, dtype: float64

## Exploring Price by Brand

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the `brand` column.

In [16]:
autos['brand'].value_counts(normalize=True) * 100

volkswagen       21.55822
opel             11.56139
bmw              10.37161
mercedes_benz     8.71901
audi              7.76629
ford              7.28541
renault           5.08873
peugeot           3.17650
fiat              2.79270
seat              1.89642
skoda             1.65711
mazda             1.60067
nissan            1.55777
smart             1.50585
citroen           1.49004
toyota            1.30943
hyundai           1.03400
sonstige_autos    0.97079
volvo             0.94369
mitsubishi        0.86016
honda             0.83307
mini              0.76534
kia               0.69987
alfa_romeo        0.69761
suzuki            0.63440
chevrolet         0.57570
chrysler          0.39057
daihatsu          0.27769
dacia             0.27769
subaru            0.22576
porsche           0.19867
jeep              0.19867
saab              0.17158
trabant           0.16932
daewoo            0.16255
rover             0.14449
land_rover        0.12643
jaguar            0.12643
lancia      

In [17]:
brand_counts = autos["brand"].value_counts(normalize=True) * 100
common_brands = brand_counts[brand_counts > 7].index

common_brands

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [18]:
brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices

{'volkswagen': 4041,
 'opel': 2621,
 'bmw': 5469,
 'mercedes_benz': 5156,
 'audi': 5524,
 'ford': 2865}

We aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

+ Audi, BMW and Mercedes Benz are more expensive
+ Ford and Opel are less expensive
+ Volkswagen is in between

## Storing Aggregate Data in a Dataframe

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

In [19]:
brand_mean_km = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["odometer_km"].mean()
    brand_mean_km[brand] = int(mean_price)

brand_mean_km

{'volkswagen': 132692,
 'opel': 130371,
 'bmw': 138471,
 'mercedes_benz': 138341,
 'audi': 139527,
 'ford': 126726}

In [20]:
bmp_series = pd.Series(brand_mean_prices)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

Unnamed: 0,mean_price
volkswagen,4041
opel,2621
bmw,5469
mercedes_benz,5156
audi,5524
ford,2865


In [21]:
bmk_series = pd.Series(brand_mean_km)
df['mean_km'] = bmk_series
df

Unnamed: 0,mean_price,mean_km
volkswagen,4041,132692
opel,2621,130371
bmw,5469,138471
mercedes_benz,5156,138341
audi,5524,139527
ford,2865,126726
