# Guided Project: Exploring Ebay Car Sales Data

In [57]:
import numpy as np
import pandas as pd

The data set is of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally [scraped](https://en.wikipedia.org/wiki/Web_scraping) and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). We've made a few modifications from the original dataset that was uploaded to Kaggle:

We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:
- <code>dateCrawled</code> - When this ad was first crawled. All field-values are taken from this date.
- <code>name</code> - Name of the car.
- <code>seller</code> - Whether the seller is private or a dealer.
- <code>offerType</code> - The type of listing
- <code>price</code> - The price on the ad to sell the car.
- <code>abtest</code> - Whether the listing is included in an A/B test.
- <code>vehicleType</code> - The vehicle Type.
- <code>yearOfRegistration</code></code> - The year in which the car was first registered.
- <code>gearbox</code> - The transmission type.
- <code>powerPS</code> - The power of the car in PS.
- <code>model</code> - The car model name.
- <code>kilometer</code> - How many kilometers the car has driven.
- <code>monthOfRegistration</code> - The month in which the car was first registered.
- <code>fuelType</code> - What type of fuel the car uses.
- <code>brand</code> - The brand of the car.
- <code>notRepairedDamage</code> - If the car has a damage which is not yet repaired.
- <code>dateCreated</code> - The date on which the eBay listing was created.
- <code>nrOfPictures</code> - The number of pictures in the ad.
- <code>postalCode</code> - The postal code for the location of the vehicle.
- <code>lastSeenOnline</code> - When the crawler saw this ad last online.

In [58]:
# Loading the data
autos = pd.read_csv("autos.csv", encoding="Latin-1")

autos.info()

autos.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


First step in the data cleaning process is to change the columns names from camel to snake case.

In [59]:
renamed = {
           'dateCrawled': 'date_crawled',
           'offerType': 'offer_type',
           'abtest': 'ab_test',
           'vehicleType': 'vehicle_type', 
           'odometer': 'odometer_km',
           'yearOfRegistration': 'registration_year',
           'gearbox': 'gear_box',
           'powerPS': 'power_PS',
           'monthOfRegistration': 'registration_month',
           'fuelType': 'fuel_type',
           'notRepairedDamage': 'not_repaired_damage',
           'dateCreated': 'ad_created',
           'nrOfPictures': 'nr_of_pictures',
           'postalCode': 'postal_code',
           'lastSeen': 'last_seen'
            }

autos.rename(columns=renamed, inplace=True)

Next, will be cleaning columns so that datatypes can be accurate to what data is stored in the column.

In [60]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,power_PS,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-22 09:51:06,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [61]:
autos.price.unique()

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

From the description of the data columns it can be seen that the data columns that columns <code>seller</code>, <code>offer_type</code>, <code>ab_test</code>, <code>gear_box</code>, and <code>not_repaired_damamge</code> have only two unique values stored per column. The <code>price</code> column seems to have the $ in some of the values. <code>registration_year</code> seems to have some errors and should be a integer type. <code>odometer</code> values should be numeric so the 'km' text should be removed and added to the column description. <code>nr_of_pictures</code> seems to have no values stored in the column since all the values are 0.

So will first remove any <code>km</code> and <code>,</code> text from the <code>odometer</code> column and convert to integer.

Next will remove any <code>$</code> and <code>,</code> text from the <code>price</code> column and convert to integer.

In [62]:
# Removing , and km from odometer column
# Converting data type to int
autos['odometer_km'] = autos['odometer_km'].str.replace("km","")
autos['odometer_km'] = autos['odometer_km'].str.replace(",","").astype(int)

In [63]:
# Removing , and $ from odometer column
# Converting data type to int
autos['price'] = autos['price'].str.replace("$","")
autos['price'] = autos['price'].str.replace(",","").astype(int)

Removing outliers from the the <code>price</code>. Looking at the values below, it looks like there is a big jump in price sold from 350000 and 999990. Based on this we will remove all values larger than 350000. On the lower side of the values, 1421 cars seem to have been sold for 0 dollars which does not seem right, so those values will be removed.

In [64]:
print(autos['price'].value_counts().sort_index(ascending=True).head(10))
print(autos['price'].value_counts().sort_index(ascending=False).head(15))

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64


In [65]:
# Removing unrealiztice values in price
autos = autos[autos['price'].between(1,350000)]

Next we will be looking at the date values in the data. Specificly <code>date_crawled</code>, <code>ad_created</code> and <code>last_seen</code>. The dates are in full timestamp values, so we will parse the strings to just the date.

In [66]:
autos.loc[:,'date_crawled'] = autos.loc[:,'date_crawled'].str[:10]
autos.loc[:,'ad_created'] = autos.loc[:,'ad_created'].str[:10]
autos.loc[:,'last_seen'] = autos.loc[:,'last_seen'].str[:10]

Now lets look at the values in the date and the percentage of each date.

In [67]:
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [68]:
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

In [69]:
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

Summary of the date columns.

|Column|Earliest Date|Lastest Date|
|:-:|:-:|:-:|
|date_crawled|2016-03-05|2016-04-07|
|ad_created|2015-06-11|2016-04-07|
|last_seen|2016-03-05|2016-04-07|

In [70]:
autos['registration_year'].value_counts(dropna=False).sort_index(ascending=True)

1000    1
1001    1
1111    1
1800    2
1910    5
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

In [71]:
autos[(autos['registration_year'] > 2017) | (autos['registration_year'] < 1886)]

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,power_PS,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
84,2016-03-27,Renault_twingo,privat,Angebot,900,control,,2018,,60,twingo,150000,0,,renault,,2016-03-27,0,40589,2016-04-05
164,2016-03-13,Opel_Meriva__nur_76000_Km__unfallfrei__scheckh...,privat,Angebot,4800,control,,2018,manuell,0,meriva,80000,4,benzin,opel,nein,2016-03-13,0,37627,2016-04-04
390,2016-03-25,Fiat_Bertone_X_1_9__X_1/9__X19__X_19__X1_9__X_19,privat,Angebot,7750,test,,2018,manuell,76,andere,150000,6,benzin,fiat,nein,2016-03-25,0,78239,2016-03-28
453,2016-03-28,Armee_Jeep,privat,Angebot,9800,test,,4500,manuell,0,andere,5000,0,,jeep,,2016-03-28,0,7545,2016-04-06
802,2016-03-19,Lada_mit_wenig_km_neuem_Tuev_bj_08,privat,Angebot,2100,test,,2018,manuell,0,kalina,150000,0,benzin,lada,,2016-03-19,0,12621,2016-03-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49283,2016-03-15,Citroen_HY,privat,Angebot,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15,0,66706,2016-04-06
49354,2016-04-05,Bmw_e39_523i_mit_neuem_Tuev,privat,Angebot,2499,control,,2018,manuell,174,5er,150000,8,,bmw,nein,2016-04-05,0,65207,2016-04-05
49411,2016-03-28,Renault_twingo_Tuev_neu,privat,Angebot,1550,test,,2018,,0,twingo,100000,0,,renault,,2016-03-28,0,48739,2016-03-28
49770,2016-03-15,VW_Polo_6n_Tuev_Neu!__1.6_75PS,privat,Angebot,999,control,,2018,manuell,75,polo,150000,12,benzin,volkswagen,nein,2016-03-15,0,24321,2016-04-06


The year the data was scraped as 2016. Based on this registration year should not be over 2016 and the earliest car model year was 1886, so all values should be over this as was well. 492 data points fall out of this range. Those will be removed and the data will be looked at again.

In [72]:
# Removing unrealiztice values in registration_year
autos = autos[autos['registration_year'].between(1886,2016)]

# See remaining values
autos['registration_year'].describe()

count    46681.000000
mean      2002.910756
std          7.185103
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Now, we are going to look at the car data aggregated by <code>brand</code>. We will look at the top 20 car brands that are sold on the German Ebay.

In [73]:
brands = autos['brand'].value_counts(normalize=True, dropna=False).head(20)

brands_list = []

for b, value in brands.items():
    agg = [b, 
            round(value*100,2), 
            round(autos[autos['brand'] == b]['price'].mean(), 2)]
    brands_list.append(agg)

brands_agg = pd.DataFrame(brands_list, columns = ['brand','percent_sold','avg_price_sold'])
brands_agg = brands_agg.set_index('brand')

brands_agg.sort_values(by=['avg_price_sold'],ascending=False)

Unnamed: 0_level_0,percent_sold,avg_price_sold
brand,Unnamed: 1_level_1,Unnamed: 2_level_1
sonstige_autos,0.98,12338.55
mini,0.88,10613.46
audi,8.66,9336.69
mercedes_benz,9.65,8628.45
bmw,11.0,8332.82
skoda,1.64,6368.0
volkswagen,21.13,5402.41
hyundai,1.0,5365.25
toyota,1.27,5167.09
volvo,0.91,4946.5
