# eBay Car Sales Analysis
The aim of this project is to clean the data and analyze the included used car listings. You'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.

We'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.float_format', lambda x: '%.5f' % x)

# Reading the Data
**We will read the data and use info method to have a quick overview of the data we are processing.**

In [2]:
autos = pd.read_csv(r"C:\Users\Andy\Desktop\Learning\Dataquest\Project_3\autos.csv")

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

**From above, we can very quickly notice a few things:**
* There are null values in some of the data. Namely "vehicleType", "gearbox", "model", "fuelType", and "notRepairedDamage".
* The data consists of string and integer values.

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

**As the column headers are in camelcase instead of Python's preferred snakecase, we will make edits to the headers to conver them to snakecase.**

In [6]:
autos.rename({"dateCrawled": "date_crawled","offerType": "offer_type","abtest": "ab_test","vehicleType": "vehicle_type",
             "yearOfRegistration": "registration_year", "powerPS": "power_PS", "monthOfRegistration": "registration_month",
             "fuelType": "fuel_type", "notRepairedDamage": "unrepaired_damage", "dateCreated": "ad_created",
             "nrOfPictures": "no_of_pictures", "postalCode": "postal_code", "lastSeen": "last_seen"}, axis=1, inplace=True)

In [7]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,no_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


**Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:**

* Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
* Examples of numeric data stored as text which can be cleaned and converted.

In [8]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,no_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.71281,,209.21663,,,3.71198,,,,,0.0,25779.74796,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [9]:
autos["no_of_pictures"].unique()

array([0], dtype=int64)

**The following are the observations:**

* `no_of_pictures` has only one value of `0.0`, hence there is no meaningful data and can be dropped.
* `seller` has only one value of `gewerblich` out of the 50,000 entries.
* `offer_type` has only one value of `Gesuch` out of the 50,000 entries.
* `price` and `odometer` are stored as string and should be converted to float/integer.

In [10]:
autos["price"].value_counts()

$0          1421
$500         781
$1,500       734
$2,500       643
$1,000       639
            ... 
$12,888        1
$1,221         1
$4,125         1
$5,485         1
$194,000       1
Name: price, Length: 2357, dtype: int64

**We will remove non-numeric characters from `price` and convert it to integer type.**

In [11]:
autos["price"] = autos["price"].str.replace("$","",regex=True).replace(",","",regex=True)

In [12]:
autos["price"] = autos["price"].astype(int)

In [13]:
autos["odometer"].value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

**We will remove non-numeric characters from `odometer` and convert it to integer type.**

In [14]:
autos["odometer"] = autos["odometer"].str.replace("km","",regex=True).replace(",","",regex=True)

In [15]:
autos["odometer"] = autos["odometer"].astype(int)

In [16]:
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)

**Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analysing the odometer_km and price columns. Here's the steps we'll take:**
* Analyse the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

**We will explore `price` first.**

In [17]:
autos["price"].describe()

count      50000.00000
mean        9840.04376
std       481104.38050
min            0.00000
25%         1100.00000
50%         2950.00000
75%         7200.00000
max     99999999.00000
Name: price, dtype: float64

In [18]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [19]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

**We can make the following observations:**
* Mean is around `$10,000` so we can assume that that should be the typical pricing for cars.
* For highest few prices, we can see that it increases steadily to around `$350,000`. We will remove values that are higher than this.
* For lowest few prices, we notice a signifcant number of starting price at `$0`. This is not quite possible as if there should be a starting price if not the listing would be sold for free. We will remove values that are equal to `$0`.

In [20]:
autos = autos[(autos["price"] < 351000) & (autos["price"] > 0)]

In [21]:
autos["price"].describe()

count    48565.00000
mean      5888.93559
std       9059.85475
min          1.00000
25%       1200.00000
50%       3000.00000
75%       7490.00000
max     350000.00000
Name: price, dtype: float64

**Next we will explore `odometer_km`.**

In [22]:
autos["odometer_km"].describe()

count    48565.00000
mean    125770.10193
std      39788.63680
min       5000.00000
25%     125000.00000
50%     150000.00000
75%     150000.00000
max     150000.00000
Name: odometer_km, dtype: float64

In [23]:
autos["odometer_km"].value_counts()

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64

**We can make the following observations:**
* There are no noticeable abnormal data.
* Most of the cars are of higher mileage, which makes sense for a used car listing on eBay.

**Let's now move on to the date columns and understand the date range the data covers.**

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
* `date_crawled`: added by the crawler
* `last_seen`: added by the crawler
* `ad_created`: from the website
* `registration_month`: from the website
* `registration_year`: from the website

In [24]:
autos[["date_crawled","ad_created","last_seen"]][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


You'll notice that the first 10 characters represent the day (e.g. 2016-03-26).

To understand the date range, we can extract just the date values, use `Series.value_counts()` to generate a distribution, and then sort by the index.

In [25]:
print('date_crawled: When this ad was first crawled. All field-values are taken from this date.')
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

date_crawled: When this ad was first crawled. All field-values are taken from this date.


2016-03-05   0.02533
2016-03-06   0.01404
2016-03-07   0.03601
2016-03-08   0.03330
2016-03-09   0.03309
2016-03-10   0.03218
2016-03-11   0.03257
2016-03-12   0.03692
2016-03-13   0.01567
2016-03-14   0.03655
2016-03-15   0.03428
2016-03-16   0.02961
2016-03-17   0.03163
2016-03-18   0.01291
2016-03-19   0.03478
2016-03-20   0.03789
2016-03-21   0.03737
2016-03-22   0.03299
2016-03-23   0.03222
2016-03-24   0.02934
2016-03-25   0.03161
2016-03-26   0.03220
2016-03-27   0.03109
2016-03-28   0.03486
2016-03-29   0.03410
2016-03-30   0.03369
2016-03-31   0.03183
2016-04-01   0.03369
2016-04-02   0.03548
2016-04-03   0.03861
2016-04-04   0.03649
2016-04-05   0.01310
2016-04-06   0.00317
2016-04-07   0.00140
Name: date_crawled, dtype: float64

We note from above that the data was crawled over a period of around one month from Mar to Apr 2016.

In [26]:
print('ad_created: The date on which the eBay listing was created.')
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head(50)

ad_created: The date on which the eBay listing was created.


2016-04-07   0.00126
2016-04-06   0.00325
2016-04-05   0.01182
2016-04-04   0.03686
2016-04-03   0.03886
2016-04-02   0.03515
2016-04-01   0.03369
2016-03-31   0.03187
2016-03-30   0.03350
2016-03-29   0.03404
2016-03-28   0.03498
2016-03-27   0.03099
2016-03-26   0.03227
2016-03-25   0.03175
2016-03-24   0.02928
2016-03-23   0.03206
2016-03-22   0.03280
2016-03-21   0.03758
2016-03-20   0.03795
2016-03-19   0.03369
2016-03-18   0.01359
2016-03-17   0.03128
2016-03-16   0.03012
2016-03-15   0.03402
2016-03-14   0.03519
2016-03-13   0.01701
2016-03-12   0.03675
2016-03-11   0.03290
2016-03-10   0.03190
2016-03-09   0.03315
2016-03-08   0.03332
2016-03-07   0.03474
2016-03-06   0.01532
2016-03-05   0.02290
2016-03-04   0.00148
2016-03-03   0.00086
2016-03-02   0.00010
2016-03-01   0.00010
2016-02-29   0.00016
2016-02-28   0.00021
2016-02-27   0.00012
2016-02-26   0.00004
2016-02-25   0.00006
2016-02-24   0.00004
2016-02-23   0.00008
2016-02-22   0.00002
2016-02-21   0.00006
2016-02-20   

Earlier, we noted that the crawling started on 5 Mar 2016. For listings that were listed earlier, they were either sold or removed.

In [27]:
print('last_seen: When the crawler saw this ad last online.')
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

last_seen: When the crawler saw this ad last online.


2016-03-05   0.00107
2016-03-06   0.00432
2016-03-07   0.00539
2016-03-08   0.00741
2016-03-09   0.00960
2016-03-10   0.01067
2016-03-11   0.01238
2016-03-12   0.02378
2016-03-13   0.00890
2016-03-14   0.01260
2016-03-15   0.01588
2016-03-16   0.01645
2016-03-17   0.02809
2016-03-18   0.00735
2016-03-19   0.01583
2016-03-20   0.02065
2016-03-21   0.02063
2016-03-22   0.02137
2016-03-23   0.01853
2016-03-24   0.01977
2016-03-25   0.01921
2016-03-26   0.01680
2016-03-27   0.01565
2016-03-28   0.02086
2016-03-29   0.02234
2016-03-30   0.02477
2016-03-31   0.02378
2016-04-01   0.02279
2016-04-02   0.02492
2016-04-03   0.02520
2016-04-04   0.02448
2016-04-05   0.12476
2016-04-06   0.22181
2016-04-07   0.13195
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [28]:
autos["registration_year"].describe()
original_regyear_count = autos["registration_year"].describe()["count"]

In [29]:
autos["registration_year"].value_counts().sort_index(ascending=False).head(20)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
Name: registration_year, dtype: int64

In [30]:
autos["registration_year"].value_counts().sort_index(ascending=True).head(20)

1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    3
1951    2
1952    1
1953    1
1954    2
Name: registration_year, dtype: int64

From the `registration-year` data, we notice several abnormalies:
* There were listings that were before year 1886 when automobile was first invented.
* There were listings that were registered even after the data was crawled in year 2016.

We will remove these data

In [31]:
autos = autos[(autos["registration_year"] <= 2016) & (autos["registration_year"] >= 1900)]

In [32]:
autos["registration_year"].describe()
adjusted_regyear_count = autos["registration_year"].describe()["count"]

In [33]:
print((original_regyear_count - adjusted_regyear_count) / original_regyear_count * 100)

3.8793369710697


We note that the abnormal data that were removed constituted arouond 4% of the data, which is reasonable.

One of the analysis techniques is aggregation. When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the `brand` column.

In [34]:
top_20_brand = autos["brand"].value_counts(dropna=False).index[0:20]

In [35]:
top_20_brand_mean_price = {}

for b in top_20_brand:
    selected_rows = autos[autos["brand"] == b]
    mean_price = selected_rows["price"].mean()
    top_20_brand_mean_price[b] = mean_price

for k,v in top_20_brand_mean_price.items():
    print(k,":",v)

volkswagen : 5402.410261610221
bmw : 8332.820517811953
opel : 2975.2419354838707
mercedes_benz : 8628.450366422385
audi : 9336.687453600594
ford : 3749.4695065890287
renault : 2474.8646069968195
peugeot : 3094.0172290021537
fiat : 2813.748538011696
seat : 4397.230949589683
skoda : 6368.0
nissan : 4743.40252454418
mazda : 4112.596614950635
smart : 3580.2239031770046
citroen : 3779.1391437308866
toyota : 5167.091062394604
hyundai : 5365.254273504273
sonstige_autos : 12338.550218340612
volvo : 4946.501170960188
mini : 10613.459657701711


From the mean price of the top 20 brands, we notice that:
* sonstigue_autos and mini are more expensive
* renault and fiat are the cheaptest

We will use aggregation to understand the `brand` column for `mean mileage`.

In [36]:
top_20_brand_mean_mileage = {}

for b in top_20_brand:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    top_20_brand_mean_mileage[b] = mean_mileage

for k,v in top_20_brand_mean_mileage.items():
    print(k,":",v)

volkswagen : 128707.15879132022
bmw : 132572.51313996495
opel : 129310.0358422939
mercedes_benz : 130788.36331334666
audi : 129157.38678544914
ford : 124266.01287159056
renault : 128071.33121308497
peugeot : 127153.62526920316
fiat : 117121.9715956558
seat : 121131.30128956624
skoda : 110848.5639686684
nissan : 118330.99579242637
mazda : 124464.03385049365
smart : 99326.77760968229
citroen : 119694.18960244648
toyota : 115944.35075885328
hyundai : 106442.30769230769
sonstige_autos : 89956.33187772926
volvo : 138067.9156908665
mini : 88105.13447432763


For the top 20 brands, we will see if there's any visible link between mean price and mean mileage.

We will combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly so that can visually compare. To do this, we will use `pandas series constructor` and `pandas dataframe constructor`.

In [41]:
bmm_series = pd.Series(top_20_brand_mean_mileage).sort_values(ascending=False)
bmp_series = pd.Series(top_20_brand_mean_price).sort_values(ascending=False)
df = pd.DataFrame(bmm_series, columns = ["mean_mileage"])
df["mean_price"] = bmp_series

In [42]:
df

Unnamed: 0,mean_mileage,mean_price
volvo,138067.91569,4946.50117
bmw,132572.51314,8332.82052
mercedes_benz,130788.36331,8628.45037
opel,129310.03584,2975.24194
audi,129157.38679,9336.68745
volkswagen,128707.15879,5402.41026
renault,128071.33121,2474.86461
peugeot,127153.62527,3094.01723
mazda,124464.03385,4112.59661
ford,124266.01287,3749.46951


The range of car mileages does not vary as much as the prices do by brand.