# eBay Kleinanzeigen Data Cleaning and Analysis
## Introduction
The purpose of this project is to do some cleaning and analysis of used car data from eBay Kleinanzeigen, a classifieds section of eBay Germany. The data set processed and analysed herein was originally sourced from [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data), with some modification by Dataquest for teaching purposes. The .csv file imported was downloaded from Dataquest.  
  
The data set underwent some cleaning, including identification of null entries, identification of columns that could be neglected, and removal of rows based on car registration year and list price. The cleaned data was then aggregated on popular brands to analyse mean list prices and car mileages for each of these brands. Brand name appears to have a strong command of list price, but does not have much influence on mean mileage.

## Initial Analysis

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('./autos.csv', encoding='Latin-1')
print(autos.info())
print('\n')
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


An initial look at the data set shows that there are substantial null entries in several columns, around 5% of `vehicleType`, `gearbox`, and `model` entries are null; around 10% of `fuelType` entries are null; and around 20% of `notRepairedDamage` entries are null. Some columns include text, like odometer entries and price.

The `re` module is imported to convert column names from Camel Case convention to the Snake Case convention commonly used in Python. A couple of column names have been simplified as well.  
  
_(A note on second reading: I had not learned yet how to work with regular expressions, but I certainly did not think it would be efficient to manually reformat each column name. A quick search on Stack Overflow led me to the regex expression and use of the re module.)_

In [2]:
import re

column_names = []
for name in autos.columns:
    newname = re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()
    column_names.append(newname)
column_names[7] = 'registration_year'
column_names[12] = 'registration_month'
column_names[15] = 'unrepaired_damage'
column_names[16] = 'ad_created'
#print(column_names)
autos.columns = column_names
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Exploring and Cleaning Some Specific Columns

The following code cell explores a few specific columns.  
  
Columns that can be ignored for further analysis:

- `seller`: all entries 'privat' with one exception

- `offer_type`: all entries 'Angebot' with one exception

- `nr_of_pictures`: all entries are int 0

Columns that are converted to a number:

- `price`

- `odometer`: there is some variety in string inputs handled during the conversion


In [3]:
# quick examination of the data
autos.describe(include='all')
print('nr_of_pictures values:\n',autos['nr_of_pictures'].value_counts())
print('\nnr_of_pictures head:\n', autos['nr_of_pictures'].head())

# convert price from string to float
print('\n\nprice values:\n',autos['price'].value_counts())
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(float)
print('\nprice head (after float conversion):\n', autos['price'].head())

# convert odometer to int
print('\n\nodometer values:\n',autos['odometer'].value_counts())
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].astype(int)
print('\nodometer head (after int conversion):\n',autos['odometer'].value_counts())

# rename odometer column to odometer_km
autos = autos.rename({'odometer': 'odometer_km'}, axis=1)
autos.head()

nr_of_pictures values:
 0    50000
Name: nr_of_pictures, dtype: int64

nr_of_pictures head:
 0    0
1    0
2    0
3    0
4    0
Name: nr_of_pictures, dtype: int64


price values:
 $0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$12,340       1
$40,800       1
$73,500       1
$7,333        1
$13,049       1
Name: price, Length: 2357, dtype: int64

price head (after float conversion):
 0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64


odometer values:
 150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

odometer head (after int conversion):
 150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        9

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Summary Price Statistics

In [4]:
print('summary price statistics:\n', autos['price'].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95, 0.99]))
autos = autos[autos['price'].between(0, 35900)]
print('\nsummary price statistics with top 1% removed:\n', autos['price'].describe())

summary price statistics:
 count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
5%       2.000000e+02
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
95%      1.990000e+04
99%      3.590000e+04
max      1.000000e+08
Name: price, dtype: float64

summary price statistics with top 1% removed:
 count    49502.000000
mean      5193.721042
std       6063.295735
min          0.000000
25%       1100.000000
50%       2900.000000
75%       6999.000000
max      35900.000000
Name: price, dtype: float64


As can be seen above, the `price` data has a mean of \$9,840, min of \$0, 95th percentile of \$19,900, 99th percentile of \$35,900, and a max of \$100,000,000. A maximum this high suggests a very high skew caused by 1% of the price data. To focus analysis on the vast majority of the data, data in the top 1% for price will be neglected from further analysis.

## Summary Odometer Statistics

In [5]:
autos['odometer_km'].value_counts()
autos['odometer_km'].describe()

count     49502.000000
mean     126400.347461
std       39439.244276
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

The `odometer_km` data contains a high frequency of listings above 100,000 km, with a max of 150,000 km and min of 5,000 km. There do not appear to be any outliers worth removing based on odometer data.  
  
_(A note on second reading: This guided project was completed after learning some of the fundamentals of the pandas library, but I had not yet learned about plotting with matplotlib. I acknowledge that some of the summary statistics observations are rather limited without any plots to visualise the data.)_

## Date Exploration

In [6]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025454
2016-03-06    0.013959
2016-03-07    0.036039
2016-03-08    0.033332
2016-03-09    0.033292
2016-03-10    0.032160
2016-03-11    0.032383
2016-03-12    0.036908
2016-03-13    0.015575
2016-03-14    0.036685
2016-03-15    0.033978
2016-03-16    0.029534
2016-03-17    0.031494
2016-03-18    0.013010
2016-03-19    0.034908
2016-03-20    0.037817
2016-03-21    0.037291
2016-03-22    0.032827
2016-03-23    0.032383
2016-03-24    0.029049
2016-03-25    0.031675
2016-03-26    0.032645
2016-03-27    0.031009
2016-03-28    0.034766
2016-03-29    0.034241
2016-03-30    0.033696
2016-03-31    0.031837
2016-04-01    0.033736
2016-04-02    0.035372
2016-04-03    0.038786
2016-04-04    0.036584
2016-04-05    0.012949
2016-04-06    0.003192
2016-04-07    0.001434
Name: date_crawled, dtype: float64

Looking at the normalised frequency of crawl dates indicates a pretty consistent rate of data crawling between 5 March 2016 and 5 April 2016.

In [7]:
ads_created = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
print(ads_created)

2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
                ...   
2016-04-03    0.039009
2016-04-04    0.036968
2016-04-05    0.011676
2016-04-06    0.003273
2016-04-07    0.001293
Name: ad_created, Length: 75, dtype: float64


Frequency of listings generated on the website appear to be very low prior to March 2016, with what seems like a step increase in the month of March 2016.

In [8]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001091
2016-03-06    0.004464
2016-03-07    0.005353
2016-03-08    0.007656
2016-03-09    0.009899
2016-03-10    0.010848
2016-03-11    0.012606
2016-03-12    0.023938
2016-03-13    0.009030
2016-03-14    0.012828
2016-03-15    0.015979
2016-03-16    0.016444
2016-03-17    0.028019
2016-03-18    0.007434
2016-03-19    0.015858
2016-03-20    0.020686
2016-03-21    0.020747
2016-03-22    0.021676
2016-03-23    0.018646
2016-03-24    0.019595
2016-03-25    0.019252
2016-03-26    0.016989
2016-03-27    0.016060
2016-03-28    0.020969
2016-03-29    0.022464
2016-03-30    0.024969
2016-03-31    0.023858
2016-04-01    0.023231
2016-04-02    0.024969
2016-04-03    0.025474
2016-04-04    0.024787
2016-04-05    0.123551
2016-04-06    0.220476
2016-04-07    0.130156
Name: last_seen, dtype: float64

Around half of the listings were viewed in the last 3 days of the final data crawl date.

In [9]:
print('registration_year freq table:\n', autos['registration_year'].describe())
print('\nregistrations post-2016: ', autos.loc[(autos['registration_year'] > 2016), 'registration_year'].shape[0])
print('registrations pre-1900: ', autos.loc[(autos['registration_year'] < 1900), 'registration_year'].shape[0])
print('\nregistrations between 1900-2016 freq table:\n', autos.loc[autos['registration_year'].between(1900, 2016), 'registration_year'].describe())

registration_year freq table:
 count    49502.000000
mean      2004.965153
std        104.548918
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

registrations post-2016:  1957
registrations pre-1900:  6

registrations between 1900-2016 freq table:
 count    47539.000000
mean      2002.757904
std          7.222096
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2016.000000
Name: registration_year, dtype: float64


Around 2000 listings contain registration years after 2016 (i.e. the future), and 6 contain registration years before 1900. Within the range of 1900 to 2016, the majority of vehicles were registered around the early to mid 00's.

In [10]:
autos = autos[autos['registration_year'].between(1900, 2016)]
autos['registration_year'].value_counts(normalize=True)

2000    0.070511
2005    0.063337
1999    0.063043
2004    0.057384
2003    0.057342
          ...   
1929    0.000021
1948    0.000021
1938    0.000021
1939    0.000021
1952    0.000021
Name: registration_year, Length: 76, dtype: float64

Listings with registration years outside the range 1900-2016 have been removed. Those registered after 2016 are clearly inaccurate since the data was crawled in 2016 and vehicles cannot have a future date of registration. Registration years before 1900 seemed to be arbitrary but regardless represented a tiny proportion of the data set.

## Aggregating Data by Brand
Having cleaned up the data set, some analysis on brands is carried out. The focus of brand specific investigation is on the brands that contain more than 5% of the listings. 
  
_(A note on second reading: The aggragation workflow below builds Python dictionaries that are converted into a pandas DataFrame. At writing of this comment, I would instead use a `pandas.DataFrame.groupby.agg` approach for such tasks in the future.)_

In [11]:
# extract brand names that contain more than 5% of listings
brand_freq = autos['brand'].value_counts(normalize=True) # normalise freq table of brand names
brand_freq_agg = brand_freq[brand_freq > 0.05] # select brands containing more than 5% of listings

# for each brand of interest, calculate mean listing price dictionary
mean_price_by_brand = {}
for item in brand_freq_agg.index:
    mean = autos.loc[autos['brand'] == item, 'price'].mean()
    mean_price_by_brand[item] = mean
#print(mean_price_by_brand)

# for each brand of interest, calculate mean mileage dictionary
mean_mileage_by_brand = {}
for item in brand_freq_agg.index:
    mean = autos.loc[autos['brand'] == item, 'odometer_km'].mean()
    mean_mileage_by_brand[item] = mean
#print(mean_mileage_by_brand)

# combine the above dictionaries to create an aggregate dataframe
bmp_series = pd.Series(mean_price_by_brand)
bmm_series = pd.Series(mean_mileage_by_brand)
autos_agg = pd.DataFrame(bmp_series, columns=['mean_price'])
autos_agg['mean_odometer_km'] = bmm_series
autos_agg

Unnamed: 0,mean_price,mean_odometer_km
volkswagen,5115.187512,128965.924759
bmw,7552.723755,133434.8659
opel,2869.76218,129223.955324
mercedes_benz,7639.995103,132174.493657
audi,8208.403946,131257.706535
ford,3493.587549,124257.707273


The more luxury brands (BMW, Mercedes Benz, Audi) are listed at higher mean prices (over \$7500). Ford and Opel are less expensive (around \$3000), and Volkswagen is in between at \$5100.  
  
The average odometer values are pretty consistent at around 130,000 km, so brand seems to be a much bigger influence on price than mileage of the vehicle.

This guided project was originally completed in May 2020, with some review and tidying up in June 2020.