# eBay Kleinanzeigen: A Data Quest Project
---

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.
The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

We've made a few modifications from the original dataset:

-We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
-We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:
- `dateCrawled` - When this ad was first crawled. All field-values are -taken from this date.

- `name` - Name of the car.

- `seller` - Whether the seller is private or a dealer.

- `offerType` - The type of listing

- `price` - The price on the ad to sell the car.

- `abtest` - Whether the listing is included in an A/B test.

- `vehicleType` - The vehicle Type.

- `yearOfRegistration` - The year in which the car was first registered.

- `gearbox` - The transmission type.

- `powerPS` - The power of the car in PS.

- `model` - The car model name.

- `kilometer` - How many kilometers the car has driven.

- `monthOfRegistration` - The month in which the car was first registered.

- `fuelType` - What type of fuel the car uses.

- `brand` - The brand of the car.

- `notRepairedDamage` - If the car has a damage which is not yet repaired.

- `dateCreated` - The date on which the eBay listing was created.

- `nrOfPictures` - The number of pictures in the ad.

- `postalCode` - The postal code for the location of the vehicle.

- `lastSeenOnline` - When the crawler saw this ad last online.

### The First Steps:

#### 1. Importing Necessary Modules and the CSV File

In [1]:
#Importing modules
import pandas as pd
import numpy as np
import re
pd.set_option('mode.chained_assignment', None)

#Reading the CSV file which is encoded in Latin-1
autos=pd.read_csv("autos.csv",encoding="Latin-1")

#### 2. Initial Inspection

In [2]:
#Initial nspection of the first 10 data
autos.head(10)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [3]:
#Inspection of columns and its objects
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

There are some missing data (possible NaN or other unwanted types) in vehicletype, gearbox, model, fueltype, and notRepaireddamage. 20 Columns, 7 of which are NumPy integers and 13 objects.

Changing the column names into all lowercase letter would be beneficial for ease of access. 

Language used seems to be German, might be problematic for some characters (Umlaut and Eszett) in .replace() if required. Hopefully, the csv is relatively clean of the special characters, time to remember highschool German.

#### 3. Fixing Column Names

In [4]:
#Creating a dictionary of column names that will be fixed
colnamefix={"yearOfRegistration":"registration_year",
            "monthOfRegistration":"registration_month",
           "notRepairedDamage":"unrepaired_damage",
           "dateCreated":"ad_created",
            "dateCrawled":"date_crawled",
           "offerType":"offer_type",
           "vehicleType":"vehicle_type",
           "fuelType":"fuel_type",
           "nrOfPictures":"pic_count",
           "postalCode":"postal_code",
           "lastSeen":"last_seen"}

#Fixing the columns namesing loop
for i in colnamefix:
    autos.rename(columns={i:colnamefix[i]},inplace=True)

#Changing column names into lowercase characters
tempcol=list()

for i in autos.columns:
    tempcol.append(i.lower())

autos.columns=tempcol
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'pic_count', 'postal_code',
       'last_seen'],
      dtype='object')

What we are trying to achieve here is actually quite simple, changing the format of the column labels from camelcase to snakecase. 

A relatively easy and primitive way to do this is to loop the dictionary of columns that are desired and cross-matching it with the actual columns. The next part is trying to get all the characters into lower case which is quite simple.

#### 4. Mapping German Words into English

In [5]:
#Creating a list of columns that uses German
german=["seller","offer_type","vehicle_type",
        "gearbox","fuel_type","unrepaired_damage"]

#Creating a dictionary of German words that will be changed
seller={"privat":"private","gewerblich":"commercial"}
offer={"Angebot":"bid","Gesuch":"request"}
vehicle={"kleinwagen":"small_vehicle","andere":"others","kombi":"station_wagon"}
gear={"manuell":"manual","automatik":"automatic"}
fuel={"benzin":"petrol","elektro":"electric","andere":"others"}
damage={"nein":"no","ja":"yes"}

#Replacing the German words
for i in seller:
    autos["seller"].replace([str(i)+".*"],
                            [seller[i]],regex=True,inplace=True)
for i in offer:
    autos["offer_type"].replace([str(i)+".*"],
                            [offer[i]],regex=True,inplace=True)
for i in vehicle:
    autos["vehicle_type"].replace([str(i)+".*"],
                            [vehicle[i]],regex=True,inplace=True)
for i in gear:
    autos["gearbox"].replace([str(i)+".*"],
                            [gear[i]],regex=True,inplace=True)        
for i in fuel:
    autos["fuel_type"].replace([str(i)+".*"],
                            [fuel[i]],regex=True,inplace=True)
for i in damage:
    autos["unrepaired_damage"].replace([str(i)+".*"],
                            [damage[i]],regex=True,inplace=True)

#Checking if the German words have been changed  
for i in german:
    autos[i].fillna("not_known",inplace=True)
    print(autos[i].unique())

['private' 'commercial']
['bid' 'request']
['bus' 'limousine' 'small_vehicle' 'station_wagon' 'not_known' 'coupe'
 'suv' 'cabrio' 'others']
['manual' 'automatic' 'not_known']
['lpg' 'petrol' 'diesel' 'not_known' 'cng' 'hybrid' 'electric' 'others']
['no' 'not_known' 'yes']


Here we are replacing the German words with its English equivalent. The method utilized is regular expression and .replace. The idea here is that we create a dictionary of the things that we wanted to replace, after that we loop through the dictionary and match it using regular expression.

### Data Cleaning

#### 1. Cleaning Non-dates Data

##### Quick Look into Desriptive Statistics

In [6]:
#Looking at the descriptive statistics
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,pic_count,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,50000,50000.0,50000,50000.0,47242,50000,50000.0,50000,50000,50000,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,9,,3,,245,13,,8,40,3,76,,,39481
top,2016-03-05 16:57:05,Ford_Fiesta,private,bid,$0,test,limousine,,manual,,golf,"150,000km",,petrol,volkswagen,no,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


pic_count looks quite okay to drop since its maximum and minimum value is zero.

postal_code might not be that useful (unless you are trying to visualize it on a map) since it is usually quite random . There are some zero values in price.

All the dates type could also be problematic but it does have a clear pattern here. There are some creative ways parsing the dates using regular expression and maybe inserting a column for a month-year combo which will also be in string. 

price and odometer is still in strings because of some letters (if it is in numbers then the descriptive stats should pop up).

seller and offer_type looks dropable since most of them it respectivly contains mostly the same value.

##### Dropping Some Data

In [7]:
#Dropping redundant columns
autos.drop(columns=["pic_count","seller","offer_type"],inplace=True)
autos.columns

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'powerps', 'model', 'odometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

##### Changing Odometer and Price into Numerics

In [8]:
#Converting odometer and price into numeric values
tochange=["odometer","price"]

for i in tochange:
    autos[i].replace(["(km)|(,)|(\$)"],[""],regex=True,inplace=True)
    autos[i]=autos[i].astype(int)

#Renaming odometer column into odometer_km
autos.rename(columns={"odometer":"odometer_km"},inplace=True)

#Checking if changes have been made
autos[["odometer_km","price"]].head(5)

Unnamed: 0,odometer_km,price
0,150000,5000
1,150000,8500
2,70000,8990
3,70000,4350
4,150000,1350


It's quite clear here that price and odometer could be modified into numbers. 

The idea for price is removing the dollar sign, for odometer is removing the km, and removing the comma for both. This could be done with regular expression as well as .str.replace. 

After that is done, there is a need to convert its value into intgers and voila. The odometer column label is also changed into odometer_km. Quick note that astype can be a bit annoying if we decide to re-run a specific cell.

##### Finding Outliers in Price

In [9]:
#Analyzing descriptive statistics of the price column
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

One of the way to find outliers is by using the the range within the ±1.5xInter Quartile Range(IQR) of the dataset. Finding the IQR is relatively easy by simply finding both the first and the third quartile using NumPy. 

For price, there are 50000 rows of "uncleaned" data. Personally, you would instantly notice that zeros and the 10^7 will be outliers which might affect the descriptive statistics (for example inflated mean value).

In [10]:
#Function to create an outlier range using IQR
def outlierrange(a):
    q1,q3=np.percentile(autos[a],[25,75])
    iqr=q3-q1
    outrange_bottom=q1-1.5*iqr
    outrange_upper=q3+1.5*iqr
    return outrange_bottom,outrange_upper

Above is a function to find the outlier range using IQR, "a" in the function can be filled with any column. Notice that it will return 2 values, so it needs to be assigned to two variables.

In [11]:
#Creating outlier range for price
outrange_price_bottom,outrange_price_upper=outlierrange("price")
print("The range will be ",outrange_price_bottom," and ",outrange_price_upper)

The range will be  -8050.0  and  16350.0


In [12]:
#Selecting the Autos dataframe where the price value is in between the outlier range
cleaned_price=autos.loc[autos["price"].between(outrange_price_bottom,outrange_price_upper),"price"]

#Checking the descriptive statistics
cleaned_price.describe()

count    46216.000000
mean      3963.696101
std       3847.238683
min          0.000000
25%       1000.000000
50%       2500.000000
75%       5900.000000
max      16350.000000
Name: price, dtype: float64

This is a funny one actually, using the IQR method, it managed to clean the car that costs 10^7 but somehow manages to include  the free cars. To solve this, we just need to do the same thing as above but replace the range from 1 to max value.

In [13]:
#Checking the descriptive statistics after negative values and free cars are eliminated
cleaned_price_nozero=autos.loc[autos["price"].between(1,16350),:]
cleaned_price_nozero["price"].describe()

count    44795.000000
mean      4089.433620
std       3841.429247
min          1.000000
25%       1150.000000
50%       2700.000000
75%       5999.000000
max      16350.000000
Name: price, dtype: float64

##### Finding Outliers in Odometer in Kilometer

In [14]:
#Checking the descriptive statistics of the odometer column
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [15]:
#Finding the outlier range
outrange_odometer_bottom,outrange_odometer_upper=outlierrange("odometer_km")
print("The range will be between ",outrange_odometer_bottom," and ",outrange_odometer_upper)

The range will be between  87500.0  and  187500.0


In [16]:
#Selecting the original dataframe which the odometer value is between the range
cleaned_odometer=autos.loc[autos["odometer_km"].between(outrange_odometer_bottom,outrange_odometer_upper),:]

#Checking the descriptive statistics
cleaned_odometer['odometer_km'].describe()

count     41520.000000
mean     141736.030829
std       17102.004255
min       90000.000000
25%      150000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Same thing to do here as that of "price" column, Using the function above, it should be easy to find the outlier range.

After analyzing it, I decided to keep the "outliers" for this column. The original aim is to reduce data with 150,000 kilometer value but that can't be done using IQR. With the "outliers" gone, it increases the mean because there are less "lower" value data.

##### Dropping Some Data

In [17]:
#Setting the outlier-free price dataframe as our main dataframe
autos=cleaned_price_nozero

In [18]:
#Checking the descriptive statistics
autos.describe()

Unnamed: 0,price,registration_year,powerps,odometer_km,registration_month,postal_code
count,44795.0,44795.0,44795.0,44795.0,44795.0,44795.0
mean,4089.43362,2003.97339,109.39491,129762.361871,5.744012,50502.431789
std,3841.429247,74.81955,205.150622,36398.554742,3.707064,25726.937446
min,1.0,1000.0,0.0,5000.0,0.0,1067.0
25%,1150.0,1999.0,68.0,125000.0,3.0,30159.0
50%,2700.0,2003.0,102.0,150000.0,6.0,49356.0
75%,5999.0,2007.0,140.0,150000.0,9.0,71069.0
max,16350.0,9999.0,17700.0,150000.0,12.0,99998.0


As mentioned above, only the outliers from the "price" column are dropped because dropping outliers from "odometer_km" column might actually ruin the descriptive statistics later on. 

So as we can see here, the count is exactly the same as the dropping outliers value from the "price" column. The count reduces from 50,000 data to 44,795 data. From only dropping the outliers from "price" column, it still increases the mean of "odometer_km".

#### 2. Cleaning Dates

##### Quick Peek into How It is Formatted

In [19]:
#Looking at the first three rows of the date columns
autos[['date_crawled','ad_created','last_seen']][0:3]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37


So here is a table of three out of the five date types that follows a certain pattern. For the sake of naming things let's call this one Type 1. The first 10 characters indicates the calendar date while the next 8 (excluding the space) is the time. This could also be done using regular expression or just .split() since we know that there will be a space in between the calendar date and the time.

In [20]:
#Looking at the first three rows of the separated date columns
autos[['registration_month','registration_year']][0:3]

Unnamed: 0,registration_month,registration_year
0,3,2004
1,6,1997
2,7,2009


This one is a bit easier to handle since it is formatted quite nicely and separate from each other. Let's call this as Type 2. The job here is already done by the original .csv. It is possible to map the values in "registration_month" to string months (i.e. 1 will be January and so on). "registration_year" column is usually too good to be true.

##### Function to Parse Type 1

In [21]:
def type1parse(a):
    return autos[a].str.extract("(.*) ",expand=False)

The function above is to parse Type 1 date format. It uses regular expression in order to select characters before the space occurs. The reason why a separate function is created instead of doing it one by one is just to make the code shorter and neater especially because this function can be used for the distribution of all three Type 1 data.

##### Distribution of "date_crawled"

In [22]:
#Parsing the date_crawled column using the type1parse function
tem_date_crawled=type1parse("date_crawled")

#Looking at the distribution of the date_crawled column
dis_date_crawled=tem_date_crawled.value_counts(
    normalize=True,dropna=False).sort_index(ascending=False)

max_date_crawled=tem_date_crawled.value_counts(
    normalize=True,dropna=False).head(3)

min_date_crawled=tem_date_crawled.value_counts(
    normalize=True,dropna=False).tail(3)

print("date_crawled distribution:\n",dis_date_crawled,"\n")
print("date_crawled max values:\n",max_date_crawled,'\n')
print("date_crawled min values:\n",min_date_crawled)

date_crawled distribution:
 2016-04-07    0.001317
2016-04-06    0.003170
2016-04-05    0.012948
2016-04-04    0.036366
2016-04-03    0.038531
2016-04-02    0.034982
2016-04-01    0.032794
2016-03-31    0.031588
2016-03-30    0.033865
2016-03-29    0.034200
2016-03-28    0.034781
2016-03-27    0.030718
2016-03-26    0.032593
2016-03-25    0.032013
2016-03-24    0.029378
2016-03-23    0.032481
2016-03-22    0.032727
2016-03-21    0.037192
2016-03-20    0.037705
2016-03-19    0.034200
2016-03-18    0.012769
2016-03-17    0.031990
2016-03-16    0.029959
2016-03-15    0.034178
2016-03-14    0.036991
2016-03-13    0.015604
2016-03-12    0.037393
2016-03-11    0.032481
2016-03-10    0.032705
2016-03-09    0.033129
2016-03-08    0.033642
2016-03-07    0.036098
2016-03-06    0.013997
2016-03-05    0.025516
Name: date_crawled, dtype: float64 

date_crawled max values:
 2016-04-03    0.038531
2016-03-20    0.037705
2016-03-12    0.037393
Name: date_crawled, dtype: float64 

date_crawled min valu

From the first series, the latest data was crawled on the April 7th 2016 and the earliest was crawled on March 5th 2016. 

April 3rd 2016 (3.8531 percent) was the day where most data were crawled and April 7th 2016 (0.1317 percent) was the least.

##### Distribution of "ad_created"

In [23]:
#Parsing the ad_created column
tem_ad_created=type1parse("ad_created")

#Checking the distribution of the column
dis_ad_created=tem_ad_created.value_counts(
    normalize=True,dropna=False).sort_index(ascending=False)

max_ad_created=tem_ad_created.value_counts(
    normalize=True,dropna=False).head(3)

min_ad_created=tem_ad_created.value_counts(
    normalize=True,dropna=False).tail(20)

print("ad_created distribution:\n",dis_ad_created,"\n")
print("ad_created max values:\n",max_ad_created,'\n')
print("ad_created min values:\n",min_ad_created)

ad_created distribution:
 2016-04-07    0.001183
2016-04-06    0.003237
2016-04-05    0.011653
2016-04-04    0.036701
2016-04-03    0.038799
2016-04-02    0.034691
2016-04-01    0.032838
2016-03-31    0.031633
2016-03-30    0.033664
2016-03-29    0.034044
2016-03-28    0.035049
2016-03-27    0.030561
2016-03-26    0.032615
2016-03-25    0.032102
2016-03-24    0.029356
2016-03-23    0.032347
2016-03-22    0.032571
2016-03-21    0.037370
2016-03-20    0.037772
2016-03-19    0.033084
2016-03-18    0.013528
2016-03-17    0.031521
2016-03-16    0.030450
2016-03-15    0.033999
2016-03-14    0.035674
2016-03-13    0.016855
2016-03-12    0.037236
2016-03-11    0.032838
2016-03-10    0.032414
2016-03-09    0.033196
                ...   
2016-02-24    0.000022
2016-02-23    0.000089
2016-02-22    0.000022
2016-02-21    0.000067
2016-02-20    0.000045
2016-02-19    0.000067
2016-02-18    0.000045
2016-02-17    0.000022
2016-02-16    0.000022
2016-02-14    0.000022
2016-02-12    0.000045
2016-02-

From the first series, the latest ad was created on the April 7th 2016 and the earliest was created on August 10th 2015. 

April 3rd 2016 (3.8799 percent) was the day where most ads were created. There are 20 dates which have the lowest ads created percentage (0.0022 percent), the corresponding dates are shown above.

##### Distribution of "last_seen"

In [24]:
#Parsing the last seen column
tem_last_seen=type1parse("last_seen")

#Finding the distribution of the column
dis_last_seen=tem_last_seen.value_counts(
    normalize=True,dropna=False).sort_index(ascending=False)

max_last_seen=tem_last_seen.value_counts(
    normalize=True,dropna=False).head(3)

min_last_seen=tem_last_seen.value_counts(
    normalize=True,dropna=False).tail(3)

print("last_seen distribution:\n",dis_last_seen,"\n")
print("last_seen max values:\n",max_last_seen,'\n')
print("last_seen min values:\n",min_last_seen)

last_seen distribution:
 2016-04-07    0.126175
2016-04-06    0.214332
2016-04-05    0.121286
2016-04-04    0.025047
2016-04-03    0.025673
2016-04-02    0.025226
2016-04-01    0.023306
2016-03-31    0.024333
2016-03-30    0.025472
2016-03-29    0.023016
2016-03-28    0.021699
2016-03-27    0.016207
2016-03-26    0.017234
2016-03-25    0.020002
2016-03-24    0.020449
2016-03-23    0.019109
2016-03-22    0.021922
2016-03-21    0.021185
2016-03-20    0.021230
2016-03-19    0.016475
2016-03-18    0.007568
2016-03-17    0.029043
2016-03-16    0.016966
2016-03-15    0.016274
2016-03-14    0.012859
2016-03-13    0.009376
2016-03-12    0.024936
2016-03-11    0.013037
2016-03-10    0.011162
2016-03-09    0.010068
2016-03-08    0.007903
2016-03-07    0.005693
2016-03-06    0.004576
2016-03-05    0.001161
Name: last_seen, dtype: float64 

last_seen max values:
 2016-04-06    0.214332
2016-04-07    0.126175
2016-04-05    0.121286
Name: last_seen, dtype: float64 

last_seen min values:
 2016-03-07

From the first series, the latest the crawler last seen some of the ads on the April 7th 2016 (also the last day of crawling the data) and the earliest it was last seen was on March 5th 2016 (the earliest day of crawling the data). 

April 6th 2016 (2.14332 percent) was the day where most data were last seen and March 5th 2016 (0.1161 percent) was the least.

##### Looking at "registration_year"

In [25]:
#Descriptive statistics of registration year
autos["registration_year"].describe()

count    44795.00000
mean      2003.97339
std         74.81955
min       1000.00000
25%       1999.00000
50%       2003.00000
75%       2007.00000
max       9999.00000
Name: registration_year, dtype: float64

As can be seen above, there are a couple of things to note here. First is that there is at least a car which is registered way before car was invented. Second is that there is a car registered in the way ahead in the future. Obviously, they are considered as unwanted data points, hence we will drop it from the table.

The latest the data was crawled in 2016, thus we set it as our upper bound. While the first ever car was invented in 1886 by Carl Benz , thus it is safe to use it as our lower bound.

In [26]:
#Removing years that is outside the range of 1886 and 2016 inclusive
reg_year_slice=autos.loc[autos["registration_year"].between(1886,2016),:]
reg_year_slice["registration_year"].describe()

count    42959.000000
mean      2002.395191
std          6.750207
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2016.000000
Name: registration_year, dtype: float64

After dropping the unwanted data points, we can see that the earliest the car was registered was in the 1910 and the latest was in 2016. This selection makes so much sense and thus we can use it for further inspection

In [27]:
#Removing the unwanted "registration_year" data from autos
autos=reg_year_slice 

##### Distribution of "registration_year"

In [28]:
#Previewing the distribution of registration year
autos["registration_year"].value_counts(normalize=True)

2000    0.073140
2005    0.067227
1999    0.067227
2003    0.062129
2004    0.061920
2001    0.060895
2006    0.060127
2002    0.057264
1998    0.054703
2007    0.049512
2008    0.046207
1997    0.045252
2009    0.042087
1996    0.031705
2010    0.028539
1995    0.028399
2016    0.027421
2011    0.025699
2012    0.018622
1994    0.014432
1993    0.009684
2013    0.009218
1992    0.008357
1990    0.007915
1991    0.007682
2014    0.005610
1989    0.003981
1988    0.003119
2015    0.002561
1985    0.002095
          ...   
1972    0.000605
1960    0.000466
1977    0.000442
1971    0.000442
1973    0.000442
1976    0.000396
1968    0.000396
1975    0.000372
1967    0.000372
1974    0.000349
1969    0.000349
1966    0.000256
1965    0.000256
1964    0.000210
1910    0.000116
1961    0.000116
1956    0.000093
1962    0.000093
1959    0.000093
1963    0.000093
1958    0.000070
1950    0.000070
1937    0.000070
1934    0.000047
1954    0.000047
1929    0.000023
1941    0.000023
1953    0.0000

Based on the distribution of "registration_year", we can see that based on the ads on German eBay, most cars (7.314 percent) were registered in the year 2000. There are 5 years where there are the least amount of cars registered which were 1929, 1941, 1953, 1938, and 1952 with the value of 0.0023 percent for each.

### Analyzing Ads Based on Car Brands

#### Selecting the Brands

In [29]:
#looking at the number of unique entry in brand
autos["brand"].nunique()

40

In [30]:
#Distribution of brand
autos["brand"].value_counts(normalize=True)

volkswagen        0.214763
opel              0.115203
bmw               0.103541
mercedes_benz     0.088131
audi              0.077562
ford              0.073093
renault           0.050769
peugeot           0.032170
fiat              0.027747
seat              0.019088
skoda             0.016993
mazda             0.016085
nissan            0.015782
smart             0.015363
citroen           0.015038
toyota            0.013362
hyundai           0.010545
volvo             0.009334
mitsubishi        0.008706
sonstige_autos    0.008496
honda             0.008334
mini              0.007752
alfa_romeo        0.007007
kia               0.006960
suzuki            0.006378
chevrolet         0.005750
chrysler          0.003748
dacia             0.002863
daihatsu          0.002724
subaru            0.002211
jeep              0.002002
porsche           0.001886
saab              0.001769
daewoo            0.001629
trabant           0.001513
rover             0.001420
land_rover        0.001304
j

So there are 40 unique car brands available for sale. For this instance we are trying to choose brands which have at least 10 percent proportion of the ads on eBay at the time.

In [31]:
#Selecting car brands that distribution is higher than 0.1
perc=autos["brand"].value_counts(normalize=True).tolist()
ind=autos["brand"].value_counts(normalize=True).index.tolist()
selection=list()

i=0
while perc[i]>0.01:
    selection.append(ind[i])
    i+=1

#Car brands we will use
selection 

['volkswagen',
 'opel',
 'bmw',
 'mercedes_benz',
 'audi',
 'ford',
 'renault',
 'peugeot',
 'fiat',
 'seat',
 'skoda',
 'mazda',
 'nissan',
 'smart',
 'citroen',
 'toyota',
 'hyundai']

#### Calculating the "price" Mean for Each Brand

In [32]:
agg_mean_price=dict()

for i in selection:
    agg_mean_price[i]=autos.loc[autos["brand"]==i,"price"].mean()
    
agg_mean_price=pd.Series(agg_mean_price)
agg_mean_price.sort_values(ascending=False)

audi             5703.895858
bmw              5649.202563
skoda            5586.035616
mercedes_benz    5259.880877
hyundai          4876.512141
toyota           4594.972125
volkswagen       4183.019618
nissan           3899.321534
seat             3747.804878
smart            3560.118182
mazda            3557.382055
citroen          3543.492260
peugeot          2967.945007
ford             2944.803822
fiat             2740.820470
opel             2712.995959
renault          2281.461256
dtype: float64

After looking at brands which have at least 10 percent of the ads proportion on German eBay, it can be seen that Renault, Opel and Fiat are the three cheapest car brand listed and Audi, BMW, and Skoda are the three most expensive.

#### Calculating Mean for Mileage of Selected Brands

In [33]:
mean_mile=dict()

for i in selection:
    mean_mile[i]=autos.loc[autos["brand"]==i,"odometer_km"].mean()

mean_mile=pd.Series(mean_mile)
mean_mile.sort_values(ascending=False)

audi             139701.380552
bmw              138833.183453
mercedes_benz    138411.251981
volkswagen       132813.245177
opel             130500.101031
renault          128892.709766
peugeot          127861.794501
ford             127004.777070
mazda            126924.746744
seat             124207.317073
nissan           121578.171091
citroen          120402.476780
toyota           118571.428571
fiat             117449.664430
skoda            114349.315068
hyundai          108741.721854
smart             99431.818182
dtype: float64

Here we are calculating the mean mileage for the selected brands, as we can see Smart, Hyndai, and Skoda are the three brands listed on eBay with the least mileage, while Audi, BMW, and Mercedes-Benz are the three listed with the most mileage.

#### Finding Correlation between Price and Mileage for the Selected Brands

In [34]:
#Combining Price and Mileage into a dataframe
combi=pd.DataFrame(agg_mean_price,
                   columns=["mean_price"])
combi["mean_mile"]=mean_mile
combi

Unnamed: 0,mean_price,mean_mile
audi,5703.895858,139701.380552
bmw,5649.202563,138833.183453
citroen,3543.49226,120402.47678
fiat,2740.82047,117449.66443
ford,2944.803822,127004.77707
hyundai,4876.512141,108741.721854
mazda,3557.382055,126924.746744
mercedes_benz,5259.880877,138411.251981
nissan,3899.321534,121578.171091
opel,2712.995959,130500.101031


In [35]:
#Finding the correlation between the two column
print("Correlation \n",combi.corr(method="spearman"))

Correlation 
             mean_price  mean_mile
mean_price    1.000000   0.139706
mean_mile     0.139706   1.000000


Based on the correlation test conducted, we can clearly see that there is a weak correlation between "mean_price" and "mean_mile" thus their value will less likely affect one another.

#### Most Common Brand/Model Combo

In [36]:
tempdic=dict()
for i in ind:
    temp=autos.loc[autos["brand"]==i,"name"].value_counts().head(1).index.tolist()
    tempdic[i]=temp[0]

tempdic=pd.DataFrame(pd.Series(tempdic),columns=["model"])
tempdic

Unnamed: 0,model
alfa_romeo,Alfa_Romeo_147
audi,Audi_A4_Avant_2.0_TDI_DPF
bmw,BMW_316i
chevrolet,Chevrolet_Spark_1.0_LS
chrysler,Chrysler_Stratus_2.5_LX
citroen,Citroën_C1_1.0_Style
dacia,Dacia_Sandero_1.6_MPI_Stepway
daewoo,Daewoo_Matiz
daihatsu,Daihatsu_Cuore
fiat,Fiat_Punto


#### Are Cars with Unrepaired Damage Cheaper than Non-Damaged?

In [37]:
#Getting the value count index into a list
somelist=autos["unrepaired_damage"].value_counts().index.tolist()
#Removing not_known from the list
somelist.remove("not_known")

avg_price_dmg=dict()

for i in somelist:
    avg_price_dmg[i]=autos.loc[autos["unrepaired_damage"]==i,"price"].mean()

avg_price_dmg=pd.DataFrame(pd.Series(avg_price_dmg),columns=["Mean Price"])
avg_price_dmg

Unnamed: 0,Mean Price
no,4851.269556
yes,1972.497994


From the observation above, it can be clearly seen that cars with unrepaired damage sells much cheaper than its non-damaged counterparts.

#### Is There Any Pattern Between Mileage ang Average Price?

In [38]:
#Creating 8 equal bins for price to create range
group=pd.cut(autos["odometer_km"],bins=8)
autos["odogroup"]=group

#Getting the value counts index out to a list
ilist=autos["odogroup"].value_counts().index.tolist()

milepri=dict()
for a in ilist:
    milepri[a]=autos.loc[autos["odogroup"]==a,"price"].mean()

milepri=pd.DataFrame(pd.Series(milepri),columns=["Mean Price"])
milepri

Unnamed: 0,Mean Price
"(4855.0, 23125.0]",4920.036885
"(23125.0, 41250.0]",8363.175279
"(41250.0, 59375.0]",7817.786026
"(59375.0, 77500.0]",7148.494369
"(77500.0, 95625.0]",6223.643989
"(95625.0, 113750.0]",5604.91427
"(113750.0, 131875.0]",4853.245692
"(131875.0, 150000.0]",3298.209446


From the table above, it can be seen that as the the mileage has some random effect on the price. At first, there is an increase in price as the mileage increases, but after a certain kilometer, the prices gradually falls

### Conclusion

1. The latest data was crawled on the April 7th 2016 and the earliest was crawled on March 5th 2016. The crawler last seen all of the ads on this date interval. The ads was created between August 10th 2015 to April 7th 2016
2. Most of the cars are registered on the year 2000, the youngest was registered in 2016 and the oldest is in 1910.
3. Out of the 10% most car brand advertised on the site, Renault, Opel and Fiat are the cheapest, while Audi, Skoda, and BMW are the most expensive. There are little to no correlation between the mileage of the car and the price of the car
4. Cars without unrepaired damage sells at a higher price than those with damage.
5. Mileage has a random effect on the mileage where at first the increase in mileage increases the price, but it plateaued and the increase in mileage decreases the price of the car