<h1> 1) Importing Packages and Loading Cleaned Dataset </h1>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('cleaned_df.csv',parse_dates=['date_posted','construction_year'])
print(df.shape)

(4405, 24)


<h1> 2) Basic Feature Engineering </h1>
<p> With existing data it is possible to enrich the data set by performing feature engineering. Unstructured data can be transformed into structured data, allowing to expand the analysis.

In [3]:
# How many years has the motorcycles been aaround? We create an 'age' variable to answer that
df['age'] = df['construction_year'].apply(lambda x: ( (pd.to_datetime('2022-1-1') - x).total_seconds())  / (60*60*24*365))

In [4]:
# Motorcycles older than 25 are defined as old timers
df['old_timer'] = df['age'] > 25 

In [5]:
# How many kilometers per year has the motorcycles has done (on average)
df['km_per_year'] = df['km'] / df['age']

In [6]:
# Gives an idea how large the description the seller has given
df['length_description'] = df['description'].apply(lambda row: len(row))

In [7]:
# A new variable that tells how many motorcycles a certain user has put online
count_moto_per_name = df['seller_name'].value_counts().to_frame()
count_moto_per_name = df.groupby('seller_name')['id'].count().to_frame()
count_moto_per_name.rename(columns = { 'id' : 'motorcycles_per_account'}, inplace=True)
df = df.merge(count_moto_per_name,on='seller_name')

<h1> 3) Advanced Feature Engineering: Performance Metric </h1>
<p> How to decide which motorcycle is a good deal or a bad deal? There are so many aspects to take into consideration. However, We would like to create a metric that can summarise the important features. </p>
<p> Often, price and kilometers can not be compared across brands, types and power. Therefore, we decide to create a few features that account for it. The idea is that we standardize the price and kilometers according to its specific group. </p>
<p> When we have our standardized features, then we are able to create a metric to properly evaluate the motorcycle. </p>

<h3> 3.1.1) Price </h3>
<p> We start by grouping by brand, type, and power, and then calculate the standardized value for the price. We then assign these values to a new variable: standardized_price. </p>

In [8]:
# Show the variance of the variables
np.std(df)

  return std(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


price_eur                                 242641.472036
viewed                                       841.664644
liked                                          10.24652
construction_year          3532 days 09:55:57.115859392
cylinders                                      1.131884
km                                         26407.292243
cc                                           382.908733
Kenteken                                            NaN
Artikelnummer                                       NaN
advertiser_binary                              0.485392
date_posted                 162 days 11:50:29.020387766
age                                            9.677846
old_timer                                      0.347058
km_per_year                                         NaN
length_description                           560.320763
motorcycles_per_account                       21.202476
dtype: object

In [9]:
zscore = lambda x: (x - x.mean()) / x.std()

In [10]:
df.insert(1, 'std_price', df.groupby(['type', 'power','power'])['price_eur'].transform(zscore))

In [11]:
df.insert(1, 'std_km_per_year', df.groupby(['type', 'power','power'])['km_per_year'].transform(zscore))
df.insert(1, 'std_km', df.groupby(['type', 'power','power'])['km'].transform(zscore))

In [12]:
df.insert(1, 'std_age', df.groupby(['type', 'power','power'])['age'].transform(zscore))

<h3> 3.1.4) Creating the Metrics </h3>
<p> In this subsection we create a metric that will allow us to easily find the better deals. The metric will consist of the following 4 variables: standardized_km, standardized_price, scaled_age and standardized_km_per_year. </p> 
<p> Why are these good features? </p>
<p> Firstly, standardized kilometers is interesting because we prefer motorcycles with less kilometers. Secondly, the scaled_age variable says that we prefer more recent motorcycles. Thirdly, standardized_km_per_year accounts for motorcycles that stayed long in the garage and didn't accumulate many kilometers for their age. Finally, standardized_price accounts for the price in its category.  </p>

In [13]:
# Creating the metric. We use weights, because we don't want the kilometer related variables to dominate.
df['metric'] = 0.167*df['std_km'] + 0.50*df['std_price'] + 0.167*df['std_age'] + 0.167*df['std_km_per_year']

In [14]:
df.to_csv('df_final.csv',index=False)