## Answer questions

In [168]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [169]:
import pandas as pd
from src.paths import TRANSFORMED_DATA_DIR, MODELS_DIR, RAW_DATA_DIR

In [170]:
data = pd.read_csv(TRANSFORMED_DATA_DIR / 'transformed_data.csv')

In [171]:
data.shape

(4817, 22)

In [172]:
# Change data types from object to categorical
from src.data import convert_object_columns_to_category

data = convert_object_columns_to_category(data)

In [173]:
from src.data import get_train_test_data

In [174]:
import joblib

In [175]:
features = joblib.load(MODELS_DIR / 'features.pkl')
target = joblib.load(MODELS_DIR / 'target.pkl')

In [176]:
X, y, X_train, X_test, y_train, y_test = get_train_test_data(data, features, target)

In [177]:
# Load the model
model = joblib.load(MODELS_DIR / 'model.pkl')

### Q1

In [178]:
# Read feature importance df from models folder
feature_importance = joblib.load(MODELS_DIR / 'feature_importance_df.pkl')

In [179]:
feature_importance

Unnamed: 0,feature,importance
17,model_initial,0.248121
0,model_key,0.176764
13,feature_8,0.110575
14,age_in_months_when_sold,0.091855
12,feature_7,0.085964
2,engine_power,0.048129
8,feature_3,0.035199
10,feature_5,0.03356
1,mileage,0.032679
7,feature_2,0.029472


### Q2

As found during the data exploration phase:
- Hybrid and electrical cars are more expensive on average.
- Electrical cars average prices were stable from winter to summer, and were not sold in autumn.
- Diesel and petrol cars had similar average prices, although petrol cars had a drop in average prices starting summer 2018.
- The most expensive car type is, on average, suv, although coupe was most expensive at the start of the year and then dropped below suv also starting in summer.
- Coupe and convertible cars were, on average, more expensive in winter than in summer.
- Vans where more expensive, on average, in spring, summer, and autumn, than in winter.
- Subcompact had generally the lowest average prices.
- Paint color does not seem to generally determine or be associated with the average price, except for color green, which consistently had prices much lower than other colors. Maybe not very popular.
- Orange and white cars were sold for more, on average, during summer than during winter and spring.
- Red cars were the opposite, with lower average prices during summer than during winter and spring.

Find similar observations using estimated price instead of real price

In [180]:
pred = model.predict(X)

In [181]:
data_q2 = data.copy()

In [182]:
data_q2['price'] = pred

In [183]:
from src.plots import plot_avg_target_time_series_by_features

In [184]:
# Load car features
car_features = joblib.load(RAW_DATA_DIR / 'car_features.pkl')

In [185]:
plot_avg_target_time_series_by_features(data_q2, car_features)

In [186]:
# Load small cardinality features
small_cardinality_features = joblib.load(RAW_DATA_DIR / 'small_cardinality_features.pkl')

In [187]:
plot_avg_target_time_series_by_features(data_q2, small_cardinality_features)

### Q3

Take the dataset filter for those with price more than 20k, change age to one year more predict price. Then join pred with price before, then calculate difference and filter for the car with the smallest difference.

In [188]:
today_date = '3/1/2024'

In [189]:
data_q3 = data.copy()

In [190]:
# Convert date columns to datetime and mock the sold_at date as today's date
data_q3['registration_date'] = pd.to_datetime(data_q3['registration_date'])
data_q3['sold_at'] = today_date
data_q3['sold_at'] = pd.to_datetime(data_q3['sold_at'])

In [191]:
# Calculate age in month at today date and replace in data_q3
data_q3['age_in_months_when_sold'] = (data_q3['sold_at'].dt.to_period('M') - data_q3['registration_date'].dt.to_period('M')).apply(lambda x: x.n)

In [192]:
data_q3[['registration_date', 'sold_at', 'age_in_months_when_sold']].head()

Unnamed: 0,registration_date,sold_at,age_in_months_when_sold
0,2012-02-01,2024-03-01,145
1,2016-04-01,2024-03-01,95
2,2012-04-01,2024-03-01,143
3,2014-07-01,2024-03-01,116
4,2014-12-01,2024-03-01,111


In [193]:
X_q3, y_q3, X_train_q3, X_test_q3, y_train_q3, y_test_q3 = get_train_test_data(data_q3, features, target)

In [194]:
# Get estimated prices today
pred_q3 = model.predict(X_q3)

In [195]:
# Add 1 year to the age_in_months_when_sold
X_q3['age_in_months_when_sold'] = X_q3['age_in_months_when_sold'] + 12

In [196]:
# Get estimated prices 1 year later
pred_q3_one_year_later = model.predict(X_q3)

In [198]:
X_q3['price_today'] = pred_q3
X_q3['price_one_year_later'] = pred_q3_one_year_later
X_q3['loss'] = X_q3['price_today'] - X_q3['price_one_year_later'] 

In [201]:
# Sort by loss and get the top 10 cars with the smallest loss for cars bought at a price of 20k or more
X_q3[X_q3['price_today'] >= 20000].sort_values('loss', ascending=True).head(10)

Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,feature_1,feature_2,feature_3,feature_4,...,feature_6,feature_7,feature_8,age_in_months_when_sold,month_sold_at,season_sold_at,model_initial,price_today,price_one_year_later,loss
3765,X6,83999,180,diesel,white,suv,False,False,False,False,...,False,False,True,147,6,summer,X,20349.583984,21073.412109,-723.828125
4700,X6,97469,280,diesel,black,suv,True,True,True,True,...,False,True,True,141,8,summer,X,21056.880859,21710.882812,-654.001953
4743,X6,59070,225,diesel,silver,suv,True,True,True,True,...,False,True,True,146,9,autumn,X,20398.462891,20958.585938,-560.123047
4413,X5,49874,190,diesel,white,suv,True,True,False,True,...,False,True,True,126,6,summer,X,20609.341797,21067.820312,-458.478516
3869,X5,97770,190,diesel,white,suv,True,True,False,True,...,True,True,True,126,2,winter,X,20637.068359,21095.544922,-458.476562
4077,X5,68178,230,diesel,black,suv,True,True,True,True,...,False,True,True,126,3,spring,X,20656.5,20991.703125,-335.203125
72,M4,69410,317,petrol,white,coupe,True,True,False,False,...,True,True,True,121,3,spring,M,22638.947266,22886.861328,-247.914062
4717,X6,52777,230,diesel,grey,suv,True,True,False,True,...,False,True,True,116,8,summer,X,25735.421875,25954.453125,-219.03125
4041,X5 M,98050,230,diesel,black,suv,True,True,False,True,...,True,True,True,125,3,spring,X,22825.353516,22963.822266,-138.46875
3073,M3,39250,317,petrol,black,sedan,True,True,False,False,...,True,True,True,112,5,spring,M,29399.091797,29513.294922,-114.203125


In [159]:
# TODO: increase also mileage as well

### Q4

Check training notebook

### Q5

Add other findings from data exploration