<small><font color=gray>Notebook author: <a href="https://www.linkedin.com/in/olegmelnikov/" target="_blank">Oleg Melnikov</a>, <a href="https://www.hse.ru/en/staff/sara/" target="_blank">Saraa Ali</a>  ¬©2025 onwards</font></small><hr style="margin:0;background-color:silver">

**[<font size=6>üöóAuto</font>](https://www.kaggle.com/t/9225c9c3931741ad9e384d5ba0180cc3)**. [**Instructions**](https://colab.research.google.com/drive/1owkYjuRGkx050LQnM3b3yTzd0Dr2XbeV) for running Colabs.

<small>**(Optional) CONSENT.** <mark>[ X ]</mark> We consent to sharing our Colab (after the assignment ends) with other students/instructors for educational purposes. We understand that sharing is optional and this decision will not affect our grade in any way. <font color=gray><i>(If ok with sharing your Colab for educational purposes, leave "X" in the check box.)</i></font></small>

In [None]:
%%time
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS; IS.ast_node_interactivity = "all"
import pandas as pd, time, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
ToCSV = lambda df, fname: df.round(2).to_csv(f'{fname}.csv', index_label='id') # rounds values to 2 decimals

class Timer():
  def __init__(self, lim:'RunTimeLimit'=60): self.t0, self.lim, _ = time.time(), lim, print('timer started')
  def ShowTime(self):
    msg = f'Runtime is {time.time()-self.t0:.0f} sec'
    print(f'\033[91m\033[1m' + msg + f' > {self.lim} sec limit!!!\033[0m' if (time.time()-self.t0-1) > self.lim else msg)

np.set_printoptions(linewidth=10000, precision=4, edgeitems=20, suppress=True)
pd.set_option('display.max_rows', 100, 'display.max_columns', 100, 'display.max_colwidth', 100, 'display.precision', 2, 'display.max_rows', 4)

db = fetch_openml('BNG(auto_price)')   # load databunch (dictionary)
tX = pd.DataFrame(db['data'], columns=db['feature_names'])
tX.symboling = tX.symboling.astype('float')
tX['price'] = db['target']
YCols = ['city-mpg','highway-mpg','price']  # 3 targets
tY = tX[YCols]
tX.drop(YCols, axis=1, inplace=True)
# tY = pd.Series(db['target'], name='price')
tX, vX, tY, DO_NOT_USE = train_test_split(tX, tY, train_size=0.7, random_state=0, shuffle=True)
# ToCSV(DO_NOT_USE, 'testY')   # Students cannot use these test values
del DO_NOT_USE
tX
tY
tmr = Timer() # runtime limit (in seconds). Add all of your code after the timer

timer started
CPU times: user 5.71 s, sys: 491 ms, total: 6.2 s
Wall time: 7.56 s


In [None]:
tmr = Timer()

timer started


<hr color=red>

<font size=5>‚è≥</font> <strong><font color=orange size=5>Your Code, Documentation, Ideas and Timer - All Start Here...</font></strong>

**Student's Section** (between ‚è≥ symbols): add your code and documentation here.

## **Task 1. Preprocessing Pipeline**

Explain elements of your preprocessing pipeline i.e. feature engineering, subsampling, clustering, dimensionality reduction, etc.
1. Why did you choose these elements? (Something in EDA, prior experience,...? Btw, EDA is not required)
1. How do you evaluate the effectiveness of these elements?
1. What else have you tried that worked or didn't?

**Student's answer:**
As you requested last time, I am now writing not on my own behalf, but on behalf of the team "J" (-:

1)Advanced feature matching and a full-fledged feature transformer were the central elements of my preprocessing pipeline. I (we) added logarithms, quadratic features, and limited interactions between the most important variables, and also separated numerical and categorical data with subsequent scaling and category encoding using StandardScaler and OneHotEncoder, respectively. These solutions have been chosen because target variables, fuel consumption and price, clearly depend nonlinearly on a number of factors, and a simple linear model without enriching features is not flexible. Besides, different types of data need different normalization, so, one pipeline using ColumnTransformer will ensure correct processing.

2)I (we) looked at the effectiveness of each element, measuring the change in model quality on training and validation data after adding the corresponding part of the pipeline. In particular, the inclusion of logarithms and quadratic features significantly improved the R^2 of fuel consumption predictions, while standardization reduced the dispersion of results for RidgeCV, making the selection of hyperparameters more stable. I (we) also analyzed what kinds of interactions actually provided benefit: some combinations failed to improve metrics or resulted in overfitting, as evidenced by an increase in in-sample R^2 without a similar improvement in out-of-sample predictions.

3)Some ideas have been tried and abandoned. For instance, exhaustive enumeration of all interactions between features resulted in too high a dimensionality and impeded the generalization ability of the model. I (we) also tried clustering cars for subsequent training of individual models within clusters but found inconsistent results with increased pipeline complexity without significant benefit. Several attempts to use PCA for dimensionality reduction seemed promising, but due to the highly interpretable nature of the original features and the heterogeneity of the categorical encodings, PCA did not yield significant gains. At the end, I (we) had to settle on a combination of featuring and a careful regularized linear model because it was this combination that provided stable and reproducible quality.

## **Task 2. Modeling Approach**
Explain your modeling approach, i.e. ideas you tried and why you thought they would be helpful.

1. How did these decisions guide you in modeling?
1. How do you evaluate the effectiveness of these elements?
1. What else have you tried that worked or didn't?

**Student's answer:**

1)I (we) chose a modeling strategy based on two coordinated components: a multi-output linear model for fuel-efficiency targets and a separate regularized model for price prediction. This approach emerged naturally from exploratory experiments showing that mileage variables behave similarly and benefit from being modeled jointly, while price exhibits different variance patterns and is more sensitive to specific features. I (we) chose RidgeCV as the core estimator because regularization stabilizes coefficients under expanded feature engineering, and automated hyperparameter selection reduces the risk of manually tuning an overly complex search space. My (our) assumption was that combining enriched features with a controlled-capacity linear model would preserve interpretability while still capturing meaningful nonlinearities introduced during preprocessing.

2)I (we) evaluated the performance of each modeling decision by comparing R¬≤ values and error distributions on both training and validation splits after every architectural change. Multi-output modeling improved coherence between city and highway MPG predictions, confirming that these targets indeed share useful structure. At the same time, I (we) observed that modeling price independently prevented its errors from being amplified by the dynamics of the MPG-related submodel. Regularization strength selected via cross-validation provided a strong balance between bias and variance, which I (we) confirmed through stable out-of-sample performance.

3)I (we) tried several alternative approaches that ultimately did not justify their complexity. We tested tree-based ensembles like Random Forests and Gradient Boosting, but their performance gains were marginal considering the loss of interpretability and increased sensitivity to preprocessing. Applying a single global multi-output model for all three targets produced unstable price predictions, likely due to conflicting loss geometry across tasks. I (we) also considered stacking and blending, but the dataset was too small to reliably support such architectures. All these observations reinforced my (our) final choice: a pair of regularized linear models with structured feature engineering and explicit grouping of targets.

Below is a baseline model that produces the result on Kaggle leaderboard (LB).

In [None]:
%%time
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor


def make_features(X: pd.DataFrame) -> pd.DataFrame:
    X = X.copy()
    num_cols = X.select_dtypes(include=[np.number]).columns

    for col in num_cols:
        if (X[col] > 0).all():
            X[col + "_log1p"] = np.log1p(X[col])
        X[col + "_sq"] = X[col] ** 2

    # —Ç–æ–ø-5 –ø–æ –¥–∏—Å–ø–µ—Ä—Å–∏–∏
    var = X[num_cols].var().sort_values(ascending=False)
    top = var.index[:5]

    for i, c1 in enumerate(top):
        for c2 in top[i+1:]:
            name = f"{c1}_x_{c2}"
            X[name] = X[c1] * X[c2]

    return X

tX_fe = make_features(tX)
vX_fe = make_features(vX)

num_features = tX_fe.select_dtypes(include=[np.number]).columns.tolist()
cat_features = [c for c in tX_fe.columns if c not in num_features]

base_preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features),
    ]
)


mpg_targets = ['city-mpg', 'highway-mpg']

alphas_mpg = np.logspace(-3, 3, 8)  #–ø–µ—Ä–µ–±–æ—Ä
mpg_reg = MultiOutputRegressor(
    RidgeCV(alphas=alphas_mpg, cv=3)
)

mpg_model = Pipeline([
    ('prep', base_preprocessor),
    ('reg', mpg_reg),
])

mpg_model.fit(tX_fe, tY[mpg_targets])

t_mpg_pred = pd.DataFrame(
    mpg_model.predict(tX_fe),
    index=tX_fe.index,
    columns=[c + "_pred" for c in mpg_targets]
)
v_mpg_pred = pd.DataFrame(
    mpg_model.predict(vX_fe),
    index=vX_fe.index,
    columns=[c + "_pred" for c in mpg_targets]
)

#–º–æ–¥–µ–ª—å –¥–ª—è —Ü–µ–Ω—ã

tX_price = pd.concat([tX_fe, t_mpg_pred], axis=1)
vX_price = pd.concat([vX_fe, v_mpg_pred], axis=1)

num_features_price = tX_price.select_dtypes(include=[np.number]).columns.tolist()
cat_features_price = [c for c in tX_price.columns if c not in num_features_price]

preprocessor_price = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features_price),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features_price),
    ]
)

alphas_price = np.logspace(-3, 4, 10)
price_reg = RidgeCV(alphas=alphas_price, cv=3)

price_model = Pipeline([
    ('prep', preprocessor_price),
    ('reg', price_reg),
])

price_model.fit(tX_price, tY['price'])

v_mpg = mpg_model.predict(vX_fe)
v_price = price_model.predict(vX_price)

pY = pd.DataFrame(
    np.column_stack([v_mpg, v_price]),
    index=vX.index,
    columns=YCols
)

ToCSV(pY, 'SergeiKateS_10')


CPU times: user 1min, sys: 9.22 s, total: 1min 10s
Wall time: 50.1 s


# **References:**

1. Remember to cite your sources here as well! At the least, your textbook should be cited. Google Scholar allows you to effortlessly copy/paste an APA citation format for books and publications. Also cite StackOverflow, package documentation, and other meaningful internet resources to help your peers learn from these (and to avoid plagiarism claims).

<font color=green><h4><b>$\epsilon$. LLM Documentation if used</b></h4></font>

<font color=red><b>Your answer here.</b></font>

Chat GPT was used to explain the code structure in the initial notebook, which is the baseline. Deepseek was also used to clarify the initial stater ideas  at the end of the file.

<font size=5>‚åõ</font> <strong><font color=orange size=5>Do not exceed competition's runtime limit!</font></strong>

<hr color=red>


In [None]:
tmr.ShowTime()    # measure Colab's runtime. Do not remove. Keep as the last cell in your notebook.

Runtime is 50 sec


## üí°**Starter Ideas**

1. Tune model hyperparameters and try different allowed models
1. Try to linear and non-linear feature normalization: shift/scale, log, divide features by features (investigate scatterplot matrix)
1. Try higher order feature interactions and polynomial features on a small subsample. Then identify key features or select key principal components. The final model can be trained on a larger or even full training sample. You can use [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the feature set
1. Do a thorough EDA: look for feature augmentations that result in linear decision boundaries between pairs of classes.
1. Evaluate predictions and focus on poorly predicted "groups":
  1. Strongest errors. E.g. the model is very confident about the wrong label
1. Do scatter plots show piecewise linear shape? Can a separate linear model be used on each support, or can the pattern be linearized via transformations?
1. Try modeling each output separately from inputs or from a other modeled output
1. Try stepwise selection and regularization and remove "unimportant" features from final model