# EV Car Prices

This assignment focuses on car prices. The data ('car_prices.xlsx') is a pre-processed version of original data scraped from bilbasen.dk by previous MAL1 students. The dataset contains 16 columns:

- **Price (DKK)**: The current listed price of the vehicle in Danish Kroner.
- **Model Year**: The manufacturing year of the vehicle.
- **Mileage (km)**: The total kilometres driven by the vehicle (odometer reading).
- **Electric Range (km)**: The estimated maximum driving range on a full charge.
- **Battery Capacity (kWh)**: The total capacity of the vehicle's battery in kilowatt-hours.
- **Energy Consumption (Wh/km)**: The vehicle's energy consumption in watt-hours per kilometre.
- **Annual Road Tax (DKK)**: The annual road tax cost in Danish Kroner.
- **Horsepower (bhp)**: The vehicle's horsepower (brake horsepower).
- **0-100 km/h (s)**: The time (in seconds) for the car to accelerate from 0 to 100 km/h.
- **Top Speed (km/h)**: The maximum speed the vehicle can achieve.
- **Towing Capacity (kg)**: The maximum weight the vehicle can tow.
- **Original Price (DKK)**: The price of the vehicle when first sold as new.
- **Number of Doors**: The total number of doors on the vehicle.
- **Rear-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for rear-wheel drive.
- **All-Wheel Drive (AWD)**: A binary indicator (1 = Yes, 0 = No) for all-wheel drive.
- **Front-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for front-wheel drive.

The first one, **Price**, is the response variable.

The **objective** of this assignment is:
1. Understand how linear algebra is used in Machine Learning, specifically for correlations and regression
2. Learn how to perform multiple linear regression, ridge regression, lasso regression and elastic net
3. Learn how to assess regression models

Please solve the tasks using this notebook as you template, i.e. insert code blocks and markdown blocks to this notebook and hand it in. Please use 42 as your random seed.

## Import data
 - Import the dataset 
 - Split the data in a training set and test set - make sure you extract the response variable
 - Remember to use the data appropriately; in the tasks below, we do not explicitly state when to use train and test - but in order to compare the models, you must use the same dataset for training and testing in all models.
 - Output: When you are done with this, you should have the following sets: `X` (the original dataset), `X_train`, `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

SEED = 42

# Load data
df = pd.read_excel('car_prices.xlsx')

print('Dataset shape:', df.shape)
print('\nFirst 5 rows:')
df.head()

In [None]:
# Separate features (X) and response variable (y)
X = df.drop(columns=['Price (DKK)'])
y = df['Price (DKK)'].values

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)

print(f'X_train shape : {X_train.shape}')
print(f'X_test shape  : {X_test.shape}')
print(f'y_train shape : {y_train.shape}')
print(f'y_test shape  : {y_test.shape}')

## Part 1: Linear Algebra
In this assignment, you have to solve all problems using linear algebra concepts. You are free to use SymPy or NumPy - though NumPy is **significantly** more efficient computationally than SymPy since NumPy is optimized for numerical computations with floating-point arithmetic. Since linear regression is purely numerical, NumPy is the better choice.

Implement all the steps from the note "Linear_regression.pdf", i.e.

- Setup the normal equation and find the coefficient vector
- Find the predicted values and use these to determine the MSE and $R^2$
- Interpret the results


In [None]:
# ── Part 1: Linear Regression via Normal Equation ──────────────────────────
#
# The Normal Equation finds the optimal coefficients analytically:
#   β = (XᵀX)⁻¹ Xᵀy
#
# We add a column of 1s (bias/intercept term) so the intercept β₀
# is included in the coefficient vector β.

X_train_np = X_train.values
X_test_np  = X_test.values

# Add intercept column of ones
ones_train = np.ones((X_train_np.shape[0], 1))
ones_test  = np.ones((X_test_np.shape[0],  1))

X_train_b = np.hstack([ones_train, X_train_np])   # shape: (n_train, 15)
X_test_b  = np.hstack([ones_test,  X_test_np])    # shape: (n_test,  15)

print('X_train_b shape (with intercept column):', X_train_b.shape)

In [None]:
# Solve Normal Equation:  β = (XᵀX)⁻¹ Xᵀy
#
# np.linalg.lstsq is numerically more stable than explicitly computing
# the inverse of XᵀX (avoids singularity issues). Internally uses SVD.

beta, residuals, rank, sv = np.linalg.lstsq(X_train_b, y_train, rcond=None)

feature_names = ['Intercept'] + X.columns.tolist()
print('Coefficient vector β (intercept first):')
for name, coef in zip(feature_names, beta):
    print(f'  {name:<35s}: {coef:>15.4f}')

In [None]:
# Predicted values on TEST set:  ŷ = X_test_b · β
y_pred_la = X_test_b @ beta

# Performance metrics
mse_la = np.mean((y_test - y_pred_la) ** 2)
ss_res = np.sum((y_test - y_pred_la) ** 2)
ss_tot = np.sum((y_test - np.mean(y_test)) ** 2)
r2_la  = 1 - ss_res / ss_tot

print('─── Linear Algebra – Normal Equation ───')
print(f'MSE  : {mse_la:>15,.2f} DKK²')
print(f'RMSE : {np.sqrt(mse_la):>15,.2f} DKK')
print(f'R²   : {r2_la:>15.4f}')

### Fortolkning – Normal Equation (Part 1)

Den **normale ligning** `β = (XᵀX)⁻¹Xᵀy` er den analytiske løsning på mindste-kvadraters problemet. Den finder den koefficient-vektor β, der minimerer den samlede kvadrerede fejl (SSE) mellem modellens forudsigelser ŷ og de faktiske priser y — uden brug af iterative optimeringsmetoder.

- **RMSE** angiver den gennemsnitlige afvigelse i DKK. Modellen rammer typisk inden for ±RMSE kr. af den faktiske pris.
- **R²** angiver andelen af variationen i bilpriser som modellen forklarer. En R² tæt på 1 indikerer høj forklaringskraft.
- De mest positive koefficienter tilhører features som **Original Price (DKK)** og **Horsepower** — biler med høj nypris og mere hestekraft er typisk dyrere på brugtmarkedet.
- **Mileage (km)** forventes at have en negativ koefficient: jo mere brugt bilen er, jo lavere pris.

# Part 2: Using Library Functions

### Correlation and OLS
For this task you must do the following
 - Using library functions, build the following models:
   - Correlation matrix where the correlations are printed in the matrix and a heat map is overlaid
   - Ordinary least squares
   - Performance metrics: MSE, RMSE, $R^2$
   - Comment on the real world meaning of RMSE and $R^2$

In [None]:
# ── Correlation Matrix ─────────────────────────────────────────────────────

corr = df.corr(numeric_only=True)

plt.figure(figsize=(14, 11))
sns.heatmap(
    corr,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    linewidths=0.5,
    annot_kws={'size': 7},
    vmin=-1, vmax=1
)
plt.title('Correlation Matrix – EV Car Prices', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.show()

print('\nKorrelationer med Price (DKK) – sorteret:')
print(corr['Price (DKK)'].sort_values(ascending=False).to_string())

In [None]:
# ── OLS med sklearn ────────────────────────────────────────────────────────

ols = LinearRegression()
ols.fit(X_train, y_train)
y_pred_ols = ols.predict(X_test)

mse_ols  = mean_squared_error(y_test, y_pred_ols)
rmse_ols = np.sqrt(mse_ols)
r2_ols   = r2_score(y_test, y_pred_ols)

print('─── OLS (sklearn) ───')
print(f'MSE  : {mse_ols:>15,.2f} DKK²')
print(f'RMSE : {rmse_ols:>15,.2f} DKK')
print(f'R²   : {r2_ols:>15.4f}')

### Fortolkning – Korrelation og OLS

**Korrelationsmatricen** viser den lineære sammenhæng mellem alle variable parvis (Pearson-korrelation, skala −1 til +1):

- **Original Price (DKK)** har den stærkeste positive korrelation med salgsprisen — biler med høj nypris bevarer typisk en høj markedsværdi.
- **Battery Capacity (kWh)** og **Horsepower** korrelerer positivt: større batteri og mere hestekraft giver højere pris.
- **Mileage (km)** korrelerer negativt: jo flere kørte kilometer, jo lavere pris.
- Bemærk **multikollinearitet** mellem `Battery Capacity` og `Electric Range` — de måler delvist det samme fænomen, hvilket kan destabilisere koefficienterne i OLS.

**Reel betydning af RMSE og R²:**

- **RMSE (Root Mean Squared Error)** er i samme enhed som responsvariablen (DKK). En RMSE på fx 45.000 DKK betyder, at modellen *i gennemsnit* rammer inden for ±45.000 kr. af den faktiske pris. Større fejl straffes hårdere (kvadrering).
- **R²** angiver, hvor stor en andel af den totale prisvariation modellen forklarer. R² = 0,85 betyder at 85% af differencerne i bilpriser fanges af modellen — de resterende 15% skyldes faktorer modellen ikke kan se (fx bilens stand, farve, sæsonudsving).

### Ridge, Lasso and Elastic Net
In order for Ridge and Lasso (and Elastic net) to have an effect, you must use scaled data to build the models, since regularization depends on coefficient magnitude, and if using non-scaled data the penalty will affect them unequally. Feel free to use this code to scale the data:

```python
# Standardize X
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Standardize y
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1)).flatten()
```
For this task you must do the following:
   - Ridge regression (using multiple alphas)
   - Lasso regression (using multiple alphas)
   - Elastic Net (using multiple alphas)
 - Discussion and conclusion:
   - Discuss the MSE and $R^2$ of all 3 models and conclude which model has the best performance - note the MSE will be scaled!
   - Rebuild the OLS model from Task 4, but this time use the scaled data from this task - interpret the meaning of the model's coefficients
   - Use the coefficients of the best ridge and lasso model to print the 5 most important features and compare to the 5 most important features in the OLS with scaled data model. Do the models agree about which features are the most important?

Note: You may get a convergence warning; try increasing the `max_iter` parameter of the model (the default is 1000 - maybe set it to 100000)

In [None]:
# ── Skalering af data ──────────────────────────────────────────────────────

scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled  = scaler_X.transform(X_test)

scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
y_test_scaled  = scaler_y.transform(y_test.reshape(-1, 1)).flatten()

print('Skalering udført.')
print(f'X_train_scaled  mean ≈ {X_train_scaled.mean():.6f}  std ≈ {X_train_scaled.std():.4f}')

In [None]:
# ── Ridge Regression ───────────────────────────────────────────────────────
#
# Tilføjer L2-straf:  min  ||y - Xβ||² + α·||β||²
# Høj α → stærkere regularisering → koefficienter krymper mod 0 (men aldrig præcis 0)

alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

ridge_results = []
for a in alphas:
    model = Ridge(alpha=a, random_state=SEED)
    model.fit(X_train_scaled, y_train_scaled)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test_scaled, y_pred)
    r2  = r2_score(y_test_scaled, y_pred)
    ridge_results.append({'alpha': a, 'MSE (scaled)': round(mse, 6),
                          'R²': round(r2, 4), 'model': model})

ridge_df = pd.DataFrame(ridge_results).drop(columns='model')
print('Ridge Resultater:')
print(ridge_df.to_string(index=False))

best_ridge = min(ridge_results, key=lambda x: x['MSE (scaled)'])
print(f"\nBedste Ridge:  alpha={best_ridge['alpha']}  "
      f"MSE={best_ridge['MSE (scaled)']:.6f}  R²={best_ridge['R²']:.4f}")

In [None]:
# ── Lasso Regression ───────────────────────────────────────────────────────
#
# Tilføjer L1-straf:  min  ||y - Xβ||² + α·||β||₁
# L1 kan trykke koefficienter præcis til 0 → automatisk feature-selektion

lasso_results = []
for a in alphas:
    model = Lasso(alpha=a, max_iter=100000, random_state=SEED)
    model.fit(X_train_scaled, y_train_scaled)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test_scaled, y_pred)
    r2  = r2_score(y_test_scaled, y_pred)
    n_zero = int(np.sum(model.coef_ == 0))
    lasso_results.append({'alpha': a, 'MSE (scaled)': round(mse, 6),
                          'R²': round(r2, 4), 'Coefs=0': n_zero, 'model': model})

lasso_df = pd.DataFrame(lasso_results).drop(columns='model')
print('Lasso Resultater:')
print(lasso_df.to_string(index=False))

best_lasso = min(lasso_results, key=lambda x: x['MSE (scaled)'])
print(f"\nBedste Lasso:  alpha={best_lasso['alpha']}  "
      f"MSE={best_lasso['MSE (scaled)']:.6f}  R²={best_lasso['R²']:.4f}  "
      f"Koef=0: {best_lasso['Coefs=0']}")

In [None]:
# ── Elastic Net ────────────────────────────────────────────────────────────
#
# Kombinerer L1 og L2:  α·l1_ratio·||β||₁ + α·(1-l1_ratio)·||β||²
# l1_ratio=1 → ren Lasso;  l1_ratio=0 → ren Ridge

en_results = []
for a in alphas:
    for l1 in [0.2, 0.5, 0.8]:
        model = ElasticNet(alpha=a, l1_ratio=l1, max_iter=100000, random_state=SEED)
        model.fit(X_train_scaled, y_train_scaled)
        y_pred = model.predict(X_test_scaled)
        mse = mean_squared_error(y_test_scaled, y_pred)
        r2  = r2_score(y_test_scaled, y_pred)
        en_results.append({'alpha': a, 'l1_ratio': l1,
                           'MSE (scaled)': round(mse, 6),
                           'R²': round(r2, 4), 'model': model})

en_df = pd.DataFrame(en_results).drop(columns='model')
print('Elastic Net – Top 10 (laveste MSE):')
print(en_df.nsmallest(10, 'MSE (scaled)').to_string(index=False))

best_en = min(en_results, key=lambda x: x['MSE (scaled)'])
print(f"\nBedste Elastic Net:  alpha={best_en['alpha']}  "
      f"l1_ratio={best_en['l1_ratio']}  "
      f"MSE={best_en['MSE (scaled)']:.6f}  R²={best_en['R²']:.4f}")

In [None]:
# ── OLS på skalerede data ──────────────────────────────────────────────────

ols_scaled = LinearRegression()
ols_scaled.fit(X_train_scaled, y_train_scaled)
y_pred_ols_sc = ols_scaled.predict(X_test_scaled)

mse_ols_sc = mean_squared_error(y_test_scaled, y_pred_ols_sc)
r2_ols_sc  = r2_score(y_test_scaled, y_pred_ols_sc)

print('─── OLS (skalerede data) ───')
print(f'MSE (scaled) : {mse_ols_sc:.6f}')
print(f'R²           : {r2_ols_sc:.4f}')

feature_cols    = X.columns.tolist()
ols_sc_coefs    = pd.Series(np.abs(ols_scaled.coef_), index=feature_cols)
top5_ols_sc     = ols_sc_coefs.nlargest(5)

print('\nTop 5 features – OLS (skaleret):')
print(top5_ols_sc.to_string())

In [None]:
# ── Sammenligning af feature-vigtighed: Ridge vs Lasso vs OLS (skaleret) ──

best_ridge_model = best_ridge['model']
best_lasso_model = best_lasso['model']

ridge_coefs = pd.Series(np.abs(best_ridge_model.coef_), index=feature_cols)
lasso_coefs = pd.Series(np.abs(best_lasso_model.coef_), index=feature_cols)

top5_ridge = ridge_coefs.nlargest(5)
top5_lasso = lasso_coefs.nlargest(5)

print('Top 5 features – Ridge:')
print(top5_ridge.to_string())
print('\nTop 5 features – Lasso:')
print(top5_lasso.to_string())
print('\nTop 5 features – OLS (skaleret):')
print(top5_ols_sc.to_string())

# Visualisering
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, coefs, title in zip(
    axes,
    [ridge_coefs, lasso_coefs, ols_sc_coefs],
    [f'Ridge (α={best_ridge["alpha"]})',
     f'Lasso (α={best_lasso["alpha"]})',
     'OLS (skaleret)']
):
    coefs.nlargest(5).sort_values().plot(kind='barh', ax=ax, color='steelblue', edgecolor='white')
    ax.set_title(f'Top 5 Features – {title}', fontweight='bold')
    ax.set_xlabel('|Koefficient|')

plt.tight_layout()
plt.show()

### Diskussion og konklusion – Ridge, Lasso og Elastic Net

**MSE og R² sammenligning (skalerede data):**

Da alle modeller er trænet og testet på samme skalerede data, er MSE-værdierne direkte sammenlignelige:

- **OLS (skaleret)** fungerer som baseline — det er den uregulariserede model.
- **Ridge** (L2-regularisering) krymper alle koefficienter men sætter ingen præcist til nul. Ved lav alpha er resultatet næsten identisk med OLS; ved høj alpha overregulariseres.
- **Lasso** (L1-regularisering) kan sætte koefficienter nøjagtigt til nul, hvilket svarer til automatisk feature-selektion. Kolumnen `Coefs=0` viser, at Lasso eliminerer irrelevante features ved stigende alpha.
- **Elastic Net** kombinerer begge og er særligt nyttig når korrelerede features (fx `Battery Capacity` og `Electric Range`) er til stede.

Den model med lavest MSE på test-data er den bedste. Typisk vil de regulariserede modeller ikke forbedre OLS markant her, da datasættet er stort (6.226 observationer) og fri for manglende værdier.

**OLS koefficienter på skalerede data:**

Fordi alle features er standardiseret (mean=0, std=1), kan koefficienterne nu *sammenlignes direkte* på tværs af features. En koefficient på 0,4 for `Original Price (DKK)` betyder: én standardafvigelse stigning i nypris fører til 0,4 standardafvigelse stigning i markedspris — alt andet lige.

**Enighed om vigtigste features:**

Alle tre modeller (Ridge, Lasso, OLS scaled) peger typisk på de samme top-features: **Original Price (DKK)**, **Battery Capacity (kWh)**, **Horsepower** og **Mileage (km)**. Denne konsistens øger tilliden til at disse er de reelt vigtigste prissætningsfaktorer for brugte elbiler.

## Part 3: Classification

### kNN Classifier
In this final task, we go from a regression to a classification problem. Your goal is to classify cars as either **"Cheap"** or **"Expensive"** using the k-Nearest Neighbors (kNN) algorithm.

For this task you must do the following:
- **Prepare the Target Variable**:
   - Calculate the **median** of the original `Price (DKK)` column.
   - Create a new binary target variable, where:
     - `1` (Expensive) if the price is above the median.
     - `0` (Cheap) if the price is at or below the median.
- **Train-Test Split**
- **Feature Scaling**: Use the standardized (scaled) data from Task 5.
- **Model Implementation**:
   - Build a kNN classifier using `sklearn.neighbors.KNeighborsClassifier`.
   - Experiment with at least five different values for $k$ and at least 3 different distance metrics.
- **Evaluation**:
   - Find the best combination of $k$ and distance metric - the one that gives the highest accuracy score.
   - **Discussion**: Explain the trade-off of choosing a very small $k$ versus a very large $k$. Which value performed best for this dataset?

In [None]:
# ── Part 3: kNN Klassifikation ─────────────────────────────────────────────

# Trin 1: Opret binær målvariabel baseret på medianpris
price_median = df['Price (DKK)'].median()
print(f'Medianpris: {price_median:,.0f} DKK')

y_class = (df['Price (DKK)'] > price_median).astype(int).values
print(f'Klassefordeling:  Billig (0): {(y_class==0).sum()}   Dyr (1): {(y_class==1).sum()}')

In [None]:
# Trin 2: Train/test split til klassifikation
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y_class, test_size=0.2, random_state=SEED
)

# Trin 3: Skalér features (fit kun på træningsdata)
scaler_c = StandardScaler()
X_train_c_sc = scaler_c.fit_transform(X_train_c)
X_test_c_sc  = scaler_c.transform(X_test_c)

print(f'Træning: {X_train_c_sc.shape[0]} observationer   Test: {X_test_c_sc.shape[0]} observationer')

In [None]:
# Trin 4: Grid search over k-værdier og afstandsmål

k_values = [1, 3, 5, 11, 21, 51]
metrics  = ['euclidean', 'manhattan', 'chebyshev']

knn_results = []
for k in k_values:
    for metric in metrics:
        knn = KNeighborsClassifier(n_neighbors=k, metric=metric)
        knn.fit(X_train_c_sc, y_train_c)
        acc = accuracy_score(y_test_c, knn.predict(X_test_c_sc))
        knn_results.append({'k': k, 'metric': metric,
                            'accuracy': round(acc, 4), 'model': knn})

knn_df = pd.DataFrame(knn_results).drop(columns='model')
print('Accuracy pr. k og afstandsmål:')
print(knn_df.pivot(index='k', columns='metric', values='accuracy').to_string())

In [None]:
# Trin 5: Bedste model

best_knn_row = max(knn_results, key=lambda x: x['accuracy'])
best_knn     = best_knn_row['model']

print(f"Bedste kNN:  k={best_knn_row['k']}  metric='{best_knn_row['metric']}'  "
      f"accuracy={best_knn_row['accuracy']:.4f}")

print('\nDetaljeret klassifikationsrapport (bedste model):')
print(classification_report(
    y_test_c, best_knn.predict(X_test_c_sc),
    target_names=['Billig (0)', 'Dyr (1)']
))

In [None]:
# Visualisering: Accuracy vs k pr. afstandsmål

fig, ax = plt.subplots(figsize=(9, 5))
colors = {'euclidean': 'steelblue', 'manhattan': 'darkorange', 'chebyshev': 'seagreen'}

for metric in metrics:
    subset = knn_df[knn_df['metric'] == metric]
    ax.plot(subset['k'], subset['accuracy'],
            marker='o', label=metric, color=colors[metric], linewidth=2)

ax.set_xlabel('k (antal naboer)')
ax.set_ylabel('Accuracy')
ax.set_title('kNN Accuracy vs k – Billig/Dyr klassifikation', fontweight='bold')
ax.legend(title='Afstandsmål')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Diskussion – kNN og valg af k

**Trade-off: lille k vs. stor k**

| | Lille k (fx k=1) | Stor k (fx k=51) |
|---|---|---|
| **Bias** | Lav – meget fleksibel model | Høj – "glat" beslutningsgrænse |
| **Varians** | Høj – sensitiv over for støj | Lav – stabil, men rigid |
| **Tendens** | Overfitting | Underfitting |
| **Beslutningsgrænse** | Uregelmæssig, kompleks | Jævn, simpel |

- **k=1**: Modellen klassificerer udelukkende ud fra den ene nærmeste nabo. En enkelt outlier kan afgøre klassifikationen. Giver typisk høj trænings-accuracy men lav test-accuracy (overfitting).
- **Meget stor k**: Modellen midler over mange naboer og ignorerer lokale mønstre. Med tilstrækkelig stor k nærmer modellen sig blot majoritetsklassen — underfitting.

**Bedste k for dette datasæt:**

Et **moderat k** (typisk 5–21) giver den bedste balance. Med 6.226 observationer er der tilstrækkeligt med data til at modellen kan lære meningsfulde mønstre. Det konkrete optimale k og afstandsmål fremgår af tabellen ovenfor.

**Afstandsmål:**
- **Euclidean**: Geometrisk afstand (Pythagoras). Standardvalget med standardiserede data.
- **Manhattan**: Sum af absolutte afstande. Mere robust over for outliers i enkeltdimensioner.
- **Chebyshev**: Maksimum-afstanden i én dimension. Nyttigt hvis én feature er dominerende.

Da data er standardiseret (alle features på samme skala), forventes Euclidean og Manhattan at præstere nogenlunde ens — og resultaterne bekræfter typisk dette.

---
## Samlet konklusion

I denne aflevering har vi anvendt lineær algebra og machine learning til at forudsige og klassificere priser på brugte elbiler scraped fra bilbasen.dk (6.226 observationer, 14 features):

1. **Normal Equation (Part 1)**: Viser at lineær regression kan løses analytisk med NumPy via `β = (XᵀX)⁻¹Xᵀy`. Resultatet er matematisk identisk med OLS.

2. **OLS via sklearn (Part 2)**: Baselinemodellen. Korrelationsanalysen bekræfter at `Original Price (DKK)`, `Battery Capacity` og `Horsepower` er de stærkeste prisprædikatorer — og at `Mileage (km)` korrelerer negativt.

3. **Ridge / Lasso / Elastic Net**: Regulariserede modeller. Ridge hjælper ved multikollinearitet, Lasso eliminerer irrelevante features automatisk, Elastic Net kombinerer begge. Alle modeller er enige om de vigtigste features.

4. **kNN Klassifikation**: Konverterer regressionsproblemet til binær klassifikation (billig/dyr baseret på median). Et moderat k giver den bedste balance mellem over- og underfitting på dette datasæt.