# Notebook 4: Datenverbesserung & Feature-Engineering

In diesem Notebook erfolgt die Bereinigung des Datensatzes inklusive Ausreisseranalyse und -entfernung, die Exklusion des MMR-Features für einen Leak-Test sowie die Selektion relevanter Merkmale. Anschliessend wird das angepasste DataFrame in einer neuen CSV-Datei gespeichert, um den Einfluss auf die Modellperformance vergleichen zu koennen.

## 1. Daten laden & Ueberblick
In diesem Abschnitt werden der Datensatz geladen sowie die grundlegende Struktur und erste Statistiken untersucht.

In [1]:
# 1.1 Imports und Konstanten
import pandas as pd
import numpy as np

# Globaler Random State
RANDOM_STATE = 42

# Pfad zur Original-CSV
DATA_PATH = "../data/processed/cars_features_ready.csv"

# 1.2 Einlesen des Original-Datensatzes
df = pd.read_csv(DATA_PATH)

# 1.3 Erste Exploration
print(f"Shape of DataFrame: {df.shape}\n")
display(df.head())

print("\nInfo:")
df.info()

print("\nStatistische Kennzahlen:")
display(df.describe())

missing = df.isna().sum()
print("\nFehlende Werte pro Spalte:")
print(missing[missing > 0] if missing.sum() > 0 else "Keine fehlenden Werte im Datensatz.")


Shape of DataFrame: (98129, 24)



Unnamed: 0,year,condition,odometer,mmr,sale_year,sale_month,sale_day,sale_weekday,body,transmission,...,season,has_sport,has_limited,has_lx,has_se,has_touring,has_premium,miles_per_year,color_popularity,sellingprice
0,2015,2.0,5559.0,15350.0,2015,1,13,1,Sedan,automatic,...,Winter,0,0,0,1,0,0,5559.0,4,12000.0
1,2012,35.0,45035.0,15450.0,2014,12,18,3,SUV,automatic,...,Winter,0,1,0,0,0,0,22517.5,3,14100.0
2,2012,46.0,20035.0,20700.0,2014,12,18,3,SUV,automatic,...,Winter,0,0,0,1,0,0,10017.5,3,20800.0
3,2012,46.0,41115.0,19800.0,2014,12,18,3,SUV,automatic,...,Winter,0,0,0,1,0,0,20557.5,4,22100.0
4,2012,3.0,26747.0,12900.0,2014,12,17,2,Hatchback,automatic,...,Winter,0,0,0,0,0,0,13373.5,6,14000.0



Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98129 entries, 0 to 98128
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                98129 non-null  int64  
 1   condition           98129 non-null  float64
 2   odometer            98129 non-null  float64
 3   mmr                 98129 non-null  float64
 4   sale_year           98129 non-null  int64  
 5   sale_month          98129 non-null  int64  
 6   sale_day            98129 non-null  int64  
 7   sale_weekday        98129 non-null  int64  
 8   body                98129 non-null  object 
 9   transmission        98129 non-null  object 
 10  color               98129 non-null  object 
 11  interior            98129 non-null  object 
 12  avg_price_state     98129 non-null  float64
 13  median_price_state  98129 non-null  float64
 14  season              98129 non-null  object 
 15  has_sport           98129 non-null  int64  
 1

Unnamed: 0,year,condition,odometer,mmr,sale_year,sale_month,sale_day,sale_weekday,avg_price_state,median_price_state,has_sport,has_limited,has_lx,has_se,has_touring,has_premium,miles_per_year,color_popularity,sellingprice
count,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0,98129.0
mean,2010.919402,31.659428,60446.622884,11858.851359,2014.875511,3.007592,15.194744,1.413466,11730.951085,11583.703085,0.012952,0.037573,0.065271,0.291331,0.005034,0.000245,17202.400363,3.936981,11730.951085
std,2.914241,12.489657,42732.015113,4959.365272,0.33014,3.518167,8.520374,1.221226,1213.626387,1330.645344,0.113069,0.190162,0.247005,0.454378,0.070774,0.015637,9377.824969,2.688439,4992.008021
min,1998.0,1.0,3346.0,825.0,2014.0,1.0,1.0,0.0,5775.0,6150.0,0.0,0.0,0.0,0.0,0.0,0.0,728.363636,1.0,2400.0
25%,2010.0,25.0,28637.0,8450.0,2015.0,1.0,8.0,1.0,11165.785286,11100.0,0.0,0.0,0.0,0.0,0.0,0.0,10911.0,2.0,8200.0
50%,2012.0,35.0,46245.0,11900.0,2015.0,2.0,16.0,1.0,11976.194837,12100.0,0.0,0.0,0.0,0.0,0.0,0.0,15068.0,3.0,11700.0
75%,2013.0,41.0,85381.0,14950.0,2015.0,2.0,22.0,2.0,12301.101628,12400.0,0.0,0.0,0.0,1.0,0.0,0.0,21016.5,5.0,14900.0
max,2015.0,49.0,221122.0,39000.0,2015.0,12.0,31.0,6.0,13687.691274,13400.0,1.0,1.0,1.0,1.0,1.0,1.0,121873.0,20.0,25000.0



Fehlende Werte pro Spalte:
Keine fehlenden Werte im Datensatz.


## 2. Exklusion des MMR-Features
In diesem Abschnitt wird das `mmr`-Feature entfernt, um einen Leak-Test vorzubereiten. Das bereinigte DataFrame wird für die weiteren Schritte verwendet.

In [2]:
# 2.1 Erzeugen eines temporären DataFrame ohne 'mmr'
df_no_mmr = df.drop("mmr", axis=1).copy()

# Kontrolle: Shape und Kopf des neuen DataFrames
print(f"Ursprüngliches DataFrame: {df.shape}")
print(f"DataFrame ohne 'mmr': {df_no_mmr.shape}\n")
display(df_no_mmr.head())


Ursprüngliches DataFrame: (98129, 24)
DataFrame ohne 'mmr': (98129, 23)



Unnamed: 0,year,condition,odometer,sale_year,sale_month,sale_day,sale_weekday,body,transmission,color,...,season,has_sport,has_limited,has_lx,has_se,has_touring,has_premium,miles_per_year,color_popularity,sellingprice
0,2015,2.0,5559.0,2015,1,13,1,Sedan,automatic,white,...,Winter,0,0,0,1,0,0,5559.0,4,12000.0
1,2012,35.0,45035.0,2014,12,18,3,SUV,automatic,gray,...,Winter,0,1,0,0,0,0,22517.5,3,14100.0
2,2012,46.0,20035.0,2014,12,18,3,SUV,automatic,gray,...,Winter,0,0,0,1,0,0,10017.5,3,20800.0
3,2012,46.0,41115.0,2014,12,18,3,SUV,automatic,white,...,Winter,0,0,0,1,0,0,20557.5,4,22100.0
4,2012,3.0,26747.0,2014,12,17,2,Hatchback,automatic,red,...,Winter,0,0,0,0,0,0,13373.5,6,14000.0


## 3. Ausreisseranalyse & -entfernung
Univariate Analyse der Schlüsselmetriken `sellingprice`, `odometer` und `miles_per_year`. Ausreißer werden mit der 1,5×IQR-Methode identifiziert und entfernt. Dokumentation der Anzahl gedroppter Zeilen.

In [3]:
# 3.1 Definition der Schlüsselmetriken
metrics = ["sellingprice", "odometer", "miles_per_year"]

# 3.2 Funktion zur IQR-basierten Ausreißeridentifikation
def drop_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
    mask = df[column].between(lower, upper)
    n_before = df.shape[0]
    df_clean = df[mask].copy()
    n_after = df_clean.shape[0]
    print(f"{column}: {n_before - n_after} Ausreisser entfernt ({n_after}/{n_before} verbleibend)")
    return df_clean

# 3.3 Iterative Anwendung auf das no_mmr-DataFrame
df_clean = df_no_mmr.copy()
for col in metrics:
    df_clean = drop_outliers_iqr(df_clean, col)

print(f"\nShape nach Ausreiner-Entfernung: {df_clean.shape}")


sellingprice: 139 Ausreisser entfernt (97990/98129 verbleibend)
odometer: 2070 Ausreisser entfernt (95920/97990 verbleibend)
miles_per_year: 4358 Ausreisser entfernt (91562/95920 verbleibend)

Shape nach Ausreiner-Entfernung: (91562, 23)


## 4. Selektion relevanter Merkmale
Identifikation redundanter Merkmale und selektives Entfernen lediglich von `median_price_state`.

In [4]:
import numpy as np

# 4.1 Korrelationsmatrix der numerischen Features (absolut)
corr_matrix = df_clean.select_dtypes(include=[np.number]).corr().abs()

# 4.2 Ausgabe der Top-Korrelationen einzelnen Paare zur Kontrolle
#    (ohne Selbstkorrelationen)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
pairs = (
    upper
    .stack()
    .reset_index()
    .rename(columns={"level_0": "Feature A", "level_1": "Feature B", 0: "AbsCorr"})
    .sort_values("AbsCorr", ascending=False)
)
display(pairs.head(10))

# 4.3 Selektives Entfernen: nur `median_price_state`
to_drop = ["median_price_state"]
df_reduced = df_clean.drop(columns=to_drop)
print(f"Entfernte Features: {to_drop}")
print(f"Shape nach Entfernung: {df_reduced.shape}")

# 4.4 Verbleibende Features zur Kontrolle
print("Verbleibende Features:")
print(df_reduced.columns.tolist())


Unnamed: 0,Feature A,Feature B,AbsCorr
98,avg_price_state,median_price_state,0.979807
48,sale_year,sale_month,0.950287
1,year,odometer,0.808253
47,odometer,sellingprice,0.660951
16,year,sellingprice,0.647429
49,sale_year,sale_day,0.353716
32,condition,sellingprice,0.325538
62,sale_month,sale_day,0.275456
37,odometer,avg_price_state,0.274769
38,odometer,median_price_state,0.27216


Entfernte Features: ['median_price_state']
Shape nach Entfernung: (91562, 22)
Verbleibende Features:
['year', 'condition', 'odometer', 'sale_year', 'sale_month', 'sale_day', 'sale_weekday', 'body', 'transmission', 'color', 'interior', 'avg_price_state', 'season', 'has_sport', 'has_limited', 'has_lx', 'has_se', 'has_touring', 'has_premium', 'miles_per_year', 'color_popularity', 'sellingprice']


## 5. Export des bereinigten DataFrames
Zusammenführen aller bisherigen Schritte und Speichern des finalen DataFrames ohne `mmr` und `median_price_state` als neue CSV-Datei.

In [5]:
# 5.1 Zuweisung des bereinigten DataFrames
final_df = df_reduced.copy()

# 5.2 Speicherung als neue CSV-Datei
OUTPUT_PATH = "../data/processed/cars_features_no_mmr_reduced.csv"
final_df.to_csv(OUTPUT_PATH, index=False)
print(f"Bereinigtes DataFrame gespeichert unter: {OUTPUT_PATH}")


Bereinigtes DataFrame gespeichert unter: ../data/processed/cars_features_no_mmr_reduced.csv
