# Preprocessing

## Numeric Columns

### Outliers
- detect with Isolation Forest than set to NaN

### Missing values
- *Simple Imputer* as a baseline
- *Iterative Imputer* as a more robust alternative -> Cone: It takes significantly more time to run

## Categorical Columns
### Encoding techniques
- *One Hot Encoding* for Linear reg, Ridge, Lasso, Support Vector reg -> generally high cardinality should be avoided 
- *Ordinal Encoding* for the other two, since they are tree-based
### Missing values 
- OHE:
    Encoded as all 0s in 
- OE: 
    1. Encoded as -1
    2. Median imputation
    3. Separate "Unknown" category
- Both:
    Model based encoding, using WEB-scraping


# Feature engineering
(after each section, correlation coefficients and conclusions should be added)
! Question to ask ourselves: 
    Is the impact of the engineered features the same for each model?

## Numeric Columns
### Algebraic expressions
### PCA


## Categorical Columns
- unite waranties types: No + Does not apply 


In [None]:
def warranty_func(df):
    return df.replace({'warranty': {'No': 'Does not apply'}})

### Trasform to Numeric
- encode *Engine capacity* and *Horsepower* as the lower bound of the given intervals


In [None]:
def lower_bound_encoder(df):

    df = df.copy()

    def splitter(value):

        if isinstance(value, (float, int)):
            return value
        # Handle missing or unknown
        if pd.isnull(value) or value == "Unknown":
            return np.nan
        # Now, value is a string like "1.6-2.0L" or "200+ HP"
        splitted = value[:-2].split('-')
        element = splitted[0].strip()
        if element.endswith('+'):
            element = element[:-1]
        try:
            return float(element)
        except Exception:
            return np.nan

    for col in ["engine_capacity_cc", "horsepower"]:
        df[col] = df[col].apply(splitter)

    return df

### Feature combinations
- brand_model, barnd_body_type, model_fuel_type, model_trim, seller_type_warranty, interior_color_exterior_color -> these new categories might introduce multicolinearity when used without dropping the original columns

In [None]:
def add_cat_combos_func(df):
    df = df.copy()

    def warranty_helper(element):
        if element == 'Does not apply': return 'No'
        else: return element

    for col in ['brand', 'model', 'body_type', 'fuel_type', 'seller_type', 'trim']:
        df[col] = df[col].replace('Other', np.nan)

    df['warranty'] = df['warranty'].apply(warranty_helper)

    df['brand_model'] = df['brand'].astype(str) + '_' + df['model'].astype(str)
    df['brand_body_type'] = df['model'].astype(str) + '_' + df['body_type'].astype(str)
    df['model_fuel_type'] = df['model'].astype(str) + '_' + df['fuel_type'].astype(str)
    df['model_trim'] = df['model'].astype(str) + '_' + df['trim'].astype(str)
    df['seller_type_warranty'] = df['seller_type'].astype(str) + '_' + df['warranty'].astype(str)
    df['interior_color_exterior_color'] = df['interior_color'].astype(str) + '_' + df['exterior_color'].astype(str)

#    return df.drop(['model', 'body_type', 'fuel_type', 'warranty', 'seller_type', 'exterior_color', 'interior_color'], axis=1)
    return df

## Reflections

- when training models, drop seemingly not contributiong columns (e.g. transmission type)

# Models
## Simple Linear regression
### Base-line model

In [None]:
num_processor = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

cat_processor = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("encoder", OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
        ('num', num_processor, select_num_columns(X_train)),
        ('cat', cat_processor, select_cat_columns(X_train))
    ])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

cv = cross_validate(pipeline, X_train, y_train, cv=5, scoring='r2', return_train_score=True)

print("Cross-validation results:")
print("Train R^2 scores:", cv['train_score'])
print("Test R^2 scores:", cv['test_score']) 

Cross-validation results:
Train R^2 scores: [0.6528719  0.6597142  0.65714568 0.66187927 0.66109304]
Test R^2 scores: [0.49300481 0.49130671 0.48719121 0.48555077 0.50481921]


### Set outliers to NaN
made an improvement, but introduced slight overfitting

In [None]:
from sklearn.ensemble import IsolationForest

def iforest_func(X): 
    num_train = X.select_dtypes(include=np.number)

    # Temporarily impute missing values in numerical features before applying Isolation Forest
    num_temp = SimpleImputer(strategy='median').fit_transform(num_train)  # median is robust to outliers

    num_train['outliers'] = IsolationForest(random_state=42).fit_predict(num_temp) == -1

   # set the outliers to NaN
    num_train.loc[num_train['outliers'], :] = np.nan

    # drop the outliers column
    return num_train.drop(columns=['outliers'])


iforest = FunctionTransformer(
    iforest_func,
    validate=False
)

num_processor = Pipeline([
    ("iforest", iforest),  # detect outliers
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

preprocessor = ColumnTransformer([
        ('num', num_processor, select_num_columns(X_train)),
        ('cat', cat_processor, select_cat_columns(X_train))
    ])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

cv = cross_validate(pipeline, X_train, y_train, cv=5, scoring='r2', return_train_score=True)

print("Cross-validation results:")
print("Train R^2 scores:", cv['train_score'])
print("Test R^2 scores:", cv['test_score']) 

  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan
  num_train.loc[num_train['outliers'], :] = np.nan


Cross-validation results:
Train R^2 scores: [0.66733497 0.67469692 0.67091371 0.67696587 0.67561404]
Test R^2 scores: [0.51651607 0.50734882 0.51380647 0.50609148 0.52441071]


  num_train.loc[num_train['outliers'], :] = np.nan


### Fill in NaN
Made model's performance worth, because the potentially meaningful patterns were erased during filling in the missing values.
! Possibly, tuning the parameters might increase the performance

In [None]:
num_processor = Pipeline([
    ("iforest", iforest),  # detect outliers
    ("imputer", IterativeImputer(estimator=RandomForestRegressor(n_estimators=10), max_iter=10, random_state=0)),
    ("scaler", StandardScaler())
])

preprocessor = ColumnTransformer([
        ('num', num_processor, select_num_columns(X_train)),
        ('cat', cat_processor, select_cat_columns(X_train))
    ])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

cv = cross_validate(pipeline, X_train, y_train, cv=5, scoring='r2', return_train_score=True)

print("Cross-validation results:")
print("Train R^2 scores:", cv['train_score'])
print("Test R^2 scores:", cv['test_score']) 

Cross-validation results:
Train R^2 scores: [0.68168793 0.68882165 0.68446263 0.68664783 0.68911034]
Test R^2 scores: [0.49449894 0.49135105 0.51597223 0.5148079  0.51407056]


## Ridge

## Lasso

## Elastic Net

## Random Forest Regressor

## Histogram-Based Gradient Boosting

## Support Vector Regression