# Initial Modelling for Ames Iowa Housing Dataset

## Goals
- Train models with minimally processed dataset to get understand of baseline performance
- Identify which models seem more promising for given dataset

## Notes
- csv files to train initial modelling have been overwritten. Can be achieved by only running preprocess_data() function from Data_Preprocessing.ipynb

## Imports

In [34]:
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from scipy.stats import boxcox

from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

RANDOM_SEED = 6


## Reading Data

In [35]:
train = pd.read_csv('train_processed.csv')
test = pd.read_csv('test_processed.csv')

features = train.iloc[:, :-1]
target = train.iloc[:, -1]

target_transformed = np.log1p(target)

In [36]:
features.shape, target.shape, test.shape

((1460, 244), (1460,), (1459, 244))

## Initialising Models

In [37]:
kfolds = KFold(n_splits=10, shuffle=True, random_state=RANDOM_SEED)

ridge = make_pipeline(RobustScaler(), RidgeCV(cv=kfolds))
lasso = make_pipeline(RobustScaler(), LassoCV(random_state=RANDOM_SEED, cv=kfolds))
elasticnet = make_pipeline(RobustScaler(), ElasticNetCV(cv=kfolds))                                
svr = make_pipeline(RobustScaler(), SVR())
rfr = RandomForestRegressor(random_state=RANDOM_SEED)
gbr = GradientBoostingRegressor(random_state=RANDOM_SEED)
lightgbm = LGBMRegressor(random_state=RANDOM_SEED)
xgboost = XGBRegressor(seed=RANDOM_SEED)

## Initial Modelling
- No hyperparameter tuning
- Target Transformed

In [38]:
results = {}
scoring = 'neg_mean_squared_error'
models = {
    'Ridge': ridge,
    'Lasso': lasso,
    'ElasticNet': elasticnet,
    'SVR': svr,
    'RandomForest': rfr,
    'GradientBoostingRegressor': gbr,
    'LightGBM': lightgbm,
    'XGBoost': xgboost
}

for name, model in models.items():
    print(f"Training {name}...")
    start_time = time.time()
    scores = cross_val_score(model, features, target_transformed, cv=kfolds, scoring=scoring)
    
    rmse_scores = np.sqrt(-scores)
    mean_rmse = np.mean(rmse_scores)
    std_rmse = np.std(rmse_scores)
    training_time = time.time() - start_time
    
    results[name] = {
        'Mean RMSE': mean_rmse,
        'Std RMSE': std_rmse,
        'Training Time (s)': training_time
    }
    
    print(f"{name} - Training completed in {training_time:.2f} seconds.")
    print("-" * 50)

results_df = pd.DataFrame(results).T.reset_index()
results_df.columns = ['Model', 'Mean RMSE', 'Std RMSE', 'Training Time (s)']
results_df.sort_values(by='Mean RMSE')

Training Ridge...
Ridge - Training completed in 1.41 seconds.
--------------------------------------------------
Training Lasso...
Lasso - Training completed in 1.98 seconds.
--------------------------------------------------
Training ElasticNet...
ElasticNet - Training completed in 2.10 seconds.
--------------------------------------------------
Training SVR...
SVR - Training completed in 1.14 seconds.
--------------------------------------------------
Training RandomForest...
RandomForest - Training completed in 10.46 seconds.
--------------------------------------------------
Training GradientBoostingRegressor...
GradientBoostingRegressor - Training completed in 3.84 seconds.
--------------------------------------------------
Training LightGBM...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002896 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [In

Unnamed: 0,Model,Mean RMSE,Std RMSE,Training Time (s)
5,GradientBoostingRegressor,0.128806,0.021071,3.841038
6,LightGBM,0.128983,0.019724,5.939006
0,Ridge,0.137092,0.040518,1.408832
7,XGBoost,0.141038,0.017799,2.547016
4,RandomForest,0.142217,0.022442,10.463284
2,ElasticNet,0.148488,0.043025,2.101672
1,Lasso,0.148554,0.043268,1.982795
3,SVR,0.345249,0.032728,1.14199


## Next Steps
- Initial modelling of data is promising, all models perform reasonable well on data
- There is lots of possibility for feature engineering
- Models need hyperparameter tuning
- I'm not well educated on the topic but combining predictions from several models or using a stacking algorithm could further improve performance

In [39]:
lightgbm.fit(features, target_transformed)
scaled_predictions = np.expm1(lightgbm.predict(test))
submission = pd.DataFrame({
    'Id': list(range(1461, 2920)),
    'SalePrice': scaled_predictions
})
submission.head()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002845 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3430
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 165
[LightGBM] [Info] Start training from score 12.024057


Unnamed: 0,Id,SalePrice
0,1461,125638.467587
1,1462,161710.610146
2,1463,191458.537926
3,1464,189480.428186
4,1465,190516.485793


In [40]:
submission.to_csv('first_submission_lgbm.csv', index=False)

### Notes
- LGBM performs significantly better on untransformed data