# Machine Learning

In this section, we attempt to predict for the 'buy_broker_var' and 'sell_broker_var' variables we created in the Broker Analysis file based on various features of the home.

In [None]:
import pandas as pd

df = pd.read_csv('housing_data-3.csv')
df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
df.head()

Unnamed: 0,home_id,url,lastSoldDate,listingAddedDate,beds,price,lotSize,baths,yearBuilt,sqft,...,seller_agent,seller_broker,buyer_agent,buyer_broker,yearListed,monthListed,home_id_x,buy_broker_var,home_id_y,sell_broker_var
0,0,/CA/San-Francisco/3228-Santiago-St-94116/home/...,2021-04-23T07:00:00Z,2021-04-02T17:34:02.324Z,3.0,1452500,2495.0,1.0,1941.0,1000.0,...,Ali Mafi,Redfin,Kent Chen,Compass,2021,4,1,0.130214,0,0.115503
1,1,/CA/San-Francisco/2127-Balboa-St-94121/home/10...,2022-10-28T07:00:00Z,2022-10-02T05:56:58.367Z,2.0,1300000,2500.0,1.0,1934.0,984.0,...,Alex Sobieski,Redfin,Dylan Hunter,Compass,2022,10,4,0.208221,1,0.202271
2,2,/CA/San-Francisco/91-Stoneyford-Ave-94112/home...,2024-05-08T07:00:00Z,2024-04-15T19:34:54.328Z,3.0,1130000,2426.0,1.0,1942.0,1042.0,...,Ali Mafi,Redfin,Nubia Tovar,EXP Realty of California,2024,4,6,0.173675,2,0.187282
3,3,/CA/San-Francisco/2189-Market-St-94114/unit-7/...,2021-05-26T07:00:00Z,2021-05-19T17:55:49.660Z,2.0,1165000,8182.0,1.0,2002.0,770.0,...,Joshua Altman,Redfin,Jeremy Davidson,Compass,2021,5,7,0.174297,3,0.174048
4,4,/CA/San-Francisco/2250-24th-St-94107/unit-325/...,2023-03-17T07:00:00Z,2023-02-13T20:27:39.990Z,2.0,665000,80224.0,1.0,1989.0,836.0,...,Joshua Altman,Redfin,Vantika Singh,"FlyHomes, Inc",2023,2,8,0.355561,4,0.477931


# Data Validation and Pruning

In this section, we check the data for outliers as well as erroneous house data. Many of the Redfin listings themselves seem to contain errors that would make our model worse if left in the DataFrame.

In [None]:
import plotly.express as px

px.box(df, y="buy_broker_var", hover_name="home_id")

**Beds/Baths**

In [None]:
px.scatter(df, x="beds", y="buy_broker_var", hover_name="home_id")

In [None]:
px.scatter(df, x="baths", y="buy_broker_var", hover_name="home_id")

For some of the houses, the number of beds/baths do not make sense within the context of the listing. We determined via the Redfin links that the listings are not accurate, and thus can drop them from the dataframe.

In [None]:
df.drop([1151, 8413, 3176, 10959, 11308], inplace=True)

**Lot Size**

In [None]:
px.scatter(df, x="price", y="lotSize", hover_name="home_id")

We can drop houses 9223, 771, 7513, 11313, 93, 7959, and 11911 from the dataframe because it is clear that there is something wrong with the listing data. We visited the actual Redfin listing for each house and determined that there was a mistake the square footage listing for each one.

In [None]:
df.drop([9223, 771, 7513, 11313, 93, 7959, 11911], inplace=True)

**Year Built**

In [None]:
px.scatter(df, x="yearBuilt", y="buy_broker_var", hover_name="home_id")

In [None]:
df.drop(5750, inplace=True)

The house 5750 was clearly not built in the year 1700 as American settlers had not even started living in California. A quick check of the Redfin listing also showed that the house was actually built in the 1900s, so we can drop this house from the dataframe.

**Square Footage**

In [None]:
px.scatter(df,
           x="sqft",
           y="price",
           hover_name="home_id",
           labels={"sqft": "Square Footage", "price": "Price ($)"},
           title="Square Footage vs. Price")

We can also remove houses with square foot 0 or some other very low number, which is clearly not true (2438, 4889, 2933, 1797, 4980, 11269, 7389, 3626, 5902, 4459, 839, 3509, 5496).

In [None]:
df.drop([2438, 4889, 2933, 1797, 4980, 11269, 7389, 3626, 5902, 4459, 839, 3509, 5496], inplace=True)

# Heat Map Data for Buyer Broker Variance

In [None]:
import folium
from folium.plugins import HeatMap
from IPython.display import IFrame, display
import numpy as np

df['latitude'] = df['latitude'] * 180 / np.pi
df['longitude'] = df['longitude'] * 180 / np.pi

m = folium.Map(
    location=[df['latitude'].mean(), df['longitude'].mean()],
    zoom_start=10
)

heat_data = df[['latitude', 'longitude', 'buy_broker_var']].values.tolist()

HeatMap(heat_data, radius=15, blur=10).add_to(m)

m.save('data_heatmap.html')

Based on the heatmap, it doesn't seem like the standard deviation metric is significantly dependent on the area of SF the house is in.

# Predicting Buyer Broker Variance

We now create a machine learning model to predict our 'buy_broker_var' variable to determine whether the features of a house can tell us how much the buyer broker impacts the house price.

In [None]:
quant_features = ['beds', 'baths', 'yearBuilt', 'sqft', 'population', 'density', 'yearListed', 'lotSize','monthListed', 'longitude', 'latitude']

In [None]:
cat_features = ['seller_agent', 'seller_broker']

Since the seller_agent and seller_broker columns have high cardinality, we instead create a variable of how many deals they've done, making it a quantitative variable. We also relabel agents that have done less than 10 deals as a "rare agent."

In [None]:
agent_counts = df['seller_agent'].value_counts()
rare_agents = agent_counts[agent_counts < 10].index
df['seller_agent_clean'] = df['seller_agent'].replace(rare_agents, 'rare_agent')
df['seller_agent_freq'] = df['seller_agent_clean'].map(df['seller_agent_clean'].value_counts())


broker_counts = df['seller_broker'].value_counts()
rare_brokers = broker_counts[broker_counts < 10].index
df['seller_broker_clean'] = df['seller_broker'].replace(rare_brokers, 'rare_broker')
df['seller_broker_freq'] = df['seller_broker_clean'].map(df['seller_broker_clean'].value_counts())

quant_features.extend(['seller_agent_freq', 'seller_broker_freq'])

In [None]:
for feature in quant_features:
    corr_value = df[feature].corr(df['buy_broker_var'])
    print(f"Correlation between {feature} and buy_broker_var: {corr_value}")

Correlation between beds and buy_broker_var: 0.1195491195169233
Correlation between baths and buy_broker_var: 0.09194197578001623
Correlation between yearBuilt and buy_broker_var: -0.1061481671581901
Correlation between sqft and buy_broker_var: 0.16971145341583854
Correlation between population and buy_broker_var: -0.1174648001827489
Correlation between density and buy_broker_var: 0.22808842280114475
Correlation between yearListed and buy_broker_var: 0.10523370077562857
Correlation between lotSize and buy_broker_var: 0.03369340592554545
Correlation between monthListed and buy_broker_var: 0.007480181215526382
Correlation between longitude and buy_broker_var: 0.14576242967607134
Correlation between latitude and buy_broker_var: 0.31847814835094684
Correlation between seller_agent_freq and buy_broker_var: 0.0010952239646960708
Correlation between seller_broker_freq and buy_broker_var: 0.01832618022364609


In [None]:
cat_features = []
quant_features = ['sqft', 'longitude', 'latitude']
text_features = ['description']
features = cat_features + quant_features + text_features

X_train = df[features]
y_train = df['buy_broker_var']

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
import xgboost as xgb

We can do some hyperparameter tuning for the scaler for our quantitative variables as well as the distance metric for our KNeighborsRegressor.

In [None]:
def estimate_test_mse(distance_metric="euclidean", scaler=StandardScaler(with_mean=False)):
  transformer = make_column_transformer(
    (scaler, quant_features),
    (TfidfVectorizer(), 'description'),
    remainder="passthrough")

  pipeline = make_pipeline(
    transformer,
    KNeighborsRegressor(n_neighbors=10, metric=distance_metric))

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    scoring="neg_mean_squared_error",
    cv=4)
  print(f'Using metric {distance_metric} and scaler {scaler}: RMSE = {np.sqrt(-scores.mean())}')

metrics = ['euclidean', 'manhattan', 'cosine']
scalers = [PowerTransformer(), QuantileTransformer(), RobustScaler(), StandardScaler()]

for metric in metrics:
  for scaler in scalers:
    estimate_test_mse(metric, scaler)

Using metric euclidean and scaler PowerTransformer(): RMSE = 0.07788698903079105
Using metric euclidean and scaler QuantileTransformer(): RMSE = 0.07486277156545053
Using metric euclidean and scaler RobustScaler(): RMSE = 0.07260113063969052
Using metric euclidean and scaler StandardScaler(): RMSE = 0.07250914315509957
Using metric manhattan and scaler PowerTransformer(): RMSE = 0.08945968945311554
Using metric manhattan and scaler QuantileTransformer(): RMSE = 0.09470161519647032
Using metric manhattan and scaler RobustScaler(): RMSE = 0.08618155429126301
Using metric manhattan and scaler StandardScaler(): RMSE = 0.08549727480570063
Using metric cosine and scaler PowerTransformer(): RMSE = 0.116315689624475
Using metric cosine and scaler QuantileTransformer(): RMSE = 0.08914366660503172
Using metric cosine and scaler RobustScaler(): RMSE = 0.08672436742918696
Using metric cosine and scaler StandardScaler(): RMSE = 0.08844237450063219


The single regressor with the best RMSE was the LightGBM regressor, so we use that to create an initial prediction model.

In [None]:
transformer = make_column_transformer(
  (OneHotEncoder(), cat_features),
  (StandardScaler(), quant_features),
  (TfidfVectorizer(), 'description'),
  remainder="passthrough")

pipeline = make_pipeline(
    transformer,
    LGBMRegressor(),
    )

pipeline.fit(X=X_train, y=y_train)
pipeline.predict(X=X_train)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.375519 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 228442
[LightGBM] [Info] Number of data points in the train set: 11981, number of used features: 3483
[LightGBM] [Info] Start training from score 0.212353



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



array([0.1276753 , 0.21670199, 0.14735803, ..., 0.228887  , 0.14335851,
       0.19234007])

In [None]:
scores = cross_val_score(
  pipeline,
  X=X_train,
  y=y_train,
  scoring="neg_mean_squared_error",
  cv=4)
np.sqrt(-scores.mean())


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.280031 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 189945
[LightGBM] [Info] Number of data points in the train set: 8985, number of used features: 3041
[LightGBM] [Info] Start training from score 0.216948



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.249519 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 189441
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 3059
[LightGBM] [Info] Start training from score 0.218390



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.255951 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 188132
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 2986
[LightGBM] [Info] Start training from score 0.210754



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.267582 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 189065
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 2991
[LightGBM] [Info] Start training from score 0.203321



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



0.07107822259505878

To further improve performance, we again implement the stacking ensemble method using K Neighbors, Gradient Boosting Regression, Light GDBM Regression, and XGBoost Regression as estimators and Linear Regression as the final estimator.

In [None]:
transformer = make_column_transformer(
  (OneHotEncoder(), cat_features),
  (StandardScaler(), quant_features),
  (TfidfVectorizer(), 'description'),
  remainder="passthrough")

estimators = [
    ('knr', KNeighborsRegressor(n_neighbors=10, metric='euclidean')),
    ('gbr', GradientBoostingRegressor(random_state=42)),
    ('lgbm', LGBMRegressor()),
    ('xgb', xgb.XGBRegressor(objective="reg:squarederror", random_state=42))

]

stacked_regressor = StackingRegressor(
    estimators=estimators,
    final_estimator=LinearRegression()
)

pipeline = make_pipeline(
    transformer,
    stacked_regressor
)

pipeline.fit(X_train, y_train)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.497627 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 228442
[LightGBM] [Info] Number of data points in the train set: 11981, number of used features: 3483
[LightGBM] [Info] Start training from score 0.212353



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.358573 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 198240
[LightGBM] [Info] Number of data points in the train set: 9584, number of used features: 3130
[LightGBM] [Info] Start training from score 0.215629



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.363254 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197773
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3151
[LightGBM] [Info] Start training from score 0.216089



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.362174 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197467
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3140
[LightGBM] [Info] Start training from score 0.215990



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.558483 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 196812
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3097
[LightGBM] [Info] Start training from score 0.209097



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.346102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 197170
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3098
[LightGBM] [Info] Start training from score 0.204961



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



In [None]:
scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    scoring="neg_mean_squared_error",
    cv=5
)

print(f'Stacked Regressor RMSE: {np.sqrt(-scores.mean())}')


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.437061 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 228442
[LightGBM] [Info] Number of data points in the train set: 11981, number of used features: 3483
[LightGBM] [Info] Start training from score 0.212353



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.630366 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 198240
[LightGBM] [Info] Number of data points in the train set: 9584, number of used features: 3130
[LightGBM] [Info] Start training from score 0.215629



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.623170 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197773
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3151
[LightGBM] [Info] Start training from score 0.216089



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.543071 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197467
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3140
[LightGBM] [Info] Start training from score 0.215990



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.369952 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 196812
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3097
[LightGBM] [Info] Start training from score 0.209097



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.354329 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197170
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3098
[LightGBM] [Info] Start training from score 0.204961



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.330638 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 198223
[LightGBM] [Info] Number of data points in the train set: 9584, number of used features: 3130
[LightGBM] [Info] Start training from score 0.215629



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.234264 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170724
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2827
[LightGBM] [Info] Start training from score 0.219850



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.232628 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170723
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2831
[LightGBM] [Info] Start training from score 0.221629



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.373150 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170574
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2801
[LightGBM] [Info] Start training from score 0.215958



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.217484 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170418
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2793
[LightGBM] [Info] Start training from score 0.211715



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.218282 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170523
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2803
[LightGBM] [Info] Start training from score 0.208995



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.372280 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197744
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3151
[LightGBM] [Info] Start training from score 0.216089



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.245141 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171021
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2837
[LightGBM] [Info] Start training from score 0.220077



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.368897 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169804
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2834
[LightGBM] [Info] Start training from score 0.221973



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.228166 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169864
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2815
[LightGBM] [Info] Start training from score 0.216534



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.236837 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169699
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2802
[LightGBM] [Info] Start training from score 0.212289



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.230637 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169748
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2795
[LightGBM] [Info] Start training from score 0.209571



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.308393 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197476
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3140
[LightGBM] [Info] Start training from score 0.215990



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.265818 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170627
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2810
[LightGBM] [Info] Start training from score 0.219954



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.243059 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169480
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2804
[LightGBM] [Info] Start training from score 0.220436



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.255324 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169375
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2802
[LightGBM] [Info] Start training from score 0.217948



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.419586 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169385
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2781
[LightGBM] [Info] Start training from score 0.212165



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.225661 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169372
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2778
[LightGBM] [Info] Start training from score 0.209448



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.327177 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 196796
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3097
[LightGBM] [Info] Start training from score 0.209097



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.216427 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170255
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2780
[LightGBM] [Info] Start training from score 0.211337



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.223189 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169302
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2787
[LightGBM] [Info] Start training from score 0.211819



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.366104 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169415
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2809
[LightGBM] [Info] Start training from score 0.213520



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.219800 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169428
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2772
[LightGBM] [Info] Start training from score 0.207976



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.223943 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169220
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2772
[LightGBM] [Info] Start training from score 0.200831



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.339740 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197152
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3098
[LightGBM] [Info] Start training from score 0.204961



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.258861 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170742
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2800
[LightGBM] [Info] Start training from score 0.206168



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.248896 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169537
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2776
[LightGBM] [Info] Start training from score 0.206650



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.247337 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169604
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2796
[LightGBM] [Info] Start training from score 0.208351



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.226448 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169733
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2766
[LightGBM] [Info] Start training from score 0.204340



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.251886 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169117
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2763
[LightGBM] [Info] Start training from score 0.199298



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



Stacked Regressor RMSE: 0.06657264306827737



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



# Predicting Seller Broker Variance

We can also create a machine learning model to predict for the 'sell_broker_var' variable, to determine from a seller's perspective, whether the seller broker matters for certain houses.

In [None]:
quant_features = ['beds', 'baths', 'yearBuilt', 'sqft', 'population', 'density', 'yearListed', 'lotSize','monthListed', 'longitude', 'latitude']

In [None]:
cat_features = ['buyer_agent', 'buyer_broker']

We again do the same feature engineering for the buyer_broker and buyer_agent columns since they also have high cardinality. We again also relabel agents that have done less than 10 deals as a "rare agent."

In [None]:
agent_counts = df['buyer_agent'].value_counts()
rare_agents = agent_counts[agent_counts < 10].index
df['buyer_agent_clean'] = df['buyer_agent'].replace(rare_agents, 'rare_agent')
df['buyer_agent_freq'] = df['buyer_agent_clean'].map(df['buyer_agent_clean'].value_counts())


broker_counts = df['buyer_broker'].value_counts()
rare_brokers = broker_counts[broker_counts < 10].index
df['buyer_broker_clean'] = df['buyer_broker'].replace(rare_brokers, 'rare_broker')
df['buyer_broker_freq'] = df['buyer_broker_clean'].map(df['buyer_broker_clean'].value_counts())

quant_features.extend(['buyer_agent_freq', 'buyer_broker_freq'])

In [None]:
for feature in quant_features:
    corr_value = df[feature].corr(df['sell_broker_var'])
    print(f"Correlation between {feature} and sell_broker_var: {corr_value}")

Correlation between beds and sell_broker_var: 0.10540165021949309
Correlation between baths and sell_broker_var: 0.0768811068618179
Correlation between yearBuilt and sell_broker_var: -0.0744343185573079
Correlation between sqft and sell_broker_var: 0.1408920164291525
Correlation between population and sell_broker_var: -0.1099114184215411
Correlation between density and sell_broker_var: 0.2476259672372973
Correlation between yearListed and sell_broker_var: 0.11768979128635355
Correlation between lotSize and sell_broker_var: 0.04257102393625858
Correlation between monthListed and sell_broker_var: 0.005837383589713235
Correlation between longitude and sell_broker_var: 0.1494563073342089
Correlation between latitude and sell_broker_var: 0.3249726156588481
Correlation between buyer_agent_freq and sell_broker_var: 0.04518819287693668
Correlation between buyer_broker_freq and sell_broker_var: 0.02116401611429179


In [None]:
cat_features = []
quant_features = ['sqft', 'longitude', 'latitude', 'density']
text_features = ['description']
features = cat_features + quant_features + text_features

X_train = df[features]
y_train = df['sell_broker_var']

We can again do some hyperparameter tuning for the scaler for our quantitative variables as well as the distance metric for our KNeighborsRegressor.

In [None]:
def estimate_test_mse(distance_metric="euclidean", scaler=StandardScaler(with_mean=False)):
  transformer = make_column_transformer(
    (scaler, quant_features),
    (TfidfVectorizer(), 'description'),
    remainder="passthrough")

  pipeline = make_pipeline(
    transformer,
    KNeighborsRegressor(n_neighbors=10, metric=distance_metric))

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    scoring="neg_mean_squared_error",
    cv=4)
  print(f'Using metric {distance_metric} and scaler {scaler}: RMSE = {np.sqrt(-scores.mean())}')

metrics = ['euclidean', 'manhattan', 'cosine']
scalers = [PowerTransformer(), QuantileTransformer(), RobustScaler(), StandardScaler()]

for metric in metrics:
  for scaler in scalers:
    estimate_test_mse(metric, scaler)

Using metric euclidean and scaler PowerTransformer(): RMSE = 0.07583034243703879
Using metric euclidean and scaler QuantileTransformer(): RMSE = 0.0761030671726533
Using metric euclidean and scaler RobustScaler(): RMSE = 0.07334923417892338
Using metric euclidean and scaler StandardScaler(): RMSE = 0.07313867004901058
Using metric manhattan and scaler PowerTransformer(): RMSE = 0.08747714933713535
Using metric manhattan and scaler QuantileTransformer(): RMSE = 0.09838825143538707
Using metric manhattan and scaler RobustScaler(): RMSE = 0.0855426995352814
Using metric manhattan and scaler StandardScaler(): RMSE = 0.0846727842040149
Using metric cosine and scaler PowerTransformer(): RMSE = 0.11268314355003993
Using metric cosine and scaler QuantileTransformer(): RMSE = 0.09426457680677024
Using metric cosine and scaler RobustScaler(): RMSE = 0.0872475274001852
Using metric cosine and scaler StandardScaler(): RMSE = 0.08562849847689208


Again, the single regressor with the best RMSE was the LightGBM regressor, so we use that to create an initial prediction model.

In [None]:
transformer = make_column_transformer(
  (OneHotEncoder(), cat_features),
  (StandardScaler(), quant_features),
  (TfidfVectorizer(), 'description'),
  remainder="passthrough")

pipeline = make_pipeline(
    transformer,
    LGBMRegressor(),
    )

pipeline.fit(X=X_train, y=y_train)

scores = cross_val_score(
  pipeline,
  X=X_train,
  y=y_train,
  scoring="neg_mean_squared_error",
  cv=4)
np.sqrt(-scores.mean())


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.821618 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 228468
[LightGBM] [Info] Number of data points in the train set: 11981, number of used features: 3484
[LightGBM] [Info] Start training from score 0.207138



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.256712 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 189971
[LightGBM] [Info] Number of data points in the train set: 8985, number of used features: 3042
[LightGBM] [Info] Start training from score 0.211486



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.229250 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 189467
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 3060
[LightGBM] [Info] Start training from score 0.213365



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.233720 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 188158
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 2987
[LightGBM] [Info] Start training from score 0.205729



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.225252 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 189090
[LightGBM] [Info] Number of data points in the train set: 8986, number of used features: 2992
[LightGBM] [Info] Start training from score 0.197974



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



0.07178441356442404

To further improve performance, we again implement the stacking ensemble method using K Neighbors, Gradient Boosting Regression, Light GDBM Regression, and XGBoost Regression as estimators and Linear Regression as the final estimator.

In [None]:
transformer = make_column_transformer(
  (OneHotEncoder(), cat_features),
  (StandardScaler(), quant_features),
  (TfidfVectorizer(), 'description'),
  remainder="passthrough")

estimators = [
    ('knr', KNeighborsRegressor(n_neighbors=10, metric='euclidean')),
    ('gbr', GradientBoostingRegressor(random_state=42)),
    ('lgbm', LGBMRegressor()),
    ('xgb', xgb.XGBRegressor(objective="reg:squarederror", random_state=42))

]

stacked_regressor = StackingRegressor(
    estimators=estimators,
    final_estimator=LinearRegression()
)

pipeline = make_pipeline(
    transformer,
    stacked_regressor
)

pipeline.fit(X_train, y_train)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.388124 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 228468
[LightGBM] [Info] Number of data points in the train set: 11981, number of used features: 3484
[LightGBM] [Info] Start training from score 0.207138


In [None]:
scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    scoring="neg_mean_squared_error",
    cv=5
)

print(f'Stacked Regressor RMSE: {np.sqrt(-scores.mean())}')


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.284438 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 198249
[LightGBM] [Info] Number of data points in the train set: 9584, number of used features: 3131
[LightGBM] [Info] Start training from score 0.210157



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.305067 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170750
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2828
[LightGBM] [Info] Start training from score 0.214633



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.182628 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170749
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2832
[LightGBM] [Info] Start training from score 0.216081



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.186588 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170600
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2802
[LightGBM] [Info] Start training from score 0.210692



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.194949 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170444
[LightGBM] [Info] Number of data points in the train set: 7667, number of used features: 2794
[LightGBM] [Info] Start training from score 0.205791



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.179501 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170548
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2804
[LightGBM] [Info] Start training from score 0.203591



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.401493 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 197770
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3152
[LightGBM] [Info] Start training from score 0.211230



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.182500 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 171047
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2838
[LightGBM] [Info] Start training from score 0.214961



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.321891 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169830
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2835
[LightGBM] [Info] Start training from score 0.217083



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.182421 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169890
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2816
[LightGBM] [Info] Start training from score 0.212046



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.183942 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169725
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2803
[LightGBM] [Info] Start training from score 0.207121



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.175624 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169773
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2796
[LightGBM] [Info] Start training from score 0.204941



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.267302 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197502
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3141
[LightGBM] [Info] Start training from score 0.210756



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.193335 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170653
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2811
[LightGBM] [Info] Start training from score 0.214367



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.186235 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169506
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2805
[LightGBM] [Info] Start training from score 0.215406



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.285223 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169401
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2803
[LightGBM] [Info] Start training from score 0.213129



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.182833 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169411
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2782
[LightGBM] [Info] Start training from score 0.206528



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.176515 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169397
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2779
[LightGBM] [Info] Start training from score 0.204348



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.280349 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 196822
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3098
[LightGBM] [Info] Start training from score 0.203572



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.178169 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170281
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2781
[LightGBM] [Info] Start training from score 0.205388



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.175797 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 169328
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2788
[LightGBM] [Info] Start training from score 0.206427



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.178619 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169441
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2810
[LightGBM] [Info] Start training from score 0.208084



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.176800 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169454
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2773
[LightGBM] [Info] Start training from score 0.202595



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.177027 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169245
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2773
[LightGBM] [Info] Start training from score 0.195369



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.263142 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 197177
[LightGBM] [Info] Number of data points in the train set: 9585, number of used features: 3099
[LightGBM] [Info] Start training from score 0.199976



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.241537 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170767
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2801
[LightGBM] [Info] Start training from score 0.200893



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.178845 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169562
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2777
[LightGBM] [Info] Start training from score 0.201932



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.286795 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169629
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2797
[LightGBM] [Info] Start training from score 0.203588



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.186935 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169758
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2767
[LightGBM] [Info] Start training from score 0.199550



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.178821 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 169142
[LightGBM] [Info] Number of data points in the train set: 7668, number of used features: 2764
[LightGBM] [Info] Start training from score 0.193918



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



Stacked Regressor RMSE: 0.06694996372159935



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.

