# Linear Regression using XGBoost

- `Pandas`, `Scikit-learn` for cleaning data
- `Matplotlib`, `Seaborn` for some basic visualizations
- `Scikit-learn` for scaling
- `XGBoost` for predictions

### 1. Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_squared_log_error
import category_encoders as ce

import xgboost as xgb

### 2. Loading the `train` dataset
Before starting on any task, it is useful to get more familiar with your dataset.

In [None]:
df_train = pd.read_csv('train.csv').set_index('Id')
print ('The shape of df_train is: ', df_train.shape)
df_train.head(2)

- Target values are stored in a Numpy scalar `y_train`

In [None]:
y_train_raw = np.array(df_train['SalePrice'])
# print y_train_raw
print(f"y_train_raw Shape: {y_train_raw.shape}, y_train_raw Type: {type(y_train_raw)})")
print("First element of y_train_raw are:\n", y_train_raw[:])
print("Dimension of y_train_raw:", y_train_raw.ndim)

### 3. Visualize your data
* Correlated columns with `SalePrice`

In [None]:
corr = df_train.corr()
corr.sort_values(['SalePrice'], ascending=False, inplace=True)
corr.SalePrice.head(7)

- It is often useful to understand the data by visualizing it.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(20,15), sharey=True)

plt.title('OverallQual')
sns.scatterplot(ax=axes[0,0], data=df_train, x='OverallQual', y='SalePrice', color='b')
plt.title('GrLivArea')
sns.scatterplot(ax=axes[0,1], data=df_train, x='GrLivArea', y='SalePrice', color='g')
plt.title('GarageCars')
sns.scatterplot(ax=axes[0,2], data=df_train, x='GarageCars', y='SalePrice', color='b')
plt.title('GarageArea')
sns.scatterplot(ax=axes[1,0], data=df_train, x='GarageArea', y='SalePrice', color='g')
plt.title('TotalBsmtSF')
sns.scatterplot(ax=axes[1,1], data=df_train, x='TotalBsmtSF', y='SalePrice', color='b')
plt.title('1stFlrSF')
sns.scatterplot(ax=axes[1,2], data=df_train, x='1stFlrSF', y='SalePrice', color='g')

plt.show()

* Histogram `['SalePrice']` on `train` set

In [None]:
df_train['SalePrice'].hist(bins = 50);

### 4. Loading the `test` dataset

In [None]:
df_test = pd.read_csv('test.csv').set_index('Id')
print ('The shape of df_test is: ', df_test.shape)
df_test.head(2)

### 5. Scaling and category encoder

- Choosing numeric features on `train` set and leaving out `SalePrice`

In [None]:
df_train.drop('SalePrice', axis=1, inplace=True)

In [None]:
select_numeric_features = make_column_selector(dtype_include=np.number)

In [None]:
numeric_features_train = select_numeric_features(df_train)

print(f'N numeric_features_train: {len(numeric_features_train)} \n')
print(', '.join(numeric_features_train))

- Visualizing numeric features on `test` set

In [None]:
numeric_features_test = select_numeric_features(df_test)

print(f'N numeric_features_test: {len(numeric_features_test)} \n')
print(', '.join(numeric_features_test))

#### Impute missing features values using a descriptive statistic `mean`, and normalize numeric features using `StandardScaler`

- `train` & `test` sets

In [None]:
df_train.fillna(np.nan, inplace=True)
df_test.fillna(np.nan, inplace=True)

numeric_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())

##### Cardinality categorical features

In [None]:
df_train_object = df_train.select_dtypes(include="object")
df_train_object.nunique().plot.bar(figsize=(20,8))
plt.ylabel('Number of unique categories')
plt.xlabel('Variables')
plt.title('Cardinality check')
plt.axhline(y = 10, color= 'r', linestyle='--')
plt.show()

#### Categorical with moderate-to-low cardinality

In [None]:
MAX_OH_CARDINALITY = 10

def select_oh_features(df):
    
    hc_features =\
        df\
        .select_dtypes(['object', 'category'])\
        .apply(lambda col: col.nunique())\
        .loc[lambda x: x <= MAX_OH_CARDINALITY]\
        .index\
        .tolist()
        
    return hc_features

oh_features = select_oh_features(df_train)

print(f'N oh_features: {len(oh_features)} \n')
print(', '.join(oh_features))

In [None]:
oh_pipeline = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))

#### Categorical with high cardinality

In [None]:
def select_hc_features(df):
    
    hc_features =\
        df\
        .select_dtypes(['object', 'category'])\
        .apply(lambda col: col.nunique())\
        .loc[lambda x: x > MAX_OH_CARDINALITY]\
        .index\
        .tolist()
        
    return hc_features


hc_features = select_hc_features(df_train)

print(f'N hc_features: {len(hc_features)} \n')
print(', '.join(hc_features))

In [None]:
hc_pipeline = make_pipeline(ce.GLMMEncoder())

#### Putting it all together

In [None]:
column_transformer = ColumnTransformer(transformers = 
                                       [('numeric_pipeline', numeric_pipeline, select_numeric_features),
                                        ('oh_pipeline', oh_pipeline, select_oh_features),
                                        ('hc_pipeline', hc_pipeline, select_hc_features)],
                      remainder='drop')

In [None]:
X_train_trsf = column_transformer.fit_transform(df_train, y_train_raw)
X_test_trsf = column_transformer.transform(df_test)

print(X_train_trsf.shape)
print(X_test_trsf.shape)

- Checking `train` set transformed as `numpy.ndarray`

In [None]:
print(np.sum(np.isnan(X_train_trsf)))

In [None]:
print("X_train_trsf Type     :", type(X_train_trsf))
print(f"X_train_trsf Shape    : {X_train_trsf.shape}")
print("X_train_trsf Dimension:", X_train_trsf.ndim)

Normalize and encoder `train` examples are stored in a Numpy matriz `X_train_trsf`.

In [None]:
#print("First element of X_train_trsf are:\n", X_train_trsf[:1])

- Checking `test` set transformed as `numpy.ndarray`

In [None]:
print(np.sum(np.isnan(X_test_trsf)))

In [None]:
print("X_test_trsf Type     :", type(X_test_trsf))
print(f"X_test_trsf Shape    : {X_test_trsf.shape}")
print("X_test_trsf Dimension:", X_test_trsf.ndim)

Normalize and encoder `test` examples are stored in a Numpy matriz `X_test_trsf`.

In [None]:
#print("First element of X_test_trsf are:\n", X_test_trsf[:1])

### 6. Fit the model with `XGBoost`

- Separate data into *training* and *validation* sets

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_trsf, y_train_raw, test_size=0.3, random_state=0)

- Shape and Dimensions

In [None]:
print(f"X_train Shape    : {X_train.shape},| y_train Shape    : {y_train.shape}")
print("X_train Dimension:", X_train.ndim, "          | y_train Dimension:", y_train.ndim)
print(f"X_valid Shape    : {X_valid.shape}, | y_valid Shape    : {y_valid.shape}")
print("X_valid Dimension:", X_valid.ndim, "          | y_valid Dimension:", y_valid.ndim)

#### `XGBRegressor`

In [None]:
model_0 = xgb.XGBRegressor()
model_0.fit(X_train, y_train)

In [None]:
predictions_0 = model_0.predict(X_valid)

print("R2 Score                           : " + str(r2_score(y_valid, predictions_0)))
print("Mean Absolute Error                : " + str(mean_absolute_error(y_valid, predictions_0)))
print("Mean Square Error                  : " + str(mean_squared_error(y_valid, predictions_0)))
print("Mean Squared Logarithmic Error     : " + str(mean_squared_log_error(y_valid, predictions_0)))
print("Root Mean Square Error             : " + str(np.sqrt(mean_squared_error(y_valid, predictions_0))))
print("Root Mean Squared Logarithmic Error: " + str(np.sqrt(mean_squared_log_error(y_valid, predictions_0))))

# ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

- Parameter Tuning 1

In [None]:
model_1 = xgb.XGBRegressor(n_estimators=500)
model_1.fit(X_train, y_train)

In [None]:
predictions_1 = model_1.predict(X_valid)

print("R2 Score                           : " + str(r2_score(y_valid, predictions_1)))
print("Mean Absolute Error                : " + str(mean_absolute_error(y_valid, predictions_1)))
print("Mean Square Error                  : " + str(mean_squared_error(y_valid, predictions_1)))
print("Mean Squared Logarithmic Error     : " + str(mean_squared_log_error(y_valid, predictions_1)))
print("Root Mean Square Error             : " + str(np.sqrt(mean_squared_error(y_valid, predictions_1))))
print("Root Mean Squared Logarithmic Error: " + str(np.sqrt(mean_squared_log_error(y_valid, predictions_1))))

# ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

- Parameter Tuning 2

In [None]:
model_2 = xgb.XGBRegressor(n_estimators=1000)
model_2.fit(X_train, y_train)

In [None]:
predictions_2 = model_2.predict(X_valid)

print("R2 Score                           : " + str(r2_score(y_valid, predictions_2)))
print("Mean Absolute Error                : " + str(mean_absolute_error(y_valid, predictions_2)))
print("Mean Square Error                  : " + str(mean_squared_error(y_valid, predictions_2)))
print("Mean Squared Logarithmic Error     : " + str(mean_squared_log_error(y_valid, predictions_2)))
print("Root Mean Square Error             : " + str(np.sqrt(mean_squared_error(y_valid, predictions_2))))
print("Root Mean Squared Logarithmic Error: " + str(np.sqrt(mean_squared_log_error(y_valid, predictions_2))))

# ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

- Parameter Tuning 3

>**GridSearchCV**: Best parameters: `(n_estimators=500, max_depth=5, colsample_bylevel=0.4, learning_rate=3.0e-2)`

In [None]:
model_3 = xgb.XGBRegressor(n_estimators=500, max_depth=5, colsample_bylevel=0.4, learning_rate=3.0e-2)
model_3.fit(X_train, y_train)

In [None]:
predictions_3 = model_3.predict(X_valid)

print("R2 Score                           : " + str(r2_score(y_valid, predictions_3)))
print("Mean Absolute Error                : " + str(mean_absolute_error(y_valid, predictions_3)))
print("Mean Square Error                  : " + str(mean_squared_error(y_valid, predictions_3)))
print("Mean Squared Logarithmic Error     : " + str(mean_squared_log_error(y_valid, predictions_3)))
print("Root Mean Square Error             : " + str(np.sqrt(mean_squared_error(y_valid, predictions_3))))
print("Root Mean Squared Logarithmic Error: " + str(np.sqrt(mean_squared_log_error(y_valid, predictions_3))))

# ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

* Outlier

In [None]:
np.sum(predictions_3 < 0)

In [None]:
print("Shape of predictions_3:", predictions_3.shape)
print("Dimension of predictions_3:", predictions_3.ndim)

### 12. Make predictions on `test` set
Make predictions using best model `model_3`

In [None]:
predictions_test = model_3.predict(X_test_trsf)

print("R2 Score                           : " + str(r2_score(y_train_raw[:1459,], predictions_test)))
print("Mean Absolute Error                : " + str(mean_absolute_error(y_train_raw[:1459,], predictions_test)))
print("Mean Square Error                  : " + str(mean_squared_error(y_train_raw[:1459,], predictions_test)))
print("Mean Squared Logarithmic Error     : " + str(mean_squared_log_error(y_train_raw[:1459,], predictions_test)))
print("Root Mean Square Error             : " + str(np.sqrt(mean_squared_error(y_train_raw[:1459,], predictions_test))))
print("Root Mean Squared Logarithmic Error: " + str(np.sqrt(mean_squared_log_error(y_train_raw[:1459,], predictions_test))))

# ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

### Back to Pandas for submitting predictions

In [None]:
submission = pd.DataFrame(dict(Id=df_test.index, SalePrice=predictions_test))
submission.head

In [None]:
submission.describe()

In [None]:
submission = submission.astype(int)

In [None]:
#submission.to_csv('submission.csv', index=False, header=True)

## Acknowledgments

- Machine Learning Specialization offered jointly by DeepLearning.AI and Stanford University on Coursera.
- The housing data was derived from **Kaggle** [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview).
- Category Encoder: Scikit-Learn ColumnTransformer approach from [Kyle Gilde](https://www.kaggle.com/code/kylegilde/building-columntransformers-dynamically) on **Kaggle**

>Let me know if you have any recommendations.  Thanks!