# Introduction #


Welcome to the feature engineering project for the House Prices - Advanced Regression Techniques competition! This competition uses nearly the same data you used in the exercises of the Feature Engineering course. We'll collect together the work you did into a complete project which you can build off of with ideas of your own.

# Step 1 - Preliminaries #


## Imports and Configuration
We'll start by importing the packages we used in the exercises and setting some notebook defaults. Unhide this cell if you'd like to see the libraries we'll use:

In [None]:
import os  # Importing the os module for operating system interactions
import warnings  # Importing the warnings module to handle warnings
from pathlib import Path  # Importing the Path class from the pathlib module for filesystem paths

import matplotlib.pyplot as plt  # Importing matplotlib for plotting
import numpy as np  # Importing numpy for numerical computations
import pandas as pd  # Importing pandas for data manipulation and analysis
import seaborn as sns  # Importing seaborn for statistical data visualization
from IPython.display import display  # Importing display from IPython.display for rich output
from pandas.api.types import CategoricalDtype  # Importing CategoricalDtype for categorical data types
from category_encoders import MEstimateEncoder  # Importing MEstimateEncoder for target encoding
from sklearn.cluster import KMeans  # Importing KMeans for clustering
from sklearn.decomposition import PCA  # Importing PCA for principal component analysis
from sklearn.feature_selection import mutual_info_regression  # Importing mutual_info_regression for feature selection
from sklearn.model_selection import KFold, cross_val_score  # Importing KFold and cross_val_score for cross-validation
from xgboost import XGBRegressor  # Importing XGBRegressor for gradient boosting regression

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")  # Setting default plot style
plt.rc("figure", autolayout=True)  # Setting default parameters for figures
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)  # Setting default parameters for axes labels and titles

# Mute warnings
warnings.filterwarnings('ignore')  # Ignoring all warnings during program execution


## Data Preprocessing


Before we can do any feature engineering, we need to preprocess the data to get it in a form suitable for analysis. The data we used in the course was a bit simpler than the competition data. For the Ames competition dataset, we'll need to:

* Load the data from CSV files
* Clean the data to fix any errors or inconsistencies
* Encode the statistical data type (numeric, categorical)
* Impute any missing values
We'll wrap all these steps up in a function, which will make easy for you to get a fresh dataframe whenever you need. After reading the CSV file, we'll apply three preprocessing steps, clean, encode, and impute, and then create the data splits: one (df_train) for training the model, and one (df_test) for making the predictions that you'll submit to the competition for scoring on the leaderboard.

In [None]:
def load_data():
    # Define the directory where the data files are located
    data_dir = Path("../input/house-prices-advanced-regression-techniques/")
    
    # Read the training data from the CSV file into a DataFrame, setting the 'Id' column as the index
    df_train = pd.read_csv(data_dir / "train.csv", index_col="Id")
    
    # Read the test data from the CSV file into a DataFrame, setting the 'Id' column as the index
    df_test = pd.read_csv(data_dir / "test.csv", index_col="Id")
    
    # Concatenate the training and test data into a single DataFrame to process them together
    df = pd.concat([df_train, df_test])
    
    # Preprocess the data: clean, encode categorical variables, and impute missing values
    df = clean(df)
    df = encode(df)
    df = impute(df)
    
    # Split the combined DataFrame back into training and test sets
    df_train = df.loc[df_train.index, :]
    df_test = df.loc[df_test.index, :]
    
    # Return the preprocessed training and test DataFrames
    return df_train, df_test


### Clean Data
Some of the categorical features in this dataset have what are apparently typos in their categories:

In [None]:
# Define the directory where the data files are located
data_dir = Path("../input/house-prices-advanced-regression-techniques/")

# Read the CSV file named "train.csv" located in the specified directory
# and set the 'Id' column as the index of the DataFrame
df = pd.read_csv(data_dir / "train.csv", index_col="Id")

# Retrieve the unique values of the column "Exterior2nd" in the DataFrame df
unique_values = df.Exterior2nd.unique()

# Print the unique values of the column "Exterior2nd"
print(unique_values)


Comparing these to `data_description.txt` shows us what needs cleaning. We'll take care of a couple of issues here, but you might want to evaluate this data further.



In [None]:
def clean(df):
    # Replace the value "Brk Cmn" with "BrkComm" in the "Exterior2nd" column
    df["Exterior2nd"] = df["Exterior2nd"].replace({"Brk Cmn": "BrkComm"})
    
    # Replace corrupt values of "GarageYrBlt" with the year the house was built
    df["GarageYrBlt"] = df["GarageYrBlt"].where(df.GarageYrBlt <= 2010, df.YearBuilt)
    
    # Rename columns for easier working with names beginning with numbers
    df.rename(columns={
        "1stFlrSF": "FirstFlrSF",
        "2ndFlrSF": "SecondFlrSF",
        "3SsnPorch": "Threeseasonporch",
    }, inplace=True)
    
    # Return the cleaned DataFrame
    return df


### Encode the Statistical Data Type
Pandas has Python types corresponding to the standard statistical types (numeric, categorical, etc.). Encoding each feature with its correct type helps ensure each feature is treated appropriately by whatever functions we use, and makes it easier for us to apply transformations consistently. This hidden cell defines the encode function:

In [None]:
# List of nominative (unordered) categorical features
features_nom = ["MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "Foundation", "Heating", "CentralAir", "GarageType", "MiscFeature", "SaleType", "SaleCondition"]

# Dictionary defining ordered levels for ordinal categorical features
ordered_levels = {
    "OverallQual": ten_levels,
    "OverallCond": ten_levels,
    "ExterQual": five_levels,
    "ExterCond": five_levels,
    "BsmtQual": five_levels,
    "BsmtCond": five_levels,
    "HeatingQC": five_levels,
    "KitchenQual": five_levels,
    "FireplaceQu": five_levels,
    "GarageQual": five_levels,
    "GarageCond": five_levels,
    "PoolQC": five_levels,
    "LotShape": ["Reg", "IR1", "IR2", "IR3"],
    "LandSlope": ["Sev", "Mod", "Gtl"],
    "BsmtExposure": ["No", "Mn", "Av", "Gd"],
    "BsmtFinType1": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    "BsmtFinType2": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    "Functional": ["Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"],
    "GarageFinish": ["Unf", "RFn", "Fin"],
    "PavedDrive": ["N", "P", "Y"],
    "Utilities": ["NoSeWa", "NoSewr", "AllPub"],
    "CentralAir": ["N", "Y"],
    "Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
    "Fence": ["MnWw", "GdWo", "MnPrv", "GdPrv"],
}

# Add a None level for missing values in the ordered levels
ordered_levels = {key: ["None"] + value for key, value in ordered_levels.items()}

# Function to encode categorical features
def encode(df):
    # Encode nominative categorical features
    for name in features_nom:
        df[name] = df[name].astype("category")  # Convert feature to categorical type
        # Add a "None" category for missing values if it doesn't exist
        if "None" not in df[name].cat.categories:
            df[name] = df[name].cat.add_categories("None")
    # Encode ordinal categorical features
    for name, levels in ordered_levels.items():
        df[name] = df[name].astype(CategoricalDtype(levels, ordered=True))  # Convert feature to categorical type with specified levels and order
    return df  # Return the DataFrame with encoded categorical features


### Handle Missing Values
Handling missing values now will make the feature engineering go more smoothly. We'll impute 0 for missing numeric values and "None" for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating "missing value" indicators: 1 whenever a value was imputed and 0 otherwise.

In [None]:
def impute(df):
    # Iterate over numeric columns and fill missing values with 0
    for name in df.select_dtypes("number"):
        df[name] = df[name].fillna(0)
    
    # Iterate over categorical columns and fill missing values with "None"
    for name in df.select_dtypes("category"):
        df[name] = df[name].fillna("None")
    
    # Return the DataFrame with imputed missing values
    return df


### Load Data
And now we can call the data loader and get the processed data splits:

In [None]:
# Call the load_data() function to load and preprocess the data
df_train, df_test = load_data()

# Now df_train contains the preprocessed training dataset,
# and df_test contains the preprocessed test dataset.

You'd like to see what they contain. Notice that df_test is missing values for SalePrice. (NAs were willed with 0's in the imputation step.)

In [None]:
# Peek at the values
display(df_train)
display(df_test)

# Display information about dtypes and missing values
display(df_train.info())
display(df_test.info())

### Establish Baseline
Finally, let's establish a baseline score to judge our feature engineering against.

Here is the function we created in Lesson 1 that will compute the cross-validated RMSLE score for a feature set. We've used XGBoost for our model, but you might want to experiment with other models.

In [None]:
def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    #
    # Label encoding is good for XGBoost and RandomForest, but one-hot
    # would be better for models like Lasso or Ridge. The `cat.codes`
    # attribute holds the category levels.
    for colname in X.select_dtypes(["category"]):
        X[colname] = X[colname].cat.codes
    
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    # Convert the target variable y to its natural logarithm
    log_y = np.log(y)
    
    # Perform cross-validation using the negative mean squared error as the scoring metric
    score = cross_val_score(
        model, X, log_y, cv=5, scoring="neg_mean_squared_error",
    )
    
    # Compute the mean of the negative mean squared error scores
    score = -1 * score.mean()
    
    # Take the square root to obtain the Root Mean Squared Log Error (RMSLE)
    score = np.sqrt(score)
    
    # Return the computed RMSLE score
    return score


We can reuse this scoring function anytime we want to try out a new feature set. We'll run it now on the processed data with no additional features and get a baseline score:

In [None]:
# Create a copy of the training DataFrame and assign it to X
X = df_train.copy()

# Extract the target variable "SalePrice" from X and assign it to y
y = X.pop("SalePrice")

# Evaluate the baseline model by calling the score_dataset() function
baseline_score = score_dataset(X, y)

# Print the baseline score
print(f"Baseline score: {baseline_score:.5f} RMSLE")


This baseline score helps us to know whether some set of features we've assembled has actually led to any improvement or not.

# Step 2 - Feature Utility Scores #
In Lesson 2 we saw how to use mutual information to compute a utility score for a feature, giving you an indication of how much potential the feature has. This hidden cell defines the two utility functions we used, make_mi_scores and plot_mi_scores:

In [None]:
def make_mi_scores(X, y):
    X = X.copy()
    # Factorize object and category columns to convert them to integer dtype
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # Determine which features are discrete (integer dtype)
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    # Calculate MI scores using mutual_info_regression
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    # Convert the MI scores to a pandas Series
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    # Sort the MI scores in descending order
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


`plot_mi_scores` plots the MI scores calculated by the make_mi_scores function. It visualizes the importance of each feature based on its MI score.

In [None]:
def plot_mi_scores(scores):
    # Sort the scores in ascending order
    scores = scores.sort_values(ascending=True)
    # Create an array for horizontal bar plotting
    width = np.arange(len(scores))
    # Get the feature names as ticks for the y-axis
    ticks = list(scores.index)
    # Create a horizontal bar plot
    plt.barh(width, scores)
    # Set the y-ticks and labels
    plt.yticks(width, ticks)
    # Set the title of the plot
    plt.title("Mutual Information Scores")


Let's look at our feature scores again:



In [None]:
# Create a copy of the training DataFrame and assign it to X
X = df_train.copy()

# Extract the target variable "SalePrice" from X and assign it to y
y = X.pop("SalePrice")

# Calculate the Mutual Information scores using the make_mi_scores function
mi_scores = make_mi_scores(X, y)

# Print the MI scores
mi_scores


You can see that we have a number of features that are highly informative and also some that don't seem to be informative at all (at least by themselves). As we talked about in Tutorial 2, the top scoring features will usually pay-off the most during feature development, so it could be a good idea to focus your efforts on those. On the other hand, training on uninformative features can lead to overfitting. So, the features with 0.0 scores we'll drop entirely:

In [None]:
def drop_uninformative(df, mi_scores):
    # Select only the columns with MI scores greater than 0.0
    return df.loc[:, mi_scores > 0.0]


Removing them does lead to a modest performance gain:



In [None]:
X = df_train.copy()  # Create a copy of the training DataFrame
y = X.pop("SalePrice")  # Extract the target variable "SalePrice" from X

# Drop uninformative features from X based on MI scores
X = drop_uninformative(X, mi_scores)

# Evaluate the performance of the dataset X using the score_dataset() function
score_dataset(X, y)


Later, we'll add the `drop_uninformative function` to our feature-creation pipeline.



# Step 3 - Create Features #
Now we'll start developing our feature set.

To make our feature engineering workflow more modular, we'll define a function that will take a prepared dataframe and pass it through a pipeline of transformations to get the final feature set. It will look something like this:

In [None]:
def create_features(df):
    # Create a copy of the DataFrame and assign it to X
    X = df.copy()
    # Extract the target variable "SalePrice" from X and assign it to y
    y = X.pop("SalePrice")
    # Join the features created by create_features_1 with X
    X = X.join(create_features_1(X))
    # Join the features created by create_features_2 with X
    X = X.join(create_features_2(X))
    # Join the features created by create_features_3 with X
    X = X.join(create_features_3(X))
    # ...
    # Return the DataFrame with the additional features
    return X


Let's go ahead and define one transformation now, a [label encoding](https://www.kaggle.com/code/alexisbcook/categorical-variables/tutorial) for the categorical features:



In [None]:
def label_encode(df):
    # Create a copy of the DataFrame and assign it to X
    X = df.copy()
    # Iterate over columns with data type "category"
    for colname in X.select_dtypes(["category"]):
        # Label encode the values in each categorical column
        X[colname] = X[colname].cat.codes
    # Return the DataFrame with label encoded categorical features
    return X


A label encoding is okay for any kind of categorical feature when you're using a tree-ensemble like XGBoost, even for unordered categories. If you wanted to try a linear regression model (also popular in this competition), you would instead want to use a one-hot encoding, especially for the features with unordered categories.

### Create Features with Pandas
This cell reproduces the work you did in Exercise 3, where you applied strategies for creating features in Pandas. Modify or add to these functions to try out other feature combinations.



In [None]:
def mathematical_transforms(df):
    # Create a new DataFrame to hold the new features
    X = pd.DataFrame()
    # Calculate the ratio of living area to lot area and assign it to "LivLotRatio"
    X["LivLotRatio"] = df.GrLivArea / df.LotArea
    # Calculate the ratio of total floor area to total rooms above ground and assign it to "Spaciousness"
    X["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
    return X

def interactions(df):
    # Create dummy variables for building type and multiply them by living area to create interaction features
    X = pd.get_dummies(df.BldgType, prefix="Bldg").mul(df.GrLivArea, axis=0)
    return X

def counts(df):
    # Count the number of different types of porches for each row and assign it to "PorchTypes"
    X = pd.DataFrame()
    X["PorchTypes"] = df[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].gt(0.0).sum(axis=1)
    return X

def break_down(df):
    # Extract the main class from the MSSubClass column and assign it to "MSClass"
    X = pd.DataFrame()
    X["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]
    return X

def group_transforms(df):
    # Calculate the median living area for each neighborhood and assign it to each row within the same neighborhood
    X = pd.DataFrame()
    X["MedNhbdArea"] = df.groupby("Neighborhood")["GrLivArea"].transform("median")
    return X


Here are some ideas for other transforms you could explore:

* Interactions between the quality Qual and condition Cond features. OverallQual, for instance, was a high-scoring feature. You could try combining it with OverallCond by converting both to integer type and taking a product.
* Square roots of area features. This would convert units of square feet to just feet.
* Logarithms of numeric features. If a feature has a skewed distribution, applying a logarithm can help normalize it.
* Interactions between numeric and categorical features that describe the same thing. You could look at interactions between BsmtQual and TotalBsmtSF, for instance.
* Other group statistics in Neighboorhood. We did the median of GrLivArea. Looking at mean, std, or count could be interesting. You could also try combining the group statistics with other features. Maybe the difference of GrLivArea and the median is important?

k-Means Clustering
The first unsupervised algorithm we used to create features was k-means clustering. We saw that you could either use the cluster labels as a feature (a column with `0, 1, 2, ...`) or you could use the distance of the observations to each cluster. We saw how these features can sometimes be effective at untangling complicated spatial relationships.

In [None]:
cluster_features = [
    "LotArea",
    "TotalBsmtSF",
    "FirstFlrSF",
    "SecondFlrSF",
    "GrLivArea",
]

def cluster_labels(df, features, n_clusters=20):
    # Create a copy of the DataFrame and select only the specified features for clustering
    X = df.copy()
    X_scaled = X.loc[:, features]
    # Scale the selected features to have zero mean and unit variance
    X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
    # Apply K-means clustering with the specified number of clusters
    kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)
    # Assign cluster labels to each data point and store them in a new DataFrame
    X_new = pd.DataFrame()
    X_new["Cluster"] = kmeans.fit_predict(X_scaled)
    return X_new

def cluster_distance(df, features, n_clusters=20):
    # Create a copy of the DataFrame and select only the specified features for clustering
    X = df.copy()
    X_scaled = X.loc[:, features]
    # Scale the selected features to have zero mean and unit variance
    X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
    # Apply K-means clustering with the specified number of clusters
    kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)
    # Compute the Euclidean distance of each data point to the centroids of the clusters
    X_cd = kmeans.fit_transform(X_scaled)
    # Label the distance features and join them to the original dataset
    X_cd = pd.DataFrame(X_cd, columns=[f"Centroid_{i}" for i in range(X_cd.shape[1])])
    return X_cd


### Principal Component Analysis¶
PCA was the second unsupervised model we used for feature creation. We saw how it could be used to decompose the variational structure in the data. The PCA algorithm gave us loadings which described each component of variation, and also the components which were the transformed datapoints. The loadings can suggest features to create and the components we can use as features directly.

Here are the utility functions from the PCA lesson:

In [None]:
def apply_pca(X, standardize=True):
    # Standardize the data if specified
    if standardize:
        X = (X - X.mean(axis=0)) / X.std(axis=0)
    
    # Create a PCA object
    pca = PCA()
    
    # Fit PCA to the standardized data and transform it
    X_pca = pca.fit_transform(X)
    
    # Convert the transformed data to a DataFrame with appropriate column names
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns=component_names)
    
    # Create loadings DataFrame
    loadings = pd.DataFrame(
        pca.components_.T,  # Transpose the matrix of loadings
        columns=component_names,  # Principal components as columns
        index=X.columns,  # Original features as rows
    )
    
    # Return PCA object, transformed data, and loadings
    return pca, X_pca, loadings


def plot_variance(pca, width=8, dpi=100):
    # Create a figure with two subplots
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    
    # Plot explained variance ratio
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0))
    
    # Plot cumulative variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0))
    
    # Set figure properties
    fig.set(figwidth=width, dpi=dpi)
    
    # Return the axes
    return axs


And here are transforms that produce the features from the Exercise 5. You might want to change these if you came up with a different answer.

In [None]:
def pca_inspired(df):
    # Create a new DataFrame to hold the PCA-inspired features
    X = pd.DataFrame()
    
    # Create Feature1 by adding GrLivArea and TotalBsmtSF
    X["Feature1"] = df.GrLivArea + df.TotalBsmtSF
    
    # Create Feature2 by multiplying YearRemodAdd and TotalBsmtSF
    X["Feature2"] = df.YearRemodAdd * df.TotalBsmtSF
    
    return X

def pca_components(df, features):
    # Select the specified features from the DataFrame
    X = df.loc[:, features]
    
    # Apply PCA to the selected features using the apply_pca function
    _, X_pca, _ = apply_pca(X)
    
    return X_pca

# List of features used for PCA
pca_features = [
    "GarageArea",
    "YearRemodAdd",
    "TotalBsmtSF",
    "GrLivArea",
]


These are only a couple ways you could use the principal components. You could also try clustering using one or more components. One thing to note is that PCA doesn't change the distance between points -- it's just like a rotation. So clustering with the full set of components is the same as clustering with the original features. Instead, pick some subset of components, maybe those with the most variance or the highest MI scores.

For further analysis, you might want to look at a correlation matrix for the dataset:

In [None]:
def corrplot(df, method="pearson", annot=True, **kwargs):
    """
    Plot a correlation matrix heatmap.

    Parameters:
        df (DataFrame): The DataFrame containing the data.
        method (str): The correlation method to use ('pearson', 'kendall', 'spearman').
        annot (bool): Whether to annotate the heatmap with correlation values.
        **kwargs: Additional keyword arguments to pass to sns.clustermap.

    Returns:
        None
    """
    # Calculate the correlation matrix using the specified method
    corr_matrix = df.corr(method=method, numeric_only=True)
    
    # Create a clustermap of the correlation matrix
    sns.clustermap(
        corr_matrix,
        vmin=-1.0,
        vmax=1.0,
        cmap="icefire",
        method="complete",
        annot=annot,
        **kwargs,
    )

# Plot the correlation matrix heatmap for df_train with annotations disabled
corrplot(df_train, annot=None)


Groups of highly correlated features often yield interesting loadings.

### PCA Application - Indicate Outliers
In Exercise 5, you applied PCA to determine houses that were **outliers**, that is, houses having values not well represented in the rest of the data. You saw that there was a group of houses in the `Edwards` neighborhood having a `SaleCondition` of `Partial` whose values were especially extreme.

Some models can benefit from having these outliers indicated, which is what this next transform will do.

In [None]:
def indicate_outliers(df):
    """
    Create a binary indicator for outliers based on specific conditions.

    Parameters:
        df (DataFrame): The DataFrame containing the data.

    Returns:
        DataFrame: A DataFrame containing a binary indicator for outliers.
    """
    # Create a new DataFrame to hold the outlier indicator
    X_new = pd.DataFrame()
    
    # Define conditions for outliers based on specific columns
    # For example, outliers are defined as properties in the "Edwards" neighborhood with a "Partial" sale condition
    X_new["Outlier"] = (df.Neighborhood == "Edwards") & (df.SaleCondition == "Partial")
    
    return X_new


You could also consider applying some sort of robust scaler from scikit-learn's `sklearn.preprocessing` module to the outlying values, especially those in `GrLivArea.` [Here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) is a tutorial illustrating some of them. Another option could be to create a feature of "outlier scores" using one of scikit-learn's [outlier detectors](https://scikit-learn.org/stable/modules/outlier_detection.html).

Target Encoding
Needing a separate holdout set to create a target encoding is rather wasteful of data. In Tutorial 6 we used 25% of our dataset just to encode a single feature, `Zipcode`. The data from the other features in that 25% we didn't get to use at all.

There is, however, a way you can use target encoding without having to use held-out encoding data. It's basically the same trick used in cross-validation:

1. Split the data into folds, each fold having two splits of the dataset.
1. Train the encoder on one split but transform the values of the other.
1. Repeat for all the splits.
This way, training and transformation always take place on independent sets of data, just like when you use a holdout set but without any data going to waste.

In the next hidden cell is a wrapper you can use with any target encoder:

In [None]:
class CrossFoldEncoder:
    def __init__(self, encoder, **kwargs):
        """
        Initialize the CrossFoldEncoder.

        Parameters:
            encoder: The encoder object to be used.
            **kwargs: Additional keyword arguments for the encoder.

        Returns:
            None
        """
        self.encoder_ = encoder  # The encoder object
        self.kwargs_ = kwargs  # Additional keyword arguments for the encoder
        self.cv_ = KFold(n_splits=5)  # Initialize a 5-fold cross-validator

    def fit_transform(self, X, y, cols):
        """
        Fit and transform the data using the CrossFoldEncoder.

        Parameters:
            X: The input features DataFrame.
            y: The target variable Series.
            cols: The columns to encode.

        Returns:
            DataFrame: The transformed features DataFrame.
        """
        self.fitted_encoders_ = []  # List to hold fitted encoders for each fold
        self.cols_ = cols  # Columns to encode
        X_encoded = []  # List to hold encoded data from each fold
        # Iterate over each fold
        for idx_encode, idx_train in self.cv_.split(X):
            # Instantiate a new encoder for the fold
            fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)
            # Fit the encoder on the training data of the fold
            fitted_encoder.fit(
                X.iloc[idx_encode, :], y.iloc[idx_encode],
            )
            # Transform the training data of the fold
            X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])
            # Store the fitted encoder for later use
            self.fitted_encoders_.append(fitted_encoder)
        # Concatenate the encoded data from all folds
        X_encoded = pd.concat(X_encoded)
        # Rename the columns of the concatenated DataFrame
        X_encoded.columns = [name + "_encoded" for name in X_encoded.columns]
        return X_encoded

    def transform(self, X):
        """
        Transform the test data using the fitted encoders.

        Parameters:
            X: The input features DataFrame.

        Returns:
            DataFrame: The transformed features DataFrame.
        """
        from functools import reduce

        X_encoded_list = []  # List to hold encoded data from each fold
        # Iterate over each fitted encoder
        for fitted_encoder in self.fitted_encoders_:
            # Transform the test data using the fitted encoder
            X_encoded = fitted_encoder.transform(X)
            # Extract only the columns used for encoding
            X_encoded_list.append(X_encoded[self.cols_])
        # Calculate the average of the encoded data from all folds
        X_encoded = reduce(
            lambda x, y: x.add(y, fill_value=0), X_encoded_list
        ) / len(X_encoded_list)
        # Rename the columns of the resulting DataFrame
        X_encoded.columns = [name + "_encoded" for name in X_encoded.columns]
        return X_encoded


Use it like:



In [None]:
# Instantiate a CrossFoldEncoder with MEstimateEncoder as the encoder and m=1 as a parameter
encoder = CrossFoldEncoder(MEstimateEncoder, m=1)

# Fit and transform the data using the CrossFoldEncoder
# Assume X is your feature DataFrame and y is your target variable Series
# Specify the columns to encode as ["MSSubClass"]
X_encoded = encoder.fit_transform(X, y, cols=["MSSubClass"])


You can turn any of the encoders from the `category_encoders` library into a cross-fold encoder. The `CatBoostEncoder` would be worth trying. It's similar to `MEstimateEncoder` but uses some tricks to better prevent overfitting. Its smoothing parameter is called `a` instead of `m.`

### Create Final Feature Set
Now let's combine everything together. Putting the transformations into separate functions makes it easier to experiment with various combinations. The ones I left uncommented I found gave the best results. You should experiment with you own ideas though! Modify any of these transformations or come up with some of your own to add to the pipeline.

In [None]:
def create_features(df, df_test=None):
    """
    Generate features for a machine learning model.

    Parameters:
        df (DataFrame): The training DataFrame.
        df_test (DataFrame, optional): The test DataFrame. Default is None.

    Returns:
        DataFrame or tuple of DataFrames: The transformed training DataFrame or tuple of transformed training and test DataFrames.
    """
    X = df.copy()  # Make a copy of the training DataFrame
    y = X.pop("SalePrice")  # Remove the target variable from the training DataFrame
    mi_scores = make_mi_scores(X, y)  # Compute Mutual Information scores

    # Combine training and test splits if test data is provided
    if df_test is not None:
        X_test = df_test.copy()
        X_test.pop("SalePrice")
        X = pd.concat([X, X_test])

    # Lesson 2 - Mutual Information: Drop uninformative features
    X = drop_uninformative(X, mi_scores)

    # Lesson 3 - Transformations: Apply mathematical transformations, create interactions, and count features
    X = X.join(mathematical_transforms(X))
    X = X.join(interactions(X))
    X = X.join(counts(X))
    X = X.join(group_transforms(X))

    # Lesson 5 - PCA: Apply PCA-inspired feature creation
    X = X.join(pca_inspired(X))

    # Label encoding for categorical features
    X = label_encode(X)

    # Reform splits if test data is provided
    if df_test is not None:
        X_test = X.loc[df_test.index, :]
        X.drop(df_test.index, inplace=True)

    # Lesson 6 - Target Encoder: Apply target encoding using CrossFoldEncoder
    encoder = CrossFoldEncoder(MEstimateEncoder, m=1)
    X = X.join(encoder.fit_transform(X, y, cols=["MSSubClass"]))
    if df_test is not None:
        X_test = X_test.join(encoder.transform(X_test))

    # Return the transformed training DataFrame or tuple of transformed training and test DataFrames
    if df_test is not None:
        return X, X_test
    else:
        return X


# Load training and test data
df_train, df_test = load_data()

# Create features for the training data
X_train = create_features(df_train)
y_train = df_train.loc[:, "SalePrice"]

# Evaluate the model using the transformed training data
score_dataset(X_train, y_train)


# Step 4 - Hyperparameter Tuning #
At this stage, you might like to do some hyperparameter tuning with XGBoost before creating your final submission.

In [None]:
# Create features for the training data
X_train = create_features(df_train)
y_train = df_train.loc[:, "SalePrice"]

# Define XGBoost parameters
xgb_params = dict(
    max_depth=6,           # maximum depth of each tree
    learning_rate=0.01,    # effect of each tree
    n_estimators=1000,     # number of trees
    min_child_weight=1,    # minimum number of houses in a leaf
    colsample_bytree=0.7,  # fraction of features per tree
    subsample=0.7,         # fraction of instances per tree
    reg_alpha=0.5,         # L1 regularization
    reg_lambda=1.0,        # L2 regularization
    num_parallel_tree=1,   # set > 1 for boosted random forests
)

# Instantiate the XGBoost regressor with the specified parameters
xgb = XGBRegressor(**xgb_params)

# Evaluate the model using the transformed training data
score = score_dataset(X_train, y_train, xgb)
print(f"XGBoost model score: {score:.5f} RMSLE")


Just tuning these by hand can give you great results. However, you might like to try using one of scikit-learn's automatic [hyperparameter tuners](https://scikit-learn.org/stable/modules/grid_search.html). Or you could explore more advanced tuning libraries like [Optuna](https://optuna.readthedocs.io/en/stable/index.html) or [scikit-optimize](https://scikit-optimize.github.io/stable/).

Here is how you can use Optuna with XGBoost:

In [None]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    xgb_params = dict(
        max_depth=trial.suggest_int("max_depth", 2, 10),  # maximum depth of each tree
        learning_rate=trial.suggest_float("learning_rate", 1e-4, 1e-1, log=True),  # effect of each tree
        n_estimators=trial.suggest_int("n_estimators", 1000, 8000),  # number of trees
        min_child_weight=trial.suggest_int("min_child_weight", 1, 10),  # minimum number of houses in a leaf
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.2, 1.0),  # fraction of features per tree
        subsample=trial.suggest_float("subsample", 0.2, 1.0),  # fraction of instances per tree
        reg_alpha=trial.suggest_float("reg_alpha", 1e-4, 1e2, log=True),  # L1 regularization
        reg_lambda=trial.suggest_float("reg_lambda", 1e-4, 1e2, log=True),  # L2 regularization
    )
    # Create XGBoost regressor with the suggested hyperparameters
    xgb = XGBRegressor(**xgb_params)
    # Evaluate the model using the transformed training data and return the score
    return score_dataset(X_train, y_train, xgb)

# Create a study object for optimization, minimizing the objective function
study = optuna.create_study(direction="minimize")
# Optimize the hyperparameters with a specified number of trials
study.optimize(objective, n_trials=20)

# Get the best hyperparameters found during optimization
xgb_params = study.best_params


Copy this into a code cell if you'd like to use it, but be aware that it will take quite a while to run. After it's done, you might enjoy using some of [Optuna's visualizations](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/005_visualization.html).

# Step 5 - Train Model and Create Submissions #
Once you're satisfied with everything, it's time to create your final predictions! This cell will:

* create your feature set from the original data
* train XGBoost on the training data
* use the trained model to make predictions from the test set
* save the predictions to a CSV file

In [None]:
X_train, X_test = create_features(df_train, df_test)
y_train = df_train.loc[:, "SalePrice"]

xgb = XGBRegressor(**xgb_params)
# XGB minimizes MSE, but competition loss is RMSLE
# So, we need to log-transform y to train and exp-transform the predictions
xgb.fit(X_train, np.log(y_train))
predictions = np.exp(xgb.predict(X_test))

output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")


To submit these predictions to the competition, follow these steps:

1. Begin by clicking on the blue **Save Version** button in the top right corner of the window. This will generate a pop-up window.
1. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
1. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the **Save Version** button. This pulls up a list of versions on the right of the screen. Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
1. Click on the **Output** tab on the right of the screen. Then, click on the file you would like to submit, and click on the blue **Submit** button to submit your results to the leaderboard.
You have now successfully submitted to the competition!

# Next Steps #
If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

Be sure to check out [other users' notebooks](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/notebooks) in this competition. You'll find lots of great ideas for new features and as well as other ways to discover more things about the dataset or make better predictions. There's also the [discussion forum](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion), where you can share ideas with other Kagglers.

Have fun Kaggling!


Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/feature-engineering/discussion) to chat with other learners.