In [18]:
##### The Alley feature has too many missing values, so we will drop it from the dataset.

##### Based on the data, YearSold and OverallCondition are only weakly correlated with SalePrice. While we could consider removing them, we should proceed with caution, as they might interact with other variables in ways that reveal a stronger relationship.

##### The Street feature may not be useful since it contains only two categories, and one of them occurs very rarely.

##### According to the metadata, we should not include the Foundation feature in the model, so it will be excluded.

##### The CentralAir feature has a strong class imbalance, so I will not include it in the model.

##### Although the SaleType feature also shows class imbalance, it includes a wide variety of values. For now, I will keep it.

##### In the SaleCondition column, there are two identical values written differently: "Normal" and "normal". We will unify them by using "Normal" to maintain consistency in capitalization.

##### The LotArea feature contains too many unique values, so I will group them into bins.

##### Similarly, the GrLivArea feature has a high number of unique values, so I will also bin it.

##### For TotalBsmtSF, I will consider creating bins to group the values into ranges.

##### I will explore combining FullBath and HalfBath into a single feature, as they both represent bathrooms and appear to be related.

##### I will create bins for GarageCars

##### I will also create bins for GarageArea, and evaluate the correlation between GarageCars and GarageArea. If they are too closely related, one of them may be dropped.

In [19]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

def load_data(path='../data/dataset.csv'):
    """Load the housing dataset."""
    return pd.read_csv(path)

In [20]:
from pyexpat import features


def preprocess_data(df, save_path=None, version=None , features_to_drop = [
        'Foundation',  # excluded based on metadata
        'CentralAir',  # strong class imbalance
        'OverallCondition',  # weakly correlated with SalePrice
        'Street',  # only two categories with one occurring rarely
        'Alley',  # too many missing values
        'BldgType',
        'HouseStyle',
        'GarageType',
        'SaleType',
        'SaleCondition',
        'LotType',
        'GarageArea',
        'SaleAge',
    ]
    ):
    """
    Preprocessing pipeline for the housing dataset.
    
    Args:
        df: The input DataFrame
        save_path: Optional path to save the processed data
        version: Optional version number to append to saved files
        
    Returns:
        X_train, X_test, y_train, y_test: Processed data splits
    """
    # Create a copy to avoid modifying the original dataframe
    data = df.copy()
    
    # Target variable
    target = 'SalePrice'
    
    # Step 1: Handle missing values
    data['Alley'].fillna('Missing', inplace=True)
    data['GarageType'].fillna('Missing', inplace=True)
    data['GarageArea'].fillna(0, inplace=True)
    
    # Step 2: Feature engineering
    # Age-related features
    data['HouseAge'] = 2025 - data['YearBuilt']
    data['SaleAge'] = 2025 - data['YearSold']
    data.drop(columns=['YearBuilt', 'YearSold'], inplace=True)
    
    # Bathroom feature
    data['TotalBath'] = data['FullBath'] + (0.5 * data['HalfBath'])
    data.drop(columns=['FullBath', 'HalfBath'], inplace=True)
    
    # Step 3: Group rare values in categorical features
    # LotType grouping
    if 'LotType' in data.columns:
        data["LotType"] = data["LotType"].replace({'FR2': 'FR', 'FR3': 'FR'})
    
    # GarageType grouping
    if 'GarageType' in data.columns:
        data['GarageType'] = data['GarageType'].replace({
            'Basment': 'Other', 
            'CarPort': 'Other', 
            '2Types': 'Other'
        })
    
    # SaleType grouping
    if 'SaleType' in data.columns:
        data['SaleType'] = data['SaleType'].replace({
            'ConLD': 'Other', 
            'ConLI': 'Other', 
            'ConLw': 'Other', 
            'CWD': 'Other', 
            'Oth': 'Other', 
            'Con': 'Other'
        })
    
    # SaleCondition grouping and standardization
    if 'SaleCondition' in data.columns:
        data['SaleCondition'] = data['SaleCondition'].replace({'normal': 'Normal'})
        data['SaleCondition'] = data['SaleCondition'].replace({
            'Partial': 'Other',
            'Abnorml': 'Other',
            'Family': 'Other',
            'Alloca': 'Other',
            'AdjLand': 'Other'
        })
    
    # Step 4: Define features to drop (excluded during preprocessing)
    
    # Step 5: Separate features and target (WITHOUT dropping columns)
    y = data[target]
    X = data.drop(columns=[target])

    features_to_drop = features_to_drop
    
    # Step 6: Identify numerical and categorical columns (EXCLUDING features_to_drop)
    numerical_features = [
        col for col in X.select_dtypes(include=['int64', 'float64']).columns 
        if col not in features_to_drop
    ]
    print(f"Numerical features: {numerical_features}")
    categorical_features = [
        col for col in X.select_dtypes(include=['object']).columns 
        if col not in features_to_drop
    ]
    print(f"Categorical features: {categorical_features}")
    # Step 7: Create preprocessing pipelines
    numerical_transformer = Pipeline(steps=[
        ('scaler', MinMaxScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore'))
    ])
    
    # Combine preprocessing steps
    transformers = [('num', numerical_transformer, numerical_features)]
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    
    preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')
    
    # Step 8: Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Step 9: Apply preprocessing (automatically drops unprocessed columns)
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
    
    # Step 10: Get feature names after transformation
    feature_names = numerical_features.copy()
    if categorical_features:
        categorical_features_out = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
        feature_names += list(categorical_features_out)
    
    # Convert to DataFrame
    X_train_processed = pd.DataFrame(X_train_processed, columns=feature_names)
    X_test_processed = pd.DataFrame(X_test_processed, columns=feature_names)
    
    # Step 11: Save processed data if requested
    if save_path:
        os.makedirs(save_path, exist_ok=True)
        suffix = f"_{version}" if version else ""
        X_train_processed.to_csv(os.path.join(save_path, f'X_train{suffix}.csv'), index=False)
        X_test_processed.to_csv(os.path.join(save_path, f'X_test{suffix}.csv'), index=False)
        pd.DataFrame(y_train).to_csv(os.path.join(save_path, f'y_train{suffix}.csv'), index=False)
        pd.DataFrame(y_test).to_csv(os.path.join(save_path, f'y_test{suffix}.csv'), index=False)
    
    return X_train_processed, X_test_processed, y_train, y_test

In [21]:

# data_path = os.path.join('..', 'data', 'dataset.csv')
# data = load_data(data_path)

# # Apply preprocessing
# X_train, X_test, y_train, y_test = preprocess_data(data, save_path='../processed_data', version=1  )

# # Print shapes to verify
# print(f"X_train shape: {X_train.shape}")
# print(f"X_test shape: {X_test.shape}")
# print(f"y_train shape: {y_train.shape}")
# print(f"y_test shape: {y_test.shape}")

### for the second training version and the followings we add more features according to the correlation matrix seen in the EDA phase

In [22]:
# # we add GarageType ( Remember we should not work with Foundation )
# data_path = os.path.join('..', 'data', 'dataset.csv')
# data = load_data(data_path)

# # Apply preprocessing
# X_train, X_test, y_train, y_test = preprocess_data(data, save_path='../processed_data', version=2 ,features_to_drop=[
#         'Foundation',  # excluded based on metadata
#         'CentralAir',  # strong class imbalance
#         'OverallCondition',  # weakly correlated with SalePrice
#         'Street',  # only two categories with one occurring rarely
#         'Alley',  # too many missing values
#         'BldgType',
#         'HouseStyle',
#         'SaleType',
#         'SaleCondition',
#         'LotType',
#         'GarageArea',
#         'SaleAge',
#     ])

# # Print shapes to verify
# print(f"X_train shape: {X_train.shape}")
# print(f"X_test shape: {X_test.shape}")
# print(f"y_train shape: {y_train.shape}")
# print(f"y_test shape: {y_test.shape}")

In [23]:
# ### we add SaleType for the 3rd version ( Remember we should not work with Foundation )
# data_path = os.path.join('..', 'data', 'dataset.csv')
# data = load_data(data_path)

# # Apply preprocessing
# X_train, X_test, y_train, y_test = preprocess_data(data, save_path='../processed_data', version=3 ,features_to_drop=[
#         'Foundation',  # excluded based on metadata
#         'CentralAir',  # strong class imbalance
#         'OverallCondition',  # weakly correlated with SalePrice
#         'Street',  # only two categories with one occurring rarely
#         'Alley',  # too many missing values
#         'BldgType',
#         'HouseStyle',
#         'SaleCondition',
#         'LotType',
#         'GarageArea',
#         'SaleAge',
#     ])

# # Print shapes to verify
# print(f"X_train shape: {X_train.shape}")
# print(f"X_test shape: {X_test.shape}")
# print(f"y_train shape: {y_train.shape}")
# print(f"y_test shape: {y_test.shape}")

In [24]:
### we add SaleCondition , HouseStyle and OverallCondition for the 4th version ( Remember we should not work with Foundation )
data_path = os.path.join('..', 'data', 'dataset.csv')
data = load_data(data_path)

# Apply preprocessing
X_train, X_test, y_train, y_test = preprocess_data(data, save_path='../processed_data', version=4 ,features_to_drop=[
        'Foundation',  # excluded based on metadata
        'CentralAir',  # strong class imbalance
        'Street',  # only two categories with one occurring rarely
        'Alley',  # too many missing values
        'BldgType',
        'LotType',
        'GarageArea',
        'SaleAge',
    ])

# Print shapes to verify
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Numerical features: ['LotArea', 'GrLivArea', 'OverallQuality', 'OverallCondition', 'TotalBsmtSF', 'GarageCars', 'HouseAge', 'TotalBath']
Categorical features: ['HouseStyle', 'GarageType', 'SaleType', 'SaleCondition']
X_train shape: (1168, 24)
X_test shape: (292, 24)
y_train shape: (1168,)
y_test shape: (292,)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Alley'].fillna('Missing', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['GarageType'].fillna('Missing', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting val