# Preprocessing and Training Data Development

- Create dummy or indicator features for categorical variables.
- Standardize the magnitude of numeric features.
- Split the data into training and testing subsets.



## Step 1: Creating Dummy or Indicator Features
We identify and process categorical variables by creating dummy features to represent their categories numerically. This allows us to include them in model development.

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
housing_data = pd.read_csv('Housing.csv')

# Convert binary categorical variables (yes/no) to numeric (1/0)
binary_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
housing_data[binary_columns] = housing_data[binary_columns].replace({'yes': 1, 'no': 0})

# Create dummy variables for 'furnishingstatus'
housing_data = pd.get_dummies(housing_data, columns=['furnishingstatus'], drop_first=True)

# Display the updated dataset
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,False,False
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,False,False
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,True,False
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,False,False
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,False,False


## Step 2: Standardizing the Magnitude of Numeric Features
To ensure all numerical features are on the same scale, we standardize them to have a mean of 0 and a standard deviation of 1. This step is crucial for models sensitive to feature magnitude, such as linear regression or gradient boosting.

In [3]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Identify numerical columns to scale
numerical_columns = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
housing_data[numerical_columns] = scaler.fit_transform(housing_data[numerical_columns])

# Display the dataset after scaling
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,1.046726,1.403419,1.421812,1.378217,1,0,0,0,1,1.517692,1,False,False
1,12250000,1.75701,1.403419,5.405809,2.532024,1,0,0,0,1,2.679409,0,False,False
2,12250000,2.218232,0.047278,1.421812,0.22441,1,0,1,0,0,1.517692,1,True,False
3,12215000,1.083624,1.403419,1.421812,0.22441,1,0,1,0,1,2.679409,1,False,False
4,11410000,1.046726,1.403419,-0.570187,0.22441,1,1,1,0,1,1.517692,0,False,False


## Step 3: Splitting the Data into Training and Testing Subsets
We split the dataset into training (80%) and testing (20%) subsets. The training set will be used to train the model, and the testing set will evaluate its performance.

In [4]:
from sklearn.model_selection import train_test_split

# Define the target variable (price) and features
X = housing_data.drop(columns=['price'])  # Features
y = housing_data['price']  # Target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Labels Shape:", y_train.shape)
print("Testing Labels Shape:", y_test.shape)

Training Features Shape: (436, 13)
Testing Features Shape: (109, 13)
Training Labels Shape: (436,)
Testing Labels Shape: (109,)


## Step 4: Saving Preprocessed Data
To ensure reproducibility and facilitate modeling in subsequent steps, we save the preprocessed training and testing datasets as CSV files.

In [5]:
# Save the preprocessed data
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

print("Preprocessed data saved successfully.")

Preprocessed data saved successfully.


## Conclusion
- Dummy features for categorical variables have been created.
- Numerical features have been standardized.
- The data has been split into training and testing subsets.
- Preprocessed datasets have been saved for future use.

The dataset is now ready for model development.

## Modeling: Predicting House Prices

- Train and evaluate three models (Linear Regression, Random Forest, and XGBoost).
- Compare their performance using Root Mean Squared Error (RMSE) and R² Score.
- Identify the best-performing model for predicting house prices.

### Step 1: Load Preprocessed Data
We start by loading the preprocessed training and testing datasets created earlier.

In [6]:
import pandas as pd
X_train = pd.read_csv('X_train.csv')
X_test = pd.read_csv('X_test.csv')
y_train = pd.read_csv('y_train.csv').values.ravel()  # Flatten the target array
y_test = pd.read_csv('y_test.csv').values.ravel()

# Display the shapes of the datasets
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Labels Shape:", y_train.shape)
print("Testing Labels Shape:", y_test.shape)

Training Features Shape: (436, 13)
Testing Features Shape: (109, 13)
Training Labels Shape: (436,)
Testing Labels Shape: (109,)


### Step 2: Define Models
We will evaluate the following models:
- **Linear Regression**: A simple, interpretable baseline model.
- **Random Forest Regressor**: A tree-based ensemble model for non-linear relationships.
- **XGBoost Regressor**: A gradient boosting model for powerful predictions.

In [7]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

### Step 3: Train and Evaluate Models
We train each model using the training set and evaluate its performance on the testing set using the following metrics:
- **Root Mean Squared Error (RMSE)**: Measures the average prediction error.
- **R² Score**: Indicates how well the model explains the variance in the target variable.

In [10]:
from sklearn.metrics import mean_squared_error, r2_score

# Helper function to evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Make predictions
    rmse = mean_squared_error(y_test, y_pred, squared=False)  # Calculate RMSE
    r2 = r2_score(y_test, y_pred)  # Calculate R² Score
    return rmse, r2

# Evaluate each model
results = []
for name, model in models.items():
    rmse, r2 = evaluate_model(model, X_train, y_train, X_test, y_test)
    results.append({'Model': name, 'RMSE': rmse, 'R² Score': r2})

# Create a DataFrame to display results
results_df = pd.DataFrame(results)
print(results_df)



               Model          RMSE  R² Score
0  Linear Regression  1.324507e+06  0.652924
1      Random Forest  1.400845e+06  0.611764
2            XGBoost  1.448930e+06  0.584653




### Step 4: Identify the Best Model
We select the model with the lowest RMSE as the best model. If two models have similar RMSE, we also consider R² Score and other factors like interpretability and computational efficiency.

In [11]:
# Identify the best model based on RMSE
best_model = results_df.loc[results_df['RMSE'].idxmin()]
print("Best Model:")
print(best_model)

Best Model:
Model       Linear Regression
RMSE           1324506.960091
R² Score             0.652924
Name: 0, dtype: object


### Conclusion
- We trained and evaluated three models: Linear Regression, Random Forest, and XGBoost.
- Based on the evaluation metrics, we identified the best-performing model.
- This model will be used for making predictions and further analysis.