# Preprocessing and Training Data Development

- Create dummy or indicator features for categorical variables.
- Standardize the magnitude of numeric features.
- Split the data into training and testing subsets.


## Step 1: Creating Dummy or Indicator Features
We identify and process categorical variables by creating dummy features to represent their categories numerically. This allows us to include them in model development.

In [1]:
# Import necessary libraries
import pandas as pd

# Load the dataset
housing_data = pd.read_csv('Housing.csv')

# Convert binary categorical variables (yes/no) to numeric (1/0)
binary_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
housing_data[binary_columns] = housing_data[binary_columns].replace({'yes': 1, 'no': 0})

# Create dummy variables for 'furnishingstatus'
housing_data = pd.get_dummies(housing_data, columns=['furnishingstatus'], drop_first=True)

# Display the updated dataset
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,False,False
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,False,False
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,True,False
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,False,False
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,False,False


## Step 2: Standardizing the Magnitude of Numeric Features
To ensure all numerical features are on the same scale, we standardize them to have a mean of 0 and a standard deviation of 1. This step is crucial for models sensitive to feature magnitude, such as linear regression or gradient boosting.

In [2]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Identify numerical columns to scale
numerical_columns = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
housing_data[numerical_columns] = scaler.fit_transform(housing_data[numerical_columns])

# Display the dataset after scaling
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,1.046726,1.403419,1.421812,1.378217,1,0,0,0,1,1.517692,1,False,False
1,12250000,1.75701,1.403419,5.405809,2.532024,1,0,0,0,1,2.679409,0,False,False
2,12250000,2.218232,0.047278,1.421812,0.22441,1,0,1,0,0,1.517692,1,True,False
3,12215000,1.083624,1.403419,1.421812,0.22441,1,0,1,0,1,2.679409,1,False,False
4,11410000,1.046726,1.403419,-0.570187,0.22441,1,1,1,0,1,1.517692,0,False,False


## Step 3: Splitting the Data into Training and Testing Subsets
We split the dataset into training (80%) and testing (20%) subsets. The training set will be used to train the model, and the testing set will evaluate its performance.

In [3]:
from sklearn.model_selection import train_test_split

# Define the target variable (price) and features
X = housing_data.drop(columns=['price'])  # Features
y = housing_data['price']  # Target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Labels Shape:", y_train.shape)
print("Testing Labels Shape:", y_test.shape)

Training Features Shape: (436, 13)
Testing Features Shape: (109, 13)
Training Labels Shape: (436,)
Testing Labels Shape: (109,)


## Step 4: Saving Preprocessed Data
To ensure reproducibility and facilitate modeling in subsequent steps, we save the preprocessed training and testing datasets as CSV files.

In [4]:
# Save the preprocessed data
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

print("Preprocessed data saved successfully.")

Preprocessed data saved successfully.


## Conclusion
- Dummy features for categorical variables have been created.
- Numerical features have been standardized.
- The data has been split into training and testing subsets.
- Preprocessed datasets have been saved for future use.

The dataset is now ready for model development.