---

##### **Q1.** In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?

**Answer:**  
The **Root Mean Squared Error (RMSE)** would be the best metric to employ in this case. RMSE provides an error value in the same units as the house price, making it interpretable and easy to understand. It also penalizes large errors more than smaller ones, which is crucial when predicting house prices.

---

##### **Q2.** You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?

**Answer:**  
In this scenario, **Mean Squared Error (MSE)** or **Root Mean Squared Error (RMSE)** would be more appropriate because they provide a direct measure of how close your predictions are to the actual house prices. **R-squared (R²)** only explains how much of the variance in the data is captured by the model but doesn't tell you how far off the predictions are from actual prices.

---

##### **Q3.** You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?

**Answer:**  
In this case, **Mean Absolute Error (MAE)** is the most appropriate metric. MAE calculates the average of the absolute differences between predicted and actual values, and it is less sensitive to outliers than MSE or RMSE, making it a better choice when you have many outliers in your dataset.

---

##### **Q4.** You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?

**Answer:**  
If both **MSE** and **RMSE** values are very close, it’s better to use **RMSE** since it is in the same unit as the house prices, making it easier to interpret. It also penalizes large errors, which can be crucial in house price prediction.

---

##### **Q5.** You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

**Answer:**  
The most appropriate metric in this case is **R-squared (R²)**. R² tells you how well the model explains the variance in the target variable, which is ideal when comparing different kernel functions to understand how much each model captures the variability in house prices.

---


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load the dataset
data = pd.read_csv('Bengaluru_House_Data.csv')

# Display the first few rows
data.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load the dataset
data = pd.read_csv('Bengaluru_House_Data.csv')

# Check the column names
print(data.columns)

# Preprocess the data (drop rows with NaN values)
data.dropna(inplace=True)

# Clean the 'total_sqft' column to convert it to numeric values
def convert_to_numeric(area):
    if 'Sq. Yards' in area:
        return float(area.replace('Sq. Yards', '').strip()) * 9  # 1 Sq. Yard = 9 Sq. Feet
    elif 'Square Feet' in area:
        return float(area.replace('Square Feet', '').strip())
    elif 'Acres' in area:
        return float(area.replace('Acres', '').strip()) * 43560  # 1 Acre = 43560 Sq. Feet
    else:
        return np.nan  # Return NaN for any other format

# Apply the conversion function to the 'total_sqft' column
data['total_sqft'] = data['total_sqft'].apply(convert_to_numeric)

# One-hot encode the categorical variable 'location'
data = pd.get_dummies(data, columns=['location'], drop_first=True)

# Select features (update the feature names as necessary)
X = data[['total_sqft', 'bath', 'balcony'] + [col for col in data.columns if 'location_' in col]]  # Include one-hot encoded columns
y = data['price']  # Adjust as per your dataset

# Drop rows with NaN values after cleaning
X = X.dropna()
y = y[X.index]  # Keep target variable aligned with features

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate with SVM
for kernel in ['linear', 'poly', 'rbf']:
    svr = SVR(kernel=kernel)
    svr.fit(X_train, y_train)
    y_pred = svr.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r_squared = r2_score(y_test, y_pred)

    print(f'Kernel: {kernel}')
    print(f'MSE: {mse}, RMSE: {rmse}, MAE: {mae}, R²: {r_squared}')
    print('---')


Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')
Kernel: linear
MSE: 2633946.5143803917, RMSE: 1622.9437804127388, MAE: 1206.606904051719, R²: -327.8220110958324
---
Kernel: poly
MSE: 584507165443.3678, RMSE: 764530.6831274778, MAE: 540665.8759339838, R²: -72969901.9922122
---
Kernel: rbf
MSE: 9799.829794913752, RMSE: 98.99408969687914, MAE: 89.31311874369122, R²: -0.22341122872741193
---


In [8]:
# Add more features based on domain knowledge
data['size'] = data['size'].map({'1 BHK': 1, '2 BHK': 2, '3 BHK': 3})  # Example mapping for size


In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale features


In [10]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}
grid = GridSearchCV(SVR(kernel='rbf'), param_grid, scoring='neg_mean_squared_error', cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best cross-validated score (MSE):", -grid.best_score_)


Best parameters: {'C': 10, 'gamma': 'auto'}
Best cross-validated score (MSE): 12692.806


In [11]:
# Split the dataset into training and testing sets with scaled features
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
