#Support Vector Machines

In [None]:
# Q1: In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?
# For predicting house prices, the best regression metric to employ is Root Mean Squared Error (RMSE).
# RMSE provides a measure of how well the predicted values match the actual values by penalizing larger errors more than smaller ones, which is useful for understanding the typical size of the prediction errors in the same units as the target variable (i.e., currency).

In [None]:
# Q2: You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?
# If your goal is to predict the actual price of a house as accurately as possible, Mean Squared Error (MSE) would be more appropriate. MSE provides a direct measure of the average squared difference between the predicted and actual prices, focusing on the prediction accuracy.

In [None]:
# Q3: You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?
# For a dataset with a significant number of outliers, Mean Absolute Error (MAE) is the most appropriate regression metric.
# MAE is less sensitive to outliers compared to MSE and RMSE because it does not square the errors, providing a more robust measure of prediction accuracy in the presence of outliers.

In [None]:
# Q4: You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?
# When MSE and RMSE values are very close, either metric can be used.
# However, Root Mean Squared Error (RMSE) is generally preferred because it is in the same units as the target variable, making it easier to interpret.
# RMSE provides a clear understanding of the model's prediction error in terms of the original units.

In [None]:
#Q5: You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?
#If your goal is to measure how well the model explains the variance in the target variable, R-squared (R²) is the most appropriate evaluation metric.
#R-squared indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, providing a measure of the goodness-of-fit of the model.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Load the dataset
url = 'https://drive.google.com/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0'
data = pd.read_csv(url)

# Inspect the dataset
print(data.head())
print(data.info())

# Assume 'price' is the target variable and there might be categorical features

# Identify numeric and categorical columns
numeric_features = data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = data.select_dtypes(include=['object']).columns

# Handle categorical data using one-hot encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Split the dataset into features and target
X = data.drop('price', axis=1)
y = data['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the SVM regression model within a pipeline
svr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel='rbf'))
])

# Train the model
svr_pipeline.fit(X_train, y_train)

# Predict the labels of the testing data
y_pred = svr_pipeline.predict(X_test)

# Evaluate the performance of the model
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f'MSE: {mse:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'R-squared: {r2:.2f}')
print(f'MAE: {mae:.2f}')

# Tune hyperparameters using GridSearchCV
param_grid = {
    'regressor__C': [0.1, 1, 10, 100],
    'regressor__gamma': [1, 0.1, 0.01, 0.001],
    'regressor__epsilon': [0.1, 0.2, 0.5, 1.0]
}

grid_search = GridSearchCV(svr_pipeline, param_grid, refit=True, cv=5, verbose=2)
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print(f'Best parameters found: {grid_search.best_params_}')

# Train the tuned model on the entire dataset
best_svr = grid_search.best_estimator_
best_svr.fit(X, y)

# Save the trained model to a file
joblib.dump(best_svr, 'best_svr_model.pkl')
print("Model saved to best_svr_model.pkl")

# Bonus: Visualize results
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs Predicted Prices")
plt.show()



# Explanation:
# Load the Dataset:

# The dataset is loaded from the provided link and stored in a DataFrame.
# Splitting the Dataset:

# The dataset is split into training and testing sets using train_test_split.
# Preprocessing:

# Features are scaled using StandardScaler.
# Training the SVM Model:

# An SVR model with an RBF kernel is instantiated and trained on the training data.
# Prediction:

# The trained model is used to predict the target values for the testing set.
# Evaluation:

# The model's performance is evaluated using MSE, RMSE, R-squared, and MAE.
# Hyperparameter Tuning:

# GridSearchCV is used to find the best hyperparameters for the SVR model.
# Training the Best Model:

# The best model found by GridSearchCV is retrained on the entire dataset.
# Saving the Model:

# The trained model is saved to a file using joblib.
# Visualization:

# A scatter plot is created to visualize the actual vs. predicted prices.