# Task
Perform a regression analysis using the dataset found in the archive '/content/archive (2).zip'. This involves loading and preparing the data, training and evaluating Linear Regression, Polynomial Regression, K-Nearest Neighbors Regression, and Decision Tree Regression models. Finally, compare the performance of these models and summarize the findings.

## Unzip Data Archive

### Subtask:
Unzip the provided archive file '/content/archive (2).zip' to access the dataset for the regression task.


**Reasoning**:
To begin, I will import the necessary `zipfile` module and define the paths for the archive and the extraction directory. This sets up the environment for unzipping the data.



In [None]:
import zipfile
import os


zip_file_path = '/content/archive (2).zip'

extract_dir = '/content/extracted_data'


os.makedirs(extract_dir, exist_ok=True)

print(f"Zip file path: {zip_file_path}")
print(f"Extraction directory: {extract_dir}")

Zip file path: /content/archive (2).zip
Extraction directory: /content/extracted_data


**Reasoning**:
Now that the paths are defined and the extraction directory is created, I will proceed to open the zip file and extract all its contents into the designated directory.



In [None]:
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"Successfully extracted '{zip_file_path}' to '{extract_dir}'")

Successfully extracted '/content/archive (2).zip' to '/content/extracted_data'


## Load and Prepare Data

### Subtask:
Load the dataset from the unzipped files. This step will also include essential data preprocessing such as handling missing values, encoding categorical variables if any, and splitting the data into training and testing sets for model evaluation.


**Reasoning**:
First, I will import the pandas library to work with DataFrames. Then, I will construct the full path to the `Housing.csv` file using the `extract_dir` variable, and load the dataset into a pandas DataFrame named `df`.



In [None]:
import pandas as pd

dataset_path = os.path.join(extract_dir, 'Housing.csv')

df = pd.read_csv(dataset_path)

print(f"Successfully loaded data from: {dataset_path}")
print("Displaying the first 5 rows of the DataFrame:")
display(df.head())

Successfully loaded data from: /content/extracted_data/Housing.csv
Displaying the first 5 rows of the DataFrame:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


**Reasoning**:
Now that the data is loaded, I will use the `.info()` method to display a summary of the DataFrame to check for data types and non-null values, and then use the `.describe()` method to get descriptive statistics of the numerical columns.



In [None]:
print("\nDataFrame Info:")
df.info()

print("\nDescriptive Statistics for Numerical Columns:")
df.describe()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB

Descriptive Statistics for Numerical Columns:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


**Reasoning**:
Based on the `df.info()` output, there are several categorical columns with 'object' data type that need to be converted into numerical format for regression analysis. I will identify these columns and apply appropriate encoding methods.



In [None]:
import numpy as np

categorical_cols = df.select_dtypes(include='object').columns


binary_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

for col in binary_cols:
    df[col] = df[col].map({'yes': 1, 'no': 0})

df = pd.get_dummies(df, columns=['furnishingstatus'], drop_first=True)

print("DataFrame after encoding categorical variables:")
display(df.head())
print(df.info())

DataFrame after encoding categorical variables:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,False,False
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,False,False
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,True,False
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,False,False
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,False,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   price                            545 non-null    int64
 1   area                             545 non-null    int64
 2   bedrooms                         545 non-null    int64
 3   bathrooms                        545 non-null    int64
 4   stories                          545 non-null    int64
 5   mainroad                         545 non-null    int64
 6   guestroom                        545 non-null    int64
 7   basement                         545 non-null    int64
 8   hotwaterheating                  545 non-null    int64
 9   airconditioning                  545 non-null    int64
 10  parking                          545 non-null    int64
 11  prefarea                         545 non-null    int64
 12  furnishingstatus_semi-furnished  545 non-null    b

**Reasoning**:
Now that the data types are appropriate, I will separate the target variable ('price') from the features and then split the dataset into training and testing sets. This is crucial for evaluating the performance of the regression models on unseen data.



In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets successfully.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Data split into training and testing sets successfully.
X_train shape: (436, 13)
X_test shape: (109, 13)
y_train shape: (436,)
y_test shape: (109,)


## Train and Evaluate Linear Regression

### Subtask:
Train a Linear Regression model using the training data, make predictions on the test set, and evaluate its performance using relevant metrics (e.g., R-squared, Mean Squared Error). Provide a simple description of the model and mention the libraries used.


**Reasoning**:
To train and evaluate the Linear Regression model, I will first import the necessary modules, then instantiate and fit the model, make predictions, and finally calculate and display the evaluation metrics (R-squared, MSE, MAE) and store them for future comparison.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


linear_model = LinearRegression()

linear_model.fit(X_train, y_train)

y_pred_lr = linear_model.predict(X_test)

r2_score_lr = r2_score(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print("\n--- Linear Regression Model Performance ---")
print(f"R-squared (R2): {r2_score_lr:.4f}")
print(f"Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"Mean Absolute Error (MAE): {mae_lr:.2f}")

performance_metrics = {}
performance_metrics['Linear Regression'] = {
    'R2': r2_score_lr,
    'MSE': mse_lr,
    'MAE': mae_lr
}

print("\nLinear Regression model trained, predictions made, and metrics calculated and stored.")


--- Linear Regression Model Performance ---
R-squared (R2): 0.6529
Mean Squared Error (MSE): 1754318687330.66
Mean Absolute Error (MAE): 970043.40

Linear Regression model trained, predictions made, and metrics calculated and stored.


## Train and Evaluate Polynomial Regression

### Subtask:
Implement Polynomial Regression by first generating polynomial features from the existing dataset. Then, train a Linear Regression model on these transformed features, make predictions, and evaluate its performance using R-squared, Mean Squared Error, and Mean Absolute Error. Provide a simple description of the model and mention the libraries used.

## Train and Evaluate Polynomial Regression

### Subtask:
Implement Polynomial Regression by first generating polynomial features from the existing dataset. Then, train a Linear Regression model on these transformed features, make predictions, and evaluate its performance using R-squared, Mean Squared Error, and Mean Absolute Error. Provide a simple description of the model and mention the libraries used.

**Reasoning**:
To implement Polynomial Regression, the first step is to transform the existing features into polynomial features. I will use `PolynomialFeatures` from `sklearn.preprocessing` to generate these features for both the training and testing datasets. I will start with a degree of 2.



In [None]:
from sklearn.preprocessing import PolynomialFeatures

 Create a PolynomialFeatures object with degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)


X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

print("Polynomial features generated successfully.")
print(f"Original X_train shape: {X_train.shape}")
print(f"Transformed X_train_poly shape: {X_train_poly.shape}")
print(f"Original X_test shape: {X_test.shape}")
print(f"Transformed X_test_poly shape: {X_test_poly.shape}")

Polynomial features generated successfully.
Original X_train shape: (436, 13)
Transformed X_train_poly shape: (436, 104)
Original X_test shape: (109, 13)
Transformed X_test_poly shape: (109, 104)


**Reasoning**:
Now that the polynomial features have been generated, I will train a new Linear Regression model on these transformed features, make predictions on the polynomial test features, and evaluate its performance using R-squared, Mean Squared Error, and Mean Absolute Error. Finally, I will store these metrics for later comparison.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


poly_model = LinearRegression()


poly_model.fit(X_train_poly, y_train)


y_pred_poly = poly_model.predict(X_test_poly)


r2_score_poly = r2_score(y_test, y_pred_poly)
mse_poly = mean_squared_error(y_test, y_pred_poly)
mae_poly = mean_absolute_error(y_test, y_pred_poly)


print(
"\n--- Polynomial Regression Model Performance (Degree 2) ---"
)
print(f"R-squared (R2): {r2_score_poly:.4f}")
print(f"Mean Squared Error (MSE): {mse_poly:.2f}")
print(f"Mean Absolute Error (MAE): {mae_poly:.2f}")


performance_metrics['Polynomial Regression'] = {
    'R2': r2_score_poly,
    'MSE': mse_poly,
    'MAE': mae_poly
}

print(
"\nPolynomial Regression model trained, predictions made, and metrics calculated and stored."
)



--- Polynomial Regression Model Performance (Degree 2) ---
R-squared (R2): 0.6208
Mean Squared Error (MSE): 1916484379131.09
Mean Absolute Error (MAE): 1042927.64

Polynomial Regression model trained, predictions made, and metrics calculated and stored.


## Train and Evaluate K-Nearest Neighbors Regression

### Subtask:
Train a K-Nearest Neighbors Regression model using the training data, make predictions on the test set, and evaluate its performance using relevant metrics (e.g., R-squared, Mean Squared Error). Provide a simple description of the model and mention the libraries used.

## Train and Evaluate K-Nearest Neighbors Regression

### Subtask:
Train a K-Nearest Neighbors Regression model using the training data, make predictions on the test set, and evaluate its performance using relevant metrics (e.g., R-squared, Mean Squared Error). Provide a simple description of the model and mention the libraries used.

**Reasoning**:
To train the K-Nearest Neighbors Regression model, I will first import `KNeighborsRegressor` and `StandardScaler`. Then, I will scale the features, instantiate the model, fit it to the scaled training data, make predictions on the scaled test data, and finally calculate and store the evaluation metrics.



In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


knn_model = KNeighborsRegressor(n_neighbors=5) # Using default n_neighbors=5, can be tuned


knn_model.fit(X_train_scaled, y_train)


y_pred_knn = knn_model.predict(X_test_scaled)


r2_score_knn = r2_score(y_test, y_pred_knn)
mse_knn = mean_squared_error(y_test, y_pred_knn)
mae_knn = mean_absolute_error(y_test, y_pred_knn)


print("\n--- K-Nearest Neighbors Regression Model Performance ---")
print(f"R-squared (R2): {r2_score_knn:.4f}")
print(f"Mean Squared Error (MSE): {mse_knn:.2f}")
print(f"Mean Absolute Error (MAE): {mae_knn:.2f}")


performance_metrics['K-Nearest Neighbors Regression'] = {
    'R2': r2_score_knn,
    'MSE': mse_knn,
    'MAE': mae_knn
}

print("\nK-Nearest Neighbors Regression model trained, predictions made, and metrics calculated and stored.")


--- K-Nearest Neighbors Regression Model Performance ---
R-squared (R2): 0.6134
Mean Squared Error (MSE): 1953996997258.20
Mean Absolute Error (MAE): 999594.61

K-Nearest Neighbors Regression model trained, predictions made, and metrics calculated and stored.


## Train and Evaluate Decision Tree Regression

### Subtask:
Train a Decision Tree Regression model using the training data, make predictions on the test set, and evaluate its performance using relevant metrics (e.g., R-squared, Mean Squared Error). Provide a simple description of the model and mention the libraries used.

## Train and Evaluate Decision Tree Regression

### Subtask:
Train a Decision Tree Regression model using the training data, make predictions on the test set, and evaluate its performance using relevant metrics (e.g., R-squared, Mean Squared Error). Provide a simple description of the model and mention the libraries used.

**Reasoning**:
To train the Decision Tree Regression model, I will first import `DecisionTreeRegressor`. Then, I will instantiate the model, fit it to the training data, make predictions on the test data, and finally calculate and store the evaluation metrics.



In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


dt_model = DecisionTreeRegressor(random_state=42) # Using a random_state for reproducibility


dt_model.fit(X_train, y_train)

t
y_pred_dt = dt_model.predict(X_test)


r2_score_dt = r2_score(y_test, y_pred_dt)
mse_dt = mean_squared_error(y_test, y_pred_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)


print("\n--- Decision Tree Regression Model Performance ---")
print(f"R-squared (R2): {r2_score_dt:.4f}")
print(f"Mean Squared Error (MSE): {mse_dt:.2f}")
print(f"Mean Absolute Error (MAE): {mae_dt:.2f}")

# 6. Store these metrics for later comparison
performance_metrics['Decision Tree Regression'] = {
    'R2': r2_score_dt,
    'MSE': mse_dt,
    'MAE': mae_dt
}

print("\nDecision Tree Regression model trained, predictions made, and metrics calculated and stored.")


--- Decision Tree Regression Model Performance ---
R-squared (R2): 0.4771
Mean Squared Error (MSE): 2642802637614.68
Mean Absolute Error (MAE): 1195266.06

Decision Tree Regression model trained, predictions made, and metrics calculated and stored.


## Compare Model Performance and Summarize Findings

### Subtask:
Compare the performance of all trained models (Linear Regression, Polynomial Regression, K-Nearest Neighbors Regression, and Decision Tree Regression) based on their R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE) metrics. Summarize the findings, identifying which model performed best and why, and discuss any insights gained from the analysis.

**Reasoning**:
Now that all model evaluations are complete and stored in `performance_metrics`, I will display this dictionary in a structured format, preferably a DataFrame, to facilitate easy comparison of the models' performance metrics (R2, MSE, MAE).



In [None]:
import pandas as pd


performance_df = pd.DataFrame(performance_metrics).T

print("\n--- Model Performance Comparison ---")
display(performance_df.sort_values(by='R2', ascending=False))

print("\nSummary of Model Performance:\n")
# Summarize findings based on R2, MSE, and MAE
print("R-squared (R2): Higher values are better, indicating a better fit to the data.")
print("Mean Squared Error (MSE): Lower values are better, representing smaller average squared differences between predicted and actual values.")
print("Mean Absolute Error (MAE): Lower values are better, representing smaller average absolute differences between predicted and actual values.\n")

print("Based on the R-squared scores, Linear Regression appears to be the best performing model, followed closely by K-Nearest Neighbors Regression and Polynomial Regression (Degree 2). Decision Tree Regression performed the worst among the models tested. Linear Regression also has the lowest MSE and MAE, further supporting its superior performance for this dataset and feature set.")
print("\nIt's important to note that Polynomial Regression with degree 2 did not significantly improve performance over basic Linear Regression, and in some metrics, performed slightly worse. K-Nearest Neighbors Regression performed comparably to Linear Regression, suggesting that the relationship might not be entirely linear, but not complex enough to warrant high-degree polynomial features or non-parametric models without further tuning. Decision Tree Regression's lower performance indicates it might be overfitting or not capturing the underlying patterns effectively with its default parameters.")


--- Model Performance Comparison ---


Unnamed: 0,R2,MSE,MAE
Linear Regression,0.652924,1754319000000.0,970043.4
Polynomial Regression,0.620841,1916484000000.0,1042928.0
K-Nearest Neighbors Regression,0.61342,1953997000000.0,999594.6
Decision Tree Regression,0.477146,2642803000000.0,1195266.0



Summary of Model Performance:

R-squared (R2): Higher values are better, indicating a better fit to the data.
Mean Squared Error (MSE): Lower values are better, representing smaller average squared differences between predicted and actual values.
Mean Absolute Error (MAE): Lower values are better, representing smaller average absolute differences between predicted and actual values.

Based on the R-squared scores, Linear Regression appears to be the best performing model, followed closely by K-Nearest Neighbors Regression and Polynomial Regression (Degree 2). Decision Tree Regression performed the worst among the models tested. Linear Regression also has the lowest MSE and MAE, further supporting its superior performance for this dataset and feature set.

It's important to note that Polynomial Regression with degree 2 did not significantly improve performance over basic Linear Regression, and in some metrics, performed slightly worse. K-Nearest Neighbors Regression performed comparabl

## Final Task

### Subtask:
Summarize the findings, providing insights into which model performed best and why, along with a final overview of the regression task.


## Summary:

### Q&A

1.  **Which model performed best?**
    Linear Regression performed the best among the tested models, achieving the highest R-squared score of 0.6529, and the lowest Mean Squared Error (MSE) of \$1,754,318,687,330.66 and Mean Absolute Error (MAE) of \$970,043.40.

2.  **Why did the best model perform best?**
    Linear Regression's superior performance suggests that the relationship between the features and the house `price` in this dataset is predominantly linear or can be well approximated by a linear function. The addition of polynomial features (Degree 2) did not improve the model, and non-linear models like K-Nearest Neighbors and Decision Tree Regression did not outperform the simple linear model, indicating that the complexity of the relationships might not warrant more intricate models without further tuning.

3.  **What is the final overview of the regression task?**
    The regression task successfully loaded, preprocessed, and split the housing dataset. Four different regression models (Linear Regression, Polynomial Regression, K-Nearest Neighbors Regression, and Decision Tree Regression) were trained and evaluated. The comparison of their R-squared, MSE, and MAE metrics revealed that Linear Regression provided the most accurate predictions for house prices, indicating a largely linear relationship between the input features and the target variable.

### Data Analysis Key Findings

*   The dataset, `Housing.csv`, containing 545 entries and 13 columns, was successfully loaded and preprocessed. All categorical 'yes'/'no' features were converted to 1/0, and `furnishingstatus` was one-hot encoded.
*   The data was split into training (80%) and testing (20%) sets, resulting in `X_train` (436, 13), `X_test` (109, 13), `y_train` (436,), and `y_test` (109,).
*   **Linear Regression** achieved an R-squared of 0.6529, MSE of \$1,754,318,687,330.66, and MAE of \$970,043.40.
*   **Polynomial Regression (Degree 2)** resulted in an R-squared of 0.6208, MSE of \$1,916,484,379,131.09, and MAE of \$1,042,927.64.
*   **K-Nearest Neighbors Regression** (with `n_neighbors=5` and scaled features) yielded an R-squared of 0.6134, MSE of \$1,953,996,997,258.20, and MAE of \$999,594.61.
*   **Decision Tree Regression** (with `random_state=42`) showed the lowest performance with an R-squared of 0.4771, MSE of \$2,642,802,637,614.68, and MAE of \$1,195,266.06.
*   Linear Regression demonstrated the best overall performance with the highest R-squared and lowest error metrics, suggesting its suitability for this dataset.

### Insights or Next Steps

*   The relatively strong performance of Linear Regression suggests that the primary factors influencing house prices in this dataset have a largely linear relationship. This implies that simpler, more interpretable models can effectively capture the underlying patterns.
*   Future work could involve hyperparameter tuning for models like K-Nearest Neighbors and Decision Tree Regression to potentially improve their performance. Additionally, exploring other ensemble methods (e.g., Random Forest, Gradient Boosting) could yield further improvements, as could feature engineering to create more expressive variables.
