## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [1]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Dataset File

In [2]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [3]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [4]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [5]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [6]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


## 1. Train a KNN Regressor

In [30]:
knn = KNeighborsRegressor(n_neighbors=5)

X_train = df.drop('SalePrice', axis=1)
y_train = df['SalePrice']

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(dt)
X_train, X_test = X_train.align(X_test, join='inner', axis=1, fill_value=0)

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())


knn.fit(X_train, y_train)

knn_score = knn.score(X_train, y_train)
print(f"KNN Regressor score on training data: {knn_score}")

KNN Regressor score on training data: 0.7741275649996193


- Perform cross validation

In [26]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, X_train, y_train, cv=5)


print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.63 accuracy with a standard deviation of 0.05


## 2. Train a SVM Regression

In [32]:
from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train, y_train)

svr_score = svr.score(X_train, y_train)
print(f"SVM Regressor score: {svr_score}")

SVM Regressor score: -0.050490596739066085


- Perform cross validation

In [36]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svr, X_train, y_train, cv=5)


print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

-0.05 accuracy with a standard deviation of 0.02


## 3. Train a Decision Tree Regression

In [61]:

from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)

dt_reg_score = dt_reg.score(X_train, y_train)
print(f"Decision Tree Regression score: {dt_reg_score}")

Decision Tree Regression score: 1.0


- Perform cross validation

In [48]:
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation for Decision Tree Regressor
dt_cv_scores = cross_val_score(dt_reg, X_train, y_train, cv=5)

print("Decision Tree Regression Mean cross-validation score:", dt_cv_scores.mean())

Decision Tree Regression Mean cross-validation score: 0.7171969737395139


## 4. Train a Random Forest Regression

In [42]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)

rf_score = rf_reg.score(X_train, y_train)
print(f"Random Forest Regressor score: {rf_score}")

Random Forest Regressor score: 0.9799185158859299


## 5. Compare all the performance of all regression models

In [45]:

print(f"KNN Regression score: {knn_score}")
print(f"SVM Regression score: {svr_score}")
print(f"Decision Tree Regression score: {dt_score}")
print(f"Random Forest Regression score: {rf_score}")


KNN Regression score: 0.7741275649996193
SVM Regression score: -0.050490596739066085
Decision Tree Regression score: 1.0
Random Forest Regression score: 0.9799185158859299


## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [50]:

id = dt['Id'] # Get the 'Id' column from the test data (dt)
y_pred = rf_reg.predict(X_test) # Use the trained Random Forest model to predict on the preprocessed test data (X_test)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'Id': id,
    'SalePrice': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
