<a href="https://colab.research.google.com/github/csun0602/Che-Sun-Data-analysis-Projects/blob/main/Che_Sun_US_House_Sale_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# US house SalePrice prediction

#### 1. Data Preprocessing:
This involves preparing the data for the machine learning models.

Tasks:

Handle missing data.
Encode categorical variables.
Scale numerical features.
Split the data into training and validation sets.


#### 2. Model Training:
Using the preprocessed data, we'll train various regression models to predict the target variable, "SalePrice".

Tasks:

Train K-NN regression model.
Train Ridge regression model.
Train Lasso regression model.


#### 3. Model Evaluation:
After training, we need to evaluate the performance of each model on the validation set.

Tasks:

Predict the target variable on the validation set using each trained model.
Calculate performance metrics (e.g., Mean Absolute Error, Root Mean Squared Error) to evaluate and compare models.


#### 4. Model Prediction:
Once satisfied with a model's performance, we'll use it to predict the "SalePrice" for any new data you provide.

Tasks:

Use the chosen model to predict "SalePrice" on new data.

In [None]:
#Step1: Data Exploration: We'll start by taking a look at the training.csv data to understand its structure and the types of variables we have.
#Step2: Pre-processing: Based on our exploration, we'll handle missing values, encode categorical variables, and scale numerical ones if needed.
#Step3: Model Training: We'll split the training.csv into training and validation sets, train a model, and evaluate its performance on the validation set.
#Step4: Prediction: Once the model is trained and tuned, we'll use it to predict on the prediction.csv dataset.
#Step5: Result Sharing: We'll share the predictions for the prediction.csv dataset.


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the training data
train_data = pd.read_csv("training.csv")

# Display the first few rows of the training data
train_data.head()

Unnamed: 0,ID,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,208500,60,RL,65.0,8450,Pave,,Reg,Lvl,...,0,0,,,,0,2,2008,WD,Normal
1,2,181500,20,RL,80.0,9600,Pave,,Reg,Lvl,...,0,0,,,,0,5,2007,WD,Normal
2,3,223500,60,RL,68.0,11250,Pave,,IR1,Lvl,...,0,0,,,,0,9,2008,WD,Normal
3,4,140000,70,RL,60.0,9550,Pave,,IR1,Lvl,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,250000,60,RL,84.0,14260,Pave,,IR1,Lvl,...,0,0,,,,0,12,2008,WD,Normal


## 0. Data Exploration

In [None]:
#Step1: Data exploration
train_data.head()

Unnamed: 0,ID,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,208500,60,RL,65.0,8450,Pave,,Reg,Lvl,...,0,0,,,,0,2,2008,WD,Normal
1,2,181500,20,RL,80.0,9600,Pave,,Reg,Lvl,...,0,0,,,,0,5,2007,WD,Normal
2,3,223500,60,RL,68.0,11250,Pave,,IR1,Lvl,...,0,0,,,,0,9,2008,WD,Normal
3,4,140000,70,RL,60.0,9550,Pave,,IR1,Lvl,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,250000,60,RL,84.0,14260,Pave,,IR1,Lvl,...,0,0,,,,0,12,2008,WD,Normal


In [None]:
# have a view of the training dataset.
train_data.shape

(1218, 81)

In [None]:
# Get to know the information of the dataset and datatype
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             1218 non-null   int64  
 1   SalePrice      1218 non-null   int64  
 2   MSSubClass     1218 non-null   int64  
 3   MSZoning       1218 non-null   object 
 4   LotFrontage    1005 non-null   float64
 5   LotArea        1218 non-null   int64  
 6   Street         1218 non-null   object 
 7   Alley          70 non-null     object 
 8   LotShape       1218 non-null   object 
 9   LandContour    1218 non-null   object 
 10  Utilities      1218 non-null   object 
 11  LotConfig      1218 non-null   object 
 12  LandSlope      1218 non-null   object 
 13  Neighborhood   1218 non-null   object 
 14  Condition1     1218 non-null   object 
 15  Condition2     1218 non-null   object 
 16  BldgType       1218 non-null   object 
 17  HouseStyle     1218 non-null   object 
 18  OverallQ

In [None]:
# another way to have a look the type of variables
train_data.dtypes

ID                 int64
SalePrice          int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
                  ...   
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
Length: 81, dtype: object

In [None]:
#check the number of unique value in each column
train_data.nunique()

ID               1218
SalePrice         595
MSSubClass         15
MSZoning            5
LotFrontage       108
                 ... 
MiscVal            18
MoSold             12
YrSold              5
SaleType            9
SaleCondition       6
Length: 81, dtype: int64

In [None]:
#Step1.2: Understand the features

# Check the missing values
train_data.isnull().sum()
# Here we find that LotFrontage has 213 missing value

ID                 0
SalePrice          0
MSSubClass         0
MSZoning           0
LotFrontage      213
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
Length: 81, dtype: int64

In [None]:
#understand numerical feature
train_data[['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']].describe()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,...,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal
count,1005.0,1218.0,1210.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,...,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0
mean,70.351244,10594.958128,104.208264,449.940066,46.332512,563.299672,1059.57225,1162.99179,340.642036,6.325123,...,0.607553,1.758621,471.269294,94.66092,46.70936,21.395731,3.599343,14.895731,2.275041,47.031199
std,24.806884,10645.419474,181.143463,461.264267,161.882923,440.795202,443.002323,386.644612,434.249227,51.61194,...,0.646341,0.742552,211.929888,123.978107,66.74154,60.863718,29.771748,54.494176,35.64115,534.201253
min,21.0,1300.0,0.0,0.0,0.0,0.0,0.0,372.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,60.0,7662.75,0.0,0.0,0.0,224.25,796.0,884.0,0.0,0.0,...,0.0,1.0,328.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,70.0,9458.5,0.0,392.0,0.0,466.5,1001.5,1088.0,0.0,0.0,...,1.0,2.0,476.5,0.0,24.0,0.0,0.0,0.0,0.0,0.0
75%,80.0,11622.75,166.0,716.0,0.0,798.75,1298.75,1389.75,727.75,0.0,...,1.0,2.0,576.0,168.0,69.0,0.0,0.0,0.0,0.0,0.0
max,313.0,215245.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,...,3.0,4.0,1418.0,857.0,523.0,552.0,508.0,410.0,648.0,15500.0


In [None]:
#understand target
train_data['SalePrice'].describe()

count      1218.000000
mean     180527.586207
std       78540.534547
min       34900.000000
25%      130000.000000
50%      162950.000000
75%      213000.000000
max      745000.000000
Name: SalePrice, dtype: float64

## 1. Data Preprocessing:

In [None]:
# Step 1: Split the data into features and target, then drop SalePrice and ID from X train_data
X = train_data.drop(columns=["SalePrice", "ID"])
y = train_data["SalePrice"]

In [None]:
# Step 2: Split the data into training and validation sets
X_train, X_tes, y_train, y_tes = train_test_split(X, y, test_size=0.2, random_state=42)
# 20% of the data will be reserved for the validation set, while the remaining 80% will be used for training.
# With random seeds 42.
# X_train and y_train: The features and target variable for the training set.
# X_tes and y_tes: The features and target variable for the testing set.
# train_test_split: This is a function from scikit-learn that shuffles the dataset
# and then splits it into training and testing (or validation) subsets.

In [None]:
# Step 3: Identify numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

In [None]:
# Step 4: Adjust for all categorical variables represented as numbers; by observing tarin dataset,
        #MSSubClass, OverallQual, and OverallCond are categorical variables, but represented as number.
categorical_as_numbers = ["MSSubClass", "OverallQual", "OverallCond"]
for col in categorical_as_numbers:
    if col in numerical_cols:
        numerical_cols.remove(col)
        categorical_cols.append(col)

In [None]:
# Step 5: Build preprocessing pipelines; filling "median"value in numerical missing value
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
# SimpleImputer: This is a preprocessing technique to handle missing values in the dataset.
# This is used to encode categorical variables.
# One-hot： encoding creates binary columns for each category in the original column
# and indicates the presence of the category with an "1" or "0" (active/inactive).
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# ColumnTransformer is initialized with a list of transformers,
#where each transformer is applied to a subset of the columns in the input data.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
# Step 6: Apply preprocessing to the data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_tes_preprocessed = preprocessor.transform(X_tes)

X_train_preprocessed.shape, X_tes_preprocessed.shape

# print(X_train_preprocessed.shape, X_tes_preprocessed.shape)

((974, 311), (244, 311))

## 2. Model training

### Model1: KNN

In [None]:
# KNN
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Instantiate the model
# The parameter 'n_neighbors' determines how many neighbors will be used; you can adjust this based on validation performance.
knn_model = KNeighborsRegressor(n_neighbors=5)


In [None]:
knn_model.fit(X_train_preprocessed, y_train)

In [None]:
# Predict on test set
y_pred_knn = knn_model.predict(X_tes_preprocessed)

# Evaluate the model: use model to evaluate on test dataset.
mae_knn = mean_absolute_error(y_tes, y_pred_knn)
rmse_knn = np.sqrt(mean_squared_error(y_tes, y_pred_knn))

print(f"K-NN Regression MAE: {mae_knn}")
print(f"K-NN Regression RMSE: {rmse_knn}")


K-NN Regression MAE: 22759.318032786883
K-NN Regression RMSE: 39989.030984898185


### Model2: Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Assuming X_train_preprocessed, y_train, X_val_preprocessed, and y_val are already defined from our previous steps

# Initialize the Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_preprocessed, y_train)

# Predict on the validation set
y_pred_rf = rf.predict(X_tes_preprocessed)

# Calculate and print RMSE for Random Forest
rmse_rf = np.sqrt(mean_squared_error(y_tes, y_pred_rf))
print(f"Random Forest RMSE on Validation Set: {rmse_rf:.2f}")

Random Forest RMSE on Validation Set: 29424.16


### Model3: Ridge regression

In [None]:
from sklearn.linear_model import Ridge

# Instantiate the Ridge regression model
# The parameter 'alpha' controls the strength of the regularization; you can adjust this based on validation performance.
ridge_model = Ridge(alpha=1.0)


In [None]:
# Train ridge model:
ridge_model.fit(X_train_preprocessed, y_train)

In [None]:
# Predict on test set
y_pred_ridge = ridge_model.predict(X_tes_preprocessed)

# Evaluate the model
mae_ridge = mean_absolute_error(y_tes, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_tes, y_pred_ridge))

print(f"Ridge Regression MAE: {mae_ridge}")
print(f"Ridge Regression RMSE: {rmse_ridge}")


Ridge Regression MAE: 18451.419644828646
Ridge Regression RMSE: 27516.40215158025


### Model4: Lasso regression

In [None]:
from sklearn.linear_model import Lasso

# Instantiate the Lasso regression model
lasso_model = Lasso(alpha=1.0)

In [None]:
lasso_model.fit(X_train_preprocessed, y_train)

  model = cd_fast.sparse_enet_coordinate_descent(


In [None]:
# Predict on test set
y_pred_lasso = lasso_model.predict(X_tes_preprocessed)

# Evaluate the model
mae_lasso = mean_absolute_error(y_tes, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_tes, y_pred_lasso))

print(f"Lasso Regression MAE: {mae_lasso}")
print(f"Lasso Regression RMSE: {rmse_lasso}")

Lasso Regression MAE: 17305.860243708936
Lasso Regression RMSE: 24842.301781391492


##### Q3. Compare these four model, we find Ridge regression and Lasso regression has smaller RMSE, so we consider more on Ridge and Lasso regression.  And Lasso regression has smallest RMSE that is 24842.3017. We consider choose Lasso regression to do prediction on SalePrice in later process.

#### Meanwhile, we decide to find a good premater(Alpha) .  Adjust premater by cross validation, and select alpha from [0.001, 0.01, 0.1, 1, 10, 100]


### Hyperparameter tuning for Ridge/Lasso regression

#### Ridge Regression with Cross-Validation:

In [None]:
from sklearn.linear_model import RidgeCV

# Define alphas for RidgeCV
alphas_ridge = [0.001, 0.01, 0.1, 1, 10, 100]

# Instantiate the RidgeCV model
ridge_cv_model = RidgeCV(alphas=alphas_ridge, store_cv_values=True)

In [None]:
ridge_cv_model.fit(X_train_preprocessed, y_train)

In [None]:
best_alpha_ridge = ridge_cv_model.alpha_

# Predict on validation set
y_pred_ridge_cv = ridge_cv_model.predict(X_tes_preprocessed)

# Evaluate the model
mae_ridge_cv = mean_absolute_error(y_tes, y_pred_ridge_cv)
rmse_ridge_cv = np.sqrt(mean_squared_error(y_tes, y_pred_ridge_cv))

print(f"Optimal alpha for Ridge Regression: {best_alpha_ridge}")
print(f"Ridge Regression (with CV) MAE: {mae_ridge_cv}")
print(f"Ridge Regression (with CV) RMSE: {rmse_ridge_cv}")

Optimal alpha for Ridge Regression: 10.0
Ridge Regression (with CV) MAE: 18707.734918896906
Ridge Regression (with CV) RMSE: 29416.680909183702


In [None]:
# Therefore, we can find optimal alpha for ridege regression is "10"

#### Lasso Regression with Cross-Validation:

In [None]:
from sklearn.linear_model import LassoCV

# Define alphas for LassoCV
alphas_lasso = [0.001, 0.01, 0.1, 1, 10, 100]

# Instantiate the LassoCV model
lasso_cv_model = LassoCV(alphas=alphas_lasso, cv=5)

In [None]:
lasso_cv_model.fit(X_train_preprocessed, y_train)

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


In [None]:
best_alpha_lasso = lasso_cv_model.alpha_

# Predict on validation set
y_pred_lasso_cv = lasso_cv_model.predict(X_tes_preprocessed)

# Evaluate the model
mae_lasso_cv = mean_absolute_error(y_tes, y_pred_lasso_cv)
rmse_lasso_cv = np.sqrt(mean_squared_error(y_tes, y_pred_lasso_cv))

print(f"Optimal alpha for Lasso Regression: {best_alpha_lasso}")
print(f"Lasso Regression (with CV) MAE: {mae_lasso_cv}")
print(f"Lasso Regression (with CV) RMSE: {rmse_lasso_cv}")

Optimal alpha for Lasso Regression: 100.0
Lasso Regression (with CV) MAE: 16675.135363173333
Lasso Regression (with CV) RMSE: 25342.432214828874


In [None]:
# Therefore, we can find optimal alpha for Lasso regression is "100"

### Retrain model

#### Ridge regression

In [None]:
ridge_optimal = Ridge(alpha=10)
ridge_optimal.fit(X_train_preprocessed, y_train)

In [None]:
# For Ridge
y_pred_test_ridge = ridge_optimal.predict(X_tes_preprocessed)
mae_test_ridge = mean_absolute_error(y_tes, y_pred_test_ridge)
rmse_test_ridge = np.sqrt(mean_squared_error(y_tes, y_pred_test_ridge))

# Printing the evaluation metrics
print("Ridge Regression Evaluation:")
print(f"Mean Absolute Error: {mae_test_ridge:.2f}")
print(f"Root Mean Squared Error: {rmse_test_ridge:.2f}")

# We find the RMSE is 29411.47 in retrained redige model.

Ridge Regression Evaluation:
Mean Absolute Error: 18706.82
Root Mean Squared Error: 29411.47


#### Lasso regression

In [None]:
from sklearn.linear_model import Lasso

lasso_optimal = Lasso(alpha=100)
lasso_optimal.fit(X_train_preprocessed, y_train)

In [None]:
y_pred_test_lasso = lasso_optimal.predict(X_tes_preprocessed)
mae_test_lasso = mean_absolute_error(y_tes, y_pred_test_lasso)
rmse_test_lasso = np.sqrt(mean_squared_error(y_tes, y_pred_test_lasso))

print("Lasso Regression Evaluation:")
print(f"Mean Absolute Error: {mae_test_lasso:.2f}")
print(f"Root Mean Squared Error: {rmse_test_lasso:.2f}")

# We find the RMSE is 25342.43

Lasso Regression Evaluation:
Mean Absolute Error: 16675.14
Root Mean Squared Error: 25342.43


In [None]:
# After retrained Redge and Lasso regression, I find that under optimal alpha, new models can not provide me better
# prediction result with larger RMSE, probabily occuring overfitting in optimal regression,
# so I decided to use the original model again.

## Predict Sale Price

### Use Lasso regression to predict

In [None]:
# We choose Lasso regression to do prediction, because lasso model has smallest RMSE in previous eveluation

# Upload data
prediction_data = pd.read_csv("prediction.csv")

In [None]:
# Solve "Id" change "Id" to "ID"
y = training_data['SalePrice']
X = training_data.drop('SalePrice', axis=1)
# Rename "id" to "ID" in prediction data.
prediction_data = prediction_data.rename(columns={"Id": "ID"})

NameError: name 'training_data' is not defined

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso

# Identify numerical and categorical columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]
categorical_cols = [cname for cname in X.columns if X[cname].dtype == "object"]

# Preprocessing pipeline
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
# preprocess prediction data as the
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols)])
# Continue train prediction lasso using alpha 100.（trick thing is whatever alpha I changed, the result is unchanged, so i continue use alpha is "100"）
lasso_model = Lasso()
# Define and fit the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', lasso_model)])

# Train the Lasso model
pipeline.fit(X, y)

### Prediction

In [None]:
# Predict on the prediction dataset
predicted_saleprice = pipeline.predict(prediction_data)

In [None]:
# Create submission file
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission['SalePrice'] = predicted_saleprice
sample_submission.to_csv("Prediction SalePrice.csv", index=False)

#### Finally, we can get the Prediction SalePrice.csv to show our predicted SalePrice on ID from 1219 to 1300.