# House Price Prediction 🏡 

## About the Dataset :

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

# 1. Import Necessary Libraries :

In [73]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno as msno

# 2. Import DataSet :

In [74]:
df=pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

# 3. Data Preprocessing :

In [75]:
y = df.SalePrice
X = df.drop(columns=["SalePrice"], axis=1)

In [76]:
X.head()

### Id feature is simply unique ID, so it is not helpful for learning. Let's remove it.

In [77]:
X.drop('Id',axis=1,inplace=True,errors='ignore')

In [78]:
plt.figure(figsize = (8,6))
ax = X.dtypes.value_counts().plot(kind='bar',grid = False,fontsize=20,color='blue')
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+ p.get_width() / 2., height + 1, height, ha = 'center', size = 25)
sns.despine()

Observation:

There are 37 numerical features and 43 object (string) types. Among the numeric features, 34 are int types and 3 are float types. There must be a reason for using a different type like this. Let's check some more.

In [79]:
cat_cols = ["MSZoning", "Street", "LotShape", "LotConfig", "Neighborhood", "Condition1", "Condition2", "HouseStyle",
            "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "HeatingQC",
            "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "CentralAir", 
            "Electrical", "KitchenQual", "Functional", "FireplaceQu" , "GarageFinish", "FireplaceQu", "GarageType",
            "GarageQual", "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType", "SaleCondition",
            "LandContour", "Utilities","BldgType","Heating","KitchenQual","SaleCondition","LandSlope"]
num_cols = [i for i in X.columns if i not in cat_cols]

In [80]:
df_num_cols = X[num_cols]
df_num_cols.head()

In [81]:
# we can check the important parameters that could effect our model.
f, ax = plt.subplots(figsize=(20,10))
sub_sample_corr = df_num_cols.corr()
sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax)
ax.set_title('Correlations Between Columns', fontsize=14)
plt.show()

In [82]:
corr_X = df_num_cols.corr()
len(corr_X)

In [83]:
for i in range(0, len(corr_X) - 1):
    for j in range(i + 1, len(corr_X)):
        if(corr_X.iloc[i, j] < -0.6 or corr_X.iloc[i, j] > 0.6):
            print(corr_X.iloc[i, j], i, j, corr_X.index[i], corr_X.index[j])

## Remove the feautures which are highly correlated with each other

In [84]:
#Based on the above information, we further discard the features GarageCars, GarageYrBlt, BsmtFullBath, 2ndFlrSF, 1stFlrSF, TotRmsAbvGrd
F_cols = [i for i in df_num_cols.columns if i not in ['GarageCars', 'GarageYrBlt', 'BsmtFullBath', '2ndFlrSF', '1stFlrSF',
                                                       'TotRmsAbvGrd', 'MasVnrArea']]

In [85]:
X_final = df_num_cols[F_cols]

In [86]:
X_final.head()

## Missing data

In [87]:
X_final.isna().sum()

In [88]:
msno.matrix(X_final)

In [None]:
X_final.drop('LotFrontage', inplace=True, axis=1)
X_final.drop('Alley', inplace=True, axis=1)      

## Regression Using Machine Learning

In [92]:
from sklearn.metrics import r2_score 
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

In [93]:
X_train, X_val, y_train, y_val = train_test_split(X_final, y, test_size=0.2, random_state=1)

In [94]:
rmse_ = []
r_2 = []
method = []

In [95]:
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predictions = lin_reg.predict(X_val)

r_squared = r2_score(predictions, y_val)
r_2.append(r_squared)
print("R2 Score:", r_squared)
rmse = np.sqrt(mean_squared_error(predictions, y_val))
print("RMSE:", rmse)
method.append('Linear Regression')
rmse_.append(rmse)

In [96]:
# support vector regression
from sklearn.svm import SVR
svr = SVR(C=1000000)
svr.fit(X_train, y_train)
predictions = svr.predict(X_val)

r_squared = r2_score(predictions, y_val)
r_2.append(r_squared)
print("R2 Score:", r_squared)
method.append('SVM')
rmse = np.sqrt(mean_squared_error(predictions, y_val))
print("RMSE:", rmse)
rmse_.append(rmse)

In [97]:
#Random forest regressor
random_forest = RandomForestRegressor(n_estimators=10)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_val)

r_squared = r2_score(predictions, y_val)
r_2.append(r_squared)
print("R2 Score:", r_squared)
method.append('Random Forest Regressor')
rmse = np.sqrt(mean_squared_error(predictions, y_val))
print("RMSLE:", rmse)
rmse_.append(rmse)

In [98]:
# xgboost
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.01)
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_val)

r_squared = r2_score(predictions, y_val)
r_2.append(r_squared)
print("R2 Score:", r_squared)
method.append('XGBoost Regressor')
rmse = np.sqrt(mean_squared_error(predictions, y_val))
print("RMSE:", rmse)
rmse_.append(rmse)

In [99]:
# Compare performances of models using rmse
plt.barh(method, rmse_)
plt.title('RMSE comparison of models')

In [100]:
# Compare performances of models using r_2
plt.barh(method, r_2)
plt.title('R2 comparison of models')

## Test data :

In [132]:
test_df=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [133]:
ids = test_df['Id']
test_df.drop('Id',axis=1,inplace=True,errors='ignore')
test_df_num_cols = test_df_[num_cols]

In [136]:
X_test_final = test_df_num_cols[F_cols]

In [None]:
X_test_final.drop('LotFrontage', inplace=True, axis=1)
X_test_final.drop('Alley', inplace=True, axis=1)      

In [141]:
# Submission using XGBoost

preds = xgb.predict(X_test_final)
submission = pd.DataFrame({'Id': ids, 'SalePrice': preds})
submission.to_csv('submission.csv',index=False)

In [143]:
submission