<a href="https://colab.research.google.com/github/Wezz-git/AI-samples/blob/main/Regression_(House_Price_Prediction).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Business Problem:

You're a data scientist at a real estate company. The sales team needs a tool that can instantly estimate a house's sale price based on its features (like square footage, number of bedrooms, neighborhood, etc.).

In [11]:
import pandas as pd

# 1 - Load the data

In [12]:
# Get House Prices - Advanced Regression Techniques (train.csv) from Kaggle
file_name = '/content/train.csv'

In [13]:
df = pd.read_csv(file_name)

In [14]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [15]:
df.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125
1459,1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,147500


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

# 2 - Simplify the columns

In [17]:
# create a new DataFrame called df_small
# shrinking to only using 11 categories

df_small = df[['GrLivArea', 'OverallQual', 'YearBuilt', 'TotalBsmtSF',
               'GarageCars', 'FullBath', 'Neighborhood', 'ExterQual',
               'KitchenQual', 'BsmtQual', 'SalePrice']]

print(df_small.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   GrLivArea     1460 non-null   int64 
 1   OverallQual   1460 non-null   int64 
 2   YearBuilt     1460 non-null   int64 
 3   TotalBsmtSF   1460 non-null   int64 
 4   GarageCars    1460 non-null   int64 
 5   FullBath      1460 non-null   int64 
 6   Neighborhood  1460 non-null   object
 7   ExterQual     1460 non-null   object
 8   KitchenQual   1460 non-null   object
 9   BsmtQual      1423 non-null   object
 10  SalePrice     1460 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 125.6+ KB
None


#3 -  Clean the data by inputting missing values

In [18]:
# Fill the NaN values in the 'BsmtQual' column with the string "None"
# using the .fillna() function
# We use inplace=True to modify df_small directly without reassigning
# 'bsmtQual' has 1423 non-null (37 missing values) and with the function, to make it 1460 non-null

df_small['BsmtQual'].fillna("None", inplace=True)

print(df_small.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   GrLivArea     1460 non-null   int64 
 1   OverallQual   1460 non-null   int64 
 2   YearBuilt     1460 non-null   int64 
 3   TotalBsmtSF   1460 non-null   int64 
 4   GarageCars    1460 non-null   int64 
 5   FullBath      1460 non-null   int64 
 6   Neighborhood  1460 non-null   object
 7   ExterQual     1460 non-null   object
 8   KitchenQual   1460 non-null   object
 9   BsmtQual      1460 non-null   object
 10  SalePrice     1460 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 125.6+ KB
None


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_small['BsmtQual'].fillna("None", inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['BsmtQual'].fillna("None", inplace=True)


# 4 - Preprocess text features using One hot encoding

In [19]:
df_processed = pd.get_dummies(df_small)

print(df_processed.head())
print(df_processed.info())

   GrLivArea  OverallQual  YearBuilt  TotalBsmtSF  GarageCars  FullBath  \
0       1710            7       2003          856           2         2   
1       1262            6       1976         1262           2         2   
2       1786            7       2001          920           2         2   
3       1717            7       1915          756           3         1   
4       2198            8       2000         1145           3         2   

   SalePrice  Neighborhood_Blmngtn  Neighborhood_Blueste  Neighborhood_BrDale  \
0     208500                 False                 False                False   
1     181500                 False                 False                False   
2     223500                 False                 False                False   
3     140000                 False                 False                False   
4     250000                 False                 False                False   

   ...  ExterQual_TA  KitchenQual_Ex  KitchenQual_Fa  KitchenQ

# 5 - Split data in X (features) and y (target)bold text

In [20]:
# Time to split the data into X and y
# y (the target - the column we want to predict) = SalePrice
# X (the features - everything else to make the prediction) = df_processed

# 1 - y target which is SalePrice
y = df_processed['SalePrice']

# 2 - X everything else but SalePrice
# Use of .drop function to drop the column

X = df_processed.drop('SalePrice', axis=1)
# alternative code can work teh same: X = df_processed.drop(columns=['SalePrice'])

# 3 - Convert True/False to 1/0
X = X.astype(int)

print("-- Features (X) --")
print(X.head())

print ("\n-- Target (y) --")
print(y.head())


-- Features (X) --
   GrLivArea  OverallQual  YearBuilt  TotalBsmtSF  GarageCars  FullBath  \
0       1710            7       2003          856           2         2   
1       1262            6       1976         1262           2         2   
2       1786            7       2001          920           2         2   
3       1717            7       1915          756           3         1   
4       2198            8       2000         1145           3         2   

   Neighborhood_Blmngtn  Neighborhood_Blueste  Neighborhood_BrDale  \
0                     0                     0                    0   
1                     0                     0                    0   
2                     0                     0                    0   
3                     0                     0                    0   
4                     0                     0                    0   

   Neighborhood_BrkSide  ...  ExterQual_TA  KitchenQual_Ex  KitchenQual_Fa  \
0                     0  ...   

# 6 - Train baseline model to get a starting score

In [21]:
# Training the model
# 1 - import the necessary tools

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# 2 - Split the data
# set the random state to 40 rather than 42, to try

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# 3 - create the model

model = LinearRegression()

# 4 - Train the model

print(" Training the Liner Reg model..")
model.fit(X_train, y_train)
print("Training Done")

# 5 - make predictions on the test data

predictions = model.predict(X_test)

# 6 - Score the model
# use Mean Squared Error, then take the square root (RMSE)
# RMSE is a very common metric. It tells us, on average,
# "how many dollars was our prediction off by?"

mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)

print(f"Model RMSE: ${rmse:,.2f}")

#rmse:,.2f} :
# rmse - This is the variable you want to insert.
# : - The colon is the "magic" operator. It says, "I'm about to give you formatting instructions!"
# , - The comma adds a comma as a thousands separator. This is what turns 1225000 into 1,225,000.
# .2f  - This tells Python to format the number as a floating-point number (a decimal) and round it to 2 decimal places.

 Training the Liner Reg model..
Training Done
Model RMSE: $32,436.40


# 7 - Train an advanced model and quantitatively prove that it was better.

In [22]:
# Now the use for RandomForest will be used as its more pwerful for prediction. A more advanced model

# 1 - import the models

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np

# ( assume 'X' and 'y' follow from the previous cell)
# 2 - split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# 3 - Create the Advanced model
# n_estimators=100 - means to buid 100 "trees"
# also uses a random state

pro_model = RandomForestRegressor(n_estimators=100, random_state=40)

# 4 - train the model
print("training for advanced RandomForest model..")

pro_model.fit(X_train, y_train)
print("Training Done")

# 5 - make predictions on the test data

pro_predictions = pro_model.predict(X_test)

# 6 - score the model
mse = mean_squared_error(y_test, pro_predictions)
rmse = np.sqrt(mse)

print(f"Model RMSE: ${rmse:,.2f}")


training for advanced RandomForest model..
Training Done
Model RMSE: $28,677.85


The overall Business Analysis

1 - Baseline (Linear Regression): $32,436.40

2 - Advanced (Random Forest): $28,677.85

You just made a $3,758.55 improvement to your model's average error.

This is a clear win. I proved that the RandomForest model was able to find complex patterns in the data (like how Neighborhood and OverallQual interact) that the "simple" LinearRegression model missed.