|Feature|Type|Description|
|---|---|---|
|SalePrice|Numeric(Continious)|the property's sale price in dollars. This is the target variable that you're trying to predict|.
|MSSubClass|Numeric(Discrete)|The building class|
|MSZoning|Object(Cetagorical)|The general zoning classification|
|LotFrontage|Numeric(Continious)|Linear feet of street connected to property|
|LotArea|Numeric(Continious)|Lot size in square feet|
|Street|Object(Cetagorical)|Type of road access|
|Alley|Object(Cetagorical)|Type of alley access|
|LotShape|Object(Cetagorical)|General shape of property|
|LandContour|Object(Categorical)|Flatness of the property|
|Utilities|Object(Categorical)|Type of utilities available|
|LotConfig|Object(Categorical)|Lot configuration|
|LandSlope|Object(Categorical)|Slope of property|
|Neighborhood|Object(Categorical)|Physical locations within Ames city limits|
|Condition1|Object(Categorical)|Proximity to main road or railroad|
|Condition2|Object(Categorical)|Proximity to main road or railroad (if a second is present)|
|BldgType|Object(Categorical)|Type of dwelling|
|HouseStyle|Object(Categorical)|Style of dwelling|
|OverallQual|Numeric(Discrete)|Overall material and finish quality|
|OverallCond|Numeric(Discrete)|Overall condition rating|
|YearBuilt|Numeric(Discrete)|Original construction date|
|YearRemodAdd|Numeric(Discrete)|Remodel date|
|RoofStyle|Object(Categorical)|Type of roof|
|RoofMatl|Object(Categorical)|Roof material|
|Exterior1st|Object(Categorical)|Exterior covering on house|
|Exterior2nd|Object(Categorical)|Exterior covering on house (if more than one material)|
|MasVnrType|Object(Categorical)|Masonry veneer type|
|MasVnrArea|Numeric(Continious)|Masonry veneer area in square feet|
|ExterQual|Object(Categorical)|Exterior material quality|
|ExterCond|Object(Categorical)|Present condition of the material on the exterior|
|Foundation|Object(Categorical)|Type of foundation|
|BsmtQual|Object(Categorical)|Height of the basement|
|BsmtCond|Object(Categorical)|General condition of the basement|
|BsmtExposure|Object(Categorical)|Walkout or garden level basement walls|
|BsmtFinType1|Object(Categorical)|Quality of basement finished area|
|BsmtFinSF1|Numeric(Continious)|Type 1 finished square feet|
|BsmtFinType2|Object(Categorical)|Quality of second finished area (if present)|
|BsmtFinSF2|Numeric(Continious)|Type 2 finished square feet|
|BsmtUnfSF|Numeric(Continious)|Unfinished square feet of basement area|
|TotalBsmtSF|Numeric(Continious)|Total square feet of basement area|
|Heating|Object(Categorical)|Type of heating|
|HeatingQC|Object(Categorical)|Heating quality and condition|
|CentralAir|Object(Categorical)|Central air conditioning|
|Electrical|Object(Categorical)|Electrical system|
|1stFlrSF|Numeric(Continious)|First Floor square feet|
|2ndFlrSF|Numeric(Continious)|Second floor square feet|
|LowQualFinSF|Numeric(Continious)|Low quality finished square feet (all floors)|
|GrLivArea|Numeric(Continious)|Above grade (ground) living area square feet|
|BsmtFullBath|Numeric(Discrete)|Basement full bathrooms|
|BsmtHalfBath|Numeric(Discrete)|Basement half bathrooms|
|FullBath|Numeric(Discrete)|Full bathrooms above grade|
|HalfBath|Numeric(Discrete)|Half baths above grade|
|Bedroom|Numeric(Discrete)|Number of bedrooms above basement level|
|Kitchen|Numeric(Discrete)|Number of kitchens|
|KitchenQual|Object(Categorical)|Kitchen quality|
|TotRmsAbvGrd|Numeric(Discrete)|Total rooms above grade (does not include bathrooms)|
|Functional|Object(Categorical)|Home functionality rating|
|Fireplaces|Numeric(Discrete)|Number of fireplaces|
|FireplaceQu|Object(Categorical)|Fireplace quality|
|GarageType|Object(Categorical)|Garage location|
|GarageYrBlt|Numeric(Discrete)|Year garage was built|
|GarageFinish|Object(Categorical)|Interior finish of the garage|
|GarageCars|Numeric(Discrete)|Size of garage in car capacity|
|GarageArea|Numeric(Continious)|Size of garage in square feet|
|GarageQual|Object(Categorical)|Garage quality|
|GarageCond|Object(Categorical)|Garage condition|
|PavedDrive|Object(Categorical)|Paved driveway|
|WoodDeckSF|Numeric(Continious)|Wood deck area in square feet|
|OpenPorchSF|Numeric(Continious)|Open porch area in square feet|
|EnclosedPorch|Numeric(Continious)|Enclosed porch area in square feet|
|3SsnPorch|Numeric(Continious)|Three season porch area in square feet|
|ScreenPorch|Numeric(Continious)|Screen porch area in square feet|
|PoolArea|Numeric(Continious)|Pool area in square feet|
|PoolQC|Object(Categorical)|Pool quality|
|Fence|Object(Categorical)|Fence quality|
|MiscFeature|Object(Categorical)|Miscellaneous feature not covered in other categories|
|MiscVal|Numeric(Continious)|$Value of miscellaneous feature|
|MoSold|Numeric(Discrete)|Month Sold|
|YrSold|Numeric(Discrete)|Year Sold|
|SaleType|Object(Categorical)|Type of sale|
|SaleCondition|: |Condition of sale|

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.linear_model import LinearRegression, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer

pd.set_option("display.max_colwidth", None)

### Load Dataset

In [2]:
# Load dataset and make a datframe based on that
df = pd.read_csv("../data/cleaned_dataset.csv").drop("Unnamed: 0", axis=1)

In [3]:
# Get some basic information of our dataframe
print(f"Sample Data :{df.head()}")
print(f"\n--------------------------\n\n Columns : {[i for i in df.columns]}")
print(f"\n--------------------------\n\n Size of the dataset : {df.shape[0]}")
print(f"\n--------------------------\n\n Total number of features : {df.shape[1]}")
print(f"\n--------------------------\n\n Number of numerical features: {df.select_dtypes(include=[int, float]).shape[1]}")
print(f"\n--------------------------\n\n Number of categorical features: {df.select_dtypes(include=[object]).shape[1]}")

Sample Data :    Id        PID  MS SubClass MS Zoning  Lot Area Lot Shape Land Contour  \
0  109  533352170           60        RL     13517       IR1          Lvl   
1  544  531379050           60        RL     11492       IR1          Lvl   
2  153  535304180           20        RL      7922       Reg          Lvl   
3  318  916386060           60        RL      9802       Reg          Lvl   
4  255  906425045           50        RL     14235       IR1          Lvl   

  Lot Config Land Slope Neighborhood  ... Garage Cond Paved Drive  \
0    CulDSac        Gtl       Sawyer  ...          TA           Y   
1    CulDSac        Gtl  Sawyer West  ...          TA           Y   
2     Inside        Gtl   North Ames  ...          TA           Y   
3     Inside        Gtl   Timberland  ...          TA           Y   
4     Inside        Gtl  Sawyer West  ...          TA           N   

  Wood Deck SF  Open Porch SF  Enclosed Porch  Screen Porch  Mo Sold Yr Sold  \
0            0             44

In [4]:
# Define numerical values
# We don't need "PID" and "Id" because they can't be considered as impactful features on our target
# We drop "SalePrice" because it's our target and we just want to have our features
num = df.select_dtypes(include = [int, float]).drop(columns=["Id", "PID", "SalePrice"], axis=1)
num.columns

Index(['MS SubClass', 'Lot Area', 'Overall Qual', 'Overall Cond', 'Year Built',
       'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
       'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
       'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath',
       'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd',
       'Fireplaces', 'Garage Yr Blt', 'Garage Cars', 'Garage Area',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', 'Screen Porch',
       'Mo Sold', 'Yr Sold'],
      dtype='object')

In [5]:
num.dtypes

MS SubClass         int64
Lot Area            int64
Overall Qual        int64
Overall Cond        int64
Year Built          int64
Year Remod/Add      int64
Mas Vnr Area      float64
BsmtFin SF 1      float64
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
1st Flr SF          int64
2nd Flr SF          int64
Gr Liv Area         int64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Full Bath           int64
Half Bath           int64
Bedroom AbvGr       int64
Kitchen AbvGr       int64
TotRms AbvGrd       int64
Fireplaces          int64
Garage Yr Blt     float64
Garage Cars       float64
Garage Area       float64
Wood Deck SF        int64
Open Porch SF       int64
Enclosed Porch      int64
Screen Porch        int64
Mo Sold             int64
Yr Sold             int64
dtype: object

### Assemble Predictor Features (X) and Target (y) 

In [6]:
# Define X (features)
# We don't need "PID" and "Id" because they can't be considered as impactful features on our target
X = df.select_dtypes(include = [int, float]).drop(columns=["Id", "PID", "SalePrice"], axis=1)

# Define y (target)
y= df["SalePrice"]

print(f"The shape of X -------------- {X.shape}")
print(f"The shape of y -------------- {y.shape}")

The shape of X -------------- (1997, 31)
The shape of y -------------- (1997,)


### Baseline Score

In [7]:
# Instantiate Linear Regression Model
lr = LinearRegression()

In [8]:
# Utilize cross_val_score for getting baseline score
print(f"Our Baseline Score for predicting house pricing is ----------------- {np.mean(cross_val_score(lr, X, y, cv=5))}")

Our Baseline Score for predicting house pricing is ----------------- 0.8740433458838526


### Preprocessing on Categorical Features 
For using categorical features in our model, first we need to transform them into numeric values. In this part we will leverage map and one-hot encode for transforming them.

In [60]:
# Replace values with small numbers to "Other" in categorical variables
df = df.replace({
    "MS Zoning": {"C (all)": "Other", "RH": "Other", "A (agr)": "Other", "I (all)": "Other"},
    "Neighborhood": {"Blueste": "Other", "Greens": "Other", "GrnHill": "Other", "Landmrk": "Other"},
    "Condition 1": {"RRNn": "Other", "RRNe": "Other", "PosA": "Other"},
    "House Style": {"1.5Unf": "Other", "2.5Unf": "Other", "2.5Fin": "Other"},
    "Roof Style": {"Mansard": "Other", "Shed": "Other"},
    "Exterior 1st": {"BrkComm": "Other", "Stone": "Other", "CBlock": "Other", "ImStucc": "Other", "AsphShn": "Other"},
    "Exterior 2nd": {"ImStucc": "Other", "Stone": "Other", "AsphShn": "Other", "CBlock": "Other"},
    "Exter Cond": {"Ex": "Other", "Po": "Other"},
    "Foundation": {"Stone": "Other", "Wood": "Other", "Slab": "Other"},
    "Electrical": {"FuseP": "Other", "Mix": "Other"},
    "Functional": {"Maj2": "Other", "Sev": "Other", "Sal": "Other"},
    "Sale Type": {"ConLD": "Other", "CWD": "Other", "ConLI": "Other", "ConLw": "Other", "Con": "Other", "Oth": "Other"}    
                })

In [9]:
cat = df.select_dtypes(include="object")
cat.columns

Index(['MS Zoning', 'Lot Shape', 'Land Contour', 'Lot Config', 'Land Slope',
       'Neighborhood', 'Condition 1', 'Bldg Type', 'House Style', 'Roof Style',
       'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Exter Qual',
       'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Central Air',
       'Electrical', 'Kitchen Qual', 'Functional', 'Garage Type',
       'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive',
       'Sale Type'],
      dtype='object')