# Data Understanding & Exploration

## Objective
Understand the structure, quality, and key characteristics of the housing dataset.

## Inputs
- house_prices_records.csv
- inherited_houses.csv

## Outputs
- Summary statistics
- Missing value analysis
- Initial insights about SalePrice

## Data Understanding Summary

- Training dataset contains 1460 records and 24 features including the target variable `SalePrice`
- Inherited dataset contains 4 records and excludes `SalePrice`
- Several features contain missing values, mostly representing absence of property features
- Dataset includes both numerical and categorical variables
- SalePrice shows a wide range, indicating potential outliers


In [9]:
import pandas as pd

train_df = pd.read_csv("../data/raw/house_prices_records.csv")
inherited_df = pd.read_csv("../data/raw/inherited_houses.csv")

train_df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


In [10]:
train_df.shape


(1460, 24)

In [11]:
inherited_df.head()



Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


In [12]:
train_df.columns


Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinType1', 'BsmtUnfSF', 'EnclosedPorch', 'GarageArea',
       'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea',
       'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'OverallCond',
       'OverallQual', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd',
       'SalePrice'],
      dtype='str')

In [13]:
train_df.isnull().sum().sort_values(ascending=False)


EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
1stFlrSF            0
BsmtFinSF1          0
BsmtUnfSF           0
GrLivArea           0
LotArea             0
GarageArea          0
KitchenQual         0
OpenPorchSF         0
OverallQual         0
OverallCond         0
TotalBsmtSF         0
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

In [14]:
train_df.dtypes


1stFlrSF           int64
2ndFlrSF         float64
BedroomAbvGr     float64
BsmtExposure         str
BsmtFinSF1         int64
BsmtFinType1         str
BsmtUnfSF          int64
EnclosedPorch    float64
GarageArea         int64
GarageFinish         str
GarageYrBlt      float64
GrLivArea          int64
KitchenQual          str
LotArea            int64
LotFrontage      float64
MasVnrArea       float64
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
TotalBsmtSF        int64
WoodDeckSF       float64
YearBuilt          int64
YearRemodAdd       int64
SalePrice          int64
dtype: object

In [15]:
train_df.describe()


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageYrBlt,GrLivArea,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
count,1460.0,1374.0,1361.0,1460.0,1460.0,136.0,1460.0,1379.0,1460.0,1460.0,1201.0,1452.0,1460.0,1460.0,1460.0,1460.0,155.0,1460.0,1460.0,1460.0
mean,1162.626712,348.524017,2.869214,443.639726,567.240411,25.330882,472.980137,1978.506164,1515.463699,10516.828082,70.049958,103.685262,46.660274,5.575342,6.099315,1057.429452,103.741935,1971.267808,1984.865753,180921.19589
std,386.587738,438.865586,0.820115,456.098091,441.866955,66.684115,213.804841,24.689725,525.480383,9981.264932,24.284752,181.066207,66.256028,1.112799,1.382997,438.705324,135.543152,30.202904,20.645407,79442.502883
min,334.0,0.0,0.0,0.0,0.0,0.0,0.0,1900.0,334.0,1300.0,21.0,0.0,0.0,1.0,1.0,0.0,0.0,1872.0,1950.0,34900.0
25%,882.0,0.0,2.0,0.0,223.0,0.0,334.5,1961.0,1129.5,7553.5,59.0,0.0,0.0,5.0,5.0,795.75,0.0,1954.0,1967.0,129975.0
50%,1087.0,0.0,3.0,383.5,477.5,0.0,480.0,1980.0,1464.0,9478.5,69.0,0.0,25.0,5.0,6.0,991.5,0.0,1973.0,1994.0,163000.0
75%,1391.25,728.0,3.0,712.25,808.0,0.0,576.0,2002.0,1776.75,11601.5,80.0,166.0,68.0,6.0,7.0,1298.25,182.5,2000.0,2004.0,214000.0
max,4692.0,2065.0,8.0,5644.0,2336.0,286.0,1418.0,2010.0,5642.0,215245.0,313.0,1600.0,547.0,9.0,10.0,6110.0,736.0,2010.0,2010.0,755000.0
