# Work Flow

## 1. Define the problem:
- **Objective**: Predict house prices based on features like location, size, number of different rooms, etc.
- **Problem**: Regression (the target variable is continuous)
    - Besides, there are 3 other problems includes classification / Clustering / Time-series Forecasting 
- **Output**: A numerical prediction of the house price.

## 2. Data preprocessing:
- **Understand data structure** --> evaluate:
    - Check the percentage of missing data --> drop? fill?
    - Drop data if no important data need to be concerned / more than 50% of the total is missing
    - Fill data if < 5-10% missing values 
        - Fill with the most common value (mode) / creating a new category like "Missing"
        - Fill using mean/median/interpolation if the data is numerical type
    - Check the nunique() data for each features --> group / classify
- **Feature Engineering**
    - Convert categorical features using one-hot encoding  
    - Scale numerical features using standardization.
- **Train/Test split** 

## 3. Choose a Model and Train:
- **Process**
    - Feed training data into the model.
    - Adjust parameters to minimize error. 
    - Save the trained model for predictions.

## 4. Evaluate performance: --> MAE, MSE, ...
 
## 5. Predict and visualize: 
- Input new data with correct format

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import zscore
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('train.csv')
print(data.head(5))

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [3]:
#check for missing values
for feature in data.columns:
    if data[feature].isnull().sum() > 0:
        print(feature, data[feature].isnull().sum())

LotFrontage 259
Alley 1369
MasVnrType 872
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406


In [4]:
#crop the LotFrontage column
data = data.drop('LotFrontage', axis=1)
data = data.drop('Id', axis=1) # not useful for prediction
data = data.drop('Alley', axis=1) 
data = data.drop('PoolQC', axis=1) 
data = data.drop('Fence', axis=1) 
data = data.drop('MiscFeature', axis=1) 
data = data.drop('FireplaceQu', axis=1) 
data = data.drop('MasVnrType', axis=1) 

In [5]:
#check for missing values
for feature in data.columns:
    if data[feature].isnull().sum() > 0:
        print(feature, data[feature].isnull().sum())

MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81


In [6]:
#check for feature have binary values
for feature in data.columns:
    if len(data[feature].unique()) == 2:
        print(feature, data[feature].unique())

Street ['Pave' 'Grvl']
Utilities ['AllPub' 'NoSeWa']
CentralAir ['Y' 'N']


In [7]:
# check for feature have more than 5 values and is a string type
for feature in data.columns:
    if len(data[feature].unique()) > 2 and data[feature].dtype == 'object':
        print(feature, data[feature].unique())    

MSZoning ['RL' 'RM' 'C (all)' 'FV' 'RH']
LotShape ['Reg' 'IR1' 'IR2' 'IR3']
LandContour ['Lvl' 'Bnk' 'Low' 'HLS']
LotConfig ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
LandSlope ['Gtl' 'Mod' 'Sev']
Neighborhood ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Condition1 ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Condition2 ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
BldgType ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
HouseStyle ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
RoofStyle ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
RoofMatl ['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll'
 'ClyTile']
Exterior1st ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco

# Convert Categorical features

In [8]:
binary_features = ['Street', 'Utilities', 'CentralAir']

#convert binary features to 0 and 1
binary_mappings = {
    'Street': {'Pave': 0, 'Grvl': 1},
    'Utilities': {'AllPub': 0, 'NoSeWa': 1},
    'CentralAir': {'N': 0, 'Y': 1}
}

for feature in binary_features:
    data[feature] = data[feature].map(binary_mappings[feature])

for feature in binary_features:
    print(feature, data[feature].unique())

Street [0 1]
Utilities [0 1]
CentralAir [1 0]


In [9]:
# print(string_features)

In [10]:
for feature in data.columns:
    if len(data[feature].unique()) > 2 and data[feature].dtype == 'object':
        data = pd.get_dummies(data, columns=[feature])

In [11]:
# print the name of the columns using get_dummies
print(data.columns[40:50])

Index(['MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM',
       'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg',
       'LandContour_Bnk', 'LandContour_HLS'],
      dtype='object')


In [13]:
# Fill NaN with 0 before converting to integers
data = data.fillna(0).astype(int)
data= data.astype(int)
# avoid multicollinearity ?

## Visualize Feature Distributions

In [None]:
# Define numerical features to visualize
numerical_features = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']

# Create histograms for each numerical feature
plt.figure(figsize=(10, 6))
for feature in numerical_features:
    plt.figure(figsize=(6, 4))
    plt.hist(data[feature], bins=20, edgecolor='black', alpha=0.7)
    plt.title(f"Distribution of {feature}")
    plt.xlabel(feature)
    plt.ylabel("Frequency")
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

## Choose the model --> Standardization if choosing ones sensitive to feature scales

In [None]:
scaler = StandardScaler()

# Apply to 'area' only
data['area'] = scaler.fit_transform(data[['area']])

# Check scaled values
print(data.head())

In [None]:
# excluding "price" column
X= data.drop(columns=['price'])

y= data['price'] 

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# print the number of samples in each set
print(X_train.shape[0])
print(X_test.shape[0])


In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

#Visualize the process of training
plt.figure(figsize=(10, 7))
plt.plot(range(len(y_train)), y_train, label='Actual')
plt.plot(range(len(y_train)), model.predict(X_train), label='Predicted')
plt.legend()
plt.title('Training set: Actual vs Predicted')
plt.xlabel('Sample number')
plt.ylabel('Price') 
plt.show()

In [None]:
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)

## Some large errors --> ouliers or missing features 
## --> Must be ouliers since all missing ones are being handle before 

### Using boxplot

In [None]:
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.boxplot(data['price'])
plt.title("Box Plot of House Prices")
plt.show()


In [None]:
# plt.figure(figsize=(12, 8))

# for i, feature in enumerate(numerical_features, 1):
#     plt.subplot(2, 3, i)  # Arrange subplots in a grid (2 rows, 3 columns)
    
#     # Calculate IQR
#     Q1 = data[feature].quantile(0.25)
#     Q3 = data[feature].quantile(0.75)
#     IQR = Q3 - Q1
#     lower_bound = Q1 - 2.0 * IQR
#     upper_bound = Q3 + 2.0 * IQR
    
#     # Plot boxplot
#     sns.boxplot(x=data[feature], color="skyblue")
    
#     # Add vertical lines for IQR range
#     plt.axvline(Q1, color='blue', linestyle='dashed', label="Q1")
#     plt.axvline(Q3, color='green', linestyle='dashed', label="Q3")
#     plt.axvline(lower_bound, color='red', linestyle='dashed', label="Lower Bound")
#     plt.axvline(upper_bound, color='red', linestyle='dashed', label="Upper Bound")
    
#     plt.title(f"Box Plot of {feature}")
#     plt.xlabel(feature)
#     plt.legend()

# plt.tight_layout()  
# plt.show()


plt.figure(figsize=(12, 8))

for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 3, i)  # Arrange subplots in a grid (2 rows, 3 columns)
    sns.boxplot(x=data[feature], color="lightcoral")  # Use different color for before cleaning
    plt.title(f"Box Plot of {feature} (Before Cleaning)")
    plt.xlabel(feature)

plt.tight_layout()  # Adjust layout for better visualization
plt.show()


In [None]:
# def remove_outliers_iqr(data, features):
#     for feature in features:
#         Q1 = data[feature].quantile(0.25) 
#         Q3 = data[feature].quantile(0.75) 
#         IQR = Q3 - Q1  
#         lower_bound = Q1 - 2.0 * IQR # choose threshold = 3.0 instead of 1.5 to remove more outliers
#         upper_bound = Q3 + 2.0 * IQR
        
#         # Remove rows where feature value is outside the IQR bounds
#         data = data[(data[feature] >= lower_bound) & (data[feature] <= upper_bound)]
#     return data

In [None]:
def remove_outliers_zscore(df, features, threshold=3):
    df_cleaned = df.copy()
    for feature in features:
        df_cleaned = df_cleaned[np.abs(zscore(df_cleaned[feature])) < threshold]  
    return df_cleaned

In [None]:
data_cleaned = remove_outliers_zscore(data, numerical_features)
# Print shape of data before and after removing outliers
print(f"Original dataset shape: {data.shape}")
print(f"Cleaned dataset shape: {data_cleaned.shape}")

## Check again

In [None]:
# plt.figure(figsize=(12, 8))

# for i, feature in enumerate(numerical_features, 1):
#     plt.subplot(2, 3, i)  # Arrange subplots in a grid (2 rows, 3 columns)
#     sns.boxplot(x=data_cleaned[feature], color="skyblue")
#     plt.title(f"Box Plot of {feature} after removing outliers")
#     plt.xlabel(feature)

# plt.tight_layout()  # Adjust layout for better visualization
# plt.show()

plt.figure(figsize=(12, 8))

for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 3, i)  # Arrange subplots in a grid (2 rows, 3 columns)
    sns.boxplot(x=data_cleaned[feature], color="skyblue")
    plt.title(f"Box Plot of {feature} (After Z-score Outlier Removal)")
    plt.xlabel(feature)

plt.tight_layout()  # Adjust layout for better visualization
plt.show()

In [None]:
print(data['bathrooms'].value_counts())  # Check distribution before removing

In [None]:
print(data_cleaned['bathrooms'].value_counts())  # Check distribution before removing

In [None]:
# excluding "price" column
X= data_cleaned.drop(columns=['price'])

y= data_cleaned['price'] 

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# print the number of samples in each set
print(X_train.shape[0])
print(X_test.shape[0])


In [None]:
y_pred2 = model.predict(X_test)
#mae 
mae = mean_absolute_error(y_test, y_pred2)
#mse
mse = mean_squared_error(y_test, y_pred2)
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)