# A Comprehensive Study of the Factors that Affect Home Prices in Ames, Iowa

## Problem Statement
As a realtor company operating in Ames, Iowa, our goal is to help our clients make informed decisions when buying or selling a home. One of the key factors that our clients consider is the price of the property, and we want to develop a reliable regression model that can accurately predict the prices of homes in this area. By analyzing a dataset of previous home sales in Ames and identifying the key features that impact home prices, we aim to create a regression model that can be used to predict future prices based on a variety of property characteristics. Our ultimate goal is to provide our clients with a powerful tool that can help them make smart and profitable real estate decisions.

## Background
The real estate market in Ames, Iowa has seen significant growth over the past decade, with a steady increase in home prices and a high demand for quality housing. As a result, there is a growing need for accurate predictions of home prices in this area, both for home buyers looking to make a wise investment and for realtors seeking to offer valuable insights to their clients.

To address this need, our realtor company is undertaking a project to develop a regression model that can predict home prices in Ames based on a variety of key factors. By analyzing a dataset of past home sales and identifying the most significant features that impact home prices, we aim to build a reliable and accurate model that can help our clients make informed decisions about their real estate investments.

This project represents a significant opportunity for our company to provide a valuable service to our clients, while also gaining a deeper understanding of the complex factors that drive home prices in the Ames real estate market. Through careful analysis and rigorous testing, we believe that we can develop a powerful tool that will help our clients maximize their investments and achieve their real estate goals.

## Contents:

## Datasets
The raw dataset provided from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques): consisting 81 features that could impact on house prices.

|Feature|Type|Description|
|---|---|---|
|SalePrice|Numeric(Continious)|the property's sale price in dollars. This is the target variable that you're trying to predict|.
|MSSubClass|Numeric(Discrete)|The building class|
|MSZoning|Object(Cetagorical)|The general zoning classification|
|LotFrontage|Numeric(Continious)|Linear feet of street connected to property|
|LotArea|Numeric(Continious)|Lot size in square feet|
|Street|Object(Cetagorical)|Type of road access|
|Alley|Object(Cetagorical)|Type of alley access|
|LotShape|Object(Cetagorical)|General shape of property|
|LandContour|Object(Categorical)|Flatness of the property|
|Utilities|Object(Categorical)|Type of utilities available|
|LotConfig|Object(Categorical)|Lot configuration|
|LandSlope|Object(Categorical)|Slope of property|
|Neighborhood|Object(Categorical)|Physical locations within Ames city limits|
|Condition1|Object(Categorical)|Proximity to main road or railroad|
|Condition2|Object(Categorical)|Proximity to main road or railroad (if a second is present)|
|BldgType|Object(Categorical)|Type of dwelling|
|HouseStyle|Object(Categorical)|Style of dwelling|
|OverallQual|Numeric(Discrete)|Overall material and finish quality|
|OverallCond|Numeric(Discrete)|Overall condition rating|
|YearBuilt|Numeric(Discrete)|Original construction date|
|YearRemodAdd|Numeric(Discrete)|Remodel date|
|RoofStyle|Object(Categorical)|Type of roof|
|RoofMatl|Object(Categorical)|Roof material|
|Exterior1st|Object(Categorical)|Exterior covering on house|
|Exterior2nd|Object(Categorical)|Exterior covering on house (if more than one material)|
|MasVnrType|Object(Categorical)|Masonry veneer type|
|MasVnrArea|Numeric(Continious)|Masonry veneer area in square feet|
|ExterQual|Object(Categorical)|Exterior material quality|
|ExterCond|Object(Categorical)|Present condition of the material on the exterior|
|Foundation|Object(Categorical)|Type of foundation|
|BsmtQual|Object(Categorical)|Height of the basement|
|BsmtCond|Object(Categorical)|General condition of the basement|
|BsmtExposure|Object(Categorical)|Walkout or garden level basement walls|
|BsmtFinType1|Object(Categorical)|Quality of basement finished area|
|BsmtFinSF1|Numeric(Continious)|Type 1 finished square feet|
|BsmtFinType2|Object(Categorical)|Quality of second finished area (if present)|
|BsmtFinSF2|Numeric(Continious)|Type 2 finished square feet|
|BsmtUnfSF|Numeric(Continious)|Unfinished square feet of basement area|
|TotalBsmtSF|Numeric(Continious)|Total square feet of basement area|
|Heating|Object(Categorical)|Type of heating|
|HeatingQC|Object(Categorical)|Heating quality and condition|
|CentralAir|Object(Categorical)|Central air conditioning|
|Electrical|Object(Categorical)|Electrical system|
|1stFlrSF|Numeric(Continious)|First Floor square feet|
|2ndFlrSF|Numeric(Continious)|Second floor square feet|
|LowQualFinSF|Numeric(Continious)|Low quality finished square feet (all floors)|
|GrLivArea|Numeric(Continious)|Above grade (ground) living area square feet|
|BsmtFullBath|Numeric(Discrete)|Basement full bathrooms|
|BsmtHalfBath|Numeric(Discrete)|Basement half bathrooms|
|FullBath|Numeric(Discrete)|Full bathrooms above grade|
|HalfBath|Numeric(Discrete)|Half baths above grade|
|Bedroom|Numeric(Discrete)|Number of bedrooms above basement level|
|Kitchen|Numeric(Discrete)|Number of kitchens|
|KitchenQual|Object(Categorical)|Kitchen quality|
|TotRmsAbvGrd|Numeric(Discrete)|Total rooms above grade (does not include bathrooms)|
|Functional|Object(Categorical)|Home functionality rating|
|Fireplaces|Numeric(Discrete)|Number of fireplaces|
|FireplaceQu|Object(Categorical)|Fireplace quality|
|GarageType|Object(Categorical)|Garage location|
|GarageYrBlt|Numeric(Discrete)|Year garage was built|
|GarageFinish|Object(Categorical)|Interior finish of the garage|
|GarageCars|Numeric(Discrete)|Size of garage in car capacity|
|GarageArea|Numeric(Continious)|Size of garage in square feet|
|GarageQual|Object(Categorical)|Garage quality|
|GarageCond|Object(Categorical)|Garage condition|
|PavedDrive|Object(Categorical)|Paved driveway|
|WoodDeckSF|Numeric(Continious)|Wood deck area in square feet|
|OpenPorchSF|Numeric(Continious)|Open porch area in square feet|
|EnclosedPorch|Numeric(Continious)|Enclosed porch area in square feet|
|3SsnPorch|Numeric(Continious)|Three season porch area in square feet|
|ScreenPorch|Numeric(Continious)|Screen porch area in square feet|
|PoolArea|Numeric(Continious)|Pool area in square feet|
|PoolQC|Object(Categorical)|Pool quality|
|Fence|Object(Categorical)|Fence quality|
|MiscFeature|Object(Categorical)|Miscellaneous feature not covered in other categories|
|MiscVal|Numeric(Continious)|$Value of miscellaneous feature|
|MoSold|Numeric(Discrete)|Month Sold|
|YrSold|Numeric(Discrete)|Year Sold|
|SaleType|Object(Categorical)|Type of sale|
|SaleCondition|: |Condition of sale|

## Imports

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.linear_model import LinearRegression, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer

pd.set_option("display.max_colwidth", None)
     

## Functions

In [18]:
# Function for plotting histogram
def his_dis(column, titlename, x, bins, color):    
    plt.figure(figsize=(12,8))
    plt.hist(column, bins=bins, edgecolor='black', color=color)
    plt.title("Distribution of "" + titlename)
    plt.xlabel(x)
    plt.ylabel("Frequencies");

SyntaxError: EOL while scanning string literal (3403117564.py, line 5)

## Exploratory Data Analysis (EDA) and Pre-Processing

### Load Data

In [None]:
# Determine data path
data_path = '../data/train.csv'

In [6]:
# Read captions data
df_train = pd.read_csv(data_path)

# 5 Samples of data 
print(f"Sample Data :{df_train.head()}")

print(f"\n--------------------------\n\n Columns : {[i for i in df_train.columns]}")
print(f"\n--------------------------\n\n Size of the dataset : {df_train.shape[0]}")
print(f"\n--------------------------\n\n Total number of features : {df_train.shape[1]}")
print(f"\n--------------------------\n\n Number of numerical features: {df_train.select_dtypes(include=[int, float]).shape[1]}")
print(f"\n--------------------------\n\n Number of categorical features: {df_train.select_dtypes(include=[object]).shape[1]}")

Sample Data :    Id        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street Alley  \
0  109  533352170           60        RL           NaN     13517   Pave   NaN   
1  544  531379050           60        RL          43.0     11492   Pave   NaN   
2  153  535304180           20        RL          68.0      7922   Pave   NaN   
3  318  916386060           60        RL          73.0      9802   Pave   NaN   
4  255  906425045           50        RL          82.0     14235   Pave   NaN   

  Lot Shape Land Contour  ... Screen Porch Pool Area Pool QC Fence  \
0       IR1          Lvl  ...            0         0     NaN   NaN   
1       IR1          Lvl  ...            0         0     NaN   NaN   
2       Reg          Lvl  ...            0         0     NaN   NaN   
3       Reg          Lvl  ...            0         0     NaN   NaN   
4       IR1          Lvl  ...            0         0     NaN   NaN   

  Misc Feature Misc Val Mo Sold Yr Sold  Sale Type  SalePrice  
0          NaN 

In [7]:
# Lower case column names and remove spaces 
#df_train.columns = df_train.columns.str.lower().str.replace(" ", "_")

In [8]:
# Define numberical and categorical dataframes
df_num = df_train.select_dtypes(include = [int, float])
df_cat = df_train.select_dtypes(include = [object])

In [15]:
df_train.columns[0]

'Id'

In [17]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   Lot Frontage     1721 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            140 non-null    object 
 8   Lot Shape        2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  Lot Config       2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

In [16]:
for col in df_train.columns:
    print(df_train[col].value_counts())

109     1
1377    1
1521    1
1719    1
1221    1
       ..
1965    1
1598    1
1796    1
2619    1
10      1
Name: Id, Length: 2051, dtype: int64
533352170    1
905100020    1
909201110    1
528174040    1
534451170    1
            ..
535453150    1
923225370    1
528458080    1
535426195    1
527162130    1
Name: PID, Length: 2051, dtype: int64
20     770
60     394
50     198
120    132
30     101
70      90
160     88
80      86
90      75
190     46
85      28
75      16
180     11
45      11
40       4
150      1
Name: MS SubClass, dtype: int64
RL         1598
RM          316
FV          101
C (all)      19
RH           14
A (agr)       2
I (all)       1
Name: MS Zoning, dtype: int64
60.0     179
70.0      96
80.0      94
50.0      90
65.0      71
        ... 
118.0      1
137.0      1
195.0      1
115.0      1
135.0      1
Name: Lot Frontage, Length: 118, dtype: int64
9600     34
7200     27
6000     26
10800    19
9000     18
         ..
8765      1
10337     1
7614      1
731

In [28]:
df_train['Lot Shape'].value_counts()

Reg    1295
IR1     692
IR2      55
IR3       9
Name: Lot Shape, dtype: int64

### Cleaning
Working on null values.

In [None]:
# Explore the null values
print(f"Total Null values of numerical columns: {df_num.isna().sum().sum()}")
print(f"Total Null values in categorical columns: {df_cat.isna().sum().sum()}")

In [None]:
# Plot the null values
plt.figure(figsize=(15,25))
df_train.isna().sum().plot.barh(color="orange");

In [None]:
# Drop columns with more than 300 null values
df_train.dropna(thresh=len(df_train)-300, axis=1, inplace=True)
df_train.shape

### Get to know columns with most null values in numerical features after dropping columns

In [None]:
# Distribution of values in lot_frontage feature
his_dis(column=df_num['mas_vnr_area'], titlename='Masonry Veneer Area in Square Feet',
                      x='Masonry Veneer Area in Square Feet', bins=50, color='orange')

In [None]:
# Distribution of values in lot_frontage feature
his_dis(column=df_num['garage_yr_blt'], titlename='Year Garage Was Built',
                      x='Year Garage Was Built', bins=50, color='orange')

In [None]:
# Work on null values on lot_frontage
# Replace null values with the average
df_num['lot_frontage'].fillna(df_num['lot_frontage'].mean(), inplace=True)

In [None]:
# Work on null values on mas_vnr_area
# Fill null values with 0
df_num['mas_vnr_area'].fillna(0, inplace=True)

In [None]:
# Explore in null values of garage_yr_blt feature 
print(f"Take a look at garage_yr_blt and year_built where the garage and the house built at the same year:\n {df_num[['garage_yr_blt','year_built']][df_num['garage_yr_blt']==df_num['year_built']]}")
print(f"\nPercentage of where the garage and the house were built at the same year:\n {df_num[['garage_yr_blt','year_built']][df_num['garage_yr_blt']==df_num['year_built']].shape[0]/df_num.shape[0]}")

In [None]:
# Replace garage_yr_blt null values with corresponding value in year_built
df_num['garage_yr_blt'].fillna(df_num['year_built'], inplace=True)

In [None]:
# Drop remaining null values
df_num.dropna(inplace=True)

In [None]:
df_final = df_train.select_dtypes(include=[int, float])

In [None]:
# adding neighborhood columns from train dataset
df_final = pd.merge(df_final, df_train[['id', 'neighborhood']], on='id', how='left')

In [None]:
df_final.isna().sum()

In [None]:
df_final['neighborhood'].value_counts()

In [None]:
df_final.shape

In [None]:
df_final.describe()

In [None]:
df_final.isna().sum()

In [None]:
# filled null values with mean in lot_frontage
df_final['lot_frontage'] = df_final['lot_frontage'].fillna(df_final['lot_frontage'].mean())

In [None]:
df_final['lot_frontage'].value_counts()

In [None]:
df_final['full_bath'].value_counts()

In [None]:
df_final = df_final[df_final['full_bath'] > 0]

In [None]:
df_final['totrms_abvgrd'].value_counts()

In [None]:
df_final = df_final[df_final['totrms_abvgrd'] < 13]

In [None]:
# Dropping unusefull columns
df_final = df_final.drop(columns = ['garage_yr_blt', 'pid'])

In [None]:
df_final.dropna(inplace=True)

In [None]:
# Got better r2 score by using that but because of dat dictionary, I'm not gonna use that!!
##df_final = pd.get_dummies(data=df_final, columns= ['neighborhood'], prefix = None)

In [None]:
##df_final = pd.merge(df_final, df_train[['id', 'neighborhood']], on='id', how='left')

In [None]:
df_final['year_bt_bu_ren'] = df_final['year_remod/add'] - df_final['year_built']

In [None]:
df_final['year_bt_bu_ren'].value_counts()

In [None]:
df_final.shape

In [None]:
df_final.dtypes

In [None]:
df_final['neighborhood'].value_counts()

In [None]:
#save and export new dataset
df_final.to_csv('../data/final')

## Exploratory Data Analysis

In [None]:
# correlation between features and target
plt.figure(figsize= (10,8))
sns.heatmap(df_final.corr()[['saleprice']].sort_values(by='saleprice', ascending=False),
           annot=True,
           vmin=-1,
           vmax=1,
           cmap = 'coolwarm');

### Creating my model based on major features

In [None]:
df_final.corr()[['saleprice']].sort_values(by='saleprice', ascending=False)

In [None]:
# Defining X and y
X = df_final.drop(columns = ['saleprice', 'id', 'neighborhood'], axis =1)
y = df_final['saleprice']

In [None]:
# spliting X and y to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Standardizing
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [None]:
lr = LinearRegression()

In [None]:
# fir my Linear model to train data
lr.fit(X_train, y_train)

### Evaluating my model

In [None]:
#Train score
lr.score(X_train, y_train)

In [None]:
#test score
lr.score(X_test, y_test)

In [None]:
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)

In [None]:
# lasso train score
lasso_cv.score(X_train, y_train)

In [None]:
# lasso test score
lasso_cv.score(X_test, y_test)

In [None]:
coef_df = pd.DataFrame({'features': X.columns, 'coefs': lasso_cv.coef_})
coef_df.sort_values('coefs', ascending = False).head(10)

In [None]:
coef_df.sort_values('coefs', ascending = False).tail()

### Line Assumption

In [None]:
# L Linear Regression
plt.figure(figsize=(15,10))
plt.scatter(df_final['gr_liv_area'], df_final['saleprice'], edgecolor = 'black')
plt.title('Relationship Between Above Ground Living Area and Sale Price', fontsize='17')
plt.xlabel('Above Ground Living Area Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
# N-Normality of error
y_preds = lasso_cv.predict(X_train)
resids = y_train - y_preds
plt.hist(resids, bins = 50, edgecolor = 'black');
# Fine

In [None]:
# E-Equality variance of errors
# Residual plot
plt.figure(figsize=(5,5))
plt.scatter(y_preds, resids, s=1)
plt.axhline(0, color='orange');

## Visualization

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df_final['gr_liv_area'], df_final['saleprice'], c='gold', edgecolor = 'black')
plt.title('Relationship Between Above Ground Living Area and Sale Price', fontsize='17')
plt.xlabel('Above Ground Living Area Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df_final['garage_area'] , df_final['saleprice'], c='orange', edgecolor='black')
plt.title('Relationship Between Garage Area and Sale Price', fontsize=17)
plt.xlabel('Garage Area in Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.bar(df_final['garage_cars'], df_final['saleprice'], color = ['gold', 'gold', 'gold', 'gold', 'gold', 'gold'])
plt.xticks(df_final['garage_cars'])
plt.yticks()
plt.title('The Relationship Between Number of Car Space Park and Sale Price', fontsize=17)
plt.xlabel('Number Of Car Space Park', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df_final['total_bsmt_sf'], df_final['saleprice'], edgecolor = 'black')
plt.title('Relationship Between Total Basement Area and Sale Price', fontsize='17')
plt.xlabel('Total Basement Area in Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df_final['1st_flr_sf'], df_final['saleprice'], c= 'gold',edgecolor = 'black')
plt.title('Relationship Between First Floor Area and Sale Price', fontsize='17')
plt.xlabel('First Floor Area in Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df_final['2nd_flr_sf'], df_final['saleprice'], edgecolor = 'black')
plt.title('Relationship Between Second Floor Area and Sale Price', fontsize='17')
plt.xlabel('Second Floor Area in Square Feet', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
df_final['full_bath'].astype('object')

In [None]:
plt.figure(figsize=(15,10))
plt.bar(df_final['full_bath'], df_final['saleprice'], color = ['gold', 'gold', 'gold', 'gold'])
plt.xticks(df_final['full_bath'])
plt.title('The Relationship Between Number of Full-Bathroom and Sale Price', fontsize=17)
plt.xlabel('Number of Full-Bathroom', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(15,10))
plt.bar(df_final['totrms_abvgrd'], df_final['saleprice'])
plt.xticks(df_final['totrms_abvgrd'])
plt.yticks()
plt.title('Relationship of Total Rooms and Sale Price', fontsize=17)
plt.xlabel('Number of Total Rooms', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
plt.figure(figsize=(40,30))
plt.bar(df_final['neighborhood'], df_final['saleprice'])
plt.xticks(df_final['neighborhood'])
plt.title('Relationship Number Of Available Car Space Park and Sale Price', fontsize=17)
plt.xlabel('Number Of Car Space Park', fontsize=14)
plt.ylabel('House Sale Price', fontsize=14);

In [None]:
df_final.corr()[['saleprice']].sort_values(by='saleprice', ascending=False)

In [None]:
plt.figure(figsize=(15,10))
plt.bar(df_final['overall_qual'], df_final['saleprice'])
plt.xticks(df_final['overall_qual'])
plt.yticks();