## Prediction of house prices

Aim: Predict the price of a house based on the features of the house and implement ML operations on the data. Predicting the price is useful to identify the best house for a customer investment.

The data is taken from the kaggle competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The data is divided into two parts: train and test. The train data is used to train the model and the test data is used to test the model. The test data does not contain the target variable. The target variable is the price of the house.

This notebook will focus on the data analysis part of the project. The data analysis part includes the following steps:

### Compitition Description

![image.png](attachment:image.png)<br>

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### Installing libraries

In [2]:
! pip install scipy

Collecting scipy
  Using cached scipy-1.7.3-cp37-cp37m-win_amd64.whl (34.1 MB)
Collecting numpy<1.23.0,>=1.16.5
  Using cached numpy-1.21.6-cp37-cp37m-win_amd64.whl (14.0 MB)
Installing collected packages: numpy, scipy
Successfully installed numpy-1.21.6 scipy-1.7.3


In [4]:
! pip install pandas

Collecting pandas
  Using cached pandas-1.3.5-cp37-cp37m-win_amd64.whl (10.0 MB)
Collecting pytz>=2017.3
  Using cached pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.3.5 pytz-2022.7.1


###  Loading libraries

In [5]:
import pandas as pd
import os

### Importing the dataset

In [6]:
pwd = os.getcwd()
train_path = os.path.join(pwd,'Data', "train.csv")
test_path  = os.path.join(pwd, 'Data', 'test.csv')

In [7]:
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

### Size and Structure of the dataset

In [10]:
print(df_train.shape)
print(df_test.shape)

(1460, 81)
(1459, 80)


In [9]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In total, dataset has 1460 rows and 81 columns. The dataset has 80 features and 1 target variable. The target variable is the price of the house. (SalePrice). The features are the different attributes of the house. The features are described in the data_description.txt file.

There are 1460 instances of training data and 1460 of test data. Total number of attributes equals 81, of which 36 is quantitative, 43 categorical + Id and SalePrice.<br>
Test data has 1459 instances and 80 attributes, of which 35 is quantitative, 43 categorical + Id.