# Exploratory Data Analysis - House Prices

Here, in this notebook, we execute an Exploratory Data Analysis (EDA) over ["House Prices: Advanced Regression Techniques"](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) dataset.

In [1]:
# modules
import pandas as pd
import seaborn as sns
import missingno as msno

Let's take an inital look on **train** and **test** data.

In [2]:
train_data = pd.read_csv('../data/raw/train.csv')
print('train.csv::', train_data.shape)
test_data = pd.read_csv('../data/raw/test.csv')
print('test.csv::', test_data.shape)
train_data.head(3)

train.csv:: (1460, 81)
test.csv:: (1459, 80)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500


In [14]:
# Checking the data type
train_data.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

In [None]:
train_data.describe().T

Let's remove the "Id" column because it doesn't have relevant information about SalesPrice.

In [None]:
train_data.drop(columns=['Id'], inplace=True)

## Missing values

Before to proceed, let's check if train_data has issues with missing values. The missingno module is a good tool to visualize missing values.

In [None]:
msno.matrix(train_data)

It seems that some columns have a lot of missing values. Let's count how many columns are being affected.

In [None]:
columns_with_miss = train_data.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print('Columns with missing values::', len(columns_with_miss))
columns_with_miss.sort_values(ascending=False)

From 80 columns, 19 have missing values. Yet, it's interesting to note that 4 of them almost have no values (PoolQC, MiscFeature, Alley and Fence) and it will be candidates to exclusion.

**TODO**: Exclude PoolQC, MiscFeature, Alley, Fence, and FireplaceQu columns.
- PoolQC: Pool quality
- MiscFeature: Miscellaneous feature not covered in other categories
- Alley: Type of alley access to property
- Fence: Fence quality

FireplaceQu (690) and LotFrontage (259) columns need to be investigated before taking any action.

**TODO**: Analyze if FireplaceQu and LotFrontage have useful information.
- FireplaceQu: Fireplace quality
- LotFrontage: Linear feet of street connected to property

In [None]:
train_data.drop(columns=['PoolQC', 'MiscFeature', 'Alley', 'Fence'], inplace=True)

## Feature Selection

### Correlation

Which columns have more correlation with SalesPrice?

In [None]:
train_data.corr()['SalePrice'].abs().sort_values(ascending=False)

In [None]:
sns.pairplot(train_data)