# House Prices: Advanced Regression Techniques

### AI TF 머신러닝 과제

* https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
* Predict sales prices and practice feature engineering, RFs, and gradient boosting

### File descriptions

* **input/train.csv** 테스트 셋
* **input/test.csv** 트레인 셋
* **input/data_description.txt** 데이터 설명
* **input/sample_sumbission.csv** 정답 제출 샘플

### Scikit-Learn LinearRegression 모델 사용
* ** Kagle Score : 0.15665 **

In [10]:
import pandas as pd
import numpy as np

## Load Dataset

In [11]:
# train data
train = pd.read_csv("input/train.csv", index_col="Id")

print(train.shape)
train.head()

(1460, 80)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [12]:
# test data
test = pd.read_csv("input/test.csv", index_col="Id")

print(test.shape)
test.head()

(1459, 79)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


## Preprocessing

#### linear regression 에서 feature 를 선택하는 방법
1. 데이터 탐색을 통해 SalePrice 와 선형 상관관계가 높고, 선형 regression 전제조건을 충족하는 feature 선택
2. 모든 feature 를 사용하고 Backward Elimination 을 통해 feature 를 줄여나가는 방법

#### sklearn linear regression 에서 feature scaling 이 필요한가?
1. linear regression 에서는 자동으로 feature scaling 되는것으로 알고 있음
   - SVN, K-Mean, Logistic Regression 처럼 feature scaling 이 필요한 모델들도 있음
   - from sklearn.preprocessing import StandardScaler 사용
2. lenear regression 에서 feature scaling 을 했을경우와 안했을경우 차이를 확인해 보자
   - 차이가 없음 : 자동으로 feature scaling 되고 있음

### outlier

1. 'GrLiveArea' 의 outlier 데이터 제거 

In [13]:
#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train.index == 1299].index)
train = train.drop(train[train.index == 524].index)
print(train.shape)

(1458, 80)


In [14]:
# 데이터 탐색을 통해 linear regression 에 적합한 feature 선정 함

# train 시킬 feature 선언
# YearBuilt 가 있을경우 score 가 좋아지는것을 확인
feature_names = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath','YearBuilt']

In [27]:
# train 시킬 데이터셋
X_train = train.loc[:, feature_names]

# YearBuilt 에 공백값이 있는 데이터가 한건 있어서 임의의 값으로 업데이트
# X_train.loc[train["YearBuilt"] == " ", "YearBuilt"] = 1980

# YearBuil 가 object type 이어서 수치형으로 변환
pd.to_numeric(X_train['YearBuilt'])
print(X_train.shape)
X_train.head()

  result = getattr(x, name)(y)


TypeError: invalid type comparison

### missing data in X_train

In [325]:
#missing data
total = X_train.isnull().sum().sort_values(ascending=False)
percent = (X_train.isnull().sum()/X_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
YearBuilt,0,0.0
FullBath,0,0.0
TotalBsmtSF,0,0.0
GarageCars,0,0.0
GrLivArea,0,0.0
OverallQual,0,0.0


In [326]:
X_test = test.loc[:, feature_names]
print(X_test.shape)
X_test.head()

(1459, 6)


Unnamed: 0_level_0,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,FullBath,YearBuilt
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1461,5,896,1.0,882.0,1,1961
1462,6,1329,1.0,1329.0,1,1958
1463,5,1629,2.0,928.0,2,1997
1464,6,1604,2.0,926.0,2,1998
1465,8,1280,2.0,1280.0,2,1992


### missing data in X_test

In [327]:
#missing data
total = X_test.isnull().sum().sort_values(ascending=False)
percent = (X_test.isnull().sum()/X_test.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
TotalBsmtSF,1,0.000685
GarageCars,1,0.000685
YearBuilt,0,0.0
FullBath,0,0.0
GrLivArea,0,0.0
OverallQual,0,0.0


In [328]:
# TotalBsmtSF null 값 확인
X_test[X_test['TotalBsmtSF'].isnull()]

Unnamed: 0_level_0,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,FullBath,YearBuilt
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2121,4,896,1.0,,1,1946


In [330]:
# 결측치를 평균값으로 채움
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_test["TotalBsmtSF"]=imp.fit_transform(X_test[["TotalBsmtSF"]]).ravel()

# total_bsmtsf_mean = X_test['TotalBsmtSF'].mean()
# X_test.loc[X_test['TotalBsmtSF'].isnull(), 'TotalBsmtSF'] = total_bsmtsf_mean

In [331]:
# GarageCars null 값 확인
X_test[X_test['GarageCars'].isnull()]

Unnamed: 0_level_0,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,FullBath,YearBuilt
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2577,5,1828,,859.0,2,1923


In [332]:
# 결측치를 평균값으로 채움
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_test["GarageCars"]=imp.fit_transform(X_test[["GarageCars"]]).ravel()

# garagecars_mean = X_test['GarageCars'].mean()
# X_test.loc[X_test['GarageCars'].isnull(), 'GarageCars'] = garagecars_mean

### feature scaling

스케일링은 자료 집합에 적용되는 전처리 과정으로 모든 자료에 선형 변환을 적용하여 전체 자료의 분포를 평균 0, 분산 1이 되도록 만드는 과정임

LinearRegression 모델에서는 auto feature scaling 처리 됨

정규화(Normarlization)는 스케일링과 달리 개별 데이터의 크기를 모두 같게 만들기 위한 변환이다. 따라서 개별 데이터에 대해 서로 다른 변환 계수가 적용됨

정규화는 다차원 독립 변수 벡터가 있을 때 각 벡터 원소들의 상대적 크기만 중요한 경우에 사용됨

feature 정규화 했을 경우 결과에 영향이 있는지 확인

In [309]:
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# X_train = sc_X.fit_transform(X_train)
#X_test = sc_X.transform(X_test)

### One hot encoding

OverallQual 값은 수치형이지만 categories 로 봐야 함

One hot encoding 했을 경우와 안했을 경우 결과의 차이를 확인해 보자
- One hot encoding 방법은 다양함

One hot encoding 하면 결과가 더 좋아지는것을 확인 할 수 있음

In [310]:
# OneHotEncoder 사용
from sklearn.preprocessing import OneHotEncoder
# 첫번째 컬럼인 OverallQual 를 encoding 대상으로 지정
ohe = OneHotEncoder(categorical_features = [0])
# train 데이터 encoding
X_train = ohe.fit_transform(X_train).toarray()
# test 데이터 encoding
X_test = ohe.fit_transform(X_test).toarray()

### model

In [311]:
# train 시킬 때 사용할 label(target, 종속변수) 컬럼 선택
label_name = "SalePrice"

# train 시킬 때 사용할 label(target, 종속변수) 데이터셋 준비
y_train = np.log(train[label_name])

print(y_train.shape)
y_train.head()

(1458,)


Id
1    12.247694
2    12.109011
3    12.317167
4    11.849398
5    12.429216
Name: SalePrice, dtype: float64

In [312]:
# Multiple Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()

### Scoring

Kaggle 에서 log 함수로 변환된 SalePrice 값을 RMSE 방식으로 scoring 하기 때문에 RMSE 함수 구현하여 사용

In [314]:
# RMSE 함수 구현
import numpy as np
from sklearn.metrics import make_scorer

def rmse(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)
  
    difference = predict - actual
    difference = np.square(difference)
    
    mean_difference = difference.mean()
    
    score = np.sqrt(mean_difference)
    
    return score

rmse_scorer = make_scorer(rmse)
rmse_scorer

make_scorer(rmse)

In [315]:
from sklearn.cross_validation import cross_val_score

score = cross_val_score(model, X_train, y_train, cv=10, \
                         scoring=rmse_scorer).mean()

print("Score = {0:.5f}".format(score))

Score = 0.15624


## Train

In [313]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Predict

In [316]:
# RMSE scoreing 하기 위해 log로 변환했던 값을 다시 원복해야 함
predictions = np.exp(model.predict(X_test))
print(predictions.shape)
predictions

(1459,)


array([ 116069.34417955,  150939.51054938,  168696.66486261, ...,
        145626.46183705,  121105.09272965,  235176.59102407])

## Submit (kaggle 제출용)

In [317]:
submission = pd.read_csv("input/sample_submission.csv", index_col="Id")

submission["SalePrice"] = predictions

print(submission.shape)
submission.head()

(1459, 1)


Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,116069.34418
1462,150939.510549
1463,168696.664863
1464,178469.44625
1465,203729.14044


In [318]:
# 저장할 파일을 구분하기 위해 파일명에 timestamp 정보 추가 하기 위한 작업 
from datetime import datetime

current_date = datetime.now()
current_date = current_date.strftime("%Y-%m-%d_%H-%M-%S")

description = "linear-regression"

filename = "{date}_{desc}_{score}.csv".format(date=current_date, desc=description, score="{0:.5f}".format(score))
filepath = "output/{filename}".format(filename=filename)

submission.to_csv(filepath)