*House Prices: Advanced Regression Techniques*
# 01 Only numerical features
---

**Without peeking at solutions 🙈🌁**

This is my first Kaggle competition. How exciting 🔥. I’m feeling pretty uncomfortable attempting this without looking at anyone else’s solution, or guidance. But this is going to be good practice.

My goal here is to follow some basic process of getting the data, inspecting it, preparing it, training and evaluating a model. I’m aware I won’t necessarily reach the best accuracy. But my goal here is to practice thinking for myself, and only looking at API docs for Python, the libraries, etc. — not at any guides.

## Rough process

1. Get the data
2. Inspect the data
3. Prep the data if necessary
4. Choose the features to use in a model
5. Train a model

## 0. Prep 🔪🧅
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 1. Get the data
---

In [2]:
train_X = pd.read_csv("train.csv", index_col="Id")
test_X = pd.read_csv("test.csv", index_col="Id")

## 2. Inspect the data
---

In [3]:
train_X

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


### Uncomment the line below to display all columns

Don’t truncate the columns, let me see the whole thing:

In [4]:
# pd.set_option("display.max_columns", None)

## 3. Prep the features to use in a model
---

Let’s focus exclusively on numerical features and see what accuracy we can get.

### Select numerical features and examples with missing values

In [5]:
train_X = train_X.select_dtypes(include=np.number).dropna()

### Move `SalePrice` to target data variable

The last column of the training data seems to include our target labels.

In [6]:
train_X, train_y = train_X.iloc[:, :-1], train_X.iloc[:, -1]

## 4. Train a model
---

In [7]:
from sklearn.ensemble import RandomForestRegressor

In [8]:
clf = RandomForestRegressor(random_state=0)

In [9]:
clf.fit(train_X, train_y)

RandomForestRegressor(random_state=0)

## 5. Get predictions from the model
---

First, select the numerical features out of the test set, and fill missing values with 0.

In [10]:
test_X = test_X.select_dtypes(include=np.number).fillna(0)

In [11]:
predictions = clf.predict(test_X)

In [12]:
print(predictions, predictions.dtype, len(predictions), sep=" ——— ")

[123689.5  155443.25 184799.8  ... 155892.62 125826.83 248012.7 ] ——— float64 ——— 1459


### Save predictions as a CSV submission file

We’ll need a `DataFrame` object with labeled indices, and right now we have a NumPy array.

In [13]:
submission = pd.DataFrame(predictions, index=test_X.index, columns=["SalePrice"])

In [14]:
submission

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,123689.50
1462,155443.25
1463,184799.80
1464,185041.75
1465,210528.73
...,...
2915,91257.36
2916,90316.50
2917,155892.62
2918,125826.83


In [15]:
submission.to_csv("submission_01.csv")