79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, 
AIM: To predict the final price of each home.

### Setup imports

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
#from sklearn.impute import SimpleImputer

ModuleNotFoundError: No module named 'sklearn.impute'

### Read data

In [2]:
X_full = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')

### Choose the Features from the Data Set

In [3]:
# Obtain target and predictors
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()

### Split the Training set in Training + Validation in the ratio 4:1

In [4]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

### Check the data sizes

In [7]:
print('X_full -- '+str(X_full.shape[0]))
print('X_test_full -- '+str(X_test_full.shape[0]))
print('X -- '+str(X.shape[0]))
print('X_train -- '+str(X_train.shape[0]))
print('X_valid -- '+str(X_valid.shape[0]))


X_full -- 1460
X_test_full -- 1459
X -- 1460
X_train -- 1168
X_valid -- 292


### Use Different Random Forest Regressor Models

In [10]:
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

In [11]:
models = [model_1, model_2, model_3, model_4, model_5]

### The best model will have lowest the mean absolute error (MAE) for the validation set.

In [22]:
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t) # fit the data to model
    preds = model.predict(X_v) #predict for the training
    return mean_absolute_error(y_v, preds)

In [23]:
for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))

Model 1 MAE: 24015
Model 2 MAE: 23740
Model 3 MAE: 23734
Model 4 MAE: 23996
Model 5 MAE: 23706


In [24]:
# Choose the best model of all

best_model=model_3

In [25]:
# Fit the model to the training data
best_model.fit(X, y)

# Generate test predictions
preds_test = best_model.predict(X_test)

# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

### Dealing with Missing Values

### Drop the columns with missing values

In [27]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_model(best_model,reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
23734.74023972603


### Use of Imputer

Imputation fills in the missing values with some number. 