In [16]:
%run utils.ipynb
%run transformers.ipynb

In [17]:
warnings.filterwarnings('ignore')

In [18]:
df=pd.read_csv("../data/train.csv")

#drop some useless columns
cols_to_drop=["Id","PID"]
df=df.drop(columns=cols_to_drop)

#drop outliers identified in part 1
df=df[df['Gr Liv Area']<=4500]
df=df[df['SalePrice']>np.expm1(10)]

df_copy=df.copy()

In [20]:
features_to_keep=['Gr Liv Area', 'Overall Qual', 'Year Built', 'Total Bsmt SF', 'Fireplace Qu',
 'BsmtFin SF 1', 'Overall Cond', 'Garage Cars', 'Functional', 'Year Remod/Add', 'Exter Qual',
 'Foundation_PConc', 'Garage Type_Attchd', 'Kitchen Qual', 'Garage Type_Detchd', 'Lot Frontage',
 'Lot Area']

cols_to_impute_with_none=["Pool QC","Misc Feature","Alley","Fence","Fireplace Qu",
                          "Garage Finish","Garage Qual","Garage Cond","Garage Type",
                         "Bsmt Exposure","BsmtFin Type 2","Bsmt Cond","Bsmt Qual","BsmtFin Type 1",
                         "Mas Vnr Type"]

cols_to_impute_with_zero=["Garage Yr Blt","Mas Vnr Area","Bsmt Full Bath","Bsmt Half Bath","Garage Area",
                         "Garage Cars","Total Bsmt SF","Bsmt Unf SF","BsmtFin SF 2","BsmtFin SF 1"]

cols_to_impute_with_mode=["Electrical"]

correlated_to_drop=["1st Flr SF","Garage Yr Blt","TotRms AbvGrd","Garage Area"]

In [21]:
#Create a regressor to dynamically creates a pipeline
def create_regressor(model,**kwargs):
    '''
    model: estimator instance
    return: TransformedTargetRegressor
    '''
    pipeline=Pipeline(steps=[
        ('drop_correlated',DropCorrelated(correlated_to_drop)),
        ('ms_subclass_convert',MSSubClassConvert()),
        ('standard_impute',StandardImpute(none=cols_to_impute_with_none,
                                          zero=cols_to_impute_with_zero,
                                          mode=cols_to_impute_with_mode)),
        ('lot_frotage_impute',LotFrontageImpute()),
        ('ordinal_to_numerical',OrdinalToNumeric()),
        ('onehotencode',OneHotEncode()),
        ('align_train_and_predict',AlignTrainPredict(**kwargs)), #This is to align the train and predict DF in case they are different
        ('passthrough',Passthrough()), #Passthrough step, does nothing. Only exists to allow external code to retrieve feature names.
        ('robustscalar',RobustScaler()),
        ('model',model)
    ])
    
    return TransformedTargetRegressor(regressor=pipeline,
                                    func=np.log1p,
                                    inverse_func=np.expm1
                                    )

In [22]:
#set up X and Y
X=df.drop(columns="SalePrice")
y=df["SalePrice"]

#make a backup copy
X_copy=X.copy()
y_copy=y.copy()

### Investigate the coefficients

In [23]:
#create regressor and instruct to keep only numerical features
reg=create_regressor(LinearRegression(),feature_names=features_to_keep)

reg.fit(X,y)

In [46]:
df_coeff=pd.DataFrame([reg.regressor_['passthrough'].get_feature_names(),
             reg.regressor_['model'].coef_]).T
df_coeff.columns=['Feature','Coefficient']
df_coeff['Coefficient']=df_coeff['Coefficient'].map(np.expm1)
df_coeff=df_coeff.set_index("Feature")
df_coeff.sort_values(by="Coefficient",ascending=False)

Unnamed: 0_level_0,Coefficient
Feature,Unnamed: 1_level_1
Gr Liv Area,0.167945
Year Built,0.136201
Overall Qual,0.134753
BsmtFin SF 1,0.082303
Fireplace Qu,0.071773
Total Bsmt SF,0.062022
Overall Cond,0.050476
Garage Cars,0.037378
Garage Type_Attchd,0.033139
Functional,0.032356


We can break down the features into 2 main categories for practicality purpose:  
**Major features**:<br>
`Gr Liv Area`,  `Year Built`, `Overall Qual`.<br>
**Minor features**:  <br>
`BsmtFin SF 1`, `Fireplace Qu`, `Total Bsmt SF`, `Overall Cond`  

## Recommendation

We recommend a simplified framework of assessing houses using a mixture of Major and Minor features.  
The framework which comprises major and minor allow the assessor to make judgement on how significant a feature is, on-the-spot, in the absence of a computer.  
  
It will be useful for anyone who is viewing a house and has to make a rough decision on how to valuate the property on the spot.

## Conclusion

We have used several features in attempt to model the price of a property.  
Upon confirmation that the model is reliable, we then extracted the features which were identified as significant from the model, and created a framework for humans to make decision - without the use of a computer.