# Predict the Sale Price

## Objectives:

- Our objective for this notebook is to:
    - Train an ML pipeline using hyperparameter optimization.
    - Use the best features to predict a property sale price.

## Tasks:

- Loading the data.
- Creating the ML Regressor Pipelines.
- Split Train and Test sets.
- Grid Search CV - Sklearn. 

### Inputs:

- outputs/datasets/cleaned/clean_house_price_records.csv

### Outputs:

- Train set (Features and target).
- Test set (Features and target).
- ML Pipeline to predict the sale price.
- Feature importance Plot.

### Additional comments:

+ This notebook was written based on the guidelines provided in the walk through project 2: 'Churnometer'.
+ This notebook relates to the Data Understanding step of Crisp-DM methodology. 
+ This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.

___

## Change the working directory:

- In the following steps we will modify the working directory from its current folder to its parent folder.
- Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory:

- os.path.dirname() gets the parent directory.
- os.chir() defines the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory, congrats!")

The following command will confirm the new current directory: 

In [None]:
current_dir = os.getcwd()
current_dir

___

## Import the necessary packages and set environmental variables:

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

____

## Load the house price records from our cleaned dataset.

- Transform the house_prices_records dataset into a pandas data frame.

In [None]:
df = pd.read_csv(f"outputs/datasets/cleaned/clean_house_price_records.csv")
print(df.shape)
df.head()

___

## ML Regressor pipeline.

- We will create a ML pipeline for Data Cleaning and Feature Engineering.

In [None]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine import transformation as vt
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

selection_method = "cardinality"
corr_method = "spearman"


def PipelineOptimization(model):
    pipeline_base = Pipeline([

      ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                   variables=['BsmtExposure',
                                                              'BsmtFinType1',
                                                              'GarageFinish',
                                                              'KitchenQual'])),

      ("NumericLogTransform", vt.LogTransformer(variables=['1stFlrSF',
                                                           'LotArea',
                                                           'GrLivArea'])),
      ("NumericPowerTransform", vt.PowerTransformer(variables=['MasVnrArea'])),
      ("NumericYeoJohnsonTransform",
       vt.YeoJohnsonTransformer(variables=['OpenPorchSF'])),

      ("SmartCorrelatedSelection",
       SmartCorrelatedSelection(variables=None,
                                method=corr_method,
                                threshold=0.6,
                                selection_method=selection_method
                                )),

      ("feat_scaling", StandardScaler()),

      ("feat_selection",  SelectFromModel(model)),

      ("model", model),

    ])

    return pipeline_base

Note: In a future update of the project we will need to modify the pandas package as in the near future pandas will not use anymore pandas.Int64Index. 

- from pandas import MultiIndex, Int64Index.

In [None]:
from pandas import MultiIndex, Int64Index

### Hyperparameter Optimization:

In [None]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

### Split the Train and Test set:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'], axis=1),
                                    df['SalePrice'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

____

## Grid Search CV - Sklearn.

- We will use the default hyperparameter to find a suitable algorithm. 

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

- We will optimize the search using default hyperparameter. 

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

*Check results:*

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

- My niece has agreed that the requirement for an R2 score of at least 0.75.
- The best result is RandomForestRegressor with an average R2 score around 0.844212.
- We will apply an extensive search in order to improve the score.

___

## Extensive Search

- We have to do an extensive search on the most suitable model to find the best hyperparameter configuration.

**Define the model and parameters for extensive search.**

In [None]:
models_search = {
    "RandomForestRegressor":RandomForestRegressor(random_state=0),
}

params_search = {
    "RandomForestRegressor":{
        'model__n_estimators': [100,700],
        'model__max_depth': [6,18,None],
    }
}

**Extensive GridSearch CV**

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check the results:

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

- After the extensive search, we can see that the score has been improved.
- My niece would like that the requirements to be above 0.75, however our score is 0.844596.

___

## Fit in the best model:

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

- Define the best regressor, based on search.

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

- Parameters for best model.

### Assess feature importance:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

# after data cleaning and feat engine, the feature may space changes
data_cleaning_feat_eng_steps = 5 # how many data cleaning and feature engineering does your pipeline have?
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
          'Importance': best_regressor_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

### Evaluate pipeline performance:


In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train, X_test, y_test,pipeline):
	print("Model Evaluation \n")
	print("* Train Set")
	regression_evaluation(X_train,y_train,pipeline)
	print("* Test Set")
	regression_evaluation(X_test,y_test,pipeline)

def regression_evaluation(X,y,pipeline):
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")


def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()

- Evaluate performance and train the test sets.

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

___

## Applying regressor with PCA.

- We will explore potential values for PCA n_components.

In [None]:
pipeline = PipelineOptimization(model=LinearRegression())
pipeline_pca = Pipeline(pipeline.steps[:6])
df_pca = pipeline_pca.fit_transform(df.drop(['SalePrice'],axis=1))

print(df_pca.shape,'\n', type(df_pca))

- Apply PCA separately to the scaled data.

In [None]:
import numpy as np
from sklearn.decomposition import PCA

n_components = 17

def pca_components_analysis(df_pca,n_components): 
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca) # array with transformed PCA

    ComponentsList = ["Component " + str(number) for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data= np.round(100 * pca.explained_variance_ratio_ ,3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum() 

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

    print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(12,5))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()

pca_components_analysis(df_pca=df_pca,n_components=n_components)

In [None]:
n_components = 9
pca_components_analysis(df_pca=df_pca,n_components=n_components)

___

## Rewrite ML Pipeline for Modelling:

In [None]:
from sklearn.decomposition import PCA

selection_method = "cardinality"
corr_method = "spearman"
n_components = 9


def PipelineOptimization(model):
    pipeline_base = Pipeline([

      ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                   variables=['BsmtExposure',
                                                              'BsmtFinType1',
                                                              'GarageFinish',
                                                              'KitchenQual'])),

      ("NumericLogTransform", vt.LogTransformer(variables=['1stFlrSF',
                                                           'LotArea',
                                                           'GrLivArea'])),
      ("NumericPowerTransform", vt.PowerTransformer(variables=['GarageArea',
                                                               'MasVnrArea'])),
      ("NumericYeoJohnsonTransform",
       vt.YeoJohnsonTransformer(variables=['OpenPorchSF'])),

      ("SmartCorrelatedSelection",
       SmartCorrelatedSelection(variables=None,
                                method=corr_method,
                                threshold=0.6,
                                selection_method=selection_method
                                )),

      ("feat_scaling", StandardScaler()),

      ("PCA", PCA(n_components=n_components, random_state=0)),

      ("model", model),

    ])

    return pipeline_base

- Grid search cv - sklearn.

In [None]:
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

- Use the standard hyperparameter to find the most suitable model. 

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

- We doing a quick optimization search:

In [None]:
quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

- Now we checking the results:

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

- We need to conduct an extensive search on the most suitable model to find the best hyperparameter configuration: 
- Define model and parameters for extensive search:

In [None]:
models_search = {
    "RandomForestRegressor":RandomForestRegressor(random_state=0),
}

params_search = {
    "RandomForestRegressor":{
        'model__n_estimators': [100,700],
        'model__max_depth': [6,18,None],
    }
}

- Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

- Check the results:

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

- Check the best model:

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

- Parameters for best model:

In [None]:
grid_search_pipelines[best_model].best_params_

- Define the best regressor model:

In [None]:
best_regressor_pipeline_pca = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline_pca

___

### Evaluating the regressor on the Train and Test set.

In [None]:
regression_performance(X_train, y_train, X_test, y_test,best_regressor_pipeline_pca)
regression_evaluation_plots(X_train, y_train, X_test, y_test,
                            best_regressor_pipeline_pca)

____

### Subset Best Features:

In [None]:
X_train = X_train.filter(best_features)
X_test = X_test.filter(best_features)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape)
X_train.head()

### Refit the pipeline.

- We will refit the pipeline with the best features: 

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([

      ("NumericLogTransform", vt.LogTransformer(variables=['1stFlrSF',
                                                           'GrLivArea'])),

      ("feat_scaling", StandardScaler()),

      ("model", model),

    ])

    return pipeline_base

Grid Search CV - Sklearn.

- We will be using the same model from the previews GridCV Search:

In [None]:
models_search

- We will be using the best parameters from the previews GridCV search:

In [None]:
best_parameters

- We will add manually the above parameters: 

In [None]:
params_search = {
    'RandomForestRegressor':  {
        'model__max_depth': [18],
        'model__n_estimators': [100]
        }
    }
params_search

- GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

- Check the results:

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

- Checking the best model:

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

- Define the best regressor pipeline (RandomForestRegressor)

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

### Assess feature importance: 

In [None]:
data_cleaning_feat_eng_steps = 1 # how many data cleaning and feature engineering does your pipeline have?
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': columns_after_data_cleaning_feat_eng,
          'Importance': best_regressor_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

Evaluate the performance of the Train and the Test sets:

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

___

## Save the dataset and models:

- We will save the dataset and models in new files which we will be generating:
    - Train set,
    - Test Set,
    - Modeling pipeline,
    - Features importance plot.

In [None]:
import os

version = 'v2'
file_path = f'outputs/ml_pipeline/predict_saleprice/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

### Train Set: features and the target.

In [None]:
print(X_train.shape)
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
print(y_train.shape)
y_train.head()

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test set: features and the target.

In [None]:
print(X_test.shape)
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
print(y_test.shape)
y_test.head()

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

### Modelling the pipeline.

In [None]:
best_regressor_pipeline

In [None]:
import joblib
joblib.dump(value=best_regressor_pipeline, filename=f"{file_path}/best_regressor_pipeline.pkl")

### Feature importance plot:

In [None]:
df_feature_importance.to_csv(f"{file_path}/feature_importance.csv", index=False)

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/feature_importance.png', bbox_inches='tight')

In [None]:
# regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)
pred_train = best_regressor_pipeline.predict(X_train)
pred_test = best_regressor_pipeline.predict(X_test)
alpha_scatter=0.5

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
axes[0].set_xlabel("Actual")
axes[0].set_ylabel("Predictions")
axes[0].set_title("Train Set")

sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
axes[1].set_xlabel("Actual")
axes[1].set_ylabel("Predictions")
axes[1].set_title("Test Set")

plt.savefig(f'{file_path}/regression_evaluation_plots.png', bbox_inches='tight')


____

## Conclusion

- The regressor model has been chosen to predict the sale price for a given property. 
- The feature section and PCA produce similar results, however they meet the business requirement 1.
    - The feature section has a better performance.

- The feature section has scored the R2 Score: 0.97 on the train set, and the test set has scored on R2 Score 0.78.
- We consider the business requirement 1 meet as my niece has required an R2 score of 0.75+. 

___