### Problem Statement

*Predicting housing prices is of interest to potential buyers, sellers, and organizations alike. Multiple online platforms offer, for example, a free “price estimate” based on underlying machine learning models. For this assignment, we are going to build the best machine learning model we can for Ames, Iowa. The data set consists of 79 features that describe the quality and quantity of the properties to base our predictions on.*

# Task 0: Data Preperation

*Note: No code has to be written for the 5 cells below - you may just execute them sequentially. After this, you may move on to **Task 1** on understanding the data.*

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import Lasso, LassoCV, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

from scipy.stats import pearsonr

In [None]:
# All missing data removed/cleaned
housing_df = pd.read_csv("ames_data_no_missing.csv", index_col=0)

In [None]:
#Check the number of dummies to be created
count = [housing_df[col].nunique() for col in housing_df.columns if housing_df[col].dtype==object]
sum(count)

In [None]:
# ensure Python reads the categorical variables as categorical
for column in housing_df.columns:
    if housing_df[column].dtype == 'object':
        housing_df[column] = pd.Categorical(housing_df[column])

In [None]:
#define our RMSE function
def rmse(y_train, y_pred):
    return np.sqrt(mean_squared_error(y_train, y_pred))

# Task 1: Understand the Data
*Take some time to familiarize yourself with the data. It contains information about housing prices in Ames. What are the key variables?*

*You may perform any additional EDA if necessary.*

### 1.1
*What is the distribution of housing prices?*

In [None]:
# The original distribution

##### CODE HERE #####

### 1.2
*What is the variable that has the highest correlation with Housing prices? What are the key drivers behind larger house prices?*

In [None]:
#Find the correlations of all variables with SalePrice

##### CODE HERE #####

### 1.3
*Create one additional visualization, that gives some insights into the data.*

In [None]:
# Create a visualization to highlight any insight - Can be a scatter plot, line plot, box plot, histogram or any other visualization that you might know!

##### CODE HERE #####

# Task 2: Build machine learning models

*Use your knowledge of prediction models to create at least three models that predict housing prices.*

### 2.1 
1. *Create dummies for all the categorical columns*.

2. *Partition your data into training and validation (70-30 split, setting the random state to 1).*
3. *Scale the train and the test set using StandardScaler()*

In [None]:
# Initialize X and y
X = housing_df.drop(columns=['SalePrice']) # All but the outcome column
y = housing_df['SalePrice']

In [None]:
# Use dummy variables for categorical variables

##### CODE HERE #####

In [None]:
# Train - Test split (70-30 split, setting the random state to 1)

##### CODE HERE #####

In [None]:
# Scale the train and test set features separately
scaler = StandardScaler()
numeric_cols = [col for col in X.columns if X[col].dtypes != 'category']

##### CODE HERE #####

### 2.2
*Build a linear regression model, a regression tree and a kNN model. Carefully apply regularization for the linear regression model. Carefully select which variables to use for the kNN model.*

In [None]:
# Linear model - USE LassoCV to get the best LASSO model

##### CODE HERE #####

In [None]:
# Tree Model - Use max depth to control the complexity of the tree. Run a Grid search for multiple values of max depth.

##### CODE HERE #####

In [None]:
# KNN Model

# Select the top 20 most correlated features and store it in a list called 'top_20_features' (using similar correlation table from Task 1)

top_20_features = ##### CODE HERE #####


#For building the model, you must use X_train[top_20_features]

##### CODE HERE #####



# Find the value of k for which RMSE is minimum, using GridSearchCV

##### CODE HERE #####

### 2.3
*Summarize the predictive performance in terms of RMSE.* 
1. *Calculate the RMSE values for train and validation for all the models*
2. *Display them in a tabulated format*

Hint: You may use the code that you've learnt in the 'Model selection' module

In [None]:
#linear regression

##### CODE HERE #####



#max depth pruned tree

##### CODE HERE #####



#knn

##### CODE HERE #####



#Display the RMSEs

##### CODE HERE #####

### 2.4
*Study the largest errors that you made (largest overpredictions, largest underpredictions). What may be some of the reasons why the model is over/under predicting? Do these insights possibly help you improve the models?*

In [None]:
# Visualize the errors - plot a scatterplot of the residuals vs the true SalePrice

##### CODE HERE #####

# Task 3

### 3.1
*Are you able to improve your linear regression model by taking the log of the dependent variable? (remember to translate your predicted outcome back to the original units before calculating the RMSE)*

*Create a visualization, that highlights the distribution of prices when after taking log of the dependent variable*

Hint - You may use [numpy.log()](https://numpy.org/doc/stable/reference/generated/numpy.log.html) to get the log of the dependent variable

In [None]:
# distribution of the transformed SalePrice

##### CODE HERE #####

In [None]:
# Linear model - Using the log of the SalePrice as the dependent variable, run the LassoCV to obtain the best LASSO model
# Note that the optimum value of Alpha for this model would also be scaled down to a log scale. It's a better idea to simply search for the best alpha once again using LassoCV.

##### CODE HERE #####

In [None]:
# Calculate the RMSE values for train and the test set

##### CODE HERE #####

In [None]:
# Display the RMSE values in a dataframe

##### CODE HERE #####

### 3.2 Bonus Task
*Experiment with data segmentation: Should you subset the data and fit separate models for each subset?*

Data segmentation is generally useful when we think that subsegments of our data have substantially different relationships between their features and the outcome compared to other subsegments (i.e variable interactions). We can use a combination of prior knowledge and data exploration to build our domain knowledge about where this situation would apply.

Starting with prior knowledge, you can hypothesize $HouseStyle$ may be a candidate for data segmentation, as for instance, 3 bedrooms in a 1-story house may have a different effect on $SalePrice$ than 3 bedrooms in a 2-story house.

In [None]:
housing_df['House Style'].value_counts()

In [None]:
housing_df['Bedroom AbvGr'].value_counts()

In [None]:
matrix = []
styles = ['1Story', '2Story', '1.5Fin']
for style in styles:
    curr_style = []
    for bedrooms in range(1, 6):
        curr_mean = housing_df[(housing_df['House Style'] == style) & 
                               (housing_df['Bedroom AbvGr'] == bedrooms)]['SalePrice'].mean()
        
        curr_style.append(curr_mean)
    matrix.append(curr_style)
sns.heatmap(matrix)
plt.ylabel('House Style')
plt.yticks(np.arange(3)+0.5, styles)
plt.xlabel('Bedroom AbvGr')
plt.xticks(np.arange(5)+0.5, np.arange(5)+1)
pass


We indeed see some interaction between the housing style and bedroom number, indicating data segmentation could be promising. 

*From here, it's your task to start building a linear model to see whether data segmentation will improve results.*

Hint: For the first two subtasks in 3.2, you could run a for-loop for each style in HouseStyles and evaluate/create the LASSO model.

In [None]:
# Linear Full Model (FM) - Train a Lasso model for the whole dataset 

##### CODE HERE #####

# Store the RMSE values of train and validation for all the 3 subsets of styles - You can loop through the HouseStyles

##### CODE HERE #####

In [None]:
# Linear Data Segmentation Model (DSM) - Train a Lasso model for the individual subset of styles - 1Story, 2Story and 1.5Fin

##### CODE HERE #####

# Store the RMSE values of train and validation for all the 3 subsets of styles

##### CODE HERE #####

In [None]:
# Create a DataFrame to store the values of RMSE for both the models on the train and validation sets on all the 3 subsets of data

##### CODE HERE #####

*Write down your inferences about the performance of the subsetted model here -* 

...

# Task 4: Summarize your findings
*Now take some time to translate your results into valuable insights.*

### 4.1
*What drives housing prices? Find the top 20 major drivers.*

Hint - In course 3 module 1, you have already seen how to store the coefficients of a model in a dictionary. You can convert the dictionary into a DataFrame and sort the dataframe by the coefficients. [Here's](https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe) some guidance on how to convert dictionary into a DataFrame.

In [None]:
# Visualize all the columns and their coefficients sorted in descending order to understand the ones that has the most say in the SalePrice
# Hint - Check the code for Course 3 Module 1 - Linear regression in a predictive setting to 

##### CODE HERE #####

*You can also use a built in variable importance function from decision trees to capture a summary of the importance of different features in our regression tree.* 

Note: There is no coding to be done in this cell. Just execute this cell and observe the feature importances.

In [None]:
# Extract the feature_importances_ attribute from the tree model (feature_importances_ is an attribute available in trained sklearn models)

# Extracting the importances by sklearn (Replace tree_reg_best by the variable of your tree model)
importances_sk = tree_reg_best.feature_importances_

# Creating a dataframe with the feature importance by sklearn
feature_importance_df = []
for i, feature in enumerate(X_train.columns):
    feature_importance_df.append([feature, round(importances_sk[i], 3)])
   
feature_importance_df = pd.DataFrame(feature_importance_df,
                                     columns=['Feature', 'Importance'])
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True)
print(f"Feature importance by sklearn: ")
feature_importance_df.iloc[:20]

### 4.2
*What is the predictive performance of your models?*

In [None]:
# Compare the RMSE of the train and the validation set for all the models. You can reuse the code from exercise 2.3.


##### CODE HERE #####

*Which model performs the best?*

...

### 4.3
*How reliable are your predictions?*

In [None]:
#Plot a scatterplot of the predicted vs the true value of the SalePrice

##### CODE HERE #####

*A histogram of errors could also give a good insight on any underlying patterns*

In [None]:
#Plot a histogram of the residuals. 

##### CODE HERE #####