<br>

<hr style="border: 1px solid #fdb515;" />

# Question 5

It is time to build your own model!

You will conduct feature engineering on your training data using the `feature_engine_final` function (you will define this in `q5d`), fit the model with this training data, and compute the training Root Mean Squared Error (RMSE). Then, we will process our test data with `feature_engine_final`, use the model to predict `Log Sale Price` for the test data, transform the predicted and original log values back into their original forms (by using `delog`), and compute the test RMSE.

Your goal in Question 5 is to:

* Define a function to perform feature engineering and produce a design matrix for modeling.
* Apply this feature engineering function to the training data and use it to train a model that can predict the `Log Sale Price` of houses.
* Use this trained model to predict the `Log Sale Price`s of the test set. Remember that our test set does not contain the true `Sale Price` of each house –— your model is trying to guess them! 
* Submit your predicted `Log Sale Price`s on the test set to Gradescope.

In [39]:
import numpy as np
import pandas as pd
from sklearn import linear_model as lm
import warnings
warnings.filterwarnings("ignore")
from ds100_utils import *
from feature_func import *
from sklearn.preprocessing import StandardScaler, OneHotEncoder
eda_data = pd.read_csv("cook_county_train_val.csv", index_col='Unnamed: 0')

<br>

---

## Question 5c: Defining Helper Function or Helper Variables

In [41]:
ordinal_cols = ['Repair Condition',  "Garage 1 Size"]

sparse = ['Town and Neighborhood'] 

# one_hot_cols = ["Property Class", "Garage 1 Area"] 

one_hot_cols = [] 

binary = ["Central Air", 
          "O'Hare Noise",
          "Floodplain",
          "Road Proximity",
          "Most Recent Sale",
          'Pure Market Filter']

qualitative = ['Land Square Feet', 
               "Fireplaces", 
               "Building Square Feet",
               "Estimate (Land)",
               "Estimate (Building)",
               "Age",
               "Longitude",
               "Latitude",
               'Lot Size']

X_features = ordinal_cols + sparse + one_hot_cols + binary + qualitative
drop_these_cols = list(set(eda_data.columns).difference(set(X_features)))
drop_these_cols.remove('Sale Price')

In [43]:
#target encoding
eda_data['Sale Price'].describe()
(3.120000e+05 - 4.520000e+04) * 1.5 + 3.120000e+05
np.sum(eda_data['Sale Price'] > 712200.0)
#10000000.0
def remove_upper_outlier(df):
    data = df.copy()
    data = data[data['Sale Price'] < 10000000.0]
    data = data.reset_index(drop=True)
    return data
    
eda_data_v2 = remove_outliers(eda_data, 'Sale Price', lower=500)
eda_data_v2 = remove_upper_outlier(eda_data_v2)

target_enc_columns = ['Town and Neighborhood']

log_required = ["Land Square Feet", "Building Square Feet", "Estimate (Land)", "Estimate (Building)"] 

col_and_mean = {}
col_and_mapping = {}
for col in target_enc_columns:
    mapping = dict(eda_data_v2.groupby(col)['Sale Price'].mean())
    m = np.mean(list(mapping.values()))
    col_and_mean[col] = m
    col_and_mapping[col] = mapping

def add_target_enc(df):
    data = df.copy()
    for col in target_enc_columns:
        mapper = col_and_mapping[col]
        meaner = col_and_mean[col]
        data[col] = data[col].map(mapper, na_action='ignore')
        data[col] = data[col].fillna(meaner)
        data[col] = np.log(data[col])
    return data
    
log_required = ["Land Square Feet", "Building Square Feet", "Estimate (Land)", "Estimate (Building)"] 

OneHotEncoders = {}
for col in one_hot_cols:
    ohe = OneHotEncoder(handle_unknown='ignore')   
    ohe.fit(eda_data[[col]])    
    OneHotEncoders[col] = ohe

def add_one_hot(df):
    data = df.copy()
    for col in one_hot_cols:
        ohe = OneHotEncoders[col]
        encoded_day = ohe.transform(data[[col]]).toarray()
        encoded_day_df = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())
        data = data.join(encoded_day_df).drop(columns=col)
    return data

def qualitative_engin(df):
    data = df.copy()
    for col in log_required:
        if data[col].min() <= 0:
            data[col] = data[col] + 0.1
        data[col] = np.log(data[col])
    return data

def feature_pipeline(df):
    data = df.copy()
    data = qualitative_engin(add_target_enc(data))
    data = data.fillna(0.0)
    return data

def add_story(data):
    with_rooms = data.copy()
    with_rooms['story'] = with_rooms['Description'].str.findall(r"([a-zA-Z]+)-story[\w\s]*houeshold").str[0]
    mapping = {'one':1.0, 'two':2.0, 'three':3.0}
    with_rooms['story'] = with_rooms['story'].map(mapping, na_action='ignore')
    with_rooms['story'] = with_rooms['story'].fillna(1.0)
    return with_rooms

def add_bathroom(data):
    with_rooms = data.copy()
    with_rooms['bath_rooms'] = with_rooms['Description'].str.findall(r"([0-9]{1}\.[0-9]{1}) of which are bathrooms").str[0].apply(lambda x : float(x))
    return with_rooms

In [45]:
def feature_engine_final(data, is_test_set=False):
    if not is_test_set:
        data = data.reset_index(drop=True)
        data = remove_outliers(data, 'Sale Price', lower=10000)
        data = data.reset_index(drop=True)
        data = remove_upper_outlier(data)
        data = data.reset_index(drop=True)
        data['Log Sale Price'] = np.log(data['Sale Price'])
        data = add_story(data)
        data = add_bathroom(data)
        data = add_total_bedrooms(data)
        data = feature_pipeline(data)
    else:
        data = data.reset_index(drop=True)
        data = add_story(data)
        data = add_bathroom(data)
        data = add_total_bedrooms(data)
        data = feature_pipeline(data)
    scaler = StandardScaler()
    if is_test_set:
        X = data.drop(drop_these_cols, axis=1, errors='ignore')
        X_hot = add_one_hot(X[one_hot_cols]).to_numpy()
        X_scaler = scaler.fit_transform(X.drop(one_hot_cols, axis=1, errors='ignore'))
        X = np.concatenate((X_scaler, X_hot), axis=1)
        return X
    else:
        drop_these_cols_v2 = drop_these_cols
        drop_these_cols_v2 = drop_these_cols_v2 + ['Sale Price', 'Log Sale Price']
        X = data.drop(drop_these_cols_v2, axis=1, errors='ignore')
        X_hot = add_one_hot(X[one_hot_cols]).to_numpy()
        X_scaler = scaler.fit_transform(X.drop(one_hot_cols, axis=1))
        X = np.concatenate((X_scaler, X_hot), axis=1)
        Y = data['Log Sale Price']  
        return X, Y

check_rmse_threshold = run_linear_regression_test_optim(lm.LinearRegression(fit_intercept=True), feature_engine_final, 'cook_county_train.csv', None, False)
print("Current training RMSE:", check_rmse_threshold.loss)
print("You can check your grade for your prediction as per the grading scheme outlined at the start of Question 5")

Current training RMSE: 143770.44039927263
You can check your grade for your prediction as per the grading scheme outlined at the start of Question 5


<br>

---

## Question 5e: Fit and Evaluate your Model

**This question is not graded.** Use this space below to evaluate your models. Some ideas are listed below. 

**Note:** While we have a grader function that checks RMSE for you, it is best to define and create your own model object and fit on your data. This way, you have access to the model directly to help you evaluate/debug if needed. For this project, you should use a `sklearn` default `LinearRegression()` model with intercept term for grading purposes. Do not modify any hyperparameter in `LinearRegression()`, and focus on feature selection or hyperparameters of your own feature engineering function.

It may also be helpful to calculate the RMSE directly as follows:

$$RMSE = \sqrt{\dfrac{\sum_{\text{houses in the set}}(\text{actual price for house} - \text{predicted price for house})^2}{\text{number of houses}}}$$

A function that computes the RMSE is provided below. Feel free to use it if you would like calculate the RMSE for your training set.

In [47]:
def rmse(predicted, actual):
    """
    Calculates RMSE from actual and predicted values.
    Input:
      predicted (1D array): Vector of predicted/fitted values
      actual (1D array): Vector of actual values
    Output:
      A float, the RMSE value.
    """
    return np.sqrt(np.mean((actual - predicted)**2))

<br>

---

## Question 5f Submission

Recall that the test set given to you in this assignment does not contain values for the true `Sale Price` of each house. You will be predicting `Log Sale Price` on the data stored in `cook_county_contest_test.csv`. To determine your model's RMSE on the test set, you will submit the predictions made by your model to Gradescope. There, we will run checks to see what your test RMSE is by considering (hidden) true values for the `Sale Price`. We will delog/exponentiate your prediction on Gradescope to compute RMSE and use this to score your model. Before submitting to Gradescope, make sure that your predicted values can all be delogged (i.e., if one of your `Log Sale Price` predictions is 60, it is too large; $e^{60}$ is too big!)

Your score on this section will be determined by the grading scheme outlined at the start of Question 5. **Remember that you can only submit your test set predictions to Gradescope up to 4 times per day. Plan your time to ensure that you can adjust your model as necessary, and please test your model's performance using cross-validation before making any submissions.** For more on cross-validation, check [Lecture 16](https://ds100.org/sp24/lecture/lec16/). In particular, the [Lecture 16 notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Fsp24-student&urlpath=lab%2Ftree%2Fsp24-student%2F%2Flecture%2Flec16%2Flec16.ipynb&branch=main) may be helpful here. You can also feel free to reference what you did in previous questions when creating training and validation sets and seeing how your model performs.

To determine the error on the test set, please submit your predictions on the test set to the Gradescope assignment **Project A2 Test Set Predictions**. The CSV file to submit is generated below, and you should not modify the cell below. Simply download the CSV file, and submit it to the appropriate Gradescope assignment.

**You will not receive credit for the test set predictions (i.e., up to 3 points) unless you submit to this assignment**!!

**Note:** If you run into any errors, the [Proj. A2 Common Mistakes](https://ds100.org/debugging-guide/projA2/projA2.html) section of the [Data 100 Debugging Guide](https://ds100.org/debugging-guide) may be a helpful resource.

In [49]:
from datetime import datetime
from IPython.display import display, HTML

Y_test_pred = run_linear_regression_test(lm.LinearRegression(fit_intercept=True), feature_engine_final, None, 'cook_county_train.csv', 'cook_county_contest_test.csv', 
                                         is_test = True, is_ranking = False, return_predictions = True
                                         )

# Construct and save the submission:
submission_df = pd.DataFrame({
    "Id": pd.read_csv('cook_county_contest_test.csv')['Unnamed: 0'], 
    "Value": Y_test_pred,
}, columns=['Id', 'Value'])
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = "submission_{}.csv".format(timestamp)
submission_df.to_csv(filename, index=False)

#print('Created a CSV file: {}.'.format("submission_{}.csv".format(timestamp)))
display(HTML("Download your test prediction <a href='" + filename + "' download>here</a>."))
print('You may now upload this CSV file to Gradescope for scoring.')#

You may now upload this CSV file to Gradescope for scoring.


In [50]:
# Scratch space to check if your prediction is reasonable. See 5e for hints. 
# We will not reset the submission count for mis-submission issues.
submission_df["Value"].describe()

count    55311.000000
mean        12.200539
std          0.837872
min          9.281137
25%         11.612339
50%         12.168546
75%         12.741449
max         15.907689
Name: Value, dtype: float64

Congratulations on finishing your prediction model for home sale prices in Cook County! In the following section, we'll delve deeper into the implications of predictive modeling within the CCAO case study, especially because statistical modeling is how the CCAO valuates properties. 

Refer to [Lecture 15](https://ds100.org/sp24/lecture/lec15/) if you're having trouble getting started!