<a href="https://colab.research.google.com/github/cameronbaum0124/housing_price_regression/blob/main/HousingPriceRegressionLab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Housing prices regression lab

This purpose of this lab is to <q>compete</q> in [this variation of Kaggle's House Prices competition](https://marksmath.org/classes/Spring2025MML/AmesRegressionCheck). There's plenty of code to get you started.

> A rough description of your mission would be to experiment with various regression functions in SciKit-Learn and to add several more variables in an attempt improve the score.

A more precise description of your mission follows at the end of this notebook.




## Imports

We begin with some common imports.

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.metrics import mean_squared_error

# The data

We've got two Data sets, one to *train* with and one to *test* with. Let's import them:

In [None]:
train = pd.read_csv('https://marksmath.org/data/reconstructed_train.csv')
test = pd.read_csv('https://marksmath.org/data/reconstructed_test.csv')

And let's take a quick look at what the data looks like:

In [None]:
print(f'{len(train)} observations and {len(train.columns)} features in train')
print(f'{len(test)} observations and {len(test.columns)} features in test')
train.head()

Well, that's a lot of data. Let's drop all those columns that are missing more than 10 or so values. We'll store the good variables in a list called `good_variables` and display it. You can read a description of the data (that doesn't appear to be 100% accurate) on the [Kaggle competition webpage](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data).

In [None]:
def check_most(col_name):
  tol = 1400
  return len(test[col_name].dropna()) > tol and \
   len(train[col_name].dropna()) > tol
good_variables = np.array([c for c in train.columns[2:] if check_most(c) ])
print(len(good_variables))
good_variables

I'm going to illustrate how to deal with one numeric variable, one nominal variable, and one qualitative variable. Let's create a list for each variable type, a list of *all* the variables and examine the resulting data.

In [None]:
# Store a few variables of interest and of each type in a list:
numeric_variables = ['GrLivArea']
nominal_variables = ['Neighborhood']
qual_variables = ['KitchenQual']

# Place one of those into a bigger list and examine:
my_variables = np.concatenate([numeric_variables, nominal_variables, qual_variables])
train[np.concatenate([my_variables, ['SalePrice']])]

I suppose you can see how we've got one of each type of variable. Typically, ordinal variables are the trickiest to deal with in the code. The reason is that you've got to determine the order yourself. Before you do that you've got to examine the possible values.

Here are all possible values of 'KitchenQual':

In [None]:
type_check_var = 'KitchenQual'
pd.concat([train, test])[type_check_var].unique()

Note that you've still got to determine a reasonable order.

Now, we set up a work flow for each type of variable as follows:

- For numeric variables, we
  - impute with `KNNImputer`, then
  - scale with `StandardScaler`,
- For nominal variables, we
  - impute with `SimpleImputer`, then
  - encode with `OneHotEncoder`,
- For Ordinal variables, we
  - impute with `SimpleImputer`, then
  - encode with `OrdinalEncoder`, then
  - scale with `StandardScaler`.

In [None]:
categoric_impute = SimpleImputer(strategy="most_frequent")
numeric_impute = KNNImputer()

nominal_encoder = OneHotEncoder(handle_unknown='ignore')
qual_encoder = OrdinalEncoder(categories=[['Ex', 'Gd', 'TA', 'Fa', 'Po']])
qual_encoder.fit(np.array(['Ex', 'Gd', 'TA', 'Fa', 'Po']).reshape(-1, 1))

scale = StandardScaler(with_mean=False,  with_std=True)

Again, the trickiest thing to deal with the `qual_encoder`, since `KithenQual` is an ordinal variable. For any ordinal variable you choose to add, you need to
- Find all the possible values (as we did for `KitchenQual` above),
- Determine a reasonable ordering of those values, and
- Fit the encoder to that ordering.

Now, process by transforming all those columns based on type:

In [None]:
process = ColumnTransformer(
    transformers=[
        ("numerical_impute_and_scale", Pipeline(
            steps = [
                ('numeric_impute', numeric_impute),
                ('scale', scale)
            ]), numeric_variables),
        ("nominal_encode", Pipeline(steps = [
            ("categoric_impute", categoric_impute),
            ("nominal_encode", nominal_encoder)
        ]), nominal_variables),
       ("ordinal_impute_and_encode", Pipeline(steps = [
            ("categoric_impute", categoric_impute),
            ("ordinal_encode", qual_encoder),
            ("scale", scale)
        ]), qual_variables)
    ]
)

In [None]:
regress = LinearRegression()

# Uncomment this next bit of code, if you want to try the ridge.
# Probably won't help until you choose a largeish set of variables.

# regress = RidgeCV(
#     alphas=np.logspace(-1, 1, 100)
#   )

pipe = Pipeline(steps = [
        ("process", process),
        ('regress', regress)
    ]
)

In [None]:
X = train[my_variables]
Y = train.SalePrice[X.index].apply(np.log)
pipe.fit(X,Y)

In [None]:
# If you're using RidgeCV, here's the opimized
# regularization coefficient:
regress.alpha_

Let's score this thing on the train data:

In [None]:
mean_squared_error(Y, pipe.predict(X))**0.5

## Creating a submission

We don't have the prices for the test data; we've got to upload our predictions to the [online scoreing tool](https://www.google.com/url?q=https%3A%2F%2Fmarksmath.org%2Fclasses%2FSpring2025MML%2FAmesRegressionCheck) to get the score. Here's how to create a submission file:

In [None]:
test['SalePrice'] = pipe.predict(test)
submit = test[['Id', 'SalePrice']]
submit.loc[:, 'SalePrice'] = submit.SalePrice.apply(np.exp) # Gotta scale back now

# Rounding to the nearest 1000 seems to help a little bit.
# I suppose that's because the orginal prices are rounded
submit.loc[:, 'SalePrice'] = submit.SalePrice.apply(lambda x: 1000*round(x/1000))
submit.to_csv('housing_predictions_demo.csv', index=False)
submit

**Your task:**

Your task is to fiddle with the variables to see if you can improve the score. My score [should be *about* 0.13856](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/leaderboard?search=mcclure).

## LASSOing relevant variables

We can use `LassoCV` to help us determine some particularly relevant variables. This is easiest with numerical variables, so I'm going to stick with that. While it can be done with categorical variables as well, there are considerable complications. Nominal variables that are encoded with `OneHotEncoder` are a particular pain.

There are 33 numeric variables in our `good_variable` list so that's more than enough to make an interesting example.  Here's the whole process in one cell:

In [None]:
# Grab the numeric variables:
all_numeric_variables = train.select_dtypes(include=["int64", 'float64']).columns
all_numeric_variables = [c for c in all_numeric_variables if c in good_variables]

# Build a simple pre-preocessor:
process_just_numeric = ColumnTransformer(
    transformers=[
        ("numerical_impute_and_scale", Pipeline(
            steps = [
                ('numeric_impute', numeric_impute),
                ('scale', scale)
            ]), all_numeric_variables)
    ]
)

# Regress and fit the cross-valideated Lasso!
regress = LassoCV(
    alphas=np.logspace(-6, 6, 100)
  )
pipe_just_numeric = Pipeline(steps = [
        ("process", process_just_numeric),
        ('regress', regress)
    ]
)
X = train[all_numeric_variables]
Y = train.SalePrice[X.index].apply(np.log)
pipe_just_numeric.fit(X,Y)

# Display the variables together with their coefficients
# in descending order.
stats = []
for i in range(len(X.columns)):
    stats.append({'stat': X.columns[i], 'coeff': regress.coef_[i]})
stats_df = pd.DataFrame(stats)
stats_df.sort_values('coeff', ascending=False)

If you're going to add some numeric variables, it might make sense to start near the top of that list!