## Week 6

Consider four possible models for predicting house prices:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score

In [3]:
lr = LinearRegression()

ames = pd.read_csv("data/AmesHousing.csv")
X = ames[["Gr Liv Area", "TotRms AbvGrd"]]
y = ames["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_s = (X_train - X_train.mean())/X_train.std()

lr_fitted = lr.fit(X_train_s, y_train)
lr_fitted.coef_

FileNotFoundError: [Errno 2] No such file or directory: 'data/AmesHousing.csv'

# Chapter 14: Penalized Regression

In [13]:
# Read the data
ames = pd.read_csv("AmesHousing.csv")

# Get rid of columns with mostly NaN values
good_cols = ames.isna().sum() < 100
ames = ames.loc[:,good_cols]

# Drop other NAs
ames = ames.dropna()


X = ames.drop(["SalePrice", "Order", "PID"], axis = 1)
y = ames["SalePrice"]


ct = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", 
    StandardScaler(), 
    make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough"
)

lr_pipeline_1 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)



In [6]:
# check to make sure code is working correctly
transformed_data = ct.fit_transform(ames)

# Get feature names from OneHotEncoder and combine with numerical columns
column_names = (
    ct.named_transformers_["dummify"].get_feature_names_out(input_features=ames.select_dtypes(include=object).columns).tolist()
    + ames.select_dtypes(include=np.number).columns.tolist()
)

# Convert to DataFrame for visualization
transformed_df = pd.DataFrame(transformed_data, columns=column_names)
print(transformed_df)

      MS Zoning_C (all)  MS Zoning_FV  MS Zoning_I (all)  MS Zoning_RH  ...   Misc Val   Mo Sold   Yr Sold  SalePrice
0                   0.0           0.0                0.0           0.0  ...  -0.087930 -0.444404  1.675421   0.408859
1                   0.0           0.0                0.0           1.0  ...  -0.087930 -0.076545  1.675421  -0.970882
2                   0.0           0.0                0.0           0.0  ...  21.738194 -0.076545  1.675421  -0.130494
3                   0.0           0.0                0.0           0.0  ...  -0.087930 -0.812263  1.675421   0.772609
4                   0.0           0.0                0.0           0.0  ...  -0.087930 -1.180122  1.675421   0.094028
...                 ...           ...                ...           ...  ...        ...       ...       ...        ...
2816                0.0           0.0                0.0           0.0  ...  -0.087930 -1.180122 -1.358188  -0.500515
2817                0.0           0.0                0.0

The handle_unknown='ignore' tells the encoder to ignore any unknown categories (categories not present in the training data) that appear in the input data during transformation, rather than throwing an error. The resulting one-hot encoded columns will be all zeros.

The make_column_selector(dtype_include=object) selects only columns that are of type object, so it only creates dummy variables on categorical data.

We dropped Order and PID columns as they aren't going to be used in the regression since they are row identifiers.

In [14]:
# cross validate
cross_val_score(lr_pipeline_1, X, y, cv = 5, scoring = 'r2')

array([-5.33822743e+18, -1.21674040e+20, -2.79619977e+20, -2.04234366e+19, -2.28658154e+17])

The cross validation gives you really bad r-squared values, meaning we are overfitting since we're using all features in our data.

## Ridge Regression

- coefficients get smaller and smaller, but never get to zero

## LASSO

- adding a penalty to the loss function, penalizing model for having more coefficients with larger coefficients

- taking absolute value
- is possible for some coefficients to be forced to zero, so it's sometimes used as a variable selection method (to see which values get to zero)


function is Lasso() where alpha represents lambda


### Elastic Net
- including both ridge and lasso penalty 

ElasticNet() with alpha for lambda and argument name l1_ratio for alpha

# Chapter 15: Nonparametric Methods

## k-nearest neighbors

