## $\color{red}{\text{Lecture Overview}}$

There are 3 ways to feature select

1. **Feature selection**
2. **Feature selection techniques in regression**
3. **Model diagnostics**

Forward Elmination, Backwards Selection and Stepwise Regression are ways to select features via regression

## $\color{red}{\text{Feature Selection}}$

1. The process of selecting a subset of relevant variables for modeling
2. Feature selection techniques operate under the principal of **parsimony**
  - The simpler the model, the better - use fewer variables
  - Fewer variables mean decreased computational time

## $\color{red}{\text{Import Required Packages}}$

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pylab as plt

## $\color{red}{\text{Import Data}}$

In [4]:
df = pd.read_csv('/Users/dB/Documents/repos/github/bint-capstone/notebooks/data/housingData.csv')
df.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
0,7129300520,20141013T000000,3,1.0,1180,5650,1.0,0,0,3,...,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,221900.0
1,6414100192,20141209T000000,3,2.25,2570,7242,2.0,0,0,3,...,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,538000.0
2,5631500400,20150225T000000,2,1.0,770,10000,1.0,0,0,3,...,770,0,1933,0,98028,47.7379,-122.233,2720,8062,180000.0
3,2487200875,20141209T000000,4,3.0,1960,5000,1.0,0,0,5,...,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,604000.0
4,1954400510,20150218T000000,3,2.0,1680,8080,1.0,0,0,3,...,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,510000.0


In [5]:
df.columns.tolist()

['id',
 'date',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'zipcode',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15',
 'price']

## $\color{red}{\text{Analytic Task}}$
1. Using the housing data, build a **multiple linear regression model** to predict **price**
2. Use feature selection techniques
  - Forward regression
  - Backward regression
  - Stepwise regression
3. Model diagnostics to select the best model

## $\color{red}{\text{Data Preparation}}$
1. Excluding some variables from the analysis
2. Splitting data into train and testing

In [8]:
from sklearn.model_selection import train_test_split
current_year = pd.Timestamp.now().year
df['building_age'] = current_year - df['yr_built']

to_drop = ['id','date','zipcode','yr_renovated']
df2 = df.drop(to_drop, axis=1)
df2.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,lat,long,sqft_living15,sqft_lot15,price,building_age
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,47.5112,-122.257,1340,5650,221900.0,70
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,47.721,-122.319,1690,7639,538000.0,74
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,47.7379,-122.233,2720,8062,180000.0,92
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,47.5208,-122.393,1360,5000,604000.0,60
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,47.6168,-122.045,1800,7503,510000.0,38


In [9]:
preds = df2.drop('price', axis=1)
resp = df2['price']

In [10]:
preds.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,lat,long,sqft_living15,sqft_lot15,building_age
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,47.5112,-122.257,1340,5650,70
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,47.721,-122.319,1690,7639,74
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,47.7379,-122.233,2720,8062,92
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,47.5208,-122.393,1360,5000,60
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,47.6168,-122.045,1800,7503,38


In [11]:
resp

0        221900.0
1        538000.0
2        180000.0
3        604000.0
4        510000.0
           ...   
21608    360000.0
21609    400000.0
21610    402101.0
21611    400000.0
21612    325000.0
Name: price, Length: 21613, dtype: float64

In [12]:
train_X, test_X, train_y, test_y = train_test_split(preds, resp, test_size=.2, random_state=25)


## $\color{red}{\text{Model Building}}$

### $\color{blue}{\text{Forward Elimination}}$
- Starts with no predictors, only an intercept
- **Step 1**: Compute p-values for each predictor in a univariate regression (one at a time).
- **Step 2**: Add the predictor with the **lowest p-value**
- **Step 3**: Recalculate the model and test remaining predictors again.
- Repeat until no remaining predictor significantly improves the model
- Useful for efficiently finding the most relevant predictors but can miss interactions since it never removes variables once added

- Example of selection order:
Start → Add X3 (lowest p-value) → Add X1 → Stop when no more p < 0.05

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# regrssion model
model = LinearRegression()

# SFS (turn this into a function later!)
forward_reg = SFS(model, k_features='best',forward=True, floating=False,scoring='r2', cv=5) # to do forward elimination
forward_reg.fit(train_X, train_y)

forward_reg.k_feature_names_

('bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15',
 'building_age')

### $\color{blue}{\text{Backward Elimination}}$

- Start with all predictors included in the model
- **Step 1**: Compute p-values for all predictors in the full model
- **Step 2**: Remove the predictor with the highest p-value
- **Step 3**: Refit model and recalculate p-values.
- Repeat until all remaining predictors are significant (p < threshold).

- Example Removal Order:

Start with X1, X2, X3, X4 → Remove X2 (highest p) → Remove X4 → Stop when all p-values are significant.

In [None]:
# do the same 

### $\color{blue}{\text{Stepwise Regression}}$

1. Starts with no predictors, **like forward selection**
2. **Step 1**: Add the predictor with the lowest p-value
3. **Step 2**: After adding a new predictor, check all included variables and remove any with high p-vaue
4. **Step 3**: Continue adding/removing predictors until no further changes improve the model
5. Balances simplicity and accuracy, allowing variables to be removed if they become insignificant after adding others.
6. More flexible than forward or backward methods but prone to overfitting if too many variables are considered.

In [None]:
def stepwise_reg():
    pass

# get forward and backward vars


## $\color{red}{\text{Model Diagnostics}}$