<a href="https://colab.research.google.com/github/chaewoncutie/CCADMACL_EXERCISES_COM222ML/blob/main/Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [277]:
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import SelectKBest, f_regression, SequentialFeatureSelector
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

In [278]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)

In [279]:
df_house = fetch_california_housing()
df_house_features = pd.DataFrame(df_house.data, columns=df_house.feature_names)
df_house_target = pd.DataFrame(df_house.target, columns=['target'])

In [280]:
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

1. Use any filter method to select the best features

In [281]:
# put your answer here
threshold = 5
skb = SelectKBest(score_func=f_regression, k=threshold)
sel_skb = skb.fit(X, y)
sel_skb_index = sel_skb.get_support()
df_skb = X.iloc[:, sel_skb_index]
selected_features_filter = df_skb.columns.tolist()
print("Selected Features using SelectKBest:", selected_features_filter)

Selected Features using SelectKBest: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Latitude']


2. Use any wrapper method to select the best features

In [282]:
# put your answer here
threshold = 5
wrapper_model = LinearRegression()
sfs = SequentialFeatureSelector(wrapper_model, n_features_to_select=threshold, direction='forward')
sel_sfs = sfs.fit(df.iloc[:, :-1], df['MedHouseVal'])
sel_sfs_index = sel_sfs.get_support()
df_sfs = df.iloc[:, :-1].iloc[:, sel_sfs_index]
selected_features_wrapper = df_sfs.columns.tolist()
print("Selected Features using Sequential Feature Selector:", selected_features_wrapper)

Selected Features using Sequential Feature Selector: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']


3. Use any embedded methood to select the best features

In [283]:
# put your answer here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

tree = DecisionTreeRegressor(random_state=42)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)

feature_importances = tree.feature_importances_
sel_tree_index = feature_importances > 0.1

df_tree = X_train.iloc[:, sel_tree_index]
selected_features_embedded = df_tree.columns.tolist()

print("Selected Features using Decision Tree:", selected_features_embedded)

Selected Features using Decision Tree: ['MedInc', 'AveOccup']


In [284]:
def compute_rmse(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred_test = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    return rmse

default_rmse = compute_rmse(wrapper_model, X_train, X_test, y_train, y_test)

X_train_filter = X_train[selected_features_filter]
X_test_filter = X_test[selected_features_filter]
filter_rmse = compute_rmse(wrapper_model, X_train_filter, X_test_filter, y_train, y_test)

X_train_wrapper = X_train[selected_features_wrapper]
X_test_wrapper = X_test[selected_features_wrapper]
wrapper_rmse = compute_rmse(wrapper_model, X_train_wrapper, X_test_wrapper, y_train, y_test)

X_train_embedded = X_train[selected_features_embedded]
X_test_embedded = X_test[selected_features_embedded]
embedded_rmse = compute_rmse(wrapper_model, X_train_embedded, X_test_embedded, y_train, y_test)

print("Model RMSE:")
print("Default RMSE:", default_rmse)
print("Filter Method RMSE:", filter_rmse)
print("Wrapper Method RMSE:", wrapper_rmse)
print("Embedded Method RMSE:", embedded_rmse)

Model RMSE:
Default RMSE: 0.7284008391515456
Filter Method RMSE: 0.7826494475622022
Wrapper Method RMSE: 0.7859647054435119
Embedded Method RMSE: 0.8309136955927495
