Q1. What is the Filter method in feature selection, and how does it work?

Q2. How does the Wrapper method differ from the Filter method in feature selection?

Q3. What are some common techniques used in Embedded feature selection methods?

Q4. What are some drawbacks of using the Filter method for feature selection?

Q5. In which situations would you prefer using the Filter method over the Wrapper method for feature selection?

Q6. In a telecom company, you are working on a project to develop a predictive model for customer churn. You are unsure of which features to include in the model because the dataset contains several different ones. Describe how you would choose the most pertinent attributes for the model using the Filter Method.

Q7. You are working on a project to predict the outcome of a soccer match. You have a large dataset with many features, including player statistics and team rankings. Explain how you would use the Embedded method to select the most relevant features for the model.

In [20]:
pd.read_csv('stock_prices_data.csv').head()

Unnamed: 0,price,revenue,profit,market_cap,dividends,pe_ratio,volume,volatility
0,43.708611,96988880.0,323977.6,9091832000.0,3.210158,18.2502,526300.4,5.279794
1,95.564288,77738150.0,6367740.0,2471663000.0,0.4207,22.329398,5318233.0,8.636653
2,75.879455,94010400.0,3150416.0,1534459000.0,0.808144,43.30115,5410945.0,6.75759
3,63.879264,89587910.0,5090621.0,4945582000.0,4.492771,19.26149,6377925.0,2.047877
4,24.041678,60192100.0,9076589.0,9857939000.0,3.032145,12.627174,7263652.0,1.170403


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import accuracy_score

data = pd.read_csv('stock_prices_data.csv')

X = data.drop(columns=['price'])
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lasso = Lasso(alpha=0.01)
lasso.fit(X_train_scaled, y_train)

feature_importance = pd.Series(lasso.coef_, index=X.columns)
top_features = feature_importance[abs(feature_importance) > 0]

print("Top Selected Features:")
print(top_features)

Top Selected Features:
revenue        2.672867
profit         1.981315
market_cap     4.248926
dividends     -6.495119
pe_ratio      -1.875362
volume       -11.943089
volatility     8.408895
dtype: float64


Q8. You are working on a project to predict the price of a house based on its features, such as size, location, and age. You have a limited number of features, and you want to ensure that you select the most important ones for the model. Explain how you would use the Wrapper method to select the best set of features for the predictor.

In [6]:
import numpy as np

In [9]:
Area = np.random.randint(600,2500,30,int)
Roooms = np.random.randint(2,6,30,int)
Price = np.random.randint(3500000,17500000,30,int)

In [10]:
W1 = pd.DataFrame(Area,columns=['Area'])
W2 = pd.DataFrame(Roooms,columns=['Rooms'])
W3 = pd.DataFrame(Price,columns=['Price'])

In [11]:
rate = pd.concat([W1,W2],axis=1)

In [15]:
properties = pd.concat([rate,W3],axis=1)

In [16]:
properties.head()

Unnamed: 0,Area,Rooms,Price
0,1664,3,7123110
1,1007,3,11930543
2,1820,5,11019385
3,839,5,9261273
4,1046,3,6232131


In [17]:
properties.to_csv('property.csv')

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error

data = pd.read_csv('property.csv')

X = data.drop(columns=['Price'])  # input Features
y = data['Price']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linear = LinearRegression()

rfe = RFE(estimator=linear, n_features_to_select=3)
rfe.fit(X_train_scaled, y_train)

selected_features = X.columns[rfe.support_]

lin_final = LinearRegression()
lin_final.fit(X_train_scaled[:, rfe.support_], y_train)

y_pred = lin_final.predict(X_test_scaled[:, rfe.support_])
mse = mean_squared_error(y_test, y_pred)

print(f"Selected Features: {selected_features}")
print(f"Mean Squared Error: {mse}")

Selected Features: Index(['Unnamed: 0', 'Area', 'Rooms'], dtype='object')
Mean Squared Error: 19065482036700.31
