## **Exercise 3.04: Guided exercise**
### Automated Feature Selection Techniques

     Follow the steps mentioned below corresponding to each code cell to perform the tasks:


     Start with Loading the Dataset from GitHub.

     Then look for the null values in dataset. Remove them if they exist.
Follow these steps to perform the initial tasks on the dataset:

## Loading the Dataset

In [None]:
import pandas as pd
data = 'https://raw.githubusercontent.com/fenago/datawrangling/main/miami-housing.csv'
df = pd.read_csv(data)
df.sample(5)

## Printing the shape of dataset

In [None]:
df.shape

## Printing the names of features/columns

In [None]:
df.columns

## Print the total null values per column

In [None]:
# total null values per column
df.isnull().sum()


## Uncorrelated Features:
Features will be checked in such a way that if target variable and that feature is uncorrelated, just drop it:

Drop uncorrelated numeric features withthreshold <0.2

Show the correlation between target and features

In [None]:
# correlation between target and features
df_isnull = df.fillna(0)


In [None]:
# drop uncorrelated numeric features (threshold <0.2)
corr = abs(df.corr().loc['SALE_PRC'])
corr = corr[corr<0.3]
cols_to_drop = corr.index.to_list()
df = df.drop(cols_to_drop, axis=1)

In [None]:
# correlation between target and features
(df.corr().loc['SALE_PRC']
 .plot(kind='barh', figsize=(4,10)))

## **Low variance features**
Check the feature with low variance in our dataset. Drop it after that

In [None]:
import seaborn as sns
import numpy as np
# variance of numeric features
(df.select_dtypes(include=np.number).var().astype('str'))

 Lowest is of Structure Quality.

Drop the feature: structure_quality

In [None]:
df['structure_quality'].describe()

### **Multi-collinearity**
Check for the  feature that is more related with a target variable, and then delete it. 

It can be seen that the “TOT_LVG_AREA” and “LND_SQFOOT” are more correlated with SALE_PRC.

So you can eliminate one of them and let some other feature predict the target variable.

In [None]:
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="PiYG")
plt.show()

### Drop correlated features

In [None]:
# drop correlated features
df = df.drop(['SPEC_FEAT_VAL', 'SUBCNTR_DI', 'structure_quality'], axis=1)

### Numerical Features:

In [None]:
df_num = df[['LND_SQFOOT', 'TOT_LVG_AREA']]
df_num.sample(5)

### Create a crosstab/contingency table of numerical features in each column

In [None]:
crosstab = pd.crosstab(df_num['LND_SQFOOT'], df_num['TOT_LVG_AREA'])
crosstab

### Run Chi-squared test on the contingency table that will tell us whether the two features are independent

In [None]:
from scipy.stats import chi2_contingency
chi2_contingency(crosstab)

In [None]:
# drop columns with missing values
df = df.dropna()
from sklearn.model_selection import train_test_split
# get dummies for categorical features
df = pd.get_dummies(df, drop_first=True)
# X features
X = df.drop('SALE_PRC', axis=1)
# y target
y = df['SALE_PRC']
# split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LinearRegression
# scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
# convert back to dataframe
X_train = pd.DataFrame(X_train, columns = X.columns.to_list())
X_test = pd.DataFrame(X_test, columns = X.columns.to_list())
# instantiate model
model = LinearRegression()
# fit
model.fit(X_train, y_train)

### **Feature Coefficients**
As it is a regression model, we will be using regression coefficients, which will show the relative contributions of features in the model 

In [None]:
# feature coefficients
coeffs = model.coef_
# visualizing coefficients
index = X_train.columns.tolist()
(pd.DataFrame(coeffs, index = index, columns = ['coeff']).sort_values(by = 'coeff')
 .plot(kind='barh', figsize=(4,10)))

In [None]:
# filter variables near zero coefficient value
temp = pd.DataFrame(coeffs, index = index, columns = ['coeff']).sort_values(by = 'coeff')
temp = temp[(temp['coeff']>1) | (temp['coeff']< -1)]
# drop those features
cols_coeff = temp.index.to_list()
X_train = X_train[cols_coeff]
X_test = X_test[cols_coeff]

## **P-value**

In [None]:
import statsmodels.api as sm
ols = sm.OLS(y, X).fit()
print(ols.summary())

## **Variance Inflation Factor**
Check VIF for multicollinearity

Keep all the features that have VIF below 10

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# calculate VIF
vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)
# display VIFs in a table
index = X_train.columns.tolist()
vif_df = pd.DataFrame(vif, index = index, columns = ['vif']).sort_values(by = 'vif', ascending=False)
vif_df[vif_df['vif']<10]

## **Feature Importance:**


## Implementing a model through Decision trees

## Then plotting the features importance

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 20)
from sklearn.tree import DecisionTreeClassifier #Decision Tree
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.axhline(y=0.05, color='r', linestyle='-')
plt.show()
#use only high important features to feed into a model
for i,v in enumerate(importance):
    if v >= 0.05:
        print('Feature: %0d, Score: %.5f' % (i,v))


## Check out feature importance

In [None]:
#  feature importance
importances = model.feature_importances_
# visualization
cols = X.columns
(pd.DataFrame(importances, cols, columns = ['importance'])
 .sort_values(by='importance', ascending=True)
 .plot(kind='barh', figsize=(4,10)))

## **Automated Feature Selection Techniques**

Performing Chi-square based technqiue and Regularization

1. Import modules

2. select K best features

3. Keep 75% top features

In [None]:
# import modules
from sklearn.feature_selection import (SelectKBest, chi2, SelectPercentile, SelectFromModel, SequentialFeatureSelector, SequentialFeatureSelector)

## **Chi-Square**

In [None]:
# select K best features
X_best = SelectKBest(chi2, k='all').fit_transform(X,y)
# number of best features
X_best.shape[1]

In [None]:
# keep 75% top features 
X_top = SelectPercentile(chi2, percentile = 75).fit_transform(X,y)
# number of best features
X_top.shape[1]

## **Regularization**

In [None]:
# implement algorithm
from sklearn.svm import LinearSVC
model = LinearSVC(penalty= 'l1', C = 0.002, dual=False)
model.fit(X,y)
# select features using the meta transformer
selector = SelectFromModel(estimator = model, prefit=True)
X_new = selector.transform(X)
X_new.shape[1]

# names of selected features
feature_names = np.array(X.columns)
feature_names[selector.get_support()]
