# Feature Selection from Mobile data using ExhaustiveFeatureSelector Exhaustive Method
Dataset: [https://raw.githubusercontent.com/subashgandyer/datasets/main/mobile_price_train.csv]

In [22]:
import pandas as pd

In [23]:
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/mobile_price_train.csv"

In [24]:
df = pd.read_csv(url)
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [25]:
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

### Understand the data
- Find how many features?
- Find how many samples?
- What are the data types of each feature column?
- What do you think could be the most important feature(s)?
- Run some feature selection methods
- Is your intuition right?

### Import the necessary libraries

In [26]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

### Read the mobile data

In [27]:
# data = pd.read_csv("data/mobile_price_train.csv")

### Split the dataset into X and y

In [28]:
X = df.iloc[:,0:20]
y = df.iloc[:,-1]

### Sanity check

In [29]:
X.shape, y.shape

((2000, 20), (2000,))

### How many features

In [30]:
X.shape[1]

20

### Import the ExhaustiveFeatureSelector from mlxtend library

In [31]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector

### Import the Logistic Regression model

In [32]:
from sklearn.neighbors import KNeighborsClassifier

### Build a Logistic Regression model with lbfgs as solver and iterations = 500

In [33]:
knn = KNeighborsClassifier(n_neighbors=3)

### Build ExhaustiveFeatureSelector with Logistic Regression model and min and max features as 1 to 2

In [34]:
efs1 = ExhaustiveFeatureSelector(knn,
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

### Train the ExhaustiveFeatureSelector model

In [35]:
efs1.fit(X, y)

Features: 3000/6195IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Features: 6195/6195

ExhaustiveFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                          max_features=4)

### Explore the best feature names from the model

In [36]:
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)

Best accuracy score: 0.92
Best subset (indices): (0, 11, 12, 13)
Best subset (corresponding names): ('battery_power', 'px_height', 'px_width', 'ram')


### All subsets

### Best Score feature subset

### Getting Metric dict

### Plotting metric_dict helper function code

In [37]:
import matplotlib.pyplot as plt


fig = plt.figure(figsize=(50,50))

k_feat = sorted(metric_dict.keys())
avg = [metric_dict[k]['avg_score'] for k in k_feat]

upper, lower = [], []
for k in k_feat:
    upper.append(metric_dict[k]['avg_score'] +
                 metric_dict[k]['std_dev'])
    lower.append(metric_dict[k]['avg_score'] -
                 metric_dict[k]['std_dev'])

plt.fill_between(k_feat,
                 upper,
                 lower,
                 alpha=0.2,
                 color='blue',
                 lw=1)

plt.plot(k_feat, avg, color='blue', marker='o')
plt.ylabel('Accuracy +/- Standard Deviation')
plt.xlabel('Number of Features')
feature_min = len(metric_dict[k_feat[0]]['feature_idx'])
feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])
plt.xticks(k_feat, 
           [str(metric_dict[k]['feature_names']) for k in k_feat], 
           rotation=90)
plt.show()

NameError: name 'metric_dict' is not defined

<Figure size 3600x3600 with 0 Axes>

## Random Forest Classifier

### Import RandomForestClassifier

### Build the Random Forest model

### Build the Exhaustive Feature Selector with Random Forest as the Learning Algorithm

### Train the model

### Best features

### Best Score

### Collect all Metric dict with all the feature subsets

### Plot metric dict with all subsets

## KNN as Learning Algorithm

### Choose KNN as your learning algorithm

### Build a ExhaustiveFeatureSelector model with KNN as learning algorithm

### Train the model

### Get the feature names

### Try some other learning algorithms you know of
- SVM
- Anything of your choice

### Summarize the list of features chosen by different algorithms
- Algorithm | Best Features | Best Accuracy

Example: 

- Logistic Regression | ['battery_power', 'ram']  | 0.823
- KNN | ???? | ????
- RF | ???? | ????
- Algorithms of your choice | ????? | ????