<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Split" data-toc-modified-id="Split-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Split</a></span></li><li><span><a href="#Select-Features-Using-KBest" data-toc-modified-id="Select-Features-Using-KBest-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Select Features Using KBest</a></span></li><li><span><a href="#Select-Features-Using-RFE" data-toc-modified-id="Select-Features-Using-RFE-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Select Features Using RFE</a></span></li><li><span><a href="#Swiss-dataset" data-toc-modified-id="Swiss-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Swiss dataset</a></span></li></ul></div>

In [16]:
import pydataset

import sklearn.feature_selection
import sklearn.model_selection


In [3]:
tips = pydataset.data('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
tips.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [6]:
tips.columns = ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'party_size']
tips.head(1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party_size
1,16.99,1.01,Female,No,Sun,Dinner,2


In [7]:
tips['tip_percentage'] = tips.tip / tips.total_bill
tips['price_per_person'] = tips.total_bill / tips.party_size

In [8]:
tips.head(1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party_size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495


In [9]:
k = 2

In [12]:
kbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_regression, 2)

In [14]:
lm = sklearn.linear_model.LinearRegression()

In [15]:
rfe = sklearn.feature_selection.RFE(lm, 2)

## Split

- Split your data shortly after you acquire it because you should only be learning information about the dataset from the training dataset, not the testing dataset. We don't want to leak information from our test dataset into our exploration, data snooping bias.


- You should have functions that allow you to apply any transformations you perform on your training set to your test set.

In [18]:
train, test = sklearn.model_selection.train_test_split(tips, random_state=123, train_size=.8)

In [19]:
X_cols = ['total_bill', 'party_size', 'tip_percentage', 'price_per_person']

X_train = train[X_cols]
y_train = train.tip

X_test = test[X_cols]
y_test = test.tip

## Select Features Using KBest

- `.transform(X_train)` drops all of the columns except the best ones, so you would assign `kbest_X_train = kbest.transform(X_train)`


- considers variables one at a time, not the interaction of the two

In [20]:
# create, 
# fit, .fit() on train
# use sklearn objects - (.transform, .predict) on test, train, unseen

kbest.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x12e24f320>)

In [21]:
kbest.get_support()

array([ True,  True, False, False])

In [23]:
X_train.columns[kbest.get_support()]

Index(['total_bill', 'party_size'], dtype='object')

In [31]:
def select_kbest(X, y, k):
    kbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_regression, k)
    kbest.fit(X, y)
    return X.columns[kbest.get_support()]

In [32]:
select_kbest(X_train, y_train, 2)

Index(['total_bill', 'party_size'], dtype='object')

## Select Features Using RFE

- RFE looks at coefficients to decide the importance of a feature


- RFE considers the interaction of the variables


- RFE is fitting a model to get the coefficients


- A model with fewer features is generally more interpretable than a model with lots of features. 


- A small gain in performance may not be worth adding the complexity of another feature.


- Combinatorial Explosion - the more things we have, the number of combinations shoot up quickly along with the complexity of interpreting models using a large number of features.


- 

In [27]:
rfe.fit(X_train, y_train)

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                               normalize=False),
    n_features_to_select=2, step=1, verbose=0)

In [28]:
rfe.support_

array([ True, False,  True, False])

In [30]:
X_train.columns[rfe.support_]

Index(['total_bill', 'tip_percentage'], dtype='object')

In [37]:
def select_rfe(X, y, k):
    lm = sklearn.linear_model.LinearRegression()
    rfe = sklearn.feature_selection.RFE(lm, k)
    rfe.fit(X, y)
    return X.columns[rfe.support_]

In [38]:
select_rfe(X_train, y_train, 2)

Index(['total_bill', 'tip_percentage'], dtype='object')

## Swiss dataset

In [39]:
swiss = pydataset.data('swiss')
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [47]:
train, test = sklearn.model_selection.train_test_split(swiss, random_state=123, train_size=.8)

In [48]:
X_train = train.drop(columns='Fertility')
y_train = train.Fertility

X_test = test.drop(columns='Fertility')
y_test = test.Fertility

In [49]:
select_kbest(X_train, y_train, 3)

Index(['Examination', 'Education', 'Catholic'], dtype='object')

In [50]:
select_rfe(X_train, y_train, 3)

Index(['Examination', 'Education', 'Infant.Mortality'], dtype='object')