## Feature Engineering Exercises

Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

1. Load the tips dataset.

In [1]:
import pandas as pd
import numpy as np
import math

from pydataset import data

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = data('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [4]:
df['tip_percentage'] = df.tip / df.total_bill
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


b. Create a column named price_per_person. This should be the total bill divided by the party size.

In [5]:
df['price_per_person'] = df.total_bill / df['size']
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

total_bill, size

d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?

In [6]:
from sklearn.feature_selection import SelectKBest, f_regression

X = df[list(df.select_dtypes(exclude='O').columns)].drop(columns=['tip'])
y = df['tip']

f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X, y)

SelectKBest(k=2, score_func=<function f_regression at 0x7ff53df0b700>)

In [7]:
X.columns[f_selector.get_support()]

Index(['total_bill', 'size'], dtype='object')

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.feature_selection import RFE

lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X, y)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [9]:
X.columns[rfe.support_]

Index(['total_bill', 'tip_percentage'], dtype='object')

In [10]:
pd.Series(dict(zip(X.columns, rfe.ranking_))).sort_values()

total_bill          1
tip_percentage      1
size                2
price_per_person    3
dtype: int64

e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?

In [11]:
X2 = df[list(df.select_dtypes(exclude='O').columns)].drop(columns=['tip_percentage'])
y2 = df['tip_percentage']

f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X2, y2)

SelectKBest(k=2, score_func=<function f_regression at 0x7ff53df0b700>)

In [12]:
X2.columns[f_selector.get_support()]

Index(['total_bill', 'tip'], dtype='object')

In [13]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X2, y2)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [14]:
X2.columns[rfe.support_]

Index(['tip', 'size'], dtype='object')

In [15]:
pd.Series(dict(zip(X2.columns, rfe.ranking_))).sort_values()

tip                 1
size                1
price_per_person    2
total_bill          3
dtype: int64

f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

SelectKBest tests each feature independently against the target, while RFE tests all of the features to select the number of features requested

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [16]:
def select_kbest(X, y, k):

    f_selector = SelectKBest(f_regression, k)
    f_selector.fit(X, y)
    
    k_features = X.columns[f_selector.get_support()]
    
    return k_features

In [17]:
select_kbest(X, y, 2)

Index(['total_bill', 'size'], dtype='object')

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [18]:
def rfe(X, y, n):
    
    lm = LinearRegression()
    rfe = RFE(lm, n)
    rfe.fit(X, y)

    n_features = X.columns[rfe.support_]
    
    return n_features

In [19]:
rfe(X, y, 2)

Index(['total_bill', 'tip_percentage'], dtype='object')

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [20]:
df2 = data('swiss')
df2.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [21]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [22]:
X3 = df2.drop(columns='Fertility')
y3 = df2['Fertility']

In [23]:
select_kbest(X3, y3, 3)

Index(['Examination', 'Education', 'Catholic'], dtype='object')

In [24]:
rfe(X3, y3, 3)

Index(['Examination', 'Education', 'Infant.Mortality'], dtype='object')