# Exercises

Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

Load the tips dataset.

Create a column named price_per_person. This should be the total bill divided by the party size.
Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

Use select k best to select the top 2 features for predicting tip amount. What are they?

Use recursive feature elimination to select the top 2 features for tip amount. What are they?

Why do you think select k best and recursive feature elimination might give different answers for the top features? 

Does this change as you change the number of features you are selecting?

Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function 
with the tips dataset. You should see the same results as when you did the process manually.

Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

Load the swiss dataset and use all the other features to predict Fertility. 

Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [1]:
from pydataset import data
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
import sklearn.linear_model
import sklearn.feature_selection
import sklearn.preprocessing
import features

In [2]:
def split(df):
    '''
    take in a DataFrame and return train, validate, and test DataFrames.
    return train, validate, test DataFrames.
    '''
    train_validate, test = train_test_split(df, test_size=.2, random_state=123)
    train, validate = train_test_split(train_validate, 
                                       test_size=.3, 
                                       random_state=123)
    return train, validate, test

In [3]:
df = data('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


# Create a column named price_per_person. This should be the total bill divided by the party size.Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

In [4]:
df["tip_percentage"] = df.tip / df.total_bill
df["price_per_person"] = df.total_bill / df.size

In [5]:
df = df[["total_bill", "tip", "size", "tip_percentage", "price_per_person"]]

In [6]:
train, validate, test = split(df)

In [7]:
target = "tip"
X_train = train.drop(columns=[target])
y_train = train[target]
X_validate = validate.drop(columns=[target])
y_validate = validate[target]
X_test = test.drop(columns=[target])
y_test = test[target]

X_train.head()

Unnamed: 0,total_bill,size,tip_percentage,price_per_person
19,16.97,3,0.206246,0.008694
173,7.25,2,0.710345,0.003714
119,12.43,2,0.144811,0.006368
29,21.7,2,0.198157,0.011117
238,32.83,2,0.035638,0.016819


In [8]:
scaler = sklearn.preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

# Use select k best to select the top 2 features for predicting tip amount. What are they?



In [9]:

k = 2
kbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_regression, k=2)
kbest.fit(X_train, y_train)
kbest_features = X_train.columns[kbest.get_support()].tolist()
print("KBest's 2 best features are", kbest_features)


KBest's 2 best features are ['total_bill', 'price_per_person']


# Load the swiss dataset and use all the other features to predict Fertility. 

Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [10]:
swiss = data('swiss')
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [11]:
# Split the data
train, validate, test = split(swiss)

# Setup X and y
X_train = train.drop(columns='Fertility')
y_train = train.Fertility

X_validate = validate.drop(columns='Fertility')
y_validate = validate.Fertility

X_test = test.drop(columns='Fertility')
y_test = test.Fertility

In [12]:
# Scale the data
scaler = sklearn.preprocessing.MinMaxScaler()

# Fit the scaler
scaler.fit(X_train)

# Use the scaler to transform train, validate, test
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)


# Turn everything into a dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_train.columns)

In [13]:
k = 3
kbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_regression, k=3)
kbest.fit(X_train, y_train)
kbest_features = X_train.columns[kbest.get_support()].tolist()
print("KBest's 3 best features are", kbest_features)

KBest's 3 best features are ['Examination', 'Catholic', 'Infant.Mortality']


In [14]:
selected_features, all_rankings = features.select_rfe(X_train, y_train, 3)
print(selected_features)
all_rankings

['Agriculture', 'Examination', 'Infant.Mortality']


Unnamed: 0,Var,Rank
0,Agriculture,1
1,Examination,1
4,Infant.Mortality,1
2,Education,2
3,Catholic,3
