Load the tips dataset.

Create a column named price_per_person. This should be the total bill divided by the party size.

Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

Use select k best to select the top 2 features for predicting tip amount. What are they?

Use recursive feature elimination to select the top 2 features for tip amount. What are they?

Why do you think select k best and recursive feature elimination might give different answers for the top features? 
Does this change as you change the number of features you are selecting?

Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression 
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest, f_regression
from pydataset import data
import wrangle

In [9]:
# acquire the tips dataset
df = data('tips')

In [10]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [15]:
# create a new column 'price_per_person' that calculates the total price divided by the party size
df['price_per_person'] = df['total_bill'] / df['size']

In [16]:
df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person
count,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.88823
std,8.902412,1.383638,0.9511,2.91435
min,3.07,1.0,1.0,2.875
25%,13.3475,2.0,2.0,5.8025
50%,17.795,2.9,2.0,7.255
75%,24.1275,3.5625,3.0,9.39
max,50.81,10.0,6.0,20.275


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
dtypes: float64(3), int64(1), object(4)
memory usage: 17.2+ KB


In [22]:
# map 'sex' for modeling
df['sex'] = df.sex.map({'Female': 0, 'Male': 1})
# map 'smoker' for modeling
df['smoker'] = df.smoker.map({'No': 0, 'Yes': 1})

In [23]:
# create dummies for the 'day' and 'time' columns
dummy_df = pd.get_dummies(df[['day', 'time']], dummy_na=False)
# concatenate the dummy columns and the original dataframe
df = pd.concat([df, dummy_df], axis=1)

In [24]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
1,16.99,1.01,0,0,Sun,Dinner,2,8.495,0,0,1,0,1,0
2,10.34,1.66,1,0,Sun,Dinner,3,3.446667,0,0,1,0,1,0
3,21.01,3.5,1,0,Sun,Dinner,3,7.003333,0,0,1,0,1,0
4,23.68,3.31,1,0,Sun,Dinner,2,11.84,0,0,1,0,1,0
5,24.59,3.61,0,0,Sun,Dinner,4,6.1475,0,0,1,0,1,0


In [30]:
# drop the string columns that have been encoded
df = df.drop(columns=['day','time'])

In [31]:
# split the tips data
train, validate, test = wrangle.split_data(df)

In [32]:
# assign features and target variable for the train set
X_train, y_train = train.drop(columns='tip'), train.tip
# assign features and target variable for the validate set
X_validate, y_validate = validate.drop(columns='tip'), validate.tip
# assign features and target variable for the test set
X_test, y_test = test.drop(columns='tip'), test.tip

In [36]:
# initialize the select k best algorithm
f_selector = SelectKBest(f_regression, k=2)
# find the top 2 X's correlated with y
f_selector.fit(X_train, y_train)
# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()
# get list of top K features. 
f_feature = X_train.iloc[:,feature_mask].columns.tolist()

In [37]:
f_feature

['total_bill', 'size']

In [39]:
# initialize the ML algorithm
lm = LinearRegression()
# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, n_features_to_select=2)
# fit the data using RFE
rfe.fit(X_train,y_train)  
# get the mask of the columns selected
feature_mask = rfe.support_
# get list of the column names. 
rfe_feature = X_train.iloc[:,feature_mask].columns.tolist()

In [40]:
rfe_feature

['total_bill', 'time_Dinner']

In [41]:
# save the rfe ranking to a variable
var_ranks = rfe.ranking_
# save a list of selected features to a variable
var_names = X_train.columns.tolist()
# create a dataframe for the rfe ranking of each feature
pd.DataFrame({'Var': var_names, 'Rank': var_ranks})

Unnamed: 0,Var,Rank
0,total_bill,1
1,sex,6
2,smoker,9
3,size,3
4,price_per_person,4
5,day_Fri,7
6,day_Sat,8
7,day_Sun,10
8,day_Thur,2
9,time_Dinner,1


In [42]:
# create a function to calculate select k best
def select_kbest(X, y, k):
    '''This function takes in three arguments, X (selected features), y (target variable), and k (number of 
    features to select) and calculates the top features using selectKbest. The function returns a list of the 
    top features.'''
    f_selector = SelectKBest(f_regression, k=k)
    # find the top 2 X's correlated with y
    f_selector.fit(X, y)
    # boolean mask of whether the column was selected or not. 
    feature_mask = f_selector.get_support()
    # get list of top K features. 
    f_feature = X.iloc[:,feature_mask].columns.tolist()
    return f_feature

In [44]:
select_kbest(X_train, y_train, 4)

['total_bill', 'size', 'price_per_person', 'time_Dinner']

In [45]:
# create a function to calculate rfe
def rfe(X, y, k):
    '''This function takes in three arguments, X (selected features), y (target variable), and k (number of 
    features to select) and calculates the top features using recursive feature elimination. The function
    returns a list of the top features.'''
    # initialize the ML algorithm
    lm = LinearRegression()
    # create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
    rfe = RFE(lm, n_features_to_select=k)
    # fit the data using RFE
    rfe.fit(X,y)  
    # get the mask of the columns selected
    feature_mask = rfe.support_
    # get list of the column names. 
    rfe_feature = X.iloc[:,feature_mask].columns.tolist()
    return rfe_feature

In [46]:
rfe(X_train, y_train, 4)

['total_bill', 'size', 'day_Thur', 'time_Dinner']

In [54]:
# acquire the swiss dataset and save to a variable
swiss = data('swiss')
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [58]:
# split the swiss dataset into train, validate, and test sets
swiss_train, swiss_validate, swiss_test = wrangle.split_data(swiss)

In [61]:
X_swiss_train, y_swiss_train = swiss_train.drop(columns='Fertility'), swiss_train.Fertility
X_swiss_validate, y_swiss_validate = swiss_validate.drop(columns='Fertility'), swiss_validate.Fertility
X_swiss_test, y_swiss_test = swiss_test.drop(columns='Fertility'), swiss_test.Fertility

In [64]:
select_kbest(X_swiss_train, y_swiss_train, 3)

['Examination', 'Education', 'Infant.Mortality']

In [65]:
rfe(X_swiss_train, y_swiss_train, 3)

['Examination', 'Education', 'Infant.Mortality']