### Feature Engineering Exercises

Do your work for this exercise in a jupyter notebook named ```feature_engineering``` within the ```regression-exercises``` repo. Add, commit, and push your work.

**1.  Load the ```tips``` dataset.**


**a.  Create a column named ```tip_percentage```. This should be the tip amount divided by the total bill.**

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


from scipy import stats

from statsmodels.formula.api import ols
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression 
from sklearn.preprocessing import MinMaxScaler


from math import sqrt

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import pydataset
from pydataset import data

import wrangle
import utilities


In [3]:
# load from pydataset
# or load from seaborn tips = sns.load_dataset('tips')

tips = data('tips')
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3


In [4]:
tips.info()  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [5]:
tips.rename(columns={'size':'party'}, inplace=True)

In [6]:
# create tips percentage column
tips['tip_percentage'] = round(((tips.tip / tips.total_bill) * 100),1)
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,5.9
2,10.34,1.66,Male,No,Sun,Dinner,3,16.1


**b. Create a column named ```price_per_person```. This should be the total bill divided by the party size.**


In [7]:
#create price per person column
#use bracket notation because column name 'size' is a keyword
tips['price_per_person'] = round((tips.total_bill / tips.party),2)
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,5.9,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3,16.1,3.45


In [8]:
X_train, y_train, X_validate, y_validate, X_test, y_test= utilities.train_validate_test_split(tips, 'tip')
X_train.head()

Unnamed: 0,total_bill,sex,smoker,day,time,party,tip_percentage,price_per_person
19,16.97,Female,No,Sun,Dinner,3,20.6,5.66
173,7.25,Male,Yes,Sun,Dinner,2,71.0,3.62
119,12.43,Female,No,Thur,Lunch,2,14.5,6.22
29,21.7,Male,No,Sat,Dinner,2,19.8,10.85
238,32.83,Male,Yes,Sat,Dinner,2,3.6,16.42


In [16]:
numeric_cols = ['total_bill','tip_percentage','party','price_per_person']

X_train_scaled, X_validate_scaled, X_test_scaled = min_max_scale(X_train, X_validate, X_test, numeric_cols)
X_train_scaled.head()

Unnamed: 0,total_bill,tip_percentage,party,price_per_person
19,0.307114,0.252226,0.4,0.150581
173,0.092355,1.0,0.2,0.031977
119,0.206805,0.161721,0.2,0.18314
29,0.411622,0.240356,0.2,0.452326
238,0.657534,0.0,0.2,0.776163



**c.  Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?**

- tip amount - total_bill, tip_percentage

- tip percentage - time, day

**d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?**



In [17]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 2 features
f_selector = SelectKBest(f_regression, k=2)

# find the top 2 X's correlated with y
f_selector.fit(X_train_scaled, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

In [18]:
feature_mask

array([ True, False,  True, False])

In [21]:
# get list of top 2 features. 
f_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()
f_feature
print(f'The two best predictors, according to k best are: {f_feature}.')

The two best predictors, according to k best are: ['total_bill', 'party'].


In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, 2)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

rfe_feature
print(f'The two best predictors, according to recursive feature elimination are: {rfe_feature}.')

The two best predictors, according to recursive feature elimination are: ['total_bill', 'tip_percentage'].


**e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?**



In [25]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,5.9,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3,16.1,3.45
3,21.01,3.5,Male,No,Sun,Dinner,3,16.7,7.0
4,23.68,3.31,Male,No,Sun,Dinner,2,14.0,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,14.7,6.15


In [26]:
X_train, y_train, X_validate, y_validate, X_test, y_test= utilities.train_validate_test_split(tips, 'tip_percentage')
X_train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party,price_per_person
19,16.97,3.5,Female,No,Sun,Dinner,3,5.66
173,7.25,5.15,Male,Yes,Sun,Dinner,2,3.62
119,12.43,1.8,Female,No,Thur,Lunch,2,6.22
29,21.7,4.3,Male,No,Sat,Dinner,2,10.85
238,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42


In [27]:
numeric_cols = ['total_bill','tip','party','price_per_person']

X_train_scaled, X_validate_scaled, X_test_scaled = min_max_scale(X_train, X_validate, X_test, numeric_cols)
X_train_scaled.head()

Unnamed: 0,total_bill,tip,party,price_per_person
19,0.307114,0.3125,0.4,0.150581
173,0.092355,0.51875,0.2,0.031977
119,0.206805,0.1,0.2,0.18314
29,0.411622,0.4125,0.2,0.452326
238,0.657534,0.02125,0.2,0.776163


In [28]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 2 features
f_selector = SelectKBest(f_regression, k=2)

# find the top 8 X's correlated with y
f_selector.fit(X_train_scaled, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

In [29]:
feature_mask

array([False,  True, False,  True])

In [30]:
# get list of top K features. 
f_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()
f_feature
print(f'The two best predictors, according to k best are: {f_feature}.')

The two best predictors, according to k best are: ['tip', 'price_per_person'].


In [31]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, 2)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

rfe_feature
print(f'The two best predictors, according to recursive feature elimination are: {rfe_feature}.')

The two best predictors, according to recursive feature elimination are: ['total_bill', 'tip'].


**f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?**


K best runs one correlation test, which it uses to rank the x variables on how they correlate with y.  Recursive feature elimination makes multiple models.  It creates a model with all the features and removes the weakest feature. Then it creates a new model with the remaining features and removes the weakest, until the feature list is reduced to the number of features that  you are requesting.

**2. Write a function named ```select_kbest``` that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the ```SelectKBest``` class. Test your function with the ```tips``` dataset. You should see the same results as when you did the process manually.**


In [48]:
from sklearn.feature_selection import SelectKBest, f_regression

def select_kbest(predictors, target, k): 

    '''
    This function takes in a list of independent variables, or predictors (x), the target
    variable (y) and the number of features to select (k), fits  X_train_scaled and returns
    (prints) the names of the top k selected ffeatures based on the SelectKBest class.
    ''' 

    # parameters: f_regression stats test, return k number of features
    f_selector = SelectKBest(f_regression, k=k)

    # find the top k X's correlated with y
    f_selector.fit(X_train_scaled, y_train)

    # boolean mask of whether the column was selected or not. 
    feature_mask = f_selector.get_support()
    
    # get list of top K features. 
    f_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()
    
    print(f'The {k} best predictors of {target}, according to k best are: {f_feature}.')
    return
    

**3.  Write a function named ```rfe``` that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the ```RFE``` class. Test your function with the ```tips``` dataset. You should see the same results as when you did the process manually.**


In [49]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

def rfe(predictors, target, k): 
    
    '''
    This function takes in a list of independent variables, or predictors (x), the target
    variable (y) and the number of features to select (k), fits X_train_scaled
    and returns (prints) the names of the top k selected features based on the RFE class.
    ''' 

    # initialize the ML algorithm
    lm = LinearRegression()

    # create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
    rfe = RFE(lm, k)

    # fit the data using RFE
    rfe.fit(X_train_scaled,y_train)  

    # get the mask of the columns selected
    feature_mask = rfe.support_

    # get list of the column names. 
    rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

    print(f'The {k} best predictors of {target}, according to recursive feature elimination are: {rfe_feature}.')
    return

**4. Load the ```swiss``` dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).**










In [43]:
# load from pydataset

swiss = data('swiss')
swiss.head(2)

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2


In [44]:
swiss.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [45]:
X_train, y_train, X_validate, y_validate, X_test, y_test= utilities.train_validate_test_split(swiss, 'Fertility')
X_train.head(2)


Unnamed: 0,Agriculture,Examination,Education,Catholic,Infant.Mortality
Rolle,60.8,16,10,7.72,16.3
Lavaux,73.0,19,9,2.84,20.0


In [46]:
numeric_cols = ['Agriculture','Examination','Education','Catholic','Infant.Mortality']

X_train_scaled, X_validate_scaled, X_test_scaled = min_max_scale(X_train, X_validate, X_test, numeric_cols)
X_train_scaled.head()

Unnamed: 0,Agriculture,Examination,Education,Catholic,Infant.Mortality
Rolle,0.647561,0.40625,0.290323,0.054508,0.122449
Lavaux,0.796341,0.5,0.258065,0.004508,0.5
Nyone,0.526829,0.59375,0.354839,0.130533,0.163265
Conthey,0.953659,0.0,0.032258,0.997029,0.0
Yverdon,0.509756,0.375,0.225806,0.03791,0.755102


In [50]:
select_kbest(numeric_cols, "Fertility", 3)

The 3 best predictors of Fertility, according to k best are: ['Examination', 'Catholic', 'Infant.Mortality'].


In [51]:
rfe(numeric_cols, "Fertility", 3)

The 3 best predictors of Fertility, according to recursive feature elimination are: ['Agriculture', 'Examination', 'Infant.Mortality'].
