# DATA 202 Lab 4: Multiple Predictors and Feature Engineering

The goal of this lab is to practice applying linear regression to a wider range of problems, such as when data aren't just numbers.

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

sns.set()
sns.set_context("notebook")
%matplotlib inline

In [9]:
data = sns.load_dataset("tips")
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB


In [11]:
# Construct the tip percentage column
data['%'] = data['tip'] / data['total_bill'] * 100

# Prep: train-test split

Split the data *randomly* into a training set and test set. We'll ask you to write something similar yourself later, but for now we'll just give you this line:

In [92]:
train, test = train_test_split(data, train_size=.8, random_state=0)
len(train), len(test)

(195, 49)

# Exercise 1: More Predictors
As before, let's try to predict the tip percent. So the `%` column will be the *response*. We'll use the training set the whole time.

In [93]:
y = train['%']
# We'll set X to different things below.

## 1.1: No Predictors
Suppose we predict a constant value for each data point.
1. What constant should we predict?
2. What MSE do we get?
3. What R^2 does this model get?
4. What Mean Absolute Error do we get?
5. How good or bad is this prediction? Connect it to the real world somehow.

For now, don't touch the *test set*; do all of this on the training set only.

In [94]:
# your code here
np.mean(y)

15.86505005088125

In [95]:
# your code here
mse_const = np.mean((y - y.mean()) ** 2)
print("MSE:", mse_const)

MSE: 43.5844107542955


In [96]:
# your code here
print("R^2: 0.0") # because 1 - var/var = 0.

R^2: 0.0


In [97]:
# your code here
print("RMSE:", mse_const ** .5)
print("MAE:", np.mean(abs(y - y.mean())))

RMSE: 6.601849040556403
MAE: 4.274344389500647


*your answer here*

It's surprisingly good -- we're off by about 4.3 percentage points on average.

## 1.2: One predictor

Now let's use `total_bill` to help predict `%`.

First, make a matrix of observations by features (just one feature for now). Call it `X`. Use only training set points. Hint: use a one-element list of column names.

In [98]:
# your code here
X = train[['total_bill']]

In [99]:
assert X.shape == (len(train), 1)

Use the sklearn `LinearRegression` function to predict `y` from `X`.

1. As the total bill amount increases, does the tip *percentage* generally increase or decrease? (hint: `linreg.coef_`)
2. What MSE do we get?
3. What R^2 does this model get?
4. What Mean Absolute Error do we get?
5. How good or bad is this prediction? Is that better or worse than the constant predictor?

Since we'll be doing this a few more times, go ahead and make a *function* that prints out the caluclations you need for parts 1-4, given a LinearRegression object and an X and y.

In [100]:
def show_regression_stats(linreg, X, y):
    '''Show coefficients, MSE, R^2, and MAE of a fitted LinearRegression.'''
    # your code here
    print("coef:", linreg.coef_)

    y_pred = linreg.predict(X)
    resid = y - y_pred
    mse = np.mean(resid ** 2)
    print("MSE:", mse)
    print("R2:", 1 - mse / np.var(y))
#     print("RMSE",  mse ** .5)
    print("MAE:", np.mean(abs(resid)))

In [101]:
# your code here
linreg = LinearRegression()
linreg.fit(X, y)
show_regression_stats(linreg, X, y)

coef: [-0.26408455]
MSE: 37.78196091429942
R2: 0.13313131322818705
MAE: 4.071215781134974


*your answer here*

It's a bit better than predicting a constant -- we get about 0.1 percentage point better predictions. 

## 1.3: Two predictors

Now let's use both `total_bill` and `size` (number of people in the party) to predict `%`.

First, make your `X` matrix.

In [102]:
# your code here
X = train[['total_bill', 'size']]

In [103]:
assert X.shape == (len(train), 2)

Use the sklearn `LinearRegression` function to predict `y` from `X`. Use your show_regression_stats function from above.

1. As the total bill amount increases, how does the tip percentage change? Is it the same as in ex1.2?
2. As the party size increases, does the tip *percentage* generally increase or decrease?
3. What MSE, R^2, and MAE do we get?
4. How good or bad is this prediction? Is that better or worse than our previous models?

In [104]:
# your code here
linreg = LinearRegression()
linreg.fit(X, y)
show_regression_stats(linreg, X, y)

coef: [-0.30042581  0.60165427]
MSE: 37.56664858207776
R2: 0.13807143582008052
MAE: 4.035416185991046


*your answer here*

# Exercise 2: Dealing with Categorical Variables

## 2.1: Error!
Try including the `time` column now too, in exactly the same way as you included the `size` column. Observe that you get an error. Why do you get that error?

In [105]:
# your code here
X = train[['total_bill', 'size', 'time']]
linreg = LinearRegression()
linreg.fit(X, y)

ValueError: could not convert string to float: 'Dinner'

*Your answer here*

It's trying to predict a model that's like $0.3*bill + 0.5*size + 1.2 * time$, but we can't multiply "Dinner" by a number!

In [106]:
train['time'].value_counts()

Dinner    141
Lunch      54
Name: time, dtype: int64

## 2.2: One-Hot Encoding
To solve this problem, we often "encode" categorical data as separate columns. So we'll end up with a model that looks like

20 + 5 * time_Lunch + 6 * time_Dinner

(if you've studied this before, there is a caveat here, shhhh...)

What should time_Lunch and time_Dinner be? Think for a moment before moving on.

.

.

.

.


`time_Lunch` should be a column that's 1 (or some other constant) when `time` is "Lunch" and 0 otherwise. `time_Dinner` is likewise.

This is called "one-hot" encoding because only one of the columns is "hot" (non-zero) at a time.

We can construct these columns easily by using the fact that `True` is interpreted as `1`.

In [107]:
time_Lunch = train['time'] == 'Lunch'
time_Lunch.head()

7      False
83      True
176    False
106    False
156    False
Name: time, dtype: bool

In [108]:
time_Lunch = time_Lunch.astype(float) # in some situations you don't even need to explicitly convert the type
time_Lunch.head()

7      0.0
83     1.0
176    0.0
106    0.0
156    0.0
Name: time, dtype: float64

The easiest way to include these variables in the regression is to add them to the dataframe. To avoid a warning that almost every Pandas user dreads (`SettingWithCopyWarning`), we won't *modify* the original DataFrame; instead, we'll make a *new* dataframe that includes the new columns. The syntax is:

In [109]:
train_transformed = train.assign(time_Lunch=1) # replacing that RHS with the desired value
train_transformed.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,%,time_Lunch
7,26.88,3.12,Male,No,Sun,Dinner,4,11.607143,1
83,32.68,5.0,Male,Yes,Thur,Lunch,2,15.299878,1
176,17.89,2.0,Male,Yes,Sun,Dinner,2,11.17943,1
106,20.49,4.06,Male,Yes,Sat,Dinner,2,19.814544,1
156,48.17,5.0,Male,No,Sun,Dinner,6,10.379905,1


Use that to make a new variable, `train_transformed`, that has encoded columns for both Lunch and Dinner. Try to do it in a single assignment statement.

In [110]:
# your code here
train_transformed = train.assign(
    time_Lunch = (train['time'] == "Lunch").astype(float),
    time_Dinner = (train['time'] == "Dinner").astype(float))

In [111]:
assert 0 < train_transformed['time_Lunch'].mean() < 1
assert 0 < train_transformed['time_Dinner'].mean() < 1
assert np.allclose(train_transformed['time_Lunch'] + train_transformed['time_Dinner'], 1.0)

## 2.3: Fitting a regression with one-hot encoding.

Now construct an X matrix containing all of the predictors we want to use: total_bill, size, and the two encoded columns for `time`.

In [112]:
# your code here
X = train_transformed[['total_bill', 'size', 'time_Lunch', 'time_Dinner']]

And fit a linear regression predicting `y` from our new `X`. Answer the same questions as in 1.3.

In [113]:
# your code here
linreg = LinearRegression()
linreg.fit(X, y)
show_regression_stats(linreg, X, y)

coef: [-0.30099483  0.60133269 -0.03912426  0.03912426]
MSE: 37.5654514368643
R2: 0.1380989031000721
MAE: 4.037535086331967


## 2.4: Inference given a one-hot encoding

Use the coefficients of the linear model you found to write out a prediction function for this dataset. Hard-code the coefficients to 2 or 3 decimal places. You may need:

In [114]:
linreg.intercept_

20.370207409794205

In [127]:
def predict(data):
    """
    Predict the tip percentage. Hard-codes a specific fitted model.
    """
    # your code here
    return (
        20.37 +
        -.30 * data['total_bill'] +
        .601 * data['size'] +
        -.039 * (data['time'] == 'Lunch') +
        .039 * (data['time'] == 'Dinner')
    )

Test this function by calling it on the original training set.

In [126]:
y_pred = predict(train)
print(np.mean((train['%'] - y_pred) ** 2))

37.565890139674465


If this isn't within 2 decimal places of the MSE you found in 2.3, check your work.

## 2.5: Critiquing the one-hot encoding

What do you notice about the coefficients of the two encoded columns? (Suppose you lost access to the `Dinner` column. Could you figure out a way to get the same predictions?)

*your answer here*

`time_Dinner` is redundant: it's just `1 - time_Lunch`. So I'm not surprised to notice that the coefficients on the two columns are exactly negatives of each other (-.039 and .039). In fact we could have set one of these two coefficients to *anything we want* and we could just adjust the other one accordingly.