# DATA 202 Lab 4: Multiple Predictors and Feature Engineering

The goal of this lab is to practice applying linear regression to a wider range of problems, such as when data aren't just numbers.

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

sns.set()
sns.set_context("notebook")
%matplotlib inline

In [9]:
data = sns.load_dataset("tips")
data.head()

In [10]:
data.info()

In [11]:
# Construct the tip percentage column
data['%'] = data['tip'] / data['total_bill'] * 100

# Prep: train-test split

Split the data *randomly* into a training set and test set. We'll ask you to write something similar yourself later, but for now we'll just give you this line:

In [92]:
train, test = train_test_split(data, train_size=.8, random_state=0)
len(train), len(test)

# Exercise 1: More Predictors
As before, let's try to predict the tip percent. So the `%` column will be the *response*. We'll use the training set the whole time.

In [93]:
y = train['%']
# We'll set X to different things below.

## 1.1: No Predictors
Suppose we predict a constant value for each data point.
1. What constant should we predict?
2. What MSE do we get?
3. What R^2 does this model get?
4. What Mean Absolute Error do we get?
5. How good or bad is this prediction? Connect it to the real world somehow.

For now, don't touch the *test set*; do all of this on the training set only.

In [94]:
# your code here


In [95]:
# your code here


In [96]:
# your code here


In [97]:
# your code here


*your answer here*


## 1.2: One predictor

Now let's use `total_bill` to help predict `%`.

First, make a matrix of observations by features (just one feature for now). Call it `X`. Use only training set points. Hint: use a one-element list of column names.

In [98]:
# your code here


In [99]:
assert X.shape == (len(train), 1)

Use the sklearn `LinearRegression` function to predict `y` from `X`.

1. As the total bill amount increases, does the tip *percentage* generally increase or decrease? (hint: `linreg.coef_`)
2. What MSE do we get?
3. What R^2 does this model get?
4. What Mean Absolute Error do we get?
5. How good or bad is this prediction? Is that better or worse than the constant predictor?

Since we'll be doing this a few more times, go ahead and make a *function* that prints out the caluclations you need for parts 1-4, given a LinearRegression object and an X and y.

In [100]:
def show_regression_stats(linreg, X, y):
    '''Show coefficients, MSE, R^2, and MAE of a fitted LinearRegression.'''
    # your code here


In [101]:
# your code here


*your answer here*


## 1.3: Two predictors

Now let's use both `total_bill` and `size` (number of people in the party) to predict `%`.

First, make your `X` matrix.

In [102]:
# your code here


In [103]:
assert X.shape == (len(train), 2)

Use the sklearn `LinearRegression` function to predict `y` from `X`. Use your show_regression_stats function from above.

1. As the total bill amount increases, how does the tip percentage change? Is it the same as in ex1.2?
2. As the party size increases, does the tip *percentage* generally increase or decrease?
3. What MSE, R^2, and MAE do we get?
4. How good or bad is this prediction? Is that better or worse than our previous models?

In [104]:
# your code here


*your answer here*

# Exercise 2: Dealing with Categorical Variables

## 2.1: Error!
Try including the `time` column now too, in exactly the same way as you included the `size` column. Observe that you get an error. Why do you get that error?

In [105]:
# your code here


*Your answer here*


In [106]:
train['time'].value_counts()

## 2.2: One-Hot Encoding
To solve this problem, we often "encode" categorical data as separate columns. So we'll end up with a model that looks like

20 + 5 * time_Lunch + 6 * time_Dinner

(if you've studied this before, there is a caveat here, shhhh...)

What should time_Lunch and time_Dinner be? Think for a moment before moving on.

.

.

.

.


`time_Lunch` should be a column that's 1 (or some other constant) when `time` is "Lunch" and 0 otherwise. `time_Dinner` is likewise.

This is called "one-hot" encoding because only one of the columns is "hot" (non-zero) at a time.

We can construct these columns easily by using the fact that `True` is interpreted as `1`.

In [107]:
time_Lunch = train['time'] == 'Lunch'
time_Lunch.head()

In [108]:
time_Lunch = time_Lunch.astype(float) # in some situations you don't even need to explicitly convert the type
time_Lunch.head()

The easiest way to include these variables in the regression is to add them to the dataframe. To avoid a warning that almost every Pandas user dreads (`SettingWithCopyWarning`), we won't *modify* the original DataFrame; instead, we'll make a *new* dataframe that includes the new columns. The syntax is:

In [109]:
train_transformed = train.assign(time_Lunch=1) # replacing that RHS with the desired value
train_transformed.head()

Use that to make a new variable, `train_transformed`, that has encoded columns for both Lunch and Dinner. Try to do it in a single assignment statement.

In [110]:
# your code here


In [111]:
assert 0 < train_transformed['time_Lunch'].mean() < 1
assert 0 < train_transformed['time_Dinner'].mean() < 1
assert np.allclose(train_transformed['time_Lunch'] + train_transformed['time_Dinner'], 1.0)

## 2.3: Fitting a regression with one-hot encoding.

Now construct an X matrix containing all of the predictors we want to use: total_bill, size, and the two encoded columns for `sex`.

In [112]:
# your code here


And fit a linear regression predicting `y` from our new `X`. Answer the same questions as in 1.3.

In [113]:
# your code here


## 2.4: Inference given a one-hot encoding

Use the coefficients of the linear model you found to write out a prediction function for this dataset. Hard-code the coefficients to 2 or 3 decimal places. You may need:

In [114]:
linreg.intercept_

In [127]:
def predict(data):
    """
    Predict the tip percentage. Hard-codes a specific fitted model.
    """
    # your code here


Test this function by calling it on the original training set.

In [126]:
y_pred = predict(train)
print(np.mean((train['%'] - y_pred) ** 2))

If this isn't within 2 decimal places of the MSE you found in 2.3, check your work.

## 2.5: Critiquing the one-hot encoding

What do you notice about the coefficients of the two encoded columns? (Suppose you lost access to the `Dinner` column. Could you figure out a way to get the same predictions?)

*your answer here*