# Feature selection and feature engineering

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
import scipy.linalg as sp_la

## Data

Today we will keep working with the set of Craigslist listings for used cars.

All of this section is *exactly the same* as Wednesday and Friday.

First I make my converters.

In [None]:
# these will be our columns
columns = ["price", "year", "manufacturer", "model", "condition", "fuel", "odometer", "title_status", "transmission"]
# this will contain our converters
colValues = {}

# first we load our data as strings so we can define the converters
data = np.array(np.genfromtxt('data/vehicles.csv', delimiter=',', usecols=(1,2,3,4,5,7,8,9,11), skip_header=1, dtype=str, encoding='utf-8'))  

# make a list of the unique values in each column of our data
for colIndex in range(data.shape[1]):
    colValues[colIndex] = np.unique(data[:, colIndex]).tolist()
    print(colIndex, colValues[colIndex])

# fix up some of these ones we know are ordered
colValues[columns.index('condition')] = ['new', 'like new', 'excellent', 'good', 'fair', 'salvage']
colValues[columns.index('title_status')] = ['clean', 'lien', 'rebuilt', 'salvage', 'parts only', 'missing']

# map values to their indices in the list of unique values
def converter(x, colIndex):
    return colValues[colIndex].index(x)

Now we actually load the data.

In [None]:
data = np.array(np.genfromtxt('data/vehicles.csv', delimiter=',', usecols=(1,2,3,4,5,7,8,9,11), converters={3: lambda x: converter(x, 2), 4: lambda x: converter(x, 3), 5: lambda x: converter(x, 4), 7: lambda x: converter(x,5), 9: lambda x: converter(x, 7), 11: lambda x: converter(x, 8)}, skip_header=1, dtype=int, encoding='utf-8'))  

Let's get some summary statistics and do a **pairplot** so we can see what's going on.

In [None]:
def getSummaryStatistics(data):
    print("min, max, mean, std per variable")
    return pd.DataFrame([data.min(axis=0), data.max(axis=0), data.mean(axis=0), data.std(axis=0)])

def getShapeType(data):
    print("shape")
    return (data.shape, data.dtype)

print(getSummaryStatistics(data))
print(getShapeType(data))

In [None]:
df = pd.DataFrame(data, columns=columns)
seaborn.pairplot(df, y_vars = columns[0], x_vars = columns[1:])

plt.show()

Let's calculate *correlations* between price and the other variables. (Remind me what correlation values vary between?)

In [None]:
for i in range(len(columns)):
    print(columns[i], np.corrcoef(data[:, 0], data[:, i], rowvar=True)[0,1])

# Which model is best?

## Stepwise regression

We can do **feature selection**. This is useful for dealing with data that has many variables (features). How do we know which ones to *use*?
Here we do additive feature selection:
* repeatedly add an independent variable, train, and report $R^2$

For stepwise regression we use a modification of $R^2$, $${R^2}_{adj} = 1 - \frac{(1-R^2)(N-1)}{N-k-1}$$
where $N$ is the number of variables, and $k$ is the number of variables in $A$.

Stepwise regression works like this:

1. Initialize $A$ to be just the leading column of 1s (because we know we will have an intercept).

2. Then while the improvements in ${R^2}_{adj}$ are > 0 and there remain independent variables not yet added:
  * calculate a regression using $A$ and each variable not yet in $A$, and 
  * add the one with the highest ${R^2}_{adj}$ to $A$.

We could also do a variant of additive feature selection using the correlations:
* sort independent variables by size of correlation (positive or negative!) with the dependent variable
* repeatedly add the independent variable with the next biggest correlation; if it leads to higher $R^2$, keep it; if it doesn't, drop it again

And we could also go from most features to fewest:
* start with a model fit using *all* independent variables
* repeatedly take an independent variable out; if the resulting model has higher $R^2$, leave that variable out going forward

There are many other options for feature selection!

In this code block, I calculate the *powerset* of all the independent variables. Then, for each subset of the independent variables I train a model and calculate MSSE (on the training data) and $R^2$ (on the test data). Then, I report the ten worst and ten best performing sets of independent variables by MSSE and by $R^2$.

Note:
* sometimes models with fewer variables work better than models with more
* sometimes a model may fit the training data better but the test data worse

### First, split our data

Let's split our data into **train** and **test**. Let's make sure and sort by time first, because we don't want to let the future predict the past.

In [None]:
data = data[data[:, 1].argsort()]
print(getSummaryStatistics(data))
print(getShapeType(data))

(train, test) = np.split(data, [int(len(data) / 10 * 8)])
print(train.shape, test.shape)

This chunk of code below we copied over verbatim from Monday's notebook in class.

In [None]:
# x a matrix of multiple independent variables
# poly -> polys, a matrix of multiple polynomial degrees for each column in x in order
def makePoly(x, polys):
    # make an empty array of size A
    A = np.zeros([x.shape[0], np.sum(polys)+1])
    # left most column of 1s for the intercept
    # notice this is also a third way to get that leading column of ones!
    A[:, 0] = np.squeeze(x[:, 0]**0)
    k = 1
    # for each variable
    for (j, poly) in enumerate(polys):
        # for up to and including! poly
        for i in range(1, poly+1):
            A[:, k] = np.squeeze(x[:, j]**i)
            k += 1
    return A

def fit(data, independent, dependent, polys):
    # This is our independent variable, just one for now
    x = data[np.ix_(np.arange(data.shape[0]), independent)]

    # We add the polynomials, and a column of 1s for the intercept
    A = makePoly(x, polys)

    # This is the dependent variable 
    y = data[:, dependent]

    # This is the regression coefficients that were fit, plus some other results
    # We use _ when we don't want to remember something a function returns
    c, _, _, _ = sp_la.lstsq(A, y)
    return c

def predict(data, independent, polys, c):
    # These are our independent variable(s)
    x = data[np.ix_(np.arange(data.shape[0]), independent)]

    # We add the polynomials, and a column of 1s for the intercept
    A = makePoly(x, polys)

    return np.dot(A, c)

def rsquared(y, yhat):
    if len(y) != len(yhat):
        print("Need y and yhat to be the same length!")
        return 0
    return 1 - (((y - yhat)**2).sum() / ((y - y.mean())**2).sum())

This code is new for today. We updated it a little based on the code we copied over from Monday's class. Also, after class I added logging with weights and biases.

For this to work now you need a (free!) weights and biases account. Get one from https://wandb.ai. Then copy the API key. On the terminal, type "wandb login". Paste in your API key.

In [None]:
from itertools import chain, combinations
import wandb

def powerset(variables):
    return chain.from_iterable(combinations(variables, r) for r in range(len(variables)+1))

def msse(y, yhat):
    r = (np.square(y - yhat)).mean()
    return r

res = {}
for variableset in powerset(range(1, train.shape[1])):
    if len(variableset) > 0:
        name = '+'.join([str(x) for x in variableset])
        # start a new wandb run to track this run
        wandb.init(
            # set the wandb project where this run will be logged
            project="cars-regression",
            name=f"experiment_{name}",
            # track hyperparameters and run metadata
            config={
                "architecture": "regression",
                "dataset": "hyundaikia",
                "split": 20,
                "features": variableset
            }
        )

        # fit the multiple linear regression
        polys = [1 for x in range(len(variableset))]
        c = fit(train, list(variableset), 0, polys)
        # calculate MSSE and R^2
        res[variableset] = (msse(train[:, 0], predict(train, variableset, polys, c)), 
                            rsquared(test[:, 0], predict(test, variableset, polys, c)))
        wandb.log({"msse":res[variableset][0], "rsquared": res[variableset][1]})
        wandb.finish()


In [None]:
# sort by R^2
byrsquared = sorted(res.items(), key=lambda item: item[1][1])
print("Worst R^2")
for i in range(10):
    print([columns[x] for x in byrsquared[i][0]], byrsquared[i][1])
print("Best R^2")
for i in range(1, 11):
    print([columns[x] for x in byrsquared[-i][0]], byrsquared[-i][1])

In [None]:
# sort by MSSE
bymsse = sorted(res.items(), key=lambda item: item[1][0])
print("Worst MSSE")
for i in range(1, 11):
    print([columns[x] for x in bymsse[-i][0]], bymsse[-i][1])
print("Best MSSE")
for i in range(10):
    print([columns[x] for x in bymsse[i][0]], bymsse[i][1])


Bad pipe message: %s [b'\xf3\xd6\x0e\xbe\x90\xdbn_\xa8\x15\r\xdc$W\x1f#Y\x12 /}\x1di\x88\xea\x8fC\xb8|\x8e\x97\xd0;H\xe4\xdahE\xd2{@\xbb4\xaeE\tJ\\<\x0ee\x00\x08\x13\x02\x13\x03\x13\x01\x00']
Bad pipe message: %s [b'0@\x9cLF\xbb&\x17\x1bb\x9e\xfe']
Bad pipe message: %s [b'A\xcd\xa53\x93;\xd9>\xafr\xe6J\xd2\x1a\x19\xcc\x0ec\x00\x00|\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0', b"+\xc0/\x00\xa2\x00\x9e\xc0\xae\xc0\xac\xc0\xa2\xc0\x9e\xc0\\\xc0`\xc0V\xc0R\xc0$\xc0(\x00k\x00j\xc0#\xc0'\x00g\x00@\xc0\n\xc0\x14\x009\x008\xc0\t\xc0\x13\x003\x002\x00\x9d\xc0\xa1\xc0\x9d\xc0Q\x00\x9c\xc0\xa0\xc0\x9c\xc0P\x00=\x00<\x005\x00/\x00\x9a\x00"]
Bad pipe message: %s [b'E\xd7', b"$Qr\xed\xcb\xb3\xbf\xbd\xe9\x1f<\x90\xf2\xdf\x00\x00\xa6\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0S\xc0+\xc0/\x00\xa2\x00\x9e\xc0\xae\xc0\xac\xc0\xa2\xc0\x9e\xc0\\\xc0`\xc0V\xc0R\xc0$\xc0(\x00k\x00j\xc0s\xc0w\x00

# Feature engineering

When we fiddle with our independent variables to make the models better, we call this **feature engineering**. For example, adding the square of age.

So we have two ways to experiment with a single modeling approach:
* feature selection
* feature engineering

I would not call data transformations (like max-min normalization) feature engineering, since you should do them before you start modeling, but you might choose to consider them feature engineering.

In real-world projects, you can spend a huge amount of time doing feature selection and feature engineering. It can get overwhelming quickly! Keep good track of your work through either an experiment logbook, or experiment tracking software like [weights and biases](https://wandb.ai/site).

# Review

1. Find some data!
2. Load and look at your data
  * thing1
  * thing2
3. Consider cleaning, transforming and/or normalizing your data
  * thing1
  * thing2
  * thing3
  * thing4
4. Look at your data some more and consider feature selection, feature engineering, dimensionality reduction
  * *correlations*
  * *covariance matrix*
  * *PCA*
5. Model
  * thing1 (in three variations!)