# Multiple linear regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
import scipy.linalg as sp_la

## Data

Today we will keep working with the set of Craigslist listings for used cars.

In addition to the numeric columns odometer, year and price, I want to load the non-numeric columns manufacturer, model, condition, title_status, transmission and drive.

Because our numpy arrays all have to be the same type, we need to:
* figure out which columns we want
* define some converters

In [None]:
# these will be our columns
columns = ["price", "year", "manufacturer", "model", "condition", "fuel", "odometer", "title_status", "transmission"]
# this will contain our converters
colValues = {}

# first we load our data as strings so we can define the converters
data = np.array(np.genfromtxt('data/vehicles.csv', delimiter=',', usecols=(1,2,3,4,5,7,8,9,11), skip_header=1, dtype=str, encoding='utf-8'))  

# make a list of the unique values in each column of our data
for colIndex in range(data.shape[1]):
    colValues[colIndex] = np.unique(data[:, colIndex]).tolist()
    print(colIndex, colValues[colIndex])

# map values to their indices in the list of unique values
def converter(x, colIndex):
    return colValues[colIndex].index(x)

Now we actually load the data.

In [None]:
# This dataset is the mazda subsample from https://www.kaggle.com/austinreese/craigslist-carstrucks-data after some cleanup

data = np.array(np.genfromtxt('data/vehicles.csv', delimiter=',', usecols=(1,2,3,4,5,7,8,9,11), converters={3: lambda x: converter(x, 2), 4: lambda x: converter(x, 3), 5: lambda x: converter(x, 4), 7: lambda x: converter(x,5), 9: lambda x: converter(x, 7), 11: lambda x: converter(x, 8)}, skip_header=1, dtype=int, encoding='utf-8'))  

Let's get some summary statistics and do a **pairplot** so we can see what's going on.

In [None]:
def getSummaryStatistics(data):
    print("min, max, mean, std per variable")
    return pd.DataFrame([data.min(axis=0), data.max(axis=0), data.mean(axis=0), data.std(axis=0)])

def getShapeType(data):
    print("shape")
    return (data.shape, data.dtype)

print(getSummaryStatistics(data))
print(getShapeType(data))

In [None]:
df = pd.DataFrame(data, columns=columns)
seaborn.pairplot(df, y_vars = columns[0], x_vars = columns[1:])

plt.show()

It looks like which variables might be correlated with price?

Let's calculate *correlations* between price and the other variables. (Remind me what correlation values vary between?)

In [None]:
for i in range(len(columns)):
    print(columns[i], np.corrcoef(data[:, 0], data[:, i], rowvar=True)[0,1])

## Let's review regression

Regression allows us to:
* determine the *nature* of a relationship between one (or more!) independent variables and a dependent variable
* determine the *strength* of the relationship

Regression *fits* a function to a dataset.

## What kinds of functions can we fit? 

I want to predict price as a function of age and mileage! After all, an old car with low mileage may be worth more than a new car with high mileage. Actually, I probably want to include some of those other variables (features) too! (Which ones?)

It turns out I can do this using **multiple linear regression**. The function I will want to fit will be: $\hat{y} = c_0 + c_1*x_{1i} + c_2*x_{2i} + ... + c_M*x_{Mi}$ for $M$ variables, and I do this by minimizing the sum of the squares of the residuals $r_i = y_i - \hat{y_i}$.

In terms of matrix math, for $N$
 data points, $A$
 will just be a matrix of shape ($N, M+1$)
 (including the leading column of 1s) and $\vec{c}$ 
 will have shape ($M+1, 1$)
 (including $c_0$, the intercept) and $\vec{y}$ will have shape ($N, 1$) (as before).

Let's do it! 

### First, split our data

Let's split our data into **train** and **test**. Let's make sure and sort by time first, because we don't want to let the future predict the past.

In [None]:
data = data[data[:, 1].argsort()]
print(getSummaryStatistics(data))
print(getShapeType(data))

(train, test) = np.split(data, [int(len(data) / 10 * 8)])
print(train.shape, test.shape)

### Second, define an updated fit function that can handle multiple independent variables

In [None]:
def fit(data, independent, dependent):
    # These are our independent variable(s)
    x = data[np.ix_(np.arange(data.shape[0]), independent)]

    # We add a column of 1s for the intercept
    A = np.hstack((np.array([np.ones(x.shape[0])]).T, x))

    # This is the dependent variable 
    y = data[:, dependent]

    # This is the regression coefficients that were fit, plus some other results
    c, _, _, _ = sp_la.lstsq(A, y)
    return c

In [None]:
print(columns)

And fit to our training data.

In [None]:
c = fit(train, [1,2,3,4,5,6,7,8], 0)
print([columns[x] for x in [1,2,3,4,5,6,7,8]])
print(c)

### Third, define an updated predict function that can handle multiple independent variables

In [None]:
def predict(data, independent, c):
    # These are our independent variable(s)
    x = data[np.ix_(np.arange(data.shape[0]), independent)]
 
    # We add a column of 1s for the intercept
    A = np.hstack((np.array([np.ones(x.shape[0])]).T, x))

    return np.dot(A, c)

And predict

In [None]:
yhat = predict(test, [1,2,3,4,5,6,7,8], c)

### Fourth, evaluate using $R^2$

In [None]:
# assume these are numpy arrays
def rsquared(y, yhat):
    if len(y) != len(yhat):
        print("Need y and yhat to be the same length!")
        return 0
    return 1 - (((y - yhat)**2).sum() / ((y - y.mean())**2).sum())

def plotxyyhat(x, y, yhat):
    plt.plot(x, y, 'o', label='data')
    plt.plot(x, yhat, label='least squares fit, $y = mx + b$')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend(framealpha=1, shadow=True)
    plt.grid(alpha=0.25)
    plt.show()

In [None]:
print(rsquared(test[:, 0], yhat))
# using only one independent variable to plot
plotxyyhat(test[:, 1], test[:, 0], yhat)
