# About the Python

Remember our core python package for this course:
* numpy - great for processing vectors, matrices and tensors, which are the foundational data types for data analysis, visualization and machine learning

And don't forget our go-tos for visualization:
* matplotlib
* seaborn

Both of these operate over pandas dataframes, and we can make a pandas dataframe from a numpy array easily.

Today we'll learn about another python package:
* scipy - great for scientific computing (linear algebra, differential equations and more!)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

# Data

The first three steps in data science are to:
* find your data
* look at your data
* clean your data

Today we will be working with a set of Craigslist listings for used Mazda 3s. (!!)

This dataset is the Mazda 3 subset from https://www.kaggle.com/austinreese/craigslist-carstrucks-data after some cleanup (basically, I selected all the variants of Mazda 3 in the model column, and then removed some columns).

Take a look at it in data/mazdas.csv!

Now we will load this dataset. I'm going to load it the unpleasant way (rather than using the Data class), because I want to use those nonnumeric colums. 

Because our numpy arrays all have to be the same type, we first need to:
* figure out which columns we want
* define some converters

In [None]:
# this just does pattern matching
import re

def conditionConverter(x):
    values = ['', 'new', 'like new', 'excellent', 'good', 'fair', 'salvage']
    return values.index(x)

def titleConverter(x):
    values = ['', 'clean', 'lien', 'rebuilt','salvage']
    return values.index(x)

def transmissionConverter(x):
    values = ['', 'automatic', 'manual', 'other']
    return values.index(x)

def typeConverter(x):
    if re.match(r'(sedan|coupe)', x):
        return 1
    elif re.match(r'(mini-van|hatchback|wagon|SUV)', x):
        return 2
    return 0

def colorConverter(x):
    values = ['', 'grey', 'white', 'black', 'blue', 'orange', 'purple', 'red', 'green', 'brown', 'custom', 'silver']
    if x in values:
        return values.index(x)
    else:
        print("couldn't find |" + x + "|")
        return 0

In [None]:
# Load the data
# This dataset is the mazda subsample from https://www.kaggle.com/austinreese/craigslist-carstrucks-data after some cleanup

columns = ["price", "age", "condition", "odometer", "title_type", "transmission", "type", "color"]
data = np.array(np.genfromtxt('data/mazdas.csv', delimiter=',', usecols=(1,2,5,6,7,8,9,10), converters={5: conditionConverter, 7: titleConverter, 8: transmissionConverter, 9: typeConverter, 10: colorConverter}, skip_header=2, dtype=int, encoding='utf-8'))  
print(data)

*If we loaded it using your Data class from project one, what columns could we access?*

Let's do a pairplot so we can see what's going on.

In [None]:
df = pd.DataFrame(data, columns=columns)
seaborn.pairplot(df, y_vars = columns[0], x_vars = columns[1:])

plt.show()

*What do we notice?*

Let's get some summary statistics.

In [None]:
def getSummaryStatistics(data):
    return np.array([data.max(axis=0), data.min(axis=0), data.mean(axis=0, dtype=int)])

def getShapeType(data):
    return (data.shape, data.dtype)

print(getSummaryStatistics(data))
print(getShapeType(data))
print(data[0])

Let's add a new summary statistic. This one will tell us how highly *correlated* two variables are. If two variables are highly correlated, then you can estimate one using the other. For example:
* price and price should be exactly the same (correlation of 1) 
* while price and -price should have a correlation of -1, and
* price and random numbers should have correlation of 0.

In [None]:
print(np.corrcoef(data[:, 0], data[:, 0], rowvar=True)[0,1])
print(np.corrcoef(data[:, 0], -data[:, 0], rowvar=True)[0,1])
print(np.corrcoef(data[:, 0], np.random.randint(0, data[:, 0].max(), len(data[:, 0])), rowvar=True)[0,1])

Okay, what are the correlations for the variables in the mazda3 dataset?

In [None]:
for i in range(len(columns)):
    print(columns[i], np.corrcoef(data[:, 0], data[:, i], rowvar=True)[0,1])

*What do we notice?*

# Regression

Today we are moving on from:
* data loading
* data visualization
* basic data analysis
  * summary statistics on data
  * data transformation and normalization

to more advanced data analysis, in particular *regression*.

## What is regression? 

Regression allows us to:
* determine the *nature* of a relationship between one (or more!) independent variables and a dependent variable
* determine the *strength* of the relationship

Regression *fits* a function to a dataset.

Regression is *not* the same as interpolation: we don't want to fit so closely that we exactly pass through each data point, but rather fit so that we generalize over the data. Why is this?
* because the data sets we have are not *all* the data, but a *sample* of the data, and other samples may be somewhat different
* because we want to be able to *explain* relationships between variables

Both of these require some generalization/abstraction away from the actual data.

Okay, so we need some functions; we need a method for making the function "fit" the data; and we need a measure of how "good" the fit is. 

## What kinds of functions can we fit? 

Any kind! What are some kinds of [function](https://www.wolframalpha.com/input?i=mathematical+function)?

Why might we prefer simpler functions over more complex ones?

We are going to start with *linear regression*. What type of function do we fit when we do linear regression? A linear function! You might see it written like $$F(x_i) = b + mx_i$$ or like $$\hat{y}_i = b+ mx_i$$It's called linear because when you plot it you get a line that crosses 0 at $b$ (the "intercept") and has slope $m$.

Note that this function tries to calculate one variable ($y$) as a function of one other variable ($x$). Next week, we will look at functions that calculate $y$ as a function of multiple other variables.

For the mazda3 dataset, which one variable do you think would be the best one to use to calculate or approximate price?

## What does it mean to *fit* a function? 

Now let's talk about methods for making the function "fit" the data. We know $x$, and we know $y$. We do *not* know the values for $b$ or $m$; that's what we need to figure out. We could try a random or educated guess :). 

In [None]:
def plotxyyhat(x, y, m, b):
    plt.plot(x, y, 'o', label='data')
    yhat = m*x + b
    plt.plot(x, yhat, label='least squares fit, $y = mx + b$')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend(framealpha=1, shadow=True)
    plt.grid(alpha=0.25)
    plt.show()

m = 10
b = -2000

plotxyyhat(data[:, 1], data[:, 0], m, b)

Can we do better than that?

Yes, but we need a notion of "good" to assess "better". 

## How do we measure how "good" the fit is?

If we have a function and a set of data points, how well does the function fit the data? 

The "bit left over" or the "distance", which we call the *residual*, is often calculated as: $$r_i = y_i - F(x_i)$$ (or, $r_i = y_i - \hat{y}_i$).

And how can we combine these differences? We could:
* Take the average, median, min or max of the distances
* Take the average, median, min or max of the absolute distances
* Take the sum of the absolute distances
* Take the mean of the sum of the square of the distances (MSSE, or *mean sum of squared error*):
$$MSSE = 1 / N \sum_{i=1}^N (r_i)^2 = 1/N \sum_{i=1}^N (y_i - \hat{y}_i)^2$$


In [None]:
def msse(x, y, m, b):
    if len(x) != len(y):
        print("Need x and y to be the same length!")
        return 0
    yhat = [m*xi + b for xi in x]
    r = 0.0
    for i in range(len(x)):
        r += (float(y[i]) - float(yhat[i]))*(float(y[i]) - float(yhat[i]))
    r = (1 / len(x)) * r
    return r

print(msse(data[:, 1], data[:, 0], m, b))

# what happens if our slope is 0 and our intercept is the mean value for price?
print(msse(data[:, 1], data[:, 0], 0, data[:, 0].mean()))
# let's see if we can do better than just using the mean...

Okay, so to calculate $m$ and $b$, we need to _minimize_ $MSSE$ with respect to each. We do this using a method called [least squares](https://en.wikipedia.org/wiki/Least_squares) (see also [least squares](https://www.wolframalpha.com/input?i=least+squares). We take the partial derivatives of $MSSE$ with respect to $m$ and $b$, setting to 0 (the minimum) and solving (*please check my math!*):
* partial derivative of $MSSE$ ($p$) with respect to $m$: $$\frac{\partial p}{\partial m} = 0 = 1/N \sum_{i=1}^N \frac{\partial }{\partial m} (y_i - F(x_i))^2 = 1/N \sum_{i=1}^N \frac{\partial }{\partial m} (y_i - (m x_i + b))^2 = \sum_{i=1}^N -2 x_i (y_i - (m x_i + b)) = \sum_{i=1}^N x_i y_i -m \sum_{i=1}^N {x_i}^2 - b \sum_{i=1}^N x_i$$
* partial derivative of $MSSE$ ($p$) with respect to $b$: $$\frac{\partial p}{\partial b} = 0 = 1/N \sum_{i=1}^N \frac{\partial }{\partial b} (y_i - F(x_i))^2 = 1/N \sum_{i=1}^N \frac{\partial }{\partial b} (y_i - (m x_i + b))^2 = \sum_{i=1}^N -2 (y_i - (m x_i + b)) = \sum_{i=1}^N y_i - m \sum_{i=1}^N x_i  - b N$$

So then, 
$$ m = \frac{\sum_{i=1}^N x_i \sum_{i=1}^N y_i - N \sum_{i=1}^N x_i y_i}{(\sum_{i=1}^N x_i)^2 - N \sum_{i=1}^N {x_i}^2}$$
$$ b = \frac{\sum_{i=1}^N x_i \sum_{i=1}^N x_i y_i -\sum_{i=1}^N {x_i}^2 \sum_{i=1}^N y_i}{{(\sum_{i=1}^N x_i})^2 - N\sum_{i=1}^N {x_i}^2}$$

$$
\begin{bmatrix}
\sum_{i=1}^N {x_i}^2 & \sum_{i=1}^N x_i \\
\sum_{i=1}^N x_i & N
\end{bmatrix}
\begin{bmatrix}
m \\
b
\end{bmatrix}
=
\begin{bmatrix}
\sum_{i=1}^N x_i y_i \\
\sum_{i=1}^N y_i
\end{bmatrix}
$$

Now this is something we can calculate using matrix math:
$$F(x_i) = b + m x_i = (1, X_i) \cdot (b, m)$$

The function "lstsq" in the scipy package's linalg (linear algebra) subpackage
fits a linear regression using least squares. It gives us predicted $y$ values $\hat{y}$ and residuals for each $\hat{y}$. Let's try it on our data.

In [None]:
import scipy as sp

# These are our independent variable(s)
x = data[:, 1]
print(getSummaryStatistics(x))
print(getShapeType(x))

# We add a column of 1s for the intercept
M = x[:, np.newaxis]**[0, 1]
print(getSummaryStatistics(M))
print(getShapeType(M))

# This is the dependent variable 
y = data[:, 0]
print(getSummaryStatistics(y))
print(getShapeType(y))

# This is the regression coefficients that were fit, plus some other results
p, res, _, _ = sp.linalg.lstsq(M, y)

# this is b and m!
print(p)

plotxyyhat(data[:, 1], data[:, 0], p[1], p[0])
print(msse(data[:, 1], data[:, 0], p[1], p[0]))

## And, this is a function we can use to predict the $y$ (calculate the $\hat{y}$) for new $x$s, so it's a *model*!

For example, my car was a 2018 Mazda 3, so what should its Craigslist price be?

In [None]:
print(p[1]*2018+p[0])

*What about your Mazda 3s??*

Of course, this model is based on *historical* car prices, and what has happened over the past year?