# Lecture 8 – Regression and Linear Algebra

## DSC 40A, Fall 2021

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, HTML, Math

In [None]:
# Run this cell to load in our dataset. Don't worry about what it's doing.
np.random.seed(25)

salaries_raw = pd.read_csv('data/data_scientist_salaries.csv')
salaries = salaries_raw.get(['YearsCodingProf', 'Age', 'FormalEducation', 'Salary']).dropna()

def extract_years(year_str):
    if isinstance(year_str, float):
        return year_str
    if 'older' in year_str:
        years = 65
    elif 'more' in year_str:
        years = 30
    elif 'Under' in year_str:
        years = 18
    else:
        extracted = re.findall('\d+', year_str)
        try:
            lower, upper = int(extracted[0]), int(extracted[1])
        except:
            print(extracted)
        years = np.random.randint(lower, upper + 1)
    return years + np.round(np.random.normal(0, 1), 2)

salaries['Age'] = salaries['Age'].apply(extract_years)
salaries['YearsExperience'] = salaries['YearsCodingProf'].apply(extract_years)
salaries = salaries[['YearsExperience', 'Age', 'FormalEducation', 'Salary']]
salaries = salaries[(salaries['Salary'] < 500000) & (salaries['Salary'] > 1000) & (salaries['YearsExperience'] > 0)].reset_index(drop=True)
salaries['Salary'] /= 1000

In [None]:
salaries

### Design matrix

In this case, we only have one feature – `'YearsExperience'`. Our design matrix would then look something like:

In [None]:
# Don't worry about this code ---
X_as_df = pd.DataFrame()
X_as_df['1'] = np.ones(salaries.shape[0]).astype(int)
X_as_df['YearsExperience'] = salaries['YearsExperience']
X_as_df
# ---

X_as_df

In [None]:
# Converting to a numpy array
X = X_as_df.values
X

This is the design matrix! ^

### Observation vector

In [None]:
y = salaries['Salary']
y

In [None]:
y = y.values
y

### Making predictions

For any vector $\vec{w} \in \mathbb{R}^{2}$, we can make predictions using

$$\vec{h} = X \vec{w}$$

Let's test it out!

In [None]:
X @ np.array([80, 3])

Our goal is to get the above array as close to `y` as possible.

### Implementing the solution

We claimed that the vector $\vec{w}$ that minimizes

$$R_{sq}(\vec{w}) = \frac{1}{n} || \vec{y} - X \vec{w} ||^2$$

is

$$\vec{w}^* = (X^TX)^{-1}X^T\vec{y}$$

In [None]:
def least_squares(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

In [None]:
w_star = least_squares(X, y)

In [None]:
w_star

What if I have 10 years of experience – what should I expect my salary to be?

In [None]:
np.array([1, 10]) @ w_star

Note that these match the intercept and slope using our manual formulas in Lecture 7!

In [None]:
def correlation(x, y):
    x = np.array(x)
    y = np.array(y)
    
    x_su = (x - np.mean(x)) / np.std(x)
    y_su = (y - np.mean(y)) / np.std(y)
    
    return np.mean(x_su * y_su)

def slope(x, y):
    return correlation(x, y) * np.std(y) / np.std(x)

def intercept(x, y):
    return np.mean(y) - slope(x, y) * np.mean(x)

In [None]:
intercept(X[:, 1], y)

In [None]:
slope(X[:, 1], y)

# 🤯