# [Introduction to Data Science](http://datascience-intro.github.io/1MS041-2022/)    
## 1MS041, 2022 
&copy;2022 Raazesh Sainudiin, Benny Avelin. [Attribution 4.0 International     (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)

# Regression example of risk minimization

Consider the following house price data

In [None]:
import csv
data = []
header = []
with open('data/portland.csv', mode='r') as f:
    reader = csv.reader(f)
    header = tuple(next(reader))
    for row in reader:
        try:
            data.append((int(row[0]),int(row[1]),int(row[2])))
        except e:
            print(e)

In [None]:
print(header)
print(data[:1])

Lets think that the data generator $X$ is here the size of the house and the $Y$ is the price of the house.

In [None]:
import numpy as np
D = np.array(data)
X = D[:,0]
Y = D[:,2]

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X,Y)

## Linear regression

An example of a learning machine is linear regression with quadratic loss. What we have is essentially

* The loss is the quadratic function
* We are searching among linear functions, $g_{k,m}(x) = kx+m$.

$$
    k^\ast, m^\ast = \text{argmin}_{k,m} \sum_{i=1}^n (k X_i + m - Y_i)^2
$$

In [None]:
L = lambda a: np.sum(np.power((a[0]*X+a[1]-Y),2))

In [None]:
import scipy.optimize as so

We can use scipy to minimize the total loss $L$ above. 

In [None]:
result = so.minimize(L,(0,0),method = 'Nelder-Mead')
result

This gives us that the found $k \approx 135$ and the found $m \approx 70000$, we can also plot this together with data to see

In [None]:
x_pred = np.linspace(np.min(X),np.max(X),2)
y_pred = x_pred*result['x'][0]+result['x'][1]
plt.scatter(X,Y)
plt.plot(x_pred,y_pred,color='green')