In [1]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Types of Problems
In a CPU factory, a camera takes a picture of every single manufactured chip. After that, it sends the picture to an algorithm. The algorithm outputs whether the CPU is defective or not.

What type of algorithm is that?

The answer is "two-class-classification" because we have only 2 classes as outputs.

# Linear Regression Functions
When building the linear regression model, we came across several new functions. One of these functions was
$$ J = \frac{1}{n}\sum_{i} (y_i - \tilde{y_i})^2 $$
What is the name of this function?

The correct answer here is "cost function". Here we sum all the error distances.

# Income, Part 1
We have collected data from an ice cream shop. We modelled the income as a function of the outside temperature:

$$ \text{income}[\$] = 20.67449411 \text{T}[^\circ C] - 30.12047857 $$
As we can see, one of the terms is positive, and the other is negative.

Which of the following is true, based on this research only?

# Income, Part 2
In some cases we need to augment (extend) the model to return valid results.

What income (in dollars) will our current model predict when the temperature is 1.2 degrees? Round your answer to 2 decimal places.

In [2]:
temp = 1.2
income = (20.67449411 * temp) - 30.12047857
print("{:.2f}".format(income))

-5.31


The specification tells that "income" is defined as being non-negative. The model does not account for operational costs or anything like that. We need to return a valid value based on our specification.

What income (in dollars) should an augmented model predict for $T = 1.2^\circ C$?

If we have a negative income we can show "zero" as result.

# Local Minima
When performing gradient descent on a linear regression, the choice of starting point is really important. If we choose a starting point which is far away from the global minimum of the error function, we can get stuck in a local minimum.

Here we need to choose if the sentence above is true or false.

In the linear regression we have only one minimum, so we can't get stuck in a local minimum. The sentece above is false.

# Multiple Regression, Part 1
As we already saw, we can do linear regression on many variables.

The Boston housing dataset is really famous and is often used for this purpose. You can download it online or - better - load it using scikit-learn.

If you wish to download and explore the data, feel free to. The scikit-learn way is
```python
from sklearn.datasets import load_boston
boston_data = load_boston()
```
You can see what the data is about like this:
```python
print(boston_data.DESCR)
```

In [3]:
boston_data = load_boston()
print(boston_data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Note: This dataset is cleaned and prepared for modelling. If you want to download the original one and prepare it yourself, you're in for quite a challenge :).

Perform linear regression on all features.

What is the coefficient related to the number of rooms? Round your answer to two decimal places.

In [4]:
model = LinearRegression()
model.fit(boston_data.data, boston_data.target)
print("{:.2f}".format(model.coef_[5]))

3.80


What is the price of a hypothetical house with all variables set to zero? Round your answer to two decimal places.

In [5]:
print("{:.2f}".format(model.intercept_))

36.49


# Multiple Regression, Part 2
It's good to have a model of the data but it means nothing if we have no way of testing it.

A way to test regression algorithms involves the so-called coefficient of determination, $R^2$. Research how to compute it and apply it to the regression model you just created.

What is the coefficient of determination for this model? Round your answer to two decimal places.

<strong>Note</strong>: <span style="color: red">Compute the coefficient of determination using <strong>all the data</strong></span>.
Technically, this is not correct but at least gives a good idea of how this model performs. If you're more interested, look up "training and testing set".

In [6]:
predicted_data = model.predict(boston_data.data)
coefficient_of_determination = r2_score(boston_data.target, predicted_data)
print("{:.2f}".format(coefficient_of_determination))

0.74
