# Exercises for Machine Learning with Python, Lecture 1: *Linear Least Squares Regression*



## Before you begin the exercises:



The datafiles are text files stored here (also try and download them to your own computer, open and see what they contain):

Features: https://www.dropbox.com/s/pf2sfiy9l86xhww/boston_features.txt

Prices: https://www.dropbox.com/s/j7flze0oe86pr6o/boston_prices.txt

*   Start by downloading the dataset:

In [0]:
# Download the features and target (the prices) to Google Colab using !wget
!wget -O boston_features.txt https://www.dropbox.com/s/pf2sfiy9l86xhww/boston_features.txt
!wget -O boston_prices.txt https://www.dropbox.com/s/j7flze0oe86pr6o/boston_prices.txt

This time we will not be using a Pandas DataFrames, because we will code the Linear Least Squares Regression ourselves using Numpy.

*   Now, load the txt files into NumPy arrays:

In [0]:
import numpy as np

features = np.loadtxt("boston_features.txt")
prices = np.loadtxt("boston_prices.txt")

As you can see below, the features is a matrix contatining 506 rows (one for each house), each with 13 entries (one for each feature). Additionally, there are 506 prices (again, one for each house).

In [0]:
print(features.shape)
print(prices.shape)

In [0]:
print(features)
print(prices)

Beside the housing prices, the dataset contains and information about location, etc (see below), and the prices are units of in USD $1000.

```
Index   Description
-----------------------------------------------------------------------
   0    per capita crime rate by town
   1    proportion of residential land zoned for lots over 25,000 sq.ft.
   2    proportion of non-retail business acres per town
   3    Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
   4    nitric oxides concentration (parts per 10 million)
   5    average number of rooms per dwelling
   6    proportion of owner-occupied units built prior to 1940
   7    weighted distances to five Boston employment centres
   8    index of accessibility to radial highways
   9    full-value property-tax rate per $10,000
  10    pupil-teacher ratio by town
  11    1000(Bk - 0.63)^2 where Bk is the proportion of African Americans by town
  12    % lower status of the population
```


## Exercise 1.1: Exploratory Data Analysis

Before we begin the regression, lets have a look at the data. As in the lecture, we will plot the correlation between our features using Seaborn.

Since we loaded in the data as a Numpy Array, we have to create a Pandas DataFrame for Seaborn. This can be done with the code below:

In [0]:
#Pandas DataFrames
import pandas as pd

# Make a Pandas dataframe from the features
df = pd.DataFrame(features[:,:5]) 

# NOTE: ONLY TAKES THE FIRST 5 COLUMNS/FEATURES
# features[:,:5] is Numpy slice notation for "take all rows, and first five colums"

# Print the type of the dataframe, just to make sure we coverted correctly
print(type(df))

Note that since there are 13 features, the correlation plot would contain 169 plots. This would be quite slow in Google Colab. For speed reasons, we only loaded the first 5 featues into the pandas dataframe for plotting.



### Question 1.1.1: 

*   Just like in the lecture, make an `sns.pairplot()` using Seaborn of the dataframe we just made.
*   (Optional) If you want, change the style of the plot as shown during the lecture. You can see some customization options in the documentation: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the pairplot of the Pandas DataFrame

### Question 1.1.2:

One thing you might note from the pair-correlation plot is that some of the features look odd. However, these features still work well for linear regression. 

*    Can you explain why the correlations between Feature 3 and everything else look odd?

**Give your answer:** 

## Exercise 1.2: Training and Test sets

In order to train and evaluate our linear regression model, we first have to divide the dataset into two parts: A training set and a test set.


### Question 1.2.1:
A common practice is to have the training set be 70% of the full set and test set 30%.

There are 506 houses in our dataset, so divide your data into 354 in the training set and 152 in the test set.

You can use Numpy's slice notation for this, as introducted in the first lectures. For example, to take the first five *rows* from an array you can use the following notation:
```
data_first_five_rows = data[:5]
```




To complete this Question, do the following task:
*    Divide both of the numpy arrays `features` and `prices` into two parts.
*    Use, for example, the Numpy slice notation to achieve this.
*    Make sure that the same house in never in both the test and training set at the same time! Other than that, you are free to choose which rows to the test set and which go to the training set.

In [0]:
# Make the two sets of features below
features_train = ???
features_test = ???

# Make the two sets of prices below
prices_train = ???
prices_test = ???

## Exercise 1.3: Regression Fit

In the linear least squares regression, we solve an equation that looks like this:

$$ \mathbf{y} = \mathbf{X} \mathbf{\alpha}$$

Where $\mathbf{y}$ is a vector containing the target labels (for example housing prices), $\mathbf{X}$ is a matrix where the each row contain features (in our case the features for a given house in Boston) for each house. Finally, $\alpha is a vector containing the (unknown) regression coefficients.

Fortunately, Numpy has built-in capabilites for solving linear least squares regression. This is done using the function which performs the "least squares" fit:

```
alpha, residual, rank, singular_values = np.linalg.lstsq(X, y)
```

As you can see, the function actually returns four items, the first one being the regression coefficients, and the next three are the residuals of the fit, the rank of the matrix problem, and finally the singular values of the matrix.

For this exercise we only need to use the first one, but feel free to print and look at the three other items as well.

You can see the documentation for `np.linalg.lstsq()` here: https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html


### Question 1.3.1:

*    Just like in the code above, use `np.linalg.lstsq()` to fit the alpha regression-coefficients. 

**Hint:** Use the `features_train` instead of `X` and `prices_train` instead of `y`.

**Note:** *Google Colab uses a version of Numpy which might give you a warning abour "rcond", which is not a problem.*

In [0]:
# Get the alpha coefficients 




## Exercise 1.4: Making predictions

Now that you have obtained the alpha coefficients in the previous question, it is time to use them to make predictions.

Remember how we are approximating the price as a linear sum of the weighted features for a given house:

\begin{equation}
f(\mathbf{x}) = x_1 \alpha_1 + x_2 \alpha_2 + \dots + x_n \alpha_n
\end{equation}
or written in vector notation:
\begin{equation}
f(\mathbf{x}) = \mathbf{x} \cdot \mathbf{\alpha}
\end{equation}

where $\mathbf{x}$ is the vector containing all the features for the house and $\mathbf{\alpha}$ is the vector of regression coefficient you obtained by fitting the features to the prices of the training set.

### Question 1.4.1:

With the alpha regression coefficients you found in the previous exercise, you can use your linear least squares "machine" to predict the price of house that are not in your training set.

As in the equation above, this is done by taking the dot product between the feature vector for a given house and the regression coefficients.

Remember, that the features for each house are stored in the rows of the "features" numpy arrays you created in Question 1.2.1.

*    Calculate the price of all the houses in the test set. That is, take the dot product between the regression coefficients and every row-vector of the `features_test` array.


In [0]:
# Calculate the prices of the houses in the test set
#, i.e. the dot product between "features_test" and "alpha"




*Hint:* Save the predicted prices for the test set in a list or numpy array, which you can use in question 1.4.2.

### Question 1.4.2:
Since we already know the true price of the houses in ou dataset, we can now compare the true prices of the house 

Use for example `plt.scatter()` as we did in the introductory lectures.

*    Plot the correlation between the true prices stored in the `prices_test` and the predicted prices.

Does it look like there is some correlation?

In [0]:
# Write the code to plot the correlation between "prices_test" and the prices  
# you predicted in Question 1.4.1.





### Question 1.4.3:

One common measure of the prediction error is the "root mean squared error" (RMSE) between the true values and the predicted values. The RMSE is calculated as follows:

\begin{equation}
\mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=0}^{N} \left( y_i^\mathrm{true} - y_i^\mathrm{predicted}\right)^2}
\end{equation}


*    In this question, calculate the RMSE for your test set, i.e. the RMSE between the true prices and the prices you predicted in Question 1.4.1.

In [0]:
# Calculate the RMSE 




## Exercise 1.5: Linear Regression with Scikit-learn

As we saw in the lecture, the Python library Scikit-learn has capabilities for almost any type of macine learning you can think of.

It also does Linear Regression. Below is the a code example to use linear regression with sklearn.

```
# Import the machine
from sklearn.linear_model import LinearRegression

# Make a linear regression machine identically to our fitting in Numpy
machine = LinearRegression(fit_intercept=False)

# Fit the machine using the training features and training labels
machine.fit(x_train, y_train)
```

After the machine has been fitted, you can use the built-in `predict()` method to make predictions on new feature maps:

```
# Predict y-values using x_test features
y_test_predicted = machine.predict(x_test)
```



### Question 1.5.1:

*    In this question, use Scikit-learn to train a `LinearRegression` machine on the housing dataset. Use the same training and test data which you used in the previous example.

**Hint:** You use the above code snippets to solve this question.

In [0]:
# Implement the linear regression with Scikit-learn










### Question 1.5.2:

*    In this question, make a scatter plot between the between housing prices you predicted with Scikit-learn and the true prices.
*    Additionally, calculate the RMSE for prices you predicted with Scikit-learn.

**Hint:** If everything went well, you should have gotten the exact same plot and RMSE as you found in Questions 1.4.2 and 1.4.3.

In [0]:
# Make a scatter plot for  the housing prices you predicted with 
# Scikit-learn and the true prices for the test set






In [0]:
# Calculate the RMSE for the housing prices you predicted with 
# Scikit-learn and the true prices for the test set





