## Machine Learning - Practical/Workshop 1 - Data Manipulation and Subset Selection

### Preliminaries

**Aim**: The aim of this workshop is to familiarize yourself with some basic concepts and data science tools in R, before exploring methods for model selection. You have already used R in Intro to Stats. Here we will introduce you to other tools in R that are frequently used in data science.

**R software**: Here you will work with R via the Durham University cluster in Computer Science. If you want to use R on your own machine, we recommend Rstudio (you probably already installed it during Intro to Stats) or a Jupyter Notebook with an R kernel via Anaconda. You can also use other services such as Google Colab (https://colab.research.google.com/), GitHub Codespaces (https://github.com/features/codespaces).

### Part I - Working with Data


Datasets in R are often stored in special types of lists called dataframes (similar to tables in Excel and Pandas in Python). Dataframes are matrix-like blocks of values where each column represents a variable, and each row represents a case. Unlike generic lists, each sublist must contain the same number of elements.

Say we want to create a table with the top 5 grossing films in the UK since 1989 (not corrected for inflation) based on the data in [Wikipedia](https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_Kingdom).

1. The column headers for our initial table will be **Rank** and **Title**.
2. The values in **Rank** are the numbers 1 to 5.
3. The values in **Title** are Star Wars:The Force Awakens, Skyfall, No Time to Die, Spider-Man: No Way Home, and Avatar.

In [None]:
movies = data.frame(
  Rank = 1:5,
  Title = c("Star Wars: The Force Awakens",
            "Skyfall",
            "No Time to Die",
            "Spiderman: No Way Home",
            "Avatar")
)
View(movies)

Take a look at the [R Data Frames page](https://www.w3schools.com/r/r_data_frames.asp) in W3 Schools for more examples.

In the snippet above, we use the View function instead of print. If using RStudio, View opens the dataset in a new tab that let's you preview your dataset, and filter and sort your data more easily. See [View function in R](https://www.statology.org/view-function-in-r/).

Let's add two more columns to the dataframe to include the **Gross** in millions of pounds to the column after **Title**, and then the **Year** of release between the Title and Gross.

In [None]:
movies['Gross']=c(123.3, 102.8, 98.0, 97.2, 96.7)
View(movies)

In [None]:
movies['Year']=c(2015,2012,2021,2021,2009)
movies=movies[,c('Rank','Title','Year','Gross')]
View(movies)

There are many ways of adding new rows to a dataframe. If you only need to add one row to the end, you can do as below:

In [None]:
movies[6,]=c(6, "Barbie", 2023, 95.6)
View(movies)

If you need to add multiple rows, you should create another dataset with the same columns as the original and **rbind** it to the previous one. Later on we will introduce you to the **tidyverse** which will give you more possibilities to execute tasks like this.

In [None]:
movies_newrows = data.frame(
  Rank = 7:8,
  Title = c("Spectre","Avengers:Endgame"),
  Year = c(2015, 2019),
  Gross = c(95.2, 88.7)
)
movies = rbind(movies, movies_newrows)
View(movies)

Now let's sort this dataset by **Year** in descending order. To do so without introducing new packages, we use the **order** function, not the sort function! The sort function returns the entries in sorted order but it doesn't work well with dataframes; the order function returns the indices of the sorted entries.

By default sorting is ascending using the **order** function so you need to set the *decreasing* parameter to TRUE.

In [None]:
View(
  movies[order(movies$Year,decreasing=TRUE),]
)

#### **Exercise 1**

Add the next 4 movies to the dataset and sort it by year in **ascending** order:

- Top Gun: Maverick - 2022 - 83.7
- Star Wars: The Last Jedi - 2017 - 82.7
- Titanic - 1998 - 82.7
- Avatar: The Way of Water - 2022 - 76.9

### Dataframes and Matrices

Let's load the **Boston** dataset from the **MASS** library. Check your Intro to Stats workshops as a refresher on how to load libraries and datasets!

The Boston dataset contains information on multiple attributes for suburbs in Boston, Massachusetts [Boston in R Documentation](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html).



In [None]:
library(MASS)
data(Boston)
head(Boston, n=5) #The head function shows the first n rows in a dataframe

Most of the well-known machine learning models are implemented in R (and Python!), and are relatively easy to use. However, there will be cases where you'll need to prepare and transform your data using techniques seen elsewhere on the program (in Intro to Maths, for example, where you learned how to manipulate matrices and vectors, invert matrices, and solve systems of linear equations). There might also be cases where you want to implement your own version of a machine learning model and modify it; therefore, it is useful to know how to manipulate datasets and matrices in R (and understand how to move between those two data types!).

In Intro to Stats Week 6, you studied linear regression models (check your notes!), and you were trying to fit the model:

$Y=X\beta+\epsilon$

and estimate the vector $\beta$.  
To estimate $\beta$ using least squares, we want to compute:

$\hat{\beta}=(X^TX)^{-1}X^TY$

where $X^T$ is the transpose of $X$ and $A^{-1}$ represents the inverse of a matrix $A$.

#### **Briefly review Intro to Stats workshops 6 and 7 before continuing**

You can also try your solutions here in this notebook to get familiar with the environment.

Now, let's start with a simple example using linear regression and the Boston dataset.



#### **Exercise 2** - Fitting a linear regression model with **lm**

Recall that you used the **lm** function to fit linear models in Intro to Stats.  Use the **lm** function here to fit a linear regression model with *medv* as the response variable and the following variables as the predictors: *rm*, *lstat*, *indus*, and *ptratio*.

Save the estimated values of the regression coefficients to the vector **b_lm**.

Now let's try to calculate $\hat\beta$ using matrix operations via
$\hat{\beta}=(X^TX)^{-1}X^TY$.

To do that, we have to define $X$ and $Y$. $Y$ is the vector containing the values of *medv*. The $X$ matrix is the *design matrix* composed by a unit vector linked to the model intercept ($\beta_0$), and the vectors containing the values of the variables *rm*, *lstat*, *indus*, and *ptratio*.

Let's create Y and X:

In [None]:
Y = Boston$medv
Xpart = Boston[,c('rm','lstat','indus','ptratio')]
X = cbind(1,Xpart) #this looks like cheating but it works for dataframes,
#you can also use rep to first create a unit vector!
head(X)

Now let's convert $Y$ and $X$ to matrices using the *as.matrix* function.

In [None]:
Y = as.matrix(Y)
X = as.matrix(X)

To transpose the matrix $X$, we use the **t** function. To invert it, we use the **solve** function. Or if you really want to practice your coding skills, you can implement your own inversion function later on!

To multiply matrices and vectors in R, we use %*%.

In [None]:
trX = t(X)
trXX = trX%*%X
trXX

In [None]:
inv_trXX = solve(trXX)
inv_trXX

#### **Exercise 3**

Continue the calculations above and return a vector of length 5, assigning it the name **b_calc**. Compare your vector to **b_lm**.

#### **Exercise 4**

Write down and execute the steps you would take to calculate the coefficients for

*lm(medv~poly(lstat,2, raw=TRUE), data=Boston)*

using matrices as above.

What does the *raw=TRUE* option do in the *poly* function?
Increase the power of the polynomial and check what happens to your matrices.

Note: Remember, you can treat powers of *lstat* as separate feature variables

### PART II - Best Subset Selection

**Preparation**: Take a look at the contents for weeks 7 and 8 in Introduction to Statistics. We will build on some of the concepts you have already seen such as variable selection and model validation.

#### **Introduction**: 

Above, we used the *Boston* dataset to fit a linear model using **lm**. We had *medv* as our response variable, and the following variables as our predictors: *rm*, *lstat*, *indus*, and *ptratio*.

In [None]:
library(MASS)
data(Boston)
summary(lm(medv~rm+lstat+indus+ptratio, data=Boston))

We can see that the $R^2$ for this model is 0.6786. This means that 67.86% of the variation in the quantitative measure of property median value (medv) can be explained by its linear regression on the 4 chosen predictor variables. We can also see that the adjusted $R^2$ is 0.6761.

Say we now include the *nox* predictor in our model:

In [None]:
summary(lm(medv~rm+lstat+indus+ptratio+nox, data=Boston))

The $R^2$ has moved from 67.61% to 67.99%. So, if we assume that a model with a higher $R^2$ is better, we would choose this model over the previous one. In this case, the adjusted $R^2$ is also higher at 0.6767.

**Question:** Which measure should we use to compare the first and the second model? $R^2$ or the adjusted $R^2$? Discuss this with a colleague or a tutor before proceeding.


SImilarly, if we add the variable *zn* to the first model, we will see a decrease in the adjusted $R^2$ in comparison to the first model. We can also see a decrease in the $R^2$ in comparison to the second model. Therefore, we would (possibly) conclude that we are better off with the second model.

In [None]:
summary(lm(medv~rm+lstat+indus+ptratio+zn, data=Boston))

If we are to consider all possible linear combinations of variables, we would need to consider the model with no predictors (only an intercept) and the model with all predictors available. Let's see what the calculated $R^2$ and adjusted $R^2$ is for these models:

In [None]:
m0=lm(medv~1, data=Boston)
summary(m0)

In [None]:
mfull=lm(medv~ . , data=Boston)
summary(mfull)

**Question**: What happened to the $R^2$ and adjusted $R^2$ for the model $M_0$ above? Again discuss this with your colleagues or a tutor.

#### **Best Subset Selection procedure**

To perform best subset selection, we fit a separate regression model for each possible combination of the $p$ predictors. This is often broken up into stages, as follows:

1. Let   $M_0$ denote the model which contains no predictors.

2. For $k=1,2,\ldots,p$:

- Fit all $p \choose k$  models that contain exactly $k$ predictors.
- Pick the best among these
  models and call it $M_k$. Here, best is defined as having the smallest RSS or largest $R^2$.

3. Select a single best model from $M_0,M_1,\ldots,M_p$ using a suitable measure such as the cross-validated prediction error $C_p$, BIC, adjusted $R^2$, etc.

It is important to note that use of RSS or $R^2$ in step 2 of the above algorithm is acceptable because the models all have an equal number of predictors. We can't use RSS in step 3 because RSS decreases monotonically as the number of predictors included in the model increases.

**Exercise**: Let's write a loop to try to find the best subset model for *medv*.

*Note*: There are many different ways to complete this exercise using different libraries, data subsetting/manipulation, etc. The way we are going to attempt this now is not optimal but it is simple and easy to implement.

**Step 1:** Say we want to fit the model *medv ~ nox*, we write:

In [None]:
m_example=lm(medv ~ nox, data=Boston)
summary(m_example)

**Step 2:** And we can get the $R^2$ and adjusted $R^2$ for this model by calling:

In [None]:
r2_example = summary(m_example)$r.squared
adjr2_example = summary(m_example)$adj.r.squared
print(paste0("R2 = ",r2_example))
print(paste0("Adj R2 = ",adjr2_example))

**Step 3:** When we called the model with all variables in the previous section, we used *lm(medv ~ ., data=Boston)*. The dot indicates we want to use all variables in the dataset Boston.

Instead of writing *lm(medv ~ nox, data=Boston)* in the previous model, we could write *lm(medv ~ ., data=Boston[,c(5,14)])*.

The 5th column in Boston corresponds to the variable *nox* and the last column (14th) corresponds to *medv*. So we have subsetted the dataset keeping only the columns of interest for this model.

In [None]:
summary(lm(medv ~ ., data=Boston[,c(5,14)]))

**Step 4:** Putting everything together, we can write a simple loop to compare all models containing one variable, and saving their $R^2$ to a vector.  We also print some statements that inform us of the most relevant information from our calculations in an easy-to-read way.

In [None]:
r2_m1 = rep(NA,13)
for (var in 1:13){
  m_temp = lm(medv~ . , data = Boston[,c(var,14)])
  r2_m1[var]=summary(m_temp)$r.squared
}
print(r2_m1)
print(paste0("The maximum calcuated R2 is: ", round( max(r2_m1), 6 )))  # look at the help file for round.
print(paste0("The index of the corresponding model is: ",which.max(r2_m1)))
print(paste0("The relevant predictor is: ", names(Boston)[which.max(r2_m1)]))

**Step 5:** Say we now want to look at all models with 2 variables, we would have to find all pairwise combinations of predictors. To do this, we can use the function *combn*:

In [None]:
all_pairs = combn(13,2)
all_pairs
dim(all_pairs)[2]

**Step 6:** And we can use the same principle to subset the Boston dataset as before to generate the model using the variables in column 1 of *all_pairs*:

In [None]:
summary(lm(medv ~ ., data = Boston[,c(all_pairs[,1],14)]))

**Step 7:** Now write a loop similar to the one in **Step 4** and find the model with the highest $R^2$ with two predictors.

**Step 8:** Repeat steps 5 to 7 and create a loop that returns the model with the highest value of $R^2$ for $1, 2, \ldots, 13$ variables. Find a strategy to save the index for the relevant variables in each model, the corresponding $R^2$, and the adjusted $R^2$ for each.

**Step 9:** Compare the adjusted $R^2$ for the models in **Step 8** and identify the best subset model.

### **Painless best subset selection**

Now that you have seen one way to implement subset selection, you should be ready to use functions that do this job for you without the need to implement your own loops.

We use the library **leaps** to return the best subset model.

In [None]:
library(leaps)
best_models = regsubsets(medv ~ ., data=Boston)
summary(best_models)

The *regsubsets* function returns 8 possible "best subset" models.  To return more (or less) models, you can change the argument *nvmax* (by default set to 8) within the summary function.  Therefore, to check the best model with 1 up to 13 variables, we run: 

In [None]:
best_models = regsubsets(medv ~ ., data=Boston,  nvmax = 13)
summary(best_models)

To check the values that we can obtain from this summary object, you can use the *names* function:

In [None]:
res_summary = summary(best_models)
names(res_summary)

From which we can see that we can obtain the adjusted $R^2$ values for the models using the *adjr2* component of our *summary* object: 

In [None]:
res_summary$adjr2

and to identify the model with the highest adjusted $R^2$, we use the function **which.max**:

In [None]:
which.max(res_summary$adjr2)

The coefficients for model 11 are then given by:

In [None]:
print(coef(best_models,11))

We can construct the linear model corresponding to these variable as follows using:

In [None]:
lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data=Boston)

Alternatively, we can generate this linear model by reformulating the coefficient names in the best model into an expression that can be read by *lm*.  To do this, notice that we can call the names of the coefficients in the best model using *names*:

In [None]:
names(coef(best_models,11))

We can then create a string expression of the model formula we require using *paste*, where we remove the first element as we do not wish to include the word _intercept_ within our model expression formula, and then link the words using _+_:

In [None]:
paste(names(coef(best_models,11))[-1],collapse="+")

We can then use this expression to *reformulate* the required linear model expression with response *medv*, with the argument *data* being used within *lm* as usual:

In [None]:
best_ss <- lm(reformulate(paste(names(coef(best_models,11))[-1],collapse="+"),'medv'),data=Boston)
summary( best_ss )

### Forwards and Backwards Subset Selection

Forwards and backwards subset selection are similar techniques which do not make use of every possible combination of variables.

**Forward selection**

In the case of forwards subset selection, we start with the null model, and determine which of the $p$ single-predictor models has the highest $R^2$ value. We designate this model $M_1$.

Next, rather than fitting each of the $p \choose 2$ two-predictor models, we add each of the remaining $p-1$ predictor variables to $M_1$ and determine which of these models has the highest $R^2$ value. We then designate this model $M_2$.  We then generate $p-2$ models with three variables by adding each of the remaining $p-2$ variables to the variables included in $M_2$.  The model with highest $R^2$ value is denoted $M_3$.

We continue in this fashion until we have our set of $p$ models $M_0,M_1,\ldots,M_p$, from which we choose the "best" model using $C_p$, BIC, adjusted $R^2$, etc.

**Backward selection**

In the case of backwards subset selection, we start from the *full* model $M_p$ and *remove* a single variable at a time.

The model from this set of $p$ models which has the highest $R^2$ value is designated $M_{p-1}$. We continue in this fashion until we reach the null model, $M_0$. We again choose the best model from $M_0,M_1,\ldots,M_p$ based on $C_p$, BIC, adjusted $R^2$, etc.  


#### Implementation using Leaps

We can use the **leaps** package and *regsubsets* to conduct forward and backward selection, we just need to specify the method. To perform forward selection and select the best model using the adjust $R^2$, we include our method choice in the function call as follows.

In [None]:
library(leaps)
best_forward = regsubsets(medv ~ ., data=Boston, method="forward", nvmax = 13)
summary(best_forward)

In [None]:
forward_summary = summary(best_forward)
forward_adjr2=which.max(forward_summary$adjr2)
print(coef(best_forward,forward_adjr2))

In [None]:
summary(lm(reformulate(paste(names(coef(best_forward,forward_adjr2))[-1],collapse="+"),'medv'),data=Boston))

In this case, the forward selection process leads to the same model as the one obtained using best subset selection.

**Exercise:** Repeat the process above using *backward* as the method option. How different is this model from the best subset and forward models?