For this computer lab, we'll be using the IRIS dataset. Initially, we'll only look at a subset of it, and perform linear regression on two features of a given class.

# 1. Loading the data

### 1.1  Import the necessary modules

We'll use these three different modules, and one of the functions from scikit-learn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

The last line is needed in order to show matplotlib plots in notebooks.

### 1.2  Read the dataset from a .csv file

Load the [IRIS dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) using Pandas. The method `read_csv()` returns a `DataFrame` object containing the data found in the provided .csv file.

In [None]:
dataset = pd.read_csv("iris.csv")

In [None]:
type(dataset)

### 1.3  Analyze the dataset

This dataset is comprised of morphologic data from three different species of the Iris flowers: Setosa, Virginica and Versicolor.

<table style="width:100%">
  <tr>
    <th> <center>Iris Setosa</center> </th>
    <th> <center>Iris Virginica</center> </th> 
    <th> <center>Iris Versicolor</center> </th>
  </tr>
  <tr>
    <td><img src="https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg" alt="Iris Setosa"></td>
    <td><img src="https://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg" alt="Iris Virginica"></td>
    <td><img src="https://upload.wikimedia.org/wikipedia/commons/2/27/Blue_Flag%2C_Ottawa.jpg" alt="Iris Virginica"></td>
  </tr>
</table>

The lenght and width of both the petals and the sepals of each flower, together with its corresponding species were measured and stored in this dataset. Sepals and petals are both parts of a flower. Sepals are the outermost part of the whorl and the petals are the innermost part.
![](http://terpconnect.umd.edu/~petersd/666/html/iris_with_labels.jpg)

Let's take a look at what's inside the dataset now. The attribute `shape` of `DataFrame` objects returns the dimensions of the data inside it.

In [None]:
dataset.shape

So this dataset has 150 rows and 5 columns. It's easy to infer that this means 150 flowers were collected, and 5 different features were registered for each one. But we can also take a closer look at them, using the method `head()`, which returns the first 5 rows by default (you can also pass a parameter to it, which specifies a different amount of rows to be shown).

In [None]:
dataset.head()

Here we can see the header names for each column, together with the first rows, confirming that the species and morphologic measurements for each flower were collected. We can extract individual columns of this `DataFrame` by indexing using their names, for instance:

In [None]:
dataset["sepal_length"]

Additionally, we can check which species are present in the dataset using the `unique` method,

In [None]:
print(dataset["species"].unique())

where we see that only these three species are present in this dataset, as expected.

We can also learn more about the data types of each column with the method `info`.

In [None]:
dataset.info()

Here we see that the first four columns' elements are floating point numbers, and the last column's elements are objects (in this case, strings).

### 1.4  Extract the desired data

For this initial task, we are only interested in the setosa species. This corresponds to all the rows which have the column 'species' equal to the string 'setosa'. In order to extract these rows, we use [logical indexing in Pandas](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing).

In [None]:
# This returns a boolean series, which we can then use to index our DataFrame object
extract_rule = (dataset['species']=='setosa')

In [None]:
extract_rule

In [None]:
# We use the boolean series to index the DataFrame object
setosa_dataset = dataset[extract_rule]
setosa_dataset

Furthermore, we want to investigate the relationship between two features of this species, the 'sepal_length' and 'sepal_width'. To extract these, we [index the `DataFrame` using the name of the columns](https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label)  we want.

In [None]:
x = setosa_dataset['sepal_length'].values
y = setosa_dataset['sepal_width'].values

Note that the attribute `values` in a `DataFrame` object returns a numpy array.

In [None]:
type(x)

Now we can use matplotlib to plot all the examples in a 2D plane, where each dimension is one of the features described earlier.

In [None]:
fig, ax = plt.subplots()
ax.scatter(x,y)
ax.set_xlabel('sepal length')
ax.set_ylabel('sepal width');

It seems like the relation between these features could be approximated using a linear function, such as 
$f(x) = w\cdot x + b$. Let's try finding the parameters $w$ and $b$ that would make the best approximation.

### 1.5  Guess the values of w and b

We'll start with some educated guesses. To make this more convenient, we'll first define a function to plot a scatter plot of the provided data, together with a straight line with parameters specified by the user.

In [None]:
# Define a function to plot the data and a parameterized line
def plot_data_and_line(w, b, x, y, ax, line_color='r', line_label=''):
    
    # Create points lying on the line
    xline = np.unique(x)
    yline = w*xline + b

    # Plot both the line and the points from the dataset
    ax.scatter(x,y, color='C0')
    ax.plot(xline, yline, color=line_color, label=line_label)
    ax.set_xlabel('sepal length')
    ax.set_ylabel('sepal width') 

In [None]:
fig, ax = plt.subplots()
plot_data_and_line(1, -1, x, y, ax)

Additionally, another way of evaluating the quality of our approximation is to compute the MSE ([mean squared error](https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/)) between the true y features in the dataset and our predictions. So that we can use this value as well to guide our guesses, create a function to compute it (first, it might be beneficial to write down the analytical expression for it).

In [None]:
# Create a function to compute the MSE
# YOUR CODE HERE
raise NotImplementedError()

Now we can try different values of $w$ and $b$ and see how the resulting linear approximation looks like, compared to the scatter plot of our data. Using both the plot and the MSE, try searching for values of $w$ and $b$ that yield a good approximation.

In [None]:
# Guess the values for w and b
# YOUR CODE HERE
raise NotImplementedError()

# Plot your guess
plt.close('all')
fig, ax = plt.subplots()
plot_data_and_line(w, b, x, y, ax);

# Compute MSE of the guess
y_guess = w*x+b
print("MSE of your guess:", MSE(y,y_guess))

---

# 2. Training a model for linear regression

Now, instead of trying to find the parameters that give the best approximation by trial and error, we'll use PyTorch to build and optimize a linear regressor neural network. 

### 2.1 Create dataset and data loaders from the data

In [None]:
import torch
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

Current shape of `x` (and `y`) is (50,).

In [None]:
x.shape

We need to add a second dimension to `x` for `TensorDataset` to work properly. The loader expects data of size `(N,D)`, where `N` is the number of samples in the dataset, and `D` is the dimension of each sample (1-D in this case).

The dimension of `y` can be `(50,)` because we won't pass it through the model, only to the loss function (which is ok with inputs with this dimension).

In [None]:
x = x[:, None]
x.shape

Now we create the dataset and the data loader. Note that in the data loader we specify the batch size to match the size of the dataset.

In [None]:
dataset = TensorDataset(torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32))
data_loader = DataLoader(dataset, batch_size=len(x))

### 2.2 Create the model for linear regression.

In [None]:
from torch import nn

In [None]:
class LinearRegressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(1, 1)

    def forward(self, xb):
        return self.lin(xb)

In [None]:
model = LinearRegressor()

### 2.3 Define loss function and optimizer

We use the mean squared error loss:

In [None]:
loss_fn = nn.MSELoss()

And stochastic gradient descent (since our batch size is the size of the dataset, we're actually doing gradient descent):

In [None]:
from torch import optim
optimizer = optim.SGD(model.parameters(), lr=0.01)

### 2.4 Perform the optimization

Now we can perform the optimization by simply following these steps in a loop:
1. Sample a batch of data from our dataset
2. Compute the model's prediction on the batch
3. Compute the loss of the prediction w.r.t. ground-truth
4. Backpropagate the loss through the model's parameters
5. Perform one step of gradient descent.

We will do this for 20 epochs (again, since our batch size is the same size of the dataset, this means we'll take 20 steps of gradient descent).

In [None]:
for epoch in range(20):
    losses_in_epoch = []
    for batch in data_loader:

        # 1. These are the sampled batches of inputs and ground-truth
        batch_x, batch_y = batch
        
        # 2. Compute the model's prediction on the batch
        pred = model(batch_x)
        
        # 3. Compute the loss of the prediction w.r.t. ground-truth
        loss = loss_fn(pred.squeeze(), batch_y)
        
        # Save losses in a list for averaging later (not sctrictly necessary for batch_size = len(x))
        losses_in_epoch.append(loss)
        
        # 4. Backpropagation
        loss.backward()
        
        # 5. One step of gradient descent
        optimizer.step()
        
        # Zero the gradients computed in the backpropagation, for starting new optimization step
        optimizer.zero_grad()
    
    print('Epoch: {}\tLoss: {}'.format(epoch, sum(losses_in_epoch)/len(losses_in_epoch)))

Note the final MSE obtained (~0.06). Compare it to the one obtained using the guessed parameters. 

### 2.5  Extract optimal parameters

Extract the optimal parameters found by the optimization by using the `parameters` method of the created model (this returns a [generator](https://www.programiz.com/python-programming/generator), so we transform it into a list first).

In [None]:
parameters = list(model.parameters())

Each element in this list is a `Parameter` object:

In [None]:
parameters[0]

We can access the underlying tensor by using the `data` attribute of the `Parameter` class:

In [None]:
parameters[0].data

And the float inside the tensor with the `item` method (only works with one-element tensors).

In [None]:
parameters[0].data.item()

Putting this together we can get the parameters of our model as follows:

In [None]:
optimal_w, optimal_b = [p.data.item() for p in parameters]

Which results in:

In [None]:
print("w: %.3f" % optimal_w)
print("b: %.3f" % optimal_b)

Compare these optimized parameters with the ones you guessed before.

### 2.5  Compare optimal and guessed values

Finally, it's also beneficial to compare the guessed parameters with the optimized ones graphically, by showing both of the predicted lines in the same plot.

In [None]:
plt.close('all')
fig, ax = plt.subplots()
plot_data_and_line(w, b, x, y, ax, 'r', 'guess')
plot_data_and_line(optimal_w, optimal_b, x, y, ax, 'b', 'optimal')
ax.legend();