### 2.2 Question 2

Download the Iris dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/. The dataset can be downloaded from iris.data. Load the data into a pandas dataframe.
For this lab, we’re going to be performing a binary classification problem,
but this dataset has 3 classes: setosa, virginica, and versicolor. So we want
to take this multi-class problem and transform it into a binary classification.
Create a new column for the dataset called target. The value of target
will be 1 if the row contains a setosa flower, else the value is $0$. There should
be $\frac{1}{3}$ rows with the value of $1$, the rest should be $0$.

In [1]:
import pandas as pd
import numpy as np

# Defining the fielpath
filepath = "data\iris.data" 

def load_data(filepath):
    
    # This part reads the data and avoids to use the first row as the column name
    df = pd.read_csv(filepath, header = None)
    
    # After that the column names for the Dataset are assigned
    col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
    df.columns = col_names  
    
    # Then the column "tareget" is created and filled with ceros
    df['target']  = 0
    
    # Then the values in "target" column are modified depending if the Specie is or not a Setosa
    df.loc[df['Species'] == 'Iris-setosa', 'target'] = 1
    
    return df

# The previous function is called
iris  = load_data(filepath)
iris

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species,target
0,5.1,3.5,1.4,0.2,Iris-setosa,1
1,4.9,3.0,1.4,0.2,Iris-setosa,1
2,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5.0,3.6,1.4,0.2,Iris-setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,0
146,6.3,2.5,5.0,1.9,Iris-virginica,0
147,6.5,3.0,5.2,2.0,Iris-virginica,0
148,6.2,3.4,5.4,2.3,Iris-virginica,0


### 2.3 Question 3
For this question we want to take this dataset of 150 rows, and split it into
a train, test, and validation dataset, using the following proportions for each
split:

• Training: 70%

• Validation: 10%

• Testing: 20%

Sample data for each subset using stratified sampling. I.e. the training data should have roughly $\frac{1}{3}$
positive samples, the testing and validation dataset should also have roughly $\frac{1}{3}$ positive samples.


In [2]:
iris['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

In [3]:
def slipt_data(df, column):
    
    values = list(df[column].unique())
    
    training_set = pd.DataFrame()
    testing_set = pd.DataFrame()
    validation_set = pd.DataFrame()
    
    for value in values:
        
        data = df[df[column] == value]
        train=data.sample(n = 35,random_state=42)
        test=data.drop(train.index)
        validation = test.sample(n=5,random_state=42)
        test=test.drop(validation.index)
        
        training_set = pd.concat([training_set, train])
        testing_set = pd.concat([testing_set, test])
        validation_set = pd.concat([validation_set, validation])

    return training_set, testing_set, validation_set

In [4]:
training_set, testing_set, validation_set = slipt_data(iris, 'Species')

### 2.4 Question 4
Using the linear regression model you created in the previous lecture, transform it into a logistic regressor by applying the logistic function to the output
of the model. The loss function for this model should be binary cross entropy.

Select two columns from the Iris dataset (i.e. petal length and petal
width), and using these two columns, train a logistic regressor using gradient
descent, measuring the gradient using finite differences approximation. This
means that instead of having a single slope variable, we have multiple:

$$\hat{y} = \sigma\left (\beta_{0} + \sum_{i = 1}^{m}x_{i}\beta_{i}\right )$$


where $\hat{y}$ is the model’s probability prediction, $\sigma$ is the logistic/sigmoid
function, $\beta_{0}$ is the intercept, $\beta_{i}$
is the coefficient that modulates the $X_{i}$ variable.

I’ve made a start for you, please fill in the ’#TODOs’:

In [5]:
x = validation_set[["Petal_Length", "Petal_Width"]]
y = validation_set['target']

In [6]:
def lm(x, weights):
    
    h =  weights[0] + np.dot(weights[1:], x.T)
    
    return h

def sigmoid(h):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-h))

def bce(y, z):
    
    negative_side = (1-y)* np.log(z)
    positive_side = (y)* np.log(z)
    N = len(y)
    loss = (-1/N)* np.sum(positive_side + negative_side)
    
    return loss

In [44]:
def bce(y, yhat):
    
    # TODO: apply the binary cross entropy function returning the loss
    negative_side = (1-y)* np.log(yhat)
    positive_side = (y)* np.log(yhat)
    N = len(y)
    loss = (-1/N)* np.sum(positive_side + negative_side)
    return loss


class LogisticRegressor:
    
    def __init__(self, n_features: int = 2):

        self.params = np.random.randn(n_features + 1)

    def logistic(self, x):
        # TODO: apply the logistic function

        h = self.params[0] + np.dot(self.params[1:], x.T)
        z = 1 / (1 + np.exp(-h))
        return z

    def __call__(self, x, logits=False):
                                    
        y = self.params[0] + self.params[1:] @ x.T

        if not logits:
            y = self.logistic(y)
        return y

    def fit(self, train_x, train_y, epochs: int = 100, lr: float = 0.01):
        # TODO: train the model using gradient descent and finite-differences
        delta_w = np.zeros(self.params.shape)
        n = len(train_y)
        
        for epoch in range(epochs):
            for xi, yi in zip(train_x.values, train_y.values):
                # calculate loss and update model parameters using gradient descent
                zi = self.logistic(xi)
                error_term_i = zi - yi
                delta_w[0]  +=  error_term_i
                delta_w[1:] += np.dot(error_term_i, xi)
        
            self.params -= lr * delta_w / n
            print(self.params)


In [52]:
def bce(y, yhat):
    
    # TODO: apply the binary cross entropy function returning the loss
    negative_side = (1-y)* np.log(yhat)
    positive_side = (y)* np.log(yhat)
    N = len(y)
    loss = (-1/N)* np.sum(positive_side + negative_side)
    return loss


class LogisticRegressor:
    
    def __init__(self, n_features: int = 2):

        self.params = np.random.randn(n_features + 1)

    def logistic(self, x):
        # TODO: apply the logistic function

        h = self.params[0] + np.dot(self.params[1:], x.T)
        z = 1 / (1 + np.exp(-h))
        return z

    def __call__(self, x, logits=False):
                                    
        y = self.params[0] + self.params[1:] @ x.T

        if not logits:
            y = self.logistic(y)
        return y

    def fit(self, train_x, train_y, epochs: int = 100, lr: float = 0.01):
        # TODO: train the model using gradient descent and finite-differences
        
        delta_w = np.zeros(self.params.shape)
        n = len(train_y)
        
        for epoch in range(epochs):
            for xi, yi in zip(train_x.values, train_y.values):
                # calculate loss and update model parameters using gradient descent
                zi = self.logistic(xi)
                error_term_i = zi - yi
                delta_w[0]  +=  error_term_i
                delta_w[1:] += np.dot(error_term_i, xi)
                
            self.params -= lr * delta_w / n
            z = self.logistic(train_x.values)
            loss = bce(train_y.values, z)
            
            for xi, yi in zip(valid_x, valid_y):
                
            # calculate validation loss (BUT DON'T UPDATE MODEL PARAMETERS!)
            print(loss)

In [53]:
b = LogisticRegressor()
b.params

array([ 0.54256004, -0.46341769, -0.46572975])

In [54]:
b.params

array([ 0.54256004, -0.46341769, -0.46572975])

In [55]:
b.fit(x,y, epochs = 50)

2.1184775352810608
2.113948025623843
2.107245210250629
2.0984777127892897
2.087788485046392
2.0753531568389856
2.061377930588103
2.0460970015119626
2.0297694894412173
2.012675877869525
1.99511396997127
1.9773943896610366
1.9598356775249488
1.9427590552723586
1.926482956402919
1.911317442894737
1.8975586456524918
1.88548337820672
1.8753440772778707
1.867364219656587
1.8617343526384629
1.858608856021741
1.8581035290162773
1.8602940671213946
1.8652154638013299
1.8728623410005127
1.883190182232619
1.8961174129608949
1.911528246055936
1.929276186216759
1.9491880675223718
1.971068483995475
1.9947044653266852
2.019870249437377
2.0463320104292686
2.073852414028607
2.1021948915657145
2.1311275460514714
2.1604266280398807
2.1898795428017155
2.2192873722880377
2.248466914289799
2.2772522564465305
2.3054959141246094
2.3330695688427143
2.359864448294876
2.3857913906814003
2.410780635626104
2.434781382034646
2.4577611503644645


In [43]:
np.random.seed(42)
weights = np.random.randn(3)
epochs = 100

print(weights)
for epoch in range(epochs):
    
    z = sigmoid(lm(x, weights))
    N = len(y)
    print("--------------")
    print(bce(y, z))
    weights[0] -= 0.5 * np.sum(z-y) / N
    weights[1:] -= 0.5 * np.dot ((z-y), x)/ N
    print(weights)

print(weights)

[ 0.49671415 -0.1382643   0.64768854]


NameError: name 'sigmoid' is not defined

### 2.5 Question 5
As gradient descent is iterating, store (using class variables), the training and validation loss.

Visualise the training and validation loss. Is there a point at which the model begins to over fit? How do you know that the model is beginning to overfit by looking at these curves?

### 2.6 Question 6
Predict the class labels for the testing set.
For the testing set, calculate the:

• TP – number of true positives

• TN – number of true negatives

• FP – number of false positives

• FN – number of false negatives

In [None]:
def confusion_matrix(y, yhat):
    
    """
    This function returns the elements of a two-dimensional confusion matrix 
    calculated by counting the true values resulting of using logical operators such that:

        TP = Logical AND
        TN = Logical NOR
        FP = Logical A'B
        FN = Logical AB'

    """
    
    # Logical AND Gate
    TP = np.sum(np.logical_and(y, yhat))
    
    # Logical NOR Gate
    TN = np.sum(np.logical_and(np.logical_not(y),np.logical_not(yhat)))
    
    # Logical A'B Gate
    FP = np.sum(np.logical_and(np.logical_not(y),yhat))
    
    # Logical AB' Gate
    FN = np.sum(np.logical_and(y,np.logical_not(yhat)))
    
    return TP, TN, FP, FN

In [None]:
y =    np.array([1,1,0,1,0,0])
yhat = np.array([0,0,1,1,0,0])



TP, TN, FP, FN = confusion_matrix(y, yhat)

print("TP "+str(TP) + ",TN "+ str(TN) + ",FP "+ str(FP) + ",FN "+ str(FN) )

### 2.7 Question 7
Calculate the precision and recall and F1 score.


$$Precission = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

$$F_{\beta} = (1 + \beta^{2})\frac{Precission * Recall}{(\beta^{2}* Precision) + Recall)}$$

Calling the above function and making use of the previous equations we have that the metrics of the confusion matrix are given by:

In [None]:

def precision(y, yhat):
    # calculate the precision and return it
    TP, TN, FP, FN = confusion_matrix(y, yhat)  
    pr = TP / (TP + FP)
    
    return pr

def recall(y, yhat):
    # calculate the recall and return it
    TP, TN, FP, FN = confusion_matrix(y, yhat)
    rc = TP / (TP + FN)
    
    return rc

def f_beta(y, yhat, beta=1):
    
    pr = precision(y, yhat)
    rc = recall(y, yhat)
    
    # calculate the f_beta score and return it
    fb = (1 + beta**2) * ((pr * rc) / ((beta**2 * pr) + rc))
    
    return fb


### 2.8 Question 8
Generate a report using the precision, recall and F1 and confusion matrix.
The report should be printed like:

In [None]:
t1 = "|" + " " * 8 + "|" + " " * 10 + "| Predicted |" + " " *10 + "|\n"
t2 = "|" + " " * 8 + "|" + " " * 10 + "|  Positive |" + " Negative |\n"
t3 = "| Actual |" + " Positive |" + " " * (10 - len(str(TP))) + str(TP) + " |" + " " * (9 - len(str(FN))) + str(FN) + " |\n"
t4 = "|" + " " * 8 + "|" + " Negative |" + " " * (10 - len(str(FP))) + str(FP) + " |" + " " * (9 - len(str(TN))) + str(TN) + " |\n"

print(t1+t2+t3+t4)

In [None]:
x.transpose()

### 2.9 Question 9
Calculate the true-positive and false positive rate, and from these values
generate a ROC curve.


In [None]:
def roc(y, yhat, threshold_step=0.01):
# iteratively increase the threshold by threshold_step,
# calculating the TP and FP rate for each iteration. This function
# should return two lists, a list of TP rates, and a list of FP
# rates.
    return tp, fp

tp, fp = roc(y, yhat)

### 2.10 Question 10
Now that you’ve created a logistic classifier for two features of the Iris dataset
and have created some analytic results. Select another two columns (i.e.
petal width and sepal length, or petal length and sepal width). Create
a different logistic classifier using these new columns and create the same
results as you did with questions 8 and 9.
Compare these two models trained with different columns. Which model
is best, and why do we know that it’s the best?
