# Math 156 Homework 3
## Write-up:
**Problem 3:**
In this first problem we write a function to train a binary logistic regression model using mini-batch SGD. The function includes the hyperparameters of batch size, learning rate (fixed), max number of iterations, x training set, and y training set. If not specified, batch size automatically defaults to a random integer between 1 and 10, since the batch size needs to be significantly smaller than the number of samples in SGD (and I would hope we have way more than 10 samples). If not specified, the learning rate is set to 0.001 and the max number of iterations is set to 10. We then use this function in the next problem.

In problem 3 we also defined a quick sigmoid function which comes in handy throughout this assignment.

**Problem 4**:
In this problem we used the mini-batch sgd function we wrote in problem 3 and we use that to train the UCI breast cancer dataset. First we download the data, split up x and y, and split the data into train, test, split. Then we normalize the x-data based on the parameters from the x-training data only. Next, we report on how balanced each of the classes are. Then, we use the program from part 3 and experiment with different hyper parameters to see the best results. You yourself can implement different values and see what happens. I found that a smaller number of iterations and a smaller learning rate tended to give the best results. Finally, we reported on the model and summarized our findings.



---------------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------

#Problem 3

**Instructions:** Implement a program to train a binary logistic regression model using mini-batch SGD. Use the logistic
regression model we derived in class, corresponding to Equation (4.90) from the textbook, and where
the feature transformation φ is the identity function.
The program should include the following hyperparameters:
- Batch size
- Fixed learning rate
- Maximum number of iterations




In [None]:
# begin by importing necessary packages
import pandas as pd
import random
import numpy as np

According to equation 4.90 of the textbook, the gradient of the cross-entropy loss is:

$∇E(w) = ∑^{N}_{n=1}(σ(w^{T}x_{n})$  $- t_{n})*x_{n}$


First let's write a quick function for sigmoid.

In [111]:
def sigmoid(x):
  """
  Function that computes the sigmoid of a real number.
  input: real number x
  output: sigmoid(x)
  """
  return 1 / (1 + np.exp(-np.clip(x, -100, 100)))

Now let's write the program that will implement mini batch SGD.

In [112]:
def mini_batch_sgd(train_x, train_y, K = 10, n = 0.001, b_size = random.randint(1, 10)):
  """
  Function that trains a binary logistic regression model using mini-batch SGD.
  params:
    train_x = x training dataset,
    train_y = y training dataset,
    K = max number of iterations,
    n = fixed learning rate,
    b_size = batch size
  returns:
    the optimal w by gradient descent
  """

  # w should be an array with same dimensions as each sample
  N = train_x.shape[0]
  n_feats = train_x.shape[1]
  # initialize w based on a standard gaussian distribution
  w = [random.gauss(0, 1) for _ in range(n_feats)]

  k = 0
  while k < K:
    #print("k = ", k)
    b_indices = random.sample(range(N), b_size)
    # compute the sum
    mini_sum = 0
    #print("w = ", w)
    w_t = np.transpose(w)
    for i in b_indices:
      x_n = train_x[i]
      y_n = train_y[i]
      mini_sum += sigmoid(w_t @ x_n) - y_n

    # now update w
    w = w - n*mini_sum

    k = k+1
  return w

# Problem 4

**A)** Download the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository or scikit-learn’s built-in datasets.

Link to dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

In [113]:
# we are importing from scikit-learn's datasets
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

# convert to pandas dataframe
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# drop any missing values
df.dropna(inplace=True)

df.head()
# looks good

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


**B)** Split the dataset into train, validation, and test sets.

In [114]:
# import necessary packages
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# now do test-train-val split
X = df.drop("target", axis=1)
y = df["target"]
X = X.to_numpy()
y = y.to_numpy()
# 20% test, 10% validation, 70% training, just like in the last homework.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=1)

# normalize the data based only on statistics from train dataset
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

**C)** Report the size of each class in your training (+ validation) set.

External source for the following code:
https://stackoverflow.com/questions/28663856/how-do-i-count-the-occurrence-of-a-certain-item-in-an-ndarray

In [115]:
# determine the number of unique items in each class for y_train, y_val, and y_test
unique, counts = np.unique(y_train, return_counts=True)
y_train_dict = dict(zip(unique, counts))

unique, counts = np.unique(y_val, return_counts=True)
y_val_dict = dict(zip(unique, counts))

unique, counts = np.unique(y_test, return_counts=True)
y_test_dict = dict(zip(unique, counts))

# report
print("Training: \nNumber of 0s:", y_train_dict[0], "\nNumber of 1s:", y_train_dict[1] )
print("\nValidation: \nNumber of 0s:", y_val_dict[0], "\nNumber of 1s:", y_val_dict[1] )
print("\nTesting: \nNumber of 0s:", y_test_dict[0], "\nNumber of 1s:", y_test_dict[1] )

Training: 
Number of 0s: 151 
Number of 1s: 247

Validation: 
Number of 0s: 19 
Number of 1s: 38

Testing: 
Number of 0s: 42 
Number of 1s: 72


**D)** Train a binary logistic regression model using your implementation from problem 3. Initialize
the model weights randomly, sampling from a standard Gaussian distribution. Experiment with
different choices of fixed learning rate and batch size.

**Note** You can input any hyperparameters that you want here! I experimented with multiple different values of iterations, learning rate, and batch size.

In [125]:
# the model weights have already been randomly initialized following the gaussian distribution in the function.

# max_iter = 20, learning rate = 0.001, batch size = 3
w_calc = mini_batch_sgd(X_train, y_train, 20, 0.001, 3)

# final w
print(w_calc)

[-1.27672406  1.5941939   0.22880185 -1.23857881 -0.55822384  1.17844736
 -2.7276666  -0.61180367 -0.48017023 -0.25713534  2.23427245 -0.22106592
 -0.71312025 -1.73851096  0.50458885 -1.33272628  0.01084392  0.20238108
  0.63894467 -1.19257141 -0.98178075 -0.14108592 -0.0492283   0.13253185
  0.34299944  1.62218555 -1.01270415 -0.44116613  0.35376204 -0.15318822]


**E)** Use the trained model to report the performance of the model on the test set. For evaluation
metrics, use accuracy, precision, recall, and F1-score.

In [126]:
# predict values
y_pred_test = sigmoid(X_test @ w_calc)

# convert the probabilities to binary values
y_pred_test = np.array([1 if _ >= 0.5 else 0 for _ in y_pred_test])

print(y_pred_test)

[0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0
 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 0
 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
 0 0 1]


In [127]:
# evaluation time
# import necessary packages
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# evaluate each of the metrics
acc = accuracy_score(y_test, y_pred_test)
precis = precision_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
f1_score = f1_score(y_test, y_pred_test)

print("Accuracy:", acc, "\n \nPrecision:", precis, "\n \nRecall:", recall, "\n \nF1 Score:", f1_score)

Accuracy: 0.8157894736842105 
 
Precision: 0.8591549295774648 
 
Recall: 0.8472222222222222 
 
F1 Score: 0.8531468531468531


**E)** Summarize your findings.


In summary, I experimented with many different values of number of iterations, batch size, and step size. I found that a small step size was good (n = 0.001) so that we didn't overshoot in gradient descent, and surprisingly a smaller number of iterations gave a better result as well (20 iterations had better results than 500).

Overall findings:
- **Accuracy:** 82%
- **Precision:** 86%
- **Recall:** 85%
- **F1 score:** 85%

This model can hence semi-accurately predict breast cancer based on the features on unseen data.