# <center>Algorithms assignment 1 </center>#

## Steps

### Step 1: Data Prepare

this step is for reading data from csv, and scaling the features.

### Step 2: Linear classifier

* using logistic gradient descent to get optimal thetas, and the value of thetas is below, in the realised function, I use the $\lambda = 0$ as hyperparameters for regulations:

 $$\theta_{0} = 1.29, \theta_{1} = 13.47, \theta_{2} = 7.14 $$

* using encapsulated function to calculate cost function and accuracy, and the answer is

  $$J(\theta) = 0.21, Accuracy = 0.91$$

* the incorrect indexes: $[0, 11, 12, 22, 38, 58, 79, 85, 98]$


### Step 3: Quadratic classifier

Since it is quadratic, it totally has 6 thetas, which regression can be generate in this way:

$$f(x_{1}, x_{2}) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} +  \theta_{3}x_{1}x_{2} + \theta_{4}x_{1}^2 + \theta_{5}x_{2}^2 $$

1. add extra features columns to fix this method, and then traing the data. The answer is:

$$\vec{\theta} = [1.76, 13.28, 7.53, 1.47, -3.14, -3.45]$$

2. after modeling, the cost function $J(\theta)$ and accuracy is
$$J(\theta) = 0.18, Accuracy = 0.94$$

3. the incorrect indexes: $[11, 12, 58, 79, 85, 98]$


### Step 4: Naive Bayes classifier

1. the mean and standard deviation of each feature for each class:

|(Mean, StandardDeviation)|TumourSize|Age|
|---|---|---|
|Cancer|(4.8, 1.72)|(69.03, 18.84)|
|$\neg $Cancer|(1.66, 1.4) |(62.46, 21.99) |

2. using naive bayes formulate to predict. And the formulate and interpretation behind the code is below:

  $$P(C|t, a) = \frac{P(t, a|C)P(C)}{P(t, a)}$$

  where
  $$P(t, a) = P(t, a|C)|P(C) + P(t, a|\neg C)P(\neg C)$$

  and using naive bayes, we can split $P(t, a|C)$ and P(t, a|\neg C) to the multiplication of each one single event:

  $$P(t, a|C)=P(t|C)P(a|C)$$
  $$P(t, a|\neg C) = P(t|\neg C)P(a|\neg C)$$

  Since both of features are numeric values, we can assume that it follow **the Gaussian Distribution**.

  After using above principle and formulate, the result is:
  $$Accuracy = 0.84$$

3. the incorrect indexes: $[0, 4, 11, 12, 19, 50, 58, 59, 72, 73, 84, 85, 87, 91, 96, 98]$


### Question 4: Process of Model Selection

I would like to follow these steps to find the best model from these three models:
1. **Using cross-validation** and **tuning the hyperparameters** for each model to find **their own highest average accuracy value**.
2. Then choose the model which has **the highest average accuracy**.
3. **Retrain** the model on **all the data** to get **the best parameters**.

#### step 1: cross-validation and hyperparameters tuning

Using cross-validation is to eliminate the bias on the data, which means the test data may be suitable for one of the models, but not suitable for others, it will cause overfitting and underfitting.
And the method we can use is **K-fold Cross-Validation** or **Leave-One-Out Cross-Validation** to finish it.

* About the K-fold Cross-Validation, its process is:
  1. Reshuffle the dataset
  2. Split the dataset into number K subsets
  3. Train the K-1 subsets, and the other one use for testing, from testing we can get metric value (accuracy)
  4. Repeat the last step K times, but each time testing set is difference. It means each subset will be tested and using other subsets to train.
  5. Get the average metric value (accuracy).

* Or using Leave-One_Out cross-validation: it is similar to k-fold, but the split size is only 1.

And during step 1, we also need to **tune each model hyperparameters**, because different hyperparameters in the same model also have different performances. like the regularisation$\lambda$, learning rate \alpha in logistic regression. We can use **Random Search** and **Exhaustive Grid Search** based on cross-validation to do it. We can do it in the following approach.

1. Firstly, use the Random Search method to find the good ranges of hyperparameters and check if the cost change significantly around good values.
2. Then use Exhaustive Grid Search to iterate each hyperparameter during these good ranges, so that it will get the best group of hyperparameters that have the highest accuracy.

Finally, we can find the best hyperparameter for each model, and after cross-validation, we can get its average accuracy.

#### step 2
Having average accuracy for each model, we can compare it so that we can know which model is the final choice.

#### step 3
When we have selected the model to be used, we also need to get the model best parameters value. This time, we **retrain all the data** to get it with the best hyperparameter value. And the final model function will be used to predict the new data.


## Code

In [None]:
# step 1
import pandas as pd
import numpy as np

# read data
data = pd.read_csv('Data_CW.csv')
columns = data.columns
data = data.to_numpy() # change data to numpy

def feature_scaling(data, pre_data=None):
  '''
    Param data: 2-D ndarray
    Param pre_data: null or 2-D ndarray
  '''
  n = len(data[0])
  changedData = np.array([[] for i in range(len(data))])
  for i in range(n):
    column = data[:, i]
    cal_column = column if (pre_data is None) else pre_data[:, i]
    scaling_column = ((column - np.average(cal_column)) / (np.max(cal_column) - np.min(cal_column))).reshape(-1, 1)
    changedData = np.c_[changedData, scaling_column]
  return changedData

X = feature_scaling(data[:, :2])# scale the features
y = data[:, 2:]

In [None]:
# this section encapsulate some logistic regression function
def logistic_batch_gradient_descent(X, y, eta, iterations, lam=0):
  """"using regularise batch gradient descent to calculte the theta
    Param X: Metrics
    Param y: metrics
    Param eta: learning rate
    Param iterations: iterate times
    Param lam: regularisation param

    Return the optimal theta
  """
  m = len(X)
  X_b = np.c_[np.ones((m, 1)), X]
  n = len(X_b[0])
  theta = np.zeros((n, 1))
  for i in range(iterations):
    t = X_b.dot(theta)
    SIG = 1 / (1 + np.exp(-t))
    regularisation = lam * np.array([[0], *theta[1:]])
    theta = theta - eta*(1/m) * (X_b.T.dot(SIG - y) + regularisation )
  return theta

def logistic_predict_value(X, theta):
  X_b = np.c_[np.ones((len(X), 1)), X]
  t = X_b.dot(theta)
  SIG = 1 / (1 + np.exp(-t))
  return SIG

def logistic_cost_function(X, y, theta):
  """calculate cost function
  Param X: 2-D ndarray
  Param y: 2-D shape:(n, 1) ndarray
  Param theta: 2-D ndarray
  
  Return: the value of the cost function, accuracy
  """
  SIG = logistic_predict_value(X, theta)
  m = len(X)
  J = -1/m * (sum(y.T.dot(np.log(SIG)) + (1 - y).T.dot(np.log(1-SIG))))
  return J[0]

def logistic_accuracy(y, SIG):
  """
  Param y: actual value
  Param SIG: predicted value
  Return accuracy
  """
  trueValue = ((SIG + y) < 0.5) | ((SIG + y) >= 1.5)
  accuracy = len(SIG[trueValue])/len(SIG)
  return accuracy

def incorrect_prediction_indexes(y, SIG):
  predict = np.round(SIG).astype(np.int)
  # print(np.c_[y, predict, y!=predict])
  y_df = pd.DataFrame(y)
  predict_df = pd.DataFrame(predict)
  return y_df[predict_df!=y_df].dropna().index.tolist()

In [None]:
learning_rate = .1
iterations = 10000

# linear classifier
print('Linear Logistic Regression')
# Step 2.1 caculate thetas
thetas = logistic_batch_gradient_descent(X, y, learning_rate, iterations, lam=0)
print('thetas:', [round(v, 2) for v in thetas.reshape(3)])

# step 2.2 get cost and accuracy
cost = logistic_cost_function(X, y, thetas)
predictedValue = logistic_predict_value(X, thetas)
accuracy = logistic_accuracy(y, predictedValue)
print('cost: {}, accuracy: {}'.format(round(cost, 2), accuracy))

# step 2.3 error indexes
# incorrect_prediction_indexes(y, predictedValue)
indexes = incorrect_prediction_indexes(y, predictedValue)
print('incorrect indexes:', indexes)

Linear Logistic Regression
thetas: [1.29, 13.47, 7.14]
cost: 0.21, accuracy: 0.91
incorrect indexes: [0, 11, 12, 22, 38, 58, 79, 85, 98]


In [None]:
# step 3
def getQuadraticFeatures(features):
  '''
  Param features (n*2 matrix, type:ndarray)
  return n*5 matrix, type:ndarray
  '''
  quadraticFeatures = []
  for x1, x2 in features:
    item = [x1, x2, x1*x2, x1**2, x2**2]
    quadraticFeatures.append(item)
  return np.array(quadraticFeatures)

quadraticX = getQuadraticFeatures(X)

print('Quadratic classifier')
# step 3.1
quadraticTheta = logistic_batch_gradient_descent(quadraticX, y, learning_rate, iterations, lam=0)
print('thetas', [round(v, 2) for v in quadraticTheta.reshape(6)])
# step 3.2
quadraticCost = logistic_cost_function(quadraticX, y, quadraticTheta)
quadPredictedValue = logistic_predict_value(quadraticX, quadraticTheta)
quadraticAccuracy = logistic_accuracy(y, quadPredictedValue)
print('cost: {}, accuracy: {}'.format(round(quadraticCost, 2), quadraticAccuracy))
# step 3.3
quadIndexes = incorrect_prediction_indexes(y, quadPredictedValue)
print('incorrect indexes:', quadIndexes)

Quadratic classifier
thetas [1.76, 13.28, 7.53, 1.47, -3.14, -3.45]
cost: 0.18, accuracy: 0.94
incorrect indexes: [11, 12, 58, 79, 85, 98]


In [None]:
# step 4
from scipy import stats

tumourSizeData = data[:, 0]
ageData = data[:, 1]
cancer = data[:, 2]
tumourSizeMean = tumourSizeData.mean()
tumourSizeStd = tumourSizeData.std()

ageMean = ageData.mean()
ageStd = ageData.std()

tumourGivenCancerData = tumourSizeData[cancer == 1]
ageGivenCancerData = ageData[cancer == 1]
tumourGivenNoCancerData = tumourSizeData[cancer == 0]
ageGivenNoCancerData = ageData[cancer == 0]
def var_name(var, all_vars = locals()):
  result = [var_name for var_name in all_vars if all_vars[var_name] is var]
  return result[0] 

# print each class mean and standard deviation
for specificData in [tumourGivenCancerData, ageGivenCancerData, tumourGivenNoCancerData, ageGivenNoCancerData]:
    mean = specificData.mean()
    std = specificData.std()
    name = var_name(specificData).replace('Data','')
    print('P_{} mean: {}, standard deviation:{}'.format(name, round(mean, 2), round(std, 2)))

# using Gaussian Distribution to calculate the specific data
def calOneColumnProbability(value, specificData):
  mean = specificData.mean()
  std = specificData.std()
  probability = stats.norm.pdf(value, mean, std)
  return probability


# the probability of having specific tumour size and age, given patients having cancer
def calP_TumourAgeGivenCancer(size, age):
  P_TumourGivenCancer = calOneColumnProbability(size, tumourGivenCancerData)
  P_AgeGivenCancer = calOneColumnProbability(age, ageGivenCancerData)
  return P_TumourGivenCancer * P_AgeGivenCancer


# the probability of having specific tumour size and age, given patients not having cancer
def calP_TumourAgeGivenNoCancer(size, age):
  P_tumourGivenNoCancer = calOneColumnProbability(size, tumourGivenNoCancerData)
  P_ageGivenNoCancer = calOneColumnProbability(age, ageGivenNoCancerData)
  return P_tumourGivenNoCancer * P_ageGivenNoCancer


# the probability of having cancer given tumour size and age
def calP_CancerGivenTumourAge(size, age):
  P_Cancer = len(cancer[cancer == 1])/len(cancer)
  P_NoCancer = 1 - P_Cancer
  P_TumourAgeGivenCancer = calP_TumourAgeGivenCancer(size, age)
  P_TumourAgeGivenNoCancer = calP_TumourAgeGivenNoCancer(size, age)
  return P_TumourAgeGivenCancer * P_Cancer / (P_TumourAgeGivenCancer * P_Cancer + P_TumourAgeGivenNoCancer * P_NoCancer)

# calculate accuracy
def cal_accuracy_and_incorrect_indexes():
  right_num = 0
  incorrect_indexes = []
  for i, item in enumerate(data):
    probability = calP_CancerGivenTumourAge(item[0], item[1])
    if (probability > .5 and item[2] == 1) or (probability < .5 and item[2] == 0):
      right_num += 1
    else:
      incorrect_indexes.append(i)
  return right_num / len(data), incorrect_indexes

bayesAccuracy, incorrect_indexes = cal_accuracy_and_incorrect_indexes()
print('accuracy:',bayesAccuracy)
print('incorrect indexes:', incorrect_indexes)

P_tumourGivenCancer mean: 4.8, standard deviation:1.71
P_ageGivenCancer mean: 69.03, standard deviation:18.84
P_tumourGivenNoCancer mean: 1.66, standard deviation:1.4
P_ageGivenNoCancer mean: 62.46, standard deviation:21.99
accuracy: 0.84
incorrect indexes: [0, 4, 11, 12, 19, 50, 58, 59, 72, 73, 84, 85, 87, 91, 96, 98]


In [None]:
# step 1
import pandas as pd
import numpy as np

# read data
data = pd.read_csv('Data_CW.csv')
columns = data.columns
data = data.to_numpy() # change data to numpy

def feature_scaling(data, pre_data=None):
  '''
    Param data: 2-D ndarray
    Param pre_data: null or 2-D ndarray
  '''
  n = len(data[0])
  changedData = np.array([[] for i in range(len(data))])
  for i in range(n):
    column = data[:, i]
    cal_column = column if (pre_data is None) else pre_data[:, i]
    scaling_column = ((column - np.average(cal_column)) / (np.max(cal_column) - np.min(cal_column))).reshape(-1, 1)
    changedData = np.c_[changedData, scaling_column]
  return changedData

X = feature_scaling(data[:, :2])# scale the features
y = data[:, 2:]

In [None]:
# this section encapsulate some logistic regression function
def logistic_batch_gradient_descent(X, y, eta, iterations, lam=0):
  """"using regularise batch gradient descent to calculte the theta
    Param X: Metrics
    Param y: metrics
    Param eta: learning rate
    Param iterations: iterate times
    Param lam: regularisation param

    Return the optimal theta
  """
  m = len(X)
  X_b = np.c_[np.ones((m, 1)), X]
  n = len(X_b[0])
  theta = np.zeros((n, 1))
  for i in range(iterations):
    t = X_b.dot(theta)
    SIG = 1 / (1 + np.exp(-t))
    regularisation = lam * np.array([[0], *theta[1:]])
    theta = theta - eta*(1/m) * (X_b.T.dot(SIG - y) + regularisation )
  return theta

def logistic_predict_value(X, theta):
  X_b = np.c_[np.ones((len(X), 1)), X]
  t = X_b.dot(theta)
  SIG = 1 / (1 + np.exp(-t))
  return SIG

def logistic_cost_function(X, y, theta):
  """calculate cost function
  Param X: 2-D ndarray
  Param y: 2-D shape:(n, 1) ndarray
  Param theta: 2-D ndarray
  
  Return: the value of the cost function, accuracy
  """
  SIG = logistic_predict_value(X, theta)
  m = len(X)
  J = -1/m * (sum(y.T.dot(np.log(SIG)) + (1 - y).T.dot(np.log(1-SIG))))
  return J[0]

def logistic_accuracy(y, SIG):
  """
  Param y: actual value
  Param SIG: predicted value
  Return accuracy
  """
  trueValue = ((SIG + y) < 0.5) | ((SIG + y) >= 1.5)
  accuracy = len(SIG[trueValue])/len(SIG)
  return accuracy

def incorrect_prediction_indexes(y, SIG):
  predict = np.round(SIG).astype(np.int)
  # print(np.c_[y, predict, y!=predict])
  y_df = pd.DataFrame(y)
  predict_df = pd.DataFrame(predict)
  return y_df[predict_df!=y_df].dropna().index.tolist()

In [None]:
learning_rate = .1
iterations = 10000

# linear classifier
print('Linear Logistic Regression')
# Step 2.1 caculate thetas
thetas = logistic_batch_gradient_descent(X, y, learning_rate, iterations, lam=0)
print('thetas:', [round(v, 2) for v in thetas.reshape(3)])

# step 2.2 get cost and accuracy
cost = logistic_cost_function(X, y, thetas)
predictedValue = logistic_predict_value(X, thetas)
accuracy = logistic_accuracy(y, predictedValue)
print('cost: {}, accuracy: {}'.format(round(cost, 2), accuracy))

# step 2.3 error indexes
# incorrect_prediction_indexes(y, predictedValue)
indexes = incorrect_prediction_indexes(y, predictedValue)
print('incorrect indexes:', indexes)

Linear Logistic Regression
thetas: [1.29, 13.47, 7.14]
cost: 0.21, accuracy: 0.91
incorrect indexes: [0, 11, 12, 22, 38, 58, 79, 85, 98]


In [None]:
# step 4
from scipy import stats

tumourSizeData = data[:, 0]
ageData = data[:, 1]
cancer = data[:, 2]
tumourSizeMean = tumourSizeData.mean()
tumourSizeStd = tumourSizeData.std()

ageMean = ageData.mean()
ageStd = ageData.std()

tumourGivenCancerData = tumourSizeData[cancer == 1]
ageGivenCancerData = ageData[cancer == 1]
tumourGivenNoCancerData = tumourSizeData[cancer == 0]
ageGivenNoCancerData = ageData[cancer == 0]
def var_name(var, all_vars = locals()):
  result = [var_name for var_name in all_vars if all_vars[var_name] is var]
  return result[0] 

# print each class mean and standard deviation
for specificData in [tumourGivenCancerData, ageGivenCancerData, tumourGivenNoCancerData, ageGivenNoCancerData]:
    mean = specificData.mean()
    std = specificData.std()
    name = var_name(specificData).replace('Data','')
    print('P_{} mean: {}, standard deviation:{}'.format(name, round(mean, 2), round(std, 2)))

# using Gaussian Distribution to calculate the specific data
def calOneColumnProbability(value, specificData):
  mean = specificData.mean()
  std = specificData.std()
  probability = stats.norm.pdf(value, mean, std)
  return probability


# the probability of having specific tumour size and age, given patients having cancer
def calP_TumourAgeGivenCancer(size, age):
  P_TumourGivenCancer = calOneColumnProbability(size, tumourGivenCancerData)
  P_AgeGivenCancer = calOneColumnProbability(age, ageGivenCancerData)
  return P_TumourGivenCancer * P_AgeGivenCancer


# the probability of having specific tumour size and age, given patients not having cancer
def calP_TumourAgeGivenNoCancer(size, age):
  P_tumourGivenNoCancer = calOneColumnProbability(size, tumourGivenNoCancerData)
  P_ageGivenNoCancer = calOneColumnProbability(age, ageGivenNoCancerData)
  return P_tumourGivenNoCancer * P_ageGivenNoCancer


# the probability of having cancer given tumour size and age
def calP_CancerGivenTumourAge(size, age):
  P_Cancer = len(cancer[cancer == 1])/len(cancer)
  P_NoCancer = 1 - P_Cancer
  P_TumourAgeGivenCancer = calP_TumourAgeGivenCancer(size, age)
  P_TumourAgeGivenNoCancer = calP_TumourAgeGivenNoCancer(size, age)
  return P_TumourAgeGivenCancer * P_Cancer / (P_TumourAgeGivenCancer * P_Cancer + P_TumourAgeGivenNoCancer * P_NoCancer)

# calculate accuracy and incorrect indexes
def cal_accuracy_and_incorrect_indexes():
  right_num = 0
  incorrect_indexes = []
  for i, item in enumerate(data):
    probability = calP_CancerGivenTumourAge(item[0], item[1])
    if (probability > .5 and item[2] == 1) or (probability < .5 and item[2] == 0):
      right_num += 1
    else:
      incorrect_indexes.append(i)
  return right_num / len(data), incorrect_indexes

bayesAccuracy, incorrect_indexes = cal_accuracy_and_incorrect_indexes()
print('accuracy:',bayesAccuracy)
print('incorrect indexes:', incorrect_indexes)

P_tumourGivenCancer mean: 4.8, standard deviation:1.71
P_ageGivenCancer mean: 69.03, standard deviation:18.84
P_tumourGivenNoCancer mean: 1.66, standard deviation:1.4
P_ageGivenNoCancer mean: 62.46, standard deviation:21.99
accuracy: 0.84
incorrect indexes: [0, 4, 11, 12, 19, 50, 58, 59, 72, 73, 84, 85, 87, 91, 96, 98]
