# Exercise 1

In [1]:
import numpy as np

def calculate_network(x1, x2, x3):
    # ReLU activation function
    def relu(x):
        return max(0, x)
    
    # Sigmoid activation function
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    
    # Calculate X4 (ReLU)
    x4 = relu(3 * x1 + 0 * x2 + 1 * x3)
    
    # Calculate X5 (ReLU)
    x5 = relu(-2 * x1 + 3 * x2 + 3 * x3)
    
    # Calculate X6 (ReLU)
    x6 = relu(1 * x1 + 1 * x2 + 2 * x3)
    
    # Calculate X7 (Sigmoid)
    x7 = sigmoid(-2 * x4 + 2 * x5 + 3 * x6)
    
    return x4, x5, x6, x7

input_values = [
    (0, 0, 1),
    (0, 1, 0),
    (1, 0, 0),
    (1, 1, 1),
    (1, -1, 1),
    (1, 1, -1),
    (-1, -1, 1),
    (-1, -1, -1)
]

for inputs in input_values:
    x4, x5, x6, x7 = calculate_network(*inputs)
    print(f"Input: {inputs}")
    print(f"Output: X4 = {x4}, X5 = {x5}, X6 = {x6}, X7 = {x7:.3f}")
    print()

Input: (0, 0, 1)
Output: X4 = 1, X5 = 3, X6 = 2, X7 = 1.000

Input: (0, 1, 0)
Output: X4 = 0, X5 = 3, X6 = 1, X7 = 1.000

Input: (1, 0, 0)
Output: X4 = 3, X5 = 0, X6 = 1, X7 = 0.047

Input: (1, 1, 1)
Output: X4 = 4, X5 = 4, X6 = 4, X7 = 1.000

Input: (1, -1, 1)
Output: X4 = 4, X5 = 0, X6 = 2, X7 = 0.119

Input: (1, 1, -1)
Output: X4 = 2, X5 = 0, X6 = 0, X7 = 0.018

Input: (-1, -1, 1)
Output: X4 = 0, X5 = 2, X6 = 0, X7 = 0.982

Input: (-1, -1, -1)
Output: X4 = 0, X5 = 0, X6 = 0, X7 = 0.500



# Exercise 2

## Part a)

Under supervised learning, the algorithm is given a set of input-output pairs and learns to map inputs to outputs by adjusting its parameters. In the case of a neural network, the algorithm adjusts the weights and biases of the network to minimize the error between the predicted output and the true output. One way this happens is in the training process is by "randomly" initializing the weights and biases, and then using an optimization algorithm to adjust the weights and biases so that the error is minimized. Another way this happens is by using gradient descent to update the weights and biases in the direction of the negative gradient of the loss function with respect to the weights and biases. This can also happen in forward or backward propagation .....

## Part b)
Let's call the top hidden layer H1, the bottom hidden layer H2 and the calculated output layer O1

In [2]:
import numpy as np

def calculate_network(x1, x2, expected_output, w1, w2, w3, w4, w5, w6):
    # ReLU activation function
    def relu(x):
        return max(0, x)
    
    H1 = relu(w1 * x1 + w3 * x2)
    H2 = relu(w4 * x1 + w2 * x2)

    O1 = w5 * H1 + w6 * H2
    squared_error = (O1 - expected_output)**2
    
    return(squared_error)

input_values = [
    (0, 0, 0),
    (0, 1, 1),
    (1, 0, 1),
    (1, 1, 0)
]

total_squared_errors = 0

for inputs in input_values:
    total_squared_errors += calculate_network(*inputs, 1, -1, 0, 1, -1, 1)
    
mse = total_squared_errors/(len(input_values))
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 0.75


## Part c)

In [3]:
import numpy as np

def calculate_network(x1, x2, expected_output, w1, w2, w3, w4, w5, w6):
    # ReLU activation function
    def relu(x):
        return max(0, x)
    
    H1 = relu(w1 * x1 + w3 * x2) # 
    H2 = relu(w4 * x1 + w2 * x2) # 

    O1 = w5 * H1 + w6 * H2
    squared_error = (O1 - expected_output)**2
    
    return(squared_error)

input_values = [
    (0, 0, 0), #doesn't matter what the weights are
    (0, 1, 1), 
    (1, 0, 1),
    (1, 1, 0)
]

total_squared_errors = 0

for inputs in input_values:
    total_squared_errors += calculate_network(*inputs, -1, -1, 1, 1, 1, 1)
    
mse = total_squared_errors/(len(input_values))
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 0.0


In [4]:
import numpy as np
from itertools import product

def calculate_network(x1, x2, expected_output, w1, w2, w3, w4, w5, w6):
    # ReLU activation function
    def relu(x):
        return max(0, x)
    
    # Hidden layer calculations
    H1 = relu(w1 * x1 + w3 * x2)  # Neuron 1
    H2 = relu(w4 * x1 + w2 * x2)  # Neuron 2

    # Output layer calculation
    O1 = w5 * H1 + w6 * H2
    
    # Compute squared error
    squared_error = (O1 - expected_output) ** 2
    
    return squared_error

# XOR input-output pairs
input_values = [
    (0, 0, 0),
    (0, 1, 1),
    (1, 0, 1),
    (1, 1, 0)
]

# Define possible weight values (-1, 0, 1)
possible_weights = [-1, 0, 1]

# Initialize variables to store the best weights and the lowest MSE
best_weights = None
lowest_mse = float('inf')

# Iterate over all possible combinations of weights
for w1, w2, w3, w4, w5, w6 in product(possible_weights, repeat=6):
    total_squared_errors = 0
    
    # Calculate the total squared error for this set of weights
    for inputs in input_values:
        total_squared_errors += calculate_network(*inputs, w1, w2, w3, w4, w5, w6)
    
    # Calculate the mean squared error (MSE)
    mse = total_squared_errors / len(input_values)
    
    # Update the best weights if this combination has a lower MSE
    if mse < lowest_mse:
        lowest_mse = mse
        best_weights = (w1, w2, w3, w4, w5, w6)

# Output the best weights and the corresponding MSE
print(f"Best weights: {best_weights}")
print(f"Lowest MSE: {lowest_mse}")


Best weights: (-1, -1, 1, 1, 1, 1)
Lowest MSE: 0.0


# Exercise 3

## Part 1

A deep learning, neural network, based model would likely yield a better churn prediction model due to the abundance of data. In fact, this is a problem that I worked on during my time at [Webflow](webflow.com). Our customer volume was far less than the 100 million customers mentioned in this question. Due to the relatively "low" number of training data points I opted for a traditional Machine Learning model and spent about 3 months developing a feature engineering to create and test thousdands of features for training the model. Ultimately, the model proved to give us directional indicators for groups (aka clusters) of customers with a similar set of purchasing patterns and product usage, however, it wasn't as accurate as we would have liked. Additionally, due to the compute constraints we were unable to try a neural network approach.

## Part 2

Activation functions in the output layer would probably be a sigmoid function where we predict if a user churns or not (which would be similar to a standard output of a logistic regression model). The input layers would likely be..

## Part 3

Given this is a retail business, one way to reduce churn would be to offer discounts to the customers. However, this comes at the cost of impacting the financials and could decrease margins. Additionally, if "word" got out that these discounts were being offered to customers (via social media platforms) it would cause an uptick of existing customers headed towards the "churn" route in hopes of getting discounts. This is a traditional "slippery slope" problem and often has more risks than rewards.