# Ads Sales Analysis

- Nama: Bryan Herdianto
- Email: bryan.herdianto17@gmail.com

### Import library yang dibutuhkan

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold

### Data Understanding

In [2]:
df = pd.read_csv('ads_sales.csv')
df

Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               200 non-null    int64  
 1   TV Ad Budget ($)         200 non-null    float64
 2   Radio Ad Budget ($)      200 non-null    float64
 3   Newspaper Ad Budget ($)  200 non-null    float64
 4   Sales ($)                200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
count,200.0,200.0,200.0,200.0,200.0
mean,100.5,147.0425,23.264,30.554,14.0225
std,57.879185,85.854236,14.846809,21.778621,5.217457
min,1.0,0.7,0.0,0.3,1.6
25%,50.75,74.375,9.975,12.75,10.375
50%,100.5,149.75,22.9,25.75,12.9
75%,150.25,218.825,36.525,45.1,17.4
max,200.0,296.4,49.6,114.0,27.0


### Data Preprocessing

In [5]:
# Drop unnecessary column and label
X = df.drop(['Unnamed: 0', 'Sales ($)'], axis=1)
X.head()

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($)
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4


In [6]:
# Scale the data
mean = np.mean(X, axis=0)
std_dev = np.std(X, axis=0)
X_scaled = (X - mean) / std_dev
X_scaled

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($)
0,0.969852,0.981522,1.778945
1,-1.197376,1.082808,0.669579
2,-1.516155,1.528463,1.783549
3,0.052050,1.217855,1.286405
4,0.394182,-0.841614,1.281802
...,...,...,...
195,-1.270941,-1.321031,-0.771217
196,-0.617035,-1.240003,-1.033598
197,0.349810,-0.942899,-1.111852
198,1.594565,1.265121,1.640850


In [7]:
X_train = X_scaled.loc[:159, :]
X_train.shape

(160, 3)

In [8]:
X_train

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($)
0,0.969852,0.981522,1.778945
1,-1.197376,1.082808,0.669579
2,-1.516155,1.528463,1.783549
3,0.052050,1.217855,1.286405
4,0.394182,-0.841614,1.281802
...,...,...,...
155,-1.669122,-0.787595,-1.144075
156,-0.620538,1.366407,0.918151
157,0.032199,-1.483087,-0.287883
158,-1.580378,0.920751,0.674182


In [9]:
X_test = X_scaled.loc[160:200, :]
X_test.shape

(40, 3)

In [10]:
Y_train = df[['Sales ($)']].loc[:159, :]
Y_train.shape

(160, 1)

In [11]:
Y_train

Unnamed: 0,Sales ($)
0,22.1
1,10.4
2,9.3
3,18.5
4,12.9
...,...
155,3.2
156,15.3
157,10.1
158,7.3


In [12]:
Y_test = df[['Sales ($)']].loc[160:200, :]
Y_test.shape

(40, 1)

### Definitions

- **z (Linear Combination):** 
  - $z$ is the linear combination of inputs and weights, plus a bias term. It represents the input to a neuron (before applying the activation function). For a neuron $i$ in layer $l$, it is computed as:
    $$
    z_i^{(l)} = \sum_{j} W_{ij}^{(l)} \cdot a_j^{(l-1)} + b_i^{(l)}
    $$
  - Here, $W_{ij}^{(l)}$ are the weights from neurons in the previous layer $(l-1)$ to the current layer $l$, and $b_i^{(l)}$ is the bias for the neuron $i$ in layer $l$.

- **A (Activation):**
  - $A$ is the activation output after applying an activation function $f(z)$ to the linear combination $z$. It represents the output of a neuron after activation. For a neuron $i$ in layer $l$:
    $$
    A_i^{(l)} = f(z_i^{(l)})
    $$
  - Common activation functions include ReLU, sigmoid, and tanh.

- **W (Weights):**
  - $W$ is the matrix of weights connecting neurons between two layers. Each element $W_{ij}$ represents the weight between the $j$th neuron in the previous layer and the $i$th neuron in the current layer.

- **b (Bias):**
  - $b$ is the bias term added to the linear combination $z$ before applying the activation function. It allows the model to fit the data better by adjusting the output independently of the input features.

### General Equation for Updating Parameters with Gradient Descent

The parameters $W$ (weights) and $b$ (biases) in a neural network are updated using the gradient descent algorithm. The general update rule for each parameter $\theta$ (which could be $W$ or $b$) is:

$$
\theta = \theta - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta}
$$

Where:
- $\theta$ is the parameter being updated (either a weight $W$ or a bias $b$).
- $\alpha$ is the learning rate, a hyperparameter that controls the size of the step taken in the direction of the gradient.
- $\frac{\partial J(\theta)}{\partial \theta}$ is the gradient of the cost function $J(\theta)$ with respect to the parameter $\theta$.

In a neural network, the update rules for weights and biases at layer $l$ are:

$$
W^{(l)} = W^{(l)} - \alpha \cdot \frac{\partial J}{\partial W^{(l)}}
$$

$$
b^{(l)} = b^{(l)} - \alpha \cdot \frac{\partial J}{\partial b^{(l)}}
$$

This process is repeated iteratively to minimize the cost function and optimize the parameters.

### Gradient Functions and Derivations

1. **Gradient with respect to $W2$**:
    ```python
    dW2 = (2/m) * np.dot((Y_pred - Y_true).T, A1)
    ```
    - **Derivation**: The gradient of the cost function with respect to $W2$ is calculated as follows:
      $$
      \frac{\partial J}{\partial W2} = \frac{\partial J}{\partial Y_{\text{pred}}} \cdot \frac{\partial Y_{\text{pred}}}{\partial W2}
      $$
      $$
      \frac{\partial J}{\partial W2} = \frac{2}{m} \cdot (Y_{\text{pred}} - Y_{\text{true}})^T \cdot A1
      $$
    - Here, $(Y_{\text{pred}} - Y_{\text{true}})$ represents the error term for the predictions. By taking the dot product with $A1$, the activations from the previous layer, we compute how the weights $W2$ should be adjusted to minimize the cost.

2. **Gradient with respect to $b2$**:
    ```python
    db2 = np.sum((2/m) * np.dot((Y_pred - Y_true).T, 1), axis=1, keepdims=True)
    ```
    - **Derivation**: The gradient of the cost function with respect to $b2$ is calculated as:
      $$
      \frac{\partial J}{\partial b2} = \frac{\partial J}{\partial Y_{\text{pred}}} \cdot \frac{\partial Y_{\text{pred}}}{\partial b2}
      $$
      $$
      \frac{\partial J}{\partial b2} = \sum \left(\frac{2}{m} \cdot (Y_{\text{pred}} - Y_{\text{true}})\right)
      $$
    - Since $b2$ is a bias term that is added to all outputs equally, we sum the gradient contributions across all samples.

3. **Gradient with respect to $A1$**:
    ```python
    dA1 = (2/m) * np.dot((Y_pred - Y_true), W2)
    ```
    - **Derivation**: The gradient of the cost function with respect to the activations $A1$ from the hidden layer is given by:
      $$
      \frac{\partial J}{\partial A1} = \frac{\partial J}{\partial Y_{\text{pred}}} \cdot \frac{\partial Y_{\text{pred}}}{\partial A1}
      $$
      $$
      \frac{\partial J}{\partial A1} = \frac{2}{m} \cdot (Y_{\text{pred}} - Y_{\text{true}}) \cdot W2
      $$
    - This backpropagates the error from the output layer to the hidden layer, scaling it by the weights $W2$.

4. **Gradient with respect to $z1$ (using ReLU derivative)**:
    ```python
    dz1 = dA1 * np.where(z1 > 0, 1, 0)
    ```
    - **Derivation**: The gradient of the cost function with respect to $z1$ is calculated by multiplying $dA1$ by the derivative of the ReLU activation function:
      $$
      \frac{\partial J}{\partial z1} = \frac{\partial J}{\partial A1} \cdot \frac{\partial A1}{\partial z1}
      $$
      $$
      \frac{\partial J}{\partial z1} = dA1 \cdot f'(z1)
      $$
    - The derivative of ReLU $f'(z1)$ is 1 where $z1 > 0$, and 0 otherwise. This operation ensures that gradients only flow through the active neurons (where $z1 > 0$).

5. **Gradient with respect to $W1$**:
    ```python
    dW1 = np.dot(dz1.T, X)
    ```
    - **Derivation**: The gradient of the cost function with respect to $W1$ is:
      $$
      \frac{\partial J}{\partial W1} = \frac{\partial J}{\partial z1} \cdot \frac{\partial z1}{\partial W1}
      $$
      $$
      \frac{\partial J}{\partial W1} = dz1^T \cdot X
      $$
    - This computes how the weights $W1$ connecting the input layer to the hidden layer should be adjusted to minimize the cost.

6. **Gradient with respect to $b1$**:
    ```python
    db1 = np.sum(np.dot(dz1.T, 1), axis=1, keepdims=True)
    ```
    - **Derivation**: The gradient of the cost function with respect to $b1$ is calculated by summing the gradient contributions across all samples:
      $$
      \frac{\partial J}{\partial b1} = \frac{\partial J}{\partial z1} \cdot \frac{\partial z1}{\partial b1}
      $$
      $$
      \frac{\partial J}{\partial b1} = \sum dz1
      $$
    - This operation ensures that the bias is updated by the average influence of all input samples on the cost.

### Loss Function

The loss function used here is the **Mean Squared Error (MSE)**, which is commonly used for regression problems.

The formula for MSE is:
$$
J = \frac{1}{m} \sum_{i=1}^{m} (Y_{\text{pred},i} - Y_{\text{true},i})^2
$$
Where:
- $J$ is the cost (loss) function.
- $m$ is the number of training examples.
- $Y_{\text{pred},i}$ is the predicted value for the $i$-th training example.
- $Y_{\text{true},i}$ is the actual value for the $i$-th training example.

### Derivative of the MSE Loss Function

The derivative of the MSE loss function with respect to the predictions $Y_{\text{pred}}$ is:

$$
\frac{\partial J}{\partial Y_{\text{pred}}} = \frac{2}{m} \cdot (Y_{\text{pred}} - Y_{\text{true}})
$$

This derivative is used to compute the gradients for the backward propagation in the neural network. It represents the direction and magnitude of the change needed in the predicted values to reduce the overall loss.


In [13]:
# Function to initialize the weights and biases for the neural network
def inisiasi_awal(n):
    np.random.seed(10)  # Set seed for reproducibility
    W1 = np.random.randn(4, n) * 0.01  # Initialize W1 with small random values, shape (4, n)
    b1 = np.zeros((4, 1))  # Initialize b1 with zeros, shape (4, 1)
    W2 = np.random.randn(1, 4) * 0.01  # Initialize W2 with small random values, shape (1, 4)
    b2 = np.zeros((1, 1))  # Initialize b2 with zeros, shape (1, 1)
    return W1, W2, b1, b2  # Return initialized weights and biases

# Function to perform forward propagation
def forward_propagation(W1, W2, b1, b2, X, Y_true):
    # Layer input to hidden layer
    z1 = np.dot(W1, X.T) + b1  # Compute the linear combination for the hidden layer
    z1 = z1.T  # Transpose z1 to match the expected shape for further operations
    A1 = np.where(z1 > 0, z1, 0)  # Apply ReLU activation function
    assert A1.shape == (Y_true.shape[0], 4), ("Shape A1 salah! " + str(A1.shape))  # Ensure A1 has the correct shape
    
    # Hidden layer to output layer
    z2 = np.dot(W2, A1.T) + b2  # Compute the linear combination for the output layer
    Y_pred = z2.T  # Transpose Y_pred to match the shape of Y_true
    assert Y_pred.shape == Y_true.shape, ("Shape Y_pred salah! " + str(Y_pred.shape))  # Ensure Y_pred has the correct shape
    
    return z1, A1, z2, Y_pred  # Return the intermediate and final outputs

# Function to compute the cost (Mean Squared Error)
def cost_function(Y_pred, Y_true):
    if not isinstance(Y_true, np.ndarray):
        Y_true = Y_true.to_numpy()  # Convert Y_true to a NumPy array if it's not already
    
    loss = np.square((Y_pred - Y_true))  # Compute the squared differences
    cost = np.mean(loss)  # Compute the mean of the squared differences (MSE)
    return cost  # Return the cost

# Function to perform backpropagation to compute gradients
def back_propagation(X, W1, z1, A1, W2, z2, Y_pred, Y_true, m):
    # Output layer to hidden layer
    if not isinstance(Y_true, np.ndarray):
        Y_true = Y_true.to_numpy()  # Convert Y_true to a NumPy array if it's not already
    
    dW2 = (2/m) * np.dot((Y_pred - Y_true).T, A1)  # Compute gradient of the cost w.r.t W2
    db2 = np.sum((2/m) * np.dot((Y_pred - Y_true).T, 1), axis=1, keepdims=True)  # Compute gradient of the cost w.r.t b2
    assert dW2.shape == W2.shape, ("Shape dW2 salah! " + str(dW2.shape))  # Ensure dW2 has the correct shape
    
    # Hidden layer to input layer
    dA1 = (2/m) * np.dot((Y_pred - Y_true), W2)  # Compute gradient of the cost w.r.t A1
    dz1 = dA1 * np.where(z1 > 0, 1, 0)  # Compute gradient of the cost w.r.t z1 using ReLU derivative
    dW1 = np.dot(dz1.T, X)  # Compute gradient of the cost w.r.t W1
    db1 = np.sum(np.dot(dz1.T, 1), axis=1, keepdims=True)  # Compute gradient of the cost w.r.t b1
    assert dW1.shape == W1.shape, ("Shape dW1 salah! " + str(dW1.shape))  # Ensure dW1 has the correct shape
    
    return dW1, dW2, db1, db2  # Return the computed gradients

# Function to update parameters using the computed gradients
def update_parameter(learning_rate, W1, W2, b1, b2, dW1, dW2, db1, db2):
    W1 = W1 - learning_rate * dW1  # Update W1 using gradient descent
    b1 = b1 - learning_rate * db1  # Update b1 using gradient descent
    W2 = W2 - learning_rate * dW2  # Update W2 using gradient descent
    b2 = b2 - learning_rate * db2  # Update b2 using gradient descent
    return W1, W2, b1, b2  # Return the updated weights and biases

### Vanilla / Batch Gradient Descent Optimization

In [14]:
# Assuming X_train and Y_train are already defined
m, n = X_train.shape
print("Jumlah sampel:", m, "dan jumlah Kolom:", n)

# Initialize parameters
W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla = inisiasi_awal(n)

# Training loop
for i in range(3000):
    # Forward propagation
    z1_vanilla, A1_vanilla, z2_vanilla, Y_pred_vanilla = forward_propagation(
        W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla, X_train, Y_train
    )
    
    # Compute cost
    cost_vanilla = cost_function(Y_pred_vanilla, Y_train)
    
    # Backward propagation
    dW1_vanilla, dW2_vanilla, db1_vanilla, db2_vanilla = back_propagation(
        X_train, W1_vanilla, z1_vanilla, A1_vanilla, W2_vanilla, z2_vanilla,
        Y_pred_vanilla, Y_train, m
    )
    
    # Update parameters
    W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla = update_parameter(
        0.005, W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla,
        dW1_vanilla, dW2_vanilla, db1_vanilla, db2_vanilla
    )
    
    # Print cost
    print("Cost ke-%d: %.4f" % (i + 1, cost_vanilla))

Jumlah sampel: 160 dan jumlah Kolom: 3
Cost ke-1: 224.9796
Cost ke-2: 221.0422
Cost ke-3: 217.1831
Cost ke-4: 213.4007
Cost ke-5: 209.6934
Cost ke-6: 206.0597
Cost ke-7: 202.4981
Cost ke-8: 199.0069
Cost ke-9: 195.5846
Cost ke-10: 192.2298
Cost ke-11: 188.9409
Cost ke-12: 185.7164
Cost ke-13: 182.5547
Cost ke-14: 179.4542
Cost ke-15: 176.4131
Cost ke-16: 173.4297
Cost ke-17: 170.5021
Cost ke-18: 167.6284
Cost ke-19: 164.8063
Cost ke-20: 162.0335
Cost ke-21: 159.3073
Cost ke-22: 156.6247
Cost ke-23: 153.9822
Cost ke-24: 151.3761
Cost ke-25: 148.8017
Cost ke-26: 146.2537
Cost ke-27: 143.7255
Cost ke-28: 141.2102
Cost ke-29: 138.6991
Cost ke-30: 136.1819
Cost ke-31: 133.6467
Cost ke-32: 131.0793
Cost ke-33: 128.4633
Cost ke-34: 125.7792
Cost ke-35: 123.0048
Cost ke-36: 120.1146
Cost ke-37: 117.0798
Cost ke-38: 113.8678
Cost ke-39: 110.4433
Cost ke-40: 106.7691
Cost ke-41: 102.8071
Cost ke-42: 98.5204
Cost ke-43: 93.8763
Cost ke-44: 88.8510
Cost ke-45: 83.4341
Cost ke-46: 77.6345
Cost ke-4

In [15]:
# Forward propagation on the test set
_, _, _, Y_test_pred = forward_propagation(
    W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla, X_test, Y_test
)

print(Y_test_pred)
print(Y_test)

# Compute the cost on the test set
cost_test = cost_function(Y_test_pred, Y_test)

# Print the Mean Squared Error (MSE) on the test set
print("MSE pada testing:", cost_test)

[[12.82026298]
 [12.47458535]
 [13.73295107]
 [17.33904342]
 [10.64055604]
 [13.87729589]
 [ 9.09171198]
 [12.89745947]
 [16.70715742]
 [17.31153404]
 [ 8.327067  ]
 [13.08154079]
 [ 7.93349817]
 [11.77967143]
 [13.25118209]
 [27.10430397]
 [20.46758704]
 [11.95995572]
 [14.9445833 ]
 [11.90473349]
 [11.08015376]
 [13.31135507]
 [ 8.15918218]
 [26.08200189]
 [18.37090838]
 [21.95381407]
 [10.56137254]
 [16.74346565]
 [18.30206468]
 [ 7.37894636]
 [11.30168555]
 [ 9.04477885]
 [ 6.81126596]
 [18.93505027]
 [16.21850563]
 [ 7.39463833]
 [ 9.24509464]
 [12.18324852]
 [25.59201952]
 [13.90177585]]
     Sales ($)
160       14.4
161       13.3
162       14.9
163       18.0
164       11.9
165       11.9
166        8.0
167       12.2
168       17.1
169       15.0
170        8.4
171       14.5
172        7.6
173       11.7
174       11.5
175       27.0
176       20.2
177       11.7
178       11.8
179       12.6
180       10.5
181       12.2
182        8.7
183       26.2
184       17.6
185      

In [16]:
def forward_propagation_final(W1, W2, b1, b2, X):
    # Layer input ke hidden layer
    z1 = np.dot(W1, X.T) + b1
    z1 = z1.T
    A1 = np.where(z1 > 0, z1, 0)
    
    # Hidden layer ke output layer
    z2 = np.dot(W2, A1.T) + b2
    Y_pred = z2.T
    
    return z1, A1, z2, Y_pred

tv = float(input("What is your TV ad budget? "))
radio = float(input("What is your Radio ad budget? "))
newspaper = float(input("What is your Newspaper ad budget? "))

# group first via column
df_input = pd.DataFrame(
    {
        'TV Ad Budget ($)': [tv],
        'Radio Ad Budget ($)': [radio],
        'Newspaper Ad Budget ($)': [newspaper]
    }
)
print("\n")
print(df_input)

# scale the input
df_scaled = (df_input - mean) / std_dev
print("\n")
print(df_scaled)

# put into forward_propagation function
_, _, _, Y_test_pred = forward_propagation_final(
    W1_vanilla, W2_vanilla, b1_vanilla, b2_vanilla, df_scaled
)

# print result
print("\n")
print(Y_test_pred)

What is your TV ad budget?  39
What is your Radio ad budget?  128
What is your Newspaper ad budget?  48




   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0              39.0                128.0                     48.0


   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0         -1.261599             7.072148                 0.803071


[[34.53979911]]


### Mini-Batch Gradient Descent Optimization

In [17]:
# Initialize parameters
W1_mb, W2_mb, b1_mb, b2_mb = inisiasi_awal(n)

# Hyperparameters
learning_rate = 0.005
batch_size = 32
epochs = 2000

# Initialize variables to track the best weights and biases
best_W1_mb, best_W2_mb, best_b1_mb, best_b2_mb = W1_mb, W2_mb, b1_mb, b2_mb
best_cost = float('inf')

# Training loop
for epoch in range(epochs):
    # Shuffle data
    indices = np.arange(m)
    np.random.shuffle(indices)
    X_train_shuffled = X_train.iloc[indices]
    Y_train_shuffled = Y_train.iloc[indices]
    
    # Mini-batch processing
    for start in range(0, m, batch_size):
        end = min(start + batch_size, m)
        X_batch = X_train_shuffled.iloc[start:end]
        Y_batch = Y_train_shuffled.iloc[start:end]
        
        # Forward propagation
        z1_mb, A1_mb, z2_mb, Y_pred_mb = forward_propagation(
            W1_mb, W2_mb, b1_mb, b2_mb, X_batch, Y_batch
        )
        
        # Compute cost
        cost_mb = cost_function(Y_pred_mb, Y_batch)
        
        # Backward propagation
        dW1_mb, dW2_mb, db1_mb, db2_mb = back_propagation(
            X_batch, W1_mb, z1_mb, A1_mb, W2_mb, z2_mb,
            Y_pred_mb, Y_batch, len(X_batch)
        )
        
        # Update parameters
        W1_mb, W2_mb, b1_mb, b2_mb = update_parameter(
            learning_rate, W1_mb, W2_mb, b1_mb, b2_mb,
            dW1_mb, dW2_mb, db1_mb, db2_mb
        )
    
    # Print cost for the last mini-batch of the epoch
    print("Epoch %d: Last Mini-Batch Cost = %.4f" % (epoch + 1, cost_mb))
    
    # If the current epoch cost is the best (lowest), update the best weights and biases
    if cost_mb < best_cost:
        best_cost = cost_mb
        best_W1_mb, best_W2_mb, best_b1_mb, best_b2_mb = W1_mb, W2_mb, b1_mb, b2_mb

Epoch 1: Last Mini-Batch Cost = 233.8393
Epoch 2: Last Mini-Batch Cost = 170.4956
Epoch 3: Last Mini-Batch Cost = 183.7091
Epoch 4: Last Mini-Batch Cost = 195.8025
Epoch 5: Last Mini-Batch Cost = 135.9828
Epoch 6: Last Mini-Batch Cost = 143.6402
Epoch 7: Last Mini-Batch Cost = 97.8517
Epoch 8: Last Mini-Batch Cost = 100.4681
Epoch 9: Last Mini-Batch Cost = 84.6443
Epoch 10: Last Mini-Batch Cost = 45.9516
Epoch 11: Last Mini-Batch Cost = 21.6510
Epoch 12: Last Mini-Batch Cost = 7.9025
Epoch 13: Last Mini-Batch Cost = 4.3221
Epoch 14: Last Mini-Batch Cost = 2.4894
Epoch 15: Last Mini-Batch Cost = 4.2979
Epoch 16: Last Mini-Batch Cost = 3.2847
Epoch 17: Last Mini-Batch Cost = 4.2788
Epoch 18: Last Mini-Batch Cost = 1.8078
Epoch 19: Last Mini-Batch Cost = 2.0662
Epoch 20: Last Mini-Batch Cost = 2.1664
Epoch 21: Last Mini-Batch Cost = 3.9289
Epoch 22: Last Mini-Batch Cost = 2.1170
Epoch 23: Last Mini-Batch Cost = 2.0118
Epoch 24: Last Mini-Batch Cost = 4.0601
Epoch 25: Last Mini-Batch Cost 

In [18]:
# Forward propagation on the test set using the best parameters
_, _, _, Y_test_pred = forward_propagation(
    best_W1_mb, best_W2_mb, best_b1_mb, best_b2_mb, X_test, Y_test
)

# Compute the cost on the test set
cost_test = cost_function(Y_test_pred, Y_test)

# Print the lowest Mean Squared Error (MSE) on the training set
print("MSE pada training dengan parameter terbaik:", best_cost)

# Print the Mean Squared Error (MSE) on the test set
print("MSE pada testing dengan parameter terbaik:", cost_test)

MSE pada training dengan parameter terbaik: 0.5646526531465363
MSE pada testing dengan parameter terbaik: 1.2003932008488736


In [19]:
tv = float(input("What is your TV ad budget? "))
radio = float(input("What is your Radio ad budget? "))
newspaper = float(input("What is your Newspaper ad budget? "))

# group first via column
df_input = pd.DataFrame(
    {
        'TV Ad Budget ($)': [tv],
        'Radio Ad Budget ($)': [radio],
        'Newspaper Ad Budget ($)': [newspaper]
    }
)
print("\n")
print(df_input)

# scale the input
df_scaled = (df_input - mean) / std_dev
print("\n")
print(df_scaled)

# put into forward_propagation function
_, _, _, Y_test_pred = forward_propagation_final(
    best_W1_mb, best_W2_mb, best_b1_mb, best_b2_mb, df_scaled
)

# print result
print("\n")
print(Y_test_pred)

What is your TV ad budget?  28
What is your Radio ad budget?  95
What is your Newspaper ad budget?  37




   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0              28.0                 95.0                     37.0


   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0         -1.390045              4.84387                 0.296721


[[25.03629908]]


### Menggunakan metode Repeated K-Fold untuk membagi data menjadi training dan testing serta menerapkan teknik optimisasi Stochastic Gradient Descent (SGD)

In [20]:
# Dictionaries to store results
MSE_fold = {"training": [], "testing": []}
Weight_fold = {"W1": [], "W2": []}
bias_fold = {"b1": [], "b2": []}
index_fold = {"training": [], "testing": []}

# Target variable
Y = df[['Sales ($)']]

# Initialize RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=10)

# Loop through each fold
for fold, (train_index, test_index) in enumerate(rkf.split(X_scaled, Y)):
    X_train_fold, X_test_fold = X_scaled.loc[train_index], X_scaled.loc[test_index]
    Y_train_fold, Y_test_fold = Y.loc[train_index], Y.loc[test_index]
    
    # Get the number of features
    n = X_train_fold.shape[1]
    
    # Initialize parameters
    W1_SGD, W2_SGD, b1_SGD, b2_SGD = inisiasi_awal(n)
    
    if fold == 0:
        print(f"Jumlah sampel Train tiap fold      : {len(train_index)}")
        print(f"Jumlah sampel Test  tiap fold      : {len(test_index)}")  
        print(f"Jumlah atribut input               : {n}\n")

    print(f"Fold {fold + 1}:") 
    
    # Convert DataFrames to NumPy arrays
    X_train_fold = X_train_fold.to_numpy()
    Y_train_fold = Y_train_fold.to_numpy()

    # Train the model using SGD
    for epoch in range(300):
        for i in range(len(train_index)):
            X_i = X_train_fold[i].reshape(-1, 3)  # Reshape as needed
            Y_i = Y_train_fold[i].reshape(-1, 1)  # Reshape as needed
            
            # Forward propagation
            z1_SGD, A1_SGD, z2_SGD, Y_pred_SGD = forward_propagation(
                W1_SGD, W2_SGD, b1_SGD, b2_SGD, X_i, Y_i
            )
            
            # Compute cost
            cost_SGD = cost_function(Y_pred_SGD, Y_i)
            
            # Backward propagation
            dW1_SGD, dW2_SGD, db1_SGD, db2_SGD = back_propagation(
                X_i, W1_SGD, z1_SGD, A1_SGD, W2_SGD, z2_SGD, Y_pred_SGD, Y_i, len(train_index)
            )
            
            # Update parameters
            W1_SGD, W2_SGD, b1_SGD, b2_SGD = update_parameter(
                0.01, W1_SGD, W2_SGD, b1_SGD, b2_SGD, dW1_SGD, dW2_SGD, db1_SGD, db2_SGD
            )
    
    # Print training cost
    print("     cost pada training  :", cost_SGD)

    # Testing/Validation
    _, _, _, Y_test_pred = forward_propagation(
        W1_SGD, W2_SGD, b1_SGD, b2_SGD, X_test_fold, Y_test_fold
    )
    cost_test = cost_function(Y_test_pred, Y_test_fold)
    
    # Print testing cost
    print("     MSE pada testing    :", cost_test)

    # Save results
    index_fold['training'].append(np.squeeze(train_index).tolist())
    index_fold['testing'].append(np.squeeze(test_index).tolist())
    MSE_fold["training"].append(cost_SGD)
    MSE_fold['testing'].append(cost_test)
    Weight_fold['W1'].append(np.squeeze(W1_SGD).tolist())
    Weight_fold['W2'].append(np.squeeze(W2_SGD).tolist())
    bias_fold["b1"].append(b1_SGD)
    bias_fold["b2"].append(b2_SGD)

Jumlah sampel Train tiap fold      : 160
Jumlah sampel Test  tiap fold      : 40
Jumlah atribut input               : 3

Fold 1:
     cost pada training  : 1.1364137851516714
     MSE pada testing    : 4.863639224058225
Fold 2:
     cost pada training  : 0.009664183819847712
     MSE pada testing    : 1.0635067658405593
Fold 3:
     cost pada training  : 0.25663653260280855
     MSE pada testing    : 1.2843319998066018
Fold 4:
     cost pada training  : 0.0953439867377515
     MSE pada testing    : 1.4885833833759097
Fold 5:
     cost pada training  : 0.20300971286583916
     MSE pada testing    : 1.2643605036926218
Fold 6:
     cost pada training  : 0.17682528257453173
     MSE pada testing    : 1.4778875044315698
Fold 7:
     cost pada training  : 0.014402418845586427
     MSE pada testing    : 2.0812203994908547
Fold 8:
     cost pada training  : 0.15555079432456415
     MSE pada testing    : 1.5837504297715128
Fold 9:
     cost pada training  : 0.5200570690661555
     MSE pada test

In [22]:
# Find the index of the fold with the smallest testing MSE
fold_terbaik = np.argmin(MSE_fold['testing'])
testing_index = index_fold['testing'][fold_terbaik]

# Retrieve the best weights and biases from the fold with the smallest testing MSE
best_W1_sgd = np.array(Weight_fold['W1'][fold_terbaik])
best_W2_sgd = np.array(Weight_fold['W2'][fold_terbaik])
best_b1_sgd = bias_fold['b1'][fold_terbaik]
best_b2_sgd = bias_fold['b2'][fold_terbaik]

# Perform forward propagation using the best weights and biases
_, _, _, Y_test_pred = forward_propagation(
    best_W1_sgd,
    best_W2_sgd,
    best_b1_sgd,
    best_b2_sgd,
    X_scaled.loc[testing_index],
    Y.loc[testing_index]
)

# Compute the MSE on the test data
cost_test = cost_function(Y_test_pred, Y.loc[testing_index])

# Print the fold number with the smallest testing MSE
print(f"MSE dari testing terkecil pada Fold ke-{fold_terbaik + 1}: {cost_test}")

MSE dari testing terkecil pada Fold ke-2: 1.0635067658405593


In [23]:
tv = float(input("What is your TV ad budget? "))
radio = float(input("What is your Radio ad budget? "))
newspaper = float(input("What is your Newspaper ad budget? "))

# group first via column
df_input = pd.DataFrame(
    {
        'TV Ad Budget ($)': [tv],
        'Radio Ad Budget ($)': [radio],
        'Newspaper Ad Budget ($)': [newspaper]
    }
)
print("\n")
print(df_input)

# scale the input
df_scaled = (df_input - mean) / std_dev
print("\n")
print(df_scaled)

# put into forward_propagation function
_, _, _, Y_test_pred = forward_propagation_final(
    best_W1_sgd, best_W2_sgd, best_b1_sgd, best_b2_sgd, df_scaled
)

# print result
print("\n")
print(Y_test_pred)

What is your TV ad budget?  93
What is your Radio ad budget?  48
What is your Newspaper ad budget?  12




   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0              93.0                 48.0                     12.0


   TV Ad Budget ($)  Radio Ad Budget ($)  Newspaper Ad Budget ($)
0         -0.631048             1.670263                -0.854074


[[16.14996653]]
