# Using ML techniques to infer a multiplier

### Scenario

You discover that the number of apple seeds is directly tied to the overall height of the fruit, the seed count just needs to be multiplied by some fixed number. Create a model such that given the number of seeds, you can predict the height of the fruit. ***Use an iterative guessing approach to estimate the value of the multiplier.***

### We use two packages for this
1. random - to generate random numbers
2. numpy - this package handles matricies (or more technically arrays, which may have more dimensions than a matrix)

In [1]:
import numpy as np
import random

## Part 1 - Set up data

### Randomly select the multiplier
This will be the value the seed count is multiplied by, and the number we're trying to discover
* Select a random number between 10 and 100 (uniform distribution) and set it equal to a variable named "actual_multiplier"

In [4]:
#uniform means its a uniform distribution, it can be any value between 10 and 100 with equal likelihood
actual_multiplier = random.uniform(10,100)
print(actual_multiplier)

24.193581435901663


In [6]:
#uniform means its a uniform distribution, it can be any value between 10 and 100 with equal likelihood, add int before to make it whole number
actual_multiplier = int(random.uniform(10,100))
print(actual_multiplier)

15


In [5]:
#use the randint function instead if you want it to print a whole number rather than a decimal
integer = random.randint(10,100)
print(integer)

41


### Collect some apple seeds
Collect some samples of apple seeds, and measure the associated fruits
* To start we'll use 10 samples with different numbers of seeds in each sample. Here we'll use numbers 1, 2, ..., 9, 10
    * Make a numpy array named seed_count_array with these values
* For obvious reasons, we will not be measuring any apples right now. We're going to cheat a bit and say that the height of the associated apples were the number of seeds times our multiplier value plus noise
    * Make a numpy array called apple_height_array that is length 10, and equal to the seed_count_array times the actual_multiplier
    * Use the np.random.random method to create an array of length 10, and name it noise_array
    * Add the values from the elements of the noise array to the elements of apple_height_array
* Print out the actual_multiplier, seed_count_array, and apple_height_array

In [9]:
#make a numpy array with the seed values 1-10
seed_count_array = np.array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])
seed_count_array

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], dtype='<U2')

In [10]:
#alternative: make a numpy array with the seed values 1-10
seed_count_array = np.arange(1,11)
seed_count_array

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]:
#make a numpy array called apple_height_array that is length 10 and equal to the seed_count_array times the actual_multiplier
apple_height_array = np.array(seed_count_array * actual_multiplier)
apple_height_array

array([ 15,  30,  45,  60,  75,  90, 105, 120, 135, 150])

In [15]:
#make a noise_array in numpy of length 10
noise_array = np.random.random (10)
noise_array

array([0.23469826, 0.91905256, 0.17115108, 0.43974932, 0.5677963 ,
       0.29426103, 0.3740554 , 0.37647661, 0.52071945, 0.3514626 ])

In [19]:
#add the values from the elements of the noise array to the elements of apple_height_array
apple_height_array = noise_array + apple_height_array
apple_height_array

array([ 15.23469826,  30.91905256,  45.17115108,  60.43974932,
        75.5677963 ,  90.29426103, 105.3740554 , 120.37647661,
       135.52071945, 150.3514626 ])

In [20]:
#print out the atual_multiplier, seed_count_array and apple_height_array
print(actual_multiplier, seed_count_array, apple_height_array)

15 [ 1  2  3  4  5  6  7  8  9 10] [ 15.23469826  30.91905256  45.17115108  60.43974932  75.5677963
  90.29426103 105.3740554  120.37647661 135.52071945 150.3514626 ]


### Sidenote - How contrived is this exercise?

This is toy problem where we know the answer before we start. The point of this example is understand overall process of iterative improvement. Relationships typically being modelled with ML are more complicated than a simple multiplier, but suprisingly little changes for more complex problems. Here we're modelling a single parameter, but many models used in biology have 10s of millions, but are built out of many simple calculations like our exercise. The math is more advanced (but maybe not as much as you might think) and beyond our scope, but wouldn't serve much practical use anyway since these calculations are never ever done by hand, and a comprehensive understanding of them is not strictly necessary unless researching novel algorithm designs.

## Part 2 - Build out a training loop


1. Task 1: write nested for-loops to 1) make a random prediction for each sample and 2) go through 10 epochs.
An epoch is one pass through all of the training data
Task 1 cont: write a prediction function (named predict_multiplier) for this that guesses a value from -100 to 100
2. Task 2: write a function (named calculate_loss) that subtracts the prediction from the true value
#the multiplier is the parameter we are trying to guess, the prediction is the multiplier times the number of seeds.
3. Task 3: create a variable that keeps track of the best (lowest loss value) - call it best loss
#make a list called best_param_list that appends *another* list of the loss, predicted multiplier, actual multiplier, predicted target, actual target and number of seeds (making a list of lists) whenever a new best loss is found
#try increasing the number of epochs
4. Task 4: Update the predict function to take in the previous step's prediction and loss to make the output more accurate
#add a step before your loop to initialise these values
5. 

In [None]:
#Task 1: write nested for-loops to 1) make a random prediction for each sample and 2) go through 10 epochs
#an epoch is one pass through all of the training data
#Task 1 cont: write a prediction function (named predict_multiplier) for this that guesses a value from -100 to 100

In [None]:
#task 2: write a function (named calculate_loss) that subtracts the prediction from the true value
#the multiplier is the parameter we are trying to guess, the prediction is the multiplier times the number of seeds

In [None]:
#task 3: create a variable that keeps track of the best (lowest loss value) - call it best loss
#make a list called best_param_list that appends *another* list of the loss, predicted multiplier, actual multiplier, predicted target, actual target and number of seeds (making a list of lists) whenever a new best loss is found
#try increasing the number of epochs

In [None]:
#task 4: update the predict function to take in the previous step's prediction and loss to make the output more accurate
#add a step before your loop to initialise these values

In [81]:
#task 1, step 1: defining predict_multiplier

def predict_multiplier(loss, last_guess):
    if loss < 0:
        last_guess += 1
    elif loss > 0:
        last_guess -= 1
    return last_guess
    
    #number=random.uniform(-100, 100) #can change values -100 and 100 here to 'lower' and 'upper' to make it more flexible 
    #return number

In [82]:
#task 2, step 1: define a function called calculate loss that is the true value minus the prediction

def calculate_loss(true_value, prediction):
    return true_value - prediction

In [83]:
def predict_multiplier(previous_prediction, previous_loss):
    return previous_prediction-(previous_loss*0.001)

In [88]:
#task 1, step 2: creating the nested for-loops

epoch_count = 10
sample_count =len(seed_count_array)
best_loss = 100 #create a new variable called best_loss that identifies the parameters used to get the smallest loss
best_params_list = []

predicted_multiplier = random.uniform(-100, 100)
loss = calculate_loss(predicted_multiplier*seed_count_array[0], apple_height_array[0])

for i in range(epoch_count):
    for j in range(sample_count):
        predicted_multiplier = predict_multiplier(predicted_multiplier, loss)
        loss = calculate_loss(predicted_multiplier*seed_count_array[j], apple_height_array[j])
        if abs (loss)<best_loss:
            best_loss = abs(loss)
            best_params_list.append([loss, predicted_multiplier, actual_multiplier, predicted_multiplier*seed_count_array[j], apple_height_array, seed_count_array[j]])#check this line
print(best_params_list)
                                    

#for epoch in range(epoch_count): #outer loops is the number of times you're going through the inner 10 samples
    #print('loss, guess, actual_multiplier, prediction, apple_height_array[sample], seed_count_array[sample]')
    #for sample in range(len(seed_count_array)): #inner loop
        #guess = predict_multiplier(loss, prediction_updated)
        #prediction = guess*seed_count_array[sample]
        #loss = calculate_loss(apple_height_array[sample], prediction) #include index [sample] so it doesnt go through the array
        #params = (loss, guess, actual_multiplier, prediction, apple_height_array[sample], seed_count_array[sample])
        #if abs(loss) < abs(best_loss): #abs takes the absolute value, as this can be positive or negative
            #best_loss = loss
            #best_params_list.append(params) #add this to update the best_params_list everytime a new best_loss is found
#print(best_params_list)   

[[-41.69981308707492, -26.46511482428977, 15, -26.46511482428977, array([ 15.23469826,  30.91905256,  45.17115108,  60.43974932,
        75.5677963 ,  90.29426103, 105.3740554 , 120.37647661,
       135.52071945, 150.3514626 ]), 1], [-39.46911153849706, -24.234413275711912, 15, -24.234413275711912, array([ 15.23469826,  30.91905256,  45.17115108,  60.43974932,
        75.5677963 ,  90.29426103, 105.3740554 , 120.37647661,
       135.52071945, 150.3514626 ]), 1], [-37.35819418634011, -22.123495923554955, 15, -22.123495923554955, array([ 15.23469826,  30.91905256,  45.17115108,  60.43974932,
        75.5677963 ,  90.29426103, 105.3740554 , 120.37647661,
       135.52071945, 150.3514626 ]), 1], [-35.36062885964808, -20.12593059686293, 15, -20.12593059686293, array([ 15.23469826,  30.91905256,  45.17115108,  60.43974932,
        75.5677963 ,  90.29426103, 105.3740554 , 120.37647661,
       135.52071945, 150.3514626 ]), 1], [-33.4703287821368, -18.235630519351645, 15, -18.235630519351645, a

In [73]:
print(best_loss)
for row in best_params_list:
    print(row)

-83.76530173721486
loss, guess, actual_multiplier, prediction, apple_height_array[sample], seed_count_array[sample]
(-83.76530173721486, 99, 15, 99, 15.23469826278515, 1)
