In [49]:
# set matplotlib backend to inline
%matplotlib inline 

# import modules
from sklearn import datasets # import datasets
import numpy as np # import numpy
import matplotlib.pyplot as plt # import plots
import pandas as pd


### Preparing some (random) data
 This code generates some random numbers to use as a fake dataset (with 3 features).

In [50]:
# set the seed for the 'random' number generator
mySeed=1234567

# Set a seed to get pseudo-random number generation (i.e. force the same numbers to be generated every time)
np.random.seed(mySeed)

# create some fake data using random numbers 
D = np.random.random(size=(40,3))
print(D.shape)
D[:5,:] # Show the first 5 samples


(40, 3)


array([[0.23702917, 0.00764837, 0.01983031],
       [0.31309262, 0.09945466, 0.19517429],
       [0.20729802, 0.16493119, 0.71187896],
       [0.03206667, 0.19736962, 0.96455696],
       [0.57389458, 0.69922766, 0.97464142]])

### Random indices

 In this section of code, we create an array of indices for each data sample, randomly shuffle them, and split into several distinct subsets (folds).
 

In [59]:

# Create an array of indices for the data
D_indices = np.arange(0,len(D),1)
print('indices: %s\n' % D_indices) 

# randomly shuffle the indices
random_indices = np.random.permutation(D_indices)
print('random indices: {}\n'.format( random_indices) ) 

# split the indices into 4 different subsets, or folds
split_indices = np.array_split(random_indices, 4)
print('split indices: {}\n'.format( split_indices) ) 

# we can access a single fold of these indices simply by calling, e.g.:
print('fold 2: {}\n' .format(split_indices[1]))


indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]

random indices: [36 15  9 13  3  5  2 31 24 12 30 21 27 38 33 26 37 23 35  0 32  1  6 29
 19 18 25 14 22  4 39 16 10 34 20  8 11 28  7 17]

split indices: [array([36, 15,  9, 13,  3,  5,  2, 31, 24, 12]), array([30, 21, 27, 38, 33, 26, 37, 23, 35,  0]), array([32,  1,  6, 29, 19, 18, 25, 14, 22,  4]), array([39, 16, 10, 34, 20,  8, 11, 28,  7, 17])]

fold 2: [30 21 27 38 33 26 37 23 35  0]



### Using indices to work with the original data
 It is often easier to work on subsets of indices (because they are typically only one dimension). These can then be used to select 'windows' of the original data. 


In [61]:
# e.g. if we want to work with a random fold of the original data,
# we can just use the randomly selected fold of indices to access this, e.g. for fold 1:

iFold = 0

Data_fold_2 = D[split_indices[iFold],:]
Data_fold_2[:5,:] # Show the first 5 samples of this fold


array([[0.62156788, 0.10204083, 0.34336421],
       [0.26437972, 0.16516002, 0.23780665],
       [0.22100605, 0.75291129, 0.34302302],
       [0.36309048, 0.8702888 , 0.38478627],
       [0.03206667, 0.19736962, 0.96455696]])

### Combining folds of data

You might have several folds of data that you want to combine into a single dataset. One way to do this is



In [1]:
selected = []

# say we want to select folds 1 and 3
for idx in [1,3]:
    selected.append(split_indices[idx][:])
    
print('selected folds (an list of arrays): {}\n'.format(selected)) 

# We need to now flatten the list of arrays into a single array using np.concatenate:
combined_indices = np.concatenate(combined_folds)
print('combined array of indices: {}\n'.format(combined_indices)) 
    
# This array can now be used as an index to select the relevant data, D


NameError: name 'split_indices' is not defined

## Docstrings and comments

Whenever you are writing code, it is good practice to document what you are doing in as clear and readable way as possible. Inline comments e.g. ```# This code splits the data into a training and a test set``` should help clarify what the code itself is doing, particularly if the code might be difficult to understand. Comments are also important to document what you intend the code to do - even if the code doesn't actually work as you'd hoped (this also helps with coursework, because we can award marks for the idea, irrespective of whether the code actually implements that idea). For some hints on good commenting style, have a look here: https://stackabuse.com/commenting-python-code/.

Another important source of documentation is to docstring your functions. Docstrings allow us to quickly understand both what a function does, and how it might be used. They typically include information on input arguments and what the function returns. An example is shown below. For reference on how to do docstrings, have a look here: https://numpydoc.readthedocs.io/en/latest/format.html. (Typically, you can see a function's docstring in Jupyter by selecting an instantiated function name and pressing SHIFT+TAB twice - try this on your own functions).

Finally, in Jupyter Notebooks like this, you can document your code using Jupyter Markup. You might want to use Markup to introduce certain sections of code in more depth than might be covered within the docstring. Or just use it to break the code into more readable sections. In the coursework, we ask that you use the Markup to give your written answers to the questions. 



In [111]:

def myExampleFunction(my_input):
    """ 
    Summary line (e.g. this function doubles the input passed to it) 
  
    Extended description of function. 
  
    Parameters: 
    my_input (int): Description of my_input 
  
    Returns: 
    int: Description of return value 
  
    """
    
    # Comment on what the code is doing 
    # ...
    double_input = my_input * 2
        
    return double_input

 
# print out the docstring for the function 
print( myExampleFunction.__doc__ )
    
# you can access the docstring information while coding by typing the function name:
myExampleFunction(2) # and pressing SHIFT+TAB twice over it

 
    Summary line (e.g. this function doubles the input passed to it) 
  
    Extended description of function. 
  
    Parameters: 
    my_input (int): Description of my_input 
  
    Returns: 
    int: Description of return value 
  
    


4