## Goal: Exploring numpy random seed function:

Functions used: 
1. np.random.rand: 
    - generates Random values in a given shape. 
    - Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1)
    - source: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.rand.html
    
2. np.random.seed: to set seed

For more details please refer this article: https://limitlessdatascience.wordpress.com/2019/02/18/use-of-numpy-random-seed-and-random_state-in-train_test-split-function/

In [1]:
import numpy as np
print('version is:', np.__version__)

version is: 1.14.3


In [2]:
def repeat_number_generation_print(seed, NumbersToGenerate):
    if seed != 'No':
        np.random.seed(seed) 
        
    print(np.random.rand(NumbersToGenerate))

def iterate_and_generate_numbers(seed = 'No', NumbersToGenerate = 4): 
    i = 0
    while i < 3:
        repeat_number_generation_print(seed, NumbersToGenerate)
        i = i+1
    

In [3]:
iterate_and_generate_numbers()  #default values i.e. no seed and 4 numbers

[0.63643065 0.84006644 0.52892263 0.57689538]
[0.38239854 0.92918312 0.68398589 0.95084668]
[0.44077475 0.47551624 0.35607661 0.81084369]


### Insight:After every execution of the same line of code we are getting different numbers

# For seed = 0

In [4]:
iterate_and_generate_numbers(seed = 0)

[0.5488135  0.71518937 0.60276338 0.54488318]
[0.5488135  0.71518937 0.60276338 0.54488318]
[0.5488135  0.71518937 0.60276338 0.54488318]


In [5]:
#one more time tyring after restrating the kernel of the notebook.
iterate_and_generate_numbers(seed = 0)

[0.5488135  0.71518937 0.60276338 0.54488318]
[0.5488135  0.71518937 0.60276338 0.54488318]
[0.5488135  0.71518937 0.60276338 0.54488318]


### insights:
1. for seed = 0 we are getting same set of values in the same order each time 
2. After restarting the kernel of the notebook still same set of numbers are generated in the same order

## For seed = 1, 2, 3, 4 What's the difference?

In [6]:
iterate_and_generate_numbers(seed = 1)

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]


In [7]:
iterate_and_generate_numbers(seed = 2)

[0.4359949  0.02592623 0.54966248 0.43532239]
[0.4359949  0.02592623 0.54966248 0.43532239]
[0.4359949  0.02592623 0.54966248 0.43532239]


In [8]:
iterate_and_generate_numbers(seed = 3)

[0.5507979  0.70814782 0.29090474 0.51082761]
[0.5507979  0.70814782 0.29090474 0.51082761]
[0.5507979  0.70814782 0.29090474 0.51082761]


In [9]:
iterate_and_generate_numbers(seed = 1000)

[0.65358959 0.11500694 0.95028286 0.4821914 ]
[0.65358959 0.11500694 0.95028286 0.4821914 ]
[0.65358959 0.11500694 0.95028286 0.4821914 ]


### insight:
1. with each value of seed we are getting different combinations of different numbers

# Seed = 1 and random numbers to be generated = 4 and 6 

In [10]:
iterate_and_generate_numbers(seed = 1)   #by default NumbersToGenerate = 4

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]


In [11]:
iterate_and_generate_numbers(seed = 1, NumbersToGenerate = 6)

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02]


In [12]:
iterate_and_generate_numbers(seed = 1, NumbersToGenerate = 8)

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01]
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01]


### insight: 
1. first 4 random numbers generated for seed = 1 are same in even order is same seed = 4, 6 and 8.
2. First 6 random numbers are same in case of seed = 4, 6
2. New random numbers are appended at the end with same value and order. 

# sklearn train-test split

In [13]:
# lets generate imput dataset
import pandas as pd
print('version of pd', pd.__version__)

version of pd 0.23.0


In [14]:
pdInput = pd.DataFrame([1,2,3,4,5,6,7,8,9,10], columns = {'Feature1'})
pdInput['Feature2'] = [1,2,3,4,5,6,7,8,9,10]
pdInput['Output'] = [1,2,3,4,5,6,7,8,9,10]
pdInput

Unnamed: 0,Feature1,Feature2,Output
0,1,1,1
1,2,2,2
2,3,3,3
3,4,4,4
4,5,5,5
5,6,6,6
6,7,7,7
7,8,8,8
8,9,9,9
9,10,10,10


In [15]:
X = pdInput[['Feature1', 'Feature2']].values
y = pdInput['Output']

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
X_train

array([[ 6,  6],
       [10, 10],
       [ 5,  5],
       [ 3,  3],
       [ 7,  7]], dtype=int64)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
X_train

array([[7, 7],
       [2, 2],
       [8, 8],
       [5, 5],
       [3, 3]], dtype=int64)

### insight:
1. After each execution we are getting different set of training datasets with different order (or was shuffled)

# Set seed for train-test split
- Since in machine learning or deep learning we execute same code multiple times for tunning and for that we need same set of training and testing set even after we restart the kernel or machine.
- Here I am going to set seed value in the division of training and testing and not in the number generation

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
X_train

array([[7, 7],
       [8, 8],
       [4, 4],
       [1, 1],
       [6, 6]], dtype=int64)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
X_train

array([[7, 7],
       [8, 8],
       [4, 4],
       [1, 1],
       [6, 6]], dtype=int64)

In [21]:
# now restarting the kernel
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
X_train  # got same set of values in the same order after kernel restart

array([[7, 7],
       [8, 8],
       [4, 4],
       [1, 1],
       [6, 6]], dtype=int64)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train 

array([[4, 4],
       [2, 2],
       [8, 8],
       [9, 9],
       [6, 6]], dtype=int64)

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)
X_train 

array([[ 3,  3],
       [ 4,  4],
       [ 7,  7],
       [10, 10],
       [ 9,  9]], dtype=int64)

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=3)
X_train 

array([[7, 7],
       [8, 8],
       [1, 1],
       [4, 4],
       [9, 9]], dtype=int64)