# Generating dataset

I’ve done my best, but wasn’t able to find a good dataset of alien life forms, so let’s go another way and simulate one. It’s a useful exercise on its own, as it gives you probabilistic perspective and helps you to reason about your data in statistical terms of random variables. 

// https://docs.scipy.org/doc/numpy/reference/routines.random.html

NumPy is Python package for scientific computing. It provides multidimensional arrays, random number generation, linear algebra and other useful things. We’ll start by importing it:

In [28]:
import numpy as np

np variable now hold a link to `numpy` libray.

Setting seed for a pseudo-random number generator. This is needed to get the same results, when re-running the code. If you forget to do this you’ll get different random numbers every times you run the code. `123` in this case is just some constant number. You can change it to 42 or any other.

In [29]:
np.random.seed(123)

## Generating labels

In [30]:
labels = ["rabbosaurus", "platyhog"]
y = np.random.choice(labels, size=1000, replace=True)

In [31]:
y[:50]

array(['rabbosaurus', 'platyhog', 'rabbosaurus', 'rabbosaurus',
       'rabbosaurus', 'rabbosaurus', 'rabbosaurus', 'platyhog', 'platyhog',
       'rabbosaurus', 'platyhog', 'platyhog', 'rabbosaurus', 'platyhog',
       'rabbosaurus', 'platyhog', 'rabbosaurus', 'platyhog', 'platyhog',
       'rabbosaurus', 'rabbosaurus', 'rabbosaurus', 'platyhog', 'platyhog',
       'platyhog', 'rabbosaurus', 'platyhog', 'rabbosaurus', 'rabbosaurus',
       'rabbosaurus', 'rabbosaurus', 'platyhog', 'platyhog', 'platyhog',
       'rabbosaurus', 'rabbosaurus', 'platyhog', 'rabbosaurus',
       'rabbosaurus', 'platyhog', 'rabbosaurus', 'platyhog', 'rabbosaurus',
       'platyhog', 'platyhog', 'platyhog', 'rabbosaurus', 'rabbosaurus',
       'rabbosaurus', 'rabbosaurus'],
      dtype='|S11')

## Creating random length feature

In [32]:
count_r = y[y == labels[0]].shape[0]
count_p = y.shape[0]-count_r

length_f = np.zeros(y.shape[0])
length_f[y == labels[0]] = np.random.normal(loc=30, scale=5, size=count_r)
length_f[y == labels[1]] = np.random.normal(loc=20, scale=5, size=count_p)

In [33]:
length_f[:50]

array([ 27.54513859,  12.14735699,  23.45417343,  29.95669767,
        34.88406491,  21.24464825,  26.67071516,  11.0305554 ,
        18.67506774,  30.17970251,  15.53402026,  29.29237205,
        34.25051442,  20.29268877,  31.9143512 ,  10.28925228,
        31.62731814,  27.09364638,  20.80855155,  28.87843607,
        32.40937129,  35.07151942,  23.5248974 ,  23.41017388,
        21.48278284,  21.45504112,  22.61671315,  33.642677  ,
        29.50620097,  27.35005568,  17.78462107,  21.19380336,
        14.46807047,  21.83366099,  23.09824338,  38.49029496,
        25.11952752,  26.55725734,  24.57615614,  18.94971794,
        27.71787022,  22.75651109,  26.27426389,  22.09794573,
        29.07826032,  18.7362485 ,  30.62179316,  37.58486839,
        27.06692025,  30.77145025])

## Creating random color feature

In [34]:
colors = ["space gray", "light black", "pink gold", "purple polka-dot"]
col_r = colors[:3]
col_p = colors[1:]
print(col_r, col_p)

(['space gray', 'light black', 'pink gold'], ['light black', 'pink gold', 'purple polka-dot'])


In [42]:
colors_f = np.zeros(y.shape[0], dtype="S16")
colors_f[y == labels[0]] = np.random.choice(col_r, size=count_r, replace=True)
colors_f[y == labels[1]] = np.random.choice(col_p, size=count_p, replace=True)

In [43]:
colors_f[:50]

array(['pink gold', 'pink gold', 'light black', 'pink gold', 'light black',
       'pink gold', 'space gray', 'pink gold', 'pink gold', 'light black',
       'pink gold', 'pink gold', 'light black', 'pink gold', 'pink gold',
       'pink gold', 'light black', 'pink gold', 'pink gold', 'space gray',
       'light black', 'light black', 'light black', 'pink gold',
       'purple polka-dot', 'pink gold', 'purple polka-dot', 'light black',
       'space gray', 'space gray', 'space gray', 'purple polka-dot',
       'light black', 'light black', 'light black', 'light black',
       'pink gold', 'light black', 'space gray', 'purple polka-dot',
       'space gray', 'pink gold', 'light black', 'light black',
       'pink gold', 'pink gold', 'space gray', 'pink gold', 'pink gold',
       'space gray'],
      dtype='|S16')

## Creating random fluffiness feature

In [44]:
fluffy_f = np.zeros(y.shape[0], dtype="bool")
fluffy_f[y == labels[0]] = np.random.randint(0,100,count_r)<90
fluffy_f[y == labels[1]] = np.random.randint(0,100,count_p)>70

In [45]:
fluffy_f[:50]

array([ True, False,  True,  True,  True, False,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True, False,
       False,  True, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False,  True,  True,
       False,  True,  True, False,  True, False, False,  True, False,
        True,  True,  True,  True,  True], dtype=bool)

## Creating data frame

In [46]:
import pandas as pd

In [47]:
df = pd.DataFrame(data={'length': length_f, 'color': colors_f, 'fluffy': fluffy_f, 'label': y})
df = df[['length', 'color', 'fluffy', 'label']]

In [48]:
df.head()

Unnamed: 0,length,color,fluffy,label
0,27.545139,pink gold,True,rabbosaurus
1,12.147357,pink gold,False,platyhog
2,23.454173,light black,True,rabbosaurus
3,29.956698,pink gold,True,rabbosaurus
4,34.884065,light black,True,rabbosaurus


In [49]:
df.tail()

Unnamed: 0,length,color,fluffy,label
995,27.621761,pink gold,False,rabbosaurus
996,20.436062,purple polka-dot,False,platyhog
997,24.937142,pink gold,False,platyhog
998,25.140316,light black,True,rabbosaurus
999,21.387244,light black,True,platyhog


In [50]:
df.to_csv("extraterrestrials.csv", sep='\t', encoding='utf-8')