# Introduction to Numpy

#### Credit: https://www.machinelearningplus.com/python/numpy-tutorial-part1-array-python-examples/

## How to get index locations that satisfy a given condition using np.where?

np.where locates the positions in the array where a given condition holds true.



In [3]:
# create an array
import numpy as np
arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2])
print("Array: ", arr_rand)

# positions where value > 5
positions = np.where(arr_rand > 5)
print("Positions where value > 5: ", positions)

Array:  [8 8 3 7 7 0 4 2 5 2]
Positions where value > 5:  (array([0, 1, 3, 4]),)


In [5]:
# take items at given index
arr_rand.take(positions)

array([[8, 8, 7, 7]])

np.where also accepts 2 more optional arguments x and y. Whenever condition is true, ‘x’ is yielded else ‘y’.

In [6]:
# if value > 5, then yield 'gt5' else 'le5'
np.where(arr_rand > 5, 'gt5', 'le5')

array(['gt5', 'gt5', 'le5', 'gt5', 'gt5', 'le5', 'le5', 'le5', 'le5',
       'le5'], dtype='<U3')

### Find the location of the minimum and the maximum values as well

In [8]:
# location of the max
print("Location of maximum value: ", np.argmax(arr_rand))

# location of the min
print("Location of the minimum value: ", np.argmin(arr_rand))

Location of maximum value:  0
Location of the minimum value:  5


## How to import and export data as a csv file?

A standard way to import datasets is to use the np.genfromtxt function. It can import datasets from web URLs, handle missing values, multiple delimiters, handle irregular number of columns etc.

A less versatile version is the np.loadtxt which assumes the dataset has no missing values.

In [12]:
# turn off scientific notation
np.set_printoptions(suppress=True)

# import data from csv file url
path = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = np.genfromtxt(path, delimiter=',', skip_header=1, filling_values=-999, dtype='float')
data[:3]  # see first 3 rows

array([[  18. ,    8. ,  307. ,  130. , 3504. ,   12. ,   70. ,    1. ,
        -999. ],
       [  15. ,    8. ,  350. ,  165. , 3693. ,   11.5,   70. ,    1. ,
        -999. ],
       [  18. ,    8. ,  318. ,  150. , 3436. ,   11. ,   70. ,    1. ,
        -999. ]])

### How to handle datasets that has both numbers and text columns?

In case, you MUST have the text column as it is without replacing it with a placeholder, you can either set the dtype as ‘object’ or as None.

In [13]:
# save the array as a csv file
np.savetxt("out.csv", data, delimiter=",")

## How to save and load numpy objects?

At some point, we will want to save large transformed numpy arrays to disk and load it back to console directly without having the re-run the data transformations code.

Numpy provides the .npy and the .npz file types for this purpose.

If you want to store a single ndarray object, store it as a .npy file using np.save. This can be loaded back using the np.load.

If you want to store more than 1 ndarray object in a single file, then save it as a .npz file using np.savez.

In [16]:
arr2d = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
arr2d_f = arr2d.astype('float')

# save single numpy array object as .npy file
np.save('myarray.npy', arr2d)

# save multiple numpy arrays as a .npz file
np.savez('array.npz', arr2d, arr2d_f)

### Load back the .npy file

In [18]:
# load a .npy file
a = np.load('myarray.npy')
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [22]:
### Load back the .npz file
b = np.load('array.npz')
print(b.files)
b['arr_0']

['arr_0', 'arr_1']


array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

## How to concatenate two numpy arrays columnwise and row wise


There are 3 different ways of concatenating two or more numpy arrays.

1. np.concatenate by changing the axis parameter to 0 and 1
2. np.vstack and np.hstack
3. np.r_ and np.c_

All 3 methods provide the same output

One key difference to notice is unlike the other 2 methods, both np.r_ and np.c_ use square brackets to stack arrays.

In [24]:
a = np.zeros([4,4])
b = np.ones([4,4])

In [25]:
# stack the arrays vetically - row wise
np.concatenate([a,b], axis=0)
np.vstack([a,b])
np.r_[a, b]

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [26]:
# stack the arrays horizontally - column wise
np.concatenate([a,b], axis=1)
np.hstack([a,b])
np.c_[a,b]

array([[0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.]])

you can use np.r_ to create more complex number sequences in 1d arrays.



In [28]:
np.r_[[1,2,3], 0, 0, [4, 5, 6]]

array([1, 2, 3, 0, 0, 4, 5, 6])

## How to sort a numpy array based on one or more columns?

In [29]:
arr = np.random.randint(1, 6, size=[8, 4])
arr

array([[3, 5, 5, 5],
       [2, 3, 5, 3],
       [1, 2, 5, 4],
       [1, 5, 1, 1],
       [4, 3, 1, 1],
       [3, 1, 5, 3],
       [4, 2, 1, 2],
       [1, 5, 4, 3]])

If you use the np.sort function with axis=0, all the columns will be sorted in ascending order independent of eachother, effectively compromising the integrity of the row items. In simple terms, the values in each row gets corrupted with values from other rows.

### How to sort a numpy array based on 1 column using argsort?

In [31]:
# argsort the first column
sorted_index_1stcol = arr[:, 0].argsort()

# sort 'arr' by first column without disturbing the integrity of rows
arr[sorted_index_1stcol]

array([[1, 2, 5, 4],
       [1, 5, 1, 1],
       [1, 5, 4, 3],
       [2, 3, 5, 3],
       [3, 5, 5, 5],
       [3, 1, 5, 3],
       [4, 3, 1, 1],
       [4, 2, 1, 2]])

To sort it in decreasing order, simply reverse the argsorted index

In [32]:
# descending sort
arr[sorted_index_1stcol[::-1]]

array([[4, 2, 1, 2],
       [4, 3, 1, 1],
       [3, 1, 5, 3],
       [3, 5, 5, 5],
       [2, 3, 5, 3],
       [1, 5, 4, 3],
       [1, 5, 1, 1],
       [1, 2, 5, 4]])

### How to sort a numpy array based on 2 or more columns?

You can do this using np.lexsort by passing a tuple of columns based on which the array should be sorted.

Just remember to place the column to be sorted first at the rightmost side inside the tuple.

In [34]:
# sort by column 0, then by column 1
lexsorted_index = np.lexsort((arr[:, 1], arr[:, 0]))
arr[lexsorted_index]

array([[1, 2, 5, 4],
       [1, 5, 1, 1],
       [1, 5, 4, 3],
       [2, 3, 5, 3],
       [3, 1, 5, 3],
       [3, 5, 5, 5],
       [4, 2, 1, 2],
       [4, 3, 1, 1]])

## Working with dates

Numpy implements dates through the np.datetime64 object which supports a precision till nanoseconds. You can create one using a standard YYYY-MM-DD formatted date strings.

In [37]:
# create a datetime64 object
date64 = np.datetime64('2018-02-04 23:10:10')
date64

numpy.datetime64('2018-02-04T23:10:10')

In [38]:
# drop the time part from the datetime64 object
dt64 = np.datetime64(date64, 'D')
dt64

numpy.datetime64('2018-02-04')

In [41]:
# adding a number increases the days
dt64+1

numpy.datetime64('2018-02-05')

But if you need to increase any other time unit like months, hours, seconds etc, then the timedelta object is much convenient.

In [42]:
# create the timedeltas (individual units of time)
tenminutes = np.timedelta64(10, 'm')  # 10 minutes
tenseconds = np.timedelta64(10, 's')  # 10 seconds
tennanoseconds = np.timedelta64(10, 'ns')  # 10 nanoseconds

print('Add 10 minutes: ', dt64 + tenminutes)
print('Add 10 seconds: ', dt64 + tenseconds)
print('Add 10 nanoseconds: ', dt64 + tennanoseconds)

Add 10 minutes:  2018-02-04T00:10
Add 10 seconds:  2018-02-04T00:00:10
Add 10 nanoseconds:  2018-02-04T00:00:00.000000010


In [43]:
# convert np.datetime64 back to a string
np.datetime_as_string(dt64)

'2018-02-04'

You can know if a given date is a business day or not using the np.is_busday().

In [46]:
print('Date: ', dt64)
print('Is it a business day?: ', np.is_busday(dt64))
print('Add 2 business days, rolling forward to nearest biz day: ', np.busday_offset(dt64, 2, roll='forward'))
print('Add 2 business days, rolling backward to nearest biz day: ', np.busday_offset(dt64, 2, roll='backward'))

Date:  2018-02-04
Is it a business day?:  False
Add 2 business days, rolling forward to nearest biz day:  2018-02-07
Add 2 business days, rolling backward to nearest biz day:  2018-02-06


### How to create a sequence of dates?

In [47]:
# create date sequence 
dates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-10'))
print(dates)

# check if it's a business day
np.is_busday(dates)

['2018-02-01' '2018-02-02' '2018-02-03' '2018-02-04' '2018-02-05'
 '2018-02-06' '2018-02-07' '2018-02-08' '2018-02-09']


array([ True,  True, False, False,  True,  True,  True,  True,  True])