# Introduction to numpy:

**Brian Roepke**  
**Data 110: Intro to Machine Learning**  
**February 7th, 2021**  

Numpy arrays are a commonly used scientific data structure that store data as a grid, or a matrix.
Python data structures provide a way to organize/manipulate data by defining the relationships between data values stored and providing a set of functionality that can be executed on the data structure.

Like Python lists, numpy arrays consists of ordered values (elements) and use indexing for accessing the elements in the numpy arrays. A key difference is that all elements in the numpy array must have the same type of data (eg: all integers, floats, text strings, etc).

**Dimensionality of Numpy Arrays**  
Numpy arrays can be:

 * one-dimensional: values along one dimension (similar to a list).
 * two-dimensional: consists of rows of arrays with one or more columns.
 * multi-dimensional nested arrays with one or more dimensions.

brackets [] are used to assign and identify the dimensions of the numpy arrays.

More information can be found on the NumPy and SciPy documentation websites.

In [1]:
import numpy as np  # import numpy module and reference it using the alias np

In [2]:
# getting help for a function, via "help".  In this case numpy arange function

help (np.arange)  

Help on built-in function arange in module numpy:

arange(...)
    arange([start,] stop[, step,], dtype=None)
    
    Return evenly spaced values within a given interval.
    
    Values are generated within the half-open interval ``[start, stop)``
    (in other words, the interval including `start` but excluding `stop`).
    For integer arguments the function is equivalent to the Python built-in
    `range` function, but returns an ndarray rather than a list.
    
    When using a non-integer step, such as 0.1, the results will often not
    be consistent.  It is better to use `numpy.linspace` for these cases.
    
    Parameters
    ----------
    start : number, optional
        Start of interval.  The interval includes this value.  The default
        start value is 0.
    stop : number
        End of interval.  The interval does not include this value, except
        in some cases where `step` is not an integer and floating point
        round-off affects the length of `out`.
   

## Creating Arrays: ndarray  

using numpy arrange function

In [4]:
x = np.arange(5)      # similar to range function. Creates 1-Dim array of [0, 1, 2, 3, 4]
print (x)

y = np.arange(1, 6)  # A 1-Dim array of  [1, 2, 3, 4, 5]
print (y)

x / y                # a new 1-Dim array with x elements divided by y

[0 1 2 3 4]
[1 2 3 4 5]


array([0.        , 0.5       , 0.66666667, 0.75      , 0.8       ])

In [14]:
print ('1-D array')
x = np.arange(9)           # a 1-Dim array [0..9]
print(type(x))             # type is ndarray <class 'numpy.ndarray'>
print(x,'\n')              # '\n' is new-line 

print ('3x3 array')
a3x3 = x.reshape((3, 3))   # change the shape to 3X3 array and assign to new variable
print(a3x3, '\n')
print(2 ** a3x3)          # ndarray are mutable, change elements; exponent with base 2, eg: 2^0, 2^1, 2^2 ...
print()

print ('2x3 array')
a2x3 = np.array([[0,1,2], [3,4,5]])   # Create array (2 rows, 3 colums)
print(a2x3, '\n')
print (a2x3 + 1)                      # add 1 to each element

1-D array
<class 'numpy.ndarray'>
[0 1 2 3 4 5 6 7 8] 

3x3 array
[[0 1 2]
 [3 4 5]
 [6 7 8]] 

[[  1   2   4]
 [  8  16  32]
 [ 64 128 256]]

2x3 array
[[0 1 2]
 [3 4 5]] 

[[1 2 3]
 [4 5 6]]


Numpy has a number of built in methods to quickly and easily create multidimensional arrays.

In [15]:
# create a 2x2 array of zeros
ex1 = np.zeros((2,2))      
print(ex1)             
print()

# create a 2x2 array with 9.0
ex2 = np.full((2,2), 9.0)  
print(ex2)
print()

# create an array of ones
ex3 = np.ones((1, 2))
print(ex3)  
print()

# create an array of random floats between 0 and 1
ex4 = np.random.random((2,2))
print(ex4)  

[[0. 0.]
 [0. 0.]]

[[9. 9.]
 [9. 9.]]

[[1. 1.]]

[[0.54936281 0.87173849]
 [0.28648531 0.46212219]]


In [21]:
months = np.array([['Jan','Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec']])
months

array([['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug', 'Sept',
        'Oct', 'Nov', 'Dec']], dtype='<U4')

## Appending Arrays

In [22]:
np.append([1, 2, 3], [[4, 5, 6], [7, 8, 9]])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
a1 = np.array(['hello', 'world'])
a2 = np.array(['data', 'science'])
np.append(a1, a2)

array(['hello', 'world', 'data', 'science'], dtype='<U7')

## Array Indexing

In [24]:
# because this is a 1-Dim array, we need only one index to accesss each element
print ('1-D')
print (x, ' size: ', x.shape)
print(x[0], x[1], x[8]) 
print()

# n-d array uses 2 indexes
print('2x3')
print (a2x3, ' size: ', a2x3.shape)
print()
print ('row1', a2x3[0] ) # first row
print ('row2', a2x3[1] ) # 2nd row
print('R1,C1: ', a2x3[0][0]) # first element of first row
print('R2,C3: ', a2x3[1][2]) # last element of last row
print()

1-D
[0 1 2 3 4 5 6 7 8]  size:  (9,)
0 1 8

2x3
[[0 1 2]
 [3 4 5]]  size:  (2, 3)

row1 [0 1 2]
row2 [3 4 5]
R1,C1:  0
R2,C3:  5



## Slice Indexing  
Similar to the use of slice indexing with lists and strings, we can use slice indexing to pull out sub-regions of ndarrays.


In [25]:
# array shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]


In [26]:
a_slice = an_array[:2]  # first 2 rows
print(a_slice)

[[11 12 13 14]
 [21 22 23 24]]


In [27]:
# first 2 rows x 2 columns.

a_slice = an_array[:2, 1:3]  
print(a_slice)

[[12 13]
 [22 23]]


## Use both integer indexing & slice indexing  
We can use combinations of integer indexing and slice indexing to create different shaped matrices.

In [42]:
# Create array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]


In [43]:
# Using both integer indexing & slicing generates an array of lower rank/dimension
row_rank1 = an_array[1, :]    # Rank 1 view 

print(row_rank1, row_rank1.shape)  # notice only a single []

[21 22 23 24] (4,)


In [46]:
# Slicing alone: generates an array of the same rank as the an_array
row_rank2 = an_array[1:2, :]  # Rank 2 view 

print(row_rank2, row_rank2.shape)   # Notice the [[ ]]

[[21 22 23 24]] (1, 4)


In [47]:
#We can do the same thing for columns of an array:

col_rank1 = an_array[:, 1]
col_rank2 = an_array[:, 1:2]

print(col_rank1, col_rank1.shape)  # Rank 1
print()
print(col_rank2, col_rank2.shape)  # Rank 2

[12 22 32] (3,)

[[12]
 [22]
 [32]] (3, 1)


## Boolean Indexing  
Array Indexing for changing elements:

In [48]:
# create a 3x2 array
an_array = np.array([[11,12], [21, 22], [31, 32]])
print(an_array)

[[11 12]
 [21 22]
 [31 32]]


In [49]:
# create a filter which will be boolean values for whether each element meets this condition
filter = (an_array > 15)
filter

array([[False, False],
       [ True,  True],
       [ True,  True]])

In [50]:
# we can now select just those elements which meet that criteria
print(an_array[filter])

[21 22 31 32]


In [51]:
# For short, we can do this without a separate filter array.

an_array[an_array > 15]

array([21, 22, 31, 32])

In [52]:
# values between 15 and 22
an_array[(an_array > 15) & (an_array < 30)]

array([21, 22])

In [53]:
# Even values
an_array[an_array % 2 == 0]

array([12, 22, 32])

In [54]:
# we can change elements in the array using a logical filter. Here, add 100 to all the even values.

an_array[an_array % 2 == 0] +=100
print(an_array)

[[ 11 112]
 [ 21 122]
 [ 31 132]]


## Computation on NumPy Arrays: Universal Functions  
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute repeated operations on values in NumPy arrays. 
ufunc operations are not limited to one-dimensional arrays–they can also act on multi-dimensional arrays as well:  
Ufuncs exist in two flavors: unary ufuncs, which operate on a single input, and binary ufuncs, which operate on two inputs.

### Array arithmetic  
NumPy's ufuncs make use of Python's native arithmetic operators. The standard addition, subtraction, multiplication, and division can all be used:

In [55]:
x = np.arange(4)

print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


In [56]:
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


The following table lists the arithmetic operators implemented in NumPy:
Operator	Equivalent ufunc	Description
 * '+' 	np.add	Addition (e.g., 1 + 1 = 2)  
 * '-' 	np.subtract	Subtraction (e.g., 3 - 2 = 1)  
 * '-' 	np.negative	Unary negation (e.g., -2)  
 * '*'	np.multiply	Multiplication (e.g., 2 * 3 = 6)  
 * '/'	np.divide	Division (e.g., 3 / 2 = 1.5)  
 * '//'	np.floor_divide	Floor division (e.g., 3 // 2 = 1)  
 * '***'	np.power	Exponentiation (e.g., 2 ** 3 = 8)  
 * '%'	np.mod	Modulus/remainder (e.g., 9 % 4 = 1)  


## Absolute value  
Just as NumPy understands Python's built-in arithmetic operators, it also understands Python's built-in absolute value function:

In [57]:
x = np.array([-2, -1, 0, 1, 2]) # c reate a 1-D array from a list

print(type(x))           # The type of an ndarray is: "<class 'numpy.ndarray'>"
print(x)

# test the shape of the array we just created, it should have just one dimension 
print(x.shape)

<class 'numpy.ndarray'>
[-2 -1  0  1  2]
(5,)


In [58]:
# absolute value

abs(x)

array([2, 1, 0, 1, 2])

The corresponding NumPy ufunc is `np.absolute`, which is also available under the alias `np.abs`:

In [59]:
print (np.absolute(x))

print (np.abs(x))

[2 1 0 1 2]
[2 1 0 1 2]


## Trigonometric functions

In [60]:
# defining an array of angles:
theta = np.linspace(0, np.pi, 3)

In [61]:
# Now we can compute some trigonometric functions on these values:
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.57079633 3.14159265]
sin(theta) =  [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) =  [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) =  [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]


The values are computed to within machine precision, which is why values that should be zero do not always hit exactly zero. Inverse trigonometric functions are also available:

In [64]:
x = [-1, 0, 1]
print("x         = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))

x         =  [-1, 0, 1]
arcsin(x) =  [-1.57079633  0.          1.57079633]
arccos(x) =  [3.14159265 1.57079633 0.        ]
arctan(x) =  [-0.78539816  0.          0.78539816]


## Exponents and logarithms
Another common type of operation available in a NumPy ufunc are the exponentials:

In [65]:
x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))

x     = [1, 2, 3]
e^x   = [ 2.71828183  7.3890561  20.08553692]
2^x   = [2. 4. 8.]
3^x   = [ 3  9 27]


The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural logarithm; if you prefer to compute the base-2 logarithm or the base-10 logarithm, these are available as well:

In [66]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


In [67]:
x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))

exp(x) - 1 = [0.         0.0010005  0.01005017 0.10517092]
log(1 + x) = [0.         0.0009995  0.00995033 0.09531018]


## Basic Statistical Operations: on ndarray

In [68]:
# setup a random 2 x 4 matrix
arr = 10 * np.random.randn(2,5)  # random number generation
print(arr)

[[-8.20241102  1.40664139 13.74451151 -3.63485969 10.36212167]
 [-5.63107302 -2.3980745   9.95031778 -1.56501224  7.66461473]]


In [69]:
# compute the mean for all elements 
print(arr.mean())

2.169677661474976


In [70]:
# compute the means by row (axis=1)
print(arr.mean(axis = 1))

[2.73520077 1.60415455]


In [71]:
# compute the means by column
print(arr.mean(axis = 0))

[-6.91674202 -0.49571655 11.84741464 -2.59993597  9.0133682 ]


In [72]:
# sum all the elements
print(arr.sum())

21.696776614749762


In [73]:
# compute the medians for all rows
print(np.median(arr, axis = 1))

[ 1.40664139 -1.56501224]


## Sorting

In [74]:
# create a 10 element array of randoms
unsorted = np.random.randn(10)

print('unsorted')
print(unsorted)

unsorted
[-0.17282796  0.3063538   0.98348373  1.39409934  1.28822326 -0.02285437
 -0.76617949 -0.5479846   0.77443014  2.27393463]


In [75]:
# create copy and sort
print('sorted')
sorted = np.array(unsorted)
sorted.sort()

print(sorted)
print()
print(unsorted)

print('\n in-place sorting')
unsorted.sort() 
print(unsorted)

sorted
[-0.76617949 -0.5479846  -0.17282796 -0.02285437  0.3063538   0.77443014
  0.98348373  1.28822326  1.39409934  2.27393463]

[-0.17282796  0.3063538   0.98348373  1.39409934  1.28822326 -0.02285437
 -0.76617949 -0.5479846   0.77443014  2.27393463]

 in-place sorting
[-0.76617949 -0.5479846  -0.17282796 -0.02285437  0.3063538   0.77443014
  0.98348373  1.28822326  1.39409934  2.27393463]


## Read/Write text

In [77]:
# numeric data
a = np.arange(0.0, 5.0, 0.5)
print(a)

# save data from file
np.savetxt('test.txt', a, fmt='%1.2f')  # comma delimitted, floats

[0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5]


In [78]:
# read data from a file
a2 = np.loadtxt('test.txt')
a2

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [79]:
# string data
a = np.array(['one', 'two', 'three'])
print(a)

# save
np.savetxt('test2.txt', a, fmt="%s") # comma delimited, string

# read
a2 = np.genfromtxt('test2.txt', dtype='str')
a2

['one' 'two' 'three']


array(['one', 'two', 'three'], dtype='<U5')

In [80]:
# comma delimited
a = np.array([ [1,2,3], [4,5,6], [7,8,9] ])
print (a)

# save
np.savetxt("test3.csv", a, delimiter=",", fmt='%d') # integer format

# read
a2 = np.loadtxt('test3.csv', dtype='int', delimiter=',')
a2

[[1 2 3]
 [4 5 6]
 [7 8 9]]


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

# Assignment:  
Import Numeric Data from Text Files Into Numpy Arrays  
 * download the average monthly precipitation (inches) for Boulder, CO collected by the U.S. National Oceanic and Atmospheric Administration (NOAA)
     * data values: https://ndownloader.figshare.com/files/12565616
     * col values: https://ndownloader.figshare.com/files/12565619
 * load the data into a numpy array and print the values. Verify the type of the data structure.
 * run summary statistics: mean, median, standard deviation, min, max
 * get the average seasonal preciptation (eg: winter, ..fall)

In [85]:
months = np.loadtxt('months.txt', dtype='str')
months

array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug', 'Sept',
       'Oct', 'Nov', 'Dec'], dtype='<U4')

In [87]:
values = np.loadtxt('avg-monthly-precip.txt')
values

array([0.7 , 0.75, 1.85, 2.93, 3.05, 2.02, 1.93, 1.62, 1.84, 1.31, 1.39,
       0.84])

In [88]:
np.mean(values)

1.6858333333333333

In [89]:
np.median(values)

1.73

In [90]:
np.std(values)

0.7318408107110604

In [91]:
np.min(values)

0.7

In [92]:
np.max(values)

3.05

In [112]:
# Winter – December, January and February.
# Spring – March, April and May.
# Summer – June, July and August.
# Fall – September, October and November.

print('Winter')
winter = np.mean(values[np.array([0,1,11])])
print(winter, '\n')

print('Spring')
spring = np.mean(values[2:5])
print(spring, '\n')

print('Summer')
summer = np.mean(values[5:8])
print(summer, '\n')

print('Fall')
fall = np.mean(values[8:11])
print(fall, '\n')

Winter
0.7633333333333333 

Spring
2.61 

Summer
1.8566666666666667 

Fall
1.5133333333333334 

