# Basics of NumPy

### Importing NumPy

In [2]:
#To use Numpy, we first need to import the `numpy` package:
import numpy as np

## NumPy arrays
The NumPy array - an n-dimensional data structure - is the central object of the NumPy package.

A one-dimensional NumPy array can be thought of as a vector, a two-dimensional array as a matrix (i.e., a set of vectors), and a three-dimensional array as a tensor (i.e., a set of matrices).

![vector-matrix-3d-matrix](vector-matrix-3d-matrix.jpg)

Need more than three dimensions? It's entirely possible to have arrays with many dimensions, including so many dimensions that it's no longer humanly possible to conceptualize them.

### Array data types

An array can consist of integers, floating-point numbers, or strings. Within an array, the data type must be consistent (e.g., all integers or all floats).

Need an array with mixed data types? Consider using Numpy's record array format or pandas dataframes instead.

In this article, we'll restrict our focus to conventional NumPy arrays consisting of a single data type.

### Defining arrays
We can define NumPy arrays in a number of ways. We'll detail a few of the most common approaches below.

### Using np.array()
To define an array manually, we can use the np.array() function. The number of dimensions is the rank of the array.

In [3]:
a = np.array([1, 2, 3])  # Create a rank 1 array
print(a)

[1 2 3]


In [4]:
b = np.array([[1,2,3],[4,5,6]])   # Create a rank 2 array
print(b)

[[1 2 3]
 [4 5 6]]


NumPy has numerous functions for generating commonly-used arrays without having to enter the elements manually. A few of those are shown below:

### Defining arrays: np.arange()

The function np.arange() is great for creating vectors easily. Here, we create a vector with values spanning 1 up to (but not including) 5:

In [7]:
c = np.arange(1,5)
print(c)

[1 2 3 4]



A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

### Defining arrays: np.zeros, np.ones, np.full
In many programming tasks, it can be useful to initialize a variable and then write a value to it later in the code. If that variable happens to be a NumPy array, a common approach would be to create it as an array with zeros in every element. We can do this using `np.zeros()`. Here, we create an array of zeros with three rows and one column.

In [8]:
np.zeros((3,1))

array([[0.],
       [0.],
       [0.]])

You can also initialize an array with ones instead of zeros:

In [9]:
np.ones((3,1))

array([[1.],
       [1.],
       [1.]])

`np.full()` creates an array repeating a fixed value (defaults to zero). Here we create a 2x3 array with the number 7 in each element:

In [12]:
np.full((2,3),7)

array([[7, 7, 7],
       [7, 7, 7]])

we can also create identity matrix as shown below

In [None]:
d = np.eye(2)        # Create a 2x2 identity matrix
print(d)

### Array shape

All arrays have a shape accessible using `.shape`

For example, let's get the shape of a vector, matrix, and tensor.

In [17]:
vector = np.arange(5)
print(vector)
print("Vector shape:", vector.shape)

matrix = np.ones([3, 2])
print('\n',matrix)
print("Matrix shape:", matrix.shape)

tensor = np.zeros([2, 3, 3])
print('\n',tensor)
print("Tensor shape:", tensor.shape)

[0 1 2 3 4]
Vector shape: (5,)

 [[1. 1.]
 [1. 1.]
 [1. 1.]]
Matrix shape: (3, 2)

 [[[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]]]
Tensor shape: (2, 3, 3)


The shape of the vector is one-dimensional. The first number in its shape is the number of elements (or rows). For the matrix, `.shape` tells us we have three rows and two columns. The tensor is slightly different. The first number is how many matrices/slices we have. The second gives the number of rows. The third provides the number of columns.

### Reshaping arrays

We can reshape an array into any compatible dimensions using `.reshape` .

For example, say we want a 3x3 matrix where each element is incremented from 1 to 9. Easy

In [18]:
arr = np.arange(1, 10)
print(arr, '\n')

# Reshape to 3x3 matrix
arr = arr.reshape(3, 3)
print(arr, '\n')

# Reshape back to the original size
arr = arr.reshape(9)
print(arr)

[1 2 3 4 5 6 7 8 9] 

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[1 2 3 4 5 6 7 8 9]


Numpy can try to infer one of the dimensions if you use -1. You will still need to have precisely the correct number of digits for the inference to work.

In [19]:
arr = np.arange(1, 10).reshape(3, -1)
print(arr)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


### Reading data from a file into an array
Usually, data sets are too large to define manually. Instead, the most common use case is to import data from a data file into a NumPy array.

As an example, let's take some publicly-available data from the U.S. Energy Information Administration. The dataset we'll explore contains information on electricity generation in the USA from a range of sources. You can download the file, [MER_T07_02A.csv](https://www.eia.gov/totalenergy/data/browser/csv.php?tbl=T07.02A)

Because the data file is a CSV file, we'll use the csv module to import the data. It's worth noting that NumPy also has functions to read other types of data files directly into NumPy arrays, such as `np.genfromtxt()` for text files.

Here we're just reading the CSV file row-by-row, appending to a list, and then converting to a NumPy array:

In [21]:
import csv

data = []

with open('MER_T07_02A.csv', 'r') as csvfile:
    file_reader = csv.reader(csvfile, delimiter=',')
    for row in file_reader:
        data.append(row)
        
data = np.array(data) #convert the list of lists to a NumPy array

We now have our data stored in a NumPy array that we've named data. For much of the remainder of this article, we'll be exploring how NumPy's functionality can be used to manipulate and gain insights into this data.

First, we'll explore some attributes of the array. One thing that we may want to know about an array is its dimensions:

In [None]:
data.shape

For this two-dimensional array, we have 8230 rows and 6 columns of data.

Another property of a NumPy array that we may wish to know is its data type. This information is stored in the dtype attribute. Calling dtype reveals that our array is made up of strings:

In [None]:
data.dtype.type

### Saving
When we are ready to save our data, we can use the save function.

In [None]:
np.save(open('data.npy', 'wb'), data)      # Saves data to a binary file with the .npy extension

### Indexing
At some point, it will become necessary to index (select) subsets of a NumPy array. For instance, you might want to plot one column of data or perform a manipulation of that column. NumPy uses the same indexing notation as MATLAB.

Basics of indexing notation
* Commas separate axes of an array.
* Colons mean "through". For example, x[0:4] means the first 5 rows (rows 0 through 4) of x.
* Negative numbers mean "from the end of the array." For example, x[-1] means the last row of x.
* Blanks before or after colons means "the rest of". For example, x[3:] means the rest of the rows in x after row 3. Similarly, x[:3] means all the rows up to row 3. x[:] means all rows of x.
* When there are fewer indices than axes, the missing indices are considered complete slices. For example, in a 3-axis array, x[0,0] means all data in the 3rd axis of the 1st row and 1st column.
* Dots "..." mean as many colons as needed to produce a complete indexing tuple. For example, x[1,2,...] is the same as x[1,2,:,:,:].

In the following code, we'll explore some useful examples of selecting subsets from an array.

### Examples
![numpy-indexing-arrays.width-1200.jpg](numpy-indexing-arrays.width-1200.jpg)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]


array([[ 6,  8],
       [14, 16]])

### Indexing example 1: Colons and commas
Let's say we are interested in the first ten rows in the 4th column. We can use the following syntax to index this array section: `__array[start_row:end_row, col]__`

In [31]:
data[0:10,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors'], dtype='<U80')

The first row is the header for the column. Column 4 contains a description of energy sectors.

## Indexing example 1: Colons as *all* rows or columns

A colon can also denote all rows, or all columns. Here, we index all rows of column 4.

In [34]:
data[:,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors', ...,
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors'],
      dtype='<U80')

### Indexing example 3: Subset of columns
We can use the same format for any dimension of an array. The general syntax is: `array[start_row:end_row, start_col:end_col]`. The following indexes all rows and the second column up to (but not including) the 4th column:

In [None]:
data[:,2:4]

### Indexing example 4: Explicitly specifying column numbers
What if the columns we need are not next to each other? Instead of indexing a range of columns, it can be useful to specify them explicitly. To explicitly specify particular columns, we just include them in a list. Let's index the five rows after the header, selecting only columns 2 and 3. This time, we'll write the output to a new array named subset that we can re-use in the following example.

In [36]:
subset = data[1:6, [2,3]]
subset

array([['135451.32', '1'],
       ['154519.994', '1'],
       ['185203.657', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

### Indexing example 5: Mask arrays
Another convenient way to index certain sections of a NumPy array is to use a mask array. A mask array, also known as a logical array, contains boolean elements (i.e. True or False). Indexing of a given array element is determined by the value of the mask array's corresponding element.

First, we define a NumPy array of True/False values, where the True values are the ones we want to keep. Then we mask the subset array from the previous example. The result is retaining only the rows that correspond to elements that are True in the mask array.

In [37]:
mask_array = np.array([False, True, False, True, True])
subset[mask_array]

array([['154519.994', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

### Integer array Indexing

Two ways of accessing the data in the middle row of the array.
Mixing integer indexing with slices yields an array of lower rank,
while using only slices yields an array of the same rank as the
original array:

In [40]:
a = np.arange(1,17).reshape(4,4)
print(a)
row_r1 = a[1, :]    # Rank 1 view of the second row of a  
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r3 = a[[1], :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)
print(row_r3, row_r3.shape)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]
[5 6 7 8] (4,)
[[5 6 7 8]] (1, 4)
[[5 6 7 8]] (1, 4)


We can make the same distinction when accessing columns of an array:

In [55]:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)
print()
print(col_r2, col_r2.shape)

[ 2  6 10] (3,)

[[ 2]
 [ 6]
 [10]] (3, 1)


When you index into numpy arrays using slicing, the resulting array view will always be a subarray of the original array. In contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array. Here is an example:

In [8]:
a = np.array([[1,2], [3, 4], [5, 6]])
print (a.shape)

# An example of integer array indexing.
# The returned array will have shape (3,) and 
print(a[[0, 1, 2], [0, 1, 0]])

# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))

(3, 2)
[1 4 5]
[1 4 5]


In [41]:
# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))

[2 2]
[2 2]


One useful trick with integer array indexing is selecting or mutating one element from each row of a matrix:

In [58]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(a)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [59]:
# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"

[ 1  6  7 11]


In [60]:
# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10
print(a)

[[11  2  3]
 [ 4  5 16]
 [17  8  9]
 [10 21 12]]


### Datatypes

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [63]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1, 2], dtype=np.int64)  # Force a particular datatype

print(x.dtype, y.dtype, z.dtype)

int64 float64 int64


You can read all about numpy datatypes in the [documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html).

### Array math

Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:

In [64]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]


In [65]:
# Elementwise difference; both produce the array
print(x - y)
print(np.subtract(x, y))

[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]


In [66]:
# Elementwise product; both produce the array
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


In [67]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [68]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]


Note that unlike MATLAB, `*` is elementwise multiplication, not matrix multiplication. We instead use the dot function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. dot is available both as a function in the numpy module and as an instance method of array objects:

In [10]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

219
219


You can also use the `@` operator which is equivalent to numpy's `dot` operator.

In [11]:
print( v @ w )

[[19 22]
 [43 50]]


In [71]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))
print(x @ v)

[29 67]
[29 67]
[29 67]


In [72]:
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))
print(x @ y)

[[19 22]
 [43 50]]
[[19 22]
 [43 50]]
[[19 22]
 [43 50]]


Numpy provides many useful functions for performing computations on arrays; one of the most useful is `sum`:

In [73]:
x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

10
[4 6]
[3 7]


You can find the full list of mathematical functions provided by numpy in the [documentation](http://docs.scipy.org/doc/numpy/reference/routines.math.html).

Apart from computing mathematical functions using arrays, we frequently need to reshape or otherwise manipulate data in arrays. The simplest example of this type of operation is transposing a matrix; to transpose a matrix, simply use the T attribute of an array object:

In [74]:
print(x)
print("transpose\n", x.T)

[[1 2]
 [3 4]]
transpose
 [[1 3]
 [2 4]]


In [75]:
v = np.array([[1,2,3]])
print(v )
print("transpose\n", v.T)

[[1 2 3]]
transpose
 [[1]
 [2]
 [3]]


## Concatenating
NumPy also provides useful functions for concatenating (i.e., joining) arrays. Let's say we wanted to restrict our attention to the first and the last three rows of our dataset. First, we'll define new sub-arrays as follows:

In [44]:
array_start = data[:3,:]
array_start

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [43]:
array_end = data[-3:,:]
array_end

array([['ELETPUS', '202101', '350815.342', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202102', '327018.71', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202103', '310700.554', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

To concatenate these arrays we can use np.vstack, where the v denotes vertical, or row-wise, stacking of the sub-arrays:

In [47]:
np.vstack((array_start, array_end))

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202101', '350815.342', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202102', '327018.71', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202103', '310700.554', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [46]:
np.hstack((array_start, array_end))

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit',
        'ELETPUS', '202101', '350815.342', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202102', '327018.71', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202103', '310700.554', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

Here we have stacked the first three rows and last three rows on top of each other.

The horizontal counterpart of np.vstack() is np.hstack(), which combines sub-arrays column-wise. For higher dimensional joins, the most common function is np.concatenate(). The syntax for this function is similar to the 2D versions, with the additional requirement of specifying the axis along which concatenation should be performed.

Calling np.concatenate((array_start, array_end), axis = 0) would generate identical output to using np.vstack(). Axis=1 would generate identical output to using np.hstack().

## Splitting

The opposite of concatenating (i.e., joining) arrays is splitting them. To split an array, NumPy provides the following commands:

* hsplit: splits along the horizontal axis
* vsplit: splits along the vertical axis
* dsplit: Splits an array along the 3rd axis (depth)
* array_split: lets you specify the axis to use in splitting
    
## Adding/Removing Elements
NumPy provides several functions for adding or deleting data from an array:

* resize: Returns a new array with the specified shape, with zeros as placeholders in all the new cells.
* append: Adds values to the end of an array
* insert: Adds values in the middle of an array
* delete: Returns a new array with given data removed
* unique: Finds only the unique values of an array
    
## Sorting
There are several useful functions for sorting array elements. Some of the available sorting algorithms include quicksort, heapsort, mergesort, and timesort.

For example, here's how you'd merge sort the columns of an array:

In [48]:
a = np.array([[3,8,1,2], [9,5,4,8]])
np.sort(a, axis=1, kind='mergesort')      # Sort by column

array([[1, 2, 3, 8],
       [4, 5, 8, 9]])

## No Copy vs. Shallow Copy vs. Deep Copy
A common source of confusion NumPy beginners is knowing when data is and isn't copied into a new object.

### No copy: function calls and assignments:

In [53]:
print(id(a))

# Object "b" points to object "a". No new object is created.
b = a       

# Python passes objects as references. No copy is made.
def f(x):   
    print(id(x))
    
f(b)

1770907428784
1770907428784


Notice the id of b is the same as a, even if it's passed into a function.

### View/Shallow Copy: 
Arrays that share some data. The view method creates an object looking at the same data. Slicing an array returns a view of that array.

In [54]:
# View
a = b.view()

# The shape of b doesn't change
a = a.reshape((4, 2))    

# Slice
# a[:] is a view of "a".
a[:] = 5
print(a)
print(b)

array([[5, 5],
       [5, 5],
       [5, 5],
       [5, 5]])

## Deep copy: 
Use the copy method to make a complete copy of an array and all its data.
The copy() method creates the new array object c that is identical to a.

In [56]:
c = a.copy()
a[:] = 6
print(a)
print(c)

[[6 6]
 [6 6]
 [6 6]
 [6 6]]
[[5 5]
 [5 5]
 [5 5]
 [5 5]]
