# Introduction to NumPy

## What is NumPy?

NumPy (Numerical Python) is an **open source Python library that’s used in almost every field of science and engineering**. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development.

NumPy is the **fundamental package for scientific computing in Python**. It is a Python library that provides a **multidimensional array object**, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

> At the core of the NumPy package, is the **ndarray** object.

There are several important differences between NumPy arrays and the standard Python sequences:

- NumPy arrays have a **fixed size at creation**, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.
- The elements in a NumPy array are **all required to be of the same data type**, and thus will be the **same size in memory**. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.
- NumPy arrays **facilitate advanced mathematical and other types of operations** on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.
- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data.

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

## Installing NumPy

The only prerequisite for installing NumPy is Python itself. If you don’t have Python yet and want the simplest way to get started.

NumPy can be installed with conda, with pip, with a package manager on macOS and Linux, or from source.

If you use UV pip, you can install NumPy with: `uv pip install numpy`

We added NumPy to the requirements.txt file, so it's already installed by setting the environment.

To access NumPy and its functions import it in your Python code like this:

In [1]:
import numpy as np

We shorten the imported name to np for better readability of code using NumPy. This is a widely adopted convention that makes your code more readable for everyone working on it. We recommend to always use import numpy as np.

In [2]:
np.__version__

'2.2.3'

## Understanding Data Types in Python

[Understanding Data Types in Python](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html)


Effective data-driven science and computation **requires understanding how data is stored and manipulated**. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing. While a **statically-typed language like C or Java** requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:

```python
# Python code
result = 0
for i in range(100):
    result += i
```


Notice the main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that **we can assign any kind of data to any variable**:


In [3]:
# Python code
x = 4
x = "four"

Here we've switched the contents of x from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:

```C
/* C code */
int x = 4;
x = "four";  // FAILS
```

This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use. Understanding how this works is an important piece of learning to analyze data efficiently and effectively with Python. But what this type-flexibility also points to is the fact that **Python variables are more than just their value; they also contain extra information about the type of the value**. We'll explore this more in the sections that follow.


### A Python Integer Is More Than Just an Integer

The standard Python implementation is written in C. This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as x = 10000, x is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values. Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:

- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.

This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Integer Memory Layout">


Here PyObject_HEAD is the part of the structure containing the reference count, type code, and other pieces mentioned before.

Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value. A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.


### A Python List Is More Than Just a List

Let's consider now what happens when we use a Python data structure that holds many Python objects. The standard mutable multi-element container in Python is the list. We can create a list of integers as follows:


In [1]:
L = list(range(10))
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
type(L[0])

int

Or, similarly, a list of strings:


In [2]:
L2 = [str(c) for c in L]

In [7]:
type(L2[0])

str

Because of Python's dynamic typing, we can even create heterogeneous lists:


In [8]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information–that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Array Memory Layout">


At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.


### Fixed-Type Arrays in Python

Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module can be used to create dense arrays of a uniform type:


In [9]:
import array

L = list(range(10))
A = array.array("i", L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here 'i' is a type code indicating the contents are integers.

Much more useful, however, is the ndarray object of the NumPy package. While Python's array object provides efficient storage of array-based data, NumPy adds to this efficient operations on that data.


## NumPy Speed


Python is what we call a high-level language. High level languages allow you to write programs faster as the interpreter makes the decisions on how to execute your instructions. In contrast, when you use low-level languages like C, you define exactly how memory will be managed and how the processor will execute your instructions. This means that coding in a low-level language takes longer, however you have more ability to optimize your code to run faster.

The points about sequence size and speed are particularly important in scientific computing. As a simple example, consider the case of multiplying each element in a 1-D sequence with the corresponding element in another sequence of the same length. If the data are stored in two Python lists, a and b, we could iterate over each element:


In [1]:
a = list(range(10000000))
b = list(range(20000000, 30000000))
c = []

In [2]:
a = list(range(10000000))
b = list(range(20000000, 30000000))
c = []

In [3]:
%%timeit -n 1 -r 1
for i in range(len(a)):
    c.append(a[i] * b[i])  # noqa: PERF401

print(c[:10])

[0, 20000001, 40000004, 60000009, 80000016, 100000025, 120000036, 140000049, 160000064, 180000081]
1.22 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


This produces the correct answer, but if a and b each contain millions of numbers, we will pay the price for the inefficiencies of looping in Python. We could accomplish the same task much more quickly in C by writing (for clarity we neglect variable declarations and initializations, memory allocation, etc.)

```c
for (i = 0; i < rows; i++) {
  c[i] = a[i]*b[i];
}
```


This saves all the overhead involved in interpreting the Python code and manipulating Python objects, but at the expense of the benefits gained from coding in Python. Furthermore, the coding work required increases with the dimensionality of our data. In the case of a 2-D array, for example, the C code (abridged as before) expands to

```c
for (i = 0; i < rows; i++) {
  for (j = 0; j < columns; j++) {
    c[i][j] = a[i][j]*b[i][j];
  }
}
```


**NumPy gives us the best of both worlds**: element-by-element operations are the “default mode” when an ndarray is involved, but the element-by-element operation is speedily executed by pre-compiled C code. In NumPy does what the earlier examples do, at near-C speeds, but with the code simplicity we expect from something based on Python.


In [4]:
import numpy as np

a = np.arange(10000000)
b = np.arange(20000000, 30000000)

In [8]:
%%timeit -n 1 -r 1

c = (10+a) * b

print(c[:10])

[200000000 220000011 240000024 260000039 280000056 300000075 320000096
 340000119 360000144 380000171]
171 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


**Why is NumPy Fast?**

- **Vectorization** describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
  - vectorized code is more concise and easier to read
  - fewer lines of code generally means fewer bugs
  - the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)
  - vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult-to-read for loops.
- **Broadcasting** is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast. Moreover, in the example above, a and b could be multidimensional arrays of the same shape, or a scalar and an array, or even two arrays of with different shapes, provided that the smaller array is “expandable” to the shape of the larger in such a way that the resulting broadcast is unambiguous.


---

## Example: Data analysis in pure Python


In [1]:
import csv

dataset_path = "data/f500_small.csv"

with open(dataset_path, "r") as f:
    f500_small = list(csv.reader(f))

In [2]:
len(f500_small)

20

In [3]:
print(*f500_small[:3], sep="\n")

['company', 'rank', 'revenues', 'revenue_change', 'profits', 'assets', 'profit_change', 'ceo', 'industry', 'sector', 'previous_rank', 'country', 'hq_location', 'website', 'years_on_global_500_list', 'employees', 'total_stockholder_equity']
['Walmart', '1', '485873', '0.8', '13643.0', '198825', '-7.2', 'C. Douglas McMillon', 'General Merchandisers', 'Retailing', '1', 'USA', 'Bentonville, AR', 'http://www.walmart.com', '23', '2300000', '77798']
['State Grid', '2', '315199', '-4.4', '9571.3', '489838', '-6.2', 'Kou Wei', 'Utilities', 'Energy', '2', 'China', 'Beijing, China', 'http://www.sgcc.com.cn', '17', '926067', '209456']


Cilj je sešteti vse vrednosti v stolpcu `revenues`.


In [12]:
total_revenues = sum([int(row[2]) for row in f500_small[1:]])
print(total_revenues)

4305395


## Introduction to Ndarrays


A multidimensional array is a **central data structure** of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be **indexed in various ways**. The elements are all of the **same type, referred to as the array dtype**.

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

It is a table of elements (usually numbers), **all of the same type**, indexed by a tuple of non-negative integers. In NumPy **dimensions are called axes**.


For example, the array for the coordinates of a point in 3D space, `[1, 2, 1]`, has one axis. That axis has 3 elements in it, so we say it has a length of 3. In the example pictured below, the array has 2 axes. The first axis has a length of 2, the second axis has a length of 3.

    [[1., 0., 0.],
    [0., 1., 2.]]

NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality.


One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.


In [11]:
a = np.array([1, 2, 3, 4, 5, 6])
print(a)
a

[1 2 3 4 5 6]


array([1, 2, 3, 4, 5, 6])

In [49]:
print(type(a))

<class 'numpy.ndarray'>


In [28]:
b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

The more important attributes of an ndarray object are:


In [31]:
# ndarray.ndim: the number of axes (dimensions) of the array.
print(f"a ndim: {a.ndim}, b ndim: {b.ndim}")

# ndarray.shape: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension.
print(f"a shape: {a.shape}, b shape: {b.shape}")

# ndarray.size: the total number of elements of the array. This is equal to the product of the elements of shape.
print(f"a size: {a.size}, b size: {b.size}")

# ndarray.dtype: an object describing the type of the elements in the array.
print(f"a dtype: {a.dtype}, b dtype: {b.dtype}")

# ndarray.itemsize: the size in bytes of each element of the array.
print(f"a itemsize: {a.itemsize}, b itemsize: {b.itemsize}")

# ndarray.data: the buffer containing the actual elements of the array.
# Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.
print(f"a data: {a.data}, b data: {b.data}")

a ndim: 1, b ndim: 2
a shape: (6,), b shape: (3, 4)
a size: 6, b size: 12
a dtype: int32, b dtype: int32
a itemsize: 4, b itemsize: 4
a data: <memory at 0x0000023D2D535540>, b data: <memory at 0x0000023D3B9A98A0>


You might occasionally hear an array referred to as a “ndarray,” which is shorthand for “N-dimensional array.” An N-dimensional array is simply an array with any number of dimensions. You might also hear 1-D, or one-dimensional array, 2-D, or two-dimensional array, and so on. The NumPy ndarray class is used to represent both matrices and vectors. A **vector** is an array with a single dimension (there’s no difference between row and column vectors), while a **matrix** refers to an array with two dimensions. For 3-D or higher dimensional arrays, the term **tensor** is also commonly used.


<img alt="Dimensional Arrays" src="./images/one_dim.svg">


<img alt="Dimensional Arrays" src="./images/Two_Dim.svg">


There are several ways to create arrays.

For example, you can create an array from a regular Python list or tuple using the array function. The type of the resulting array is deduced from the type of the elements in the sequences.


In [32]:
a = np.array([2, 3, 4])
a

array([2, 3, 4])

In [33]:
a.dtype

dtype('int32')

In [34]:
b = np.array([1.2, 3.5, 5.1])
b.dtype

dtype('float64')

A frequent error consists in calling array with multiple arguments, rather than providing a single sequence as an argument.


In [35]:
a = np.array(1, 2, 3, 4)  # WRONG

TypeError: array() takes from 1 to 2 positional arguments but 4 were given

In [36]:
a = np.array([1, 2, 3, 4])  # RIGHT

array transforms sequences of sequences into two-dimensional arrays, sequences of sequences of sequences into three-dimensional arrays, and so on.


In [38]:
b = np.array([(1.5, 2, 3), (4, 5, 6)])
b

array([[1.5, 2. , 3. ],
       [4. , 5. , 6. ]])

The type of the array can also be explicitly specified at creation time:


In [40]:
c = np.array([[1, 2], [3, 4]], dtype="int64")
c.dtype

dtype('int64')

Often, the **elements of an array are originally unknown, but its size is known**. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of **growing arrays, an expensive operation**.


The function `zeros` creates an array full of zeros, the function `ones` creates an array full of ones, and the function `empty` creates an array whose initial content is random and depends on the state of the memory. By default, the dtype of the created array is float64, but it can be specified via the key word argument dtype.


In [41]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [42]:
np.ones((2, 3, 4), dtype=np.int16)

array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int16)

In [43]:
np.empty((2, 3))

array([[1.5, 2. , 3. ],
       [4. , 5. , 6. ]])

To create sequences of numbers, NumPy provides the arange function which is analogous to the Python built-in range, but returns an array.


In [44]:
np.arange(10, 30, 5)

array([10, 15, 20, 25])

In [45]:
np.arange(0, 2, 0.3)  # it accepts float arguments

array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

When arange is used with floating point arguments, it is generally not possible to predict the number of elements obtained, due to the finite floating point precision. For this reason, it is usually better to use the function linspace that receives as an argument the number of elements that we want, instead of the step:


In [46]:
np.linspace(0, 2, 9)

array([0.  , 0.25, 0.5 , 0.75, 1.  , 1.25, 1.5 , 1.75, 2.  ])

In [48]:
x = np.linspace(0, 2 * np.pi, 100)  # useful to evaluate function at lots of points
f = np.sin(x)
f[:10]

array([0.        , 0.06342392, 0.12659245, 0.18925124, 0.25114799,
       0.31203345, 0.37166246, 0.42979491, 0.48619674, 0.54064082])

**Reshaping the array:** Using arr.reshape() will give a new shape to an array without changing the data. Just remember that when you use the reshape method, the array you want to produce needs to have the same number of elements as the original array. If you start with an array with 12 elements, you’ll need to make sure that your new array also has a total of 12 elements.


If you start with this array:


In [77]:
a = np.arange(6)
print(a)

# You can use reshape() to reshape your array. For example, you can reshape this array to an array with three rows and two columns:
b = a.reshape(3, 2)
print(b)

[0 1 2 3 4 5]
[[0 1]
 [2 3]
 [4 5]]


When you **print an array**, NumPy displays it in a similar way to nested lists, but with the following layout:

- the last axis is printed from left to right,
- the second-to-last is printed from top to bottom,
- the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals as lists of matrices.


In [50]:
a = np.arange(6)
print(a)

[0 1 2 3 4 5]


In [51]:
b = np.arange(12).reshape(4, 3)
print(b)

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]


In [53]:
c = np.arange(24).reshape(2, 3, 4)
print(c)

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]


If an array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners:


In [54]:
print(np.arange(10000))

[   0    1    2 ... 9997 9998 9999]


## Datatypes


Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object.

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype.

[Več o datatypes](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

[List of scalars](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars-built-in)


In [35]:
x = np.array([1, 2])  # Let numpy choose the datatype
print(x.dtype)
print(x.nbytes)

x = np.array([1.0, 2.0])  # Let numpy choose the datatype
print(x.dtype)
print(x.nbytes)

int32
8
float64
16


In [36]:
x = np.array([1, 2], dtype=np.int64)  # Force a particular datatype
print(x.dtype)
print(x.nbytes)

x = np.array([1, 2], dtype=np.int8)  # Force a particular datatype
print(x.dtype)
print(x.nbytes)

int64
16
int8
2


NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:


In [37]:
np.zeros(10, dtype="int16")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

Or using the associated NumPy object:


In [38]:
np.ones(10, dtype=np.int16)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int16)

<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Data type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bool_</code></td>
<td>Boolean (True or False) stored as a byte</td>
</tr>
<tr>
<td><code>int_</code></td>
<td>Default integer type (same as C <code>long</code>; normally either <code>int64</code> or <code>int32</code>)</td>
</tr>
<tr>
<td><code>intc</code></td>
<td>Identical to C <code>int</code> (normally <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>intp</code></td>
<td>Integer used for indexing (same as C <code>ssize_t</code>; normally either <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>int8</code></td>
<td>Byte (-128 to 127)</td>
</tr>
<tr>
<td><code>int16</code></td>
<td>Integer (-32768 to 32767)</td>
</tr>
<tr>
<td><code>int32</code></td>
<td>Integer (-2147483648 to 2147483647)</td>
</tr>
<tr>
<td><code>int64</code></td>
<td>Integer (-9223372036854775808 to 9223372036854775807)</td>
</tr>
<tr>
<td><code>uint8</code></td>
<td>Unsigned integer (0 to 255)</td>
</tr>
<tr>
<td><code>uint16</code></td>
<td>Unsigned integer (0 to 65535)</td>
</tr>
<tr>
<td><code>uint32</code></td>
<td>Unsigned integer (0 to 4294967295)</td>
</tr>
<tr>
<td><code>uint64</code></td>
<td>Unsigned integer (0 to 18446744073709551615)</td>
</tr>
<tr>
<td><code>float_</code></td>
<td>Shorthand for <code>float64</code>.</td>
</tr>
<tr>
<td><code>float16</code></td>
<td>Half precision float: sign bit, 5 bits exponent, 10 bits mantissa</td>
</tr>
<tr>
<td><code>float32</code></td>
<td>Single precision float: sign bit, 8 bits exponent, 23 bits mantissa</td>
</tr>
<tr>
<td><code>float64</code></td>
<td>Double precision float: sign bit, 11 bits exponent, 52 bits mantissa</td>
</tr>
<tr>
<td><code>complex_</code></td>
<td>Shorthand for <code>complex128</code>.</td>
</tr>
<tr>
<td><code>complex64</code></td>
<td>Complex number, represented by two 32-bit floats</td>
</tr>
<tr>
<td><code>complex128</code></td>
<td>Complex number, represented by two 64-bit floats</td>
</tr>
</tbody>
</table>

</div>


In [39]:
x = np.array([189, 22, -129], dtype=np.int8)  # Force a particular datatype
print(x)
print(x.dtype)  # Prints "int8"
print(x.nbytes)

[-67  22 127]
int8
3


For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
  x = np.array([189, 22, -129], dtype=np.int8)  # Force a particular datatype
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
  x = np.array([189, 22, -129], dtype=np.int8)  # Force a particular datatype


In [113]:
a = np.array(["a", "b", "c"])
a.dtype  # Unicode string of 1 character

dtype('<U1')

In [119]:
a = np.array(["a", "b", "c", 23, 34.5, True])
print(a.dtype)
print(a)

<U32
['a' 'b' 'c' '23' '34.5' 'True']


## Basic Operations and Universal Functions


Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.


In [55]:
a = np.array([20, 30, 40, 50])
b = np.arange(4)
c = a - b
print(c)

[20 29 38 47]


In [56]:
b**2

array([0, 1, 4, 9])

In [82]:
b * 4

array([[ 0,  4],
       [ 8, 12],
       [16, 20]])

NumPy understands that the multiplication should happen with each cell. That concept is called broadcasting. Broadcasting is a mechanism that allows NumPy to perform operations on arrays of different shapes. The dimensions of your array must be compatible, for example, when the dimensions of both arrays are equal or when one of them is 1. If the dimensions are not compatible, you will get a ValueError.


In [57]:
10 * np.sin(a)

array([ 9.12945251, -9.88031624,  7.4511316 , -2.62374854])

In [58]:
a < 35

array([ True,  True, False, False])

Unlike in many matrix languages, the product operator \* operates elementwise in NumPy arrays. The matrix product can be performed using the @ operator (in python >=3.5) or the dot function or method:


In [60]:
A = np.array([[1, 1], [0, 1]])
B = np.array([[2, 0], [3, 4]])
print(A * B)
print(A @ B)
print(A.dot(B))

[[2 0]
 [0 4]]
[[5 4]
 [3 4]]
[[5 4]
 [3 4]]


Some operations, such as += and \*=, act in place to modify an existing array rather than create a new one.


In [68]:
rg = np.random.default_rng(1)  # create instance of default random number generator
a = np.ones((2, 3), dtype=int)
b = rg.random((2, 3))
a *= 3
a

array([[3, 3, 3],
       [3, 3, 3]])

In [69]:
b += a
b

array([[3.51182162, 3.9504637 , 3.14415961],
       [3.94864945, 3.31183145, 3.42332645]])

In [70]:
a += b  # b is not automatically converted to integer type

UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind'

When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one (a behavior known as upcasting).


In [72]:
a = np.ones(3, dtype=np.int32)
b = np.linspace(0, np.pi, 3)
b.dtype.name

'float64'

In [74]:
c = a + b
c

array([1.        , 2.57079633, 4.14159265])

In [75]:
c.dtype.name

'float64'

In [76]:
d = np.exp(c * 1j)
print(d)
print(d.dtype.name)

[ 0.54030231+0.84147098j -0.84147098+0.54030231j -0.54030231-0.84147098j]
complex128


NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called **universal functions** (ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output.


In [78]:
B = np.arange(3)
B

array([0, 1, 2])

In [79]:
np.exp(B)  # calculates e^x for each value of x in your input array

array([1.        , 2.71828183, 7.3890561 ])

In [80]:
np.sqrt(B)

array([0.        , 1.        , 1.41421356])

In [81]:
C = np.array([2.0, -1.0, 4.0])
np.add(B, C)

array([2., 0., 6.])

## Indexing, Slicing and Iterating


- [Indexing, Slicing and Iterating](https://numpy.org/doc/stable/user/quickstart.html#indexing-slicing-and-iterating)
- [Indexing and slicing](https://numpy.org/doc/stable/user/absolute_beginners.html#indexing-and-slicing)


One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.


In [3]:
a = np.arange(10) ** 3
a[2]


8

In [4]:
print(a[2:5])
print(a[:6:2])
print(a[::-1])


[ 8 27 64]
[ 0  8 64]
[729 512 343 216 125  64  27   8   1   0]


In [5]:
for i in a:
    print(i ** (1 / 3.0))

0.0
1.0
2.0
3.0
3.9999999999999996
5.0
5.999999999999999
6.999999999999999
7.999999999999999
8.999999999999998


### Selecting and Slicing Rows and Items from ndarrays


An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers.


Next, let's look at a comparison between working with ndarrays and list of lists to select one or more rows of data:


<img alt="Dimensional Arrays" src="./images/selection_rows.svg">


As shown above, we can select rows in ndarrays very similarly to lists of lists. In reality, what we're seeing is a kind of shortcut. For any 2D array, the full syntax for selecting data is:


    ndarray[row_index,column_index]

    # or if you want to select all
    # columns for a given set of rows
    ndarray[row_index]


Where row_index defines the location along the row axis and column_index defines the location along the column axis.

Both row and column can be one of the following:

- An integer, indicating a specific location, eg ndarray[3,0].
- A slice, indicating a range of locations, eg ndarray[0:5,6:].
- A colon, indicating every location, eg ndarray[:,2].
- A list of values, indicating specific locations, eg ndarray[[0,1,3,4],0].
- A boolean array, indicating specific locations

Like lists, array slicing is from the first specified index up to — but not including — the second specified index. For example, to select the items at index 1, 2, and 3, we'd need to use the slice [1:4].

This is how we select a single item from a 2D ndarray:


<img alt="Dimensional Arrays" src="./images/selection_item.svg">


With a list of lists, we use two separate pairs of square brackets back-to-back. With a NumPy ndarray, we use a single pair of brackets with comma-separated row and column locations.


In [6]:
# Create a 5x5 array of random integers in the interval [0, 10)
test = np.random.randint(0, 10, (5, 5))

In [7]:
test

array([[8, 6, 2, 3, 7],
       [9, 6, 0, 5, 9],
       [9, 4, 0, 4, 3],
       [3, 5, 1, 0, 1],
       [4, 9, 8, 2, 3]])

In [8]:
# selecting the first row
# Remember that indexing starts at 0.
first_row = test[0]
first_row

array([8, 6, 2, 3, 7])

In [9]:
# Use negatives to count from the back.
test[-1]  # zadnja vrstica

array([4, 9, 8, 2, 3])

In [10]:
# selecting the 2nd and 3rd row
row_2_and_3 = test[1:3]  # Use : to indicate a range. array[start:stop]
row_2_and_3

array([[9, 6, 0, 5, 9],
       [9, 4, 0, 4, 3]])

In [11]:
# select all row from the 3rd on
test[2:]  # Leaving start or stop empty will default to the beginning/end of the array.

array([[9, 4, 0, 4, 3],
       [3, 5, 1, 0, 1],
       [4, 9, 8, 2, 3]])

In [12]:
# slect the item at row 2 and column 3
test[2, 3]

4

### Selecting Columns and Custom Slicing ndarrays


Let's continue by learning how to select one or more columns of data:


<img alt="Dimensional Arrays" src="./images/selection_columns_updated.svg">


With a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. With ndarrays, the process is much simpler. We again use single brackets with comma-separated row and column locations, but we use a colon (:) for the row locations, which gives us all of the rows.

If we want to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:


<img alt="Dimensional Arrays" src="./images/selection_1darray_updated.svg">


Lastly, if we want to select a 2D slice, we can use slices for both dimensions:


<img alt="Dimensional Arrays" src="./images/selection_2darray_updated.svg">


In [13]:
# Create an array filled with random values
columns_test = np.random.random((5, 5))

In [14]:
columns_test

array([[0.07211459, 0.43982296, 0.87674471, 0.08456572, 0.0491967 ],
       [0.79787932, 0.41727357, 0.49860847, 0.84687912, 0.63321345],
       [0.44512667, 0.10153768, 0.51277847, 0.88461149, 0.73972573],
       [0.34862382, 0.96419148, 0.0805172 , 0.79092135, 0.83509568],
       [0.30421193, 0.7475745 , 0.79978463, 0.56676069, 0.71379805]])

In [15]:
# selecting a single column
columns_test[:, 3]

array([0.08456572, 0.84687912, 0.88461149, 0.79092135, 0.56676069])

In [16]:
# selecting multiple columns
columns_test[:, 1:3]

array([[0.43982296, 0.87674471],
       [0.41727357, 0.49860847],
       [0.10153768, 0.51277847],
       [0.96419148, 0.0805172 ],
       [0.7475745 , 0.79978463]])

In [17]:
# selecting multiple specific column
cols = [1, 3, 4]
columns_test[:, cols]

array([[0.43982296, 0.08456572, 0.0491967 ],
       [0.41727357, 0.84687912, 0.63321345],
       [0.10153768, 0.88461149, 0.73972573],
       [0.96419148, 0.79092135, 0.83509568],
       [0.7475745 , 0.56676069, 0.71379805]])

In [18]:
# selecting a 1D slice row
columns_test[2, 1:4]

array([0.10153768, 0.51277847, 0.88461149])

In [19]:
# selecting a 1D slice column
columns_test[1:, 4]

array([0.63321345, 0.73972573, 0.83509568, 0.71379805])

In [20]:
# selecting a 2D slice
columns_test[1:4, :3]

array([[0.79787932, 0.41727357, 0.49860847],
       [0.44512667, 0.10153768, 0.51277847],
       [0.34862382, 0.96419148, 0.0805172 ]])

## Vector Math


As we saw in the last two screens, NumPy ndarrays allow us to select data much more easily. Beyond this, the selection we make is a lot faster when working with vectorized operations because the operations are applied to multiple data points at once.

When we first talked about vectorized operations, we used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually:


In [27]:
my_numbers = [[6, 5], [9, 1], [2, 4], [7, 14], [8, 6]]

In [28]:
sums = []
for row in my_numbers:
    row_sums = row[0] + row[1]
    sums.append(row_sums)

print(sums)

[11, 10, 6, 21, 14]


At the time, we only talked about how vectorized operations make this faster; however, vectorized operations also make our code easier to execute. Here's how we would perform the same task above with vectorized operations:


In [29]:
# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

In [30]:
# select each of the columns - the result
# of each will be a 1D ndarray
col1 = my_numbers[:, 0]
col2 = my_numbers[:, 1]

In [31]:
# add the two columns
sums = col1 + col2

We could simplify this further if we wanted to:


In [32]:
sums = my_numbers[:, 0] + my_numbers[:, 1]
sums

array([11, 10,  6, 21, 14])

<div>
<p>Here are some key observations about this code:</p>
<ul>
<li>When we selected each column, we used the syntax <code>ndarray[:,c]</code> where <code>c</code> is the column index we wanted to select.  Like we saw in the previous screen, the colon selects all rows.</li>
<li>To add the two 1D ndarrays, <code>col1</code> and <code>col2</code>, we simply use the addition operator (<code>+</code>) between them.</li>
</ul>

<p>The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the original. In this context, ndarrays can also be called <strong>vectors</strong>, a term taken from a branch of mathematics called linear algebra. What we just did, adding two vectors together, is called <strong>vector addition</strong>.</p></div>


We can actually use any of the standard Python numeric operators with vectors, including:

- vector_a + vector_b - Addition
- vector_a - vector_b - Subtraction
- vector_a \* vector_b - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
- vector_a / vector_b - Division

When we perform these operations on two 1D vectors, **both vectors must have the same shape**.


As you become more familiar with NumPy (and later, pandas), you'll find that there is often **more than one way to do the same thing**. Most of the time, which you choose is up to you. The general rule with situations like these it to choose the one that makes your code easier to read, which will pay dividends both as you start working with data in teams, and when you have to refer back to code you wrote some time ago. You will find that for these arithmetic operations, it's much more common to use the built-in Python operators than the functions.

As you start to feel more comfortable with these libraries, you should start exploring the documentation. This is useful because it builds out your knowledge of available functions and methods, but also because it gets you used to reading the documentation. It's not possible to remember the syntax for every variation of every data science library, but if you remember what is possible, and can read the documentation, you'll always be able to quickly refamiliarize yourself with some syntax whenever you need it.

You may have noticed that when we mention a function or method for the first time, we'll link to the documentation for it. Take a moment now to click the link for the numpy.divide() function from the first paragraph of this screen and look at the documentation. It may seem a little overwhelming at first, but it is well worth your time.

<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table lists the arithmetic operators implemented in NumPy:</p>
<table>
<thead><tr>
<th>Operator</th>
<th>Equivalent ufunc</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>+</code></td>
<td><code>np.add</code></td>
<td>Addition (e.g., <code>1 + 1 = 2</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.subtract</code></td>
<td>Subtraction (e.g., <code>3 - 2 = 1</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.negative</code></td>
<td>Unary negation (e.g., <code>-2</code>)</td>
</tr>
<tr>
<td><code>*</code></td>
<td><code>np.multiply</code></td>
<td>Multiplication (e.g., <code>2 * 3 = 6</code>)</td>
</tr>
<tr>
<td><code>/</code></td>
<td><code>np.divide</code></td>
<td>Division (e.g., <code>3 / 2 = 1.5</code>)</td>
</tr>
<tr>
<td><code>//</code></td>
<td><code>np.floor_divide</code></td>
<td>Floor division (e.g., <code>3 // 2 = 1</code>)</td>
</tr>
<tr>
<td><code>**</code></td>
<td><code>np.power</code></td>
<td>Exponentiation (e.g., <code>2 ** 3 = 8</code>)</td>
</tr>
<tr>
<td><code>%</code></td>
<td><code>np.mod</code></td>
<td>Modulus/remainder (e.g., <code>9 % 4 = 1</code>)</td>
</tr>
</tbody>
</table>

</div>
</div>


To make the calculations in the previous screen, we used operators like the / symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - arithmetic functions. Let's look at how we would write the exercise from the previous screen with with the equivalent, the `numpy.divide` function:


In [33]:
d_dols = np.divide(col1, col2)
d_dols

array([1.2       , 9.        , 0.5       , 0.5       , 1.33333333])

## Calculating Statistics For 1D ndarrays


<p>To calculate the minimum value of a 1D ndarray, we use the vectorized <a target="_blank" href="http://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html"><code>ndarray.min()</code> method</a>, like so:</p>
</div>


In [44]:
columns_test = np.arange(100)
print(columns_test)


columns_test.min()


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


0

<div>

<p>Numpy ndarrays have methods for many different calculations. A few key methods are:</p>
<ul>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min"><code>ndarray.min()</code> to calculate the minimum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html"><code>ndarray.max()</code> to calculate the maximum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean"><code>ndarray.mean()</code> to calculate the mean or average value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum"><code>ndarray.sum()</code> to calculate the sum of the values</a></li>
</ul>
<p>You can see the full list of ndarray methods in the <a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation">NumPy ndarray documentation</a>.</p>
<p>It's important to get comfortable with the documentation because it's not possible to remember the syntax for every variation of every data science library. However, if you remember what is possible and can read the documentation, you'll always be able to refamiliarize yourself with it.</p>

</div>


<div>
<p>In NumPy, sometimes there are operations that are implemented as both methods and functions, which can be confusing. Let's look at some examples:</p>
<table>
<thead>
<tr>
<th>Calculation</th>
<th>Function Representation</th>
<th>Method Representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Calculate the minimum value of <code>trip_mph</code></td>
<td><code>np.min(trip_mph)</code></td>
<td><code>trip_mph.min()</code></td>
</tr>
<tr>
<td>Calculate the maximum value of <code>trip_mph</code></td>
<td><code>np.max(trip_mph)</code></td>
<td><code>trip_mph.max()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Mean">mean average</a> value of <code>trip_mph</code></td>
<td><code>np.mean(trip_mph)</code></td>
<td><code>trip_mph.mean()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Median">median average</a> value of <code>trip_mph</code></td>
<td><code>np.median(trip_mph)</code></td>
<td>There is no ndarray median method</td>
</tr>
</tbody>
</table>
<p>To remember the right terminology, anything that starts with <code>np</code> (e.g. <code>np.mean()</code>) is a function and anything expressed with an object (or variable) name first (e.g. <code>trip_mph.mean()</code>) is a method. When both exist, it's up to you to decide which to use, but it's much more common to use the method approach.</p></div>


In [46]:
columns_test.max()

99

In [47]:
np.max(columns_test)

99

## Calculating Statistics For 2D ndarrays


Next, we'll calculate statistics for 2D ndarrays. If we use the ndarray.max() method on a 2D ndarray without any additional parameters, it will return a single value, just like with a 1D array:


<img alt="Dimensional Arrays" src="./images/array_method_axis_none.svg">


But what if we wanted to find the maximum value of each row? We'd need to use the axis parameter and specify a value of 1 to indicate we want to calculate the maximum value for each row.


<img alt="Dimensional Arrays" src="./images/array_method_axis_1.svg">


If we want to find the maximum value of each column, we'd use an axis value of 0:


<img alt="Dimensional Arrays" src="./images/array_method_axis_0.svg">


To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use ndarray[row,column]. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:

<p><img alt="The axis parameter" src="https://s3.amazonaws.com/dq-content/289/axis_param.svg"></p>


In [50]:
data = np.arange(15).reshape(5, 3)
data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [51]:
data.max()

14

In [52]:
data.max(axis=0)

array([12, 13, 14])

In [53]:
data.max(axis=1)

array([ 2,  5,  8, 11, 14])

## Boolean Indexing


### Boolean Arrays


<div><p>In the last mission, we learned how to index — or select — data from ndarrays. In this mission, we're going to focus on arguably the most powerful method, the boolean array.  A <strong>boolean array</strong>, as the name suggests, is an array of boolean values. Boolean arrays are sometimes called <strong>boolean vectors</strong> or <strong>boolean masks</strong>.</p>
<p>You may recall that the boolean (or <code>bool</code>) type is a built-in Python type that can be one of two unique values:</p>
<ul>
<li><code>True</code></li>
<li><code>False</code></li>
</ul>
<p>You may also remember that we've used boolean values when working with Python <a target="_blank" href="https://docs.python.org/3.4/library/stdtypes.html#comparisons">comparison operators</a> like <code>==</code> (equal) <code>&gt;</code> (greater than), <code>&lt;</code> (less than), <code>!=</code> (not equal). Below are a couple examples of simple boolean operations:</p>
</div>


In [54]:
print(type(3.5) == float)

True


In [55]:
print(5 > 6)

False


When we explored vector math in the first mission, we learned that an operation between a ndarray and a single value results in a new ndarray:


In [56]:
print(np.array([2, 4, 6, 8]) + 10)

[12 14 16 18]


The + 10 operation is applied to each value in the array.

Now, let's look at what happens when we perform a boolean operation between an ndarray and a single value:


In [57]:
print(np.array([2, 4, 6, 8]) < 5)

[ True  True False False]


A similar pattern occurs – each value in the array is compared to five. If the value is less than five, True is returned. Otherwise, False is returned.


<div class="alert alert-block alert-info">
Use vectorized boolean operations to:
<li> Evaluate whether the elements in array a are less than 3. Assign the result to a_bool.</li> 
<li> Evaluate whether the elements in array b are equal to "blue". Assign the result to b_bool.</li> 
<li>  Evaluate whether the elements in array c are greater than 100. Assign the result to c_bool.</li> </div>


In [58]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

In [59]:
a < 3

array([ True,  True, False, False, False])

In [60]:
b == "blue"

array([ True,  True, False,  True])

In [61]:
c > 100

array([False,  True, False,  True])

### Boolean Indexing with 1D ndarrays


In the last screen, we learned how to create boolean arrays using vectorized boolean operations. Next, we'll learn how to index (or select) using boolean arrays, known as boolean indexing. Let's use one of the examples from the previous screen.


In [62]:
c = np.array([80.0, 103.4, 6.9, 200.3])
c_bool = c > 100
c_bool

array([False,  True, False,  True])

To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:


In [63]:
result = c[c_bool]

In [64]:
result

array([103.4, 200.3])

The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.


### Boolean Indexing with 2D ndarrays


When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing.


<img alt="Dimensional Arrays" src="./images/bool_dims_updated.svg">


Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.


In [65]:
data = np.arange(15).reshape(5, 3)
data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [66]:
data[data > 5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14])

In [67]:
data[:, [True, False, True]]

array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14]])

In [68]:
data[[True, False, True, False, True], :]

array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])

## Shape Manipulation


- [Shape Manipulation](https://numpy.org/doc/stable/user/quickstart.html#shape-manipulation)
- [How to convert a 1D array into a 2D array (how to add a new axis to an array)](https://numpy.org/doc/stable/user/absolute_beginners.html#how-to-convert-a-1d-array-into-a-2d-array-how-to-add-a-new-axis-to-an-array)


**1-DIMENSIONAL NUMPY ARRAYS ONLY HAVE ONE AXIS**


The important thing to know is that 1-dimensional NumPy arrays only have one axis.

If 1-d arrays only have one axis, can you guess the name of that axis?

Remember, axes are numbered like Python indexes. They start at 0.

So, in a 1-d NumPy array, the first and only axis is axis 0.


The fact that 1-d arrays have only one axis can cause some results that confuse NumPy beginners.


**2-DIMENSIONAL NUMPY ARRAYS**


Just like coordinate systems, NumPy arrays also have axes.


The best way to think about NumPy arrays is that they consist of two parts, a **data buffer** which is just a block of raw elements, and a **view** which describes how to interpret the data buffer.


The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. NumPy also contains a significant set of data that describes how to interpret the data in the data buffer.


For example, if we create an array of 12 integers:


In [69]:
a = np.arange(12)
print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11]


Then a consists of a data buffer, arranged something like this:


<pre class="lang-py s-code-block"><code class="hljs language-python">┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  <span class="hljs-number">0</span> │  <span class="hljs-number">1</span> │  <span class="hljs-number">2</span> │  <span class="hljs-number">3</span> │  <span class="hljs-number">4</span> │  <span class="hljs-number">5</span> │  <span class="hljs-number">6</span> │  <span class="hljs-number">7</span> │  <span class="hljs-number">8</span> │  <span class="hljs-number">9</span> │ <span class="hljs-number">10</span> │ <span class="hljs-number">11</span> │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
</code></pre>


In [70]:
a.shape

(12,)

In [71]:
# lokacija podatkov v pomnilniku
a.__array_interface__["data"]

(1167053356592, False)

Here the shape (12,) means the array is indexed by a single index which runs from 0 to 11. Conceptually, if we label this single index i, the array a looks like this:


<pre class="lang-py s-code-block"><code class="hljs language-python">i= <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>    <span class="hljs-number">4</span>    <span class="hljs-number">5</span>    <span class="hljs-number">6</span>    <span class="hljs-number">7</span>    <span class="hljs-number">8</span>    <span class="hljs-number">9</span>   <span class="hljs-number">10</span>   <span class="hljs-number">11</span>
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  <span class="hljs-number">0</span> │  <span class="hljs-number">1</span> │  <span class="hljs-number">2</span> │  <span class="hljs-number">3</span> │  <span class="hljs-number">4</span> │  <span class="hljs-number">5</span> │  <span class="hljs-number">6</span> │  <span class="hljs-number">7</span> │  <span class="hljs-number">8</span> │  <span class="hljs-number">9</span> │ <span class="hljs-number">10</span> │ <span class="hljs-number">11</span> │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
</code></pre>


In [72]:
a[2]

2

If we reshape an array, this doesn't change the data buffer. Instead, it creates a new view that describes a different way to interpret the data. So after:


In [73]:
b = a.reshape((3, 4))
print(b)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [74]:
b.shape

(3, 4)

In [75]:
b.__array_interface__["data"]

(1167053356592, False)

The array b has the same data buffer as a, but now it is indexed by two indices which run from 0 to 2 and 0 to 3 respectively. If we label the two indices i and j, the array b looks like this:


<pre class="lang-py s-code-block"><code class="hljs language-python">i= <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">1</span>    <span class="hljs-number">1</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">2</span>    <span class="hljs-number">2</span>    <span class="hljs-number">2</span>
j= <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  <span class="hljs-number">0</span> │  <span class="hljs-number">1</span> │  <span class="hljs-number">2</span> │  <span class="hljs-number">3</span> │  <span class="hljs-number">4</span> │  <span class="hljs-number">5</span> │  <span class="hljs-number">6</span> │  <span class="hljs-number">7</span> │  <span class="hljs-number">8</span> │  <span class="hljs-number">9</span> │ <span class="hljs-number">10</span> │ <span class="hljs-number">11</span> │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
</code></pre>


In [76]:
b[0, 2]

2

You can see that the second index changes quickly and the first index changes slowly. If you prefer this to be the other way round, you can specify the order parameter:


In [77]:
c = a.reshape((3, 4), order="F")

In [78]:
print(c)

[[ 0  3  6  9]
 [ 1  4  7 10]
 [ 2  5  8 11]]


In [79]:
c.shape

(3, 4)

In [80]:
c.__array_interface__["data"]

(1167053356592, False)

In Fortran the first index is the most rapidly varying index when moving through the elements of a two dimensional array as it is stored in memory. If you adopt the matrix convention for indexing, then this means the matrix is stored one column at a time (since the first index moves to the next row as it changes). Thus Fortran is considered a Column-major language. C has just the opposite convention. In C, the last index changes most rapidly as one moves through the array as stored in memory. Thus C is a Row-major language. The matrix is stored by rows. Note that in both cases it presumes that the matrix convention for indexing is being used, i.e., for both Fortran and C, the first index is the row. Note this convention implies that the indexing convention is invariant and that the data order changes to keep that so.


Which results in an array indexed like this:


<pre class="lang-py s-code-block"><code class="hljs language-python">i= <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>
j= <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">1</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">2</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>    <span class="hljs-number">3</span>    <span class="hljs-number">3</span>
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  <span class="hljs-number">0</span> │  <span class="hljs-number">1</span> │  <span class="hljs-number">2</span> │  <span class="hljs-number">3</span> │  <span class="hljs-number">4</span> │  <span class="hljs-number">5</span> │  <span class="hljs-number">6</span> │  <span class="hljs-number">7</span> │  <span class="hljs-number">8</span> │  <span class="hljs-number">9</span> │ <span class="hljs-number">10</span> │ <span class="hljs-number">11</span> │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
</code></pre>


In [81]:
c[2, 1]

5

It should now be clear what it means for an array to have a shape with one or more dimensions of size 1. After:


In [82]:
d = a.reshape((12, 1))

In [83]:
print(d)

[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]]


In [84]:
d.shape

(12, 1)

In [85]:
d.__array_interface__["data"]

(1167053356592, False)

The array d is indexed by two indices, the first of which runs from 0 to 11, and the second index is always 0:


<pre class="lang-py s-code-block"><code class="hljs language-python">i= <span class="hljs-number">0</span>    <span class="hljs-number">1</span>    <span class="hljs-number">2</span>    <span class="hljs-number">3</span>    <span class="hljs-number">4</span>    <span class="hljs-number">5</span>    <span class="hljs-number">6</span>    <span class="hljs-number">7</span>    <span class="hljs-number">8</span>    <span class="hljs-number">9</span>   <span class="hljs-number">10</span>   <span class="hljs-number">11</span>
j= <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>    <span class="hljs-number">0</span>
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  <span class="hljs-number">0</span> │  <span class="hljs-number">1</span> │  <span class="hljs-number">2</span> │  <span class="hljs-number">3</span> │  <span class="hljs-number">4</span> │  <span class="hljs-number">5</span> │  <span class="hljs-number">6</span> │  <span class="hljs-number">7</span> │  <span class="hljs-number">8</span> │  <span class="hljs-number">9</span> │ <span class="hljs-number">10</span> │ <span class="hljs-number">11</span> │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
</code></pre>


and so:


In [86]:
d[10]

array([10])

In [87]:
d[10, 0]

10

In [88]:
e = a.reshape((12))

In [89]:
print(e)

[ 0  1  2  3  4  5  6  7  8  9 10 11]


In [90]:
e.shape

(12,)

In [91]:
e.__array_interface__["data"]

(1167053356592, False)

This arrangement allow for very flexible use of arrays. One thing that it allows is simple changes of the metadata to change the interpretation of the array buffer. Changing the byteorder of the array is a simple change involving no rearrangement of the data. The shape of the array can be changed very easily without changing anything in the data buffer or any data copying at all.


Other operations, such as transpose, don’t move data elements around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn’t move.


**Matrix notation uses the first index to indicate which row is being selected and the second index to indicate which column is selected.** This is opposite the geometrically oriented-convention for images where people generally think the first index represents x position (i.e., column) and the second represents y position (i.e., row). This alone is the source of much confusion; matrix-oriented users and image-oriented users expect two different things with regard to indexing.


---


You can use `np.newaxis` and `np.expand_dims` to increase the dimensions of your existing array.

Using np.newaxis will increase the dimensions of your array by one dimension when used once. This means that a 1D array will become a 2D array, a 2D array will become a 3D array, and so on.

For example, if you start with this array:


In [93]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape


(6,)

In [94]:
# You can use np.newaxis to add a new axis:
a2 = a[np.newaxis, :]
a2.shape


(1, 6)

You can explicitly convert a 1D array with either a row vector or a column vector using np.newaxis. For example, you can convert a 1D array to a row vector by inserting an axis along the first dimension:


In [95]:
row_vector = a[np.newaxis, :]
row_vector.shape


(1, 6)

Or, for a column vector, you can insert an axis along the second dimension:


In [97]:
col_vector = a[:, np.newaxis]
col_vector.shape


(6, 1)

You can also expand an array by inserting a new axis at a specified position with np.expand_dims.


In [98]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape

# You can use np.expand_dims to add an axis at index position 1 with:
b = np.expand_dims(a, axis=1)
print(b.shape)

# You can add an axis at index position 0 with:
c = np.expand_dims(a, axis=0)
print(c.shape)

(6, 1)
(1, 6)


## Assigning Values


So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques we've already learned to modify values within an ndarray. The syntax we'll use (in pseudocode) is:


    ndarray[location_of_values] = new_value


Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific index location:


In [129]:
a = np.array(["red", "blue", "black", "blue", "purple"])
a[0] = "orange"
print(a)

['orange' 'blue' 'black' 'blue' 'purple']


Or we can assign multiple values at once:


In [130]:
a[3:] = "pink"
print(a)

['orange' 'blue' 'black' 'pink' 'pink']


With a 2D ndarray, just like with a 1D ndarray, we can assign one specific index location:


In [131]:
ones = np.ones((3, 5))
ones[1, 2] = 99
print(ones)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1. 99.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


We can also assign a whole row...


In [132]:
ones[0] = 42
print(ones)

[[42. 42. 42. 42. 42.]
 [ 1.  1. 99.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


...or a whole column:


In [133]:
ones[:, 2] = 0
print(ones)

[[42. 42.  0. 42. 42.]
 [ 1.  1.  0.  1.  1.]
 [ 1.  1.  0.  1.  1.]]


### Assignment Using Boolean Arrays


Boolean arrays become very powerful when we use them for assignment. Let's look at an example:


In [134]:
a2 = np.array([1, 2, 3, 4, 5])
a2[a2 > 2] = 99

print(a2)

[ 1  2 99 99 99]


The boolean array controls the values that the assignment applies to, and the other values remain unchanged.


Notice in the diagram above that we used a "shortcut" - we inserted the definition of the boolean array directly into the selection. This "shortcut" is the conventional way to write boolean indexing. Up until now, we've been assigning to an intermediate variable first so that the process is clear, but from here on, we will use this "shortcut" method instead.


Next, we'll look at an example of assignment using a boolean array with two dimensions:


In [135]:
b = np.linspace(1, 9, num=9, dtype=np.int32)
b = np.reshape(b, (3, 3))
c = b.copy()
b

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

> [More about reshape](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html#Reshaping-of-Arrays)


In [136]:
b[b > 4] = 99

In [137]:
b

array([[ 1,  2,  3],
       [ 4, 99, 99],
       [99, 99, 99]])

The b > 4 boolean operation produces a 2D boolean array which then controls the values that the assignment applies to.

We can also use a 1D boolean array to perform assignment on a 2D array:


In [138]:
c

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [139]:
c[c[:, 1] > 2, 1] = 99

In [140]:
c

array([[ 1,  2,  3],
       [ 4, 99,  6],
       [ 7, 99,  9]])

The c[:,1] > 2 boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array as the row index for assignment, and 1 as the column index to specify the second column. Our boolean array is only applied to the second column, while all other values remaining unchanged.

The pseudocode syntax for this code is as follows, first using an intermediate variable:


    bool = array[:, column_for_comparison] == value_for_comparison
    array[bool, column_for_assignment] = new_value


and then all in one line:


    array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value


## Adding Rows and Columns to ndarrays


To start, let's learn how to add rows and columns to an ndarray. The technique we're going to use involves the `numpy.concatenate()` function. This function accepts:

- A list of ndarrays as the first, unnamed parameter.
- An integer for the axis parameter, where 0 will add rows and 1 will add columns.

The numpy.concatenate() function requires that each array have the same shape, excepting the dimension corresponding to axis. Let's look at an example to understand more precisely how that works. We have two arrays, ones and zeros:


In [141]:
# primer 2 -1d
ones = np.ones(shape=3)
print(ones)
print(ones.shape)

[1. 1. 1.]
(3,)


In [142]:
# primer 1 -1d
ones = np.ones(3)
print(ones)
print(ones.shape)

[1. 1. 1.]
(3,)


In [143]:
# primer 3 -2d
ones = np.ones(shape=(3, 1))
print(ones)
print("----------")
print(ones[0])
print("----------")
print(ones[0, 0])
print(ones.shape)

[[1.]
 [1.]
 [1.]]
----------
[1.]
----------
1.0
(3, 1)


In [144]:
# primer 4 -2d
ones = np.ones(shape=(1, 3))
print(ones)
print("----------")
print(ones[0])
print("----------")
print(ones[0, 0])
print(ones.shape)

[[1. 1. 1.]]
----------
[1. 1. 1.]
----------
1.0
(1, 3)


In [145]:
# primer 5 -3d
ones = np.ones(shape=(2, 3, 2))
print(ones)
print("----------")
print(ones[0])
print("----------")
print(ones[0, 0])
print(ones.shape)

[[[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]]
----------
[[1. 1.]
 [1. 1.]
 [1. 1.]]
----------
[1. 1.]
(2, 3, 2)


In [146]:
ones = np.ones((2, 3))
print(ones)

[[1. 1. 1.]
 [1. 1. 1.]]


In [147]:
zeros = np.zeros(3)
print(zeros)

[0. 0. 0.]


Let's try and use numpy.concatenate() to add zeros as a row. Because we are wanting to add a row, we use axis=0


In [148]:
combined = np.concatenate([ones, zeros], axis=0)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

We've got an error because our dimensions don't match - let's look at the shape of each array to see if we can understand why:


In [149]:
print(ones.shape)

(2, 3)


In [150]:
print(zeros.shape)

(3,)


Because we're using axis=0, our shapes have to match across all dimensions except the first. If we look at these two array's we can see that the second dimension of ones is 3, but zeros doesn't have a second dimension, because it's only a 1D array. This is the source of our error. The table below shows the shapes we need to be able to combine these arrays.

<table>
<thead>
<tr>
<th>Object</th>
<th>Current shape</th>
<th>Desired Shape</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>ones</code></td>
<td><code>(2, 3)</code></td>
<td><code>(2, 3)</code></td>
</tr>
<tr>
<td><code>zeros</code></td>
<td><code>(3,)</code></td>
<td><code>(1, 3)</code></td>
</tr>
</tbody>
</table>


In order to adjust the shape of zeros, we can use the numpy.expand_dims() function. You might like to follow these steps in the console. We'll start by passing axis=0 because we want to convert our 1D array into a 2D array representing a row:


In [151]:
zeros

array([0., 0., 0.])

In [152]:
np.expand_dims(zeros, axis=0)

array([[0., 0., 0.]])

In [153]:
np.expand_dims(zeros, axis=1)

array([[0.],
       [0.],
       [0.]])

In [154]:
zeros_2d = np.expand_dims(zeros, axis=0)

In [155]:
print(zeros_2d)

[[0. 0. 0.]]


In [156]:
print(zeros_2d.shape)

(1, 3)


Finally, we can use numpy.concatenate() to combine the two arrays:


In [157]:
combined = np.concatenate([ones, zeros_2d], axis=0)

In [158]:
print(combined)

[[1. 1. 1.]
 [1. 1. 1.]
 [0. 0. 0.]]


Adding a column is done the same way, except substituting axis=1 for axis=0 in both functions.


## Copies and Views


- [Copies and Views](https://numpy.org/doc/stable/user/quickstart.html#copies-and-views)


**No Copy at All**


When operating and manipulating arrays, their data is sometimes copied into a new array and sometimes not. This is often a source of confusion for beginners. There are three cases:


Simple assignments make no copy of objects or their data.


In [99]:
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])
b = a  # no new object is created
b is a  # a and b are two names for the same ndarray object

True

Python passes mutable objects as references, so function calls make no copy.


In [101]:
def f(x):
    print(id(x))


print(id(a))  # id is a unique identifier of an object
f(a)

1167384103120
1167384103120


**View or Shallow Copy**


Different array objects can share the same data. The view method creates a new array object that looks at the same data.


In [102]:
c = a.view()

In [103]:
c is a

False

In [104]:
c.base is a  # c is a view of the data owned by a

True

In [107]:
print(a.shape)
c = c.reshape((2, 6))  # a's shape doesn't change

(3, 4)


In [106]:
a.shape

(3, 4)

In [108]:
print(a)
c[0, 4] = 1234  # a's data changes
print(a)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[   0    1    2    3]
 [1234    5    6    7]
 [   8    9   10   11]]


Slicing an array returns a view of it:


In [109]:
s = a[:, 1:3]
s[:] = 10
a


array([[   0,   10,   10,    3],
       [1234,   10,   10,    7],
       [   8,   10,   10,   11]])

**Deep Copy**


The copy method makes a complete copy of the array and its data.


In [110]:
d = a.copy()  # a new array object with new data is created
print(d is a)
print(d.base is a)  # d doesn't share anything with a


False
False


In [111]:
d[0, 0] = 9999
print(a)

[[   0   10   10    3]
 [1234   10   10    7]
 [   8   10   10   11]]


Sometimes copy should be called after slicing if the original array is not required anymore. For example, suppose a is a huge intermediate result and the final result b only contains a small fraction of a, a deep copy should be made when constructing b with slicing:


In [112]:
a = np.arange(int(1e8))
b = a[:100].copy()
del a  # the memory of ``a`` can be released.

If `b = a[:100]` is used instead, a is referenced by b and will persist in memory even if del a is executed.


## Reading CSV files with NumPy


Below is information about selected columns from the data set:

- `rank`
- `revenue`
- `revenue_change`
- `profits`
- `asset`
- `profit_change`

<p>Now that we understand NumPy a little better, let's learn how to use the <a target="_blank" href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt"><code>numpy.genfromtxt()</code> function</a> to read files into NumPy ndarrays. Here is the simplified syntax for the function, and an explanation of the two parameters:</p>


    np.genfromtxt(filename, delimiter=None)


- `filename`: A positional argument, usually a string representing the path to the text file to be read.
- `delimiter`: A named argument, specifying the string used to separate each value.
- `usecols`: Which columns to read, with 0 being the first. For example, usecols = (1, 4, 5) will extract the 2nd, 5th and 6th columns.

> Preberemo smo števiske vrednosti, v Numpy array lahko shranimo samo en podatkovni tip. Zato bomo v nadaljevanju spoznali pandas ki olajša delo z razičnimi podatkovnimi tipi.

In this case, because we have a CSV file, the delimiter is a comma. Here's how we'd read in a file named data.csv:


    data = np.genfromtxt('data.csv', delimiter=',')


In [120]:
import numpy as np

f500 = np.genfromtxt("data/f500_small.csv", delimiter=",", usecols=(1, 2, 3, 4, 5, 6))

In [121]:
f500[:5]

array([[         nan,          nan,          nan,          nan,
                 nan,          nan],
       [ 1.00000e+00,  4.85873e+05,  8.00000e-01,  1.36430e+04,
         1.98825e+05, -7.20000e+00],
       [ 2.00000e+00,  3.15199e+05, -4.40000e+00,  9.57130e+03,
         4.89838e+05, -6.20000e+00],
       [ 3.00000e+00,  2.67518e+05, -9.10000e+00,  1.25790e+03,
         3.10726e+05, -6.50000e+01],
       [ 4.00000e+00,  2.62573e+05, -1.23000e+01,  1.86750e+03,
         5.85619e+05, -7.37000e+01]])

It's often useful to know the number of rows and columns in an ndarray. When we can't easily print the entire ndarray, we can use the ndarray.shape attribute instead:


In [122]:
f500_shape = f500.shape

In [123]:
f500_shape

(20, 6)

The data type returned is called a tuple. Tuples are very similar to Python lists, but can't be modified.

The output gives us a few important pieces of information:

- The first number tells us that there are 20 rows in data_ndarray.
- The second number tells us that there are 6 columns in data_ndarray.


In the last exercise, we used the numpy.genfromtxt() function to read the file into NumPy, which allowed us to import the data much more quickly and efficiently than the method we used in the previous mission.


<div>
<p>We can use the <a target="_blank" href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype"><code>ndarray.dtype</code> attribute</a> to see the internal datatype that has been used.</p>
</div>


In [124]:
print(f500.dtype)

float64


<div>
<p>NumPy chose the <code>float64</code> type, since it will allow most of the values from our CSV to be read. You can think of NumPy's <code>float64</code> type as being identical to Python's <code>float</code> type (the "64" refers to the number of <a target="_blank" href="https://en.wikipedia.org/wiki/Bit">bits</a> used to store the underlying value).</p>
<p>If we review the results from the last exercise, we can see that <code>f500</code> contains almost all numbers except for a value that we haven't seen before: <code>nan</code>.</p>
</div>


In [125]:
print(f500[:7])

[[         nan          nan          nan          nan          nan
           nan]
 [ 1.00000e+00  4.85873e+05  8.00000e-01  1.36430e+04  1.98825e+05
  -7.20000e+00]
 [ 2.00000e+00  3.15199e+05 -4.40000e+00  9.57130e+03  4.89838e+05
  -6.20000e+00]
 [ 3.00000e+00  2.67518e+05 -9.10000e+00  1.25790e+03  3.10726e+05
  -6.50000e+01]
 [ 4.00000e+00  2.62573e+05 -1.23000e+01  1.86750e+03  5.85619e+05
  -7.37000e+01]
 [ 5.00000e+00  2.54694e+05  7.70000e+00  1.68993e+04  4.37575e+05
  -1.23000e+01]
 [ 6.00000e+00  2.40264e+05  1.50000e+00  5.93730e+03  4.32116e+05
           nan]]


<div>
<p>NaN is an acronym for <strong>Not a Number</strong> - it literally means that the value cannot be stored as a number.  It is similar to (and often referred to as a) null value, like Python's <a target="_blank" href="https://docs.python.org/3.4/library/constants.html#None"><code>None</code> constant</a>.</p>
<p>NaN is most commonly seen when a value is missing, but in this case, we have NaN values because the first line from our CSV file contains the names of each column. NumPy is unable to convert string values into the <code>float64</code> data type.</p>
<p>For now, we need to remove this header row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:</p>
</div>


In [126]:
f500 = f500[1:]

<div>
<p>Alternatively, we can pass an additional parameter, <code>skip_header</code>, to the <code>numpy.genfromtxt()</code> function.  The <code>skip_header</code> parameter accepts an integer, the number of rows from the start of the file to skip. Note that because this integer should be the <em>number of rows</em> and not the index, skipping the first row would require a value of <code>1</code>, not <code>0</code>.</p></div>


In [127]:
f500 = np.genfromtxt(
    "data/f500_small.csv", delimiter=",", usecols=(1, 2, 3, 4, 5, 6), skip_header=1
)


f500_shape = f500.shape


f500_shape

(19, 6)

In [128]:
print(f500[:5])

[[ 1.00000e+00  4.85873e+05  8.00000e-01  1.36430e+04  1.98825e+05
  -7.20000e+00]
 [ 2.00000e+00  3.15199e+05 -4.40000e+00  9.57130e+03  4.89838e+05
  -6.20000e+00]
 [ 3.00000e+00  2.67518e+05 -9.10000e+00  1.25790e+03  3.10726e+05
  -6.50000e+01]
 [ 4.00000e+00  2.62573e+05 -1.23000e+01  1.86750e+03  5.85619e+05
  -7.37000e+01]
 [ 5.00000e+00  2.54694e+05  7.70000e+00  1.68993e+04  4.37575e+05
  -1.23000e+01]]


<div class="alert alert-block alert-info">
<b>Vaja:</b> Preverite še vse lastnosti za f500 array.</div>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Izberite vrednost in vrstico f500 array-ja.</div>


<div class="alert alert-block alert-info">
<b>Vaja:</b> 
Use vector addition to add revenues (1) and profits (3). Assign the result to revenues_and_profits.</div>


In [99]:
revenues = f500[:, 1]
profits = f500[:, 3]
revenues_and_profits = revenues + profits
revenues_and_profits

array([499516. , 324770.3, 268775.9, 264440.5, 271593.3, 246201.3,
       244608. , 247678. , 261326. , 212844. , 203603. , 186721. ,
       191857. , 182843. , 193273.5, 175262. , 178911.4, 175807. ,
       176762. ])

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the ndarray.min() method to calculate the minimum value of revenues. Assign the result to revenues_min.
Use the ndarray.mean() method to calculate the average value of profits. Assign the result to profits_mean.</div>


In [104]:
revenues.min()

163786.0

In [103]:
profits.mean()

10599.905263157894

<div class="alert alert-block alert-info">Izračunamo nekaj statistik za f500 podatke.</div>
