# NumPy

## Understanding Data Types in Python


```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:
```python
# Python code
result = 0
for i in range(100):
    result += i
```


In [1]:
import numpy as np
import pandas as pd

```C
/* C code */
int x = 4;
x = "four";  // FAILS
```

### A Python Integer Is More Than Just an Integer

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:
- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Integer Memory Layout">

### A Python List Is More Than Just a List


In [3]:
x = 4
x

4

In [4]:
x = 'four'
x

'four'

In [3]:
l = list(range(10))
type(l)
l

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [6]:
l2 = [str(c) for c in l]
type(l2)
l2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [7]:
l3 = {True, "2", 3.0, 4}
type(l3)
l3

{'2', 3.0, 4, True}


<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Array Memory Layout">

### Fixed-Type Arrays in Python


In [8]:
import array


In [18]:
l = list(range(10))
a = array.array('i', l)
a

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## How Vectorization Makes Code Faster



<p><img alt="Translating Python code to bytecode" src="https://s3.amazonaws.com/dq-content/289/bytecode.svg"></p>


<table>
<thead>
<tr>
<th>Language Type</th>
<th>Example</th>
<th>Time taken to write program</th>
<th>Control over program performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>High-Level</td>
<td>Python</td>
<td>Low</td>
<td>Low</td>
</tr>
<tr>
<td>Low-Level</td>
<td>C</td>
<td>High</td>
<td>High</td>
</tr>
</tbody>
</table>



<p><img alt="For loop to sum rows" src="https://s3.amazonaws.com/dq-content/289/for_loop.svg"></p>

In [9]:
my_numbers = [[6,5],[1,3],[5,6]]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
    
print(sums)

[11, 4, 11]



<p><img alt="Unvectorized operation" src="https://s3.amazonaws.com/dq-content/289/unvectorized.svg"></p>

<p><img alt="Vectorized operation" src="https://s3.amazonaws.com/dq-content/289/vectorized.svg"></p>



## Numpy

In [17]:
import numpy as np
list1 = [[1,2,3], [4,5,6]]
arr1 = np.array(list1)
print(arr1)

[[1 2 3]
 [4 5 6]]


In [14]:
import numpy as np

### NumPy ndarrays



<p><img alt="Dimensional Arrays" src="https://s3.amazonaws.com/dq-content/289/dimensional_arrays.svg"></p>



#### Create an array



In [20]:
list1 = [1,2,3]
arr1 = np.array(list1)
arr1

array([1, 2, 3])

In [28]:
np.ones((3,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [26]:
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [27]:
np.zeros((10,6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [34]:
np.linspace(0,1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [38]:
np.random.randint(0,10, (4,4))

array([[6, 3, 7, 3],
       [5, 3, 0, 7],
       [8, 5, 7, 5],
       [1, 2, 0, 7]])

In [58]:
r = np.random.random((3,3,3))
r.itemsize

8

In [45]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [48]:
np.full((5,6), 8)

array([[8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8]])

In [18]:
np.empty([2,2])

array([[1. , 2. ],
       [3. , 7.5]])

#### Understanding NumPy ndarrays

In [60]:
x = np.array([1,2,3])

x.nbytes

24

In [64]:
test_arr = np.random.randint(10, size=(5,6))
test_arr

array([[1, 0, 7, 9, 2, 7],
       [7, 7, 8, 4, 9, 7],
       [9, 5, 0, 3, 9, 9],
       [0, 5, 7, 8, 7, 4],
       [9, 1, 8, 0, 9, 5]])

In [82]:
first_row = test_arr[0,0]
first_row

1

In [86]:
x = np.array([1,2])
x.dtype

dtype('int64')

In [89]:
np.zeros(10)
dtype = np.int
dtype

int

#### Selecting and Slicing Rows and Items from ndarrays

<p><img alt="Selecting rows from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_rows.svg"></p>



This is how we select a single item from a 2D ndarray:

<p><img alt="Selecting a single item from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_item.svg"></p>


In [8]:
import numpy as np
import pandas as pd

#### Selecting Columns and Custom Slicing ndarrays

Let's continue by learning how to select one or more columns of data:

<p><img alt="Selecting columns from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_columns.svg"></p>



If we wanted to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

<p><img alt="Selecting partial 1D slices from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_1darray.svg"></p>

Lastly, if we wanted to select a 2D slice, we can use slices for both dimensions:

<p><img alt="Selecting a 2D slice from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_2darray.svg"></p>



#### Modify values in ndarray



In [60]:


my_numbers = [[6,5],[1,3],[5,6]]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
    
print(sums)



[11, 4, 11]


In [61]:
list1 = [6,7.5,78,45,9,6,58]
arr1 = np.array(list1)
print(arr1)

[ 6.   7.5 78.  45.   9.   6.  58. ]


#### Datatypes

[Več o datatypes](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

[List of scalars](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars-built-in)

<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Data type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bool_</code></td>
<td>Boolean (True or False) stored as a byte</td>
</tr>
<tr>
<td><code>int_</code></td>
<td>Default integer type (same as C <code>long</code>; normally either <code>int64</code> or <code>int32</code>)</td>
</tr>
<tr>
<td><code>intc</code></td>
<td>Identical to C <code>int</code> (normally <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>intp</code></td>
<td>Integer used for indexing (same as C <code>ssize_t</code>; normally either <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>int8</code></td>
<td>Byte (-128 to 127)</td>
</tr>
<tr>
<td><code>int16</code></td>
<td>Integer (-32768 to 32767)</td>
</tr>
<tr>
<td><code>int32</code></td>
<td>Integer (-2147483648 to 2147483647)</td>
</tr>
<tr>
<td><code>int64</code></td>
<td>Integer (-9223372036854775808 to 9223372036854775807)</td>
</tr>
<tr>
<td><code>uint8</code></td>
<td>Unsigned integer (0 to 255)</td>
</tr>
<tr>
<td><code>uint16</code></td>
<td>Unsigned integer (0 to 65535)</td>
</tr>
<tr>
<td><code>uint32</code></td>
<td>Unsigned integer (0 to 4294967295)</td>
</tr>
<tr>
<td><code>uint64</code></td>
<td>Unsigned integer (0 to 18446744073709551615)</td>
</tr>
<tr>
<td><code>float_</code></td>
<td>Shorthand for <code>float64</code>.</td>
</tr>
<tr>
<td><code>float16</code></td>
<td>Half precision float: sign bit, 5 bits exponent, 10 bits mantissa</td>
</tr>
<tr>
<td><code>float32</code></td>
<td>Single precision float: sign bit, 8 bits exponent, 23 bits mantissa</td>
</tr>
<tr>
<td><code>float64</code></td>
<td>Double precision float: sign bit, 11 bits exponent, 52 bits mantissa</td>
</tr>
<tr>
<td><code>complex_</code></td>
<td>Shorthand for <code>complex128</code>.</td>
</tr>
<tr>
<td><code>complex64</code></td>
<td>Complex number, represented by two 32-bit floats</td>
</tr>
<tr>
<td><code>complex128</code></td>
<td>Complex number, represented by two 64-bit floats</td>
</tr>
</tbody>
</table>

</div>

### Computation on NumPy Arrays: Universal Functions


#### The Slowness of Loops



In [63]:
def comute_reciprocals(values):
    output = np.empty(values)
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
        return output    

np.random.seed(0)
values = np.random.randint(1,10, size=5)
values


array([6, 1, 4, 4, 8])

In [64]:
comute_reciprocals(values)


array([[[[[ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001],
          [ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001],
          [ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001],
          [ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001]],

         [[ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-001],
          [ 1.66666667e-001,  1.66666667e-001,  1.66666667e-001,
            1.66666667e-001,  1.66666667e-0

In [65]:
big_array = np.random.randint(1,100, size=10000)
%timeit (1.0/big_array)

20.4 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [120]:

np.arange(5)

array([0, 1, 2, 3, 4])

In [121]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [126]:
x = np.arange(9).reshape((3,3))
x

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

#### Introducing UFuncs (Universal functions)

[Docs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html())



In [3]:
import csv

In [4]:
!head taxi_data.csv

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type

2,2017-01-01 00:01:15,2017-01-01 00:11:05,N,1,42,166,1,1.71,9,0,0.5,0,0,,0.3,9.8,2,1
2,2017-01-01 00:03:34,2017-01-01 00:09:00,N,1,75,74,1,1.44,6.5,0.5,0.5,0,0,,0.3,7.8,2,1
2,2017-01-01 00:04:02,2017-01-01 00:12:55,N,1,82,70,5,3.45,12,0.5,0.5,2.66,0,,0.3,15.96,1,1
2,2017-01-01 00:01:40,2017-01-01 00:14:23,N,1,255,232,1,2.11,10.5,0.5,0.5,0,0,,0.3,11.8,2,1
2,2017-01-01 00:00:51,2017-01-01 00:18:55,N,1,166,239,1,2.76,11.5,0.5,0.5,0,0,,0.3,12.8,2,1
2,2017-01-01 00:00:28,2017-01-01 00:13:31,N,1,179,226,1,4.14,15,0.5,0.5,0,0,,0.3,16.3,1,1
2,2017-01-01 00:02:39,2017-01-01 00:26:28,N,1,74,167,1,4.22,19,0.5,0.5,0,0,,0.3,20.3,2,1
2,2017-01-01 00:15:21,2017-01-01 00:28:06,N,1,112,37,1,2.83,11,0.5,0.5,0,0,,0.3,12.3,2,1


In [17]:
with open('taxi_data.csv', 'r') as f:
    taxi_list = list(csv.reader(f))
    
 


In [18]:
taxi_list = taxi_list[2:]   

In [20]:
#taxi_list[:3]

In [21]:
taxi_list[2]

['2',
 '2017-01-01 00:00:51',
 '2017-01-01 00:18:55',
 'N',
 '1',
 '166',
 '239',
 '1',
 '2.76',
 '11.5',
 '0.5',
 '0.5',
 '0',
 '0',
 '',
 '0.3',
 '12.8',
 '2',
 '1']

In [26]:
converted_taxi_list = []

for row in taxi_list:
    converted_row = []
    for item in row:
        try:
            converted_row.append(float(item))
        except:
            continue
    converted_taxi_list.append(converted_row)


In [24]:
#converted_taxi_list[:2]    



In [27]:
taxi = np.array(converted_taxi_list)
taxi

array([[  2.  ,   1.  ,  82.  , ...,  15.96,   1.  ,   1.  ],
       [  2.  ,   1.  , 255.  , ...,  11.8 ,   2.  ,   1.  ],
       [  2.  ,   1.  , 166.  , ...,  12.8 ,   2.  ,   1.  ],
       ...,
       [  1.  ,   1.  , 228.  , ...,  80.3 ,   1.  ,   1.  ],
       [  1.  ,   1.  ,   7.  , ...,  17.3 ,   1.  ,   1.  ],
       [  1.  ,   1.  , 255.  , ...,  12.8 ,   1.  ,   1.  ]])

In [9]:
x = np.array([1,2,3])
y = np.array([4,5,6])

x.dot(y)

z = np.array([y, y**2])
z

array([[ 4,  5,  6],
       [16, 25, 36]])

In [10]:
z.T

array([[ 4, 16],
       [ 5, 25],
       [ 6, 36]])

In [11]:
z.shape

(2, 3)

In [12]:
z.sort

<function ndarray.sort>

In [31]:
col1 = taxi[:, 6]
col2 = taxi[:, 8]
col3 = taxi[:, 11]
sums  = col1 + col2 + col3
sums

array([12.8, 11.3, 12.3, ..., 61.8, 15.3, 10.8])

In [32]:
np.arange(3) + np.arange(4)

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

In [33]:
taxi.shape

(19996, 15)

In [51]:
taxi.ndim

1

In [34]:
trip_distance = taxi[:,5]
trip_price = taxi[:,12]
price_per_mile = trip_price / trip_distance
price_per_mile

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


array([4.62608696, 5.59241706, 4.63768116, ..., 3.70046083, 4.11904762,
       5.12      ])

In [35]:
np.array([1,2,3]) / 0

  """Entry point for launching an IPython kernel.


array([inf, inf, inf])

In [36]:
price_per_mile2 = np.divide(trip_price, trip_distance)
price_per_mile2

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


array([4.62608696, 5.59241706, 4.63768116, ..., 3.70046083, 4.11904762,
       5.12      ])

In [38]:
price_min = taxi[:,12].min()
price_min

-60.0

In [40]:
price_max = taxi[:,12].max()
price_max

240.0

In [41]:
np.max(taxi[:,12])

240.0

In [42]:
taxi[:,12].max()

240.0

In [43]:
taxi[:,12].mean()

15.861171234246854

In [45]:
taxi_first_20 = taxi[:20]
taxi_first_20

array([[  2.  ,   1.  ,  82.  ,  70.  ,   5.  ,   3.45,  12.  ,   0.5 ,
          0.5 ,   2.66,   0.  ,   0.3 ,  15.96,   1.  ,   1.  ],
       [  2.  ,   1.  , 255.  , 232.  ,   1.  ,   2.11,  10.5 ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,  11.8 ,   2.  ,   1.  ],
       [  2.  ,   1.  , 166.  , 239.  ,   1.  ,   2.76,  11.5 ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,  12.8 ,   2.  ,   1.  ],
       [  2.  ,   1.  , 179.  , 226.  ,   1.  ,   4.14,  15.  ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,  16.3 ,   1.  ,   1.  ],
       [  2.  ,   1.  ,  74.  , 167.  ,   1.  ,   4.22,  19.  ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,  20.3 ,   2.  ,   1.  ],
       [  2.  ,   1.  , 112.  ,  37.  ,   1.  ,   2.83,  11.  ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,  12.3 ,   2.  ,   1.  ],
       [  2.  ,   1.  ,  36.  ,  37.  ,   1.  ,   0.78,   5.  ,   0.5 ,
          0.5 ,   0.  ,   0.  ,   0.3 ,   6.3 ,   2.  ,   1.  ],
       [  2.  ,   1.  , 127.  , 174.  ,  

In [47]:
cene_skupaj = taxi_first_20[:,6:12]
cene_skupaj

array([[12.  ,  0.5 ,  0.5 ,  2.66,  0.  ,  0.3 ],
       [10.5 ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [11.5 ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [15.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [19.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [11.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [ 5.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [13.5 ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [ 8.5 ,  0.5 ,  0.5 ,  1.96,  0.  ,  0.3 ],
       [21.  ,  0.5 ,  0.5 ,  1.  ,  0.  ,  0.3 ],
       [30.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [ 7.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [18.5 ,  0.5 ,  0.5 ,  5.94,  0.  ,  0.3 ],
       [10.  ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [ 3.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 9.  ,  0.5 ,  0.5 ,  2.06,  0.  ,  0.3 ],
       [17.  ,  0.5 ,  0.5 ,  3.29,  0.  ,  0.3 ],
       [10.5 ,  0.5 ,  0.5 ,  0.  ,  0.  ,  0.3 ],
       [21.5 ,  0.5 ,  0.5 ,  4.56,  0.  ,  0.3 ],
       [10.5 ,  0.5 ,  0.5 ,  0

In [50]:

cene_skupaj_preracunano = taxi_first_20[:,12]
cene_skupaj_preracunano

array([15.96, 11.8 , 12.8 , 16.3 , 20.3 , 12.3 ,  6.3 , 14.8 , 11.76,
       23.3 , 31.3 ,  8.3 , 25.74, 11.3 ,  3.  , 12.36, 21.59, 11.8 ,
       27.36, 11.8 ])

In [52]:
cena_sestevek = cene_skupaj.sum(axis=1)
cena_sestevek.round()

array([16., 12., 13., 16., 20., 12.,  6., 15., 12., 23., 31.,  8., 26.,
       11.,  3., 12., 22., 12., 27., 12.])

In [53]:
cene_skupaj_preracunano.round()

array([16., 12., 13., 16., 20., 12.,  6., 15., 12., 23., 31.,  8., 26.,
       11.,  3., 12., 22., 12., 27., 12.])

In [55]:
ones = np.ones ((2,3))
ones

array([[1., 1., 1.],
       [1., 1., 1.]])

In [56]:
zeros = np.zeros(3)
zeros

array([0., 0., 0.])

In [59]:
combined = np.concatenate([ones,zeros], axis=0)

ValueError: all the input arrays must have same number of dimensions

In [60]:
ones.shape

(2, 3)

In [61]:
zeros.shape

(3,)

In [65]:
zeros_2d = np.expand_dims(zeros, axis=0)
zeros_2d


array([[0., 0., 0.]])

In [66]:
zeros_2d.shape

(1, 3)

In [68]:
combined = np.concatenate([ones, zeros_2d], axis=0)
combined

array([[1., 1., 1.],
       [1., 1., 1.],
       [0., 0., 0.]])

In [69]:
price_per_mile.shape

(19996,)

In [71]:
price_per_mile_2d = np.expand_dims(price_per_mile, axis=1)
price_per_mile_2d

array([[4.62608696],
       [5.59241706],
       [4.63768116],
       ...,
       [3.70046083],
       [4.11904762],
       [5.12      ]])

In [72]:
price_per_mile_2d.shape

(19996, 1)

In [75]:
taxi = np.concatenate([taxi, price_per_mile_2d], axis=1)
taxi[:,-1]

array([4.62608696, 5.59241706, 4.63768116, ..., 3.70046083, 4.11904762,
       5.12      ])

In [77]:
sadje = np.array(['pomaranca', 'banana', 'jabolka', 'grozdje', 'cesnja'])
sadje

array(['pomaranca', 'banana', 'jabolka', 'grozdje', 'cesnja'], dtype='<U9')

In [78]:
sadje[2]

'jabolka'

In [80]:
sadje[[2,1]]

array(['jabolka', 'banana'], dtype='<U9')

In [82]:
sorted_order = np.argsort(sadje)
sorted_order

array([1, 4, 3, 2, 0])

In [84]:
sortirano_sadje = sadje[sorted_order]
sortirano_sadje

array(['banana', 'cesnja', 'grozdje', 'jabolka', 'pomaranca'], dtype='<U9')

In [86]:
int_square = np.random.randint(10, size=(5,5))
int_square

array([[0, 1, 6, 1, 6],
       [1, 2, 6, 0, 5],
       [7, 8, 7, 0, 3],
       [9, 8, 9, 4, 1],
       [0, 6, 4, 3, 4]])

In [88]:
last_column = int_square[:,4]
last_column

array([6, 5, 3, 1, 4])

In [91]:
sorted_order = np.argsort(last_column)
sorted_order

array([3, 2, 4, 1, 0])

In [93]:
last_column_sorted = last_column[sorted_order]
last_column_sorted

array([1, 3, 4, 5, 6])

In [95]:
int_square_sorted = int_square[sorted_order]
int_square_sorted

array([[9, 8, 9, 4, 1],
       [7, 8, 7, 0, 3],
       [0, 6, 4, 3, 4],
       [1, 2, 6, 0, 5],
       [0, 1, 6, 1, 6]])

In [97]:
int_square_sorted = int_square[np.argsort(int_square[:,4])]
int_square_sorted 

array([[9, 8, 9, 4, 1],
       [7, 8, 7, 0, 3],
       [0, 6, 4, 3, 4],
       [1, 2, 6, 0, 5],
       [0, 1, 6, 1, 6]])

In [99]:
sorted_order = np.argsort(taxi[:,1])
sorted_order

array([    0, 13229, 13228, ...,  5603,  4739, 19125])

In [101]:
taxi_sorted = taxi[sorted_order]
taxi_sorted

array([[  2.        ,   1.        ,  82.        , ...,   4.62608696,
          4.62608696,   4.62608696],
       [  2.        ,   1.        ,  97.        , ...,   6.13445378,
          6.13445378,   6.13445378],
       [  2.        ,   1.        ,  61.        , ...,   8.        ,
          8.        ,   8.        ],
       ...,
       [  2.        ,   5.        , 228.        , ...,   7.53295669,
          7.53295669,   7.53295669],
       [  2.        ,   5.        , 181.        , ...,   7.68667643,
          7.68667643,   7.68667643],
       [  2.        ,   5.        , 256.        , ...,  32.65306122,
         32.65306122,  32.65306122]])

In [102]:
taxi_sorted[:20,-1]

array([ 4.62608696,  6.13445378,  8.        , 12.        ,  5.08051948,
        3.40625   ,  8.29268293,  4.69658887,  5.07317073,  5.98130841,
        4.47129909,  2.85714286,  6.88888889,  6.86868687,  4.03258656,
        4.40816327,  6.24242424,  6.24      ,  3.63302752,  6.50526316])

In [103]:
!head taxi_data.csv

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type

2,2017-01-01 00:01:15,2017-01-01 00:11:05,N,1,42,166,1,1.71,9,0,0.5,0,0,,0.3,9.8,2,1
2,2017-01-01 00:03:34,2017-01-01 00:09:00,N,1,75,74,1,1.44,6.5,0.5,0.5,0,0,,0.3,7.8,2,1
2,2017-01-01 00:04:02,2017-01-01 00:12:55,N,1,82,70,5,3.45,12,0.5,0.5,2.66,0,,0.3,15.96,1,1
2,2017-01-01 00:01:40,2017-01-01 00:14:23,N,1,255,232,1,2.11,10.5,0.5,0.5,0,0,,0.3,11.8,2,1
2,2017-01-01 00:00:51,2017-01-01 00:18:55,N,1,166,239,1,2.76,11.5,0.5,0.5,0,0,,0.3,12.8,2,1
2,2017-01-01 00:00:28,2017-01-01 00:13:31,N,1,179,226,1,4.14,15,0.5,0.5,0,0,,0.3,16.3,1,1
2,2017-01-01 00:02:39,2017-01-01 00:26:28,N,1,74,167,1,4.22,19,0.5,0.5,0,0,,0.3,20.3,2,1
2,2017-01-01 00:15:21,2017-01-01 00:28:06,N,1,112,37,1,2.83,11,0.5,0.5,0,0,,0.3,12.3,2,1


In [106]:
taxi = np.genfromtxt('taxi_data.csv',
                    delimiter=',',
                    skip_header=2)
taxi

array([[ 2.  ,   nan,   nan, ...,  9.8 ,  2.  ,  1.  ],
       [ 2.  ,   nan,   nan, ...,  7.8 ,  2.  ,  1.  ],
       [ 2.  ,   nan,   nan, ..., 15.96,  1.  ,  1.  ],
       ...,
       [ 1.  ,   nan,   nan, ..., 80.3 ,  1.  ,  1.  ],
       [ 1.  ,   nan,   nan, ..., 17.3 ,  1.  ,  1.  ],
       [ 1.  ,   nan,   nan, ..., 12.8 ,  1.  ,  1.  ]])

In [107]:
taxi.shape

(19998, 19)

In [108]:
taxi[1,5]


75.0

In [109]:
taxi

array([[ 2.  ,   nan,   nan, ...,  9.8 ,  2.  ,  1.  ],
       [ 2.  ,   nan,   nan, ...,  7.8 ,  2.  ,  1.  ],
       [ 2.  ,   nan,   nan, ..., 15.96,  1.  ,  1.  ],
       ...,
       [ 1.  ,   nan,   nan, ..., 80.3 ,  1.  ,  1.  ],
       [ 1.  ,   nan,   nan, ..., 17.3 ,  1.  ,  1.  ],
       [ 1.  ,   nan,   nan, ..., 12.8 ,  1.  ,  1.  ]])

In [110]:
np.array([1,2,3,4]) + 10

array([11, 12, 13, 14])

In [111]:
np.array([2,4,6,8,]) < 5

array([ True,  True, False, False])

In [114]:
a = np.array([1,2,3,4,5])
a

array([1, 2, 3, 4, 5])

In [116]:
a_bool = a < 3
a_bool

array([ True,  True, False, False, False])

In [118]:
b = np.array(['blue', 'red', 'black'])
b

array(['blue', 'red', 'black'], dtype='<U5')

In [120]:
b_bool = b == 'red'
b_bool

array([False,  True, False])

In [122]:
passenger_count = taxi[:,7]
passenger_count

array([1., 1., 5., ..., 2., 1., 1.])

In [123]:
passenger_count.shape

(19998,)

In [125]:
two_pass_bool = passenger_count == 2
two_pass_bool


array([False, False, False, ...,  True, False, False])

In [127]:
two_passengers = passenger_count[two_pass_bool]
two_passengers

array([2., 2., 2., ..., 2., 2., 2.])

In [128]:
two_passengers.shape

(1801,)

In [130]:
arr = np.random.randint(10, size=(4,3))
arr

array([[1, 7, 3],
       [7, 6, 0],
       [5, 7, 6],
       [5, 0, 3]])

In [135]:
bool_1 = [True, False, True, True]
bool_2 = [True, False, True]

In [133]:
arr[:,bool_1]

IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 4

In [136]:
arr[:,bool_2]

array([[1, 3],
       [7, 0],
       [5, 6],
       [5, 3]])

In [138]:
b = np.array(['blue', 'red', 'black'])
b

array(['blue', 'red', 'black'], dtype='<U5')

In [141]:
b[1] = 'orange'
b

array(['blue', 'orang', 'black'], dtype='<U5')

In [143]:
b[1:] = 'pink'
b

array(['blue', 'pink', 'pink'], dtype='<U5')

In [145]:
ones = np.ones((3,5))
ones

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [147]:
ones[1,2] = 99
ones

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1., 99.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [149]:
ones[0] = 42
ones

array([[42., 42., 42., 42., 42.],
       [ 1.,  1., 99.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [151]:
r = np.ones((4,4))
r

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [153]:
r2 = r[:2, :2]
r2

array([[1., 1.],
       [1., 1.]])

In [155]:
r2[:] = 0
r2

array([[0., 0.],
       [0., 0.]])

In [156]:
id(r2)

140364762213072

In [157]:
r

array([[0., 0., 1., 1.],
       [0., 0., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [160]:
r3 = np.random.randint(10, size=(5,5))
r3

array([[3, 2, 6, 5, 3],
       [9, 6, 8, 2, 8],
       [8, 0, 6, 8, 7],
       [5, 3, 1, 2, 1],
       [5, 4, 9, 2, 9]])

In [162]:
r3_sub_copy = r3[:2, :2].copy()
r3_sub_copy

array([[3, 2],
       [9, 6]])

In [165]:
r3_sub_copy[:] = 0
r3_sub_copy

array([[0, 0],
       [0, 0]])

### Uvoz realnih podatkov


- Row 1 is RatecodeID
- Row 2 is PULocationID
- Row 3 is DOLocationID
- Row 4 is passenger_count
- Row 5 is trip_distance
- Row 6 is fare_amount
- Row 7 is extra
- Row 8 is mta_tax
- Row 9 is tip_amount
- Row 10 is tolls_amount
- Row 11 is improvement_surcharge
- Row 12 is total_amount
- Row 13 is payment_type
- Row 14 is trip_type

### Vector Math




Here's what happened behind the scenes:

<p><img alt="Vectorized Addition" src="https://s3.amazonaws.com/dq-content/289/vectorized_addition.svg"></p>


- `vector_a + vector_b` - Addition
- `vector_a - vector_b7` - Subtraction
- `vector_a * vector_b` - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
- `vector_a / vector_b` - Division
- `vector_a % vector_b` - Modulus (find the remainder when vector_a is divided by vector_b)
- `vector_a ** vector_b` - Exponent (raise vector_a to the power of vector_b)
- `vector_a // vector_b` - Floor Division (divide vector_a by vector_b, rounding down to the nearest integer)


<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table lists the arithmetic operators implemented in NumPy:</p>
<table>
<thead><tr>
<th>Operator</th>
<th>Equivalent ufunc</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>+</code></td>
<td><code>np.add</code></td>
<td>Addition (e.g., <code>1 + 1 = 2</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.subtract</code></td>
<td>Subtraction (e.g., <code>3 - 2 = 1</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.negative</code></td>
<td>Unary negation (e.g., <code>-2</code>)</td>
</tr>
<tr>
<td><code>*</code></td>
<td><code>np.multiply</code></td>
<td>Multiplication (e.g., <code>2 * 3 = 6</code>)</td>
</tr>
<tr>
<td><code>/</code></td>
<td><code>np.divide</code></td>
<td>Division (e.g., <code>3 / 2 = 1.5</code>)</td>
</tr>
<tr>
<td><code>//</code></td>
<td><code>np.floor_divide</code></td>
<td>Floor division (e.g., <code>3 // 2 = 1</code>)</td>
</tr>
<tr>
<td><code>**</code></td>
<td><code>np.power</code></td>
<td>Exponentiation (e.g., <code>2 ** 3 = 8</code>)</td>
</tr>
<tr>
<td><code>%</code></td>
<td><code>np.mod</code></td>
<td>Modulus/remainder (e.g., <code>9 % 4 = 1</code>)</td>
</tr>
</tbody>
</table>
<p>Additionally there are Boolean/bitwise operators; we will explore these in <a href="02.06-boolean-arrays-and-masks.html">Comparisons, Masks, and Boolean Logic</a>.</p>

</div>
</div>

[Mathematical expressions](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html#arithmetic-operations)

### Calculating Statistics For 1D ndarrays



### Calculating Statistics For 2D ndarrays

For now, we're going to look at how we can calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:

<p><img alt="Array method without axis parameter" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_none.svg"></p>

But what if we wanted to find the maximum value of each row? For that, we need to use the axis parameter, and specify a value of 1, which indicates we want to calculate values for each row.

<p><img alt="Array method without axis 1" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_1.svg"></p>

If we want to find the maximum value of each column, we use an axis value of 0:

<p><img alt="Array method without axis 1" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_0.svg"></p>

To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use ndarray[row,column]. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:

<p><img alt="The axis parameter" src="https://s3.amazonaws.com/dq-content/289/axis_param.svg"></p>



In [5]:
import numpy as np

In [6]:
def comute_reciprocals(values):
    output = np.empty(values)
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output    

In [9]:
print(np.random.seed(0))

None


In [10]:
values = np.random.randint(1,10, size = 5)
print(values)

[6 1 4 4 8]


In [15]:
big_array = np.random.randint(1,10,size = 100000)

In [16]:
%timeit comute_reciprocals(big_array)

ValueError: sequence too large; cannot be greater than 32

In [23]:
!head taxi_data.csv

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type

2,2017-01-01 00:01:15,2017-01-01 00:11:05,N,1,42,166,1,1.71,9,0,0.5,0,0,,0.3,9.8,2,1
2,2017-01-01 00:03:34,2017-01-01 00:09:00,N,1,75,74,1,1.44,6.5,0.5,0.5,0,0,,0.3,7.8,2,1
2,2017-01-01 00:04:02,2017-01-01 00:12:55,N,1,82,70,5,3.45,12,0.5,0.5,2.66,0,,0.3,15.96,1,1
2,2017-01-01 00:01:40,2017-01-01 00:14:23,N,1,255,232,1,2.11,10.5,0.5,0.5,0,0,,0.3,11.8,2,1
2,2017-01-01 00:00:51,2017-01-01 00:18:55,N,1,166,239,1,2.76,11.5,0.5,0.5,0,0,,0.3,12.8,2,1
2,2017-01-01 00:00:28,2017-01-01 00:13:31,N,1,179,226,1,4.14,15,0.5,0.5,0,0,,0.3,16.3,1,1
2,2017-01-01 00:02:39,2017-01-01 00:26:28,N,1,74,167,1,4.22,19,0.5,0.5,0,0,,0.3,20.3,2,1
2,2017-01-01 00:15:21,2017-01-01 00:28:06,N,1,112,37,1,2.83,11,0.5,0.5,0,0,,0.3,12.3,2,1


In [22]:
x = np.array([1,2,3])
y = np.array([4,5,6])
 x+y

SyntaxError: invalid syntax (<ipython-input-22-9b0f77e74a0b>, line 3)

In [30]:
import csv





In [31]:
price_min = taxi[:, 12].min()
price_min

NameError: name 'taxi' is not defined

In [25]:
import csv

In [32]:
with open(('taxi_data.csv', 'r')) as f:
    taxi_list = list(csv.reader(f))
    
taxi_list[:4]    

TypeError: expected str, bytes or os.PathLike object, not tuple

In [27]:
ones = np.ones((2,3))

In [28]:
ones

array([[1., 1., 1.],
       [1., 1., 1.]])

In [29]:
zeros = np.zeros(3)
zeros

array([0., 0., 0.])

In [33]:
3<10

True

### Adding Rows and Columns to ndarrays


### Sorting ndarrays


###  Reading CSV files with NumPy

###  Boolean Arrays





A similar pattern occurs– the 'less than five' operation is applied to each value in the array. The diagram below shows this step by step:

<p><img alt="Vectorized boolean operation" src="https://s3.amazonaws.com/dq-content/290/vectorized_bool.svg"></p>

### Boolean Indexing with 1D ndarrays




<p><img alt="Boolean indexing 1D ndarrays 1" src="https://s3.amazonaws.com/dq-content/290/1d_bool_1.svg"></p>



<p><img alt="Boolean indexing 1D ndarrays 2" src="https://s3.amazonaws.com/dq-content/290/1d_bool_2.svg"></p>




### Boolean Indexing with 2D ndarrays


<p><img alt="Boolean indexing 1D ndarrays 2" src="https://s3.amazonaws.com/dq-content/290/bool_dims.svg"></p>


### Assigning Values in ndarrays

### Subarrays as no-copy views



### Copying Data
