# Working with numerical data¶

Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters) & average relative humidity (in percentage) as a linear equation.

yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity

In [1]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [2]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

In [3]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

56.8

In [4]:
print("The expected yield of apples in the Knato region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in the Knato region is 56.8 tons per hectare.


In [5]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

In [6]:
weights = [w1, w2, w3]

In [7]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [8]:
crop_yield(kanto, weights)

56.8

In [9]:
crop_yield(johto, weights)

76.9

# Going from Python lists to Numpy arrays

In [10]:
import numpy as np


In [11]:
kanto = np.array([73, 67, 43])

In [12]:
weights = np.array([w1, w2, w3])

In [13]:
weights

array([0.3, 0.2, 0.5])

In [14]:
type(kanto)

numpy.ndarray

In [15]:
type(weights)

numpy.ndarray

In [16]:
weights[0]

0.3

In [17]:
kanto[2]

43

# Operating on Numpy arrays

In [18]:
np.dot(kanto, weights)

56.8

In [20]:
(kanto * weights).sum()

56.8

# Benefits of using Numpy arrays

In [27]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [28]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

Wall time: 400 ms


833332333333500000

In [29]:
%%time
np.dot(arr1_np, arr2_np)

Wall time: 996 µs


-1942957984

# Multi-dimensional Numpy arrays

In [31]:
climate_data = np.array([[73, 67, 43], 
                        [91, 88, 64],
                        [87, 134, 58],
                        [102, 43, 37],
                        [69, 96, 70]])

In [32]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

In [34]:
climate_data.shape

(5, 3)

In [35]:
weights

array([0.3, 0.2, 0.5])

In [36]:
weights.shape

(3,)

In [37]:
arr3 = np.array([[[11, 12, 13], [13, 14, 15]], [[15, 16, 17], [17, 18, 19.5]]])

In [38]:
arr3

array([[[11. , 12. , 13. ],
        [13. , 14. , 15. ]],

       [[15. , 16. , 17. ],
        [17. , 18. , 19.5]]])

In [39]:
arr3.shape

(2, 2, 3)

In [40]:
weights.dtype

dtype('float64')

In [41]:
climate_data.dtype

dtype('int32')

In [42]:
arr3.dtype

dtype('float64')

In [43]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [44]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

# Working with CSV data files

In [45]:
#CSV-Comma separated value
import urllib.request

urllib.request.urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv', 'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x226664ac070>)

In [46]:
climate_data = np.genfromtxt('climate.txt', delimiter = ',', skip_header = 1)

In [47]:
climate_data

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [48]:
climate_data.shape

(10000, 3)

In [49]:
weights = np.array([0.3, 0.2, 0.5])

In [50]:
yields = climate_data @ weights

In [51]:
yields

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [52]:
yields.shape

(10000,)

adding the `yields` to `climate_data` as a fourth column using the [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function.

In [57]:
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis = 1)

In [58]:
climate_results

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

In [63]:
np.savetxt('climate_results.txt', climate_results, fmt = '%.2f',  delimiter = ',', header = 'temperature, rainfall, humidity, yeild_apples', comments = '')

# Arithmetic operations, broadcasting and comparison

In [64]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [65]:
arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

In [66]:
arr2 + 3

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12,  4,  5,  6]])

In [67]:
arr3 - arr2

array([[10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10]])

In [68]:
arr2 / 2

array([[0.5, 1. , 1.5, 2. ],
       [2.5, 3. , 3.5, 4. ],
       [4.5, 0.5, 1. , 1.5]])

In [69]:
arr2 * arr3

array([[ 11,  24,  39,  56],
       [ 75,  96, 119, 144],
       [171,  11,  24,  39]])

In [70]:
arr2 % 4

array([[1, 2, 3, 0],
       [1, 2, 3, 0],
       [1, 1, 2, 3]], dtype=int32)

# Array Broadcasting

In [71]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [72]:
arr2.shape

(3, 4)

In [73]:
arr4 = np.array([4, 5, 6, 7])

In [74]:
arr4.shape

(4,)

In [75]:
arr2 + arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" width="360">

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

In [76]:
arr5 = np.array([7, 8])

In [77]:
arr5.shape

(2,)

In [78]:
arr2 + arr5

ValueError: operands could not be broadcast together with shapes (3,4) (2,) 

# Array Comparison

In [79]:
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

In [80]:
arr1 == arr2

array([[False,  True,  True],
       [False, False,  True]])

In [81]:
arr1 != arr2

array([[ True, False, False],
       [ True,  True, False]])

In [82]:
arr1 >= arr2

array([[False,  True,  True],
       [ True,  True,  True]])

In [83]:
arr1 < arr2

array([[ True, False, False],
       [False, False, False]])

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [84]:
(arr1 == arr2).sum()

3

# Array indexing and slicing

In [85]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [86]:
arr3.shape

(3, 2, 4)

In [87]:
arr3[1, 0, 3]

21.0

In [88]:
arr3[1, 1, 2]

36.0

In [89]:
arr3[1:, 0:1, :2]

array([[[15., 16.]],

       [[98., 32.]]])

In [90]:
arr3[0:2, 0:1, 1:]

array([[[12., 13., 14.]],

       [[16., 17., 21.]]])

In [91]:
arr3[1:, 1, 3]

array([18., 43.])

In [92]:
arr3[1:, 1, :3]

array([[63. , 92. , 36. ],
       [17. , 18. , 19.5]])

In [93]:
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [94]:
arr3[:2, 1]

array([[13., 14., 15., 19.],
       [63., 92., 36., 18.]])

In [97]:
# Using too many indices
arr3[1, 3, 2, 1]

IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed

# Other ways of creating Numpy arrays

In [98]:
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [99]:
np.ones([2, 2, 3])

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [100]:
# Identity matrix
np.eye(3)


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [101]:
# Random vector
np.random.rand(5)


array([0.34690877, 0.67216627, 0.65377199, 0.59332258, 0.28944259])

In [102]:
# Random matrix
np.random.randn(2, 3) 

array([[-0.23321044,  0.29432042,  0.71580131],
       [-1.48050594, -0.88060701, -0.41079367]])

In [103]:
# Fixed value
np.full([2, 3], 42)

array([[42, 42, 42],
       [42, 42, 42]])

In [104]:
# Range with start, end and step
np.arange(10, 90, 3)

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

In [105]:
# Equally spaced numbers in a range
np.linspace(3, 27, 9)

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])