In [20]:
import pandas as pd
import numpy as np
import math
import os

import warnings
warnings.filterwarnings('ignore')

### NumPy: Working with numerical data

The 'Data' in Data Analysis refers to numerical data, e.g. stock prices, sales figures, sensor measurements, sport scores, database tables etc. The Numpy library provides specialized data structures, functions and other tools for numerical computing in Python. Let's work through an example:

Suppose we want to use climate data like temparature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this is to formulate a relationship between the annual yield of apples (tons per hectare) and the climatic conditions like average temparature (in degrees fahrenheit), rainfall (mm) and average relative humidity (in percentage) as a linear equation

    yield_of_apples = w1 * temparature + w2 * rainfall + w3 * humidity

The yield of apples is being expressed as a weighted sum of the temparature, rainfall, and humidity. This equation is an approximation, since the actual relationship may not be linear and may depend on other factors as well. But for practice, this simple linear relationship works well.

BAsed on some stastical analysis of historical data, we may come up with some reasonable values for weights w1, w2, w3. For example:

In [21]:
w1, w2, w3 = 0.3, 0.2, 0.5

We can define some variables to record climate data for a region

In [22]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

We can now substitute these values into the linear equation to predict the yield of apples

In [23]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

56.8

In [24]:
print(f"The expected yield of apples in Kanto is {kanto_yield_apples}")

The expected yield of apples in Kanto is 56.8


To make it easier for performing computations across multiple regions, we can represent the climate data for each region as a vector (list of numbers):

In [25]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temparature, rainfall and humidity respectively. We can also represent the set of weights used in the formula as a vector:

In [26]:
weights = [w1, w2, w3]

We can now write a function to calculate the yield of apples (or any other crop) given the climate data and the respective weights

In [27]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [28]:
crop_yield(kanto, weights)

56.8

In [29]:
crop_yield(unova, weights)

74.9

#### Moving from Lists to Numpy Arrays

The calculation performed by the crop_yield function (element-wise multiplication of two vectors and taking a sum of the results) is called the 'dot product'.

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays. 

In [30]:
kanto = np.array([73, 67, 43])
kanto

array([73, 67, 43])

In [31]:
weights = np.array([w1, w2, w3])
weights

array([0.3, 0.2, 0.5])

In [32]:
type(kanto)

numpy.ndarray

In [33]:
type(weights)

numpy.ndarray

#### Numpy Array Operations

We can compute the dot product of two vectors using np.dot

In [34]:
np.dot(kanto, weights)

56.8

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum

In [35]:
(kanto * weights).sum()

56.8

The * operator performs an element-wise multiplication of two arrays if they are of the same size. The sum method calculates the sum of elements in an array

In [36]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr1 * arr2

array([ 4, 10, 18])

In [37]:
arr2.sum()

15

#### Benefits of using Numpy Arrays

Numpy arrays offer following benefits over Python lists for operating on numerical data:

**Ease of Use:** You can write small, concise and intuitive mathematical expressions like '(kanto * weights).sum() rather than using loops and custom functions like crop_yield

**Performance:** Numpy operations and functions are implemented internally in C++, which makes them faster than using Python statements and loops that are interpreted at runtime

Here's a comparision of dot products performed using Python loops vs Numpy arrays on two vectors with a million elements each

In [38]:
# Python Lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [39]:
%%time

result = 0

for x1, x2 in zip(arr1, arr2):
    result += x1 * x2
result

Wall time: 442 ms


833332333333500000

In [41]:
arr1_np

array([     0,      1,      2, ..., 999997, 999998, 999999])

In [42]:
arr2_np

array([1000000, 1000001, 1000002, ..., 1999997, 1999998, 1999999])

In [43]:
%%time
result = np.dot(arr1_np, arr2_np)
result

Wall time: 2.1 ms


-1942957984

#### Multi-Dimensional Numpy Arrays

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array:

In [45]:
climate_data = np.array([[73, 67, 43],
                        [91, 88, 64],
                        [87, 134, 58],
                        [102, 43, 37],
                        [69, 96, 70]])
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

In [46]:
climate_data.shape

(5, 3)

In [47]:
# 3-D Array
arr3 = np.array([
    
    [[11, 12, 13],
    [13, 14, 15]],
    
    [[15, 16, 17],
    [17, 18, 19.5]]
])
arr3

array([[[11. , 12. , 13. ],
        [13. , 14. , 15. ]],

       [[15. , 16. , 17. ],
        [17. , 18. , 19.5]]])

In [48]:
arr3.shape

(2, 2, 3)

In [49]:
# Check data types of array
weights.dtype

dtype('float64')

In [50]:
arr3.dtype

dtype('float64')

In [51]:
climate_data.dtype

dtype('int32')

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between climate_data and weights. We can use the np.matmul or @ to perform matrix multiplication

In [52]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [53]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

#### Working with CSV files

Numpy also provides helper functions to read from and write to files. Let's download a file which contains 10,000 climate measurements

In [54]:
import urllib.request

urllib.request.urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv', 'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x2502dc1cdc8>)

In [57]:
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header = 1)

In [58]:
climate_data

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [59]:
climate_data.shape

(10000, 3)

We can now perform a matrix multiplication using the @ operator to predict the yield of apples for the entire dataset using a given set of weights

In [60]:
weights = np.array([0.3, 0.2, 0.5])

In [61]:
yields = climate_data @ weights
yields

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [62]:
yields.shape

(10000,)

Let's add yields to climate_data as a fourth column using the np.concatenate function

In [63]:
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis = 1)
climate_results

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])