## Working with numerical data

The "data" in *Data Analysis* typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [1]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [2]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

In [3]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

56.8

To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.

In [4]:
kanto = [73,67,43]
johto = [91,88,64]
hoenn = [87,134,58]
sinnoh = [102,43,37]
unova = [69,96,70]

In [5]:
weights = [w1,w2,w3]

In [6]:
for x,w in zip(kanto,weights):
    print(x)
    print(w)

73
0.3
67
0.2
43
0.5


In [7]:
def crop_yield(region,weights):
    result=0
    for x,w in zip(region,weights):
        result +=x*w
    return result

In [8]:
crop_yield(kanto,weights)

56.8

In [9]:
crop_yield(johto,weights)

76.9

In [10]:
crop_yield(unova,weights)

74.9

## Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. Learn more about dot product here: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length . 

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

In [11]:
import numpy as np

numpy arrays can be created by np.array

In [12]:
kanto = np.array([73,67,43])

In [13]:
kanto

array([73, 67, 43])

In [14]:
weights = np.array([w1,w2,w3])

In [15]:
weights

array([0.3, 0.2, 0.5])

In [16]:
type(kanto)

numpy.ndarray

In [17]:
type(weights)

numpy.ndarray

just like lists, numpy arrays also support indexing notations []

In [18]:
weights[0]

0.3

In [19]:
kanto[2]

43

### Operating on numpy arrays

We can now compute the dot product of the tow vectors using the np.dot function

In [20]:
np.dot(kanto,weights)

56.8

We can achieve the same result by low level operations supported by numpy array

In [21]:
(kanto * weights).sum()

56.8

The * operator performs an element wise multiplication of two arrays (assuming they have the same size), and the sum method calculates the sum of numbers in array

In [22]:
arr1 = np.array([1,2,3])
arr2 = np.array([4,5,6])

In [23]:
arr1 * arr2

array([ 4, 10, 18])

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [24]:
# python lists

arr1=list(range(100000))
arr2=list(range(100000,200000))

# numpy arrays
arr1_np=np.array(arr1)
arr2_np=np.array(arr2)

'%%time' tells time to run the cell

In [25]:
%%time 
result=0
for x1,x2 in zip(arr1,arr2):
    result+=x1*x2
result

CPU times: total: 0 ns
Wall time: 16.7 ms


833323333350000

In [26]:
# %%time
np.dot(arr1_np,arr2_np)

893678192

## Multi-Dimensional Numpy Array

In [27]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [28]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

You may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the `.shape` property of an array.

<img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" width="420">


In [29]:
# 2 D array(matrix)
climate_data.shape

(5, 3)

In [30]:
weights

array([0.3, 0.2, 0.5])

In [31]:
# 1 D array(vector)
weights.shape

(3,)

In [32]:
# 3 D array

arr3 = np.array([
    [[11,12,13],
    [14,15,16]],
    [[17,18,19],
    [20,21,22]]])

In [33]:
arr3.shape

(2, 2, 3)

In [34]:
weights.dtype # here data type of weights is float

dtype('float64')

'float64' and 'int32' means that 64 and 32 bits of space will be used in the memory respectively

In [35]:
climate_data.dtype # data type of climate_data is integer

dtype('int32')

In [36]:
arr3.dtype

dtype('int32')

NOTE- If an array contains even a single floating number than all the other elements are also converted to floats

We can use np.matmul or @ to perform matrix multiplications

In [37]:
np.matmul(climate_data,weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [38]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

## Working with CSV data files

Numpy also provides helper functions reading from & writing to files. Let's download a file `climate.txt`, which contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:


```
temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
84.00,63.00,38.00
66.00,50.00,52.00
41.00,94.00,77.00
91.00,57.00,96.00
49.00,96.00,99.00
67.00,20.00,28.00
...
```

This format of storing data is known as *comma-separated values* or CSV. 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)


To read this file into a numpy array, we can use the `genfromtxt` function.

In [39]:
import urllib.request

urllib.request.urlretrieve(
    'https://gist.github.com/BirajCoder/a4ffcb76fd6fb221d76ac2ee2b8584e9/raw/4054f90adfd361b7aa4255e99c2e874664094cea/climate.csv', 
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x20ce9114670>)

'urllib.request.urlretrieve' is a function provided by the urllib module in Python's standard library (Python 3) for downloading files from the internet. It allows you to retrieve a file from a specified URL and save it to your local file system. This function can be helpful when you need to automate the process of downloading files from the web.

In [40]:
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1) 

'numpy.genfromtxt' is a function in the NumPy library used for reading data from text files and converting it into a NumPy array.

Delimter means seperator in the files and for csv(comma seperated files) ',' is the delimeter

In [41]:
climate_data.shape

(10000, 3)

We can now perform a matrix multiplication using the `@` operator to predict the yield of apples for the entire dataset using a given set of weights.

In [42]:
weights=np.array([0.3,0.2,0.5])

In [43]:
yields = climate_data @ weights

In [44]:
yields

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [45]:
yields.shape

(10000,)

Let's add the `yields` to `climate_data` as a fourth column using the [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function.

In [46]:
climate_result=np.concatenate((climate_data,yields.reshape(10000,1)),axis=1)

'.reshape(10000, 1)': Here, the reshape method is called on the 'yields' array to change its shape. The argument (10000, 1) specifies the new shape that you want. In this case, it's reshaping the 1D array into a 2D array with 10,000 rows and 1 column.

In [47]:
climate_result.shape

(10000, 4)

In [48]:
climate_result

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

In [49]:
np.savetxt('climate_result.txt',climate_result)

'np.savetxt' is a function in the NumPy library that is used to save data from a NumPy array into a text file. It's particularly useful when you have numerical data stored in a NumPy array and you want to write that data to a file in a human-readable text format.

Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:


* Mathematics: `np.sum`, `np.exp`, `np.round`, arithemtic operators 
* Array manipulation: `np.reshape`, `np.stack`, `np.concatenate`, `np.split`
* Linear Algebra: `np.matmul`, `np.dot`, `np.transpose`, `np.eigvals`
* Statistics: `np.mean`, `np.median`, `np.std`, `np.max`

> **How to find the function you need?** The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to [this tutorial on array concatenation](https://cmdlinetips.com/2018/04/how-to-concatenate-arrays-in-numpy/). 

You can find a full list of array functions here: https://numpy.org/doc/stable/reference/routines.html

## Arithmetic operations, broadcasting and comparison

Numpy arrays support arithmetic operators like `+`, `-`, `*`, etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.

In [50]:
arr2=np.array([[1,2,3],[4,5,6],[7,8,9]])

In [51]:
arr3=np.array([[10,20,30],[40,50,60],[70,80,90]])

In [52]:
# Adding two scalars

arr2+2

array([[ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [53]:
# Element wise subtraction
arr3-arr2

array([[ 9, 18, 27],
       [36, 45, 54],
       [63, 72, 81]])

In [54]:
# Division by scalar
arr3/2

array([[ 5., 10., 15.],
       [20., 25., 30.],
       [35., 40., 45.]])

In [55]:
# Element wise multiplication
arr2*arr3

array([[ 10,  40,  90],
       [160, 250, 360],
       [490, 640, 810]])

In [56]:
# Modulus with scalar
arr2 % 4

array([[1, 2, 3],
       [0, 1, 2],
       [3, 0, 1]], dtype=int32)

### Array Broadcasting

Numpy arrays also support *broadcasting*, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

In [57]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [58]:
arr2.shape

(3, 4)

In [59]:
arr4=np.array([4,5,6,7])

In [60]:
arr4.shape

(4,)

In [61]:
arr2+arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" width="360">

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

In [64]:
arr5 = np.array([10,20])

In [65]:
arr2+arr5

ValueError: operands could not be broadcast together with shapes (3,4) (2,) 

In the above example, even if `arr5` is replicated three times, it will not match the shape of `arr2`. Hence `arr2 + arr5` cannot be evaluated successfully. Learn more about broadcasting here: https://numpy.org/doc/stable/user/basics.broadcasting.html .

### Array Comparison

Numpy arrays also support comparison operations like `==`, `!=`, `>` etc. The result is an array of booleans.

In [66]:
arr1=np.array([[1,2,3],[3,4,5]])
arr2=np.array([[6,7,8],[8,9,10]])

In [67]:
arr1==arr2

array([[False, False, False],
       [False, False, False]])

In [68]:
arr1 != arr2   # here '!' means not 

array([[ True,  True,  True],
       [ True,  True,  True]])

In [69]:
arr1>=arr2

array([[False, False, False],
       [False, False, False]])

In [70]:
arr1<=arr2

array([[ True,  True,  True],
       [ True,  True,  True]])

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [71]:
(arr1 == arr2).sum()

0

## Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [72]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [73]:
arr3.shape

(3, 2, 4)

In [74]:
# single element
arr3[1,1,2]

36.0

In [75]:
# in order to show the step wise output of above

In [76]:
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [77]:
arr3[1,1]

array([63., 92., 36., 18.])

In [78]:
# finally we get
arr3[1,1,2]

36.0

In [79]:
# subarray using ranges
arr3[1:,0:1,:2]

array([[[15., 16.]],

       [[98., 32.]]])

In [80]:
# mixing indices and ranges
arr3[1:,1,3]

array([18., 43.])

In [81]:
# using fewer indices
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [82]:
arr3[:2,1]

array([[13., 14., 15., 19.],
       [63., 92., 36., 18.]])

In [83]:
# all zeros
np.zeros((3,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [84]:
# all ones
np.ones((2,2,3))

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [85]:
# indentity matrix
np.eye(2,3)

array([[1., 0., 0.],
       [0., 1., 0.]])

In [86]:
# random vector
np.random.rand(5)

array([0.27249072, 0.39627077, 0.12294945, 0.46967428, 0.95508951])

In [87]:
# random matrix
np.random.randn(5,2)

array([[ 0.22252061, -0.76114027],
       [ 0.79051089,  1.12814338],
       [-0.16392213, -1.0858876 ],
       [-2.52574359, -0.64209579],
       [ 0.99856969, -0.34600181]])

'np.random.rand' produces random numbers from a uniform distribution over the interval [0, 1).

'np.random.randn' produces random numbers from a standard normal distribution with a mean of 0 and a standard deviation of 1.

If you need uniformly distributed random numbers, use rand. If you need normally distributed random numbers, use randn.

In [88]:
# fixed value
np.full([2,3],42)

array([[42, 42, 42],
       [42, 42, 42]])

In [89]:
# range with start,end and stop
np.arange(10,90,5)

array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85])

In [90]:
# equally spaces numbers in a range
np.linspace(3,27,9)

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])