## Numpy

Numpy is a numerical library that makes it easy to work with big arrays and matrices.

It can be used to make fast arithmetic operations with matrixes. Pandas and Numpy are usually used together, as Pandas builds on NumPy functionality to work with DataFrames.

Since Pandas is designed to work with Numpy, almost any Numpy function will work with Pandas Series and DataFrames, lets run some examples.

In [None]:
import numpy as np
import pandas as pd

In [None]:
def fetch_data():
  import os, shutil
  cwd = os.getcwd()
  if os.path.exists("LSAMP_Python_Course2024"):
    shutil.rmtree("LSAMP_Python_Course2024")
  !git clone https://github.com/aliawofford9317/LSAMP_Python_Course2024.git
  for file in os.listdir("LSAMP_Python_Course2024"):
    if file.endswith((".txt",".csv")):
      shutil.copy("LSAMP_Python_Course2024/{}".format(file),cwd)
fetch_data()

Cloning into 'LSAMP_Python_Course2024'...
remote: Enumerating objects: 210, done.[K
remote: Counting objects: 100% (140/140), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 210 (delta 72), reused 1 (delta 1), pack-reused 70[K
Receiving objects: 100% (210/210), 2.48 MiB | 3.06 MiB/s, done.
Resolving deltas: 100% (108/108), done.


In [None]:
# Create a container for a pseudo random generator
rng = np.random.RandomState()
# Create a Pandas Series from our random generator
series = pd.Series(rng.randint(0, 10, 4))
series

0    7
1    5
2    0
3    4
dtype: int64

In [None]:
# Create a Pandas Dataframe with our Numpy random generator
# Integer numbers between 0 and 10, 3 rows and 4 columns
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                 columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,6,3,7
1,8,3,2,1
2,3,1,7,6


We can use a Numpy function on these Pandas objects and still keep our indexes for order

In [None]:
# Calculate e^x, where x is every element of our array
np.exp(series)

0    1096.633158
1     148.413159
2       1.000000
3      54.598150
dtype: float64

In [None]:
# Calculate sin() of every value in the DataFrame multiplied by pi and divided by 4
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,-1.0,0.707107,-0.707107
1,-2.449294e-16,0.707107,1.0,0.707107
2,0.7071068,0.707107,-0.707107,-1.0


### The Numpy Array
Numpy provides mutidimentional arrays, with high efficiency and designed for scientific calculations.

An array is similar to a list in Python and can be created from a list.

Array have useful atributes we can use. Lets start by defining three random arrays, a single dimension, a two dimension and a tri dimensional array. We will use Numpy random number generator.

In [None]:
# Import our Numpy package
import numpy as np
np.random.seed(0) #this will generate the same random arrays every time

x1 = np.random.randint(10, size=6) # one dimension
x2 = np.random.randint(10, size=(3, 4)) # two dimensions
x3 = np.random.randint(10, size=(3, 4, 5)) # tri dimensional array

All arrays have the `ndim` (number of dimensions) attribute, `shape` the size of each dimension, and `size` the total array size

In [None]:
print("x3 ndim:", x3.ndim)
print("x3 sahpe:", x3.shape)
print("x3 size:", x3.size)

x3 ndim: 3
x3 sahpe: (3, 4, 5)
x3 size: 60


Other attributes for arrays are `itemsize` shows the byte size of every element in the array, and `nbytes` shows the total bytes size of the array:

In [None]:
print("x3 itemsize:", x3.itemsize, "bytes")
print("x3 nbytes", x3.nbytes, "bytes")

x3 itemsize: 8 bytes
x3 nbytes 480 bytes


#### Optional Exercise
- Create a 3x3x3 array that contains all random numbers.
- Using the previous array, square all the numbers in the array.

### Creating arrays with Numpy methods
Specially for largers arrays we can use the more efficient methods from Numpy.

In [None]:
# Create a length 10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
# Create a 3x5 float array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [None]:
# Create an array filled with a linear sequence
# Start at 0, end at 20, step size 2
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [None]:
# Create a 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3, 3))

array([[0.65279032, 0.63505887, 0.99529957],
       [0.58185033, 0.41436859, 0.4746975 ],
       [0.6235101 , 0.33800761, 0.67475232]])

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 1.0657892 , -0.69993739,  0.14407911],
       [ 0.3985421 ,  0.02686925,  1.05583713],
       [-0.07318342, -0.66572066, -0.04411241]])

In [None]:
# Create a 3x3 array of random integer in the interval [0, 10]
np.random.randint(0, 10, (3, 3))

array([[7, 2, 9],
       [2, 3, 3],
       [2, 3, 4]])

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [None]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([1., 1., 1.])

#### Array indexing
Similar to Python lists we can access individual elements in the array. For single dimensional arrays we can use the indexing format using `[]`

In [None]:
x1

array([5, 0, 3, 3, 7, 9])

In [None]:
x1[0]

5

In [None]:
x1[4]

7

In [None]:
x1[-6]

5

We can use a similar logic for multi dimensional arrays

In [None]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [None]:
x2[0, 3] # Access row 0, column index 3

4

In [None]:
x2[2, -1] # Access row index 2, column index -1

7

We can use the same logic to change values using array indexing

In [None]:
x2[2, -1] = 2

#### Optional Exercise
- Create a new array using `arange` this array should contain a total of 4 values.
- Select only values in index 2 and index 4.
- Change the value of index 3 to 99

In [None]:
import numpy as np
array = np.arange(4)

selected_values = array[[2, 3]]

array[3] = 99

print("Array:", array)
print("Selected values at indexes 2 and 4:", selected_values)


Array: [ 0  1  2 99]
Selected values at indexes 2 and 4: [2 3]


#### Sub arrays (slicing)
We can also use a similar syntax to Python list slicing to access only parts of the array. The syntax goes as follows:

`x[start:stop:step]`
Where default start value = 0, stop is the non inclusive stop index, and step the number of items we want to count

In [None]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
x[:5] # first five elements

array([0, 1, 2, 3, 4])

In [None]:
x[5:] # elements starting from the fifth index

array([5, 6, 7, 8, 9])

In [None]:
x[4:7] # elements between 4 and non inclusive 7

array([4, 5, 6])

In [None]:
x[::2] # all elements but step size 2

array([0, 2, 4, 6, 8])

In [None]:
x[1::2] # elements every two steps, starting from index 1

array([1, 3, 5, 7, 9])

We can also use negative step index. In this case the default start and stop values are inverted. This makes it a convenient way to invert an array.

In [None]:
x[::-1] # reverse the array

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [None]:
x[5::-2] # inverted array starting from the fifth index at minus two step interval

array([5, 3, 1])

We can also select multi dimension sectors of an array. The syntax is similar, with every sector separated by a comma

In [None]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 2]])

In [None]:
x2[:2, :3] # rows with index 0 and 1, and columns with index 0, 1 and 2

array([[3, 5, 2],
       [7, 6, 8]])

In [None]:
x2[:3, ::2] # all rows but step size 2

array([[3, 2],
       [7, 8],
       [1, 7]])

#### Reshaping arrays
Another useful operation is reshaping, we can use the `reshape()` method. If we wanted to reshape an array by a 3 x 3 array we can use the following syntax

In [None]:
grid = np.arange(1, 10).reshape((3, 3))
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

For this to work the size of the initial array must match the reshaped array.

Another common form of reshaping is converting an unidimensional array of rows or columns. We can either use `reshape` or the `nexaxis` keyword inside a slicing operation

In [None]:
x = np.array([1, 2, 3])

# row vector via reshape
x.reshape(1, 3)

array([[1, 2, 3]])

In [None]:
# row vector via newaxis
x[np.newaxis, :]

array([[1, 2, 3]])

In [None]:
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [None]:
# column vector via newaxis
x[:, np.newaxis]

array([[1],
       [2],
       [3]])

#### Optional exercise
- Create a new 4x4 array.
- Select and print only the first two rows.
- Select and print only the last two columns.
- Select only the last two rows and first two columns.
- Reshape the array into a 8x2 array.

In [None]:
import numpy as np

array = np.arange(1, 17).reshape(4, 4)

first_two_rows = array[:2, :]
print("First two rows:\n", first_two_rows)

last_two_columns = array[:, -2:]
print("Last two columns:\n", last_two_columns)

last_two_rows_first_two_columns = array[-2:, :2]
print("Last two rows and first two columns:\n", last_two_rows_first_two_columns)

reshaped_array = array.reshape(8, 2)
print("Reshaped array (8x2):\n", reshaped_array)


First two rows:
 [[1 2 3 4]
 [5 6 7 8]]
Last two columns:
 [[ 3  4]
 [ 7  8]
 [11 12]
 [15 16]]
Last two rows and first two columns:
 [[ 9 10]
 [13 14]]
Reshaped array (8x2):
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]]


#### Array concatenation and division
We can also combine multiple arrays into one and vice versa

Concatenation can be achieved using the `np.concatenate`, `np.vstack` and `np.hstack`. `np.concatenate` takes a list of arrays as its first argument

In [None]:
x = np.array([1, 2, 3]) # create an array from a list
y = np.array([3, 2, 1]) # create a second array from a list
np.concatenate([x, y]) # array list to concatenate

In [None]:
# we can concatenate more than one array at a time
z = [99, 88, 77]
print(np.concatenate([x, y, z]))

We can use the same logic for multidimensional arrays

In [None]:
grid = np.array([[1, 2, 3],
                [4, 5, 6]])
np.concatenate([grid, grid]) # concatenate on the first axis. Rows

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [None]:
grid

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
# Concatenate on the second axis (zero indexed)
np.concatenate([grid, grid], axis=1) # concatenate on columns

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

To work with diferent sized arrays it may be easier to work with `np.vstack` vertical stack, and `np.hstack` horizontal stack

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                [6, 5, 4]])
# stack vertically
np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [None]:
# stack horizontally
y = np.array([[99],
              [99]])
np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

#### Split arrays
Finally we can split arrays using the `np.split`, `np.hplit` and `np.vsplit`. For each of these we need to pass a list of indexes that divide/split our array

In [None]:
x = [1, 2, 3, 99, 88, 77, 4, 5, 6]
x1, x2, x3 = np.split(x, [3, 5]) # split at index 3 and 5, non inclusive
print(x1, x2, x3)

[1 2 3] [99 88] [77  4  5  6]


Observe that N division/split points lead to N + 1 sub arrays. Similarly `np.hsplit` and `np.vsplit` can be used

In [None]:
grid = np.arange(16).reshape((4, 4))

In [None]:
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [None]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [None]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


### Universal Functions
Next we will look at why Numpy is important for data science and working with arrays.

The key to making computation with Numpy very fast is using vectorized operations with Numpy, they key to this is using Numpy Universal Functions.

Here is a speed comparison between a traditional for loop and a vectorized operation in Numpy using and array that contains a million values.

In [None]:
# Traditional implementation
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

2.01 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The code above takes several seconds to run. Lets run it now with a vectorized operation.

In [None]:
print(compute_reciprocals(values))
print(1.0 / values)

[0.16666667 1.         0.25       0.25       0.125     ]
[0.16666667 1.         0.25       0.25       0.125     ]


In [None]:
%timeit (1.0 / big_array)

1.61 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We can see that each loop and the consequent total execution time is orders of magnitude faster that the first iteration. Vectorized operations are implemented via ufuncs whose main function is to exectute repeteated operations on values in Numpy arrays.

Ufuncs (universal functions) can run between scalars and arrays, two arrays, and multi dimensional arrays.

In [None]:
np.arange(5) / np.arange(1, 6)

array([0.        , 0.5       , 0.66666667, 0.75      , 0.8       ])

In [None]:
# multidimensional example
x = np.arange(9).reshape((3, 3))
2 ** x

array([[  1,   2,   4],
       [  8,  16,  32],
       [ 64, 128, 256]])

### Numpy UFuncs
#### Array arithmetic
We have standard addition, substraction, multiplication and division

In [None]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


In [None]:
# unary functions for negation, exponentiation, and modulus
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


All these arithmetic operations are wrappers for Numpy functions

|Operator|	Equivalent ufunc |	Description |
|:--------:|:--------|:--------|
|+ |	np.add	| Addition (e.g., 1 + 1 = 2) |
|- |	np.subtract | Subtraction (e.g., 3 - 2 = 1)
|- |	np.negative |	Unary negation (e.g., -2)
|* |	np.multiply |	Multiplication (e.g., 2 * 3 = 6)
|/ |	np.divide |	Division (e.g., 3 / 2 = 1.5)
|// |	np.floor_divide |	Floor division (e.g., 3 // 2 = 1)
|** |	np.power |	Exponentiation (e.g., 2 ** 3 = 8)
|% |	np.mod |	Modulus/remainder (e.g., 9 % 4 = 1)

#### Trigonometric functions
We will explore trigonometric functions. Lets start by defining an array of angles.

In [None]:
theta = np.linspace(0, np.pi, 3)

In [None]:
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.57079633 3.14159265]
sin(theta) =  [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) =  [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) =  [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]


In [None]:
# inverse functions
x = [-1, 0, 1]
print("x         = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))

x         =  [-1, 0, 1]
arcsin(x) =  [-1.57079633  0.          1.57079633]
arccos(x) =  [3.14159265 1.57079633 0.        ]
arctan(x) =  [-0.78539816  0.          0.78539816]


#### Exponents and logarithms

In [None]:
x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))

x     = [1, 2, 3]
e^x   = [ 2.71828183  7.3890561  20.08553692]
2^x   = [2. 4. 8.]
3^x   = [ 3  9 27]


In [None]:
# log functions
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


#### Agregates
We can reduce array operations of any ufunc. A reduce repeatedly applies a given operation to the elements of the array until a single result remains.

Calling `reduce` on the `add` ufunc results in the sum of all elements in the array

In [None]:
x = np.arange(1, 6)
np.add.reduce(x)

15

In [None]:
x

array([1, 2, 3, 4, 5])

In [None]:
# calling reduce on multiply
# results in the product of all array elements
np.multiply.reduce(x)

120

In [None]:
# if we would like to store all the intermediate results
# we can use accumulate instead
np.add.accumulate(x)

array([ 1,  3,  6, 10, 15])

In [None]:
np.multiply.accumulate(x)


array([  1,   2,   6,  24, 120])

### Exercises for participation credit
1. Create a new random integer array with Numpy functions. Use a random seed to guarantee the same array every time. Array must be of size (20, 5).
2. Calculate the average value of the second column of the array created in exercise 1. Calculate the sum of all elements in columns 3 and 4. You can use built in ufuncs and indexing.

We will now use a simple IoT readings dataset for this exercise. This dataset contains IoT temperature readings, where the reading was made (room) and location (reading was outside or inside a room).

3. Open the data file `IOT-temp.csv` inside the `iotdata-compressed` file, you will need some program like 7zip or Winrar to uncompress this `iotdata-compressed` folder. Open this file with Pandas and create a new Dataframe with this data. Show the first 10 rows of data. Show the last 10 rows of data. Show data points for index 300 to 350.
4. Create a new dataframe from index 1000 to 2000. Rename column `out/in` to `location`. Rename colum `room_id/id` to `room` How many of these readings were made outside? How many were inside?
5. Split your new dataframe into readings that were made inside and readings that were made outside. Create a new dataframe with only inside readings. Create a new dataframe only with outside readings. Print the mean value of the temperature for both new dataframes.
6. Save both dataframes created in exercise 5 to `csv` files.

More information about the dataset for exercises 3 and 4 in [this link](https://www.kaggle.com/datasets/atulanandjha/temperature-readings-iot-devices)

In [None]:
import zipfile
import pandas as pd
import numpy as np
from google.colab import files

uploaded = files.upload()

zip_file_name = 'archive.zip'

with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
    zip_ref.extractall()

csv_file_name = 'IOT-temp.csv'
df = pd.read_csv(csv_file_name)


np.random.seed(42)

random_array = np.random.randint(1, 101, size=(20, 5))

average_second_column = np.mean(random_array[:, 1])

sum_columns_3_and_4 = np.sum(random_array[:, 2]) + np.sum(random_array[:, 3])

print("Average of the second column:", average_second_column)
print("Sum of all elements in columns 3 and 4:", sum_columns_3_and_4)

print("First 10 rows of data:")
print(df.head(10))

print("Last 10 rows of data:")
print(df.tail(10))

print("Data points for index 300 to 350:")
print(df.iloc[300:351])

new_df = df.iloc[1000:2001].copy()

new_df.rename(columns={'out/in': 'location', 'room_id/id': 'room'}, inplace=True)

outside_readings_count = new_df[new_df['location'] == 'out'].shape[0]
inside_readings_count = new_df[new_df['location'] == 'in'].shape[0]

print("Number of outside readings:", outside_readings_count)
print("Number of inside readings:", inside_readings_count)

inside_readings = new_df[new_df['location'] == 'in']
outside_readings = new_df[new_df['location'] == 'out']

mean_inside_temp = inside_readings['temp'].mean()

mean_outside_temp = outside_readings['temp'].mean()

print("Mean temp for inside readings:", mean_inside_temp)
print("Mean temp for outside readings:", mean_outside_temp)

inside_csv_file_path = 'inside_readings.csv'
inside_readings.to_csv(inside_csv_file_path, index=False)

outside_csv_file_path = 'outside_readings.csv'
outside_readings.to_csv(outside_csv_file_path, index=False)


Saving archive.zip to archive (3).zip
Average of the second column: 49.75
Sum of all elements in columns 3 and 4: 2234
First 10 rows of data:
                                    id  room_id/id        noted_date  temp  \
0  __export__.temp_log_196134_bd201015  Room Admin  08-12-2018 09:30    29   
1  __export__.temp_log_196131_7bca51bc  Room Admin  08-12-2018 09:30    29   
2  __export__.temp_log_196127_522915e3  Room Admin  08-12-2018 09:29    41   
3  __export__.temp_log_196128_be0919cf  Room Admin  08-12-2018 09:29    41   
4  __export__.temp_log_196126_d30b72fb  Room Admin  08-12-2018 09:29    31   
5  __export__.temp_log_196125_b0fa0b41  Room Admin  08-12-2018 09:29    31   
6  __export__.temp_log_196121_01544d45  Room Admin  08-12-2018 09:28    29   
7  __export__.temp_log_196122_f8b80a9f  Room Admin  08-12-2018 09:28    29   
8  __export__.temp_log_196111_6b7a0848  Room Admin  08-12-2018 09:26    29   
9  __export__.temp_log_196112_e134aebd  Room Admin  08-12-2018 09:26    29   
