## NumPy Masterclass

**NumPy** also known as Numerical Python is an open-source library that is the universal standard for working with 
numerical data in Python, and forms the foundation of other libraries like Pandas
Pandas DataFrames are built on NumPy arrays and can leverage NumPy functions

In [1]:
import numpy as np

sales = [0,5,155,0,518,0,1827,616,317,325]

sales_array = np.array(sales)
sales_array

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325])

### Array basics


**NumPy arrays** are fixed-size containers of items that are more efficient than Python lists or tuples for data processing. They have these key properties:

* ndim – the number of dimensions (axes) in the array

* shape – the size of the array for each dimension

* size – the total number of elements in the array

* dtype – the data type of the elements in the array

In [2]:
sales_array.shape

(10,)

In [3]:
sales_array_2d = np.array([range(5),range(5)])
sales_array_2d

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

In [4]:
sales_array_2d.shape

#This array has 2 rows and 5 columns

(2, 5)

### Array Creation

As an alternative to converting lists, we can create arrays using numpy functions

In [5]:
np.ones((4,2),float)

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [6]:
np.zeros((4,2),int)

array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

In [7]:
np.arange(0,15,5)

array([ 0,  5, 10])

In [8]:
#Creates an array of floats with given start and stop values with n elements,separated by a consistent step size
np.linspace(0,100,5)


array([  0.,  25.,  50.,  75., 100.])

In [9]:
#Reshape (1,) array to (2,2) shape
np.array([1,2,3,4]).reshape(2,2)

array([[1, 2],
       [3, 4]])

In [10]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

### Random Number Arrays

We can create **random number arrays** from a variety of distributions using NumPy functions and methods, which is great for sampling and simulation

Tip: always make sure to **set a seed** when generating random numbers to ensure we can always recreate the work done.

In [11]:
#Create a random number generator (the seed is for reproducibility - generate the same numbers between sessions)
rng = np.random.default_rng(616)

random_array = rng.random(10)
random_array

array([0.39682145, 0.86568572, 0.46040359, 0.30599848, 0.57381588,
       0.08888468, 0.88194347, 0.73228387, 0.73215182, 0.56233394])

In [12]:
rng.integers(1,10,10)

array([8, 4, 6, 1, 3, 4, 2, 2, 3, 4], dtype=int64)

In [13]:
#normal distribution
rng.normal(50,5,10)

array([57.86374339, 47.26107561, 48.54153216, 53.60574701, 49.20798481,
       48.8109699 , 52.48272073, 60.23438095, 50.49515067, 48.54883699])

### Indexing & Slicing Arrays

**indexing:** array[rown index,column index]

**slicing:** array[start:stop:step size, start:stop:step size]

In [14]:
x = np.arange(12).reshape(6,2)
x

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [15]:
#grab the first column
x[:,0]

array([ 0,  2,  4,  6,  8, 10])

In [16]:
#grab the first row
x[0]

array([0, 1])

In [17]:
#grab the second element of the last row
x[-1][-1]
#or
x[:,-1][-1]

11

### Array Operations

Arithmetic operators can be used to perform array operations.

Array operations are applied via vectorization and broadcasting, which eliminates the need to loop through the array's elements.

In [18]:
sales_array

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325])

In [19]:
#Add shipping cost of 2 to every sales
sales_array + 2

array([   2,    7,  157,    2,  520,    2, 1829,  618,  319,  327])

In [20]:
sales_2 = (sales_array + 2).reshape(2,5)

quantity = sales_2[0]
price = sales_2[-1]

quantity * price

array([     4,  12803,  97026,    638, 170040])

### Filtering Arrays

We can filter arrays by indexing them with a logical test.

Only the array elements in positions where the logical test returns True are returned. We can filter arrays with multiple logical tests:
* Useing | for or conditions and & for and conditions
* We should store complex filtering criterian in a variable (known as boolean mask)
* We can also filter arrays based on values from other arrays

In [21]:
sales_array3 = sales_array.reshape(2,5)
sales_array3

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [22]:
sales_array3 != 0

array([[False,  True,  True, False,  True],
       [False,  True,  True,  True,  True]])

In [23]:
sales_array3[sales_array3 != 0]

array([   5,  155,  518, 1827,  616,  317,  325])

In [24]:
#return sales = 5 or sales superior to 1820
sales_array3[(sales_array3 == 5) | (sales_array3 > 1820) ]

array([   5, 1827])

In [25]:
#return sales between 500 and 700
sales_array3[(sales_array3 >= 500) & (sales_array3 <= 700)]

array([518, 616])

In [26]:
mask = (sales_array3 >= 500) & (sales_array3 <= 700)
sales_array3[mask]

array([518, 616])

In [27]:
sales_array = np.array([0,5,155,0])

products_array = np.array(['fruit','cereal','dairy','eggs'])

#retrieve products where sales are greater than 0

products_array[sales_array > 0]

array(['cereal', 'dairy'], dtype='<U6')

### Modifying array values

In [28]:
sales_array3

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [29]:
sales_array3[-1][-1] = 50000
sales_array3

array([[    0,     5,   155,     0,   518],
       [    0,  1827,   616,   317, 50000]])

In [30]:
sales_array3[sales_array3 == 0] = 999
sales_array3

array([[  999,     5,   155,   999,   518],
       [  999,  1827,   616,   317, 50000]])

### The WHERE function

The where() numpy function performs a logical test and returns a given value if the test is True or another if the hest is False

np.where(logical test,
            value if True,
            value if False)

In [31]:
inventory_array = np.array([0,102,0,72,89])
product_array = np.array(['samsung x','iphone5','ps5','asus rog','fifa 22'])

#If the product is out of stock return "out of stock" otherwise we're going to return the value from product_array

np.where(inventory_array <= 0
         ,'Out of Stock'
         ,product_array)

array(['Out of Stock', 'iphone5', 'Out of Stock', 'asus rog', 'fifa 22'],
      dtype='<U12')

### Array aggregation methods

Array aggregation methods allow us to calculate metrics like sum, mean or max.

In [32]:
sales_array3

array([[  999,     5,   155,   999,   518],
       [  999,  1827,   616,   317, 50000]])

In [33]:
#Retrieve total sales - the sum of all the values in the array
sales_array3.sum()

56435

In [34]:
print('max:',sales_array3.max())
print('mean',sales_array3.mean())
print('min:',sales_array3.min())

max: 50000
mean 5643.5
min: 5


In [35]:
#We can also aggregate across rows or columns:

sales_array3.sum(axis=0) #Aggregates across rows

array([ 1998,  1832,   771,  1316, 50518])

In [36]:
sales_array3.sum(axis=1) #Aggregates across columns

array([ 2676, 53759])

### Array Functions

Array functions let us perform other aggregations like median and percentiles

Note: the median is the halfway point in our data and is going to be what's called a robust statistic. Compared to something like the mean, which can be hardly influenced by outliers, the median will give us a very stable approximation of the center of our data. It's very helpful when we think about distributions like price or income

In [37]:
sales_array3

array([[  999,     5,   155,   999,   518],
       [  999,  1827,   616,   317, 50000]])

In [38]:
np.median(sales_array3)

807.5

In [39]:
#Returns a value in the x percentile in an array
round(np.percentile(sales_array3,80),2)

1164.6

In [40]:
#Return the unique values in an array
np.unique(sales_array3)

array([    5,   155,   317,   518,   616,   999,  1827, 50000])

In [41]:
np.sqrt(sales_array3)

array([[ 31.60696126,   2.23606798,  12.4498996 ,  31.60696126,
         22.75961335],
       [ 31.60696126,  42.74342055,  24.81934729,  17.80449381,
        223.60679775]])

### Sorting Arrays

The sort() method will sort arrays in place

In [42]:
sales_array3.sort() #by default it will sort by axis=1, which is our column axis
sales_array3

array([[    5,   155,   518,   999,   999],
       [  317,   616,   999,  1827, 50000]])

In [43]:
#reverse the order using the negative step size trick
sales_array3[:,::-1]

array([[  999,   999,   518,   155,     5],
       [50000,  1827,   999,   616,   317]])

### Vectorization

Vectorization is the process of pushing array operations into optimized C code, which is easire and more efficient that writing for loops. 

So when we add 2 to an array and that adds 2 to every value in the array, we're not writing that for loop in Python, there is a loop occuring in C that will be much more efficient 

In [44]:
def for_loop_multiply(list1,list2):
    
    products_list = []
    
    for x,y in zip(list1,list2):
        products_list.append(x*y)
        
    return products_list


def multiply_arrays(array1,array2):
    
    return array1 * array2

In [45]:
list1 = list(range(1000))
list2 = list(range(1000))

In [52]:
# Call both functions 10.000 times and compare the result

%%timeit -n 10000
for_loop_multiply(list1,list2)

93.6 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [53]:
%%timeit -n 10000
multiply_arrays(array1,array2)

1.38 µs ± 97.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


On average, using for_loop_multiply takes about 93 microseconds while performing the same operations with arays takes less than 2 microseconds.

**Array based operation is way faster than list based operation because of the concept of vectorization. We're taking advantage of this very highly efficient data structure, the array, as well as the vectorization concept.**

So we should always use vectorized operations whenever possible to manipulate data and avoice writing loops.

### Broadcasting

Broadcasting lets us perform vectorized operations with arrays of different sizes, where NumPy will expand the smaller array to 'fit' the larger one.

![Capture.PNG](attachment:Capture.PNG)

In [56]:
test_array = np.array([[1,2,3],[1,2,3],[1,2,3]])
test_array

array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

In [62]:
test_array * test_array[:,1].reshape(3,1)

array([[2, 4, 6],
       [2, 4, 6],
       [2, 4, 6]])

### Bringing it all together


1 - Filter sales_array down to only sales where the product family was produce.

2 - Then, randomly sample roughly half (random number < .5) of the produce sales and report the mean and median sales. Use a random seed of 2022.

3 - Finally, create a new array that has the values 'above_both', 'above_median', and 'below_both' based on whether the sales were above the median and mean of the sample, just above the median of the sample, or below both the median and mean of the sample.

In [65]:
import pandas as pd
import numpy as np

retail_df = pd.read_csv(
    r"C:\Users\andre\OneDrive\Ambiente de Trabalho\NumPy & Pandas\Pandas Course Resources\retail/retail_2016_2017.csv"
    ,skiprows=range(1, 11000)
    ,nrows=1000
)

family_array = np.array(retail_df["family"])
sales_array = np.array(retail_df["sales"])

In [71]:
#1

produce_sales = sales_array[family_array == 'PRODUCE']

print(f'There are {produce_sales.size} sales where the product family was produce')

There are 30 sales where the product family was produce


In [79]:
#2

#We already know that the size of produce_sales is 30, so we need to create a random sample with size 30
rng = np.random.default_rng(2022)
random_sample = rng.random(30)

#Now we need to sample roughly half of the produce sales 
random_sales = produce_sales[random_sample < 0.5]

#Report the mean and median sales
mean = random_sales.mean()
median = np.median(random_sales)

print('Mean:',round(mean,3))
print('Median:',median)

Mean: 2268.102
Median: 1272.755


In [83]:
#3 

np.where(random_sales < median
        ,'below_both'
        ,np.where(random_sales > mean
                  ,'above_both'
                  ,'above_median') 
        )

array(['above_median', 'below_both', 'below_both', 'below_both',
       'above_both', 'below_both', 'below_both', 'above_both',
       'below_both', 'above_median', 'above_both', 'above_both',
       'below_both', 'above_median', 'above_both', 'below_both',
       'above_both'], dtype='<U12')

In [85]:
random_sales

array([1662.394,  447.064,  962.866, 1077.44 , 3404.531,  962.96 ,
       1089.319, 7860.031,  446.038, 1272.755, 2775.771, 2339.906,
        722.333, 1567.843, 2458.456,  673.885, 8834.15 ])