# Week 3
# NumPy Arrays

[NumPy](https://numpy.org/) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

**Resources:**
- Textbook Chapter 4
- [NumPy Documentation](https://numpy.org/doc/stable/)
- [NumPy Tutorial from W3School](https://www.w3resource.com/python-exercises/numpy/index.php)

In [2]:
import numpy as np # np is a universally-used abbrevation for numpy

## NumPy Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.
- **rank**: number of dimensions
- **shape** a tuple of integers giving the size of the array along each dimension

In [7]:
# Create a numpy array from a Python list
py_list = [1, 2, 3, 4]
np_ary = np.array(py_list)
print("Numpy array:", np_ary)
print("Shape:", np_ary.shape)
print("Type:", np_ary.dtype)

# Change the data type after it is automatically determined
new_ary = np_ary.astype('float')
print(new_ary.dtype)

Numpy array: [1 2 3 4]
Shape: (4,)
Type: int32
float64


NumPy arrays can perform computations very efficiently. 

In [8]:
# Time cost comparison
my_arr = np.arange(1_000_000) # np.arange() creates a numpy array of ints
my_list = list(range(1_000_000))

# Use %timeit to estimate the time cost
%timeit my_arr2 = my_arr * 2

%timeit my_list2 = [x * 2 for x in my_list]

937 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
47.7 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [9]:
# How many times slower is the list operation?
print(0.0477 / 0.000937)

50.907150480256135


## Array Indexing

In [10]:
# Access elements using square brackets
# Ex: print the index 0, 2, 4 elements of ary
ary = np.array([3, 1, 4, 1, 5, 9])
print(ary[0])
print(ary[0], ary[2], ary[4])

3
3 4 5


In [11]:
# print the last element of ary
print(ary[-1])

9


In [13]:
# print the first 3 elements of ary
print(ary[0:3])
print(ary[:3])

[3 1 4]
[3 1 4]


In [14]:
# print the last 2 elements of ary
print(ary[-2:])
print(ary[-2:6])
print(ary[4:6])

[5 9]
[5 9]
[5 9]


In [15]:
# Create a 2D array
ary2d = np.array([[1, 2, 3], 
                  [4, 5, 6]])
print(ary2d)

[[1 2 3]
 [4 5 6]]


In [17]:
# Ex: print the shape and variable type of ary2d
print("Shape:", ary2d.shape)
print("Data type:", ary2d.dtype)

Shape: (2, 3)
Data type: int32


In [18]:
# Ex: print the element on the first row and the second column
print(ary2d[0][1])
print(ary2d[0, 1])

2
2


In [19]:
# Ex: print the entire first row
print(ary2d[0])
print(ary2d[0, :])

[1 2 3]
[1 2 3]


In [20]:
# Ex: print the entire first column
print(ary2d[:, 0])

[1 4]


NumPy also provides many functions to create arrays:

In [None]:
ary = np.zeros((2, 3))
print(ary)
print(ary.dtype)

In [None]:
ary = np.ones((3, 2))
print(ary)

In [None]:
ary = np.random.rand(2, 2)
print(ary)

In [None]:
# What if we want to create an array of random integers?
int_ary = np.random.randint(0, 10, size=(4, 3))
print(int_ary)

## Advanced Array Indexing
NumPy arrays support **integer array indexing** and **boolean indexing**, which provides additional tools to create a subarray.

In [None]:
# Integer array index: create a subarray whose indices come from another array
data_ary = np.array([0, 2, 4, 6, 8, 10])
idx_ary = [0, 4, 5]
# # What values will be printed?
print(data_ary[idx_ary])

In [None]:
# Python list does not support integer array indexing
data_list = [0, 2, 4, 6, 8, 10]
data_list[idx_ary]

In [None]:
# Boolean indexing: select values that satisfies some condition
ary = np.array([1, 3, -5, -2, 0, -1])
idx = (ary > 0)
# What will be printed?
print(idx)
print(ary[idx])

In [None]:
# plug in the condition directly
print(ary[ary >= 0])

## Array Arithmetics
Basic mathematical functions operate elementwise on the arrays.

In [5]:
x = np.array([[1, 2, 3],
              [4, 5, 6]])
print(x + 1)

[[2 3 4]
 [5 6 7]]


In [6]:
print(x * 2)

[[ 2  4  6]
 [ 8 10 12]]


In [7]:
y = np.array([[10, 20, 30],
              [40, 50, 60]])
print(x + y)

[[11 22 33]
 [44 55 66]]


In [8]:
print(x * y)

[[ 10  40  90]
 [160 250 360]]


In [9]:
z = np.array([[-1, -2, -3, -4],
              [-5, -6, -7, -8]])
print(x + z)

ValueError: operands could not be broadcast together with shapes (2,3) (2,4) 

In [10]:
# What does this expression do?
1 / x

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [11]:
# What does this expression do?
x > y

array([[False, False, False],
       [False, False, False]])

## NumPy Math Functions
NumPy provides many math functions that are fully compatible with NumPy arrays

In [None]:
# Calculate the square-root of 1, 2, ..., 10
ary = np.arange(1, 11)
print(ary)
print(ary.shape)
ary2 = np.sqrt(ary)
print(ary2)

In [None]:
# Statistical functions
data = np.array([12, 34, 56, 78, 90])
print("Minimum:", data.min())
print("Maximum:", data.max())
print("Mean:", data.mean())
print("Variance:", data.var())
print("Standard deviation:", data.std())

In [None]:
# These functions can be called directly
print(np.min(data))
print(np.max(data))

# Example: 80 Cereals - Nutrition data on 80 cereal products

In this example, we will use NumPy to analyze a dataset that contains nutrition facts on 80 cereal products. In particular, we will:
- Download and load the dataset.
- Explore for interesting information
- Examine sugar content

The data file can be downloaded from [Kaggle.com](https://www.kaggle.com/crawford/80-cereals)

>If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. - Kaggle

- Download the zip file from Kaggle (login required)
- Unzip to get `cereal.csv` file
- Move the csv file to a proper folder
- Open the csv file using notepad and excel to examine its content

## Load And Examine The Data

In [4]:
# Load the csv file with np.loadtxt()
# Spoiler alert: in the next chapter we will learn a more user-friendly
# way of loading data.
import numpy as np # import numpy again to make this section self-contained
raw_data = np.loadtxt("data/cereal.csv", delimiter=",", skiprows=1, dtype=str)

In [5]:
# Show values in raw_data
print(raw_data[0:5, :])

[['100% Bran' 'N' 'C' '70' '4' '1' '130' '10' '5' '6' '280' '25' '3' '1'
  '0.33' '68.402973']
 ['100% Natural Bran' 'Q' 'C' '120' '3' '5' '15' '2' '8' '8' '135' '0'
  '3' '1' '1' '33.983679']
 ['All-Bran' 'K' 'C' '70' '4' '1' '260' '9' '7' '5' '320' '25' '3' '1'
  '0.33' '59.425505']
 ['All-Bran with Extra Fiber' 'K' 'C' '50' '4' '0' '140' '14' '8' '0'
  '330' '25' '3' '1' '0.5' '93.704912']
 ['Almond Delight' 'R' 'C' '110' '2' '2' '200' '1' '14' '8' '-1' '25' '3'
  '1' '0.75' '34.384843']]


In [21]:
# Ex: What is the shape of raw_data?
raw_data.shape

(77, 16)

In [22]:
# Ex: Create a list of feature names (call it feature_names)
feature_names = ["name","mfr","type","calories","protein","fat","sodium",
                 "fiber","carbo","sugars","potass","vitamins","shelf",
                 "weight","cups","rating"]

## Explore The Contents

In [17]:
# Display the list of cereal names
raw_data[:, 0]

array(['100% Bran', '100% Natural Bran', 'All-Bran',
       'All-Bran with Extra Fiber', 'Almond Delight',
       'Apple Cinnamon Cheerios', 'Apple Jacks', 'Basic 4', 'Bran Chex',
       'Bran Flakes', "Cap'n'Crunch", 'Cheerios', 'Cinnamon Toast Crunch',
       'Clusters', 'Cocoa Puffs', 'Corn Chex', 'Corn Flakes', 'Corn Pops',
       'Count Chocula', "Cracklin' Oat Bran", 'Cream of Wheat (Quick)',
       'Crispix', 'Crispy Wheat & Raisins', 'Double Chex', 'Froot Loops',
       'Frosted Flakes', 'Frosted Mini-Wheats',
       'Fruit & Fibre Dates; Walnuts; and Oats', 'Fruitful Bran',
       'Fruity Pebbles', 'Golden Crisp', 'Golden Grahams',
       'Grape Nuts Flakes', 'Grape-Nuts', 'Great Grains Pecan',
       'Honey Graham Ohs', 'Honey Nut Cheerios', 'Honey-comb',
       'Just Right Crunchy  Nuggets', 'Just Right Fruit & Nut', 'Kix',
       'Life', 'Lucky Charms', 'Maypo',
       'Muesli Raisins; Dates; & Almonds',
       'Muesli Raisins; Peaches; & Pecans', 'Mueslix Crispy Blend',
  

In [23]:
# Display the list of cereal ratings
ratings = raw_data[:, -1]
ratings

array(['68.402973', '33.983679', '59.425505', '93.704912', '34.384843',
       '29.509541', '33.174094', '37.038562', '49.120253', '53.313813',
       '18.042851', '50.764999', '19.823573', '40.400208', '22.736446',
       '41.445019', '45.863324', '35.782791', '22.396513', '40.448772',
       '64.533816', '46.895644', '36.176196', '44.330856', '32.207582',
       '31.435973', '58.345141', '40.917047', '41.015492', '28.025765',
       '35.252444', '23.804043', '52.076897', '53.371007', '45.811716',
       '21.871292', '31.072217', '28.742414', '36.523683', '36.471512',
       '39.241114', '45.328074', '26.734515', '54.850917', '37.136863',
       '34.139765', '30.313351', '40.105965', '29.924285', '40.692320',
       '59.642837', '30.450843', '37.840594', '41.503540', '60.756112',
       '63.005645', '49.511874', '50.828392', '39.259197', '39.703400',
       '55.333142', '41.998933', '40.560159', '68.235885', '74.472949',
       '72.801787', '31.230054', '53.131324', '59.363993', '38.8

In [33]:
# Which product has the best rating?
# 1. convert all ratings to floats
ratings = ratings.astype('float')
print(ratings)
# 2. find the highest rating
# np.max(ratings)
print("Maximum rating:", ratings.max())
# 3. find the (row) index of the highest rating
print("The row index of the maximum rating:", np.argmax(ratings))
# 4. Find the product name on that row
raw_data[3, 0]

[68.402973 33.983679 59.425505 93.704912 34.384843 29.509541 33.174094
 37.038562 49.120253 53.313813 18.042851 50.764999 19.823573 40.400208
 22.736446 41.445019 45.863324 35.782791 22.396513 40.448772 64.533816
 46.895644 36.176196 44.330856 32.207582 31.435973 58.345141 40.917047
 41.015492 28.025765 35.252444 23.804043 52.076897 53.371007 45.811716
 21.871292 31.072217 28.742414 36.523683 36.471512 39.241114 45.328074
 26.734515 54.850917 37.136863 34.139765 30.313351 40.105965 29.924285
 40.69232  59.642837 30.450843 37.840594 41.50354  60.756112 63.005645
 49.511874 50.828392 39.259197 39.7034   55.333142 41.998933 40.560159
 68.235885 74.472949 72.801787 31.230054 53.131324 59.363993 38.839746
 28.592785 46.658844 39.106174 27.753301 49.787445 51.592193 36.187559]
Maximum rating: 93.704912
The row index of the maximum rating: 3


'All-Bran with Extra Fiber'

In [7]:
# Display the amount of sugars for each product.
# How can we identify the column index for sugar?

# index_sugars = feature_names.index("sugars")
# raw_data[:, index_sugars]

# raw_data[:, feature_names.index("sugars")]

raw_data[:, np.array(feature_names) == "sugars"] # use a condition to select the sugars column

array([['6'],
       ['8'],
       ['5'],
       ['0'],
       ['8'],
       ['10'],
       ['14'],
       ['8'],
       ['6'],
       ['5'],
       ['12'],
       ['1'],
       ['9'],
       ['7'],
       ['13'],
       ['3'],
       ['2'],
       ['12'],
       ['13'],
       ['7'],
       ['0'],
       ['3'],
       ['10'],
       ['5'],
       ['13'],
       ['11'],
       ['7'],
       ['10'],
       ['12'],
       ['12'],
       ['15'],
       ['9'],
       ['5'],
       ['3'],
       ['4'],
       ['11'],
       ['10'],
       ['11'],
       ['6'],
       ['9'],
       ['3'],
       ['6'],
       ['12'],
       ['3'],
       ['11'],
       ['11'],
       ['13'],
       ['6'],
       ['9'],
       ['7'],
       ['2'],
       ['10'],
       ['14'],
       ['3'],
       ['0'],
       ['0'],
       ['6'],
       ['-1'],
       ['12'],
       ['8'],
       ['6'],
       ['2'],
       ['3'],
       ['0'],
       ['0'],
       ['0'],
       ['15'],
       ['3'],
       ['5'],
       ['

In [9]:
# What is the highest rating?
ratings.max() # This will cause an error since the array contains strings

TypeError: cannot perform reduce with flexible type

In [10]:
# Need to convert strings in ratings to floating point numbers.
ratings = ratings.astype(float)
ratings.max()

93.704912

In [11]:
print(ratings)

[68.402973 33.983679 59.425505 93.704912 34.384843 29.509541 33.174094
 37.038562 49.120253 53.313813 18.042851 50.764999 19.823573 40.400208
 22.736446 41.445019 45.863324 35.782791 22.396513 40.448772 64.533816
 46.895644 36.176196 44.330856 32.207582 31.435973 58.345141 40.917047
 41.015492 28.025765 35.252444 23.804043 52.076897 53.371007 45.811716
 21.871292 31.072217 28.742414 36.523683 36.471512 39.241114 45.328074
 26.734515 54.850917 37.136863 34.139765 30.313351 40.105965 29.924285
 40.69232  59.642837 30.450843 37.840594 41.50354  60.756112 63.005645
 49.511874 50.828392 39.259197 39.7034   55.333142 41.998933 40.560159
 68.235885 74.472949 72.801787 31.230054 53.131324 59.363993 38.839746
 28.592785 46.658844 39.106174 27.753301 49.787445 51.592193 36.187559]


In [13]:
# Which cereal receives the highest rating?

# find the index of the largest value in an array.
index_max_rating = np.argmax(ratings)
print("Index of the max rating:", index_max_rating)

# Extract the entire row with this index
print(raw_data[index_max_rating, :])

Index of the max rating: 3
['All-Bran with Extra Fiber' 'K' 'C' '50' '4' '0' '140' '14' '8' '0' '330'
 '25' '3' '1' '0.5' '93.704912']


In [18]:
# How many cereals receive rating above 60? What are they?
print("Cereal Products with above 60 rating:")
result = raw_data[ratings > 60, 0]
print(result)

print("Total number:", result.shape)

Cereal Products with above 60 rating:
['100% Bran' 'All-Bran with Extra Fiber' 'Cream of Wheat (Quick)'
 'Puffed Rice' 'Puffed Wheat' 'Shredded Wheat' "Shredded Wheat 'n'Bran"
 'Shredded Wheat spoon size']
Total number: (8,)


In [19]:
# What is the average rating?
ratings.mean()

42.66570498701299

## Sugar

In [22]:
# Display the list of sugar per serving
sugars = raw_data[:, feature_names.index("sugars")]
print("Amount of sugars:")
print(sugars)

Amount of sugars:
['6' '8' '5' '0' '8' '10' '14' '8' '6' '5' '12' '1' '9' '7' '13' '3' '2'
 '12' '13' '7' '0' '3' '10' '5' '13' '11' '7' '10' '12' '12' '15' '9' '5'
 '3' '4' '11' '10' '11' '6' '9' '3' '6' '12' '3' '11' '11' '13' '6' '9'
 '7' '2' '10' '14' '3' '0' '0' '6' '-1' '12' '8' '6' '2' '3' '0' '0' '0'
 '15' '3' '5' '3' '14' '3' '3' '12' '3' '3' '8']


In [24]:
# Display the list of weight per serving
weights = raw_data[:, feature_names.index("weight")]
print("Weight for one serving:")
print(weights)

Weight for one serving:
['1' '1' '1' '1' '1' '1' '1' '1.33' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1.25' '1.33' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1.3' '1' '1' '1' '1' '1' '1' '1.5' '1' '1' '1.33'
 '1' '1.25' '1.33' '1' '0.5' '0.5' '1' '1' '1.33' '1' '1' '1' '1' '0.83'
 '1' '1' '1' '1' '1' '1' '1.5' '1' '1' '1' '1' '1' '1']


In [32]:
# Let's compare two products with different definitions of one serving:
# Product 1:
print("Weight:", weights[0], "Sugars:", sugars[0])
# Find a product whose serving is half an ounce:
# First, convert the strings to floats
weights = weights.astype(float)
index = np.where(weights == 0.5)
print("Index of products with 0.5 weight:", index)
# Product 55:
print("Weight:", weights[55], "Sugars:", sugars[55])

Weight: 1.0 Sugars: 6
Index of products with 0.5 weight: (array([54, 55], dtype=int64),)
Weight: 0.5 Sugars: 0


In [36]:
# Calculate sugar per ounce
# Convert the strings in sugars to floats
sugars = sugars.astype(float)
# np.divide(sugars, weights)
sugars / weights

array([ 6.        ,  8.        ,  5.        ,  0.        ,  8.        ,
       10.        , 14.        ,  6.01503759,  6.        ,  5.        ,
       12.        ,  1.        ,  9.        ,  7.        , 13.        ,
        3.        ,  2.        , 12.        , 13.        ,  7.        ,
        0.        ,  3.        , 10.        ,  5.        , 13.        ,
       11.        ,  7.        ,  8.        ,  9.02255639, 12.        ,
       15.        ,  9.        ,  5.        ,  3.        ,  4.        ,
       11.        , 10.        , 11.        ,  6.        ,  6.92307692,
        3.        ,  6.        , 12.        ,  3.        , 11.        ,
       11.        ,  8.66666667,  6.        ,  9.        ,  5.26315789,
        2.        ,  8.        , 10.52631579,  3.        ,  0.        ,
        0.        ,  6.        , -1.        ,  9.02255639,  8.        ,
        6.        ,  2.        ,  3.        ,  0.        ,  0.        ,
        0.        , 15.        ,  3.        ,  5.        ,  3.  

In [43]:
# Which product has the highest amount of sugar per ounce?

# Calculate the amount of sugars per ounce
sugar_per_ounce = sugars / weights

# Find the maximal value
print("Max value:", sugar_per_ounce.max())

# Find the index of the maximal value
# index_max_sugars = np.argmax(sugar_per_ounce)
index_max_sugars = np.where(sugar_per_ounce == 15.0)
print("Index of max value:", index_max_sugars)

# Find the name of product 30 and 66
print("Name of the product with most sugar:", raw_data[30, 0], ", ",
      raw_data[66, 0])

Max value: 15.0
Index of max value: (array([30, 66], dtype=int64),)
Name of the product with most sugar: Golden Crisp ,  Smacks
