# Week 3
# NumPy Arrays

[NumPy](https://numpy.org/) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

**Resources:**
- Textbook Chapter 4
- [NumPy Documentation](https://numpy.org/doc/1.19/user/quickstart.html)
- [NumPy Tutorial from W3School](https://www.w3resource.com/python-exercises/numpy/index.php)

In [1]:
import numpy as np # np is a universally-used abbrevation for numpy

## NumPy Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.
- **rank**: number of dimensions
- **shape** a tuple of integers giving the size of the array along each dimension

In [2]:
# Create a numpy array from a Python list
py_list = [1, 2, 3, 4]
np_ary = np.array(py_list)
print("Numpy array:", np_ary)
print("Shape:", np_ary.shape)
print("Type:", np_ary.dtype)

Numpy array: [1 2 3 4]
Shape: (4,)
Type: int32


In [8]:
# Access elements using square brackets
# Ex: print the index 0, 2, 4 elements of ary
ary = np.array([3, 1, 4, 1, 5, 9])
print(ary[0], ary[2], ary[4])
# print("Something")

3 4 5


In [9]:
# Ex: print the last element of ary
print(ary[-1])

9


In [11]:
# Ex: print the first 3 elements of ary
print(ary[0:3]) # The second index is excluded
print(ary[:3]) # omitting the first index means starting at 0

[3 1 4]
[3 1 4]


In [13]:
# Ex: print the last 2 elements of ary
print(ary[-2:])
print(ary[4:6])
print(ary[4:])

[5 9]
[5 9]
[5 9]


In [14]:
# Create a 2D array
ary2d = np.array([[1, 2, 3],
                  [4, 5, 6]])
print(ary2d)

[[1 2 3]
 [4 5 6]]


In [16]:
# Ex: print the shape and variable type of ary2d
print(ary2d.shape)
print(ary2d.dtype)

(2, 3)
int32


In [17]:
# Ex: print the element on the first row and the second column
print(ary2d[0, 1]) # [row_index, col_index]
print(ary2d[0][1])

2
2


In [19]:
# Ex: print the entire first row
print(ary2d[0])
print(ary2d[0, :])

[1 2 3]
[1 2 3]


In [20]:
# Ex: print the entire first column
print(ary2d[:, 0])

[1 4]


NumPy also provides many functions to create arrays:

In [21]:
ary = np.zeros((2, 3))
print(ary)

[[0. 0. 0.]
 [0. 0. 0.]]


In [22]:
ary = np.ones((3, 2))
print(ary)

[[1. 1.]
 [1. 1.]
 [1. 1.]]


In [23]:
ary = np.random.rand(2, 2) # rand() randomly draws a value between 0 (inclusive) and 1 (exclusive)
print(ary)

[[0.4444273  0.86112227]
 [0.5963166  0.17648411]]


## Advanced Array Indexing
NumPy arrays support **integer array indexing** and **boolean indexing**, which provides additional tools to create a subarray.

In [24]:
# Integer array index: create a subarray whose indices come from another array
data_ary = np.array([0, 2, 4, 6, 8, 10])
idx_ary = [0, 4, 5]
# What values will be printed?
print(data_ary[idx_ary])

[ 0  8 10]


In [None]:
# Boolean indexing: select values that satisfies some condition
ary = np.array([1, 3, -5, -2, 0, -1])
idx = (ary > 0)
# What will be printed?
print(idx)
print(ary[idx])

In [28]:
# We can combine the two steps
print(ary[ary > 0])
print(ary[ary % 2 == 0]) # Extract even values

[1 3]
[-2  0]


## Array Math
Basic mathematical functions operate elementwise on the arrays.

In [29]:
x = np.array([[1, 2, 3],
              [4, 5, 6]])
print(x + 1)

[[2 3 4]
 [5 6 7]]


In [30]:
print(x * 2)

[[ 2  4  6]
 [ 8 10 12]]


In [31]:
y = np.array([[10, 20, 30],
              [40, 50, 60]])
print(x + y)

[[11 22 33]
 [44 55 66]]


In [33]:
print(x * y) # If you want to multiply x and y as matrices, use np.dot()

[[ 10  40  90]
 [160 250 360]]


## NumPy Math Functions
NumPy provides many math functions that are fully compatible with NumPy arrays

In [34]:
# Calculate the square-root of 1, 2, ..., 10
ary = np.arange(1, 11)
print(ary)
ary2 = np.sqrt(ary)
print(ary2)

[ 1  2  3  4  5  6  7  8  9 10]
[1.         1.41421356 1.73205081 2.         2.23606798 2.44948974
 2.64575131 2.82842712 3.         3.16227766]


In [None]:
# Statistical functions
data = np.array([12, 34, 56, 78, 90])
print("Minimum:", data.min())
print("Maximum:", data.max())
print("Mean:", data.mean())
print("Variance:", data.var()) # Variance shows that different values are
print("Standard deviation:", data.std()) # By comparing std with the original values, we can 
                                         # put the variance in perspective.

# Example: 80 Cereals - Nutrition data on 80 cereal products

In this example, we will use NumPy to analyze a dataset that contains nutrition facts on 80 cereal products. In particular, we will:
- Download and load the dataset.
- Explore for interesting information
- Examine sugar content

The data file can be downloaded from [Kaggle.com](https://www.kaggle.com/crawford/80-cereals)

>If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. - Kaggle

- Download the zip file from Kaggle (login required)
- Unzip to get `cereal.csv` file
- Move the csv file to the same folder where you save the python notebook
- Open the csv file using notepad and excel to examine its content

## Load And Examine The Data

In [37]:
# Load the csv file with np.loadtxt()
# Spoiler alert: in the next chapter we will learn a more user-friendly
# way of loading data.

raw_data = np.loadtxt("cereal.csv", delimiter=',', dtype='str')
print(raw_data)

[['name' 'mfr' 'type' ... 'weight' 'cups' 'rating']
 ['100% Bran' 'N' 'C' ... '1' '0.33' '68.402973']
 ['100% Natural Bran' 'Q' 'C' ... '1' '1' '33.983679']
 ...
 ['Wheat Chex' 'R' 'C' ... '1' '0.67' '49.787445']
 ['Wheaties' 'G' 'C' ... '1' '1' '51.592193']
 ['Wheaties Honey Gold' 'G' 'C' ... '1' '0.75' '36.187559']]


In [39]:
# Show values in raw_data
print(raw_data[0, :]) # shows the first row
print(raw_data[:, 0]) # shows the first column

['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']
['name' '100% Bran' '100% Natural Bran' 'All-Bran'
 'All-Bran with Extra Fiber' 'Almond Delight' 'Apple Cinnamon Cheerios'
 'Apple Jacks' 'Basic 4' 'Bran Chex' 'Bran Flakes' "Cap'n'Crunch"
 'Cheerios' 'Cinnamon Toast Crunch' 'Clusters' 'Cocoa Puffs' 'Corn Chex'
 'Corn Flakes' 'Corn Pops' 'Count Chocula' "Cracklin' Oat Bran"
 'Cream of Wheat (Quick)' 'Crispix' 'Crispy Wheat & Raisins' 'Double Chex'
 'Froot Loops' 'Frosted Flakes' 'Frosted Mini-Wheats'
 'Fruit & Fibre Dates; Walnuts; and Oats' 'Fruitful Bran' 'Fruity Pebbles'
 'Golden Crisp' 'Golden Grahams' 'Grape Nuts Flakes' 'Grape-Nuts'
 'Great Grains Pecan' 'Honey Graham Ohs' 'Honey Nut Cheerios' 'Honey-comb'
 'Just Right Crunchy  Nuggets' 'Just Right Fruit & Nut' 'Kix' 'Life'
 'Lucky Charms' 'Maypo' 'Muesli Raisins; Dates; & Almonds'
 'Muesli Raisins; Peaches; & Pecans' 'Mueslix Crispy Blend'
 'Mu

In [40]:
# Ex: What is the shape of raw_data?
print(raw_data.shape)

(78, 16)


In [43]:
# Ex: Create a list of feature names (call it feature_names)
feature_names = raw_data[0, :] # variable names in Python follows snake-casing style
print(feature_names)

['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']


In [41]:
# Split raw_data into feature_names and data
data = raw_data[1:, :]
print(data)

[['100% Bran' 'N' 'C' ... '1' '0.33' '68.402973']
 ['100% Natural Bran' 'Q' 'C' ... '1' '1' '33.983679']
 ['All-Bran' 'K' 'C' ... '1' '0.33' '59.425505']
 ...
 ['Wheat Chex' 'R' 'C' ... '1' '0.67' '49.787445']
 ['Wheaties' 'G' 'C' ... '1' '1' '51.592193']
 ['Wheaties Honey Gold' 'G' 'C' ... '1' '0.75' '36.187559']]


## Explore The Contents

In [42]:
# Display the list of cereal names
cereal_names = data[:, 0] # raw_data[1:,0]
print(cereal_names)

['100% Bran' '100% Natural Bran' 'All-Bran' 'All-Bran with Extra Fiber'
 'Almond Delight' 'Apple Cinnamon Cheerios' 'Apple Jacks' 'Basic 4'
 'Bran Chex' 'Bran Flakes' "Cap'n'Crunch" 'Cheerios'
 'Cinnamon Toast Crunch' 'Clusters' 'Cocoa Puffs' 'Corn Chex'
 'Corn Flakes' 'Corn Pops' 'Count Chocula' "Cracklin' Oat Bran"
 'Cream of Wheat (Quick)' 'Crispix' 'Crispy Wheat & Raisins' 'Double Chex'
 'Froot Loops' 'Frosted Flakes' 'Frosted Mini-Wheats'
 'Fruit & Fibre Dates; Walnuts; and Oats' 'Fruitful Bran' 'Fruity Pebbles'
 'Golden Crisp' 'Golden Grahams' 'Grape Nuts Flakes' 'Grape-Nuts'
 'Great Grains Pecan' 'Honey Graham Ohs' 'Honey Nut Cheerios' 'Honey-comb'
 'Just Right Crunchy  Nuggets' 'Just Right Fruit & Nut' 'Kix' 'Life'
 'Lucky Charms' 'Maypo' 'Muesli Raisins; Dates; & Almonds'
 'Muesli Raisins; Peaches; & Pecans' 'Mueslix Crispy Blend'
 'Multi-Grain Cheerios' 'Nut&Honey Crunch' 'Nutri-Grain Almond-Raisin'
 'Nutri-grain Wheat' 'Oatmeal Raisin Crisp' 'Post Nat. Raisin Bran'
 'Product

In [45]:
# Display the list of cereal ratings
cereal_rating = data[:, -1]
print(cereal_rating)

['68.402973' '33.983679' '59.425505' '93.704912' '34.384843' '29.509541'
 '33.174094' '37.038562' '49.120253' '53.313813' '18.042851' '50.764999'
 '19.823573' '40.400208' '22.736446' '41.445019' '45.863324' '35.782791'
 '22.396513' '40.448772' '64.533816' '46.895644' '36.176196' '44.330856'
 '32.207582' '31.435973' '58.345141' '40.917047' '41.015492' '28.025765'
 '35.252444' '23.804043' '52.076897' '53.371007' '45.811716' '21.871292'
 '31.072217' '28.742414' '36.523683' '36.471512' '39.241114' '45.328074'
 '26.734515' '54.850917' '37.136863' '34.139765' '30.313351' '40.105965'
 '29.924285' '40.692320' '59.642837' '30.450843' '37.840594' '41.503540'
 '60.756112' '63.005645' '49.511874' '50.828392' '39.259197' '39.703400'
 '55.333142' '41.998933' '40.560159' '68.235885' '74.472949' '72.801787'
 '31.230054' '53.131324' '59.363993' '38.839746' '28.592785' '46.658844'
 '39.106174' '27.753301' '49.787445' '51.592193' '36.187559']


In [49]:
# What is the highest rating?
# First, let's convert the dtype to float
cereal_rating_float = cereal_rating.astype(float)
print(cereal_rating_float.dtype)
print(cereal_rating_float.max())

float64
93.704912


In [50]:
# Alternative approach
print(max(cereal_rating))

93.704912


In [53]:
# Which cereal receives the highest rating?
# Plan: find out the row index of the highest rating
#   -> find out the product name on that row
highest_row = np.argmax(cereal_rating_float)
print(highest_row)
print(cereal_rating_float[3])
highest_name = data[highest_row, 0]
print(highest_name)

3
93.704912
All-Bran with Extra Fiber


In [60]:
# How many cereals receive rating above 60? What are they?
# boolean indexing
cereal_above_60 = (cereal_rating_float > 60)
print(cereal_above_60)
print(np.sum(cereal_above_60)) # 8 products receive rating above 60
# print(data[cereal_above_60, 0])
print(data[(cereal_rating_float > 60), 0])

[ True False False  True False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False  True  True False False False False
 False False False  True  True  True False False False False False False
 False False False False False]
8
['100% Bran' 'All-Bran with Extra Fiber' 'Cream of Wheat (Quick)'
 'Puffed Rice' 'Puffed Wheat' 'Shredded Wheat' "Shredded Wheat 'n'Bran"
 'Shredded Wheat spoon size']


In [62]:
# What is the average rating?
print(cereal_rating_float.mean())
print(np.mean(cereal_rating_float))

42.66570498701299
42.66570498701299


## Sugar

In [63]:
# Display the list of sugar per serving
print(data[:, (feature_names == "sugars")])

[['6']
 ['8']
 ['5']
 ['0']
 ['8']
 ['10']
 ['14']
 ['8']
 ['6']
 ['5']
 ['12']
 ['1']
 ['9']
 ['7']
 ['13']
 ['3']
 ['2']
 ['12']
 ['13']
 ['7']
 ['0']
 ['3']
 ['10']
 ['5']
 ['13']
 ['11']
 ['7']
 ['10']
 ['12']
 ['12']
 ['15']
 ['9']
 ['5']
 ['3']
 ['4']
 ['11']
 ['10']
 ['11']
 ['6']
 ['9']
 ['3']
 ['6']
 ['12']
 ['3']
 ['11']
 ['11']
 ['13']
 ['6']
 ['9']
 ['7']
 ['2']
 ['10']
 ['14']
 ['3']
 ['0']
 ['0']
 ['6']
 ['-1']
 ['12']
 ['8']
 ['6']
 ['2']
 ['3']
 ['0']
 ['0']
 ['0']
 ['15']
 ['3']
 ['5']
 ['3']
 ['14']
 ['3']
 ['3']
 ['12']
 ['3']
 ['3']
 ['8']]


In [64]:
# Display the list of weight per serving
print(data[:, (feature_names == "weight")])

[['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1.33']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1.25']
 ['1.33']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1.3']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1.5']
 ['1']
 ['1']
 ['1.33']
 ['1']
 ['1.25']
 ['1.33']
 ['1']
 ['0.5']
 ['0.5']
 ['1']
 ['1']
 ['1.33']
 ['1']
 ['1']
 ['1']
 ['1']
 ['0.83']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1.5']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']
 ['1']]


In [65]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving
sugar_per_serving = data[:, (feature_names == "sugars")].astype(float)
ounce_per_serving = data[:, (feature_names == "weight")].astype(float)
sugar_per_ounce = sugar_per_serving / ounce_per_serving
print(sugar_per_ounce)

[[ 6.        ]
 [ 8.        ]
 [ 5.        ]
 [ 0.        ]
 [ 8.        ]
 [10.        ]
 [14.        ]
 [ 6.01503759]
 [ 6.        ]
 [ 5.        ]
 [12.        ]
 [ 1.        ]
 [ 9.        ]
 [ 7.        ]
 [13.        ]
 [ 3.        ]
 [ 2.        ]
 [12.        ]
 [13.        ]
 [ 7.        ]
 [ 0.        ]
 [ 3.        ]
 [10.        ]
 [ 5.        ]
 [13.        ]
 [11.        ]
 [ 7.        ]
 [ 8.        ]
 [ 9.02255639]
 [12.        ]
 [15.        ]
 [ 9.        ]
 [ 5.        ]
 [ 3.        ]
 [ 4.        ]
 [11.        ]
 [10.        ]
 [11.        ]
 [ 6.        ]
 [ 6.92307692]
 [ 3.        ]
 [ 6.        ]
 [12.        ]
 [ 3.        ]
 [11.        ]
 [11.        ]
 [ 8.66666667]
 [ 6.        ]
 [ 9.        ]
 [ 5.26315789]
 [ 2.        ]
 [ 8.        ]
 [10.52631579]
 [ 3.        ]
 [ 0.        ]
 [ 0.        ]
 [ 6.        ]
 [-1.        ]
 [ 9.02255639]
 [ 8.        ]
 [ 6.        ]
 [ 2.        ]
 [ 3.        ]
 [ 0.        ]
 [ 0.        ]
 [ 0.        ]
 [15.     

In [68]:
# Which product has the highest amount of sugar per ounce?
# 1. find the index of highest value
# 2. show the name of that row
print("Highest sugar per ounce:", sugar_per_serving.max())
print("Index of this product:", np.argmax(sugar_per_serving))
print("Name of this product:", data[30, 0])

Highest sugar per ounce: 15.0
Index of this product: 30
Name of this product: Golden Crisp
