# Week 3
# NumPy Arrays

[NumPy](https://numpy.org/) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

**Resources:**
- Textbook Chapter 4
- [NumPy Documentation](https://numpy.org/doc/stable/)
- [NumPy Tutorial from W3School](https://www.w3resource.com/python-exercises/numpy/index.php)

In [None]:
import numpy as np # np is a universally-used abbrevation for numpy

## NumPy Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.
- **rank**: number of dimensions
- **shape** a tuple of integers giving the size of the array along each dimension

In [None]:
# Create a numpy array from a Python list
py_list = [1, 2, 3, 4]
np_ary = np.array(py_list)
print("Numpy array:", np_ary)
print("Shape:", np_ary.shape)
print("Type:", np_ary.dtype)

NumPy arrays can perform computations very efficiently. 

In [None]:
# Time cost comparison
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

# Use %timeit to estimate the time cost
%timeit my_arr2 = my_arr * 2

%timeit my_list2 = [x * 2 for x in my_list]

## Array Indexing

In [None]:
# Access elements using square brackets
# Ex: print the index 0, 2, 4 elements of ary
ary = np.array([3, 1, 4, 1, 5, 9])
print(ary[0])
print(ary[0], ary[2], ary[4])

In [None]:
# print the last element of ary
print(ary[-1])

In [None]:
# print the first 3 elements of ary
print(ary[0:3])
print(ary[:3])
print(ary[-3:])
print(ary[-3:6])

In [None]:
# print the last 2 elements of ary
print(ary[-2:])
print(ary[-2:6])
print(ary[4:6])

In [None]:
# Create a 2D array
ary2d = np.array([[1, 2, 3],
                  [4, 5, 6]])
print(ary2d)

In [None]:
# Ex: print the shape and variable type of ary2d



In [None]:
# Ex: print the element on the first row and the second column
print(ary2d[0][1])
print(ary2d[0, 1])

In [None]:
# Ex: print the entire first row
print(ary2d[0])
print(ary2d[0, :])

In [None]:
# Ex: print the entire first column
print(ary2d[:, 0])

NumPy also provides many functions to create arrays:

In [None]:
ary = np.zeros((2, 3))
print(ary)
print(ary.dtype)

In [None]:
ary = np.ones((3, 2))
print(ary)

In [None]:
ary = np.random.rand(2, 2)
print(ary)

In [None]:
# What if we want to create an array of random integers?
int_ary = np.random.randint(0, 10, size=(4, 3))
print(int_ary)

## Advanced Array Indexing
NumPy arrays support **integer array indexing** and **boolean indexing**, which provides additional tools to create a subarray.

In [None]:
# Integer array index: create a subarray whose indices come from another array
data_ary = np.array([0, 2, 4, 6, 8, 10])
idx_ary = [0, 4, 5]
# # What values will be printed?
print(data_ary[idx_ary])

In [None]:
# Python list does not support integer array indexing
data_list = [0, 2, 4, 6, 8, 10]
data_list[idx_ary]

In [None]:
# Boolean indexing: select values that satisfies some condition
ary = np.array([1, 3, -5, -2, 0, -1])
idx = (ary > 0)
# What will be printed?
print(idx)
print(ary[idx])

In [None]:
# plug in the condition directly
print(ary[ary >= 0])

## Array Arithmetics
Basic mathematical functions operate elementwise on the arrays.

In [None]:
x = np.array([[1, 2, 3],
              [4, 5, 6]])
print(x + 1)

In [None]:
print(x * 2)

In [None]:
y = np.array([[10, 20, 30],
              [40, 50, 60]])
print(x + y)

In [None]:
print(x * y)

In [None]:
z = np.array([[-1, -2, -3, -4],
              [-5, -6, -7, -8]])
print(x + z)

In [None]:
# What does this expression do?
1 / x

In [None]:
# What does this expression do?
x > y

## NumPy Math Functions
NumPy provides many math functions that are fully compatible with NumPy arrays

In [None]:
# Calculate the square-root of 1, 2, ..., 10
ary = np.arange(1, 11)
print(ary)
print(ary.shape)
ary2 = np.sqrt(ary)
print(ary2)

In [None]:
# Statistical functions
data = np.array([12, 34, 56, 78, 90])
print("Minimum:", data.min())
print("Maximum:", data.max())
print("Mean:", data.mean())
print("Variance:", data.var())
print("Standard deviation:", data.std())

In [None]:
# These functions can be called directly
print(np.min(data))
print(np.max(data))

# Example: 80 Cereals - Nutrition data on 80 cereal products

In this example, we will use NumPy to analyze a dataset that contains nutrition facts on 80 cereal products. In particular, we will:
- Download and load the dataset.
- Explore for interesting information
- Examine sugar content

The data file can be downloaded from [Kaggle.com](https://www.kaggle.com/crawford/80-cereals)

>If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. - Kaggle

- Download the zip file from Kaggle (login required)
- Unzip to get `cereal.csv` file
- Move the csv file to a proper folder
- Open the csv file using notepad and excel to examine its content

## Load And Examine The Data

In [None]:
# Load the csv file with np.loadtxt()
# Spoiler alert: in the next chapter we will learn a more user-friendly
# way of loading data.
import numpy as np # import numpy again to make this section self-contained
raw_data = np.loadtxt("data/cereal.csv", delimiter=",", skiprows=1, dtype=str)

In [None]:
# Show values in raw_data
print(raw_data[0:5, :])

In [None]:
# Ex: What is the shape of raw_data?
# len(raw_data)


In [None]:
# Ex: Create a list of feature names (call it feature_names)
feature_names = ["name","mfr","type","calories","protein","fat","sodium",
                 "fiber","carbo","sugars","potass","vitamins","shelf",
                 "weight","cups","rating"]

## Explore The Contents

In [None]:
# Display the list of cereal names
raw_data[:, 0]

In [None]:
# Display the list of cereal ratings
ratings = raw_data[:, -1]
ratings.shape

In [None]:
# Display the amount of sugars for each product.
# How can we identify the column index for sugar?

# index_sugars = feature_names.index("sugars")
# raw_data[:, index_sugars]

# raw_data[:, feature_names.index("sugars")]

raw_data[:, np.array(feature_names) == "sugars"] # use a condition to select the sugars column

In [None]:
# What is the highest rating?
ratings.max() # This will cause an error since the array contains strings

In [None]:
# Need to convert strings in ratings to floating point numbers.
ratings = ratings.astype(float)
ratings.max()

In [None]:
print(ratings)

In [None]:
# Which cereal receives the highest rating?

# find the index of the largest value in an array.
index_max_rating = np.argmax(ratings)
print("Index of the max rating:", index_max_rating)

# Extract the entire row with this index
print(raw_data[index_max_rating, :])

In [None]:
# How many cereals receive rating above 60? What are they?
print("Cereal Products with above 60 rating:")
result = raw_data[ratings > 60, 0]
print(result)

print("Total number:", result.shape)

In [None]:
# What is the average rating?
ratings.mean()

## Sugar

In [None]:
# Display the list of sugar per serving
sugars = raw_data[:, feature_names.index("sugars")]
print("Amount of sugars:")
print(sugars)

In [None]:
# Display the list of weight per serving
weights = raw_data[:, feature_names.index("weight")]
print("Weight for one serving:")
print(weights)

In [None]:
# Let's compare two products with different definitions of one serving:
# Product 1:
print("Weight:", weights[0], "Sugars:", sugars[0])
# Find a product whose serving is half an ounce:
# First, convert the strings to floats
weights = weights.astype(float)
index = np.where(weights == 0.5)
print("Index of products with 0.5 weight:", index)
# Product 55:
print("Weight:", weights[55], "Sugars:", sugars[55])

In [None]:
# Calculate sugar per ounce
# Convert the strings in sugars to floats
sugars = sugars.astype(float)
sugars / weights

In [None]:
# Which product has the highest amount of sugar per ounce?

# Calculate the amount of sugars per ounce
sugar_per_ounce = sugars / weights

# Find the maximal value
print("Max value:", sugar_per_ounce.max())

# Find the index of the maximal value
# index_max_sugars = np.argmax(sugar_per_ounce)
index_max_sugars = np.where(sugar_per_ounce == 15.0)
print("Index of max value:", index_max_sugars)

# Find the name of product 30 and 66
print("Name of the product with most sugar:", raw_data[30, 0], ", ",
      raw_data[66, 0])