<a href="https://colab.research.google.com/github/evecount/6m-data-1.6-intro-numpy/blob/main/notebooks/numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1.6: Introduction to NumPy

## Introduction
**NumPy**, short for Numerical Python, is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many computational and data science packages use NumPy as the main building block. It is a fundamental library for scientific computing in Python.

### Key Features of NumPy:
* **ndarray**: An efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
* **Vectorization**: Mathematical functions for fast operations on entire arrays of data without having to write loops.
* **Linear Algebra**: Tools for random number generation, Fourier transforms, and matrix manipulation.
* **C API**: For connecting NumPy with libraries written in C, C++, or FORTRAN.

### Advantages over Python Lists:
1. **Contiguous Memory**: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. This allows for significantly faster access and manipulation.
2. **Vectorized Operations**: NumPy algorithms written in C can operate on this memory without type checking or other Python overhead, performing complex computations without slow `for` loops.

![numpy_vs_list](https://github.com/evecount/6m-data-1.6-intro-numpy/blob/main/assets/numpy_vs_python_list.png?raw=1)

## Part 1: Performance Benchmark
To give you an idea of the performance difference, consider a NumPy array of one million integers and an equivalent Python list. We use the `%timeit` magic command to measure execution time.

In [1]:
import numpy as np
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

print("NumPy Vectorized Multiplication (my_arr * 2):")
%timeit my_arr2 = my_arr * 2

print("\nPython List Comprehension ([x * 2 for x in my_list]):")
%timeit my_list2 = [x * 2 for x in my_list]

NumPy Vectorized Multiplication (my_arr * 2):
1.82 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Python List Comprehension ([x * 2 for x in my_list]):
71.5 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Part 2: The ndarray (N-dimensional array)
The `ndarray` is a fast, flexible container for large datasets. It is a multidimensional array of fixed size with **homogeneous** elements (all elements must be of the same type).

Every array has:
* **shape**: A tuple indicating the size of each dimension.
* **dtype**: An object describing the data type of the array.
* **ndim**: The number of dimensions (axes).

### ndarray illustration
![ndarray](https://github.com/evecount/6m-data-1.6-intro-numpy/blob/main/assets/numpy_ndarray.png?raw=1)

In [3]:
# [DEMO] Creating arrays from sequences
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

print(f"Array 1:\n{arr1}")
print(f"Shape: {arr1.shape}, Dtype: {arr1.dtype}, Dimensions: {arr1.ndim}")

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

print(f"Array 2:\n{arr2}")
print(f"Shape: {arr2.shape}, Dtype: {arr2.dtype}, Dimensions: {arr2.ndim}")

Array 1:
[6.  7.5 8.  0.  1. ]
Shape: (5,), Dtype: float64, Dimensions: 1
Array 2:
[[1 2 3 4]
 [5 6 7 8]]
Shape: (2, 4), Dtype: int64, Dimensions: 2


### Data Types and Casting
NumPy supports specific numerical types like `int32`, `float64`, etc. You can explicitly convert an array from one `dtype` to another using the `astype` method.

**Note:** If you cast floating-point numbers to an integer `dtype`, the decimal part will be truncated.

In [4]:
# [DEMO] Casting arrays
arr = np.array([3.7, -1.2, 0.5, 12.9])
print("Original:", arr)
print("Casted to int32:", arr.astype(np.int32))

Original: [ 3.7 -1.2  0.5 12.9]
Casted to int32: [ 3 -1  0 12]


### [EXERCISE 1: Creation & Casting]
1. Create a 3x4 array of all ones using `np.ones()`.
2. Cast this array to `float32`.
3. Create an array of strings representing numbers: `['1.25', '-9.6', '42']`. Cast it to `float`.

In [7]:
# 1. Create a 3x4 array of all ones using np.ones()
ones_array = np.ones((3, 4))
print(f"1. Original 3x4 array of ones:\n{ones_array}")

# 2. Cast this array to float32
ones_array_float32 = ones_array.astype(np.float32)
print(f"2. Casted to float32:\n{ones_array_float32}\nDtype: {ones_array_float32.dtype}")

# 3. Create an array of strings representing numbers: ['1.25', '-9.6', '42']
string_array = np.array(['1.25', '-9.6', '42'])
print(f"\n3. Original string array:\n{string_array}\nDtype: {string_array.dtype}")

# 4. Cast it to float
float_array = string_array.astype(float)
print(f"4. Casted to float:\n{float_array}\nDtype: {float_array.dtype}")

# 5. Cast the float array to integer
int_array = float_array.astype(int)
print(f"\n5. Casted to integer:\n{int_array}\nDtype: {int_array.dtype}")

1. Original 3x4 array of ones:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
2. Casted to float32:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
Dtype: float32

3. Original string array:
['1.25' '-9.6' '42']
Dtype: <U4
4. Casted to float:
[ 1.25 -9.6  42.  ]
Dtype: float64

5. Casted to integer:
[ 1 -9 42]
Dtype: int64


## Part 3: Arithmetic & Broadcasting
Arithmetic operations are applied as batch operations without for loops. **Broadcasting** describes how arithmetic works between arrays of different shapes.

![vectorization](https://github.com/evecount/6m-data-1.6-intro-numpy/blob/main/assets/vectorization.png?raw=1)

Example: A scalar value being replicated (broadcast) to match the shape of a larger array.

In [8]:
# [DEMO] Arithmetic & Broadcasting
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print("Element-wise multiplication (arr * arr):\n", arr * arr)
print("\nBroadcasting scalar (1 / arr):\n", 1 / arr)

Element-wise multiplication (arr * arr):
 [[ 1.  4.  9.]
 [16. 25. 36.]]

Broadcasting scalar (1 / arr):
 [[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]


## Part 4: Indexing and Slicing
One-dimensional arrays act similarly to Python lists. In 2D arrays, indexing can be done with `[row, column]` syntax.

### 2D Array Indexing Syntax
![2d_array_indexing](https://github.com/evecount/6m-data-1.6-intro-numpy/blob/main/assets/ndarray_axis_index.png?raw=1)

**Important:** Array slices are **views** on the original array. This means data is not copied, and modifications to the slice will be reflected in the source array.

In [10]:
# [DEMO] Slicing views
arr = np.arange(10) # arr is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
arr_slice = arr[5:8] # arr_slice is a view containing [5, 6, 7]
                     # It refers to arr[5], arr[6], arr[7]

arr_slice[1] = 12345 # This changes the element at index 1 of arr_slice,
                     # which corresponds to arr[6].
                     # So, arr[6] becomes 12345.
print("Original array modified via slice:", arr)

# [DEMO] 2D Slicing
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(
  "\nFirst two rows, columns 1 onwards:\n", arr2d[:2, 1:]
)

Original array modified via slice: [    0     1     2     3     4     5 12345     7     8     9]

First two rows, columns 1 onwards:
 [[2 3]
 [5 6]]


### [EXERCISE 2: The Logic of Slicing]
1. Select the first column of `arr2d` using a slice.
2. Set all values in the second row to 0.
3. **Socratic Prompt:** How does `arr2d[1]` differ from `arr2d[1, :]`? (Hint: check shapes)

In [11]:
# 1. Select the first column of arr2d using a slice.
print("Original arr2d:\n", arr2d)
first_column = arr2d[:, 0]
print("\nFirst column of arr2d:\n", first_column)

# 2. Set all values in the second row to 0.
arr2d[1, :] = 0
print("\narr2d after setting second row to 0:\n", arr2d)

Original arr2d:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

First column of arr2d:
 [1 4 7]

arr2d after setting second row to 0:
 [[1 2 3]
 [0 0 0]
 [7 8 9]]


### Understanding Slicing Syntax ( `:` and `0`-based indexing)

In Python (and thus NumPy), indexing is **0-based**. This means the first element of an array, row, or column is always at index `0`.

*   `arr[0]` refers to the first element.
*   `arr[1]` refers to the second element, and so on.

The colon symbol (`:`) is used in slicing to indicate ranges:

*   **`:` alone** (e.g., `arr[:, 0]`) means 'all elements along this axis'. In `arr[:, 0]`, it means 'all rows', taking the first column.
*   **`start:end`** (e.g., `arr[5:8]`) means elements from `start` up to (but not including) `end`.
*   **`start:`** (e.g., `arr[1:]`) means elements from `start` to the end.
*   **`:end`** (e.g., `arr[:2]`) means elements from the beginning up to (but not including) `end`.

## Other Important NumPy Symbols and Concepts

Beyond basic indexing and slicing, NumPy uses several other symbols and functions that are crucial for effective numerical computing:

1.  **Arithmetic Operators (`+`, `-`, `*`, `/`, `**`, `%`):**
    These perform **element-wise** operations by default. If shapes are compatible (due to broadcasting rules), these operators will apply the operation to each corresponding element.
    *   `arr1 + arr2`: Element-wise addition
    *   `arr1 * arr2`: Element-wise multiplication (not matrix multiplication!)
    *   `arr ** 2`: Element-wise squaring

2.  **Matrix Multiplication (`@` or `np.dot()`):**
    *   `arr1 @ arr2`: This is the matrix multiplication operator (Python 3.5+). It performs the dot product between two arrays. For example, `A @ B` means standard matrix multiplication.
    *   `np.dot(arr1, arr2)`: The function equivalent for matrix multiplication.

3.  **Comparison Operators (`==`, `!=`, `<`, `>`, `<=`, `>=`):**
    These also perform **element-wise** comparisons and return a boolean array of the same shape.
    *   `arr == value`: Returns `True` where elements are equal to `value`, `False` otherwise.
    *   `arr > threshold`: Returns `True` where elements are greater than `threshold`.

4.  **Logical Operators (for boolean arrays):**
    These are used to combine boolean arrays element-wise.
    *   `&` (AND): `(arr > 0) & (arr < 10)` – selects elements that are both greater than 0 AND less than 10.
    *   `|` (OR): `(arr == 'Bob') | (arr == 'Will')` – selects elements that are 'Bob' OR 'Will'.
    *   `~` (NOT): `~(arr == 'Bob')` – selects elements that are NOT 'Bob'.
    **Important:** You must use `&`, `|`, `~` for element-wise logical operations on NumPy boolean arrays, not `and`, `or`, `not` (which work on single boolean values).

5.  **`.T` (Transpose):**
    *   `matrix.T`: Returns the transpose of the array (swaps rows and columns).

6.  **Methods and Attributes (e.g., `.shape`, `.dtype`, `.mean()`, `.sum()`):**
    *   `.`: The dot operator is used to access attributes (like `.shape`, `.dtype`, `.ndim`) or call methods (like `.mean()`, `.sum()`, `.astype()`) on a NumPy array object.

7.  **Indexing and Slicing specific characters:**
    *   `...` (Ellipsis): Used for selecting full slices in any remaining dimensions. E.g., `arr[..., 0]` selects the first element along the last axis, regardless of how many preceding dimensions there are.


## Part 5: Boolean Indexing
Like arithmetic operations, comparisons (such as `==`) with arrays are vectorized. This yields a boolean array which can be used to filter data. We are now looking at students' scores, where we are trying to find what Bob scored in his exams.

### Systematic Boolean Indexing Explained

Boolean indexing is a very powerful and systematic way to filter or select data in NumPy arrays based on a condition. Here's how it generally works step-by-step:

1.  **Create a Boolean Mask (the Condition):**
    *   You start with an array (let's call it `condition_array`) and apply a condition to it. This condition is usually a comparison operator (`==`, `!=`, `>`, `<`, `>=`, `<=`).
    *   The result of this comparison is a new array, called a **boolean mask**, where each element is either `True` or `False`.
    *   `True` indicates that the element in `condition_array` met the condition.
    *   `False` indicates it did not.
    *   **Crucially, this boolean mask must have the same shape or be broadcastable to the array you want to filter.**

    *Example:* `names == 'Bob'` creates `[ True False False True False False False ]`

2.  **Apply the Boolean Mask to the Data Array:**
    *   Once you have your boolean mask, you use it to index the `data_array` you want to filter.
    *   When the boolean mask is placed inside the square brackets `[]` of the `data_array`, NumPy returns all elements from `data_array` that correspond to `True` values in the boolean mask.
    *   The result is a new array containing only the selected elements.

    *Example:* `scores[bob_mask]` uses `[ True False False True False False False ]` to pick rows from `scores`.

**Systematic Steps in Summary:**

*   **Step 1:** Define your `data_array` (the array you want to filter) and your `condition_array` (the array that holds the values to be checked against a condition).
*   **Step 2:** Formulate a boolean expression using your `condition_array` that evaluates to `True` or `False` for each element. This generates the `boolean_mask`.
*   **Step 3:** Use the `boolean_mask` to index your `data_array`: `filtered_data = data_array[boolean_mask]`.

This method allows for very flexible and efficient data selection and manipulation without needing explicit loops.

In [13]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

bob_mask = (names == 'Bob')
print("Mask:", bob_mask)
print("Bob's scores:\n", scores[bob_mask])

Mask: [ True False False  True False False False]
Bob's scores:
 [[ 75  80]
 [100  77]]


### Clarification on Boolean Masks

It's important to understand that a **boolean mask does not change the data type** of the elements in the array it's applied to. Instead, a boolean mask is an array of `True` and `False` values that acts as a **filter**.

When you use `scores[bob_mask]`, NumPy uses the `True` values in `bob_mask` to select the corresponding elements (or rows/columns) from the `scores` array. The data in `scores` remains in its original type; the mask simply determines which parts of the data are included in the result.

### [EXERCISE 3: Complex Filtering]
1. Select all scores where the name is NOT 'Bob'.
2. Select scores for 'Bob' or 'Will' using the `|` operator.
3. Find all scores less than 80 and set them to 0.

In [14]:
# 1. Select all scores where the name is NOT 'Bob'.
not_bob_mask = (names != 'Bob')
print("1. Scores for names NOT 'Bob':\n", scores[not_bob_mask])

# 2. Select scores for 'Bob' or 'Will' using the | operator.
bob_or_will_mask = (names == 'Bob') | (names == 'Will')
print("\n2. Scores for 'Bob' or 'Will':\n", scores[bob_or_will_mask])

# 3. Find all scores less than 80 and set them to 0.
# First, let's create a fresh copy of scores to avoid modifying the original 'scores' variable for subsequent exercises if needed.
scores_copy = scores.copy()
print("\nOriginal scores_copy before modification:\n", scores_copy)
scores_copy[scores_copy < 80] = 0
print("3. Scores after setting values less than 80 to 0:\n", scores_copy)

1. Scores for names NOT 'Bob':
 [[ 85  90]
 [ 95 100]
 [ 85  92]
 [ 95  80]
 [ 72  80]]

2. Scores for 'Bob' or 'Will':
 [[ 75  80]
 [ 95 100]
 [100  77]
 [ 85  92]]

Original scores_copy before modification:
 [[ 75  80]
 [ 85  90]
 [ 95 100]
 [100  77]
 [ 85  92]
 [ 95  80]
 [ 72  80]]
3. Scores after setting values less than 80 to 0:
 [[  0  80]
 [ 85  90]
 [ 95 100]
 [100   0]
 [ 85  92]
 [ 95  80]
 [  0  80]]


## Working with AI for Data Exploration: A Collaborative Workflow

This notebook demonstrates not just NumPy concepts, but also a powerful new way to approach data analysis and problem-solving through collaboration with an AI like Gemini. Here's how an AI-Human/HITL workflow functions:

### 1. Introducing a Dataset:

*   **Your Role:** You provide the dataset, specifying its location (e.g., URL, local file path if uploaded to Colab, or details of a BigQuery table).
*   **AI's Role:** I generate the appropriate code (e.g., using `pandas.read_csv`, `pandas_gbq.read_gbq`) to load the data into the notebook. We can then quickly inspect its structure (`df.head()`, `df.info()`).

### 2. Collaborative Questioning & Analysis:

This is where the synergy happens. Instead of you having to translate every analytical thought into code, we can work together:

*   **Your Role:** You express your curiosity, goals, or specific questions in natural language (e.g., "Who scored more than 80?", "What are the average scores?", "How do these two columns relate?").
*   **AI's Role:** I interpret your question and translate it into precise Python code using libraries like NumPy and Pandas. I execute the code and present the results.
*   **Iterative Process:** Based on the results, you can ask follow-up questions, refine your initial query, or guide the analysis in a new direction. I'll continue to generate and execute the code needed.

**The Power of This Collaboration:**

*   **Efficiency:** You can focus on the *what* and *why* of your analysis, rather than getting bogged down in the *how* of coding syntax.
*   **Accessibility:** It lowers the barrier to entry for complex data operations, allowing you to perform sophisticated analyses with natural language.
*   **Discovery:** By rapidly iterating through questions and getting immediate code-generated answers, we can accelerate the discovery of insights and patterns in your data.

This approach allows for a highly interactive and productive data exploration experience, turning your analytical thoughts directly into actionable code and insights.

### Answering "Who scored more than 80?" with NumPy

To find out who scored more than 80, we can use boolean indexing. We'll check if *any* of a student's scores are greater than 80. Then, we'll use that boolean mask to filter the `names` array.

In [15]:
# 1. Create a boolean mask for scores greater than 80
# np.any(axis=1) checks if ANY score in a given row (axis=1) is True
scores_greater_than_80_mask = (scores > 80).any(axis=1)
print(f"Scores > 80 Mask: {scores_greater_than_80_mask}")

# 2. Use the mask to find the names of students who meet this criteria
students_who_scored_over_80 = names[scores_greater_than_80_mask]

print(f"\nStudents who scored more than 80 in at least one exam: {np.unique(students_who_scored_over_80)}")

Scores > 80 Mask: [False  True  True  True  True  True False]

Students who scored more than 80 in at least one exam: ['Bob' 'Joe' 'Will']


## Part 6: Universal Functions (ufuncs) and Methods
A **ufunc** is a function that performs element-wise operations on data in ndarrays.

* **Unary ufuncs**: Take one array (e.g., `sqrt`, `exp`).
* **Binary ufuncs**: Take two arrays (e.g., `add`, `maximum`).
* **Statistical Methods**: `mean`, `sum`, `std` can be computed over the entire array or along an axis.

In [None]:
# [DEMO] Statistical Methods
arr = np.random.randn(3, 4)
print("Random Array:\n", arr)
print("\nMean down rows (axis=0):", arr.mean(axis=0))
print("Sum across columns (axis=1):", arr.sum(axis=1))

## Part 7: Linear Algebra
Linear algebra operations, like matrix multiplication, are crucial for many data science algorithms. Multiplying two arrays with `*` is an element-wise product; for matrix multiplication, use `.dot()` or the `@` operator.

![matrix_multiplication](https://github.com/evecount/6m-data-1.6-intro-numpy/blob/main/assets/matrix_multiplication.png?raw=1)

In [None]:
# [DEMO] Matrix Multiplication
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])

print("Matrix product (x @ y):\n", x @ y)

### [EXERCISE 4: Reshaping & Statistics]
1. Create an array of 15 integers using `arange(15)` and reshape it to `(3, 5)`.
2. Calculate the average value of each row.
3. Use `np.unique()` to find distinct elements in an array of your choice.
4. Transpose the reshaped array using `.T` and check the new shape.

In [None]:
# Your code here
