## Part 1: Why NumPy? (The Hook)

*Paste this into a Markdown cell to start.*

Standard Python lists are flexible but slow and memory-heavy. NumPy (Numerical Python) adds a special data structure: the **ndarray** (N-dimensional array). It is:

1.  **Fast:** Written in C, operations are pre-compiled.
2.  **Vectorized:** No need for `for` loops.
3.  **The Foundation:** Pandas, Scikit-Learn, and TensorFlow are all built on top of NumPy.

### 1.1 The Speed Test

*Paste this into a Code cell to prove the value immediately.*

In [10]:
!pip install numpy



In [11]:
import numpy as np
import time

# Create a list and an array of 1 million numbers
size = 1_000_000
py_list = list(range(size))
np_arr = np.arange(size)

# 1. Python List Sum
start = time.time()
total = sum(py_list)
print(f"Python List Time: {time.time() - start:.5f} seconds")

# 2. NumPy Array Sum
start = time.time()
total = np.sum(np_arr)
print(f"NumPy Array Time: {time.time() - start:.5f} seconds")

# You will likely see NumPy is 10x to 50x faster!

Python List Time: 0.00628 seconds
NumPy Array Time: 0.00048 seconds


## Part 2: Creating Arrays

*Learning the syntax for generating data.*

### 2.1 Basic Creation

In [12]:
# From a standard list
arr_1d = np.array([1, 2, 3])
print("1D Array:", arr_1d)

# From a list of lists (Matrix)
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("\n2D Array:\n", arr_2d)

# Checking Type (Important: NumPy arrays are homogeneous - all elements must be same type)
# If you mix floats and ints, NumPy converts everything to float
mixed = np.array([1, 2.5, 3]) 
print("\nAuto-converted dtype:", mixed.dtype) # output: float64

1D Array: [1 2 3]

2D Array:
 [[1 2 3]
 [4 5 6]]

Auto-converted dtype: float64


### 2.2 Auto-Generation (The most common methods)


In [13]:
# Zeros and Ones (useful for initializing weights/masks)
print(np.zeros((3, 3))) # 3x3 matrix of zeros
print(np.ones((2, 4)))  # 2x4 matrix of ones

# Ranges
print(np.arange(0, 10, 2)) # Like Python range(): Start, Stop, Step -> [0, 2, 4, 6, 8]

# Linspace (Linear Space) - CRUCIAL for data science/plotting
# "Give me 5 numbers evenly spaced between 0 and 1"
print(np.linspace(0, 1, 5)) 

# Random Data
np.random.seed(42) # Set seed for reproducibility
print("\nRandom Integers (0-10):\n", np.random.randint(0, 10, (2, 3)))
print("\nStandard Normal Dist:\n", np.random.randn(3))

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[0 2 4 6 8]
[0.   0.25 0.5  0.75 1.  ]

Random Integers (0-10):
 [[6 3 7]
 [4 6 9]]

Standard Normal Dist:
 [ 0.27904129  1.01051528 -0.58087813]


## Part 3: Attributes & Shapes

*Understanding the anatomy of your data.*

In [14]:
# Create a 3D array (2 matrices, 3 rows, 4 columns)
arr_3d = np.random.randint(0, 10, size=(2, 3, 4))

print("Array:\n", arr_3d)
print("-" * 20)
print(f"Dimensions (ndim): {arr_3d.ndim}") # 3
print(f"Shape: {arr_3d.shape}")             # (2, 3, 4)
print(f"Total Elements (size): {arr_3d.size}") # 24
print(f"Data Type (dtype): {arr_3d.dtype}")    # int64 or int32

Array:
 [[[4 0 9 5]
  [8 0 9 2]
  [6 3 8 2]]

 [[4 2 6 4]
  [8 6 1 3]
  [8 1 9 8]]]
--------------------
Dimensions (ndim): 3
Shape: (2, 3, 4)
Total Elements (size): 24
Data Type (dtype): int64


## Part 4: Indexing & Slicing (The "View" Concept)

*How to grab specific data. Note: Slicing returns a view, not a copy\!*

### 4.1 Basic Slicing


In [15]:
matrix = np.array([[10, 20, 30], 
                   [40, 50, 60], 
                   [70, 80, 90]])

# Syntax: [row_selector, column_selector]
print("Element at (0,1):", matrix[0, 1])   # Row 0, Col 1 -> 20

# Slicing: [start:stop]
print("\nFirst 2 rows:\n", matrix[:2])     # Row 0 and 1, all columns

# The "Comma" magic
print("\nLast column only:\n", matrix[:, -1]) # All rows, last column -> [30, 60, 90]

Element at (0,1): 20

First 2 rows:
 [[10 20 30]
 [40 50 60]]

Last column only:
 [30 60 90]


### 4.2 Boolean Masking (Filtering)

*This is the most powerful feature for Data Analysis.*

In [16]:
data = np.array([1, 5, 10, -3, 8, -2])

# 1. Create a boolean mask
mask = data > 0
print("Mask:", mask) # [True, True, True, False, True, False]

# 2. Apply the mask
positives = data[mask]
print("Positives only:", positives)

# One-liner version
print("Elements > 5:", data[data > 5])

# Example: Replace all negative values with 0
data[data < 0] = 0
print("Cleaned Data:", data)

Mask: [ True  True  True False  True False]
Positives only: [ 1  5 10  8]
Elements > 5: [10  8]
Cleaned Data: [ 1  5 10  0  8  0]


## Part 5: Operations & Broadcasting

*How NumPy handles math without loops.*

### 5.1 Element-wise Operations

In [17]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])

print("Addition:", a + b)       # [11, 22, 33]
print("Multiplication:", a * b) # [10, 40, 90]
print("Power:", b ** 2)         # [1, 4, 9]

# Universal Functions (ufuncs)
print("Square Root:", np.sqrt(a))
print("Log:", np.log(a))

Addition: [11 22 33]
Multiplication: [10 40 90]
Power: [1 4 9]
Square Root: [3.16227766 4.47213595 5.47722558]
Log: [2.30258509 2.99573227 3.40119738]


### 5.2 Broadcasting

*Broadcasting allows NumPy to work with arrays of different shapes.*


In [18]:
matrix = np.ones((3, 3))
vector = np.array([1, 2, 3])

# The vector [1, 2, 3] is "stretched" (broadcasted) across every row of the matrix
result = matrix + vector 
print("Broadcast result:\n", result)
# Row 1: 1+1, 1+2, 1+3
# Row 2: 1+1, 1+2, 1+3 ...

Broadcast result:
 [[2. 3. 4.]
 [2. 3. 4.]
 [2. 3. 4.]]


## Part 6: Statistics & Aggregation (The "Axis" Trap)

**The Concept:**
When you ask for a "mean" or "sum" in NumPy, you often don't want the result for the *entire* matrix. You usually want it for a specific row or column. This is where `axis` comes in.

  * **`axis=0` (The Vertical Arrow):** Acts **downwards**. It collapses the rows. Use this to get stats for each *column*.
  * **`axis=1` (The Horizontal Arrow):** Acts **across**. It collapses the columns. Use this to get stats for each *row*.

### 6.1 Understanding Axis Logic

In [19]:
# Let's imagine a gradebook for 3 students taking 3 tests.
# Rows = Students
# Cols = Tests
grades = np.array([
    [80, 90, 100], # Student A's scores
    [70, 75, 80],  # Student B's scores
    [90, 95, 95]   # Student C's scores
])

print("The Matrix:\n", grades)

# SCENARIO 1: "What is the class average?"
# No axis specified = calculates for the whole array
print("\nGlobal Mean:", np.mean(grades)) 

# SCENARIO 2: "What is the average score for EACH TEST?"
# We need to average "down" the columns.
# We hold the columns fixed and collapse the rows. -> Axis 0
print("\nAverage per Test (Axis 0):", np.mean(grades, axis=0))
# Logic: (80+70+90)/3, (90+75+95)/3...

# SCENARIO 3: "What is the average score for EACH STUDENT?"
# We need to average "across" the rows.
# We hold the rows fixed and collapse the columns. -> Axis 1
print("Average per Student (Axis 1):", np.mean(grades, axis=1))
# Logic: Student A: (80+90+100)/3...

The Matrix:
 [[ 80  90 100]
 [ 70  75  80]
 [ 90  95  95]]

Global Mean: 86.11111111111111

Average per Test (Axis 0): [80.         86.66666667 91.66666667]
Average per Student (Axis 1): [90.         75.         93.33333333]


### 6.2 Other Aggregations

Standard statistics functions work the exact same way.

In [20]:
# Who got the highest single score in the whole class?
print("Max Score:", np.max(grades))

# Which test was the hardest? (Lowest average)
# argmin() returns the INDEX of the minimum value
test_avgs = np.mean(grades, axis=0)
hardest_test_index = np.argmin(test_avgs) 
print(f"Hardest Test Index: {hardest_test_index} (Score: {test_avgs[hardest_test_index]})")

Max Score: 100
Hardest Test Index: 0 (Score: 80.0)


## Part 7: Manipulation & Reshaping

**The Concept:**
In Data Science (especially Deep Learning), you constantly need to rearrange how data is "packaged."

  * **The Golden Rule of Reshaping:** The total number of elements must remain the same. If you have 12 numbers, you can make a 3x4 matrix or a 2x6 matrix, but you cannot make a 3x5 matrix (which requires 15 numbers).

### 7.1 Reshaping

Think of this as taking a long snake of data and folding it into a box.


In [21]:
# Create 12 numbers (0 to 11)
# This is a 1D array (shape: 12,)
flat_data = np.arange(12) 
print("Original 1D:", flat_data)

# Fold it into a 3x4 grid
# 3 rows * 4 cols = 12 elements (Matches!)
reshaped = flat_data.reshape(3, 4)
print("\nReshaped to 3x4:\n", reshaped)

# THE -1 TRICK (Very useful!)
# If you know you want 4 columns but are too lazy to calculate how many rows that requires...
# Use -1. NumPy will solve for 'x'.
auto_reshaped = flat_data.reshape(-1, 4) 
print("\nAuto-calculated rows (-1, 4):\n", auto_reshaped)

Original 1D: [ 0  1  2  3  4  5  6  7  8  9 10 11]

Reshaped to 3x4:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Auto-calculated rows (-1, 4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


### 7.2 Flattening

The opposite of reshape. Taking a matrix and turning it back into a flat list. This is often required before plotting data or saving it to a file.


In [22]:
matrix = np.array([[1, 2, 3], 
                   [4, 5, 6]])

# flatten() returns a COPY (safe to modify without affecting original)
flat_copy = matrix.flatten()
print("Flattened:", flat_copy)

# ravel() returns a VIEW (faster, but modifying it modifies the original!)
flat_view = matrix.ravel()

Flattened: [1 2 3 4 5 6]


### 7.3 Stacking (Gluing Arrays Together)

Sometimes you have two separate datasets (e.g., "Data from Jan" and "Data from Feb") and you need to combine them.


In [23]:
dataset_A = np.array([1, 2, 3])
dataset_B = np.array([4, 5, 6])

# vstack (Vertical Stack)
# Puts B *under* A. 
# Think: Stacking plates.
v_stack = np.vstack((dataset_A, dataset_B))
print("Vertical Stack:\n", v_stack)
# Result is now 2x3

# hstack (Horizontal Stack)
# Puts B *next to* A. 
# Think: Parking cars side-by-side.
h_stack = np.hstack((dataset_A, dataset_B))
print("Horizontal Stack:", h_stack)
# Result is now 1D array of length 6

Vertical Stack:
 [[1 2 3]
 [4 5 6]]
Horizontal Stack: [1 2 3 4 5 6]


## Part 8: Capstone Exercise - "Normalizing Data"

*Let's combine concepts to perform a common ML preprocessing task: Z-Score Normalization.*

$$Z = \frac{x - \mu}{\sigma}$$

*Where $\mu$ is mean and $\sigma$ is standard deviation.*

In [24]:
# 1. Generate Dummy Data (5 samples, 3 features)
# Imagine: [Height, Weight, Age] for 5 people
raw_data = np.random.randint(50, 200, size=(5, 3))
print("Raw Data:\n", raw_data)

# 2. Calculate Mean and Std Dev for each feature (Column-wise / Axis 0)
means = np.mean(raw_data, axis=0)
stdevs = np.std(raw_data, axis=0)

print("\nMeans:", means)
print("Stdevs:", stdevs)

# 3. Apply formula using Broadcasting
# (5,3) array minus (3,) vector works because of broadcasting!
normalized_data = (raw_data - means) / stdevs

print("\nNormalized Data (Z-Scores):\n", normalized_data)

# Verify: The mean of normalized columns should be essentially 0
print("\nCheck Means (should be ~0):", np.round(np.mean(normalized_data, axis=0)))

Raw Data:
 [[139 102 179]
 [133 141 160]
 [ 57  84 130]
 [ 99 153 181]
 [ 51 183 103]]

Means: [ 95.8 132.6 150.6]
Stdevs: [36.80434757 35.5674008  30.03065101]

Normalized Data (Z-Scores):
 [[ 1.17377437 -0.86033838  0.94570044]
 [ 1.01075015  0.23617132  0.31301353]
 [-1.05422328 -1.36641978 -0.68596582]
 [ 0.08694625  0.57355892  1.01229907]
 [-1.2172475   1.41702792 -1.58504722]]

Check Means (should be ~0): [0. 0. 0.]
