# Fundamentals of NumPy
* Understand why NumPy is essential for efficient data manipulation in Python
* Practice NumPy fundamentals through hands-on exercises with real data

## Why Learn NumPy?
* foundation of the Python data science ecosystem
* provides powerful tools for working w/large datasets and performing mathematical operations efficiently

### Key Benefits

* **Performance**: NumPy operations are 10-100x faster than pure Python
* **Memory Efficiency**: Uses contiguous memory layout for better performance
* **Foundation**: Required by Pandas, Scikit-learn, Matplotlib, and other libraries
* **Vectorization**: Eliminates the need for explicit loops in mathematical operations

In [None]:
%pip install numpy

In [None]:
# Let's demonstrate the performance difference
import numpy as np
import time

# Create large datasets
size = 1_000_000
python_list1 = list(range(size))
python_list2 = list(range(size, 2 * size))
numpy_array1 = np.arange(size)
numpy_array2 = np.arange(size, 2 * size)

In [None]:
# Time Python list operation
start_time = time.time()
python_result = [a + b for a, b in zip(python_list1, python_list2)]
python_time = time.time() - start_time

# Time NumPy array operation
start_time = time.time()
numpy_result = numpy_array1 + numpy_array2
numpy_time = time.time() - start_time

print(f"Python list operation time: {python_time:.4f} seconds")
print(f"NumPy array operation time: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time/numpy_time:.1f}x faster!")

## Efficient Matrix Math

### NumPy vs. Pure Python

Let's compare how mathematical operations are performed in pure Python versus NumPy:

In [None]:
# Pure Python approach (slow)
list1 = [1, 2, 3, 4, 5]
list2 = [10, 20, 30, 40, 50]

# Slow element-wise addition
result_python = []
for i in range(len(list1)):
    result_python.append(list1[i] + list2[i])

print("Pure Python result:", result_python)

In [None]:
# NumPy approach (fast)
array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([10, 20, 30, 40, 50])

# Fast vectorized addition
result_numpy = array1 + array2

print("NumPy result:", result_numpy)
print("\nBoth produce the same result, but NumPy is much faster and cleaner!")

### Problems with Pure Python
* explicit loops required
* type checking for each element
* Python object overhead

### Benefits of NumPy
* single operation on entire arrays
* pre-compiled C code
* homogeneous data types
* optimized memory access

## NumPy and C

NumPy's performance advantage comes from its implementation in C and optimized algorithms

### Key Performance Features

* **Contiguous Memory Layout**: Arrays stored as continuous blocks in memory
* **Single Data Type**: All elements are the same type (no type checking overhead)
* **SIMD Instructions**: Modern processors can operate on multiple values simultaneously
* **Optimized Libraries**: Uses highly optimized BLAS and LAPACK libraries

In [None]:
# Demonstrate memory efficiency
import sys

# Python list
python_list = [1, 2, 3, 4, 5] * 1000
python_memory = sys.getsizeof(python_list) + sum(sys.getsizeof(item) for item in python_list)

# NumPy array
numpy_array = np.array([1, 2, 3, 4, 5] * 1000)
numpy_memory = numpy_array.nbytes

print(f"Python list memory usage: {python_memory:,} bytes")
print(f"NumPy array memory usage: {numpy_memory:,} bytes")
print(f"NumPy uses {python_memory/numpy_memory:.1f}x less memory!")

In [None]:
# Show array properties
print(f"Array dtype: {numpy_array.dtype}")
print(f"Array shape: {numpy_array.shape}")
print(f"Array size: {numpy_array.size} elements")

## NumPy Array Basics

* NumPy arrays (ndarrays) are the fundamental data structure for numerical computing in Python

### Creating Arrays

In [None]:
import numpy as np

# From Lists
# 1D array
arr_1d = np.array([1, 2, 3, 4])
print("1D array:", arr_1d)

# 2D array
arr_2d = np.array([[1, 2], [3, 4]])
print("\n2D array:")
print(arr_2d)

# Specify data type
arr_float = np.array([1, 2, 3], dtype=float)
print("\nFloat array:", arr_float)
print("Data type:", arr_float.dtype)

In [None]:
# Using Functions

# Array of zeros
zeros = np.zeros((3, 4))
print("Array of zeros:")
print(zeros)

In [None]:
# Array of ones
ones = np.ones((2, 3))
print("Array of ones:")
print(ones)

In [None]:
# Range of values
range_arr = np.arange(0, 10, 2)
print("Range array:", range_arr)

In [None]:
# Evenly spaced values
linspace_arr = np.linspace(0, 1, 5)
print("Linspace array:", linspace_arr)

In [None]:
# Random values
np.random.seed(42)  # For reproducible results
random_arr = np.random.random((2, 3))
print("Random array:")
print(random_arr)

## Basic Array Operations
* NumPy provides a rich set of operations for manipulating and analyzing arrays

In [None]:
# Create a sample array
arr = np.array([[1, 2, 3],
                [4, 5, 6]])

print("Sample array:")
print(arr)

In [None]:
# Array properties
print(f"Shape: {arr.shape}")        # (2, 3)
print(f"Data type: {arr.dtype}")    # int64 (or int32)
print(f"Dimensions: {arr.ndim}")    # 2
print(f"Size: {arr.size}")          # 6
print(f"Item size: {arr.itemsize} bytes")  # 8 bytes for int64

In [None]:
# Mathematical operations
print(f"Sum: {arr.sum()}")          # 21
print(f"Mean: {arr.mean()}")        # 3.5
print(f"Standard deviation: {arr.std():.2f}")  # 1.71
print(f"Max: {arr.max()}")          # 6
print(f"Min: {arr.min()}")          # 1
print(f"Argmax (index of max): {arr.argmax()}")  # 5
print(f"Argmin (index of min): {arr.argmin()}")  # 0

In [None]:
# Operations along specific axes
print("Operations along axes:")
print("\nOriginal array:")
print(arr)

# axis=0 operates along rows (down columns)
print(f"\nSum along axis=0 (column sums): {arr.sum(axis=0)}")
print(f"Mean along axis=0 (column means): {arr.mean(axis=0)}")

# axis=1 operates along columns (across rows)
print(f"\nSum along axis=1 (row sums): {arr.sum(axis=1)}")
print(f"Mean along axis=1 (row means): {arr.mean(axis=1)}")

## Vectorized (Broadcast) Operations

**Vectorization** perform operations on entire arrays without explicit loops

**Broadcasting** enables operations between arrays of different shapes

In [None]:
# Vectorized operations
arr = np.array([1, 2, 3, 4, 5])
print("Original array:", arr)

In [None]:
# Scalar operations (broadcasting)
result = arr * 2  # Multiplies every element by 2
print("Multiplied by 2:", result)

In [None]:
result = arr + 10  # Adds 10 to every element
print("Added 10:", result)

In [None]:
result = arr ** 2  # Squares every element
print("Squared:", result)

In [None]:
# Mathematical functions
result = np.sqrt(arr)  # Square root of every element
print("Square root:", result)

In [None]:
result = np.sin(arr)  # Sine of every element
print("Sine values:", result)

In [None]:
# Broadcasting example
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
vector = np.array([10, 20, 30])

print("Matrix:")
print(matrix)
print("\nVector:")
print(vector)

In [None]:
# Broadcasting: adds vector to each row of matrix
result = matrix + vector
print("Matrix + Vector (broadcasting):")
print(result)

In [None]:
# More broadcasting examples
column_vector = np.array([[100], [200]])
print("Column vector:")
print(column_vector)

In [None]:
result = matrix + column_vector
print("Matrix + Column vector (broadcasting):")
print(result)

### Broadcasting Rules
* NumPy follows these rules for broadcasting:
  1. Arrays are aligned from the rightmost dimension
  2. Dimensions of size 1 can be "stretched" to match
  3. Missing dimensions are assumed to be size 1
  4. Arrays are compatible if dimensions are equal or one is 1

In [None]:
# Broadcasting rules demonstration
print("Broadcasting examples:")

# Example 1: (3,) + (3,) = (3,)
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
print(f"\n{a.shape} + {b.shape} = {(a + b).shape}")
print(f"{a} + {b} = {a + b}")

In [None]:
# Example 2: (3,) + scalar = (3,)
c = 100
print(f"{a.shape} + scalar = {(a + c).shape}")
print(f"{a} + {c} = {a + c}")

In [None]:
# Example 3: (2,3) + (3,) = (2,3)
d = np.array([[1, 2, 3], [4, 5, 6]])
e = np.array([10, 20, 30])
print(f"{d.shape} + {e.shape} = {(d + e).shape}")
print(f"{d} + {e} =\n{d + e}")

In [None]:
# Example 4: (2,3) + (2,1) = (2,3)
f = np.array([[100], [200]])
print(f"{d.shape} + {f.shape} = {(d + f).shape}")
print(f"{d} +\n{f} =\n{d + f}")

## Array Indexing and Slicing

### Basic Array Indexing

In [None]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("Sample 3x3 array:")
print(arr)

In [None]:
# Single element access
print(f"Element at [0, 1]: {arr[0, 1]}")    # 2
print(f"Element at [1, 2]: {arr[1, 2]}")    # 6
print(f"Element at [-1, -1]: {arr[-1, -1]}")

In [None]:
# Entire row selection
print(f"First row [0, :]: {arr[0, :]}")
print(f"Second row [1, :]: {arr[1, :]}")
print(f"Last row [-1, :]: {arr[-1, :]}")

In [None]:
# Entire column selection
print(f"First column [:, 0]: {arr[:, 0]}")
print(f"Second column [:, 1]: {arr[:, 1]}")
print(f"Last column [:, -1]: {arr[:, -1]}")

In [None]:
# Slicing
print("Slice [0:2, 1:3]:")
print(arr[0:2, 1:3])

In [None]:
print("Slice [1:, :2]:")
print(arr[1:, :2])

### Array Indexing Key Points:

* Access elements using zero-based indices: `arr[row, column]`
* Use `:` to select entire rows or columns: `arr[:, 0]` for first column
* Slicing works like Python lists: `arr[1:3, 0:2]` selects a subarray
* Negative indices count from the end: `arr[-1, :]` for last row

### Boolean Array Indexing

In [None]:
# Boolean indexing
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("Original array:")
print(arr)

In [None]:
# Create boolean mask
mask = arr > 5
print("Boolean mask (arr > 5):")
print(mask)

In [None]:
# Select elements using the mask
print("Elements where arr > 5:")
print(arr[mask]) 

In [None]:
# Multiple conditions
mask_complex = (arr > 3) & (arr < 8)
print("Elements where 3 < arr < 8:")
print(arr[mask_complex])

In [None]:
# Conditional modification
arr_copy = arr.copy()
arr_copy[arr_copy > 5] = 0
print("Array after setting values > 5 to 0:")
print(arr_copy)

In [None]:
# More advanced boolean indexing examples
data = np.random.randint(1, 20, (5, 4))
print("Random data array:")
print(data)

In [None]:
# Find elements divisible by 3
divisible_by_3 = data % 3 == 0
print("Elements divisible by 3:")
print(data[divisible_by_3])

In [None]:
# Replace even numbers with -1
data_modified = data.copy()
data_modified[data_modified % 2 == 0] = -1
print("\nArray with even numbers replaced by -1:")
print(data_modified)

In [None]:
# Count elements meeting condition
count_greater_than_10 = np.sum(data > 10)
print(f"Number of elements > 10: {count_greater_than_10}")

In [None]:
# Get indices of elements meeting condition
indices = np.where(data > 15)
print(f"Indices where data > 15:\n{list(zip(indices[0], indices[1]))}")

### Boolean Indexing Key Points:

* Create boolean masks to filter arrays: `mask = arr > value`
* Use masks to select elements: `arr[mask]` returns elements where mask is True
* Modify elements conditionally: `arr[arr > value] = new_value`
* Combine conditions with `&` (and), `|` (or), `~` (not)
* Use parentheses with multiple conditions: `(arr > 3) & (arr < 8)`

## NumPy Aggregation Functions

NumPy provides many functions for computing summary statistics and aggregations.

In [None]:
# Create sample data for aggregation
np.random.seed(42)
data = np.random.randint(1, 100, (4, 5))
print("Sample data:")
print(data)

In [None]:
# Basic aggregations
print(f"Sum of all elements: {np.sum(data)}")
print(f"Mean of all elements: {np.mean(data):.2f}")
print(f"Median of all elements: {np.median(data):.2f}")
print(f"Variance: {np.var(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Minimum value: {np.min(data)}")
print(f"Maximum value: {np.max(data)}")
print(f"Range (max - min): {np.max(data) - np.min(data)}")

In [None]:
# Percentiles
print(f"25th percentile: {np.percentile(data, 25):.2f}")
print(f"50th percentile (median): {np.percentile(data, 50):.2f}")
print(f"75th percentile: {np.percentile(data, 75):.2f}")

In [None]:
# Aggregations along axes
print("Sum along axis=0 (column sums):")
print(np.sum(data, axis=0))

In [None]:
print("Sum along axis=1 (row sums):")
print(np.sum(data, axis=1))

In [None]:
print("Mean along axis=0 (column means):")
print(np.mean(data, axis=0))

In [None]:
print("Mean along axis=1 (row means):")
print(np.mean(data, axis=1))

In [None]:
# More advanced aggregations
print("Advanced aggregation functions:")

# Cumulative operations
arr = np.array([1, 2, 3, 4, 5])
print(f"\nOriginal array: {arr}")
print(f"Cumulative sum: {np.cumsum(arr)}")
print(f"Cumulative product: {np.cumprod(arr)}")

In [None]:
# Unique values
arr_with_duplicates = np.array([1, 2, 2, 3, 3, 3, 4, 5, 5])
unique_vals, counts = np.unique(arr_with_duplicates, return_counts=True)
print(f"Array with duplicates: {arr_with_duplicates}")
print(f"Unique values: {unique_vals}")
print(f"Counts: {counts}")

In [None]:
# Sorting
unsorted_arr = np.array([64, 34, 25, 12, 22, 11, 90])
print(f"Unsorted array: {unsorted_arr}")
print(f"Sorted array: {np.sort(unsorted_arr)}")
print(f"Indices that would sort the array: {np.argsort(unsorted_arr)}")

In [None]:
# Conditional aggregations
print(f"Count of elements > 50: {np.sum(unsorted_arr > 50)}")
print(f"Any element > 80: {np.any(unsorted_arr > 80)}")
print(f"All elements > 10: {np.all(unsorted_arr > 10)}")

## Practical Exercise: Data Analysis with NumPy

Let's put everything together with a practical data analysis example.

In [None]:
# Simulate sales data for a retail company
np.random.seed(42)

# Generate data for 12 months, 4 product categories, 5 stores
months = 12
categories = 4
stores = 5

# Sales data: shape (months, categories, stores)
sales_data = np.random.normal(1000, 200, (months, categories, stores))
sales_data = np.abs(sales_data)  # Ensure positive sales

print(f"Sales data shape: {sales_data.shape}")
print(f"Total data points: {sales_data.size}")
print(f"Memory usage: {sales_data.nbytes} bytes")

print("\nFirst month's data (categories × stores):")
print(sales_data[0])

In [None]:
# Analyze the sales data–total sales across all dimensions
total_sales = np.sum(sales_data)
print(f"Total sales across all months/categories/stores: ${total_sales:,.2f}")

In [None]:
# Monthly sales (sum across categories and stores)
monthly_sales = np.sum(sales_data, axis=(1, 2))
print(f"Monthly sales: {monthly_sales}")
print(f"Best month: Month {np.argmax(monthly_sales) + 1} (${monthly_sales.max():,.2f})")
print(f"Worst month: Month {np.argmin(monthly_sales) + 1} (${monthly_sales.min():,.2f})")

In [None]:
# Category performance (sum across months and stores)
category_sales = np.sum(sales_data, axis=(0, 2))
print(f"Category sales: {category_sales}")
print(f"Best category: Category {np.argmax(category_sales) + 1} (${category_sales.max():,.2f})")

In [None]:
# Store performance (sum across months and categories)
store_sales = np.sum(sales_data, axis=(0, 1))
print(f"Store sales: {store_sales}")
print(f"Best store: Store {np.argmax(store_sales) + 1} (${store_sales.max():,.2f})")

In [None]:
# Statistical analysis
print(f"Statistical Summary:")
print(f"Mean daily sales: ${np.mean(sales_data):,.2f}")
print(f"Median daily sales: ${np.median(sales_data):,.2f}")
print(f"Standard deviation: ${np.std(sales_data):,.2f}")
print(f"Coefficient of variation: {np.std(sales_data)/np.mean(sales_data):.2%}")

In [None]:
# Advanced analysis with boolean indexing

# Find high-performing periods (sales > 1.5 * mean)
threshold = 1.5 * np.mean(sales_data)
high_sales_mask = sales_data > threshold
high_sales_count = np.sum(high_sales_mask)
high_sales_percentage = (high_sales_count / sales_data.size) * 100

print(f"High Performance Analysis (sales > ${threshold:,.2f}):")
print(f"Number of high-sales periods: {high_sales_count}")
print(f"Percentage of high-sales periods: {high_sales_percentage:.1f}%")

In [None]:
# Identify underperforming areas (sales < 0.5 * mean)
low_threshold = 0.5 * np.mean(sales_data)
low_sales_mask = sales_data < low_threshold
low_sales_count = np.sum(low_sales_mask)

print(f"Underperforming periods (sales < ${low_threshold:,.2f}):")
print(f"Number of low-sales periods: {low_sales_count}")
print(f"Percentage of low-sales periods: {(low_sales_count / sales_data.size) * 100:.1f}%")

In [None]:
# Growth analysis (month-over-month)
monthly_growth = np.diff(monthly_sales) / monthly_sales[:-1] * 100
print(f"Month-over-Month Growth Rates:")

for index, growth in enumerate(monthly_growth):
    print(f"Month {index +1} to {index +2}: {growth:+.1f}%")

print(f"\nAverage monthly growth: {np.mean(monthly_growth):+.1f}%")
print(f"Most volatile month transition: Month {np.argmax(np.abs(monthly_growth))+1} to {np.argmax(np.abs(monthly_growth))+2}")

## Reflection Questions
* Why is NumPy significantly faster than pure Python for numerical operations?
* When would you choose to use broadcasting instead of explicit loops?
* How do vectorized operations change the way you think about solving problems?
* What are the key benefits of using homogeneous arrays versus Python lists?

## Additional Resources

### Core Documentation
* [NumPy Documentation](https://numpy.org/doc/stable/)
* [NumPy Quickstart Tutorial](https://numpy.org/doc/stable/user/quickstart.html)
* [From Python to NumPy](https://www.labri.fr/perso/nrougier/from-python-to-numpy/)

### More Practice (if you want it)
* [100 NumPy Exercises](https://github.com/rougier/numpy-100)
* [NumPy Tutorial on Real Python](https://realpython.com/numpy-tutorial/)
