## 1. Creating NumPy Arrays

**NumPy** (short for "Numerical Python") is the foundational library for numerical computing in Python. Its core data structure is the powerful **NumPy array**.

Think of a NumPy array as a grid of values, all of the same type. They are similar to Python lists, but they're much faster and more efficient for numerical operations, especially on large datasets. This speed and efficiency make them essential for data analysis.

### 1. Import Convention
First, you need to import the library. The standard convention used by all data scientists is to import `numpy` with the alias `np`.


In [24]:
import numpy as np

# A regular Python list
rent_list = [1500, 1500, 1750, 1750, 1800, 2000]

# Create a NumPy array from the list
rent_array = np.array(rent_list)

print("Python list:", rent_list)
print("NumPy array:", rent_array)

Python list: [1500, 1500, 1750, 1750, 1800, 2000]
NumPy array: [1500 1500 1750 1750 1800 2000]


In [4]:
# You can also create it directly
notifications = np.array([237, 150, 198, 205, 215, 180, 201])
print("Daily notifications:", notifications)

Daily notifications: [237 150 198 205 215 180 201]


## 2. Inspecting Data (Indexing & Slicing)

Just like Python lists, NumPy arrays are **0-indexed**, meaning the first element is at index `0`.

### 1. Indexing (Accessing a Single Element)
You can access any individual element using square brackets `[]` with the element's index inside.

### 2. Slicing (Accessing a Range of Elements)
Slicing lets you select a sub-section of the array. The syntax is `array[start:end]`, where the `start` index is included and the `end` index is excluded.

In [5]:
# Array of NYC population (in millions, simplified for demo)
nyc_population = np.array([19.57, 19.67, 19.85, 20.10, 19.46, 19.62])

In [25]:
# 1. Indexing
print("First element:", nyc_population[0])
print("Last element:", nyc_population[-1]) # Negative indexing works too!

First element: 19.57
Last element: 19.62


In [8]:
# 2. Slicing
print("First three elements:", nyc_population[0:3])
print("Elements from index 2 to the end:", nyc_population[2:])

First three elements: [19.57 19.67 19.85]
Elements from index 2 to the end: [19.85 20.1  19.46 19.62]


## 3. Getting More Info (Shape & Basic Functions)

### 1. `.shape`
The `.shape` property tells you the dimensions of your array. It returns a tuple where each number represents the size of a dimension. For a 2D array, it will be `(rows, columns)`.

### 2. Math Functions
NumPy provides a suite of fast, optimized mathematical functions that operate on entire arrays at once. This is much more efficient than looping through a list in Python.
* `.min()`: Find the minimum value.
* `.max()`: Find the maximum value.
* `.sum()`: Calculate the sum of all elements.
* `.average()` or `.mean()`: Calculate the average of all elements.

In [9]:
# Weekly walking steps
walking_steps = np.array([10521, 8765, 12053, 7500, 9812, 11023, 9112])
egg_carton = np.array([
  [0.89, 0.90, 0.83, 0.89, 0.97, 0.98],
  [0.95, 0.95, 0.89, 0.95, 0.23, 0.99]
]) # 2D array

In [10]:
# 1. Shape
print(f"Shape of walking_steps (1D): {walking_steps.shape}")
print(f"Shape of egg_carton (2D): {egg_carton.shape}")

Shape of walking_steps (1D): (7,)
Shape of egg_carton (2D): (2, 6)


In [11]:
# 2. Math Functions
print(f"Minimum steps: {np.min(walking_steps)}")
print(f"Maximum steps: {np.max(walking_steps)}")
print(f"Total steps: {np.sum(walking_steps)}")
print(f"Average daily steps: {np.average(walking_steps):.2f}") # .2f formats to 2 decimal places

Minimum steps: 7500
Maximum steps: 12053
Total steps: 68786
Average daily steps: 9826.57


In [12]:
# Functions work on 2D arrays too!
print(f"Average egg freshness: {np.average(egg_carton):.2f}")

Average egg freshness: 0.87


## 4. Modifying Arrays (Operators & Reshaping)

### 1. Arithmetic Operators
One of NumPy's most powerful features is **vectorization**. Think of it like a manager at a factory. Instead of telling each worker on an assembly line what to do one-by-one (a `for` loop), the manager shouts one command—"Everyone, speed up by 10%!"—and the entire line performs the action at once. Vectorization is that single command to the whole array.

### 2. `.reshape()`
You can change the shape of an array while keeping the same data with the `.reshape()` method. The total number of elements must remain the same. For example, you can reshape an array with 8 elements into a `(2, 4)` grid (2 rows, 4 columns) because `2 * 4 = 8`.

In [13]:
# 1. Operators
tallest_buildings_ft = np.array([2717, 2227, 2073, 1972, 1966])

In [14]:
# Convert feet to meters by multiplying the whole array by 0.3048
tallest_buildings_m = tallest_buildings_ft * 0.3048
print("Heights in meters:", tallest_buildings_m)

Heights in meters: [828.1416 678.7896 631.8504 601.0656 599.2368]


In [15]:
# 2. Reshaping
month_results = np.array([56, 100, 33, 0, 45, 45, 46, 34, 89, 180, 60, 45, 45, 44])
print("Original shape:", month_results.shape)

Original shape: (14,)


In [16]:
# Reshape 14 days of data into 2 weeks (2 rows, 7 columns)
weekly_results = month_results.reshape(2, 7)
print("\nReshaped into weekly data:")
print(weekly_results)
print("New shape:", weekly_results.shape)


Reshaped into weekly data:
[[ 56 100  33   0  45  45  46]
 [ 34  89 180  60  45  45  44]]
New shape: (2, 7)


## 5. Creating Arrays (Advanced)

### `.arange()`
The `np.arange()` function is similar to Python's built-in `range()`, but it creates a NumPy array instead of a list. It's great for generating arrays with evenly spaced numerical sequences.

The syntax is `np.arange(start, stop, step)`:
* `start`: The first value (inclusive).
* `stop`: The end value (exclusive).
* `step`: The interval between values.

In [17]:
# Halley's Comet appears every 75 years. It last appeared in 1986.
# Let's find its next appearances up to the year 2300.

comet_years = np.arange(start=1986, stop=2300, step=75)

print("Halley's Comet next appearances:", comet_years)

Halley's Comet next appearances: [1986 2061 2136 2211 2286]


## 6. Case Study: Titanic Dataset

Now let's apply these concepts to analyze a real dataset. Here is data from 50 passengers on the Titanic.

The columns are:
1.  **Passenger ID**
2.  **Survived** (0 = No, 1 = Yes)
3.  **Passenger Class** (1 = Upper, 2 = Middle, 3 = Lower)
4.  **Age**

In [18]:
passengers = np.array([
   [1, 0, 3, 22], [2, 1, 1, 38], [3, 1, 3, 26], [4, 1, 1, 35], [5, 0, 3, 35],
   [6, 0, 3, 18], [7, 0, 1, 54], [8, 0, 3, 2], [9, 1, 3, 27], [10, 1, 2, 14],
  [11, 1, 3, 4], [12, 1, 1, 58], [13, 0, 3, 20], [14, 0, 3, 39], [15, 0, 3, 14],
  [16, 1, 2, 55], [17, 0, 3, 2], [18, 1, 2, 12], [19, 0, 3, 31], [20, 1, 3, 8],
  [21, 0, 2, 35], [22, 1, 2, 34], [23, 1, 3, 15], [24, 1, 1, 28], [25, 0, 3, 8],
  [26, 1, 3, 38], [27, 0, 3, 2], [28, 0, 1, 1], [29, 1, 3, 5], [30, 0, 3, 18],
  [31, 0, 1, 40], [32, 1, 1, 70], [33, 1, 3, 33], [34, 0, 2, 66], [35, 0, 1, 28],
  [36, 0, 1, 42], [37, 1, 3, 5], [38, 0, 3, 18], [39, 0, 3, 18], [40, 1, 3, 14],
  [41, 0, 3, 40], [42, 0, 2, 27], [43, 0, 3, 29], [44, 1, 2, 0], [45, 1, 3, 19],
  [46, 0, 3, 33], [47, 0, 3, 14], [48, 1, 3, 22], [49, 0, 3, 41], [50, 0, 3, 18]
])

### Q1: What is the shape of this array?

In [19]:
print(passengers.shape)

(50, 4)


### Q2: What is the average age of the passengers?

In [20]:
# The 'Age' is in the 4th column (index 3)
ages = passengers[:, 3] # The ':' means 'all rows'
average_age = np.average(ages)
print(f"The average passenger age is: {average_age:.1f} years old")

The average passenger age is: 25.5 years old


### Q3: What is the percentage of passengers that survived?

In [21]:
# The 'Survived' column is the 2nd column (index 1)
survived_column = passengers[:, 1]

# The average of a column of 0s and 1s is the percentage of 1s!
survival_percentage = np.average(survived_column) * 100
print(f"{survival_percentage:.0f}% of passengers survived.")

44% of passengers survived.


### Q4: What was the survival percentage by passenger class?

This is a more complex question that requires **filtering**—a key skill in data analysis. We can create a "mask" to select only the rows that meet a certain condition.

1. Create a boolean mask (e.g., `passengers[:, 2] == 1` checks which rows are for Class 1).
2. Apply this mask to the original `passengers` array to get a new array with only the filtered rows.
3. Perform calculations on this new, filtered array.

In [22]:
# Get the 'Survived' and 'Class' columns
survived = passengers[:, 1]
p_class = passengers[:, 2]

# Create a boolean mask for each passenger class
class1_mask = (p_class == 1)
class2_mask = (p_class == 2)
class3_mask = (p_class == 3)

# Calculate the average survival rate for each class using the masks
class1_survival = np.average(survived[class1_mask]) * 100
class2_survival = np.average(survived[class2_mask]) * 100
class3_survival = np.average(survived[class3_mask]) * 100

print(f"Survival rate for Class 1: {class1_survival:.0f}%")
print(f"Survival rate for Class 2: {class2_survival:.0f}%")
print(f"Survival rate for Class 3: {class3_survival:.0f}%")

Survival rate for Class 1: 50%
Survival rate for Class 2: 62%
Survival rate for Class 3: 38%
