# Data Pipeline in Practice

`NumPy + matplotlib + pandas` for real‑world analysis

## Learning goals
1. Deepen your NumPy fluency: boolean masking, fancy indexing, reshaping tricks.
2. Use broadcasting & linear algebra helpers to solve compact physics problems.
3. Craft multi‑panel, publication‑quality figures with the OO interface of **matplotlib**.
4. Discover **pandas** as a labelled data container and quick analysis tool.
5. Launch a mini‑project: clean, analyse & visualise a small experimental dataset.

### 0 · Warm‑up: vectorisation wins
Run the timing snippet below to remind yourself **why** we love NumPy.

In [None]:
import numpy as np, math
n = 1000
x = np.linspace(0, 2*np.pi, n)

def pure_python(xs):
    return [math.sin(z) for z in xs]

%timeit pure_python(x)
%timeit np.sin(x)

## 1 · Advanced NumPy toolkit

In [None]:
import numpy as np

# Synthetic 2‑D sensor data: rows = time snapshots, cols = sensors
rng = np.random.default_rng(42)
data = rng.normal(loc=0, scale=1, size=(5, 4))
print('Raw data:\n', data)

# Boolean mask: mark values outside ±1.5σ as outliers
mask_out = np.abs(data) > 1.5
print('\nMask:\n', mask_out)

# Replace outliers with NaN, then compute column means ignoring NaNs
clean = data.copy()
clean[mask_out] = np.nan
col_means = np.nanmean(clean, axis=0)
print('\nColumn means (cleaned):', col_means)

# Fancy indexing: extract rows 0 and 3, columns 1 and 2
subset = data[[0, 3]][:, [1, 2]]
print('\nFancy‑indexed subset:\n', subset)


**Exercise 1 – Reshape gymnastics**  
1. Create a NumPy array `A` of shape (24,) containing the integers 0‑23.  
2. Reshape it into a 3‑D array of shape (2, 3, 4) **without copying** data (hint: `view`).  
3. Swap axes so that the new shape is (3, 4, 2).  
4. Verify that modifying one view affects the original array.


In [None]:
# Your code here


## 2 · Broadcasting & Linear Algebra mini‑tour

In [None]:
import numpy as np

# Outer product via broadcasting
v = np.arange(5)
outer = v[:, None] * v[None, :]
print('Outer product:\n', outer)

# Solve a linear system: RLC circuit currents
A = np.array([[4, 1, -1],
              [2, 7,  1],
              [1, -3, 5]], dtype=float)
b = np.array([15, 18, 14], dtype=float)
x = np.linalg.solve(A, b)
print('\nSolution x =', x)

# Eigenvalues of a 2×2 matrix
eigvals, eigvecs = np.linalg.eig(np.array([[0, 1], [-2, -3]]))
print('\nEigenvalues:', eigvals)

## 3 · Multi‑panel plotting with matplotlib (OO interface)

In [None]:
%matplotlib inline
import numpy as np, matplotlib.pyplot as plt

t = np.linspace(0, 10, 500)
signal = np.exp(-0.2*t) * np.cos(2*np.pi*1.5*t)
noise = 0.1 * np.random.default_rng(0).normal(size=t.size)
y = signal + noise

fig, axs = plt.subplots(2, 1, figsize=(7, 6), sharex=True)

axs[0].plot(t, y, label='noisy')
axs[0].plot(t, signal, '--', label='true')
axs[0].set_ylabel('Amplitude')
axs[0].legend()
axs[0].set_title('Damped cosine, noisy vs true')

axs[1].hist(y - signal, bins=40)
axs[1].set_xlabel('Residual')
axs[1].set_ylabel('freq')

fig.tight_layout()
plt.show()

## 4 · pandas quick‑start

In [None]:
import pandas as pd, numpy as np

rng = np.random.default_rng(1)
n = 200
df = pd.DataFrame({
    'timestamp': pd.date_range('2025‑01‑01', periods=n, freq='h'),
    'sensor': rng.choice(['A', 'B', 'C'], size=n),
    'reading': rng.normal(0, 1, size=n) + np.linspace(0, 3, n)
})
df.head()

In [None]:
# Mean reading per sensor
group_stats = df.groupby('sensor')['reading'].describe()
group_stats

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for name, sub in df.groupby('sensor'):
    ax.plot(sub['timestamp'], sub['reading'], label=f'Sensor {name}')
ax.legend(); ax.set_ylabel('Reading'); ax.set_title('Sensor drift'); plt.show()


**Exercise 2 – Cleaning & slicing**  
1. Compute a rolling mean (window = 10 hours) for each sensor and plot it.  
2. Remove readings outside the 5th/95th percentile and re‑plot the histogram of all sensors combined.


In [None]:
# Your code here


## 5 · Mini‑project
Choose **one** of the datasets in the `datasets/` folder (e.g. `climate.csv`, `lab_sensors.csv`). Create a new notebook that:
1. Loads the data into a `pandas` DataFrame.
2. Cleans obvious issues (NaNs, outliers).
3. Computes at least **two** insightful statistics.
4. Produces a figure with **multiple axes** telling a concise story.
5. Ends with 2‑3 bullet conclusions.
