# Data Science Basics with NumPy, Pandas, and MatPlotLib

Here will go over the basics of working datasets using the Python libraries NumPy and Pandas. We will also utilize MatPlotLib for plotting purposes to visualize data.

## Loading Modules/Libraries

In [None]:
# Option 1:
import time
time.sleep(0.1)  # Sleep for 0.1 seconds

In [None]:
# Option 2:
from time import sleep
sleep(0.1)  # Sleep for 0.1 seconds

In [None]:
# Option 3:
import time as t
t.sleep(0.1)  # Sleep for 0.1 seconds

In [None]:
# DON'T:
from time import *
sleep(0.1)  # Sleep for 0.1 seconds

## NumPy Basics

NumPy is a crucial library in the Python ecosystem because it handles data arrays very rapidly. This is due to the fact that the NumPy backend is written in C, and therefore does array operations at a similar speed as machine-level code implementations.

Unlike Matlab, Python requires the installation and import of NumPy to handle array-like data structures.

In [None]:
import time
import random

# Matrix addition with nested loops
# Matrix size
N = 5000

A = [[random.random() for _ in range(N)] for _ in range(N)]
B = [[random.random() for _ in range(N)] for _ in range(N)]

C = [[0.0 for _ in range(N)] for _ in range(N)]

start = time.time()
for i in range(N):
    for j in range(N):
        C[i][j] = A[i][j] + B[i][j]
end = time.time()

print(f"Time to add two {N}x{N} matrices (pure nested Python lists): {round(end - start, 4)} seconds")

In [None]:
import numpy as np

A_np = np.random.rand(N, N)
B_np = np.random.rand(N, N)

start = time.time()
C_np = A_np + B_np
end = time.time()

print(f"Time to add two {N}x{N} matrices (with NumPy arrays): {round(end - start, 4)} seconds")

### Creating Arrays

In [None]:
# From Python list
arr1 = np.array([1, 2, 3])
print("From list:", arr1)

# Zeros and ones
arr2 = np.zeros((2, 3))  # (rows, columns)
arr3 = np.ones((3, 3))
arr4 = np.full((2, 2), 17.5)
print("Zeros:\n", arr2)
print("Ones:\n", arr3)
print("Full of 17.5s:\n", arr4)

# Range and linspace
arr5 = np.arange(0, 10, 2)  # integer, analogous to range() function
arr6 = np.linspace(0, 1, 5)  # floats
print("arange:", arr5)
print("linspace:", arr6)

# Random
arr7 = np.random.rand(2, 2)  # uniform [0, 1)
arr8 = np.random.randint(0, 10, (3, 3))
print("Random floats:\n", arr7)
print("Random ints:\n", arr8)

### Accessing Array Properties

In [None]:
arr = np.random.randint(0, 10, (2, 3))
print("Array:\n", arr)
print("Shape:", arr.shape)
print("Dimensions:", arr.ndim)
print("Size:", arr.size)
print("Data type:", arr.dtype)
print("Item size (bytes):", arr.itemsize)


### Indexing and Slicing in NumPy

In [None]:
arr = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
print("Array:\n", arr)

print("Single element [0, 1]:", arr[0, 1])
print("First column:", arr[:, 0])
print("First row:", arr[0, :])
print("Rows 1-2, all cols:\n", arr[1:3, :])
print("Boolean indexing >50:", arr[arr > 50])
print("Fancy indexing [0, 2] rows and [1, 2] cols:", arr[[0, 2], [1, 2]])


### Element-wise Operations

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Original:\n", arr)
print("Squared:\n", arr**2)
print("Square root:\n", np.sqrt(arr))
print("Exponential:\n", np.exp(arr))
print("Logarithm:\n", np.log(arr))
# other functions: np.sin, np.cos, np.tan, np.abs, np.ceil, np.floor

# Element-wise operations with two arrays of the same shape
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print("a + b:\n", a + b)
print("a - b:\n", a - b)
print("a * b:\n", a * b)
print("a / b:\n", a / b)


### Aggregations of NumPy Arrays

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array:\n", arr)

print("Sum over entire array", arr.sum())
print("Sum over columns:", arr.sum(axis=0))
print("Sum over rows:", arr.sum(axis=1))
print("Mean:", arr.mean())
print("Std:", arr.std())
print("Min:", arr.min())
print("Max:", arr.max())
print("Argmin:", arr.argmin())  # index of min element of flattened array
print("Argmax:", arr.argmax())  # index of max element of flattened array
print("Argmin along axis 0 (columns):", arr.argmin(axis=0))  # indices of min elements in each column
print("Argmax along axis 1 (rows):", arr.argmax(axis=1))  # indices of max elements in each row

print("Sum along axis 0 (columns):", arr.sum(axis=0))
print("Mean along axis 1 (rows):", arr.mean(axis=1))

# Get row/column indices of max/min elements
row_indices, col_indices = np.unravel_index(arr.argmax(), arr.shape)
print(f"Row and column indices of max element: ({row_indices}, {col_indices})")

### Reshaping and Combining Arrays

In [None]:
arr = np.arange(6)
print("Original:", arr)

reshaped = arr.reshape(3, 2)
print("Reshaped (3x2):\n", reshaped)

flat = reshaped.flatten()
print("Flattened:", flat)

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
c = np.array([[7], [8]])
print("Array a:\n", a)
print("Array b:\n", b)
print("Concatenate:\n", np.concatenate([a, b], axis=0))
print("Vertical stack:\n", np.vstack([a, b]))
print("Horizontal stack:\n", np.hstack([a, c]))

### Linear Algebra Operations

In [None]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix A:\n", A)
print("Matrix B:\n", B)

print("Dot product:\n", np.dot(A, B))
print("Using @ operator:\n", A @ B)
print("Transpose:\n", A.T)

print("Inverse:\n", np.linalg.inv(A))
print("Determinant:", np.linalg.det(A))
eigvals, eigvecs = np.linalg.eig(A)
print("Eigenvalues:", eigvals)
print("Eigenvectors:\n", eigvecs)

### Exercise 4.1

You are given measurements of thermal conductivity (in W/mK) of 3 different alloys, each measured at 5 different temperatures.

Determine:

- The overall maximum value and its row/column location (use `np.argmax` + `np.unravel_index`).

- The minimum conductivity for each column (temperature).

- The row index of the alloy with the highest average conductivity.

- Compute the relative drop in conductivity (in %) for each alloy from its first to its last measurement.

- Find all conductivity values greater than 170 using boolean indexing.

- Suppose you want to compute a weighted average conductivity for each alloy, where the weights represent the relative importance of each temperature. Use matrix multiplication to compute the weighted conductivity values (1 per alloy) and find which alloy has the highest weighted conductivity.

In [None]:
data = np.array([
    [200, 190, 185, 180, 178],   # Alloy A
    [150, 148, 147, 145, 144],   # Alloy B
    [100,  98,  95,  92,  90]    # Alloy C
])
weights = np.array([0.1, 0.2, 0.3, 0.2, 0.2])  # weights for each temperature

# Your code below:


## Working with Data in Pandas
Pandas is built on top of NumPy and provides high-level data structures and functions designed to make data analysis fast and easy in Python.


### Creating Pandas Objects and Inspecting Them

In [None]:
import pandas as pd
import numpy as np

# Series
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s)

# DataFrame
df = pd.DataFrame({"A": [1.2, 2.7, 3.1], "B": [4.9, 5.7, 6.2]}, index=["x", "y", "z"])
print(df)

print("\nInspecting DataFrame:\n", df)
print(df.head())  # first few rows
print(df.tail())  # last few rows

print("\nInfo and describe:")
print(df.info())
print(df.describe().round(2))  # statistical summary rounded to 2 decimals

### Selecting Data

Selecting/slicing of data in Pandas is slightly confusing, as there are various approaches which each have their own benefits/drawbacks.

| Method | Based on   | Supports arrays/slices | Slice end inclusive? | Can assign new indices? | Speed | Example |
|--------|------------|-------------------------|----------------------|---------------------------------|-------|---------|
| **loc**  | Labels     | ✅ Yes (slices, boolean arrays) | ✅ Yes (end label included) | ✅ Yes | Normal | `df.loc["r1":"r3", "col"] = 3` |
| **iloc** | Positions  | ✅ Yes (slices, ranges) | ❌ No (end excluded) | ❌ No  | Normal | `df.iloc[:3, 2:4] = 3` |
| **at**   | Labels     | ❌ No (scalar only)     | N/A                  | ✅ Yes | Faster than loc | `df.at["C", "col"] = 3` |
| **iat**  | Positions  | ❌ No (scalar only)     | N/A                  | ❌ No  | Faster than iloc | `df.iat[2, 1] = 3` |



In [None]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=["x", "y", "z"])

# Column selection
print(df["A"])
print(df.A)  # Same result, but not recommended if column name has spaces or conflicts with DataFrame methods

In [None]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=["x", "y", "z"])

# iloc (by position)
print(f"\nselect element at row 0 and 1, col 1:\n{df.iloc[0:2, 1]}")
print(f"same as above but accesssing col 1 by label:\n{df.iloc[0:2, df.columns.get_loc('B')]}")
# Cannot assign new columns with iloc

# loc (by labels)
print(f'\nselect row "y", col "B":\n{df.loc["y", "A":"B"]}')  # Careful, loc is end-inclusive!
print(f"same as above but accessing row 'y' by position:\n{df.loc[df.index[1], 'A':'B']}")
print(f"selecting with a boolean mask:\n{df.loc[df['A'] > 1, :]}")
df.loc[:, 'C'] = [7, 8, 9]
print("Can assign new columns with loc:\n", df)

In [None]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}, index=["x", "y", "z"])

# iat (faster than iloc but only for single element)
print(f"\nselect element at row 2, col 0: {df.iat[2, 0]}") 
print(f"same as above but accessing row 2 by label: {df.iat[df.index.get_loc('z'), 0]}")
# Cannot assign new columns with iat

# at (faster than loc but only for single element)
print(f'\nselect element at row "z", col "B": {df.at["z", "B"]}') 
print(f"same as above but accessing col 'B' by position: {df.at['z', df.columns[1]]}")
df.at['z', 'D'] = 10
print("\nCan assign new columns with at:")
print(df)

### Speed Comparison `iat` vs. `iloc` vs. applying a Lamda Function

In [None]:
import pandas as pd
import numpy as np
import time

# Create a big DataFrame
N = 100_000
df = pd.DataFrame({
    "A": np.random.randint(0, 100, N),
    "B": np.random.randint(0, 100, N)
}, dtype=np.int32)

# Copy original for fairness
df_iat = df.copy()
df_iloc = df.copy()
df_apply = df.copy()

# --- iat (in-place, scalar access) ---
start = time.time()
for i in range(N):
    df_iat.iat[i, 0] = df_iat.iat[i, 0] ** 2
end = time.time()
print(f"iat inplace: {end - start:.4f} seconds")

# --- iloc (in-place, scalar access) ---
start = time.time()
for i in range(N):
    df_iloc.iloc[i, 0] = df_iloc.iloc[i, 0] ** 2
end = time.time()
print(f"iloc inplace: {end - start:.4f} seconds")

# --- apply (column-wise, vectorized-ish, inplace) ---
start = time.time()
df_apply["A"] = df_apply["A"].apply(lambda x: x**2)
end = time.time()
print(f"apply inplace: {end - start:.4f} seconds")


### Common Pandas Operations

In [None]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}, index=["x", "y", "z"])

print(df.mean(axis=0))  # mean of each column
print(df.mean(axis=1))  # mean of each row
print(df.std())   # std of each column
print(df.median())  # median of each column
print(df > 1)  # boolean mask
print(df[df["A"] > 1])  # rows where column A > 1

### Grouping and Method Chaining

In [None]:
df3 = pd.DataFrame({
    "Material": ["Steel", "Steel", "Aluminum", "Aluminum"],
    "Strength": [400, 420, 150, 160],
    "Density": [7.8, 7.9, 2.7, 2.8]
})
df3_copy = df3.copy()

print(df3)
print(df3.groupby("Material").mean().round(1).reset_index())  # Method chaining

# Same as:
grouped = df3_copy.groupby("Material")
mean_df = grouped.mean()
rounded_df = mean_df.round(1)
final_df = rounded_df.reset_index()
print(final_df)

### Handling Missing Data

In [None]:
df4 = pd.DataFrame({"A": [1, np.nan, 3], "B": [4, 5, np.nan]})
print("Original dataframe:\n", df4)
print("\nFill NaN with zeros:\n", df4.fillna(0))
print("\nDrop all NaNs:\n", df4.dropna())

## Common Data File Formats for Storage

| Format | Structure | Human-readable | Loading speed | Strengths | Limitations |
|--------|-----------|----------------|---------------|-----------|-------------|
| **CSV** (Comma-Separated Values) | Rows and columns separated by delimiters (flat, tabular) | ✅ Yes | ⚡ Medium | Simple, lightweight, works well with spreadsheets | No nested structures, no data types (all text) |
| **JSON** (JavaScript Object Notation) | Key–value pairs, lists, nested objects (hierarchical) | ✅ Yes | 🐢 Slow | Handles complex data, preserves data types, widely used in APIs (Application Programming Interfaces) | More verbose, harder to edit manually than CSV |
| **HDF5** (Hierarchical Data Format) | Binary format with hierarchical groups and datasets | ❌ No | ⚡ Fast | Efficient for very large datasets, supports metadata | Requires special libraries (e.g., h5py, PyTables) |
| **Pickle** | Python-specific serialized objects | ❌ No | ⚡ Fast | Can store almost any Python object easily | Not portable outside Python, unsafe if source is untrusted |
| **YAML** (YAML Ain’t Markup Language) | Human-readable key–value and nested structure | ✅ Yes | 🐢 Slow | More flexible and readable than JSON, allows comments | Less standardized than JSON |



### Exercise 4.2, Part 1
We’ll be working with a band gap dataset first compiled in 1973 by two scientists, W.H. Strehlow and E.L. Cook. They collected values for many elementary and binary compound semiconductors from over 700 scientific papers. Importantly, they didn’t just list numbers, they also flagged which data points are more or less reliable, based on things like the measurement method, material quality, and experimental conditions. The unedited dataset can be found [here](https://citrination.com/datasets/1160/show_files/) and its usage for teaching purposes was inspired by Prof. Dane Morgan.

First load the dataset `band_gap_data.json` with the `read_json` method in Pandas and get basic information about the dataset with the previous functions we introduced. What kind of overall information would be useful when first looking at a new dataset?

Then look at all unique entries of columns "crystallinity", "band_gap_units", "band_gap_type", "exp_method", and "data_type".

In [None]:
# your code below:


### Exercise 4.2, Part 2

Now create separate Pandas DataFrames for polycrystalline, single crystalline, and amorphous materials (use a boolean mask).

Obtain statistics for each one.

In [None]:
# Your code below:

## Plotting Data with Pandas and Matplotlib

Often times visuals say more than "just numbers". Luckily, there are several Python libraries that are dedicated for plotting and visualization of data. Matplotlib is probably the most known one and works natively with NumPy and Pandas data structures. More recent libraries are [Seaborn](https://seaborn.pydata.org/) and [Plotly](https://plotly.com/).

Matplotlib is already a quite powerful library with numerous options. We will just briefly give an example and use it to plot the information from the previous exercise. For full documentation, please have a look [here](https://matplotlib.org/).

### Matplotlib Example

In [None]:
import matplotlib.pyplot as plt

# Example data: synthetic stress-strain curve
strain = np.linspace(0, 0.3, 100)
stress = 200 * strain - 500 * strain**2  # toy model

# Create a figure with two subplots (1 row, 2 columns)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))

# --- Left subplot: stress-strain curve ---
ax1.plot(strain, stress, color="blue", label="Stress-Strain")
ax1.scatter(strain[::10], stress[::10], color="red", zorder=5, label="Sample points")
ax1.set_xlabel("Strain")
ax1.set_ylabel("Stress (MPa)")
ax1.set_title("Stress-Strain Curve")
ax1.axhline(0, color="black", linewidth=0.8)
ax1.legend()

# --- Right subplot: histogram of stress values ---
ax2.hist(stress, bins=15, color="skyblue", edgecolor="black")
ax2.set_xlabel("Stress (MPa)")
ax2.set_ylabel("Frequency")
ax2.set_title("Distribution of Stress Values")

# Adjust layout to avoid overlap
plt.tight_layout()

# Save and show
fig.savefig("stress_analysis.png")
plt.show()


### Exercise 4.3

Create a plot with three subplots: A histogram of the band gap for each polycrystalline, single crystalline, and amorphous materials from exercise 4.2.

Use the `hist` function and set `density=True` to get the bin sizes relative to the total. Documentation of that function is [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

- Add a horizontal line with the average value for each group.
- Add axis labeles.
- Add a title for each subplot.
- Constraint the x-axis to 0-10 eV for all plots. (`set_xlim(0, 10)`)

In [None]:
# Your code here:

## Remark: Type-hinting of NumPy Arrays and Pandas DataFrames and Series

For NumPy this requires the import of the numpy.typing module.

In [None]:
import numpy.typing as npt

arr: npt.NDArray[np.float64] = np.array([1.0, 2.0, 3.0])

df: pd.DataFrame = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
series: pd.Series = pd.Series([10, 20, 30])