# Pandas Foundations: Series, DataFrames, NumPy Interop, Reading/Writing Data

This notebook is a **lecture + hands-on tutorial** covering:

- Pandas **Series**
- Pandas **DataFrames**
- Conversion between **NumPy arrays** and pandas objects
- Reading data from **CSV** and from a **URL**
- Writing data back to **CSV**

> Tip: Run cells top-to-bottom. Try the exercises as you go.


## Setup

In [2]:
import numpy as np
import pandas as pd

# Reproducibility for examples
rng = np.random.default_rng(42)

# Setting default display options
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 120)


## 1) Pandas Series

A **Series** is a 1D labeled array. Think of it as:
- a NumPy array **plus** an index (labels)
- a single column of a table

Key ideas:
- Values are aligned by **index label**, not by position
- Supports missing values (`NaN`)


In [3]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
s


In [4]:
s.index, s.to_numpy(), s.dtype


### Indexing in Series

- Label-based: `.loc[...]`
- Position-based: `.iloc[...]`


In [5]:
s.loc["b"], s.iloc[1]


In [6]:
s.loc[["a", "c"]]


### Vectorized operations + alignment

Pandas aligns by index labels automatically.


In [7]:
s2 = pd.Series([1, 2, 3], index=["b", "c", "d"])
s, s2


In [8]:
# Alignment happens by matching index labels (a,b,c) with (b,c,d)
s + s2


### Missing data

Missing values show up as `NaN` (float) in many contexts.


In [9]:
(s + s2).isna()


In [10]:
(s + s2).fillna(0)


### Exercise 1 (Series)

Create a Series of 5 random integers 0..9 with index labels:
`["Mon","Tue","Wed","Thu","Fri"]`.
Then:
1) Select Wed value using `.loc`  
2) Select the 3rd element using `.iloc`


In [11]:
# --- Your turn ---
# idx = ...
# s_week = ...
# wed = ...
# third = ...
# s_week, wed, third


In [12]:
# Solution
idx = ["Mon", "Tue", "Wed", "Thu", "Fri"]
s_week = pd.Series(rng.integers(0, 10, size=5), index=idx)
wed = s_week.loc["Wed"]
third = s_week.iloc[2]
s_week, wed, third


## 2) Pandas DataFrames

A **DataFrame** is a 2D labeled table:
- rows have an index
- columns have labels
- each column is a Series (often different dtypes)

Common creation patterns:
- dict of columns
- list of dicts (records)
- from NumPy arrays


In [13]:
df = pd.DataFrame({
    "name": ["Ada", "Grace", "Linus", "Guido"],
    "age":  [36,  28,     40,     34],
    "city": ["Austin", "Chicago", "Seattle", "New York"]
})
df


In [14]:
df.dtypes


### Selecting columns and rows

- `df["col"]` returns a Series
- `df[["col1","col2"]]` returns a DataFrame
- `.loc` uses labels
- `.iloc` uses positions


In [15]:
df["age"]


In [16]:
df[["name", "city"]]


In [17]:
df.loc[0, "name"], df.iloc[0, 0]


In [18]:
df.loc[0:2, ["name", "age"]]   # label slice is inclusive on both ends for .loc


In [19]:
df.iloc[0:2, 0:2]             # position slice is stop-exclusive (like NumPy)


### Filtering rows

Use boolean masks (like NumPy boolean indexing).


In [20]:
df[df["age"] >= 35]


### Adding / modifying columns

Operations are vectorized and align by index.

**Note that the `.copy()` method creates a deep copy of the dataframe independent of the first instance** (refer to Views vs. Copies section in Numpy lecture)


In [21]:
df2 = df.copy()
df2["age_plus_10"] = df2["age"] + 10
df2


### Exercise 2 (DataFrames)

Using `df2`:
1) Create a column `is_senior` where age >= 35  
2) Filter to show only rows where `is_senior` is True  
3) Select only the columns `name` and `is_senior`


In [22]:
# --- Your turn ---
# df2["is_senior"] = ...
# seniors = ...
# seniors_view = ...
# seniors_view


In [23]:
# Solution
df2["is_senior"] = df2["age"] >= 35
seniors = df2[df2["is_senior"]]
seniors_view = seniors[["name", "is_senior"]]
seniors_view


## 3) Conversions between NumPy arrays and pandas objects

### Series ↔ NumPy
- `Series.to_numpy()` gives a NumPy array
- `np.array(series)` works too, but `to_numpy()` is clearer
- `Series.values` is legacy-ish; prefer `to_numpy()`

### DataFrame ↔ NumPy
- `DataFrame.to_numpy()` returns a 2D NumPy array (often with a common dtype)
- `DataFrame.values` is older; prefer `to_numpy()`


### Series from NumPy and back

In [24]:
arr = rng.normal(size=5)
arr


In [25]:
s_arr = pd.Series(arr, name="z")
s_arr


In [26]:
arr_back = s_arr.to_numpy()
arr_back, type(arr_back)


### DataFrame from NumPy and back

In [27]:
A = rng.integers(0, 100, size=(4, 3))
A


In [28]:
dfA = pd.DataFrame(A, columns=["x1", "x2", "x3"])
dfA


In [29]:
A_back = dfA.to_numpy()
A_back, A_back.shape


### Important note: dtypes when converting

If a DataFrame has mixed dtypes (e.g., strings + numbers), converting to NumPy may produce `dtype=object`.


In [30]:
mixed = pd.DataFrame({"num":[1,2,3], "txt":["a","b","c"]})
mixed


In [31]:
mixed.to_numpy(), mixed.to_numpy().dtype


### Exercise 3 (NumPy interop)

1) Create a 3×2 NumPy array of random floats  
2) Convert it to a DataFrame with columns `["a","b"]`  
3) Convert that DataFrame back to a NumPy array and verify the shape


In [32]:
# --- Your turn ---
# X = ...
# dfX = ...
# X2 = ...
# X.shape, dfX.shape, X2.shape


In [33]:
# Solution
X = rng.random((3, 2))
dfX = pd.DataFrame(X, columns=["a", "b"])
X2 = dfX.to_numpy()
X.shape, dfX.shape, X2.shape


## 4) Reading data (CSV and URL) + Writing data (CSV)

You can read a and write a variety of common formats using Pandas.

### Read CSV from local file
- `pd.read_csv("path/to/file.csv")`

### Read CSV directly from a URL
- `pd.read_csv("https://.../file.csv")`

### Write CSV
- `df.to_csv("out.csv", index=False)` (commonly)


### Example A: Write a DataFrame to CSV, then read it back

In [34]:
df_small = pd.DataFrame({
    "id": [101, 102, 103],
    "score": [88.5, 92.0, 79.5],
    "passed": [True, True, False]
})
df_small


In [35]:
# Write to CSV (in the notebook's current working directory).
# Notice that the file now appears in your environment.
out_path = "example_scores.csv"
df_small.to_csv(out_path, index=False)
out_path


In [36]:
# Read it back
df_read_back = pd.read_csv(out_path)
df_read_back


### Example B: Read table from a URL

This requires internet access in your environment.  
If you're running this notebook offline, you'll see an error — that's expected.

The example below uses a public URL as a demonstration.


In [37]:
url1 = "https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html"

table = pd.read_html(url1)

table[0].head(5)


## Wrap-up

You now covered:
- **Series** (index, `.loc` vs `.iloc`, alignment)
- **DataFrames** (selection, filtering, adding columns)
- Conversion between **NumPy** arrays and pandas objects
- Reading from **CSV** (local) and **URL**
- Writing back to **CSV**

**Futher exploration topics:**
- handling missing values (`dropna`, `fillna`)
- grouping and aggregation (`groupby`)
- joins/merges (`merge`)
- reshaping (`pivot`, `melt`)
