# Section 1 — The Pandas DataFrame: Core Structure and Philosophy

**What this section covers (brief):**
This section introduces the Pandas `Series` and `DataFrame` as labeled, tabular data containers built on top of NumPy. We'll create DataFrames, inspect their structure and dtypes, access rows/columns, and perform basic column operations. The examples use a consistent small retail-sales dataset theme so you can reuse it as we progress through later sections.

## Subtopics

1. **DataFrame & Series essentials** — creation, shape, `dtypes`, `info`, and `describe()` for quick inspection.
2. **Column & row access** — `.loc`, `.iloc`, attribute access, and boolean masks for selection.
3. **Basic transformations** — adding/renaming columns, `astype` conversions, handling `NaN`s.
4. **Practical patterns** — consistent dataset theme and idiomatic patterns for readability and reproducibility.

In [None]:
# Core imports and display configuration
import pandas as pd
import numpy as np

# Helpful display options for notebooks (optional)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

### Creating DataFrames

We'll create a small retail sales DataFrame; in real life you'd use `pd.read_csv()` or `pd.read_sql()`, but for self-contained examples we generate the data programmatically so the code is runnable anywhere.

In [None]:
# Create a simple retail sales DataFrame (runnable example)
rng = np.random.default_rng(seed=42)
n = 12
df = pd.DataFrame({
    'order_id': np.arange(1001, 1001 + n),
    'customer_id': rng.integers(200, 210, size=n),
    'product': rng.choice(['T-shirt', 'Mug', 'Cap', 'Notebook'], size=n),
    'quantity': rng.integers(1, 5, size=n),
    'price': np.round(rng.uniform(5.0, 40.0, size=n), 2),
    'order_date': pd.date_range('2023-01-15', periods=n, freq='7D')
})

# Introduce a couple of NaNs to illustrate missing data handling
df.loc[[2, 7], 'price'] = np.nan
df

### Inspecting a DataFrame — quick checks

Use `head()`, `info()`, `dtypes`, and `describe()` to understand shape and types. `info()` is especially useful for spotting `object` dtype columns which might represent strings, datetimes, or mixed types.

In [None]:
# Quick inspection utilities
print('Shape:', df.shape)
display(df.head())
print('\nInfo:')
display(df.info())
print('\nDtypes:')
display(df.dtypes)
print('\nNumeric summary:')
display(df.describe(include=[np.number]))

### Column access and selection patterns

Common idioms:
- `df['col']` — returns a `Series`.
- `df[['a','b']]` — returns a `DataFrame` with selected columns.
- `df.loc[row_label, col_label]` — label-based selection.
- `df.iloc[row_idx, col_idx]` — integer positional selection.
- Boolean masks (`df[df['col'] > val]`) for filtered views.

In [None]:
# Column access examples
series_price = df['price']          # Series
subset = df[['order_id', 'product', 'price']]  # DataFrame
row3 = df.iloc[2]                  # third row (0-based)
by_customer_200 = df[df['customer_id'] == 200]

# Label-based access example (loc)
first_row_price = df.loc[0, 'price']

series_price, subset.head(), row3.to_dict(), len(by_customer_200), first_row_price

### Basic transformations: adding columns, renaming, and dtype conversions

Idiomatic patterns are important for readability. Avoid in-place changes unless necessary (explicit `inplace=True` is discouraged in many cases — prefer assignment to a new variable or to the same name to make transformations explicit).

In [None]:
# Add a derived column (total = quantity * price), carefully handling NaNs
df = df.copy()  # make an explicit copy to avoid surprises for learners
df['total'] = df['quantity'] * df['price']

# Fill missing prices with a sentinel or estimate (example: median) before certain ops
median_price = df['price'].median(skipna=True)
df['price_filled'] = df['price'].fillna(median_price)
df['total_filled'] = df['quantity'] * df['price_filled']

# Rename columns cleanly
df = df.rename(columns={'price': 'price_orig', 'price_filled': 'price'})
df.head()

## Real-World Problem 1 — Sales summary by product and month

**Problem:** Given a sales DataFrame (`order_id`, `order_date`, `product`, `quantity`, `price`), compute monthly revenue per product and show the top 3 product-month combinations by revenue.

This reinforces grouping/aggregation, datetime handling, and sorting.

In [None]:
# Prepare data (ensure we have a proper datetime index/column)
df1 = df.copy()
df1['order_month'] = df1['order_date'].dt.to_period('M')

# Compute revenue using filled prices to avoid NaNs
df1['revenue'] = df1['quantity'] * df1['price']

# Group by product and month, then sum revenue
monthly = (
    df1.groupby(['order_month', 'product'], as_index=False)
       .agg(total_revenue=('revenue', 'sum'), total_qty=('quantity', 'sum'))
)

# Convert order_month back to string for nicer display then sort
monthly['order_month'] = monthly['order_month'].astype(str)
top3 = monthly.sort_values('total_revenue', ascending=False).head(3)
monthly, top3

## Real-World Problem 2 — Detect customers with declining purchase trend

**Problem:** For each customer, compute monthly total spend; identify customers whose spend declined for at least two consecutive months. This uses pivoting/time grouping and boolean masks to find trends.

In [None]:
# Prepare a slightly larger sample to illustrate trends (simulate more history)
rng = np.random.default_rng(1)
dates = pd.date_range('2023-01-01', periods=90, freq='D')
sim = pd.DataFrame({
    'order_date': rng.choice(dates, size=200),
    'customer_id': rng.integers(1000, 1010, size=200),
    'quantity': rng.integers(1, 4, size=200),
    'price': np.round(rng.uniform(5, 50, size=200), 2)
})
sim['month'] = sim['order_date'].dt.to_period('M')
sim['revenue'] = sim['quantity'] * sim['price']

# Aggregate: customer x month revenue
cust_month = sim.groupby(['customer_id', 'month'], as_index=False).agg(revenue=('revenue', 'sum'))
pivot = cust_month.pivot(index='customer_id', columns='month', values='revenue').fillna(0)

# Detect customers with at least two consecutive months of decline
decline_customers = []
for cid, row in pivot.iterrows():
    # Convert to numpy array for easy difference checks
    arr = row.values
    # compute month-to-month differences (negative => decline)
    diffs = np.diff(arr)
    # check if there's any place with two consecutive negative diffs
    if np.any((diffs[:-1] < 0) & (diffs[1:] < 0)):
        decline_customers.append(cid)

pivot.head(), decline_customers[:10]

## Under the Hood — How a DataFrame stores data (concise)

- A `DataFrame` is conceptually a dict-like container of `Series` objects that share an index. Underneath, many columns are backed by NumPy `ndarray`s (or extension arrays for some dtypes).
- Pandas uses block managers / columnar layout to store homogeneous blocks efficiently. Numeric columns are usually contiguous `ndarray`s — operations on them are fast because NumPy vectorized routines are used.
- `object` dtype columns are pointers to Python objects and are much slower for numeric work. Prefer native numeric dtypes (`int`, `float`) or `Categorical` for repeated strings.
- Index alignment: when you do operations across columns/rows, Pandas aligns on index labels — this is powerful but may create hidden costs if indexes are not simple RangeIndex.

## Best Practices / Common Pitfalls

- **Prefer vectorized operations** (column arithmetic) over `for` loops or repeated `df.loc[...]` assignments — vectorized ops use NumPy under the hood and are far faster.
- **Be explicit with copies**: `df2 = df` does **not** copy — use `df.copy()` when you need a separate object to avoid SettingWithCopy warnings.
- **Watch `object` dtype**: convert string-like categorical columns to `pd.Categorical` when appropriate to save memory and speed up groupby operations.
- **Avoid chaining ambiguous assignments** (e.g., `df[df['a'] > 0]['b'] = 1`) — use `.loc` to set values safely: `df.loc[df['a'] > 0, 'b'] = 1`.
- **Always inspect `dtypes`** after reading data — automatic inference can produce unexpected `object` or `float` dtypes for what should be integers or datetimes.

## Challenge Exercise (no solution here)

Using the initial small `df` (the retail sample at the top):
1. Create a clean `orders_clean` DataFrame that:
   - Converts `order_date` to `datetime` if not already,
   - Fills missing `price` values using a reasonable strategy,
   - Adds a `weekday` column (name of the day),
   - Ensures `customer_id` is of integer dtype.
2. Then write a function `top_customers(df, n=5)` that returns the top `n` customers by total spend. Make sure your function is robust to `NaN`s and unexpected dtypes.

_Hint:_ Compose operations into a clear pipeline using local assignment (avoid chained assignments).

# --- End of Section 1 — Continue to Section 2 ---