In [1]:
# !pip install sqlalchemy psycopg2-binary
import os, pandas as pd
from sqlalchemy import create_engine

pg_user = os.getenv("PGUSER", "postgres")
pg_pass = os.getenv("PGPASSWORD", "CSDBMS623")
pg_host = os.getenv("PGHOST", "localhost")
pg_port = os.getenv("PGPORT", "5432")
pg_db   = os.getenv("PGDATABASE", "SP500_ML")

engine = create_engine(f"postgresql+psycopg2://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}")

# Use only the most recent membership date to define the current universe
universe = pd.read_sql_query("""
    SELECT DISTINCT UPPER(TRIM(latest_ticker)) AS latest_ticker
    FROM sp500_long_latest_profiles
    WHERE latest_ticker IS NOT NULL

""", engine)["latest_ticker"].tolist()

print("Universe size:", len(universe))

# Now run your Yahoo fetcher
#prices_df = fetch_prices_for_universe(universe, checkpoint_path="prices_checkpoint.parquet")

Universe size: 679


## How `_signed_pct_change` works

**Formula:**  
\[
\text{spc}(a,b) \;=\; \frac{2\,(a-b)}{|a|+|b|}
\]
This is the *symmetric percent change* (also called percent difference over the average absolute level).  
It divides the difference by the average magnitude instead of by the previous value, so it’s **well-behaved when the previous value is 0 or negative**.

### Properties
- **Bounded**: result is always in **[-2, 2]** (i.e., -200% to +200%).  
  - Example: 0 → 5 gives \(2\cdot(5-0)/(5+0)=2\) (+200%), and 5 → 0 gives -200%.
- **Symmetric up vs. down**: +10% move up and down have the **same magnitude** (unlike standard % change).  
  - 100 → 110: \(2\cdot10/210 = 0.0952\) (~9.52%)  
  - 110 → 100: \(2\cdot(-10)/210 = -0.0952\) (~-9.52%)
- **Handles sign flips** gracefully: -5 → 5 = +200%, 5 → -5 = -200%.
- **Zeros and missing values**:
  - If **both a and b are 0**, it returns **0.0**.
  - If either is missing (or both nonzero but one is NaN), you get **NA**.
- Uses pandas’ nullable **`Float64`** dtype so NA is preserved.

### Implementation details
1. Coerces both inputs to numeric `Float64` (`pd.NA`-aware).
2. Computes the denominator \(|a|+|b|\).  
   - If it’s **nonzero**, computes \(2(a-b)/(|a|+|b|)\).
   - If **both a and b are 0**, sets result to **0.0**.
3. Leaves other invalid cases as **NA**.

---

## How `add_ttm_growth` uses it

You’re computing **YoY growth for TTM series** (t vs. t-4 quarters) per symbol:

\[
\text{growth}_{t}^{\text{TTM}} \;=\; \text{spc}\big(\text{TTM}_t,\; \text{TTM}_{t-4}\big)
\]

- It sorts by `["symbol","date"]`.
- For each TTM column (`revenue_ttm`, `netIncome_ttm`, `operatingIncome_ttm`), it:
  1. Shifts by 4 within each symbol to get the year-ago TTM (`prev`).
  2. Applies `_signed_pct_change(curr, prev)` to produce:
     - `revenue_ttm_growth`
     - `netIncome_ttm_growth`
     - `operatingIncome_ttm_growth`

### Why this is useful for fundamentals
- Fundamentals (esp. **TTM net income**) can be **zero or negative**.  
  Traditional growth \((a-b)/b\) explodes or flips sign around zero; the symmetric form stays **bounded and interpretable**.

### Quick intuition
- Think of it as “**difference divided by average absolute size**.”  
  If levels are large, the same difference yields a smaller growth; if levels are tiny (but not both zero), the same difference yields a larger growth—without going to ±∞.



# Income Statement Pipeline (Quarterly + TTM + YoY)

This notebook block fetches **quarterly income statements**, builds **as-of validity windows**, derives **TTM** totals, computes **YoY growth** (robust symmetric version), and **ingests** the result into Postgres.

---

## What the dataset represents

Each row is a company **quarterly filing** with additional derived fields:

- **Identity & timing**
  - `symbol`: uppercased ticker returned by FMP (fallback to query symbol).
  - `date`: filing period end date from FMP (quarter reference date).
  - `date_start`, `date_end`: **validity window** for this row (see below).

- **Source fields (FMP)**
  - Currency/IDs: `reportedCurrency`, `cik`, …
  - Quarter labels: `calendarYear`, `period` ∈ {Q1,Q2,Q3,Q4}
  - Fundamentals: `revenue`, `grossProfit`, `operatingIncome`, `netIncome`, `eps`, ratios, etc.

- **TTM (trailing-twelve-months) totals**
  - `revenue_ttm`, `netIncome_ttm`, `operatingIncome_ttm`  
  Strict **4-quarter rolling sums** per `symbol` (first 3 rows are NA).

- **Growth metrics**
  - `revenue_ttm_growth`, `netIncome_ttm_growth`, `operatingIncome_ttm_growth`  
    YoY on TTM using **symmetric percent change** (SPC).
  - `revenue_q_yoy`, `netIncome_q_yoy`, `operatingIncome_q_yoy`  
    **True quarterly YoY**: Qx(Y) vs Qx(Y-1), after deduping to the **latest** filing per `(symbol, calendarYear, period)` using timestamps (`acceptedDate` > `fillingDate` > `date`).

---

## As-of windows (SCD style)

For each `symbol`, rows are sorted by `date`. We set:
- `date_start = date`
- `date_end   = next(date)` within the same symbol  
  (the **last** row uses `2100-01-01`)

Interpretation: the row is valid on **\[date_start, date_end)**. Use this when joining fundamentals to prices to **avoid look-ahead**.

---

## TTM construction

For each metric \(x \in \{ \text{revenue}, \text{operatingIncome}, \text{netIncome} \}\):

\[
x_{\text{ttm}, t} \;=\; \sum_{k=0}^{3} x_{t-k}
\]

- Computed per `symbol`, strictly 4 quarters, **signs preserved**.
- Stored as nullable integers (pandas `Int64` → Postgres `BIGINT`).

---

## Growth calculations (robust to zeros/negatives)

### Symmetric percent change (SPC)
Used for both TTM YoY and Quarterly YoY:

\[
\text{spc}(a,b) \;=\; \frac{2\,(a-b)}{|a|+|b|} \;\in\; [-2,\;2]
\]

- **Bounded** in \([-200\%, +200\%]\).
- **Well-behaved** for zeros and sign flips.
- If \(a=b=0\), defined as 0.

### Quarterly YoY alignment
- **Deduplicate** to the **latest** filing per `(symbol, calendarYear, period)`.
- **Self-join** Qx(Y) to Qx(Y-1) on `(symbol, period, calendarYear)`.
- Apply SPC to compute:
  - `revenue_q_yoy`, `operatingIncome_q_yoy`, `netIncome_q_yoy`.

---

## Fetching & reliability

- **API**: Financial Modeling Prep `/api/v3/income-statement/{ticker}` (`period=quarter`, deep `limit`, your API key).
- **Resilience**: retries for `429/5xx` with exponential backoff.
- **Batching & throttle**: configurable batch size + optional “sleep every N tickers”.

---

## Ingestion (PostgreSQL)

- **Target**: `public.income_statements_q`
- **Keys/Indexes**
  - `UNIQUE(symbol, date)` to de-dupe updates.
  - Indexes on `(symbol)` and `(date)`.
- **Loader pattern**
  1. **Create** table if missing (lowercase columns).
  2. **ALTER TABLE ADD COLUMN IF NOT EXISTS** for any new fields (e.g., growths).
  3. **Stage → UPSERT** chunks on `(symbol, date)`; drop staging after merge.
- **Types**
  - Dates → `TIMESTAMP` (tz-naive)
  - Big numeric totals → `BIGINT`
  - Ratios/changes → `DOUBLE PRECISION`
  - Strings/IDs → `TEXT`

---

## Usage tips

- Always sort before rolling logic:
  ```python
  df = df.sort_values(["symbol","date"]).reset_index(drop=True)


In [6]:
# ===================== FULL SCRIPT — FMP Income Statements w/ SCD Windows + TTM (skip + 200-ticker sleep) =====================
import time
from typing import Iterable, List, Union, Optional, Tuple
import requests
import pandas as pd

# -------------------------- Config --------------------------
FMP_BASE = "https://financialmodelingprep.com/api/v3/income-statement"
MAX_VALID_DATE = pd.Timestamp("2100-01-01")

# Base schema column order (your listing)
COL_ORDER = [
    "symbol", "date", "date_start", "date_end",
    "reportedCurrency", "cik", "fillingDate", "acceptedDate",
    "calendarYear", "period",
    "revenue", "costOfRevenue", "grossProfit", "grossProfitRatio",
    "researchAndDevelopmentExpenses", "generalAndAdministrativeExpenses",
    "sellingAndMarketingExpenses", "sellingGeneralAndAdministrativeExpenses",
    "otherExpenses", "operatingExpenses", "costAndExpenses",
    "interestIncome", "interestExpense", "depreciationAndAmortization",
    "ebitda", "ebitdaratio", "operatingIncome", "operatingIncomeRatio",
    "totalOtherIncomeExpensesNet", "incomeBeforeTax", "incomeBeforeTaxRatio",
    "incomeTaxExpense", "netIncome", "netIncomeRatio",
    "eps", "epsdiluted", "weightedAverageShsOut", "weightedAverageShsOutDil",
    "link", "finalLink",
]

# TTM columns appended at the end
TTM_COLS = ["revenue_ttm", "netIncome_ttm", "operatingIncome_ttm"]


# -------------------------- Helpers --------------------------
def _normalize_fmp_json(j):
    """Normalize possible JSON shapes to list[dict]."""
    if isinstance(j, list):
        return j
    if isinstance(j, dict):
        for k in ("error", "Error", "message", "Note", "Error Message"):
            if k in j and isinstance(j[k], str):
                # treat as no data; let caller decide to skip
                raise RuntimeError(f"API message: {j[k]}")
        for k in ("financials", "items", "data", "results", "financialStatements"):
            if k in j and isinstance(j[k], list):
                return j[k]
        return [j]
    raise RuntimeError(f"Unexpected JSON type from API: {type(j)}")


def _get_with_retries(
    session: requests.Session,
    url: str,
    params: dict,
    timeout: int = 30,
    max_retries: int = 4,
    base_sleep: float = 1.0,
):
    """GET with simple exponential backoff on transient errors/rate-limits."""
    last = None
    for attempt in range(1, max_retries + 1):
        resp = session.get(url, params=params, timeout=timeout)
        if resp.status_code == 200:
            return resp
        last = resp
        if resp.status_code in (429, 500, 502, 503, 504):
            time.sleep(base_sleep * (2 ** (attempt - 1)))
            continue
        resp.raise_for_status()
    if last is not None:
        last.raise_for_status()
    raise RuntimeError("Request failed without response.")


# -------------------------- Core fetchers --------------------------
def fetch_income_statements_one(
    ticker: str,
    api_key: str,
    session: Optional[requests.Session] = None,
    period: str = "quarter",
    limit: int = 120,
    timeout: int = 30,
) -> pd.DataFrame:
    """Fetch income statements for a single ticker; tidy, dedup, sorted.
       Raises on problems (caller may choose to skip)."""
    if session is None:
        session = requests.Session()

    url = f"{FMP_BASE}/{ticker}"
    params = {"period": period, "apikey": api_key, "limit": limit}

    r = _get_with_retries(session, url, params, timeout=timeout)
    try:
        data = r.json()
    except ValueError as e:
        raise RuntimeError(f"Non-JSON response for {ticker}: {r.text[:300]}") from e

    records = _normalize_fmp_json(data)
    if not records:
        raise RuntimeError(f"No records returned for {ticker}.")

    df = pd.DataFrame.from_records(records)

    # Robust symbol handling
    if "symbol" in df.columns:
        df["symbol"] = df["symbol"].astype(str)
        mask_missing = df["symbol"].isin(["", "None", "nan", "NaN"]) | df["symbol"].isna()
        df.loc[mask_missing, "symbol"] = ticker
        df["symbol"] = df["symbol"].str.upper()
    else:
        df["symbol"] = ticker.upper()

    if "date" not in df.columns:
        raise RuntimeError(f"'date' missing in payload for {ticker}.")
    df["date"] = pd.to_datetime(df["date"], errors="coerce")

    # Drop bad/dup rows and sort for rolling logic
    df = (
        df.dropna(subset=["date"])
          .drop_duplicates(subset=["symbol", "date"], keep="first")
          .sort_values(["symbol", "date"])
          .reset_index(drop=True)
    )
    if df.empty:
        raise RuntimeError(f"No valid rows after cleaning for {ticker}.")
    return df


def add_validity_windows(df: pd.DataFrame) -> pd.DataFrame:
    """Add date_start/date_end per ticker; last date_end = 2100-01-01."""
    df = df.sort_values(["symbol", "date"]).reset_index(drop=True)
    df["date_start"] = df["date"]
    df["date_end"] = df.groupby("symbol")["date"].shift(-1)
    df["date_end"] = df["date_end"].fillna(MAX_VALID_DATE)
    return df


def add_ttm(df: pd.DataFrame) -> pd.DataFrame:
    """
    Adds revenue_ttm, netIncome_ttm, operatingIncome_ttm:
      - strict 4-quarter rolling sum per symbol
      - preserves signs (negatives remain negative)
      - outputs pandas nullable Int64 (NA for first 3 rows per symbol)
    """
    df = df.sort_values(["symbol", "date"]).reset_index(drop=True)

    def _rolling_ttm(s: pd.Series) -> pd.Series:
        roll = s.rolling(window=4, min_periods=4).sum()
        return roll.round(0).astype("Int64")

    for base_col, ttm_col in [
        ("revenue", "revenue_ttm"),
        ("netIncome", "netIncome_ttm"),
        ("operatingIncome", "operatingIncome_ttm"),
    ]:
        if base_col not in df.columns:
            df[base_col] = pd.NA
        df[base_col] = pd.to_numeric(df[base_col], errors="coerce")
        df[ttm_col] = df.groupby("symbol", group_keys=False)[base_col].apply(_rolling_ttm)

    return df


def coerce_and_order(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure presence of base columns, convert date-like, and order columns."""
    for col in COL_ORDER:
        if col not in df.columns:
            df[col] = pd.NaT if col in ("date", "date_start", "date_end") else pd.NA
    for dcol in ("date", "date_start", "date_end"):
        df[dcol] = pd.to_datetime(df[dcol], errors="coerce")
    extras = [c for c in df.columns if c not in COL_ORDER]
    df = df[COL_ORDER + extras]
    return df


def fetch_income_statements(
    tickers: Union[str, Iterable[str]],
    api_key: str,
    period: str = "quarter",
    limit: int = 120,
    batch_size: int = 25,            # safe default for big lists (e.g., 700+)
    sleep_between_batches: float = 1.2,
    timeout: int = 30,
    skip_errors: bool = True,        # skip tickers that fail
    verbose: bool = True,            # log progress & skips
    # NEW: throttle every N tickers
    sleep_every_n: int = 200,
    sleep_every_n_seconds: int = 120,   # 2 minutes
) -> pd.DataFrame:
    """
    Master fetcher: many tickers, batching, windows, TTM, ordered columns.
    - skip_errors=True: tickers with no data / bad payloads are skipped.
    - Sleeps `sleep_between_batches` between batches,
      and additionally sleeps `sleep_every_n_seconds` after every `sleep_every_n` tickers processed.
    """
    if isinstance(tickers, str):
        tickers = [tickers]
    tickers = [t.upper().strip() for t in tickers if str(t).strip()]

    session = requests.Session()
    frames: List[pd.DataFrame] = []
    skipped: List[Tuple[str, str]] = []

    total = len(tickers)
    processed_count = 0

    for i in range(0, total, batch_size):
        batch = tickers[i:i + batch_size]
        if verbose:
            print(f"Batch {i//batch_size + 1}: {len(batch)} tickers "
                  f"({i+1}–{min(i+len(batch), total)} of {total})")

        for t in batch:
            try:
                df_t = fetch_income_statements_one(
                    t, api_key=api_key, session=session, period=period, limit=limit, timeout=timeout
                )
                frames.append(df_t)
            except Exception as e:
                if skip_errors:
                    skipped.append((t, str(e)))
                    if verbose:
                        print(f"  [skip] {t}: {e}")
                else:
                    raise
            finally:
                processed_count += 1
                # ---- NEW: Sleep after every N processed tickers ----
                if sleep_every_n > 0 and processed_count % sleep_every_n == 0 and processed_count < total:
                    if verbose:
                        remaining = total - processed_count
                        print(f"⏸ Sleeping {sleep_every_n_seconds}s after {processed_count} tickers "
                              f"({remaining} remaining)...")
                    time.sleep(sleep_every_n_seconds)

        # polite pacing between batches (this may occur near the N-boundary; that's okay)
        if i + batch_size < total:
            time.sleep(sleep_between_batches)

    if not frames:
        if verbose:
            print("No successful results.")
            if skipped:
                print(f"Skipped {len(skipped)} tickers. Examples: {skipped[:5]}")
        return pd.DataFrame(columns=COL_ORDER + TTM_COLS)

    df_all = pd.concat(frames, ignore_index=True)

    # Add windows, TTM, then normalize & order columns
    df_all = add_validity_windows(df_all)
    df_all = add_ttm(df_all)
    df_all = coerce_and_order(df_all)

    if verbose:
        print(f"✅ Success: {len(frames)} tickers; rows: {len(df_all)}")
        if skipped:
            print(f"⚠️ Skipped {len(skipped)} tickers (no data or errors).")
            # for sym, msg in skipped: print(f"   {sym}: {msg}")

    return df_all


# -------------------------- Example usage --------------------------
if __name__ == "__main__":
    API_KEY = "" # <-- replace with your key

    # You can pass 700+ tickers; includes a couple of fake ones to demo skip
    tickers = universe

    df = fetch_income_statements(
        tickers=tickers,
        api_key=API_KEY,
        period="quarter",
        limit=120,
        batch_size=25,
        sleep_between_batches=1.2,
        skip_errors=True,
        verbose=True,
        sleep_every_n=200,          # <-- sleep every 200 tickers
        sleep_every_n_seconds=120,  # <-- for 2 minutes
    )

    if not df.empty:
        print(df[[
            "symbol", "date",
            "revenue", "revenue_ttm",
            "netIncome", "netIncome_ttm",
            "operatingIncome", "operatingIncome_ttm",
            "date_start", "date_end"
        ]].head(12))

    # df.to_parquet("income_statements_with_ttm.parquet", index=False)
    # df.to_csv("income_statements_with_ttm.csv", index=False)


Batch 1: 25 tickers (1–25 of 679)
Batch 2: 25 tickers (26–50 of 679)
Batch 3: 25 tickers (51–75 of 679)
Batch 4: 25 tickers (76–100 of 679)
Batch 5: 25 tickers (101–125 of 679)
Batch 6: 25 tickers (126–150 of 679)
Batch 7: 25 tickers (151–175 of 679)
Batch 8: 25 tickers (176–200 of 679)
⏸ Sleeping 120s after 200 tickers (479 remaining)...
Batch 9: 25 tickers (201–225 of 679)
  [skip] EMC: No records returned for EMC.
  [skip] EMCR: No records returned for EMCR.
Batch 10: 25 tickers (226–250 of 679)
Batch 11: 25 tickers (251–275 of 679)
Batch 12: 25 tickers (276–300 of 679)
Batch 13: 25 tickers (301–325 of 679)
Batch 14: 25 tickers (326–350 of 679)
Batch 15: 25 tickers (351–375 of 679)
Batch 16: 25 tickers (376–400 of 679)
⏸ Sleeping 120s after 400 tickers (279 remaining)...
Batch 17: 25 tickers (401–425 of 679)
Batch 18: 25 tickers (426–450 of 679)
Batch 19: 25 tickers (451–475 of 679)
  [skip] NYX: No records returned for NYX.
Batch 20: 25 tickers (476–500 of 679)
Batch 21: 25 tickers

In [7]:
df

Unnamed: 0,symbol,date,date_start,date_end,reportedCurrency,cik,fillingDate,acceptedDate,calendarYear,period,...,netIncomeRatio,eps,epsdiluted,weightedAverageShsOut,weightedAverageShsOutDil,link,finalLink,revenue_ttm,netIncome_ttm,operatingIncome_ttm
0,A,1999-01-31,1999-01-31,1999-04-30,USD,0001090872,1999-01-31,1999-01-31 00:00:00,1999,Q1,...,0.041433,0.17,0.16,435294118,462500000,,,,,
1,A,1999-04-30,1999-04-30,1999-07-31,USD,0001090872,1999-04-30,1999-04-30 00:00:00,1999,Q2,...,0.078109,0.35,0.34,448571429,461764706,,,,,
2,A,1999-07-31,1999-07-31,1999-10-31,USD,0001090872,1999-07-31,1999-07-31 00:00:00,1999,Q3,...,0.064686,0.30,0.30,450000000,450000000,,,,,
3,A,1999-10-31,1999-10-31,2000-01-31,USD,0001090872,2000-01-25,2000-01-25 00:00:00,1999,Q4,...,0.059641,0.33,0.33,439000000,440000000,https://www.sec.gov/Archives/edgar/data/109087...,https://www.sec.gov/Archives/edgar/data/109087...,8331000000,512000000,741000000
4,A,2000-01-31,2000-01-31,2000-04-30,USD,0001090872,2000-03-15,2000-03-15 00:00:00,2000,Q1,...,0.058326,0.30,0.30,439000000,440000000,https://www.sec.gov/Archives/edgar/data/109087...,https://www.sec.gov/Archives/edgar/data/109087...,8791000000,569000000,-874000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67746,ZTS,2024-06-30,2024-06-30,2024-09-30,USD,0001555280,2024-08-06,2024-08-06 12:41:47,2024,Q2,...,0.264295,1.37,1.37,455500000,456000000,https://www.sec.gov/Archives/edgar/data/155528...,https://www.sec.gov/Archives/edgar/data/155528...,8915000000,2344000000,3214000000
67747,ZTS,2024-09-30,2024-09-30,2024-12-31,USD,0001555280,2024-11-04,2024-11-04 13:52:13,2024,Q3,...,0.285595,1.51,1.50,452900000,453500000,https://www.sec.gov/Archives/edgar/data/155528...,https://www.sec.gov/Archives/edgar/data/155528...,9152000000,2430000000,3336000000
67748,ZTS,2024-12-31,2024-12-31,2025-03-31,USD,0001555280,2025-02-13,2025-02-13 14:51:25,2024,Q4,...,0.250755,1.29,1.29,450500000,451100000,https://www.sec.gov/Archives/edgar/data/155528...,https://www.sec.gov/Archives/edgar/data/155528...,9256000000,2486000000,3392000000
67749,ZTS,2025-03-31,2025-03-31,2025-06-30,USD,0001555280,2025-05-06,2025-05-06 13:29:54,2025,Q1,...,0.284234,1.41,1.41,447600000,448000000,https://www.sec.gov/Archives/edgar/data/155528...,https://www.sec.gov/Archives/edgar/data/155528...,9286000000,2518000000,3437000000


In [11]:
# -------------------------- NEW: signed pct-change for TTM --------------------------
def _signed_pct_change(curr: pd.Series, prev: pd.Series) -> pd.Series:
    """
    Symmetric percent change, robust to zeros/negatives:
        growth = 2*(curr - prev) / (|curr| + |prev|)
    Returns Float64. If both are 0, returns 0.0.
    """
    a = pd.to_numeric(curr, errors="coerce").astype("Float64")
    b = pd.to_numeric(prev, errors="coerce").astype("Float64")

    denom = a.abs() + b.abs()
    out = pd.Series(pd.NA, index=a.index, dtype="Float64")

    valid = denom.notna() & (denom != 0)
    out.loc[valid] = 2.0 * (a[valid] - b[valid]) / denom[valid]

    both_zero = (a == 0) & (b == 0)
    out.loc[both_zero] = 0.0
    return out


def add_ttm_growth(df: pd.DataFrame) -> pd.DataFrame:
    """Add YoY growth for TTM series (compare t vs t-4 quarters)."""
    df = df.sort_values(["symbol", "date"]).reset_index(drop=True)
    for ttm_col, out_col in [
        ("revenue_ttm", "revenue_ttm_growth"),
        ("netIncome_ttm", "netIncome_ttm_growth"),
        ("operatingIncome_ttm", "operatingIncome_ttm_growth"),
    ]:
        prev = df.groupby("symbol")[ttm_col].shift(4)
        df[out_col] = _signed_pct_change(df[ttm_col], prev)
    return df


In [12]:
df = add_ttm_growth(df)

In [14]:
def add_quarterly_yoy(df: pd.DataFrame) -> pd.DataFrame:
    # --- build normalized join keys on a copy ---
    base = df.copy()
    base["symbol"] = base["symbol"].astype(str).str.upper().str.strip()
    base["period_key"] = base.get("period", pd.Series(index=base.index, dtype="object")).astype(str).str.upper().str.strip()
    base["year_key"]   = pd.to_numeric(base.get("calendarYear"), errors="coerce").astype("Int64")

    # keep only real fiscal quarters
    base = base[base["period_key"].isin(["Q1","Q2","Q3","Q4"])].copy()

    # robust timestamps for "latest filing" per (symbol, year, quarter)
    for c in ["acceptedDate","fillingDate","date"]:
        if c in base.columns:
            base[c] = pd.to_datetime(base[c], errors="coerce")

    sort_cols = [c for c in ["acceptedDate","fillingDate","date"] if c in base.columns]
    latest = (base.sort_values(["symbol","year_key","period_key"] + sort_cols)
                   .drop_duplicates(["symbol","year_key","period_key"], keep="last"))

    # prior-year values aligned on (symbol, period, year-1)
    prev = latest[["symbol","period_key","year_key","revenue","netIncome","operatingIncome"]].rename(
        columns={"revenue":"revenue_prev","netIncome":"netIncome_prev","operatingIncome":"operatingIncome_prev"}
    )
    prev["year_key"] = prev["year_key"] + 1  # so 2025 joins to 2024

    aligned = latest.merge(prev, on=["symbol","period_key","year_key"], how="left", validate="one_to_one")

    # symmetric pct-change (your helper)
    aligned["revenue_q_yoy"]         = _signed_pct_change(aligned["revenue"],         aligned["revenue_prev"])
    aligned["netIncome_q_yoy"]       = _signed_pct_change(aligned["netIncome"],       aligned["netIncome_prev"])
    aligned["operatingIncome_q_yoy"] = _signed_pct_change(aligned["operatingIncome"], aligned["operatingIncome_prev"])

    # --- merge back to the original df using the SAME keys (many_to_one) ---
    out = df.copy()
    out["symbol"]     = out["symbol"].astype(str).str.upper().str.strip()
    out["period_key"] = out.get("period", pd.Series(index=out.index, dtype="object")).astype(str).str.upper().str.strip()
    out["year_key"]   = pd.to_numeric(out.get("calendarYear"), errors="coerce").astype("Int64")

    out = out.merge(
        aligned[["symbol","period_key","year_key","revenue_q_yoy","netIncome_q_yoy","operatingIncome_q_yoy"]],
        on=["symbol","period_key","year_key"],
        how="left",
        validate="many_to_one"
    )

    return out.drop(columns=["period_key","year_key"])


In [15]:
df = add_quarterly_yoy(df)

In [16]:
# spot-check a ticker to verify Qx(year) vs Qx(year-1) pairing
chk = (df[df.symbol.eq("AAPL")]
       .query("period in ['Q1','Q2','Q3','Q4']")
       .sort_values(["calendarYear","period"])[
         ["symbol","calendarYear","period","revenue","revenue_q_yoy"]
       ])
chk.tail(8)


Unnamed: 0,symbol,calendarYear,period,revenue,revenue_q_yoy
479,AAPL,2023,Q4,89498000000.0,-0.007214
480,AAPL,2024,Q1,119575000000.0,0.020454
481,AAPL,2024,Q2,90753000000.0,-0.044
482,AAPL,2024,Q3,85777000000.0,0.047501
483,AAPL,2024,Q4,94930000000.0,0.058906
484,AAPL,2025,Q1,124300000000.0,0.038749
485,AAPL,2025,Q2,95359000000.0,0.049497
486,AAPL,2025,Q3,94036000000.0,0.091862


In [26]:
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67751 entries, 0 to 67750
Data columns (total 49 columns):
 #   Column                                   Non-Null Count  Dtype         
---  ------                                   --------------  -----         
 0   symbol                                   67751 non-null  object        
 1   date                                     67751 non-null  datetime64[ns]
 2   date_start                               67751 non-null  datetime64[ns]
 3   date_end                                 67751 non-null  datetime64[ns]
 4   reportedCurrency                         67751 non-null  object        
 5   cik                                      67751 non-null  object        
 6   fillingDate                              67751 non-null  object        
 7   acceptedDate                             67751 non-null  object        
 8   calendarYear                             67751 non-null  object        
 9   period                                 

In [30]:
# ================= FULL INGEST (dynamic columns) =================
import math, uuid
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
from sqlalchemy.dialects.postgresql import DOUBLE_PRECISION, BIGINT
from sqlalchemy.types import DateTime, Text

PG_CONN_STR = "postgresql://postgres:CSDBMS623@localhost:5432/SP500_ML"
SCHEMA      = "public"
TABLE       = "income_statements_q"
CHUNK_ROWS  = 25_000

df_load = df.copy()  # <-- your big DataFrame from above

# --- 1) normalize columns & dtypes ---
df_load.columns = df_load.columns.str.lower()
for d in ("date","date_start","date_end","accepteddate","fillingdate"):
    if d in df_load.columns:
        df_load[d] = pd.to_datetime(df_load[d], errors="coerce").dt.tz_localize(None)

# convert pandas nullable ints to object with None so psycopg inserts NULLs
for c in df_load.columns:
    if str(df_load[c].dtype) in ("Int64",):
        s = pd.to_numeric(df_load[c], errors="coerce")
        df_load[c] = s.where(~s.isna(), None).astype("object")

# --- 2) infer PostgreSQL types from pandas dtypes ---
def infer_pg_type(s: pd.Series):
    dt = str(s.dtype)
    if "datetime64" in dt:
        return DateTime(timezone=False)
    if dt in ("Float64","float64"):
        return DOUBLE_PRECISION()
    if dt in ("Int64","int64"):
        return BIGINT()
    if dt == "bool":
        return DOUBLE_PRECISION()  # store as 0/1 float (or map to BOOLEAN if you prefer)
    return Text()  # object/strings

PG_DTYPE = {c: infer_pg_type(df_load[c]) for c in df_load.columns}

# growth columns are floats for sure
for c in ["revenue_ttm_growth","netincome_ttm_growth","operatingincome_ttm_growth",
          "revenue_q_yoy","netincome_q_yoy","operatingincome_q_yoy"]:
    if c in df_load.columns:
        PG_DTYPE[c] = DOUBLE_PRECISION()

# --- 3) ensure table exists and has all columns ---
engine: Engine = create_engine(PG_CONN_STR, pool_pre_ping=True)
with engine.begin() as conn:
    # create table if missing with minimal keys
    conn.execute(text(f'''
        CREATE TABLE IF NOT EXISTS "{SCHEMA}"."{TABLE}" (
            symbol TEXT,
            date   TIMESTAMP
        );
    '''))
    # unique constraint on (symbol, date)
    conn.execute(text(f"""
        DO $$
        BEGIN
          IF NOT EXISTS (
            SELECT 1 FROM pg_constraint WHERE conname = '{TABLE}_symbol_date_key'
          ) THEN
            ALTER TABLE "{SCHEMA}"."{TABLE}"
            ADD CONSTRAINT {TABLE}_symbol_date_key UNIQUE (symbol, date);
          END IF;
        END$$;
    """))
    # add any missing columns with inferred types
    # fetch current table cols
    existing = conn.execute(text(f"""
        SELECT column_name
        FROM information_schema.columns
        WHERE table_schema=:s AND table_name=:t
    """), {"s": SCHEMA, "t": TABLE}).fetchall()
    have = {r[0] for r in existing}
    for col, typ in PG_DTYPE.items():
        if col not in have:
            sqltype = ("DOUBLE PRECISION" if isinstance(typ, DOUBLE_PRECISION)
                       else "BIGINT" if isinstance(typ, BIGINT)
                       else "TIMESTAMP" if isinstance(typ, DateTime)
                       else "TEXT")
            conn.execute(text(f'ALTER TABLE "{SCHEMA}"."{TABLE}" ADD COLUMN IF NOT EXISTS {col} {sqltype};'))

    # helpful indexes
    conn.execute(text(f'CREATE INDEX IF NOT EXISTS {TABLE}_symbol_idx ON "{SCHEMA}"."{TABLE}" (symbol);'))
    conn.execute(text(f'CREATE INDEX IF NOT EXISTS {TABLE}_date_idx   ON "{SCHEMA}"."{TABLE}" (date);'))

# --- 4) chunked staging → upsert ---
n = len(df_load)
n_chunks = math.ceil(n / CHUNK_ROWS)
for i in range(n_chunks):
    lo, hi = i * CHUNK_ROWS, min((i + 1) * CHUNK_ROWS, n)
    chunk = df_load.iloc[lo:hi].copy()

    staging = f"stg_{TABLE}_{uuid.uuid4().hex[:8]}"
    # write staging with explicit dtypes
    chunk.to_sql(staging, engine, schema=SCHEMA, if_exists="replace",
                 index=False, dtype={c: PG_DTYPE.get(c) for c in chunk.columns})

    cols = chunk.columns.tolist()
    non_key = [c for c in cols if c not in ("symbol","date")]
    set_clause = ", ".join([f'{c}=EXCLUDED.{c}' for c in non_key]) if non_key else ""

    with engine.begin() as conn:
        conn.execute(text(f'''
            INSERT INTO "{SCHEMA}"."{TABLE}" ({", ".join(cols)})
            SELECT {", ".join(cols)} FROM "{SCHEMA}"."{staging}"
            ON CONFLICT (symbol, date) DO UPDATE SET {set_clause};
            DROP TABLE "{SCHEMA}"."{staging}";
        '''))
    print(f"Upserted rows {lo}–{hi} / {n}")

print("✅ Full ingestion complete:", TABLE)


Upserted rows 0–25000 / 67751
Upserted rows 25000–50000 / 67751
Upserted rows 50000–67751 / 67751
✅ Full ingestion complete: income_statements_q
