## Question 1 — NLP Mini Practical (Sentiment, CSV)

**Scenario**  
Your team needs to implement a minimal **sentiment analysis module**.  
Given an English sentence, output whether the sentiment is **`positive`** or **`negative`**.

---

**Input**  
CSV file **`sentiment.csv`** with the following columns:

- `text` — an English sentence  
- `label` — one of `positive` or `negative` (ground truth)

---

**What to Build**

1. `load_sentiment_csv(path: str = "./data/sentiment.csv") -> pd.DataFrame`  
2. `predict_sentiment(text: str) -> str` → returns **exactly** `"positive"` or `"negative"`  
3. `predict_sentiment_batch(texts: list[str]) -> list[str]`  

---

**Baseline Requirements (Rule-Based)**

- Handle **case-insensitivity**.  
- Ignore **punctuation** (reasonable tokenization is acceptable).  
- Use **lexicon-based rules** as a starting point.  

We provide a very small **sample lexicon** below.  
👉 You should **expand it** (add synonyms and common variants to improve accuracy).

- `POSITIVE_WORDS` (sample): `{"good", "great", "excellent"}`  
- `NEGATIVE_WORDS` (sample): `{"bad", "terrible", "awful"}`  
- `NEGATIONS` (sample): `{"not", "no", "never", "n't"}`  
  *(can be used for polarity flipping, e.g., “not good” → negative)*  

---

**Optional (Stretch Goal): Transformer Calibration**

- If your environment allows (e.g., internet access and `transformers` installed),  
  you may additionally run an off-the-shelf model:  
  ```python
  from transformers import pipeline
  sentiment_model = pipeline("sentiment-analysis")


In [None]:
# Part 3 — Code Skeleton
from typing import List
import re
import pandas as pd

# --- SAMPLE LEXICONS (expand these) ---
POSITIVE_WORDS = {"good", "great", "excellent"}       # expand
NEGATIVE_WORDS = {"bad", "terrible", "awful"}         # expand
NEGATIONS = {"not", "no", "never", "n't"}             # expand if you handle negation

# --- Tokenization (you may replace with your own) ---
TOKEN_RE = re.compile(r"[a-z']+")

def load_sentiment_csv(path: str = "./data/sentiment.csv") -> pd.DataFrame:
    """Load sentiment.csv and return a DataFrame with columns ['text','label'].""" 
    return pd.read_csv(path)

def predict_sentiment(text: str) -> str:
    """Return 'positive' or 'negative'.
    Implement a rule-based classifier using the lexicons above.
    - Case-insensitive
    - Ignore punctuation (e.g., via regex tokenization)
    - (Optional) Handle negation flipping, e.g., 'not good' -> negative
    """
    # TODO: implement your sentiment classifier
    # Hints (not required): lowercase, tokenize, count pos/neg hits, apply (optional) negation flip
    pass


def predict_sentiment_batch(texts: List[str]) -> List[str]:
    """Vectorized helper that applies predict_sentiment to a list of texts."""
    return [predict_sentiment(t) for t in texts]


In [None]:
# Optional — Transformer Calibration (reference solution)
try:
    import pandas as pd
    from transformers import pipeline

    # 确保 CSV 已加载
    if "df" not in globals():
        df = load_sentiment_csv("./data/sentiment.csv")

    # 显式指定模型，避免 warning
    sentiment_model = pipeline(
        task="sentiment-analysis",
        model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
        revision="af0f99b"   # 你的环境日志里提示的 revision
    )

    # 批量推理
    texts = df["text"].tolist()
    raw_preds = sentiment_model(texts, batch_size=16, truncation=True)

    # label 统一映射为 "positive" / "negative"
    def _map_label(lbl: str) -> str:
        L = str(lbl).lower()
        if "pos" in L or L.endswith("1"):
            return "positive"
        if "neg" in L or L.endswith("0"):
            return "negative"
        return "negative"

    preds_tr = [_map_label(r.get("label", "")) for r in raw_preds]

    # 计算准确率
    acc_tr = (pd.Series(preds_tr).values == df["label"].values).mean()
    print(f"Samples: {len(df)}")
    print(f"Accuracy (transformer): {acc_tr:.2%}")
    print("First 10 transformer predictions:")
    for t, y, p in list(zip(df["text"], df["label"], preds_tr))[:10]:
        print(f"- {t} | gold={y} | pred={p}")

except Exception as e:
    print("⚠️ Transformers calibration skipped:", repr(e))


In [None]:
# Part 3 — Validation
import os

def _find_sentiment_csv():
    for p in ["./data/sentiment.csv", "./sentiment.csv", "/mnt/data/sentiment.csv"]:
        if os.path.exists(p):
            return p
    return None

csv_path = _find_sentiment_csv()
if csv_path is None:
    print("❌ Could not find 'sentiment.csv'. Place it under ./data/ or alongside this notebook.")
else:
    df = load_sentiment_csv(csv_path)
    if not {"text","label"}.issubset(df.columns):
        print("❌ CSV must contain columns: ['text','label']. Found:", list(df.columns))
    else:
        allowed = {"positive","negative"}
        preds = predict_sentiment_batch(df["text"].tolist())
        if any(p not in allowed for p in preds):
            bad = {p for p in preds if p not in allowed}
            print("⚠️ Found unexpected labels in predictions:", bad)
        acc = (pd.Series(preds).values == df["label"].values).mean()
        print(f"Samples: {len(df)}")
        print(f"Accuracy (rule-based): {acc:.2%}")
        print("First 10 predictions:")
        for i, (t, y, p) in enumerate(zip(df["text"].tolist(), df["label"].tolist(), preds)):
            if i >= 10: break
            print(f"- {t} | gold={y} | pred={p}")
        if 0 <= acc <= 1:
            print("✅ Part 3 validation executed.")



## Question 2 — Tiny Retrieval over Doctors CSV

**Task**
Implement filters over the provided **`doctors.csv`** dataset.

**Input**
A CSV file **`doctors.csv`** with columns (example):
- `id` (int), `name` (str), `specialty` (str), `city` (str), `rating` (float), `notes` (str)

**Requirements**
1. Implement `load_doctors_csv(path: str = "./data/doctors.csv") -> pd.DataFrame` to load the dataset.
2. Implement a function `filter_doctors(df: pd.DataFrame, field: str, value: str) -> pd.DataFrame` that returns a **new DataFrame** of rows where `row[field] == value`.
   - If the column is string-like, matching should be **case-insensitive**.
   - If the column is numeric (e.g., `rating`), interpret `value` as a number and apply exact equality.
3. Skip rows missing the requested field (or treat as non-match).
4. Do not mutate the input DataFrame. Return a **new** DataFrame.
5. Sort the returned DataFrame by `id` ascending.

**Notes**
- You may assume valid `field` names (but robust handling is encouraged).
- For real systems you might support ranges or partial matches; here we keep equality for simplicity.


In [15]:

# Part 4 — Code Skeleton
import pandas as pd
import numpy as np

def load_doctors_csv(path: str = "./data/doctors.csv") -> pd.DataFrame:
    """Load doctors.csv and return a DataFrame."""
    return pd.read_csv(path)

def filter_doctors(df: pd.DataFrame, field: str, value: str) -> pd.DataFrame:
    """Return a new DataFrame of rows where row[field] == value.
    - Case-insensitive for string columns
    - Numeric equality for numeric columns
    - Sorted by id ascending
    """
    pass


In [None]:

# Part 4 — Validation
import os
import pandas as pd
import numpy as np

def _find_doctors_csv():
    candidates = ["./data/doctors.csv", "./doctors.csv", "/mnt/data/doctors.csv"]
    for p in candidates:
        if os.path.exists(p):
            return p
    return None

def _safe_lower(x):
    try:
        return str(x).lower()
    except Exception:
        return x

csv_path = _find_doctors_csv()
if csv_path is None:
    print("❌ Could not find 'doctors.csv'. Place it under ./data/ or alongside this notebook.")
else:
    doctors = load_doctors_csv(csv_path)
    expected_cols = {"id","name","specialty","city","rating","notes"}
    if not expected_cols.issubset(doctors.columns):
        print("⚠️ Columns differ from expected; proceeding with available columns:", list(doctors.columns))

    try:
        derm = filter_doctors(doctors, "specialty", "dermatology")
        city_toronto = filter_doctors(doctors, "city", "toronto")
        name_lee = filter_doctors(doctors, "name", "dr. lee")
        rating_46 = filter_doctors(doctors, "rating", "4.6")

        expected_derm = doctors[doctors["specialty"].map(_safe_lower) == "dermatology"].copy().sort_values("id")
        expected_toronto = doctors[doctors["city"].map(_safe_lower) == "toronto"].copy().sort_values("id")
        expected_name = doctors[doctors["name"].map(_safe_lower) == "dr. lee"].copy().sort_values("id")
        expected_rating = doctors[np.isclose(doctors["rating"].astype(float), 4.6)].copy().sort_values("id")

        print("specialty == 'dermatology' ->")
        print(derm)
        print("\ncity == 'toronto' ->")
        print(city_toronto)
        print("\nname == 'dr. lee' ->")
        print(name_lee)
        print("\nrating == 4.6 ->")
        print(rating_46)

        score = 0
        score += int(derm.reset_index(drop=True).equals(expected_derm.reset_index(drop=True)))
        score += int(city_toronto.reset_index(drop=True).equals(expected_toronto.reset_index(drop=True)))
        score += int(name_lee.reset_index(drop=True).equals(expected_name.reset_index(drop=True)))
        score += int(rating_46.reset_index(drop=True).equals(expected_rating.reset_index(drop=True)))

        print(f"\nScore: {score}/4")
        if score == 4:
            print("✅ Part 4 validation passed.")
        else:
            print("ℹ️ Part 4 validation did not pass all checks. Review case-insensitive/string vs numeric equality and sorting.")
    except Exception as e:
        print("❌ Validation raised:", repr(e))
