# Llama – Finetuning/Inference Notebook (Annotated)

This notebook has been structured for clarity when training/evaluating Llama-family models. Each major step is introduced with a short explanation so readers can follow the workflow and reproduce results.

**Included:**
- Section headers automatically injected before relevant code cells
- Lightweight explanations for each phase
- A simple footer with suggested next steps

---


<details>
<summary><strong>Table of Contents</strong></summary>

1. Setup & Imports  
2. Configuration & Constants  
3. Environment / GPU Check  
4. Data Loading  
5. Exploratory Data Analysis (EDA)  
6. Cleaning & Preprocessing  
7. Feature Engineering / Tokenization  
8. Model Setup  
9. Training Loop / Trainer  
10. Evaluation & Metrics  
11. Inference / Generation  
12. Explainability & Safety  
13. Persistence & Export  

</details>


# Llama 3.2

### Setup & Imports
Import core libraries for model training and utilities. Keep imports organized and remove unused ones.


In [21]:
from huggingface_hub import login, whoami, hf_hub_download

HF_API_KEY = ""
login(token=HF_API_KEY)
whoami()

{'type': 'user',
 'id': '6891761002359d4e3841311f',
 'name': 'bal141',
 'fullname': 'Deepinder',
 'isPro': False,
 'avatarUrl': '/avatars/6f679a831597edacbec257d48e9c6ef1.svg',
 'orgs': [],
 'auth': {'type': 'access_token',
  'accessToken': {'displayName': 'Ntk',
   'role': 'fineGrained',
   'createdAt': '2025-08-05T03:48:28.378Z',
   'fineGrained': {'canReadGatedRepos': True,
    'global': [],
    'scoped': [{'entity': {'_id': '6891761002359d4e3841311f',
       'type': 'user',
       'name': 'bal141'},
      'permissions': ['repo.content.read']}]}}}}

In [22]:
import torch
torch.cuda.is_available()

True

In [59]:
import os, re, json, math
from collections import defaultdict

import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [58]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tok,
    return_full_text=False
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.18s/it]


### Data Loading
Load datasets/artifacts and validate shapes/schemas. Print sample rows to sanity-check text fields.


In [66]:
csv_path = r'C:\Users\dbal\anaconda_projects\PotentialTalentsNLP\potentialtalents.csv'
df = pd.read_csv(csv_path)

# Parse connections like "500+" → 500 (handles missing/NaN/ints/strings)
def parse_connections(x):
    if pd.isna(x): 
        return 0
    s = str(x).strip()
    if s.endswith("+"):
        s = s[:-1]
    s = re.sub(r"\D", "", s)  # keep digits only
    return int(s) if s else 0

df["connections_num"] = df.get("connection", 0).apply(parse_connections)

# Clean job titles
honor_pat = re.compile(r"\b(?:cum laude|magna cum|summa cum|dean['’]?s list|honou?rs|with honou?rs)\b", re.I)
df["job_title_clean"] = (
    df["job_title"].astype(str)
      .str.replace(honor_pat, "", regex=True)
      .str.replace(r"\s+", " ", regex=True)
      .str.strip()
)

# Convert rows → candidate dicts
def to_candidates(rows: pd.DataFrame):
    return [
        {
            "id": int(r["id"]),
            "title": str(r.get("job_title_clean") or r.get("job_title") or ""),
            "location": str(r.get("location", "")),
            "connections": int(r.get("connections_num", 0)),
        }
        for _, r in rows.iterrows()
    ]

cands = to_candidates(df.head(5))
#cands_preview

In [67]:
def build_prompt(cands, role="Aspiring Human Resources Specialist"):
    lines = [
        "You are a recruiting assistant.",
        f'Rank these {len(cands)} candidates for the role "{role}" by fit.',
        "Return ONLY a JSON array with exactly "
        f"{len(cands)} objects, one per candidate in the SAME ORDER.",
        'Each object must be {"id": <int>, "score": <float in [0,1]>}.',
        "No text before or after the JSON.",
        "",
        "Candidates:"
    ]
    for c in cands:
        lines.append(
            f'- id={c["id"]}, title="{c["title"]}", '
            f'location="{c["location"]}", connections={c["connections"]}'
        )
    lines.append("\nJSON:")
    return "\n".join(lines)

In [68]:
prompt = build_prompt(cands)
raw_output = pipe(
    prompt,
    max_new_tokens=256,     # bump to 384 if your arrays ever truncate
    do_sample=False,        # deterministic
    pad_token_id=tokenizer.eos_token_id
)[0]["generated_text"]

print(raw_output)



 
[
  {"id":1,"score":0.2},
  {"id":2,"score":0.8},
  {"id":3,"score":0.1},
  {"id":4,"score":0.9},
  {"id":5,"score":0.1}
]


In [71]:
cands_all   = to_candidates(df)
prompt_all  = build_prompt(cands_all, role="Aspiring Human Resources Specialist")

raw = pipe(
    prompt_all,
    max_new_tokens=1792,
    do_sample=False,
    pad_token_id=pipe.tokenizer.eos_token_id,
    return_full_text=False
)[0]["generated_text"]

# Strict parse and attach
items = json.loads(raw)
scores_all = {int(it["id"]): float(it["score"]) for it in items}
df["sim_llama"] = df["id"].map(scores_all)

df.sort_values("sim_llama", ascending=False)[["id","job_title_clean","sim_llama"]].head(10)

Unnamed: 0,id,job_title_clean,sim_llama
0,1,2019 C.T. Bauer College of Business Graduate (...,0.2
1,2,Native English Teacher at EPIK (English Progra...,0.2
76,77,Human Resources| Conflict Management| Policies...,0.2
75,76,Aspiring Human Resources Professional | Passio...,0.2
74,75,"Nortia Staffing is seeking Human Resources, Pa...",0.2
73,74,Human Resources Professional,0.2
72,73,"Aspiring Human Resources Manager, seeking inte...",0.2
71,72,Business Management Major and Aspiring Human R...,0.2
70,71,"Human Resources Generalist at ScottMadden, Inc.",0.2
69,70,"Retired Army National Guard Recruiter, office ...",0.2


In [9]:
def build_prompt(cands, role, few_shot=True):
    lines = ["You are a recruiting assistant."]
    if few_shot:
        # tiny example to calibrate 0–1 scores
        lines += [
            "Example:",
            'Candidates:\n- id=1, title="Aspiring HR Specialist", location="Houston", connections=300',
            'JSON: [{"id": 1, "score": 0.92}]',
            ""
        ]
    lines += [
        f'Rank these {len(cands)} candidates for the role "{role}" by fit.',
        f"Return ONLY a JSON array with exactly {len(cands)} objects, one per candidate in the SAME ORDER.",
        'Each object must be {"id": <int>, "score": <float in [0,1]>}.',
        "No text before or after the JSON.",
        "",
        "Candidates:"
    ]
    for c in cands:
        lines.append(
            f'- id={c["id"]}, title="{c["title"]}", location="{c["location"]}", connections={c["connections"]}'
        )
    lines.append("\nJSON:")
    return "\n".join(lines)

def _extract_balanced_json_array(text: str):
    start = text.find("[")
    if start == -1: raise ValueError("No '[' found")
    depth, end = 0, None
    for i, ch in enumerate(text[start:], start=start):
        if ch == "[": depth += 1
        elif ch == "]":
            depth -= 1
            if depth == 0: end = i + 1; break
    if end is None: raise ValueError("Unbalanced JSON array (truncated).")
    return json.loads(text[start:end])

def _extract_pairs_from_partial(text: str):
    pairs = re.findall(r'\{"id"\s*:\s*(\d+)\s*,\s*"score"\s*:\s*([0-9]*\.?[0-9]+)', text)
    if not pairs: raise ValueError("No parsable id/score pairs in partial output.")
    return [{"id": int(i), "score": float(s)} for i, s in pairs]

def safe_parse_scores(text: str):
    try:
        return _extract_balanced_json_array(text)
    except Exception:
        return _extract_pairs_from_partial(text)

def llama_score_batch(pipe, cands, role="Aspiring Human Resources Specialist", few_shot=True):
    """
    cands: list[dict] with keys: id, title, location, connections
    returns: dict {id: score_float}
    """
    prompt = build_prompt(cands, role, few_shot=few_shot)
    out = pipe(prompt)[0]["generated_text"].strip()
    data = safe_parse_scores(out)  # robust to truncation
    return {int(d["id"]): float(d["score"]) for d in data}

In [13]:
from collections import defaultdict

def windows_for_batches(base_df, batch_size=8, overlap=4):
    i, n = 0, len(base_df)
    step = batch_size - overlap if batch_size > overlap else batch_size
    while i < n:
        yield i, base_df.iloc[i:i+batch_size]
        i += step

def rank_with_llama(
    df,
    pipe,
    role="Aspiring Human Resources Specialist",
    seed_col="seed_hr_score",
    batch_size=8,
    overlap=4,
    few_shot=True,
    blend_with_connections=True,
    pipe_batch_size=4,   # how many prompts to send to pipeline at once
):
    # 1) Seed order & windows
    base = (
        df.sort_values(seed_col, ascending=False)
          .loc[:, ["id","job_title_clean","location","connections_num","connections_norm"]]
          .rename(columns={"job_title_clean":"title","connections_num":"connections"})
          .reset_index(drop=True)
    )
    windows = list(windows_for_batches(base, batch_size, overlap))

    # 2) Build all prompts
    prompts = [build_prompt(batch.to_dict(orient="records"), role, few_shot=few_shot)
               for _, batch in windows]

    # 3) Run prompts in mini-batches and FLATTEN results correctly
    gen_texts = []
    for i in range(0, len(prompts), pipe_batch_size):
        chunk = prompts[i:i+pipe_batch_size]
        # When given a list of prompts, pipeline returns a list of lists of dicts
        outs = pipe(chunk, batch_size=len(chunk), truncation=True, padding=True)

        # outs shape: [ [ { "generated_text": ... } ], [ { ... } ], ... ]
        for res in outs:
            if isinstance(res, list):
                gen_texts.append(res[0]["generated_text"])
            elif isinstance(res, dict):
                # (rare) single-dict case
                gen_texts.append(res.get("generated_text", ""))
            else:
                raise TypeError(f"Unexpected pipeline output type: {type(res)}")

    # 4) Aggregate scores over overlapping windows
    scores_sum, counts = defaultdict(float), defaultdict(int)
    for (_, batch), text in zip(windows, gen_texts):
        try:
            items = safe_parse_scores(text)   # returns list of {"id":..., "score":...}
        except Exception:
            items = []  # skip unparsable outputs but keep going
        for item in items:
            # guard in case parser returned a list accidentally
            if isinstance(item, dict) and "id" in item and "score" in item:
                scores_sum[int(item["id"])] += float(item["score"])
                counts[int(item["id"])]     += 1

    agg = {k: scores_sum[k]/counts[k] for k in scores_sum if counts[k]}

    # 5) Attach to df and (optionally) blend with connections
    out = df.copy()
    out["sim_llama"] = out["id"].map(agg)
    if blend_with_connections and "connections_norm" in out:
        out["score_blend"] = 0.8*out["sim_llama"].fillna(0) + 0.2*out["connections_norm"].fillna(0)

    return out.sort_values("sim_llama", ascending=False)

ranked = rank_with_llama(df, pipe, batch_size=8, overlap=4, pipe_batch_size=4, few_shot=True)
ranked[["id","job_title_clean","location","connections_num","sim_llama"]].head(10)





Unnamed: 0,id,job_title_clean,location,connections_num,sim_llama
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.92
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.92
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1,0.92
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.92
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.89
13,14,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.885
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.885
46,47,People Development Coordinator at Ryan,"Denton, Texas",500,0.885
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.885
58,59,People Development Coordinator at Ryan,"Denton, Texas",500,0.885


In [14]:
print(HR_KWS)

{'specialist', 'acquisition', 'recruiter', 'generalist', 'hr', 'talent', 'coordinator', 'human', 'resources'}


In [15]:
def spearman_no_scipy(a, b):
    a = pd.Series(a).rank(method="average").to_numpy()
    b = pd.Series(b).rank(method="average").to_numpy()
    a = (a - a.mean()) / a.std(ddof=0)
    b = (b - b.mean()) / b.std(ddof=0)
    return float(np.mean(a*b))

def ndcg_at_k(y_true, y_score, k=10):
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    order = np.argsort(-y_score)
    gains = (2.0**y_true[order] - 1.0)
    discounts = 1.0 / np.log2(np.arange(2, k+2))
    dcg = np.sum(gains[:k] * discounts)
    ideal = np.sum(np.sort(y_true)[::-1][:k] * discounts)
    return float(dcg/ideal) if ideal > 0 else 0.0

# Compare to your lexical seed (or swap to your earlier GloVe proxy if present)
proxy_col = "seed_hr_score"
mask = ranked["sim_llama"].notna()
rho  = spearman_no_scipy(ranked.loc[mask,"sim_llama"], ranked.loc[mask,proxy_col])
ndcg = ndcg_at_k(ranked.loc[mask,proxy_col], ranked.loc[mask,"sim_llama"], k=10)
print(f"Spearman(Llama vs {proxy_col}): {rho:.3f}")
print(f"NDCG@10: {ndcg:.3f}")

Spearman(Llama vs seed_hr_score): 0.490
NDCG@10: 0.844


# Fine Tunining

### Model Setup
Load base model (and adapters if using PEFT/LoRA). Note dtype (fp16/bf16), quantization, and gradient checkpointing.


In [1]:
!pip install transformers sentence-transformers
!pip install transformers torch
!pip install -U bitsandbytes
!pip install datasets
!pip install accelerate
!pip install peft
!pip install -U trl

Collecting sentence-transformers
  Downloading sentence_transformers-5.1.1-py3-none-any.whl.metadata (16 kB)
Downloading sentence_transformers-5.1.1-py3-none-any.whl (486 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-5.1.1
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.1-py3-none-win_amd64.whl.metadata (10 kB)
Downloading bitsandbytes-0.48.1-py3-none-win_amd64.whl (59.5 MB)
   ---------------------------------------- 0.0/59.5 MB ? eta -:--:--
   -- ------------------------------------- 3.7/59.5 MB 19.9 MB/s eta 0:00:03
   ----- ---------------------------------- 8.7/59.5 MB 21.5 MB/s eta 0:00:03
   --------- ------------------------------ 14.2/59.5 MB 22.8 MB/s eta 0:00:02
   ------------- -------------------------- 19.7/59.5 MB 23.4 MB/s eta 0:00:02
   ----------------- ---------------------- 25.4/59.5 MB 24.1 MB/s eta 0:00:02
   -------------------- ------------------- 30.7/59.5 MB 24.3 MB/s eta 0:00:02
   -------------

### Setup & Imports
Import core libraries for model training and utilities. Keep imports organized and remove unused ones.


In [2]:
# import libraries
import pandas as pd
import numpy as np
import warnings
import logging
import random
import requests
import sys
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM, LlamaTokenizer, set_seed, TrainingArguments
from huggingface_hub import notebook_login
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
warnings.filterwarnings('ignore', category=UserWarning)

print(torch.__version__)
#tf.__version__

  from .autonotebook import tqdm as notebook_tqdm
W1027 22:31:33.324000 35268 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.



2.8.0+cu126


### Data Loading
Load datasets/artifacts and validate shapes/schemas. Print sample rows to sanity-check text fields.


In [3]:
# Reading the CSV
file_path = r'C:\Users\dbal\anaconda_projects\PotentialTalentsNLP\ExtendedPotentialTalents.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,id,title,location,screening_score
0,1.0,innovative and driven professional seeking a r...,United States,100.0
1,2.0,ms applied data science student usc research a...,United States,100.0
2,3.0,computer science student seeking full-time sof...,United States,100.0
3,4.0,microsoft certified power bi data analyst mba ...,United States,100.0
4,5.0,graduate research assistant at uab masters in ...,United States,100.0


In [3]:
job_titles = df["title"].tolist()
job_titles

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025',
 'computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs',
 'microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson',
 'graduate research assistant at uab masters in data science student at uab ex jio',
 'student at kennesaw state university',
 'data analyst business analyst python snowflake sql machine learning power bi tableau equipped with analytics driven by insights and passionate about impactful solutions.',
 'graduate research aide student at ariz

In [4]:
job_titles_short = df["title"].head(10).tolist()
job_titles_short

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025',
 'computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs',
 'microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson',
 'graduate research assistant at uab masters in data science student at uab ex jio',
 'student at kennesaw state university',
 'data analyst business analyst python snowflake sql machine learning power bi tableau equipped with analytics driven by insights and passionate about impactful solutions.',
 'graduate research aide student at ariz

In [5]:
job_ids_short = df["id"].head(10).tolist()
job_ids_short

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

In [6]:
target_title = "Data Scientist"

In [7]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# QLoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

data = [
    {
        "prompt": f"""
Return a list of the top 5 job candidates with full job title and job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**job titles**
{job_titles_short}

**job ids**
{job_ids_short}


Show answer in following format:
Rank Job ID   Job Title
1 - 1: Aspiring Human Resources Specialist
2 - ...
3 - ...
...

""",}
]
# Convert to Hugging Face Dataset format
dataset = Dataset.from_list([
    {"text": f"{item['prompt']}"} for item in data
])

# Training arguments (CPU-friendly)
training_args = TrainingArguments(
    output_dir="./llama-finetune",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=30,  # Keep small for testing
    learning_rate=5e-5,
    logging_steps=5,
    save_steps=15,
    save_total_limit=2,
    fp16=True,
    bf16=False,
    report_to="none",
    no_cuda=False
)

# Fine-tuning trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args
)

trainer.train()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.96s/it]
Adding EOS to train dataset: 100%|██████████| 1/1 [00:00<00:00, 500.10 examples/s]
Tokenizing train dataset: 100%|██████████| 1/1 [00:00<00:00, 249.97 examples/s]
Truncating train dataset: 100%|██████████| 1/1 [00:00<00:00, 500.45 examples/s]
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009}.


Step,Training Loss
5,3.4285
10,3.3522
15,3.2789
20,3.2156
25,3.1685
30,3.1414


TrainOutput(global_step=30, training_loss=3.2641965548197427, metrics={'train_runtime': 13.8827, 'train_samples_per_second': 2.161, 'train_steps_per_second': 2.161, 'total_flos': 178378299740160.0, 'train_loss': 3.2641965548197427, 'epoch': 30.0})

In [1]:
merged_model_dir = "./llama-finetune-merged"

# Save the fine-tuned model
trainer.model.save_pretrained(merged_model_dir)

# Save the tokenizer to the same directory
tokenizer.save_pretrained(merged_model_dir)

# Load the merged model and tokenizer explicitly
merged_tokenizer = AutoTokenizer.from_pretrained(merged_model_dir)
merged_model = AutoModelForCausalLM.from_pretrained(merged_model_dir)

NameError: name 'trainer' is not defined

In [21]:
#seed = random.randint(1000,9999)
seed = 7308
set_seed(seed)
print(seed)
#good seeds: 7308

# Create a combined string of job ID and job title pairs
job_pairs = "\n".join([f"{job_id}: {job_title}" for job_id, job_title in zip(job_ids_short, job_titles_short)])

# Load the merged model and tokenizer using the pipeline
# You might need to specify the trust_remote_code=True for some models
# Pass the loaded model and tokenizer objects to the pipeline
generator = pipeline('text-generation', model=merged_model, tokenizer=merged_tokenizer)


prompt = f"""
Return a list of the top 5 job candidates with full unmodified job title and matching job id from a job titles list ranked by their similirality to the search term in desecnding order. Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**Job candidates (ID: Title)**
{job_pairs}


Show answer in following format:
Rank Job ID   Job Title
1 - 1: Aspiring Human Resources Specialist
2 - ...
3 - ...
...
Top 5 are:
"""

output = generator(prompt, max_new_tokens=200, num_return_sequences=1)
print(output[0]['generated_text'])

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


7308

Return a list of the top 5 job candidates with full unmodified job title and matching job id from a job titles list ranked by their similirality to the search term in desecnding order. Only show the answer. Do not reason or explain.
**Search term**
Data Scientist

**Job candidates (ID: Title)**
1.0: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
2.0: ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025
3.0: computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs
4.0: microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson
5.0: graduate research assistant a

---

## Next Steps

- Add an **Experiment Card** cell (model, tokenizer, data snapshot, hyperparams, seed, hardware).
- Log training with **wandb/MLflow** and save evaluation tables/plots for reports.
- Provide a robust **inference demo** (batch & streaming) and expected I/O schema.
- If applicable, include **safety checks** on generations and mitigation strategies.
