In [None]:
# This cell is intentionally left blank as a placeholder for future code or markdown.

In [None]:
# 02 - Data Formatting for Fine-tuning

This notebook focuses on taking your raw extracted text and external datasets, and transforming them into the specific prompt-response format required for fine-tuning your LLM.

## Objective:
- Load extracted documents from `data/processed/extracted_raw_documents.jsonl`.
- Load any additional external datasets from `data/external/`.
- Apply a strategy (e.g., manual annotation, heuristic-based, or self-instruct) to create high-quality prompt-response pairs.
- Format these pairs into the `{"messages": [{"role": "user", "content": "..."}, {"role": "model", "content": "..."}]}` structure.
- Save the final fine-tuning dataset to `data/processed/fine_tuning_data.jsonl`.

## Instructions:
1.  **Load your extracted data:** Use the `json` module to load `extracted_raw_documents.jsonl`.
2.  **Load external datasets:** Use the `datasets` library to load any datasets from `data/external/`.
3.  **Implement your formatting logic:** This is the most custom part. You'll write Python code to generate your prompt-response pairs.
4.  **Save the dataset:** Ensure the output is a JSON Lines file.

## Code (Example of what you'd put here):

```python
# import sys
# sys.path.append('scripts')
# from ingest_data import PROCESSED_DOCS_PATH
# import json
# from datasets import Dataset
# import os

# # Define output path
# FINE_TUNING_DATA_PATH = os.path.join(os.path.dirname(PROCESSED_DOCS_PATH), "fine_tuning_data.jsonl")

# # Load extracted raw documents
# extracted_docs = []
# try:
#     with open(PROCESSED_DOCS_PATH, 'r', encoding='utf-8') as f:
#         for line in f:
#             extracted_docs.append(json.loads(line))
# except FileNotFoundError:
#     print(f"Error: {PROCESSED_DOCS_PATH} not found. Please run ingest_data.py first.")

# # --- Example: Simple formatting for a Q&A chatbot from extracted text ---
# # This is a highly simplified example. You'd need more sophisticated logic
# # to create meaningful Q&A pairs from arbitrary text.
# # For code, you might prompt: "Explain this function: [code snippet]"
# # For articles: "Summarize this paragraph: [paragraph]"

# formatted_data = []
# for doc in extracted_docs:
#     content = doc.get("content", "")
#     source = doc.get("source_filename", "unknown")
#     file_type = doc.get("file_type", "text")

#     # Simple heuristic: take first 200 chars as context, ask a generic question
#     if len(content) > 100:
#         context_snippet = content[:200] + "..."
#         # Example for text/pdf
#         if file_type in ["text", "pdf"]:
#             user_prompt = f"Based on this snippet from '{source}': '{context_snippet}', what is the main idea?"
#             model_response = f"The main idea of this snippet from '{source}' is about [LLM-generated main idea]."
#             # In a real scenario, you'd use an LLM (self-instruct) or manual annotation
#             # to generate the 'model_response' based on the 'context_snippet'.
#             formatted_data.append({
#                 "messages": [
#                     {"role": "user", "content": user_prompt},
#                     {"role": "model", "content": model_response}
#                 ]
#             })
#         # Example for code
#         elif file_type == "code":
#             user_prompt = f"Explain the purpose of the code snippet from '{source}':\n```python\n{content[:200]}...\n```"
#             model_response = f"This code snippet from '{source}' appears to [LLM-generated explanation]."
#             formatted_data.append({
#                 "messages": [
#                     {"role": "user", "content": user_prompt},
#                     {"role": "model", "content": model_response}
#                 ]
#             })

# # Save the formatted data
# os.makedirs(os.path.dirname(FINE_TUNING_DATA_PATH), exist_ok=True)
# with open(FINE_TUNING_DATA_PATH, 'w', encoding='utf-8') as f:
#     for item in formatted_data:
#         f.write(json.dumps(item) + "\n")

# print(f"Created {len(formatted_data)} fine-tuning examples and saved to {FINE_TUNING_DATA_PATH}")

# # Optional: Load into Hugging Face Dataset format for inspection
# # hf_dataset = Dataset.from_list(formatted_data)
# # print("\nSample of Hugging Face Dataset:")
# # print(hf_dataset[0])
```