### Installing Dependencies

The following code cell will use `pip` to install all the necessary dependencies listed in the `requirements.txt` file. This ensures that all the required packages are available for the project to run correctly.

In [None]:
!pip install -r requirements.txt

### Code Description

The following code snippet demonstrates how to configure and initialize the Azure OpenAI client for extracting text from PDFs using PyMuPDF (`fitz`).

1. **Importing Libraries**:
   - `fitz`: PyMuPDF library for extracting text from PDF files.
   - `AzureOpenAI`: Library for interacting with Azure OpenAI services.
   - `json` and `re`: Standard Python libraries for handling JSON data and regular expressions, respectively.

2. **Azure OpenAI Configuration**:
   - `AZURE_OPENAI_API_KEY`: Your Azure OpenAI API key.
   - `AZURE_OPENAI_ENDPOINT`: Your Azure OpenAI endpoint URL.
   - `DEPLOYMENT_NAME`: The name of your Azure OpenAI deployment.
   - `AZURE_OPENAI_VERSION`: The version of the Azure OpenAI API you are using.

3. **Initializing the Azure OpenAI Client**:
   - The `AzureOpenAI` client is initialized with the provided API key, API version, and endpoint URL.

In [None]:
import fitz  # PyMuPDF for extracting text from PDF
from openai import AzureOpenAI
import json
import re

# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY = "your_azure_openai_api_key"
AZURE_OPENAI_ENDPOINT = "your_azure_openai_endpoint"
DEPLOYMENT_NAME = "your_deployment_name" 
AZURE_OPENAI_VERSION = "your_openai_version"

client = AzureOpenAI(
 api_key=AZURE_OPENAI_API_KEY,
 api_version=AZURE_OPENAI_VERSION,
 azure_endpoint=AZURE_OPENAI_ENDPOINT )

### Extract Text from PDF and Generating Synthetic Data

The following code snippet demonstrates how to extract text from a PDF file, clean JSON responses, generate synthetic data using Azure OpenAI, and save the data to a JSONL file.

In [None]:
def extract_text_from_pdf(pdf_path):
    """Extract text from a given PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    return text.strip()

def clean_json_response(response_text):
    response_text = response_text.strip()
    response_text = re.sub(r"^```jsonl\s*", "", response_text) # Remove Leading ```json
    response_text = re.sub(r"```$", "", response_text) # Remove trailing ```
    return response_text.strip()

def generate_synthetic_data(text, num_samples=2):
    """Generate synthetic dataset using Azure OpenAI."""
    synthetic_data = []
    
    
    for _ in range(num_samples):
        prompt = (
        "Generate a structured JSONL dataset for model evaluation where the <user prompt> is a sentence "
        "and <expected completion> is the continuation of that sentence. The format should be: "
        "{'input': [{'role': 'system', 'content': '<system message>'}, {'role': 'user', 'content': '<user prompt>'}], "
        "'output': '<expected completion>'}. Ensure the output is valid JSONL.\n\n"
        "everything should have semantic similarity like this {'input': [{'role': 'system', 'content': 'Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications.'}, {'role': 'user', 'content': 'Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies.'}], 'output': 'Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities.'}  "    
        "{'input': [{'role': 'system', 'content': 'Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic.'}, {'role': 'user', 'content': 'Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships.'}], 'output': 'Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links.'}  "
        f"Generate a structured synthetic dataset based on this text  with semantic similarity between the <system message> and <expected completion>:\n{text}"
    )

        # Create the chat completion
        response = client.chat.completions.create(
            model=DEPLOYMENT_NAME,
            messages=[
                {"role": "system", "content": "Generate a structured JSONL dataset for model evaluation."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=4096
        )

        synthetic_text = response.choices[0].message.content
        
        synthetic_text = clean_json_response(synthetic_text)
        synthetic_data.append(synthetic_text)
    
    joined_synthetic_data = []
    joined_synthetic_data = '\n'.join(synthetic_data)

    print(joined_synthetic_data)
    return joined_synthetic_data


def save_to_jsonl(data, output_file):
    """Save synthetic data to a JSONL file."""
    with open(output_file, "w", encoding="utf-8") as f:        
        f.write(data)






### Execute Synthetic Data Generation

The following code snippet demonstrates how to extract text from a PDF file, generate synthetic data using Azure OpenAI, and save the data to a JSONL file.

1. **File Paths**:
   - `pdf_path`: Specifies the path to the PDF file from which text will be extracted. Change this to your PDF file path.
   - `output_file`: Specifies the path to the output JSONL file where the synthetic data will be saved.

2. **Extracting Text from PDF**:
   - The `extract_text_from_pdf` function is called with `pdf_path` as the argument to extract text from the specified PDF file.

3. **Generating Synthetic Data**:
   - If text is successfully extracted from the PDF, the `generate_synthetic_data` function is called with the extracted text to generate synthetic data.

4. **Saving Synthetic Data**:
   - The generated synthetic data is saved to the specified `output_file` using the `save_to_jsonl` function.
   - A message is printed to indicate that the synthetic dataset has been saved to the output file.

5. **Handling No Text Extraction**:
   - If no text is extracted from the PDF, a message is printed to indicate that no text was extracted.


In [None]:
pdf_path = "data/MSFT_cloud_architecture_contoso.pdf"  # Change to your PDF file path
output_file = "synthetic_data.jsonl"

text = extract_text_from_pdf(pdf_path)
if text:
    synthetic_data = generate_synthetic_data(text)
    save_to_jsonl(synthetic_data, output_file)
    print(f"Synthetic dataset saved to {output_file}")
else:
    print("No text extracted from the PDF.")