# Dialogue2Data (D2D) Package Documentation

Welcome to the documentation for the `d2d` package! Dialogue2Data (D2D) is a Python-based, open-source tool designed to transform unstructured interview transcripts into structured data. By leveraging Natural Language Processing (NLP) and Large Language Models (LLMs), D2D automates the extraction, matching, and summarization of responses based on predefined guideline questions. This package is ideal for researchers, analysts, and organizations aiming to derive insights from qualitative data efficiently.

## Installation
1. Clone the repository:
```bash
git clone https://github.com/fathomthat/d2d.git
cd D2D
```
2. Create and activate the Conda environment:
```bash
conda env create -f environment.yml
conda activate d2d
```

## Environment Configuration
To use the OpenAI and Anthropic API, you need to set up an environment variable for your API key. Create a `.env` file in the root directory of the project with the following content:

```bash
OPENAI_API_KEY=[Please replace this with your OPENAI API key]
ANTHROPIC_API_KEY=[Please replace this with your OPENAI API key]
```

### Data Placement

To ensure smooth operation of the processor, please organize your data as follows:

- **Interview Data Structure**:
  Each interview should have its own subdirectory. The name of this subdirectory is the **interview name** (e.g., `interview_1090`). While it is suggested to place these directories under `data/private_data/` for confidentiality, you may choose a different location if needed.

- **Transcript Files**:
  Place all transcript files directly inside the interview directory. For example:
  - `data/private_data/interview_1090/transcript.txt`

- **Guidelines CSV File**:
  The guidelines CSV file must be placed in the same level of the interview directory and should be named consistently with the interview name. For example:
  - `data/private_data/interview_1090_guidelines.csv`

- **Synthetic Data**:
  For demonstration purposes, synthetic data is provided in `data/synthetic_data/interview_food`. This can be used to test the processor without needing private data.

## How It Works (Processor part)

The processor follows these steps:
1. **Segmentation**: Divides the transcript into question-response pairs.
2. **Summarization**: Summarize the questions in the transcript and guideline questions.
3. **Embedding**: Uses a SentenceTransformer model to embed summarized questions in the transcript and guideline questions.
4. **Matching**: Matches segments to guideline questions via cosine similarity.
5. **Summarization**: Summarizes matched segments using an LLM.
6. **Output**: Generates a CSV with summaries, plus JSON metadata and a log file.

## Usage

To run the processor on the synthetic data, use the following command after setting up your environment and data:

```bash
python examples/api_test/processor_test.py
```

**Note: for more detailed and executable examples, please refer to `examples/api_test/processor_test.py`. The following example doesn't work until the package is pushed to PyPI**

### Example
Here’s how to process a sample transcript:

```python
# Step 0: Import the processor correctly.
from d2d.processor import D2DProcessor

# Suppose You have correctly imported the processor
# Step 1: Initialize the processor
processor = D2DProcessor(
    llm_model="gpt-4o-mini",
    embedding_model="multi-qa-mpnet-base-dot-v1",
    max_concurrent_calls=10,
    sampling_method=D2DProcessor.SamplingMethod.TOP_K,
)

# Step 2: Define paths relative to the root directory
root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../..'))
data_dir = os.path.join(root_dir, "data", "synthetic_data")
interview_name = "interview_food"
output_dir = os.path.join(root_dir, "results")

# Step 3: Start transcripts processing
processor.process_transcripts(
    data_dir=data_dir,
    interview_name=interview_name,
    output_dir=output_dir,
    disable_logging=False
    )
# Process completed
```



## Parameters

### Initialization Parameters `D2DProcessor`
**Note that you don't have to set any parameters, each of them has default setting. You can simply run `processor = D2DProcessor()`**

- **llm_model : str**  
  The Large Language Model to use for summarization (e.g., "gpt-4o-mini"). This specifies the model that processes the text data.

- **embedding_model : str**  
  The SentenceTransformer model used for embedding transcript segments and guideline questions (e.g., "multi-qa-mpnet-base-dot-v1"). This model converts text into vector representations for similarity matching.

- **max_concurrent_calls : int**  
  Maximum number of concurrent API calls to the LLM, enabling efficient parallel processing of transcript segments.

- **sampling_method : D2DProcessor.SamplingMethod**  
  Determines the method for sampling matched segments from the transcript:
  - `TOP_K`: Selects the top K segments based on similarity scores.
  - `TOP_P`: Selects segments until the cumulative probability reaches a specified threshold P.

- **top_k : int, optional**  
  Number of top segments to consider when using `TOP_K` sampling. Default is 5. Ignored if `sampling_method` is `TOP_P`.

- **top_p : float, optional**  
  Cumulative probability threshold for `TOP_P` sampling. Default is 0.5. Ignored if `sampling_method` is `TOP_K`.

- **custom_extract_prompt : str, optional**  
  Custom prompt template for extracting key phrases from matched segments. The template can include `{context}` (the transcript segment) and `{query}` (the guideline question). If not provided, a default extraction prompt is used. Example:  
  `"Using the dialogue: {context}, find a short phrase from the interviewee that answers '{query}'. Avoid pronouns and use explicit names. If no answer is found, return '[No answer found]'."`

- **custom_summarize_prompt : str, optional**  
  Custom prompt template for summarizing extracted phrases. The template can include `{extracted_phrase}` (the extracted text) and `{query}` (the guideline question). If not provided, a default summarization prompt is used. Example:  
  `"From the phrase: {extracted_phrase}, for the query '{query}', create a brief summary using only the original words, focusing on the main point."`

### Processing Parameters `process_transcripts`

- **data_dir : str**  
  Directory containing the interview data, including transcript files and a guideline CSV file. This is the root path for input data.

- **interview_name : str**  
  Name of the interview folder within `data_dir`. Specifies which interview’s data to process.

- **output_dir : str**  
  Directory where output files (e.g., CSV, JSON, log files) will be saved. This is the destination for processed results.

- **disable_logging : bool, optional**  
  If set to `True`, disables logging to both the console and log file. Default is `False`, enabling logging for debugging and tracking purposes.


## Additional Notes

1. **Edge Cases**:  
   - Empty transcripts or missing guidelines will raise errors.  
   - Ensure API keys are set for LLM access.

2. **Performance**:  
   - Processing time depends on transcript length and question count. Use GPU for faster computation.

## Summary

The `d2d` package simplifies the transformation of interview transcripts into structured data, offering:

- **Automated Segmentation** and matching to guideline questions.
- **LLM-Powered Summarization** for concise insights.
- **Structured Outputs** for easy analysis.

Explore D2D for your qualitative data needs!

The following sections will replace the upper part when the package is published 

## Installation
To start using the `d2d` package, install it via pip in your terminal:
```bash
pip install d2d
```