# Dialogue2Data (D2D) Package Documentation

Welcome to the documentation for the `d2d` package! Dialogue2Data (D2D) is a Python-based, open-source tool designed to transform unstructured interview transcripts into structured data. By leveraging Natural Language Processing (NLP) and Large Language Models (LLMs), D2D automates the extraction, matching, and summarization of responses based on predefined guideline questions. This package is ideal for researchers, analysts, and organizations aiming to derive insights from qualitative data efficiently.

---

## Installation
1. Clone the repository:
```bash
git clone https://github.com/avalanche-strategy/D2D.git
cd D2D
```
2. Create and activate the Conda environment:
```bash
conda env create -f environment.yml
conda activate d2d
```

---

## Environment Configuration
To use the OpenAI and Anthropic API, you need to set up an environment variable for your API key. Create a `.env` file in the root directory of the project with the following content:

- **Example:**  
```bash
OPENAI_API_KEY=sk-abc123XYZ789pqr456STU012vwx789YZ
ANTHROPIC_API_KEY=sk-ant-987ZYX654WVU321TSR098qwe456PLM
```

**Note: This are fictional keys.**

---

### Data Placement

To ensure smooth operation, please organize your data as follows:

- **Interview Data Structure (for processor)**:
  Each interview should have its own subdirectory. The name of this subdirectory is the **interview name**, which should be in the format `interview_XXXX`, where `XXXX` is a unique identifier for the interview (e.g., `interview_food` is a folder containing interview transcript files for food theme.). While it is suggested to place these directories under `data/private_data/` for confidentiality, you may choose a different location if needed.

- **Transcript TXT Files (for processor)**:
  There is no requirement for the naming of the transcript files. Just make sure all transcript files are placed directly inside the interview directory. For example:
  - `data/private_data/interview_food/transcript1.txt`
  - `data/private_data/interview_food/transcript2.txt`
  - etc.  

- **Guidelines CSV File (for processor)**:
  A CSV file named `interview_xxx_guidelines.csv` containing the guideline questions. There should be a column named `guide_text` with the guideline questions. For example:
  - `interview_food_guidelines.csv` contains the guideline questions for the interviews of food theme.

- **Reference Answer (for evaluation)**: A CSV file named `response_xxx.csv` containing the reference answers for the guideline questions. The first column should be `respondent_id`, and the remaining columns should be the reference answers to the corresponding guideline questions. For example:
  - `response_food.csv` contains the reference answers for the food theme.

---

## Data Format and Sample Data Output for D2D Pipeline

### Data Input
The D2D pipeline (processor part) processes two types of input files to extract and structure responses from unstructured interview transcripts based on provided guidelines.

### 1. Guidelines
- **Description**: A structured file listing the questions or prompts to guide the interview and extraction process.
- **Format**: Comma-separated values (`.csv`)
- **Structure**:
  - Single column named `guide_text` with each row containing a question or prompt.
  - Questions align with those asked in transcripts for matching purposes.
- **Example**:
  - **File**: `interview_food_sample_guidelines.csv`
    > `guide_text`  
    > What’s a dish that reminds you of your childhood?  
    > Can you describe a meal that has a special meaning for you?
    > ...

### 2. Transcripts
- **Description**: Raw text files containing conversational interview data, with alternating lines or labeled segments for interviewers and interviewees.
- **Format**: Plain text (`.txt`)
- **Structure**:
  - Each file represents one interview, named sequentially (e.g., `001.txt`, `002.txt`).
  - Content includes dialogue, with questions from interviewers and responses from interviewees.
- **Example**:
  - **File**: `001.txt`
    > Interviewer: Let’s talk food. What’s a dish that reminds you of your childhood?  
    > Interviewee: Definitely my grandma’s chicken and rice. She used to make it every Sunday, and the smell would just take over the whole house. It was simple—nothing fancy—but it was filled with love.  
    > Interviewer: Can you describe a meal that has a special meaning for you?  
    > Interviewee: Yeah, actually. My 18th birthday dinner. My parents surprised me by cooking all my favorite dishes—pad thai, roasted veggies, and this chocolate lava cake I was obsessed with. I remember feeling really seen, you know?
    > ...
  - **File**: `002.txt`
    > Interviewer: Alright, diving into food and memories—what dish instantly brings your childhood back?  
    > Interviewee: Oh man, my mom’s arroz con leche. She’d make it every time I was sick, or honestly, just when I needed cheering up. The cinnamon smell still makes me emotional sometimes.  
    > Interviewer: Can you describe a meal that holds special meaning for you?  
    > Interviewee: Our Christmas Eve dinner. It’s this big spread—tamales, roasted pork, rice, beans. It’s loud and chaotic and full of stories. It’s more than food—it’s our whole culture on a table.
    > ...



### Sample Data Output
The D2D pipeline produces structured output by matching interviewee responses to guideline questions, consolidating results for analysis.
- **Format**: Comma-separated values (`.csv`)
- **Structure**:
  - Columns:
    - `Interview File`: Identifier of the source transcript file (e.g., `001`, `002`).
    - Additional columns named after guideline questions (e.g., "What’s a dish that reminds you of your childhood?").
  - Each row corresponds to one interview, with cells containing the extracted response text.
  - Responses are concise, summarizing key points from the transcript.  
- **Example**:
| Interview File | What’s a dish that reminds you of your childhood? | Can you describe a meal that has a special meaning for you?                          |...|
|----------------|--------------------------------------------------|------------------------------------------------------------------------------------|---|
| 001            | Grandma’s chicken and rice                       | 18th birthday dinner with favorite dishes cooked by parents.                        |...|
| 002            | Mom’s arroz con leche                            | Christmas Eve dinner with tamales, roasted pork, rice, beans; loud, chaotic, full of stories. |...|
|...|...|...|...|

---

## How It Works (Processor part)

The processor follows these steps:
1. **Segmentation**: Divides the transcript into question-response pairs.
2. **Summarization**: Summarize the questions in the transcript and guideline questions.
3. **Embedding**: Uses a SentenceTransformer model to embed summarized questions in the transcript and guideline questions.
4. **Matching**: Matches segments to guideline questions via cosine similarity.
5. **Summarization**: Summarizes matched segments using an LLM.
6. **Output**: Generates a CSV with summaries, plus JSON metadata and a log file.

---

## Usage

To run the processor on the synthetic data, use the following command after setting up your environment and data:

```bash
python examples/api_test/processor_test.py
```
---

## Data Output

Upon processing, the D2D pipeline generates four output files, which are stored in a user-designated directory. In the examples, this directory is `results/`, but users can specify any preferred path when running the processor. For the synthetic interview dataset `interview_food`, processed at 11:13 AM PDT on May 26, 2025, the following files are generated:

- **`D2D_survey_food_generator_log_202505261113.txt`**
  - **Description**: This plain text file contains the retrieval information passed into the generator module for the `interview_food` dataset, specifically for use in the evaluation process. It logs the data or queries that the generator utilizes to produce synthetic or processed outputs.
  - **Purpose**: It supports the evaluation module by providing transparency into the inputs fed to the generator, aiding in assessing its performance or accuracy.

- **`D2D_survey_food_log_202505261113.log`**
  - **Description**: This log file records the processing steps applied to the `interview_food` dataset within the D2D pipeline. It captures details such as segmentation, embedding, matching, and summarization, along with any informational messages, warnings, or errors encountered during execution.
  - **Purpose**: It acts as a comprehensive record of the workflow, facilitating debugging and ensuring that each processing step is documented for review or troubleshooting.

- **`D2D_survey_food_references_202505261113.json`**
  - **Description**: This JSON file provides reference information for the `interview_food` dataset, specifying the exact lines in the transcripts from which responses were extracted for each interview. It serves as a mapping to link summarized responses back to their original context within the source transcripts.
  - **Purpose**: It enables users to trace and verify the origins of the extracted responses, offering insight into the context and accuracy of the summarization process.

- **`D2D_survey_food_responses_202505261113.csv`**
  - **Description**: This CSV file delivers the structured output of the D2D pipeline for the `interview_food` dataset. Each row typically corresponds to a single transcript, with columns containing summarized responses aligned with the guideline questions from the interview guidelines.
  - **Purpose**: It provides a concise, tabular summary of key insights from the `interview_food` transcripts, making the data readily accessible for analysis or further use.

**Note**: The timestamp in the file names (e.g., `202505261113`) is based on the date and time when the processing was run, ensuring unique file names for each execution and preventing overwrites.

---

## Explanation of the API
**Note: for more detailed and executable examples, please refer to `examples/api_test/processor_test.py`. The first line of following example doesn't work until the package is pushed to PyPI**  

### API Example
Here’s how to process a sample transcript:

```python
# Step 0: Import the processor correctly.
from d2d.processor import D2DProcessor

# Suppose You have correctly imported the processor
# Step 1: Initialize the processor
processor = D2DProcessor(
    llm_model="gpt-4o-mini",
    embedding_model="multi-qa-mpnet-base-dot-v1",
    max_concurrent_calls=10,
    sampling_method=D2DProcessor.SamplingMethod.TOP_K,
)

# Step 2: Define paths relative to the root directory
root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../..'))
data_dir = os.path.join(root_dir, "data", "synthetic_data")
interview_name = "interview_food"
output_dir = os.path.join(root_dir, "results")

# Step 3: Start transcripts processing
processor.process_transcripts(
    data_dir=data_dir,
    interview_name=interview_name,
    output_dir=output_dir,
    disable_logging=False
    )
# Process completed
```

### Parameters

#### Initialization Parameters `D2DProcessor`
**Note that you don't have to set any parameters, each of them has a default setting. You can simply run `processor = D2DProcessor()`**

- **llm_model : str**
  The Large Language Model to use for summarization (e.g., "gpt-4o-mini"). This specifies the model that processes the text data.

- **embedding_model : str**
  The SentenceTransformer model used for embedding transcript segments and guideline questions (e.g., "multi-qa-mpnet-base-dot-v1"). This model converts text into vector representations for similarity matching.

- **max_concurrent_calls : int**
  Maximum number of concurrent API calls to the LLM, enabling efficient parallel processing of transcript segments.

- **sampling_method : D2DProcessor.SamplingMethod**
  Determines the method for sampling matched segments from the transcript:
  - `TOP_K`: Selects the top K segments based on similarity scores.
  - `TOP_P`: Selects segments until the cumulative probability reaches a specified threshold P.

- **top_k : int, optional**
  Number of top segments to consider when using `TOP_K` sampling. Default is 5. Ignored if `sampling_method` is `TOP_P`.

- **top_p : float, optional**
  Cumulative probability threshold for `TOP_P` sampling. Default is 0.5. Ignored if `sampling_method` is `TOP_K`.

- **custom_extract_prompt : str, optional**
  Custom prompt template for extracting key phrases from matched segments. The template can include `{context}` (the transcript segment) and `{query}` (the guideline question). If not provided, a default extraction prompt is used. Example:
  `"Using the dialogue: {context}, find a short phrase from the interviewee that answers '{query}'. Avoid pronouns and use explicit names. If no answer is found, return '[No answer found]'."`

- **custom_summarize_prompt : str, optional**
  Custom prompt template for summarizing extracted phrases. The template can include `{extracted_phrase}` (the extracted text) and `{query}` (the guideline question). If not provided, a default summarization prompt is used. Example:
  `"From the phrase: {extracted_phrase}, for the query '{query}', create a brief summary using only the original words, focusing on the main point."`

#### Processing Parameters `process_transcripts`

- **data_dir : str**
  Directory containing the interview data, including transcript files and a guideline CSV file. This is the root path for input data.

- **interview_name : str**
  Name of the interview folder within `data_dir`. Specifies which interview’s data to process.

- **output_dir : str**
  Directory where output files (e.g., CSV, JSON, log files) will be saved. This is the destination for processed results.

- **disable_logging : bool, optional**
  If set to `True`, disables logging to both the console and log file. Default is `False`, enabling logging for debugging and tracking purposes.

---

## Evaluator
---

## Data Format and Sample Data Output

### Data Input
The D2D pipeline (evaluator part) evaluates the performance of the processor and scores the pipeline result with 5 metrics and a weighted join score. 
The evaluator takes 2 outputs of the processor and a reference answer as inputs.

### 1. CSV Output from Processor
- **Description**: A CSV file of the structured output of the D2D pipeline for the given dataset. Each row typically corresponds to a single transcript,
  with columns containing summarized responses aligned with the guideline questions from the interview guidelines.
- **Format**: Comma-separated values (`.csv`)
- **Structure**:
  - Columns:
    - `Interview File`: Identifier of the source transcript file (e.g., `001`, `002`).
    - Additional columns named after guideline questions (e.g., "What’s a dish that reminds you of your childhood?").
  - Each row corresponds to one interview, with cells containing the extracted response text.
  - Responses are concise, summarizing key points from the transcript.  
- **Example**:
| Interview File | What’s a dish that reminds you of your childhood? | Can you describe a meal that has a special meaning for you?                          |...|
|----------------|--------------------------------------------------|------------------------------------------------------------------------------------|---|
| 001            | Grandma’s chicken and rice                       | 18th birthday dinner with favorite dishes cooked by parents.                        |...|
| 002            | Mom’s arroz con leche                            | Christmas Eve dinner with tamales, roasted pork, rice, beans; loud, chaotic, full of stories. |...|
|...|...|...|...|

### 2. Log (TXT) Output from Processor
- **Description**: The log file records the processing steps applied to the given interview dataset within the D2D pipeline.
  It captures details such as segmentation, embedding, matching, and summarization, along with any informational messages, warnings,
  or errors encountered during execution.
- **Format**: Plain text (`.txt`)
- **Structure**:
  - Each chunk of text marked with `===Start===` and `===End===` includes one analyzed guideline question from a single interview file.
  - Each chunk is consisted of the file name, guideline question, and relevant chunks of questions and answers that may contain the response. 
- **Example**:
  > ===Start===  
  > Processing file: 002  
  > Processing guide question: How do food and family traditions connect for you?  
  > Relevant Interviewee Responses:  
  > Interviewer: What about food and family traditions—how do they connect?  
  > Interviewee: They’re basically the same thing in my family. Recipes are sacred. Like, if you try to tweak my aunt’s flan recipe, you might start a family feud. [laughs]  
  > Interviewer: Any food tied to a place or person for you?  
  > Interviewee: Yeah—empanadas always remind me of my grandma in Buenos Aires. She’d let me help fold the dough, and I’d sneak bits of the filling when she wasn’t looking.  
  > Interviewer: Favorite dish from another culture?  
  > Interviewee: Japanese ramen. The broth, the noodles, the toppings—it’s like a bowl of magic. I tried making it once. Total disaster. [laughs]  
  > Interviewer: Alright, diving into food and memories—what dish instantly brings your childhood back?  
  > Interviewee: Oh man, my mom’s arroz con leche. She’d make it every time I was sick, or honestly, just when I needed cheering up. The cinnamon smell still makes me emotional sometimes.  
  > Interviewer: What’s the first thing you learned to cook?  
  > Interviewee: French toast! I was like nine, and I made it for my dad on Father’s Day. I used way too much cinnamon, but he ate it like it was gourmet. I’ll never forget that.  
  > ===End===
  
### 3. 

---

## How It Works (Processor part)

The processor follows these steps:
1. **Segmentation**: Divides the transcript into question-response pairs.
2. **Summarization**: Summarize the questions in the transcript and guideline questions.
3. **Embedding**: Uses a SentenceTransformer model to embed summarized questions in the transcript and guideline questions.
4. **Matching**: Matches segments to guideline questions via cosine similarity.
5. **Summarization**: Summarizes matched segments using an LLM.
6. **Output**: Generates a CSV with summaries, plus JSON metadata and a log file.

---

## Usage

To run the processor on the synthetic data, use the following command after setting up your environment and data:

```bash
python examples/api_test/processor_test.py
```
---

## Data Output

Upon processing, the D2D pipeline generates four output files, which are stored in a user-designated directory. In the examples, this directory is `results/`, but users can specify any preferred path when running the processor. For the synthetic interview dataset `interview_food`, processed at 11:13 AM PDT on May 26, 2025, the following files are generated:

- **`D2D_survey_food_generator_log_202505261113.txt`**
  - **Description**: This plain text file contains the retrieval information passed into the generator module for the `interview_food` dataset, specifically for use in the evaluation process. It logs the data or queries that the generator utilizes to produce synthetic or processed outputs.
  - **Purpose**: It supports the evaluation module by providing transparency into the inputs fed to the generator, aiding in assessing its performance or accuracy.

- **`D2D_survey_food_log_202505261113.log`**
  - **Description**: This log file records the processing steps applied to the `interview_food` dataset within the D2D pipeline. It captures details such as segmentation, embedding, matching, and summarization, along with any informational messages, warnings, or errors encountered during execution.
  - **Purpose**: It acts as a comprehensive record of the workflow, facilitating debugging and ensuring that each processing step is documented for review or troubleshooting.

- **`D2D_survey_food_references_202505261113.json`**
  - **Description**: This JSON file provides reference information for the `interview_food` dataset, specifying the exact lines in the transcripts from which responses were extracted for each interview. It serves as a mapping to link summarized responses back to their original context within the source transcripts.
  - **Purpose**: It enables users to trace and verify the origins of the extracted responses, offering insight into the context and accuracy of the summarization process.

- **`D2D_survey_food_responses_202505261113.csv`**
  - **Description**: This CSV file delivers the structured output of the D2D pipeline for the `interview_food` dataset. Each row typically corresponds to a single transcript, with columns containing summarized responses aligned with the guideline questions from the interview.
  - **Purpose**: It provides a concise, tabular summary of key insights from the `interview_food` transcripts, making the data readily accessible for analysis or further use.

**Note**: The timestamp in the file names (e.g., `202505261113`) is based on the date and time when the processing was run, ensuring unique file names for each execution and preventing overwrites.

---

## Explanation of the API
**Note: for more detailed and executable examples, please refer to `examples/api_test/processor_test.py`. The first line of following example doesn't work until the package is pushed to PyPI**  

### API Example
Here’s how to process a sample transcript:

```python
# Step 0: Import the processor correctly.
from d2d.processor import D2DProcessor

# Suppose You have correctly imported the processor
# Step 1: Initialize the processor
processor = D2DProcessor(
    llm_model="gpt-4o-mini",
    embedding_model="multi-qa-mpnet-base-dot-v1",
    max_concurrent_calls=10,
    sampling_method=D2DProcessor.SamplingMethod.TOP_K,
)

# Step 2: Define paths relative to the root directory
root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../..'))
data_dir = os.path.join(root_dir, "data", "synthetic_data")
interview_name = "interview_food"
output_dir = os.path.join(root_dir, "results")

# Step 3: Start transcripts processing
processor.process_transcripts(
    data_dir=data_dir,
    interview_name=interview_name,
    output_dir=output_dir,
    disable_logging=False
    )
# Process completed
```

### Parameters

#### Initialization Parameters `D2DProcessor`
**Note that you don't have to set any parameters, each of them has a default setting. You can simply run `processor = D2DProcessor()`**

- **llm_model : str**
  The Large Language Model to use for summarization (e.g., "gpt-4o-mini"). This specifies the model that processes the text data.

- **embedding_model : str**
  The SentenceTransformer model used for embedding transcript segments and guideline questions (e.g., "multi-qa-mpnet-base-dot-v1"). This model converts text into vector representations for similarity matching.

- **max_concurrent_calls : int**
  Maximum number of concurrent API calls to the LLM, enabling efficient parallel processing of transcript segments.

- **sampling_method : D2DProcessor.SamplingMethod**
  Determines the method for sampling matched segments from the transcript:
  - `TOP_K`: Selects the top K segments based on similarity scores.
  - `TOP_P`: Selects segments until the cumulative probability reaches a specified threshold P.

- **top_k : int, optional**
  Number of top segments to consider when using `TOP_K` sampling. Default is 5. Ignored if `sampling_method` is `TOP_P`.

- **top_p : float, optional**
  Cumulative probability threshold for `TOP_P` sampling. Default is 0.5. Ignored if `sampling_method` is `TOP_K`.

- **custom_extract_prompt : str, optional**
  Custom prompt template for extracting key phrases from matched segments. The template can include `{context}` (the transcript segment) and `{query}` (the guideline question). If not provided, a default extraction prompt is used. Example:
  `"Using the dialogue: {context}, find a short phrase from the interviewee that answers '{query}'. Avoid pronouns and use explicit names. If no answer is found, return '[No answer found]'."`

- **custom_summarize_prompt : str, optional**
  Custom prompt template for summarizing extracted phrases. The template can include `{extracted_phrase}` (the extracted text) and `{query}` (the guideline question). If not provided, a default summarization prompt is used. Example:
  `"From the phrase: {extracted_phrase}, for the query '{query}', create a brief summary using only the original words, focusing on the main point."`

#### Processing Parameters `process_transcripts`

- **data_dir : str**
  Directory containing the interview data, including transcript files and a guideline CSV file. This is the root path for input data.

- **interview_name : str**
  Name of the interview folder within `data_dir`. Specifies which interview’s data to process.

- **output_dir : str**
  Directory where output files (e.g., CSV, JSON, log files) will be saved. This is the destination for processed results.

- **disable_logging : bool, optional**
  If set to `True`, disables logging to both the console and log file. Default is `False`, enabling logging for debugging and tracking purposes.

---

## Additional Notes

1. **Edge Cases**:
   - Empty transcripts or missing guidelines will raise errors.
   - Ensure API keys are set for LLM access.

2. **Performance**:
   - Processing time depends on transcript length and question count. Use GPU for faster computation.

## Summary

The `d2d` package simplifies the transformation of interview transcripts into structured data, offering:

- **Automated Segmentation** and matching to guideline questions.
- **LLM-Powered Summarization** for concise insights.
- **Structured Outputs** for easy analysis.

Explore D2D for your qualitative data needs!

**The following sections will replace the upper part when the package is published to PyPI**


## Installation
To start using the `d2d` package, install it via pip in your terminal:
```bash
pip install d2d
```