# 📦 **Step 1: Install Required Packages + Configure API Keys**

Before you start using this notebook, ensure all required Python packages are installed **and** that your API keys are properly set up in a `.env` file. This prepares your environment for accessing services like OpenAI or Google Gemini.

---

## 💻 How to Install Required Packages

Open your terminal or command prompt, navigate to the project directory (where `requirements.txt` is located), and run:

```bash
pip install -r requirements.txt
```

> ✅ **Tip:** Use a virtual environment (e.g., `venv`, `conda`) to keep your dependencies clean and isolated.

If you're running this from a Jupyter notebook:

```python
!pip install -r requirements.txt
```

---

## 🔐 Set Up API Keys in `.env` File

To keep your API keys safe and reusable, store them in a `.env` file at the root of your project.

### 📄 Example `.env` File

```env
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
```

> 🔑 **Get your API keys from:**
>
> * **Gemini (Google AI Studio):** [aistudio](https://aistudio.google.com/apikey)
> * **OpenAI (ChatGPT/GPT-4):** [openai.com](https://platform.openai.com/settings/organization/api-keys)

This `.env` file is read by the Python `dotenv` library, so make sure it's included in your `requirements.txt`.

---

## 🛠️ Troubleshooting

* If you get a "permission denied" error:

  ```bash
  pip install --user -r requirements.txt
  ```

* Ensure `.env` is not shared or committed to version control (e.g., use `.gitignore`):

  ```
  .env
  ```


# 🔧 Second Step: Create and Register Your Project Folder

In this step, you'll register the **main project folder**, so the system knows where to store and retrieve all files related to your project.

This setup automatically creates a structured folder system under your specified `main_folder`, and stores the configuration in a central JSON file. This ensures your projects remain organized, especially when working with multiple studies or datasets.

---

## 🗂️ Project Folder Structure

On the first run, the following folder structure will be automatically created under your specified `main_folder`:

```
main_folder/
├── database/
│   └── scopus/
│   └── <project_name>_database.xlsx  ← auto-generated path (not file creation)
├── pdf/
└── xml/
```

---

## 💡 Example

Suppose your project name is `corona_discharge`, and you want to store all project files under:

```
G:\My Drive\research_related\ear_eog
```

You can register this setup by running:

```python
project_folder(
    project_review='corona_discharge',
    main_folder=r'G:\My Drive\research_related\ear_eog'
)
```

✅ This will:

* Save the project path in `setting/project_folders.json`
* Create the full folder structure: `database/scopus`, `pdf`, and `xml`

---

## 🔁 Loading the Project Later

Once registered, you can access the project folders in future sessions by providing just the `project_review` name:

```python
paths = project_folder(project_review='corona_discharge')

print(paths['main_folder'])  # Main project folder
print(paths['csv_path'])     # Path to the Excel database file
```

---

## ⚙️ What Happens in the Background?

* A file named `project_folders.json` is stored in the `setting/` directory within your project.
* It maps each `project_review` to its corresponding `main_folder`.
* Folder structure is created automatically on the first run.
* On subsequent runs, the system reads the JSON to locate your project — no need to re-enter paths.


In [None]:
from setting.project_path import project_folder
project_name='corona_discharge'
main_folder=r"D:\my_project"

project_folder(project_review=project_name, main_folder=main_folder)

# 📥 Third Step: Download the Scopus BibTeX File

In this step, you'll use the **Scopus database** to find and download relevant papers for your project. Scopus is a comprehensive and widely-used repository of peer-reviewed academic literature.

---

## 🔍 Using Scopus Advanced Search

To retrieve high-quality and relevant papers, we recommend using **Scopus' Advanced Search** feature. This powerful tool lets you refine your search based on:

* Keywords
* Authors
* Publication dates
* Document types
* And more...

This ensures that your literature collection is both targeted and comprehensive.

---

## 💡 Get Keyword Ideas with a Prompt

To help you formulate effective search queries, you can use the following **prompt-based suggestion tool**:

👉 [Keyword Search Prompt](https://gist.github.com/balandongiv/886437963d38252e61634ddc00b9d983)

You may need to modify the prompt to better suit your research domain. Here are some example domains:

* `"corona discharge"`
* `"fatigue driving EEG"`
* `"wafer classification"`

Feel free to add, remove, or tweak keywords as needed to refine your search results.

---

## 💾 Save and Organize Your Results

Once you've finalized your search:

1. **Select all available attributes** when exporting results from Scopus.
2. # Access Scopus
![Scopus CSV Export](image/scopus_csv_export.png)


2. Choose the **BibTeX** format when saving the export file.
3. Save the file inside the `database/scopus/` folder of your project.

The resulting folder structure might look like this:

```
main_folder/
├── database/
│   └── scopus/
│       ├── scopus(1).bib
│       ├── scopus(2).bib
│       ├── scopus(3).bib
```

Make sure the BibTeX files are correctly named and stored to ensure smooth integration in later steps.

# 📊 Fourth Step: Combine Scopus BibTeX Files into Excel

Once you've downloaded multiple `.bib` files from Scopus, the next step is to **combine and convert** them into a structured Excel file. This makes it easier to filter, sort, and review the metadata of all collected papers.

---

## 🧰 What This Step Does

* Loads all `.bib` files from your project's `database/scopus/` folder
* Parses the relevant metadata (e.g., title, authors, year, source, DOI)
* Combines the results into a single Excel spreadsheet
* Saves the spreadsheet in the `database/` folder as `combined_filtered.xlsx`



## 📁 Folder Structure Example

After running the script, your folder might look like this:

```
main_folder/
├── database/
│   ├── scopus/
│   │   ├── scopus(1).bib
│   │   ├── scopus(2).bib
│   │   ├── scopus(3).bib
│   └── combined_filtered.xlsx
```

This Excel file will serve as your primary reference for filtering papers before downloading PDFs.


In [None]:
import os
from download_pdf.database_preparation import combine_scopus_bib_to_excel
from setting.project_path import project_folder

# project_review='corona_discharge'
project_review='wafer'
path_dic=project_folder(project_review=project_review)
folder_path=path_dic['scopus_path']
output_excel =  path_dic['database_path']
combine_scopus_bib_to_excel(folder_path, output_excel)

# 🧾 **Step 5: Automatically Filter Relevant References Using LLM**

After retrieving thousands of BibTeX references from Scopus (**Step 3**) and combining them into an Excel file (`combined_filtered.xlsx`) in **Step 4**, you'll likely find many entries irrelevant to your specific research focus.

In this step, we use a **Large Language Model (LLM)** to classify abstracts based on a defined research topic and context, eliminating the need for tedious manual filtering.

---

## ✨ **What This Step Does**

* Loads abstracts from `combined_filtered.xlsx`
* Applies a **custom LLM prompt** to assess relevance
* Adds a new column to the Excel file with `True`/`False` labels:

  * ✅ `True` → Relevant
  * ❌ `False` → Not relevant

This process is based on a topic-specific prompt defined by you.

---

## 🧠 **LLM Prompt Customization**

For the automated relevance filtering to work effectively, you **must provide two key inputs** to guide the LLM:

* `topic`: A concise statement of your **specific research goal or area of interest**.
  *Example:* `"wafer defect classification"`

* `topic_context`: A description of the **broader scientific or industrial context** where your topic belongs.
  *Example:* `"semiconductor manufacturing and inspection"`

These inputs help the LLM understand what kinds of abstracts are considered relevant and which ones should be filtered out.

---

### 💡 *Not sure how to define your `topic` and `topic_context`?*

You can get help from an LLM to generate these values. Here’s how:

1. **Open the filtering prompt** template defined in the YAML file:

   ```
   research_filter/agent/abstract_check.yaml
   ```

2. **Copy the prompt structure** (including the placeholders for `topic` and `topic_context`).

3. **Ask any LLM to assist**, by providing it the YAML prompt and a description of your research.
   For example, you could say:

   > 🧠 *"Given the following prompt structure, and knowing that my research is about identifying AI-generated academic papers, can you help me fill in the `topic` and `topic_context` placeholders?"*

4. The LLM might then suggest:

   ```yaml
   topic: "AI-generated academic paper detection"
   topic_context: "scientific publishing and machine learning ethics"
   ```

This approach ensures your filtering prompt is both precise and contextually grounded, improving the accuracy of the classification.

---

## ⚙️ **Configuration via `agentic_setting`**

The entire filtering behavior is controlled by a configuration dictionary called `agentic_setting`:

```python
agentic_setting = {
    "agent_name": "abstract_filter",     # The identifier for the agent logic (must match YAML)
    "column_name": "abstract_filter",    # Name of the column added to the Excel file
    "yaml_path": yaml_path,              # Path to the YAML file defining agent behavior
    "model_name": model_name             # Name of the LLM model to use
}
```

### 🔍 Parameter Breakdown:

| Key           | Description                                                             |
| ------------- | ----------------------------------------------------------------------- |
| `agent_name`  | Matches the name of the agent defined in the YAML configuration file.   |
| `column_name` | The name of the new column in the Excel file where results are saved.   |
| `yaml_path`   | Path to the YAML file containing the agent's logic and prompt template. |
| `model_name`  | The specific LLM model used (e.g., `"gpt-4"` or `"claude-3-opus"`).     |

---

## 📂 **File and Folder Structure**

To avoid redundant LLM calls (which can be costly), results are cached as JSON files:

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx           ← Updated with filtering results
├── abstract_filter/
│   └── json_output/
│       ├── kim_2019.json
│       ├── smith_2020.json
│       └── ...
```

* Each abstract’s filtering result is saved individually.
* If the run encounters an error, it can resume without reprocessing previous abstracts.
* These files are later reused for **cross-checking** and **final review**.

---

## 📤 **Excel Output Behavior**

Depending on the `overwrite_csv` setting:

* `True` → Updates the original `combined_filtered.xlsx`
* `False` → Creates a new file (e.g., `combined_filtered_updated.xlsx`)

---

## ⚠️ **Caution: Manual Review is Still Required**

> ❗ **LLMs are powerful but not perfect**. They may misclassify edge cases or ambiguous abstracts.
>
> 🔍 Always **manually inspect** the final results before using them for publication or decision-making.

In [None]:
from research_filter.auto_llm import run_pipeline
from setting.project_path import project_folder
import os

# Define your project
project_review = 'wafer'
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

# Choose your LLM. or you can experiment with other models that is cheap or expensive and check the performance
model_name = "gpt-4o-mini"
base_dir = os.getcwd()
yaml_path = os.path.join(base_dir, "research_filter", "agent", "abstract_check.yaml")
# Agent configuration
agentic_setting = {
    "agent_name": "abstract_filter",# This is agent name as what available in yaml
    "column_name": "abstract_filter", # This is the column name in the excel, we will use this name to save the result and the result is either True or False. While you have the flexibility to choose the column name, it is recommended to use the same name as written here as there some task that use this naming convention to clean up the llm output that being stored under the df in the column column_name. This is specifically true only for this step.
    "yaml_path": yaml_path, # This is the path to the yaml file or the agent
    "model_name": model_name # This is the model name, you can choose from the available models
}

# Paths and folders
csv_path = path_dic['database_path']
methodology_json_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'json_output')
multiple_runs_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'multiple_runs_folder')
final_cross_check_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'final_cross_check_folder')
placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}
# LLM runtime settings. If unsure, use the default settings
process_setup = {
    'batch_process': False,
    'manual_paste_llm': False,
    'iterative_confirmation': False,
    'overwrite_csv': True,
    'cross_check_enabled': False,
    'cross_check_runs': 3,
    'cross_check_agent_name': 'agent_cross_check',
    'cleanup_json': False
}

# Run the LLM-based filtering
run_pipeline(
    agentic_setting,
    process_setup,
    placeholders=placeholders,
    csv_path=csv_path,
    main_folder=main_folder,
    methodology_json_folder=methodology_json_folder,
    multiple_runs_folder=multiple_runs_folder,
    final_cross_check_folder=final_cross_check_folder,
)


# 🧾 **Sixth Step: Download PDFs**

Now that your reference list has been filtered to include only **relevant papers** (based on the abstract analysis in **Step 5**), you're ready to automatically download their corresponding PDFs.

This step uses the filtered Excel file (updated in Step 5) to retrieve and save PDFs for each BibTeX entry. The script is powered by Selenium and supports fallback strategies for sources like IEEE, MDPI, and ScienceDirect.

> 🛑 **Note:** This step launches a full browser window during execution. Some publishers may block headless downloads — using a visible browser avoids this issue.

> 📝 **Important:** By default, only **Sci-Hub** is enabled. To use fallback sources like IEEE, MDPI, or ScienceDirect, you must **manually uncomment the relevant function calls** in the script. This allows you to selectively control which sources to attempt.

---

## ✨ What This Step Does

* Loads metadata from the filtered Excel file (`combined_filtered.xlsx` or the updated version from Step 5).
* Attempts to download each paper from **Sci-Hub** first.
* If Sci-Hub fails for a paper, you can optionally enable **fallback downloads** from:

  * IEEE
  * IEEE Search
  * MDPI
  * ScienceDirect (note: may fail due to access restrictions)
* Saves each PDF as `{bibtex_key}.pdf` in the `pdf/` directory for easy tracking and consistent file naming.

---

## 🧰 Code Snippet

```python
from download_pdf.download_pdf import (
    run_pipeline,
    process_scihub_downloads,
    process_fallback_ieee,
    # process_fallback_ieee_search,
    # process_fallback_mdpi,
    # process_fallback_sciencedirect,
)
from setting.project_path import project_folder
import os

# Project setup
project_review = 'wafer_defect'
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

# Use the filtered Excel from Step 5
file_path = path_dic['csv_path']
output_folder = os.path.join(main_folder, 'pdf')

# Load and categorize data
categories, data_filtered = run_pipeline(file_path)

# Step 1: Attempt to download from Sci-Hub
process_scihub_downloads(categories, output_folder, data_filtered)

# Optional fallback sources — uncomment as needed:
# Step 2: Try IEEE fallback if Sci-Hub fails
# process_fallback_ieee(categories, data_filtered, output_folder)

# Step 3: Use IEEE Search-based fallback
# process_fallback_ieee_search(categories, data_filtered, output_folder)

# Step 4: Try MDPI fallback
# process_fallback_mdpi(categories, data_filtered, output_folder)

# Step 5: Try ScienceDirect fallback (limited due to restrictions)
# process_fallback_sciencedirect(categories, data_filtered, output_folder)

# Optional: Save updated Excel with download statuses
# save_data(data_filtered, file_path)
```

---

## 🗂️ Example Folder Structure

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx     ← Filtered metadata with BibTeX keys (updated in Step 5)
├── pdf/
│   ├── smith_2020.pdf                 ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
```

---

In [None]:
import os

from download_pdf.download_pdf import run_pipeline, process_scihub_downloads, process_fallback_ieee
from setting.project_path import project_folder

# Define your project
project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

file_path = path_dic['database_path']
output_folder = os.path.join(main_folder, 'pdf')

# Run the main pipeline to load and categorize the data
categories, data_filtered = run_pipeline(file_path)

# First step, we will always use Sci-Hub to attempt PDF downloads
process_scihub_downloads(categories, output_folder, data_filtered)

# Fallback options for entries not available via Sci-Hub:
# Uncomment the following lines one by one if you want to try downloading from specific sources

# Uncomment to attempt fallback download from IEEE
# process_fallback_ieee(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download using IEEE Search
# process_fallback_ieee_search(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from MDPI
# process_fallback_mdpi(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from ScienceDirect
# Note: ScienceDirect URLs can be extracted but PDFs may not be downloadable due to security restrictions
# process_fallback_sciencedirect(categories, data_filtered, output_folder)

# Uncomment to save the updated data to Excel after processing
# save_data(data_filtered, file_path)

# 🧾 **Seventh Step: Convert PDFs to XML using GROBID (OPTIONAL)**

After downloading the PDFs, the next step is to convert them into structured **TEI XML** format using [**GROBID**](https://grobid.readthedocs.io). This step enables downstream tasks like metadata extraction, reference parsing, and full-text analysis.

---

## 🧰 What This Step Does

* Processes all PDF files in your `pdf/` directory.
* Uses GROBID's **batch processing API**.
* Saves the resulting XML files into the `xml/` folder (one `.xml` per `.pdf`).
* Leverages Docker for fast, isolated execution.

---

## ⚙️ Setup Requirements

> 🛠️ **GROBID requires WSL + Docker on Windows**

* You must have **WSL** installed (tested on **WSL2 with Ubuntu 22.04**).
* You must have **Docker** installed and **running** before launching GROBID.

---

## 🐳 How to Install & Run GROBID

1. **Pull the Docker image from Docker Hub**
   Check for the [latest version](https://hub.docker.com/r/grobid/grobid/tags), or use the stable one:

   ```bash
   docker pull grobid/grobid:0.8.1
   ```

2. **Start the GROBID container in Ubuntu (WSL)**

   Open your Ubuntu terminal and run:

   ```bash
   docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1
   ```

   > ✅ This exposes GROBID's REST API on `http://localhost:8070/`

3. **Test it in your browser**

   Open a browser (e.g., Firefox or Chrome) and navigate to:

   ```
   http://localhost:8070/
   ```

   You should see the GROBID interface.

---

## 🚀 Batch Conversion Command

Once the GROBID service is running, you can convert all PDFs in your `pdf/` folder to XML using:

```bash
# From your project root (in WSL/Ubuntu)
cd path/to/your/project

# Create output folder if not exists
mkdir -p xml

# Run batch processing using curl
curl -v --form "input=@pdf/" localhost:8070/api/processFulltextDocument -o xml/
```

Or, use a Python wrapper or script to iterate over PDFs and call GROBID’s REST API for more control.

---

## 🗂️ Folder Structure After Conversion

```
wafer_defect/
├── pdf/
│   ├── smith_2020.pdf
│   └── kim_2019.pdf
├── xml/
│   ├── smith_2020.xml
│   └── kim_2019.xml
```


# 🧾 **Eighth Step (Optional): Convert XML to JSON**

This step converts GROBID-generated TEI XML files into structured JSON format. While optional, it can be helpful for reviewing document content, integrating into other tools, or preparing data to feed into an LLM.

> 📝 **Note:** This step is **optional** — the main pipeline (`run_llm`) reads directly from XML. Use this conversion if you want to inspect or process JSON files instead.

---

## ✨ What This Step Does

* Reads all `*.xml` files from the `xml/` directory.
* Converts each into a corresponding `*.json` file (preserving the **BibTeX key as filename** for consistency).
* Stores all JSON outputs in `xml/json/`.

In addition, it handles and organizes special cases:

* 📁 **`xml/json/no_intro_conclusion/`**: XML files where GROBID could not detect an *introduction* or *conclusion* section.
* 📁 **`xml/json/untitled_section/`**: XML files where GROBID could not detect any section titles at all — these require manual checking.
* 📄 Other successfully processed files are stored directly in `xml/json/`.

---

## ▶️ Example Code

```python
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)

# Convert all XML files in the specified folder to JSON
run_pipeline(path_dic['xml_path'])
```

---

## 🗂️ Example Folder Structure After Conversion

```
corona_discharge/
├── pdf/
│   ├── smith_2020.pdf
│   └── kim_2019.pdf
├── xml/
│   ├── smith_2020.xml
│   ├── kim_2019.xml
│   └── json/
│       ├── smith_2020.json
│       ├── kim_2019.json
│       ├── no_intro_conclusion/
│       │   └── failed_paper1.json
│       └── untitled_section/
│           └── failed_paper2.json
```

In [None]:
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
run_pipeline(path_dic['xml_path'])

# 🧾 **Ninth Step: Extract Methodology Details Using LLM**

At this stage, your reference list is filtered and the corresponding PDFs (or abstracts) are available. Now, the focus shifts to **extracting key methodological insights** from each paper, such as:

* 🧠 Classification algorithms
* 🛠️ Feature engineering approaches
* 📏 Evaluation metrics

This is achieved using a specialized **LLM agent** with a targeted prompt for methodology extraction.

> 📌 This step works with **full manuscripts (PDF)** when available, and **falls back to abstracts** if no PDF exists. This flexibility ensures comprehensive analysis even with incomplete data.

---

## ✨ What This Step Does

* Loads your filtered Excel or CSV file from Step 5 or 6.
* For each relevant paper:

  * If the PDF is available: extracts methodology from the full text.
  * If the PDF is **not** available: extracts from the **abstract** instead.
* Uses a domain-specific LLM prompt to analyze methodological content.
* Appends the results to the existing Excel or CSV file.
* Saves per-paper structured JSON files for advanced or customized usage.
* Handles backups automatically if overwriting the metadata file.

> ⛑️ **Safety Note**: If `'overwrite_csv': True`, a **timestamped backup** of the original `.csv` or `.xlsx` file is automatically created **in the same folder** before any updates are made. This prevents accidental corruption and allows for recovery or version tracking.

---

## 📄 Prompt Purpose

This step uses a **domain-aware analytical agent** designed to:

> “Extract methodological details (e.g., classification algorithms, feature engineering, evaluation metrics) from filtered papers relevant to a specific machine learning task.”

The prompt is defined in a YAML config file (`agent_ml.yaml`) and is tailored by the agent name you provide.

---

## 🗂️ Folder and Output Structure

Your project directory may look like this after completing this step:

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Updated metadata with extraction results
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto-generated backup (if overwrite_csv=True)
├── pdf/
│   ├── smith_2020.pdf                              ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor/
│   ├── json_output/
│   │   └── smith_2020.json
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/
```

---

## 🧠 Supported Models

Choose from the following supported LLMs:

* `"gpt-4o"`
* `"gpt-4o-mini"`
* `"gpt-o3-mini"`
* `'gemini-1.5-pro'`
* `'gemini-exp-1206'`
* `'gemini-2.0-flash-thinking-exp-01-21'`

---

## ⚠️ Important Reminders

* ✅ Set `used_abstract = True` if some papers lack full PDFs.
* 🛑 **Always verify** extracted methodologies manually before using them in analysis, models, or publication. LLMs can hallucinate or misinterpret technical details.

---

In [None]:
from research_filter.auto_llm import run_pipeline
from setting.project_path import project_folder
import os

# Define the project
project_review = 'wafer'
# Select your LLM model. For this step, it is recommended to use a more powerful model like `gpt-4o` or `gemini-2.0-flash-thinking-exp-01-21` for better performance. Use model that have reasoning capability for better performance.


model_name = 'gpt-4o-mini'
agent_name = "methodology_gap_extractor"
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

base_dir = os.getcwd()
# Construct the relative path
yaml_path = os.path.join(base_dir, "research_filter", "agent", "agent_ml.yaml")

# Agent and config setup
agentic_setting = {
    "agent_name": agent_name,
    "column_name": "methodology_gap",
    "yaml_path": yaml_path,
    "model_name": model_name
}

# Output directories
methodology_json_folder = os.path.join(main_folder, agent_name, 'json_output',model_name)
multiple_runs_folder = os.path.join(main_folder, agent_name, 'multiple_runs_folder')
final_cross_check_folder = os.path.join(main_folder, agent_name, 'final_cross_check_folder')

# Excel input
csv_path = path_dic['database_path']

# Topic placeholders (adjust per project)
placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

# LLM runtime configuration
process_setup = {
    'batch_process': False,
    'manual_paste_llm': False,
    'iterative_confirmation': False,
    'overwrite_csv': True,
    'cross_check_enabled': False,
    'cross_check_runs': 3,
    'cross_check_agent_name': 'agent_cross_check',
    'cleanup_json': False,
    'used_abstract': True  # Always True to enable fallback to abstract if PDF is missing
}

# Run the pipeline
run_pipeline(
    agentic_setting,
    process_setup,
    placeholders=placeholders,
    csv_path=csv_path,
    main_folder=main_folder,
    methodology_json_folder=methodology_json_folder,
    multiple_runs_folder=multiple_runs_folder,
    final_cross_check_folder=final_cross_check_folder,
)

# 🧾 **Tenth Step: Draft Literature Review (Chapter 2) Using Combined JSON**

Once you've extracted methodological insights in Step 9, the next logical move is to **start organizing and drafting your literature review** (typically Chapter 2 of a thesis or paper). This step outlines how to **combine extracted results** into a single, structured file — and how to use that file to produce high-quality summaries, tables, and narrative drafts using an LLM.

> 📌 This is a branching step — not everyone will follow this exact path — but it's a powerful way to move from **structured data → written draft** efficiently.

---

## ✨ What This Step Does

* Collects individual methodology JSON files.
* Merges them into a single, unified JSON (`combined_output.json`).
* Prepares this JSON as input for:

  * Google AI Studio (e.g., Gemini Pro)
  * GPT-based agents
  * Jupyter notebooks or drafting pipelines
* Enables:

  * Narrative generation for Chapter 2
  * Thematic clustering of methods
  * Structured tables summarizing key findings

---

## 🧰 Code Example: Combine JSONs

```python
import os
from setting.project_path import project_folder
from helper.combine_json import combine_json_files

project_review = 'corona_discharge'
path_dict = project_folder(project_review=project_review)

input_dir = os.path.join(
    path_dict['main_folder'],
    r"methodology_gap_extractor_partial_discharge\json_output\gemini-2.0-flash-thinking-exp-01-21_updated"
)
output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

if not os.path.exists(input_dir):
    raise FileNotFoundError(f"Input directory not found: {input_dir}")

combine_json_files(
    input_directory=input_dir,
    output_file=output_file
)

print(f"Combined JSON saved to: {output_file}")
```

---

## 🗂️ Output Structure Example

```
corona_discharge/
├── combined_output.json      ← ✅ Master file for summarization and drafting
├── methodology_gap_extractor_partial_discharge/
│   └── json_output/
│       └── *.json            ← Individual extracted documents
```

---

## 📊 Using LLM to Generate Summary Tables

With the `combined_output.json`, you can prompt an LLM to create structured tables summarizing key findings. For example:

> “Using the combined JSON, generate a table with the following columns:
> **Author**, **Machine Learning Algorithm**, **Feature Types**, **Dataset Used**, and **Performance Metrics**.”

This gives you a **quick-glance overview** of the landscape, helpful both for understanding trends and citing clearly in your literature review.

### ✅ Sample Columns for Summary Table:

* **Author / Year**
* **ML Algorithm(s) Used**
* **Features**
* **Dataset / Source**
* **Performance (Accuracy, F1, etc.)**

---

## ✍️ Drafting Strategy Tips

You can also use LLM prompts like:

> “Based on the combined JSON, write a summary paragraph comparing the top 3 machine learning techniques used for partial discharge classification.”

Or:

> “Generate an introduction section discussing the evolution of feature engineering techniques in this domain.”

---

## ⚠️ Important Reminders

* ✅ Make sure all JSON structures are consistent before combining.
* 📉 Segment or cluster the JSON by subdomain if the file becomes too large.
* 🔍 Review all generated tables and text — LLMs are **tools**, not final authorities.

---

In [None]:
import os
from setting.project_path import project_folder
from helper.combine_json import combine_json_files

project_review = 'corona_discharge'
path_dict = project_folder(project_review=project_review)

input_dir = os.path.join(
    path_dict['main_folder'],
    r"methodology_gap_extractor_partial_discharge\json_output\gemini-2.0-flash-thinking-exp-01-21_updated"
)
output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

if not os.path.exists(input_dir):
    raise FileNotFoundError(f"Input directory not found: {input_dir}")

combine_json_files(
    input_directory=input_dir,
    output_file=output_file
)

print(f"Combined JSON saved to: {output_file}")


# 🧾 **Eleventh Step: Export Filtered Excel to BibTeX**

After reviewing and filtering your `combined_filtered.xlsx` file, you can convert the refined list of papers back into a **BibTeX file**. This can be helpful for citation management or integration with tools like LaTeX or reference managers.

---

# ✨ What This Step Does

* Loads the filtered Excel file containing your selected papers
* Converts the data back into BibTeX format
* Saves the result to a `.bib` file for easy reuse or citation


# 📁 File Structure Example

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Filtered Excel file with selected papers
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto backup if overwrite is enabled
│       └── filtered_output.bib                     ← Newly generated BibTeX file
├── pdf/
│   ├── smith_2020.pdf                              ← Full-text PDFs saved by bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor_wafer_defect/
│   ├── json_output/
│   │   └── smith_2020.json                         ← Extracted methodology as structured JSON
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/

```

---

## 🛠️ Notes

* Ensure the Excel file has standard bibliographic columns like: `title`, `author`, `year`, `journal`, `doi`, etc.
* The function `generate_bibtex()` maps these fields into valid BibTeX entries.
* You can open the `.bib` file in any text editor or reference manager to confirm the results.

In [None]:
import os
import pandas as pd
from post_code_saviour.excel_to_bib import generate_bibtex
from setting.project_path import project_folder

# Define your project
project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

# Define input and output paths
input_excel = os.path.join(main_folder, 'database', 'combined_filtered.xlsx')
output_bib = os.path.join(main_folder, 'database', 'filtered_output.bib')

# Load the filtered Excel file
df = pd.read_excel(input_excel)

# Generate BibTeX file
generate_bibtex(df, output_file=output_bib)