# 📦 **Step 1: Install Required Packages**

Before running the notebook, make sure to **clone the repository** and **install all required dependencies** directly in your Colab environment.

---

## 📥 Clone the Repository

```python
!git clone https://github.com/balandongiv/academic_paper_maker.git
%cd academic_paper_maker
```

---

## 🧰 Install Dependencies

Install all the necessary Python packages listed in `requirements.txt`:

```python
!pip install -r requirements.txt
!pip install openai
```

> ✅ **Note:** `openai` is installed separately in case it’s not included in the `requirements.txt`.

---


In [None]:
!git clone https://github.com/balandongiv/academic_paper_maker.git
%cd academic_paper_maker
!pip install -r requirements.txt
# !pip install python-dotenv
!pip install openai

# 🔧 Step 2: Create and Register Your Project Folder

In this step, you'll register the **main project folder**, so the system knows where to store and retrieve all files related to your project.

This setup automatically creates a structured folder system under your specified `main_folder`, and stores the configuration in a central JSON file. This ensures your projects remain organized, especially when working with multiple studies or datasets.

---

## 🗂️ Project Folder Structure

On the first run, the following folder structure will be automatically created under your specified `main_folder`:

```
main_folder/
├── database/
│   └── scopus/
│   └── <project_name>_database.xlsx  ← auto-generated path (not file creation)
├── pdf/
└── xml/
```

---

## 💡 Example

Suppose your project name is `corona_discharge`, and you want to store all project files under:

```
G:\My Drive\research_related\ear_eog
```

You can register this setup by running:

```python
project_folder(
    project_review='corona_discharge',
    main_folder=r'G:\My Drive\research_related\ear_eog'
)
```

✅ This will:

* Save the project path in `setting/project_folders.json`
* Create the full folder structure: `database/scopus`, `pdf`, and `xml`

---

## 🔁 Loading the Project Later

Once registered, you can access the project folders in future sessions by providing just the `project_review` name:

```python
paths = project_folder(project_review='corona_discharge')

print(paths['main_folder'])  # Main project folder
print(paths['csv_path'])     # Path to the Excel database file
```

---

## ⚙️ What Happens in the Background?

* A file named `project_folders.json` is stored in the `setting/` directory within your project.
* It maps each `project_review` to its corresponding `main_folder`.
* Folder structure is created automatically on the first run.
* On subsequent runs, the system reads the JSON to locate your project — no need to re-enter paths.


In [None]:
from academic_paper_maker.setting.project_path import project_folder

cfg={
    "project_review": "corona_discharge",
    "env_path": "/content/my_project/.env",
    "main_folder": "/content/my_project",  # Use a Unix-style path for Colab
    "project_root": "/content/my_project/database",
    'yaml_path': '/content/academic_paper_maker/research_filter/agent/abstract_check.yaml',
    'config_file':'/content/academic_paper_maker/setting/project_folders.json',
    'methodology_gap_extractor_path': '/content/my_project/methodology_gap_extractor/json_output',
}


project_folder(project_review=cfg['project_review'], main_folder=cfg['main_folder'])

---

# 🔐 **Step 3: Create a `.env` File for API Keys**

To securely manage API access, you'll create a `.env` file inside your project directory. This file stores your credentials for services like **OpenAI** and **Gemini (Google)** — without exposing them directly in your code.

---

## 🧾 What This Step Does

* Automatically creates a `.env` file at the location defined by `cfg["env_path"]`
* Supports **both** `OPENAI_API_KEY` and `GEMINI_API_KEY` (a.k.a. `GOOGLE_API_KEY`)
* Adds clear comments so users understand the purpose of each key
* Compatible with libraries like `python-dotenv` for seamless key loading

You can:

* Create a **blank template** for manual editing
* OR **prefill** with known keys during function call

---

## ✍️ How to Edit or Prefill Keys

After running the code cell:

1. Click the 📁 **folder icon** on the left sidebar in Colab
2. Navigate to your project folder (e.g., `/content/my_project`)
3. Locate the `.env` file and **right-click → Edit**
4. Enter your API keys like this:

```env
# OpenAI GPT-4 / GPT-4o
OPENAI_API_KEY=sk-...

# Google Gemini (AI Studio)
GEMINI_API_KEY=AIzaSy...
```

> 💡 `GEMINI_API_KEY` is sometimes referred to as `GOOGLE_API_KEY` in third-party SDKs — they are functionally equivalent.

---

## 🛡️ Best Practices

* 🔒 **Never share or commit** your `.env` file
* 📂 Make sure `.env` is included in `.gitignore`
* ✅ Use `load_dotenv()` to load these variables at runtime without hardcoding

> This step ensures your API keys are securely stored, portable across environments, and easy to manage.

---

In [None]:

from academic_paper_maker.helper.google_colab import create_env_file
# Minimal (user will edit manually later)
create_env_file(cfg["env_path"])

# Or with keys already known
create_env_file(cfg["env_path"],
                openai_api_key="sk-abc123",
                gemini_api_key="AIzaSy...")



---

# ⚙️ **Step 3a: Load API Key and Test OpenAI Connection**

After setting up your `.env` file, it's important to **verify that your OpenAI API key is working correctly**. This step uses a built-in helper function to test the connection.

---

## 🔄 What This Step Does

* Loads your API key from the `.env` file
* Initializes an OpenAI client with that key
* Sends a simple test message to the GPT model (e.g., `gpt-4o`)
* Confirms success or provides a clear error message if something goes wrong

---

## ✅ How to Run It

Just call the helper function:

```python
from academic_paper_maker.helper.google_colab import test_openai_connection

test_openai_connection()
```

---

## 📌 Expected Outcome

If everything is set up correctly, you'll see output like:

```
✅ API call successful!
🤖 Response: 2 + 2 is 4.
```

---

## ⚠️ If Something Goes Wrong

Common errors and their meanings:

| Error Type             | What It Means                                 |
| ---------------------- | --------------------------------------------- |
| ❌ AuthenticationError  | API key is missing or incorrect               |
| ⚠️ RateLimitError      | You’re sending too many requests too quickly  |
| 📡 APIConnectionError  | Network issue or OpenAI server is unreachable |
| 🚫 InvalidRequestError | Incorrect model name or bad request structure |

Make sure your `.env` file includes a valid key in this format:

```env
OPENAI_API_KEY=sk-...
```

> 🔁 Re-run the test after fixing the `.env` or internet connection if needed.

---


In [None]:
from academic_paper_maker.helper.google_colab import test_openai_connection

print("🔍 Testing OpenAI connection... Please wait.\n")

# Run the test
test_openai_connection()

print("\n✅ If the connection is successful, you should see a response from the assistant above.")
print("⚠️ If not, please check your `.env` file and verify that your `OPENAI_API_KEY` is correct.")


In [None]:
from academic_paper_maker.setting.project_path import project_folder

project_name = 'corona_discharge'
main_folder = '/content/my_project'  # Use a Unix-style path for Colab

project_folder(project_review=project_name, main_folder=main_folder)

# 📥 Third Step: Download the Scopus BibTeX File

In this step, you'll use the **Scopus database** to find and download relevant papers for your project. Scopus is a comprehensive and widely-used repository of peer-reviewed academic literature.

---

## 🔍 Using Scopus Advanced Search

To retrieve high-quality and relevant papers, we recommend using **Scopus' Advanced Search** feature. This powerful tool lets you refine your search based on:

* Keywords
* Authors
* Publication dates
* Document types
* And more...

This ensures that your literature collection is both targeted and comprehensive.

---

## 💡 Get Keyword Ideas with a Prompt

To help you formulate effective search queries, you can use the following **prompt-based suggestion tool**:

👉 [Keyword Search Prompt](https://gist.github.com/balandongiv/886437963d38252e61634ddc00b9d983)

You may need to modify the prompt to better suit your research domain. Here are some example domains:

* `"corona discharge"`
* `"fatigue driving EEG"`
* `"wafer classification"`

Feel free to add, remove, or tweak keywords as needed to refine your search results.

---

## 💾 Save and Organize Your Results

Once you've finalized your search:

1. **Select all available attributes** when exporting results from Scopus.
2. # Access Scopus
![Scopus CSV Export](image/scopus_csv_export.png)


2. Choose the **BibTeX** format when saving the export file.
3. Save the file inside the `database/scopus/` folder of your project.

The resulting folder structure might look like this:

```
main_folder/
├── database/
│   └── scopus/
│       ├── scopus(1).bib
│       ├── scopus(2).bib
│       ├── scopus(3).bib
```

Make sure the BibTeX files are correctly named and stored to ensure smooth integration in later steps.

In [None]:

from academic_paper_maker.download_pdf.database_preparation import combine_scopus_bib_to_excel
from academic_paper_maker.setting.project_path import project_folder

# project_review='corona_discharge'
path_dic=project_folder(project_review=cfg['project_review'])
folder_path=path_dic['scopus_path']
output_excel =  path_dic['database_path']
combine_scopus_bib_to_excel(folder_path, output_excel)

In [None]:
import os
from academic_paper_maker.download_pdf.database_preparation import combine_scopus_bib_to_excel
from academic_paper_maker.setting.project_path import project_folder

project_review='corona_discharge'
path_dic=project_folder(project_review=project_review)
folder_path=path_dic['scopus_path']
output_excel =  path_dic['database_path']
combine_scopus_bib_to_excel(folder_path, output_excel)

In [None]:
from academic_paper_maker.setting.project_path import project_folder
import os
import sys
from research_filter.auto_llm import run_pipeline
# Path to the directory where "research_filter" lives
# project_root = "/content/academic_paper_maker"

# Add it to Python's module search path
if cfg['project_root'] not in sys.path:
    sys.path.append(cfg['project_root'])




path_dic = project_folder(project_review=cfg['project_review'],config_file=cfg['config_file'])
main_folder = path_dic['main_folder']

# Choose your LLM. or you can experiment with other models that is cheap or expensive and check the performance
model_name = "gpt-4o-mini"

placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}
base_dir = os.getcwd()
# yaml_path ='/content/academic_paper_maker/research_filter/agent/abstract_check.yaml'
# Agent configuration
agentic_setting = {
    "agent_name": "abstract_filter",# This is agent name as what available in yaml
    "column_name": "abstract_filter", # This is the column name in the excel, we will use this name to save the result and the result is either True or False. While you have the flexibility to choose the column name, it is recommended to use the same name as written here as there some task that use this naming convention to clean up the llm output that being stored under the df in the column column_name. This is specifically true only for this step.
    "yaml_path": cfg['yaml_path'], # This is the path to the yaml file or the agent
    "model_name": model_name # This is the model name, you can choose from the available models
}

# Paths and folders
csv_path = path_dic['database_path']
methodology_json_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'json_output')
multiple_runs_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'multiple_runs_folder')
final_cross_check_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'final_cross_check_folder')
# LLM runtime settings. If unsure, use the default settings
process_setup = {
    'batch_process': False,
    'manual_paste_llm': False,
    'iterative_confirmation': False,
    'overwrite_csv': True,
    'cross_check_enabled': False,
    'cross_check_runs': 3,
    'cross_check_agent_name': 'agent_cross_check',
    'cleanup_json': False
}

# Run the LLM-based filtering
run_pipeline(
    agentic_setting,
    process_setup,
    placeholders=placeholders,
    csv_path=csv_path,
    main_folder=main_folder,
    methodology_json_folder=methodology_json_folder,
    multiple_runs_folder=multiple_runs_folder,
    final_cross_check_folder=final_cross_check_folder,
)


In [None]:
import sys
import os

# Path to the directory where "research_filter" lives
project_root = "/content/academic_paper_maker"

# Add it to Python's module search path
if project_root not in sys.path:
    sys.path.append(project_root)

from research_filter.auto_llm import run_pipeline

Shortcut of below is
run_llm_filtering_pipeline_colab

In [None]:
from research_filter.auto_llm import run_pipeline
from setting.project_path import project_folder
import os

# Define the project
# project_review = 'corona_discharge'
# Select your LLM model. For this step, it is recommended to use a more powerful model like `gpt-4o` or `gemini-2.0-flash-thinking-exp-01-21` for better performance. Use model that have reasoning capability for better performance.


model_name_method_extractor = 'gpt-4o-mini'
agent_name = "methodology_gap_extractor"

# config_file='/content/academic_paper_maker/setting/project_folders.json'


path_dic = project_folder(project_review=cfg['project_review'],config_file=cfg['config_file'])


# path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

base_dir = os.getcwd()
# Construct the relative path
# yaml_path = '/content/academic_paper_maker/research_filter/agent/agent_ml.yaml'

# Agent and config setup
agentic_setting = {
    "agent_name": agent_name,
    "column_name": "methodology_gap",
    "yaml_path": cfg['yaml_path'],
    "model_name": model_name_method_extractor
}

# Output directories
methodology_json_folder = os.path.join(main_folder, agent_name, 'json_output',model_name)
multiple_runs_folder = os.path.join(main_folder, agent_name, 'multiple_runs_folder')
final_cross_check_folder = os.path.join(main_folder, agent_name, 'final_cross_check_folder')

# Excel input
csv_path = path_dic['database_path']

# Topic placeholders (adjust per project)
placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

# LLM runtime configuration
process_setup = {
    'batch_process': False,
    'manual_paste_llm': False,
    'iterative_confirmation': False,
    'overwrite_csv': True,
    'cross_check_enabled': False,
    'cross_check_runs': 3,
    'cross_check_agent_name': 'agent_cross_check',
    'cleanup_json': False,
    'used_abstract': True  # Always True to enable fallback to abstract if PDF is missing
}

# Run the pipeline
run_pipeline(
    agentic_setting,
    process_setup,
    placeholders=placeholders,
    csv_path=csv_path,
    main_folder=main_folder,
    methodology_json_folder=methodology_json_folder,
    multiple_runs_folder=multiple_runs_folder,
    final_cross_check_folder=final_cross_check_folder,
)

You can simplify that import statement using an alias. Here's the **shortcut version**:

```python
from academic_paper_maker.helper.google_colab import run_llm_filtering_pipeline_colab as run_llm
```

Now you can call the function like this:

```python
run_llm(model_name="gpt-4o-mini", cfg=cfg, placeholders=placeholders)
```

✅ This keeps your code cleaner while still referencing the full functionality.


In [None]:
from academic_paper_maker.helper.google_colab import run_llm_filtering_pipeline_colab as run_llm

placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

run_llm(model_name="gpt-4o-mini", cfg=cfg, placeholders=placeholders)


# 🧾 **Step 8: Draft Literature Review (Chapter 2) Using Combined JSON**

Once you've extracted methodological insights in Step 9, the next logical move is to **start organizing and drafting your literature review** (typically Chapter 2 of a thesis or paper). This step outlines how to **combine extracted results** into a single, structured file — and how to use that file to produce high-quality summaries, tables, and narrative drafts using an LLM.

> 📌 This is a branching step — not everyone will follow this exact path — but it's a powerful way to move from **structured data → written draft** efficiently.

---

## ✨ What This Step Does

* Collects individual methodology JSON files.
* Merges them into a single, unified JSON (`combined_output.json`).
* Prepares this JSON as input for:

  * Google AI Studio (e.g., Gemini Pro)
  * GPT-based agents
  * Jupyter notebooks or drafting pipelines
* Enables:

  * Narrative generation for Chapter 2
  * Thematic clustering of methods
  * Structured tables summarizing key findings


---

## 🗂️ Output Structure Example

```
corona_discharge/
├── combined_output.json      ← ✅ Master file for summarization and drafting
├── methodology_gap_extractor_partial_discharge/
│   └── json_output/
│       └── *.json            ← Individual extracted documents
```

---

## 📊 Using LLM to Generate Summary Tables

With the `combined_output.json`, you can prompt an LLM to create structured tables summarizing key findings. For example:

> “Using the combined JSON, generate a table with the following columns:
> **Author**, **Machine Learning Algorithm**, **Feature Types**, **Dataset Used**, and **Performance Metrics**.”

This gives you a **quick-glance overview** of the landscape, helpful both for understanding trends and citing clearly in your literature review.

### ✅ Sample Columns for Summary Table:

* **Author / Year**
* **ML Algorithm(s) Used**
* **Features**
* **Dataset / Source**
* **Performance (Accuracy, F1, etc.)**

---

## ✍️ Drafting Strategy Tips

You can also use LLM prompts like:

> “Based on the combined JSON, write a summary paragraph comparing the top 3 machine learning techniques used for partial discharge classification.”

Or:

> “Generate an introduction section discussing the evolution of feature engineering techniques in this domain.”

---

## ⚠️ Important Reminders

* ✅ Make sure all JSON structures are consistent before combining.
* 📉 Segment or cluster the JSON by subdomain if the file becomes too large.
* 🔍 Review all generated tables and text — LLMs are **tools**, not final authorities.

---

In [None]:
import os
from setting.project_path import project_folder
from helper.combine_json import combine_json_files
from pathlib import Path
# project_review = 'corona_discharge'
# config_file='/content/academic_paper_maker/setting/project_folders.json'


path_dict = project_folder(project_review=cfg['project_review'],config_file=cfg['config_file'])

# The model_name_method_extractor is the model name used in the methodology gap extractor step, it is used to find the correct folder to combine the json files.


input_dir = Path(cfg['methodology_gap_extractor_path']) / model_name_method_extractor # this path might be different based on the model being used

# input_dir = cfg['methodology_gap_extractor_path'],'gpt-4o-mini'
output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

if not os.path.exists(input_dir):
    raise FileNotFoundError(f"Input directory not found: {input_dir}")

combine_json_files(
    input_directory=input_dir,
    output_file=output_file
)

print(f"Combined JSON saved to: {output_file}")


Here's the **amended version** using your alias template for `combine_methodology_json_colab`:

---

You can simplify that import statement using an alias. Here's the **shortcut version**:

```python
from academic_paper_maker.helper.google_colab import combine_methodology_json_colab as combine_json
```

Now you can call the function like this:

```python
combine_json(cfg=cfg, model_name_method_extractor="gpt-4o-mini")
```

✅ This keeps your code cleaner while still referencing the full functionality.


In [None]:
from academic_paper_maker.helper.google_colab import combine_methodology_json_colab as combine_json
combine_json(cfg=cfg, model_name_method_extractor="gpt-4o-mini")


In [None]:
import os

from download_pdf.download_pdf import run_pipeline, process_scihub_downloads, process_fallback_ieee
from setting.project_path import project_folder

# Define your project
project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

file_path = path_dic['database_path']
output_folder = os.path.join(main_folder, 'pdf')

# Run the main pipeline to load and categorize the data
categories, data_filtered = run_pipeline(file_path)

# First step, we will always use Sci-Hub to attempt PDF downloads
process_scihub_downloads(categories, output_folder, data_filtered)

# Fallback options for entries not available via Sci-Hub:
# Uncomment the following lines one by one if you want to try downloading from specific sources

# Uncomment to attempt fallback download from IEEE
# process_fallback_ieee(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download using IEEE Search
# process_fallback_ieee_search(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from MDPI
# process_fallback_mdpi(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from ScienceDirect
# Note: ScienceDirect URLs can be extracted but PDFs may not be downloadable due to security restrictions
# process_fallback_sciencedirect(categories, data_filtered, output_folder)

# Uncomment to save the updated data to Excel after processing
# save_data(data_filtered, file_path)

In [None]:
import os
import pandas as pd
from post_code_saviour.excel_to_bib import generate_bibtex
from setting.project_path import project_folder


# Load project paths
path_dic = project_folder(project_review=cfg['project_review'])
main_folder = path_dic['main_folder']

# Define input and output paths
input_excel = os.path.join(main_folder, 'database', 'combined_filtered.xlsx')
output_bib = os.path.join(main_folder, 'database', 'filtered_output.bib')

# Load the filtered Excel file
df = pd.read_excel(input_excel)

# Generate BibTeX file
generate_bibtex(df, output_file=output_bib)

### 📄 BibTeX Generator: Use This Shortcut

You can simplify the import with an alias. Here's the **shortcut version**:

```python
from academic_paper_maker.helper.google_colab import generate_bibtex_from_excel_colab as gen_bib
```

Now you can call the function like this:

```python
gen_bib(cfg=cfg)
```

✅ This will read the filtered Excel file and generate a `.bib` file for your bibliography—keeping everything clean and automatic.

In [None]:
from academic_paper_maker.helper.google_colab import generate_bibtex_from_excel_colab as gen_bib
gen_bib(cfg=cfg)


# 🧾 **Eighth Step (Optional): Convert XML to JSON**

This step converts GROBID-generated TEI XML files into structured JSON format. While optional, it can be helpful for reviewing document content, integrating into other tools, or preparing data to feed into an LLM.

> 📝 **Note:** This step is **optional** — the main pipeline (`run_llm`) reads directly from XML. Use this conversion if you want to inspect or process JSON files instead.

---

## ✨ What This Step Does

* Reads all `*.xml` files from the `xml/` directory.
* Converts each into a corresponding `*.json` file (preserving the **BibTeX key as filename** for consistency).
* Stores all JSON outputs in `xml/json/`.

In addition, it handles and organizes special cases:

* 📁 **`xml/json/no_intro_conclusion/`**: XML files where GROBID could not detect an *introduction* or *conclusion* section.
* 📁 **`xml/json/untitled_section/`**: XML files where GROBID could not detect any section titles at all — these require manual checking.
* 📄 Other successfully processed files are stored directly in `xml/json/`.

---

## ▶️ Example Code

```python
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)

# Convert all XML files in the specified folder to JSON
run_pipeline(path_dic['xml_path'])
```

---

## 🗂️ Example Folder Structure After Conversion

```
corona_discharge/
├── pdf/
│   ├── smith_2020.pdf
│   └── kim_2019.pdf
├── xml/
│   ├── smith_2020.xml
│   ├── kim_2019.xml
│   └── json/
│       ├── smith_2020.json
│       ├── kim_2019.json
│       ├── no_intro_conclusion/
│       │   └── failed_paper1.json
│       └── untitled_section/
│           └── failed_paper2.json
```

In [None]:
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
run_pipeline(path_dic['xml_path'])

# 🧾 **Ninth Step: Extract Methodology Details Using LLM**

At this stage, your reference list is filtered and the corresponding PDFs (or abstracts) are available. Now, the focus shifts to **extracting key methodological insights** from each paper, such as:

* 🧠 Classification algorithms
* 🛠️ Feature engineering approaches
* 📏 Evaluation metrics

This is achieved using a specialized **LLM agent** with a targeted prompt for methodology extraction.

> 📌 This step works with **full manuscripts (PDF)** when available, and **falls back to abstracts** if no PDF exists. This flexibility ensures comprehensive analysis even with incomplete data.

---

## ✨ What This Step Does

* Loads your filtered Excel or CSV file from Step 5 or 6.
* For each relevant paper:

  * If the PDF is available: extracts methodology from the full text.
  * If the PDF is **not** available: extracts from the **abstract** instead.
* Uses a domain-specific LLM prompt to analyze methodological content.
* Appends the results to the existing Excel or CSV file.
* Saves per-paper structured JSON files for advanced or customized usage.
* Handles backups automatically if overwriting the metadata file.

> ⛑️ **Safety Note**: If `'overwrite_csv': True`, a **timestamped backup** of the original `.csv` or `.xlsx` file is automatically created **in the same folder** before any updates are made. This prevents accidental corruption and allows for recovery or version tracking.

---

## 📄 Prompt Purpose

This step uses a **domain-aware analytical agent** designed to:

> “Extract methodological details (e.g., classification algorithms, feature engineering, evaluation metrics) from filtered papers relevant to a specific machine learning task.”

The prompt is defined in a YAML config file (`agent_ml.yaml`) and is tailored by the agent name you provide.

---

## 🗂️ Folder and Output Structure

Your project directory may look like this after completing this step:

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Updated metadata with extraction results
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto-generated backup (if overwrite_csv=True)
├── pdf/
│   ├── smith_2020.pdf                              ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor/
│   ├── json_output/
│   │   └── smith_2020.json
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/
```

---

## 🧠 Supported Models

Choose from the following supported LLMs:

* `"gpt-4o"`
* `"gpt-4o-mini"`
* `"gpt-o3-mini"`
* `'gemini-1.5-pro'`
* `'gemini-exp-1206'`
* `'gemini-2.0-flash-thinking-exp-01-21'`

---

## ⚠️ Important Reminders

* ✅ Set `used_abstract = True` if some papers lack full PDFs.
* 🛑 **Always verify** extracted methodologies manually before using them in analysis, models, or publication. LLMs can hallucinate or misinterpret technical details.

---

In [None]:
from research_filter.auto_llm import run_pipeline
from setting.project_path import project_folder
import os

# Define the project
project_review = 'corona_discharge'
# Select your LLM model. For this step, it is recommended to use a more powerful model like `gpt-4o` or `gemini-2.0-flash-thinking-exp-01-21` for better performance. Use model that have reasoning capability for better performance.


model_name = 'gpt-4o-mini'
agent_name = "methodology_gap_extractor"

config_file='/content/academic_paper_maker/setting/project_folders.json'


path_dic = project_folder(project_review=project_review,config_file=config_file)


# path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

base_dir = os.getcwd()
# Construct the relative path
yaml_path = '/content/academic_paper_maker/research_filter/agent/agent_ml.yaml'

# Agent and config setup
agentic_setting = {
    "agent_name": agent_name,
    "column_name": "methodology_gap",
    "yaml_path": yaml_path,
    "model_name": model_name
}

# Output directories
methodology_json_folder = os.path.join(main_folder, agent_name, 'json_output',model_name)
multiple_runs_folder = os.path.join(main_folder, agent_name, 'multiple_runs_folder')
final_cross_check_folder = os.path.join(main_folder, agent_name, 'final_cross_check_folder')

# Excel input
csv_path = path_dic['database_path']

# Topic placeholders (adjust per project)
placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

# LLM runtime configuration
process_setup = {
    'batch_process': False,
    'manual_paste_llm': False,
    'iterative_confirmation': False,
    'overwrite_csv': True,
    'cross_check_enabled': False,
    'cross_check_runs': 3,
    'cross_check_agent_name': 'agent_cross_check',
    'cleanup_json': False,
    'used_abstract': True  # Always True to enable fallback to abstract if PDF is missing
}

# Run the pipeline
run_pipeline(
    agentic_setting,
    process_setup,
    placeholders=placeholders,
    csv_path=csv_path,
    main_folder=main_folder,
    methodology_json_folder=methodology_json_folder,
    multiple_runs_folder=multiple_runs_folder,
    final_cross_check_folder=final_cross_check_folder,
)

# 🧾 **Tenth Step: Draft Literature Review (Chapter 2) Using Combined JSON**

Once you've extracted methodological insights in Step 9, the next logical move is to **start organizing and drafting your literature review** (typically Chapter 2 of a thesis or paper). This step outlines how to **combine extracted results** into a single, structured file — and how to use that file to produce high-quality summaries, tables, and narrative drafts using an LLM.

> 📌 This is a branching step — not everyone will follow this exact path — but it's a powerful way to move from **structured data → written draft** efficiently.

---

## ✨ What This Step Does

* Collects individual methodology JSON files.
* Merges them into a single, unified JSON (`combined_output.json`).
* Prepares this JSON as input for:

  * Google AI Studio (e.g., Gemini Pro)
  * GPT-based agents
  * Jupyter notebooks or drafting pipelines
* Enables:

  * Narrative generation for Chapter 2
  * Thematic clustering of methods
  * Structured tables summarizing key findings

---

## 🧰 Code Example: Combine JSONs

```python
import os
from setting.project_path import project_folder
from helper.combine_json import combine_json_files

project_review = 'corona_discharge'
path_dict = project_folder(project_review=project_review)

input_dir = os.path.join(
    path_dict['main_folder'],
    r"methodology_gap_extractor_partial_discharge\json_output\gemini-2.0-flash-thinking-exp-01-21_updated"
)
output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

if not os.path.exists(input_dir):
    raise FileNotFoundError(f"Input directory not found: {input_dir}")

combine_json_files(
    input_directory=input_dir,
    output_file=output_file
)

print(f"Combined JSON saved to: {output_file}")
```

---

## 🗂️ Output Structure Example

```
corona_discharge/
├── combined_output.json      ← ✅ Master file for summarization and drafting
├── methodology_gap_extractor_partial_discharge/
│   └── json_output/
│       └── *.json            ← Individual extracted documents
```

---

## 📊 Using LLM to Generate Summary Tables

With the `combined_output.json`, you can prompt an LLM to create structured tables summarizing key findings. For example:

> “Using the combined JSON, generate a table with the following columns:
> **Author**, **Machine Learning Algorithm**, **Feature Types**, **Dataset Used**, and **Performance Metrics**.”

This gives you a **quick-glance overview** of the landscape, helpful both for understanding trends and citing clearly in your literature review.

### ✅ Sample Columns for Summary Table:

* **Author / Year**
* **ML Algorithm(s) Used**
* **Features**
* **Dataset / Source**
* **Performance (Accuracy, F1, etc.)**

---

## ✍️ Drafting Strategy Tips

You can also use LLM prompts like:

> “Based on the combined JSON, write a summary paragraph comparing the top 3 machine learning techniques used for partial discharge classification.”

Or:

> “Generate an introduction section discussing the evolution of feature engineering techniques in this domain.”

---

## ⚠️ Important Reminders

* ✅ Make sure all JSON structures are consistent before combining.
* 📉 Segment or cluster the JSON by subdomain if the file becomes too large.
* 🔍 Review all generated tables and text — LLMs are **tools**, not final authorities.

---

In [None]:
import os
from setting.project_path import project_folder
from helper.combine_json import combine_json_files

project_review = 'corona_discharge'
config_file='/content/academic_paper_maker/setting/project_folders.json'


path_dict = project_folder(project_review=project_review,config_file=config_file)

input_dir = '/content/my_project/methodology_gap_extractor/json_output/gpt-4o-mini'
output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

if not os.path.exists(input_dir):
    raise FileNotFoundError(f"Input directory not found: {input_dir}")

combine_json_files(
    input_directory=input_dir,
    output_file=output_file
)

print(f"Combined JSON saved to: {output_file}")


# 🧾 **Eleventh Step: Export Filtered Excel to BibTeX**

After reviewing and filtering your `combined_filtered.xlsx` file, you can convert the refined list of papers back into a **BibTeX file**. This can be helpful for citation management or integration with tools like LaTeX or reference managers.

---

# ✨ What This Step Does

* Loads the filtered Excel file containing your selected papers
* Converts the data back into BibTeX format
* Saves the result to a `.bib` file for easy reuse or citation


# 📁 File Structure Example

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Filtered Excel file with selected papers
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto backup if overwrite is enabled
│       └── filtered_output.bib                     ← Newly generated BibTeX file
├── pdf/
│   ├── smith_2020.pdf                              ← Full-text PDFs saved by bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor_wafer_defect/
│   ├── json_output/
│   │   └── smith_2020.json                         ← Extracted methodology as structured JSON
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/

```

---

## 🛠️ Notes

* Ensure the Excel file has standard bibliographic columns like: `title`, `author`, `year`, `journal`, `doi`, etc.
* The function `generate_bibtex()` maps these fields into valid BibTeX entries.
* You can open the `.bib` file in any text editor or reference manager to confirm the results.

In [None]:
import os
import pandas as pd
from post_code_saviour.excel_to_bib import generate_bibtex
from setting.project_path import project_folder

# Define your project
project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

# Define input and output paths
input_excel = os.path.join(main_folder, 'database', 'combined_filtered.xlsx')
output_bib = os.path.join(main_folder, 'database', 'filtered_output.bib')

# Load the filtered Excel file
df = pd.read_excel(input_excel)

# Generate BibTeX file
generate_bibtex(df, output_file=output_bib)