# Notebook Summary: Academic Paper Maker Workflow

This notebook guides you through a step-by-step process for setting up your environment, collecting academic papers, filtering them using Large Language Models (LLMs), extracting key information, and preparing files for literature review drafting and citation.

---

## How to Run This Notebook

To run this notebook, follow these instructions:

1.  **Read each step** below first to understand the process.
2.  **Run the code cells** one by one, in the order they appear.
3.  Click on a code cell and press `Shift + Enter`, or click the play button (`▶`) on the left side of the cell.
4.  A `[*]` next to a cell indicates it's currently running. It will change to a number (e.g., `[1]`, `[2]`) when execution is complete.

---

## Summary of Steps

This notebook consists of **12 distinct steps** presented sequentially through Markdown headers, guiding you through the automated academic paper review process.

Here is a summary of each step and its purpose:

1.  **Step 1: Install Required Packages**
    *   **Purpose:** To set up the necessary Python environment by cloning the `academic_paper_maker` repository and installing all required dependencies from `requirements.txt`.

2.  **Step 2: Utility Functions for Colab Execution**
    *   **Purpose:** To define helper functions specifically tailored for running the various steps within a Google Colab environment, simplifying common operations like API testing, filtering, JSON merging, and BibTeX generation. (Code cell implements these functions).

3.  **Step 3: Obtain Your API Keys**
    *   **Purpose:** To instruct the user on how to acquire necessary API keys (e.g., OpenAI, Gemini) from their respective platforms, which are required for using LLM-based features later in the workflow. Keys are saved externally at this stage.

4.  **Step 4: Create and Register Your Project Folder**
    *   **Purpose:** To define and initialize the main directory for your project, setting up a standard folder structure (`database/scopus`, `pdf`, `xml`) and registering the project's location for easy access in future sessions.

5.  **Step 5: Create a `.env` File to Store API Keys**
    *   **Purpose:** To create a `.env` file in the project directory. This file serves as a secure location to store your obtained API keys (manually added by the user after creation), preventing credentials from being hardcoded.

6.  **Step 5a: Load API Key and Test OpenAI Connection**
    *   **Purpose:** To verify that the OpenAI API key stored in the `.env` file is correctly loaded and can successfully connect and communicate with the OpenAI API by sending a simple test query.

7.  **Step 6: Download the Scopus BibTeX File**
    *   **Purpose:** To guide the user on how to use the Scopus database (or a similar source) to perform advanced searches for relevant academic papers and download the results in BibTeX format (`.bib`) into the designated `database/scopus` folder.

8.  **Step 7: Combine Scopus BibTeX Files into Excel**
    *   **Purpose:** To process the downloaded BibTeX (`.bib`) files from the `database/scopus` folder, extract key metadata from all files, and merge them into a single, structured Excel file (`combined_filtered.xlsx`) for subsequent filtering and review.

9.  **Step 8: Automatically Filter Relevant References Using LLM**
    *   **Purpose:** To employ a Large Language Model (LLM) to automatically read the abstracts from the `combined_filtered.xlsx` file and classify their relevance based on a user-defined `topic` and `topic_context`, adding a `True`/`False` filter column to the Excel.

10. **Step 7: Extract Methodology Details Using LLM** (Note: Header numbers are inconsistent, but this is a distinct step logically)
    *   **Purpose:** To use an LLM agent to extract detailed methodological information (e.g., algorithms, datasets, metrics) from the filtered papers, utilizing either the full PDF text (if available) or the abstract as a fallback, and saving the results in structured JSON files and updating the Excel.

11. **Step 8: Draft Literature Review (Chapter 2) Using Combined JSON** (Note: Header numbers are inconsistent, but this is a distinct step logically)
    *   **Purpose:** To combine the individual JSON files containing extracted methodology details into a single `combined_output.json`. This file then serves as structured input for LLMs or scripts to assist in drafting literature review sections, generating summary tables, or performing thematic analysis.

12. **Step 9: Export Filtered Excel to BibTeX**
    *   **Purpose:** To convert the final, manually reviewed and filtered Excel file (`combined_filtered.xlsx`) back into a BibTeX (`.bib`) file, facilitating easy import into citation managers or integration with LaTeX documents for generating bibliographies.

---

This sequence of steps automates significant portions of the literature review process, from collecting initial references to extracting detailed insights and preparing final outputs for writing.

# 📦 **Step 1: Install Required Packages + Configure API Keys**

Before you start using this notebook, ensure all required Python packages are installed **and** that your API keys are properly set up in a `.env` file. This prepares your environment for accessing services like OpenAI or Google Gemini.

---

## 💻 How to Install Required Packages

Open your terminal or command prompt, navigate to the project directory (where `requirements.txt` is located), and run:

```bash
pip install -r requirements.txt
```

> ✅ **Tip:** Use a virtual environment (e.g., `venv`, `conda`) to keep your dependencies clean and isolated.

If you're running this from a Jupyter notebook:

```python
!pip install -r requirements.txt
```

## 🧠 **LLM Prompt Customization**

For the automated relevance filtering to work effectively, you **must provide two key inputs** to guide the LLM:

* `topic`: A concise statement of your **specific research goal or area of interest**.
  *Example:* `"wafer defect classification"`

* `topic_context`: A description of the **broader scientific or industrial context** where your topic belongs.
  *Example:* `"semiconductor manufacturing and inspection"`

These inputs help the LLM understand what kinds of abstracts are considered relevant and which ones should be filtered out.

---

### 💡 *Not sure how to define your `topic` and `topic_context`?*

You can get help from an LLM to generate these values. Here’s how:

1. **Open the filtering prompt** template defined in the YAML file:

   ```
   research_filter/agent/abstract_check.yaml
   ```

2. **Copy the prompt structure** (including the placeholders for `topic` and `topic_context`).

3. **Ask any LLM to assist**, by providing it the YAML prompt and a description of your research.
   For example, you could say:

   > 🧠 *"Given the following prompt structure, and knowing that my research is about identifying AI-generated academic papers, can you help me fill in the `topic` and `topic_context` placeholders?"*

4. The LLM might then suggest:

   ```yaml
   topic: "AI-generated academic paper detection"
   topic_context: "scientific publishing and machine learning ethics"
   ```

This approach ensures your filtering prompt is both precise and contextually grounded, improving the accuracy of the classification.


In [1]:
# Click me to execute
# You can define your topic and context here, or use the LLM to help you fill in these placeholders.

placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

project_theme="wafer"

# 🔧 Step 2: Create and Register Your Project Folder

In this step, you'll register the **main project folder**, so the system knows where to store and retrieve all files related to your project.

This setup automatically creates a structured folder system under your specified `main_folder`, and stores the configuration in a central JSON file. This ensures your projects remain organized, especially when working with multiple studies or datasets.

---

## 🗂️ Project Folder Structure

On the first run, the following folder structure will be automatically created under your specified `main_folder`:

```
main_folder/
├── database/
│   └── scopus/
│   └── <project_name>_database.xlsx  ← auto-generated path (not file creation)
├── pdf/
└── xml/
```

---

## 💡 Example

Suppose your project name is `corona_discharge`, and you want to store all project files under:

```
G:\My Drive\research_related\ear_eog
```

You can register this setup by running:

```python
project_folder(
    project_review='corona_discharge',
    main_folder=r'G:\My Drive\research_related\ear_eog'
)
```

✅ This will:

* Save the project path in `setting/project_folders.json`
* Create the full folder structure: `database/scopus`, `pdf`, and `xml`

---

## 🔁 Loading the Project Later

Once registered, you can access the project folders in future sessions by providing just the `project_review` name:

```python
paths = project_folder(project_review='corona_discharge')

print(paths['main_folder'])  # Main project folder
print(paths['csv_path'])     # Path to the Excel database file
```

---

## ⚙️ What Happens in the Background?

* A file named `project_folders.json` is stored in the `setting/` directory within your project.
* It maps each `project_review` to its corresponding `main_folder`.
* Folder structure is created automatically on the first run.
* On subsequent runs, the system reads the JSON to locate your project — no need to re-enter paths.


In [5]:
# === User-Defined Base Directory ===
# Recommended: Set the base directory where all project files will live
base_dir = r"D:\test_me"  # Change this to your preferred location


# ⚙️ Step 2 **Utility Functions for Colab Execution**

To streamline your workflow, we've wrapped common operations into **modular utility functions**. These abstract away repetitive code and let you focus on your analysis—not setup.

In [3]:
from notebook.helper import generate_bibtex_from_excel_colab,generate_bibtex, create_env_file, test_openai_connection, run_llm_filtering_pipeline_colab, run_abstract_filtering_colab, run_methodology_extraction_colab, combine_methodology_output_colab

# 🔑 **Step 3: Obtain Your API Keys**

Before running any LLM-based tools, you'll need to obtain API keys for **OpenAI** and optionally **Google Gemini**, if you plan to use both.

---

## 🧠 Why You Need This

API keys are required to authenticate your access to models like `gpt-4o` or `gemini-pro`. Without them, the system won’t be able to connect to the services.

---

## 🔐 Where to Get Your Keys

| Provider   | Key Name         | Get It From                                                          |
| ---------- | ---------------- | -------------------------------------------------------------------- |
| **OpenAI** | `OPENAI_API_KEY` | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) |
| **Gemini** | `GEMINI_API_KEY` | [aistudio.google.com](https://aistudio.google.com/apikey)            |

---

## 📋 What To Do With Them Now

Just **copy and save your API keys in a safe place** (e.g., a password manager or text file on your machine).

> 📝 You do **not** need to put them in a `.env` file yet.

In the **next step**, you’ll insert them into a configuration file automatically using a helper function.

---

## ✅ Example (Save for Later)

```
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy...
```

> ⚠️ Do not share these keys with anyone or post them online.


# ⚙️ **Step 4: Create and Register Your Project Folder**

Just execute the code below to set up your project folder. This will create a structured directory for storing all your project files, including databases, PDFs, and XML files.

In [8]:
import os
from os.path import join
from setting.project_path import project_folder
from notebook.helper import build_project_config


# === Optional: Define Custom YAML Configuration Paths ===
# These paths point to YAML files used for abstract and methodology filtering
notebook_dir = os.getcwd()
yaml_dir = join(notebook_dir, "research_filter", "agent")
custom_abstract = join(yaml_dir, "abstract_check.yaml")
custom_methodology = join(yaml_dir, "agent_ml.yaml")

# === Build Project Configuration ===
cfg = build_project_config(
    project_name="wafer",
    base_dir=base_dir,
    yaml_path_abstract=custom_abstract,
    yaml_path_methodology=custom_methodology
)

# === Initialize Folder Structure ===
# Registers and prepares the folder structure for this project
project_folder(
    project_review=cfg["project_review"],
    main_folder=cfg["main_folder"],
    config_file=cfg["config_file"]
)

print("✅ Configuration loaded and project folder initialized successfully.")


✅ Configuration loaded and project folder initialized successfully.


In [9]:
# Minimal (user will edit manually later)
create_env_file(cfg["env_path"])

# Or with keys already known
create_env_file(cfg["env_path"],
                openai_api_key="go_and_replace with_your_openai_key",
                # gemini_api_key="AIzaSy..."
                )

✅ `.env` file created at: D:\test_me\project_files\.env
👉 In Colab, click the folder icon on the left, go to the appropriate folder, right-click `.env`, and choose 'Edit' to enter your keys.
✅ `.env` file created at: D:\test_me\project_files\.env
👉 In Colab, click the folder icon on the left, go to the appropriate folder, right-click `.env`, and choose 'Edit' to enter your keys.


---

# ⚙️ **Step 5a: Load API Key and Test OpenAI Connection**

After setting up your `.env` file, it's important to **verify that your OpenAI API key is working correctly**. This step uses a built-in helper function to test the connection.

---

## 🔄 What This Step Does

* Loads your API key from the `.env` file
* Initializes an OpenAI client with that key
* Sends a simple test message to the GPT model (e.g., `gpt-4o`)
* Confirms success or provides a clear error message if something goes wrong

---

## ✅ How to Run It

Just call the helper function:

```python
from academic_paper_maker.helper.google_colab import test_openai_connection

test_openai_connection()
```

---

## 📌 Expected Outcome

If everything is set up correctly, you'll see output like:

```
✅ API call successful!
🤖 Response: 2 + 2 is 4.
```

---

## ⚠️ If Something Goes Wrong

Common errors and their meanings:

| Error Type             | What It Means                                 |
| ---------------------- | --------------------------------------------- |
| ❌ AuthenticationError  | API key is missing or incorrect               |
| ⚠️ RateLimitError      | You’re sending too many requests too quickly  |
| 📡 APIConnectionError  | Network issue or OpenAI server is unreachable |
| 🚫 InvalidRequestError | Incorrect model name or bad request structure |

Make sure your `.env` file includes a valid key in this format:

```env
OPENAI_API_KEY=sk-...
```

> 🔁 Re-run the test after fixing the `.env` or internet connection if needed.

---


In [10]:
print("🔍 Testing OpenAI connection... Please wait.\n")

# Run the test
test_openai_connection()

print("\n✅ If the connection is successful, you should see a response from the assistant above.and the answer should be something like:  Response: Hello! The sum of 2 + 2 is 4.")
print("⚠️ If not, please check your `.env` file and verify that your `OPENAI_API_KEY` is correct.")

🔍 Testing OpenAI connection... Please wait.



2025-06-10 08:54:42,912 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


✅ API call successful!
🤖 Response: Hello! The sum of 2 + 2 is 4.

✅ If the connection is successful, you should see a response from the assistant above.and the answer should be something like:  Response: Hello! The sum of 2 + 2 is 4.
⚠️ If not, please check your `.env` file and verify that your `OPENAI_API_KEY` is correct.


# 📥 Step 6: Download the Scopus BibTeX File

In this step, you'll use the **Scopus database** to find and download relevant papers for your project. Scopus is a comprehensive and widely-used repository of peer-reviewed academic literature.

---

## 🔍 Using Scopus Advanced Search

To retrieve high-quality and relevant papers, we recommend using **Scopus' Advanced Search** feature. This powerful tool lets you refine your search based on:

* Keywords
* Authors
* Publication dates
* Document types
* And more...

This ensures that your literature collection is both targeted and comprehensive.

---

## 💡 Get Keyword Ideas with a Prompt

To help you formulate effective search queries, you can use the following **prompt-based suggestion tool**:

👉 [Keyword Search Prompt](https://gist.github.com/balandongiv/886437963d38252e61634ddc00b9d983)

You may need to modify the prompt to better suit your research domain. Here are some example domains:

* `"corona discharge"`
* `"fatigue driving EEG"`
* `"wafer classification"`

Feel free to add, remove, or tweak keywords as needed to refine your search results.

---

## 💾 Save and Organize Your Results

Once you've finalized your search:

1. **Select all available attributes** when exporting results from Scopus.
2. # Access Scopus
![Scopus CSV Export](image/scopus_csv_export.png)


2. Choose the **BibTeX** format when saving the export file.
3. Save the file inside the `database/scopus/` folder of your project.

The resulting folder structure might look like this:

```
main_folder/
├── database/
│   └── scopus/
│       ├── scopus(1).bib
│       ├── scopus(2).bib
│       ├── scopus(3).bib
```

Make sure the BibTeX files are correctly named and stored to ensure smooth integration in later steps.

# 📊 Step 7: Combine Scopus BibTeX Files into Excel

Once you've downloaded multiple `.bib` files from Scopus, the next step is to **combine and convert** them into a structured Excel file. This makes it easier to filter, sort, and review the metadata of all collected papers.

---

## 🧰 What This Step Does

* Loads all `.bib` files from your project's `database/scopus/` folder
* Parses the relevant metadata (e.g., title, authors, year, source, DOI)
* Combines the results into a single Excel spreadsheet
* Saves the spreadsheet in the `database/` folder as `combined_filtered.xlsx`



## 📁 Folder Structure Example

After running the script, your folder might look like this:

```
main_folder/
├── database/
│   ├── scopus/
│   │   ├── scopus(1).bib
│   │   ├── scopus(2).bib
│   │   ├── scopus(3).bib
│   └── combined_filtered.xlsx
```

This Excel file will serve as your primary reference for filtering papers before downloading PDFs.


In [21]:
from download_pdf.database_preparation import combine_scopus_bib_to_excel
from setting.project_path import project_folder

path_dic=project_folder(project_review=cfg['project_review'],config_file=cfg["config_file"])
combine_scopus_bib_to_excel(path_dic['scopus_path'], path_dic['database_path'])

Found 2 .bib files in D:\test_me\project_files\database\scopus


KeyboardInterrupt: 

# 🧾 **Step 8: Automatically Filter Relevant References Using LLM**

After retrieving thousands of BibTeX references from Scopus (**Step 3**) and combining them into an Excel file (`combined_filtered.xlsx`) in **Step 4**, you'll likely find many entries irrelevant to your specific research focus.

In this step, we use a **Large Language Model (LLM)** to classify abstracts based on a defined research topic and context, eliminating the need for tedious manual filtering.

---

## ✨ **What This Step Does**

* Loads abstracts from `combined_filtered.xlsx`
* Applies a **custom LLM prompt** to assess relevance
* Adds a new column to the Excel file with `True`/`False` labels:

  * ✅ `True` → Relevant
  * ❌ `False` → Not relevant

This process is based on a topic-specific prompt defined by you.

---

## 🧠 **LLM Prompt Customization**

For the automated relevance filtering to work effectively, you **must provide two key inputs** to guide the LLM:

* `topic`: A concise statement of your **specific research goal or area of interest**.
  *Example:* `"wafer defect classification"`

* `topic_context`: A description of the **broader scientific or industrial context** where your topic belongs.
  *Example:* `"semiconductor manufacturing and inspection"`

These inputs help the LLM understand what kinds of abstracts are considered relevant and which ones should be filtered out.

---

### 💡 *Not sure how to define your `topic` and `topic_context`?*

You can get help from an LLM to generate these values. Here’s how:

1. **Open the filtering prompt** template defined in the YAML file:

   ```
   research_filter/agent/abstract_check.yaml
   ```

2. **Copy the prompt structure** (including the placeholders for `topic` and `topic_context`).

3. **Ask any LLM to assist**, by providing it the YAML prompt and a description of your research.
   For example, you could say:

   > 🧠 *"Given the following prompt structure, and knowing that my research is about identifying AI-generated academic papers, can you help me fill in the `topic` and `topic_context` placeholders?"*

4. The LLM might then suggest:

   ```yaml
   topic: "AI-generated academic paper detection"
   topic_context: "scientific publishing and machine learning ethics"
   ```

This approach ensures your filtering prompt is both precise and contextually grounded, improving the accuracy of the classification.

---

## ⚙️ **Configuration via `agentic_setting`**

The entire filtering behavior is controlled by a configuration dictionary called `agentic_setting`:

```python
agentic_setting = {
    "agent_name": "abstract_filter",     # The identifier for the agent logic (must match YAML)
    "column_name": "abstract_filter",    # Name of the column added to the Excel file
    "yaml_path": yaml_path,              # Path to the YAML file defining agent behavior
    "model_name": model_name             # Name of the LLM model to use
}
```

### 🔍 Parameter Breakdown:

| Key           | Description                                                             |
| ------------- | ----------------------------------------------------------------------- |
| `agent_name`  | Matches the name of the agent defined in the YAML configuration file.   |
| `column_name` | The name of the new column in the Excel file where results are saved.   |
| `yaml_path`   | Path to the YAML file containing the agent's logic and prompt template. |
| `model_name`  | The specific LLM model used (e.g., `"gpt-4"` or `"claude-3-opus"`).     |

---

## 📂 **File and Folder Structure**

To avoid redundant LLM calls (which can be costly), results are cached as JSON files:

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx           ← Updated with filtering results
├── abstract_filter/
│   └── json_output/
│       ├── kim_2019.json
│       ├── smith_2020.json
│       └── ...
```

* Each abstract’s filtering result is saved individually.
* If the run encounters an error, it can resume without reprocessing previous abstracts.
* These files are later reused for **cross-checking** and **final review**.

---

## 📤 **Excel Output Behavior**

Depending on the `overwrite_csv` setting:

* `True` → Updates the original `combined_filtered.xlsx`
* `False` → Creates a new file (e.g., `combined_filtered_updated.xlsx`)

---

## ⚠️ **Caution: Manual Review is Still Required**

> ❗ **LLMs are powerful but not perfect**. They may misclassify edge cases or ambiguous abstracts.
>
> 🔍 Always **manually inspect** the final results before using them for publication or decision-making.

# 🧾 **Step 8: Automatically Filter Relevant References Using LLM**

After retrieving thousands of BibTeX references from Scopus (**Step 3**) and combining them into an Excel file (`combined_filtered.xlsx`) in **Step 4**, you'll likely find many entries irrelevant to your specific research focus.

In this step, we use a **Large Language Model (LLM)** to classify abstracts based on a defined research topic and context, eliminating the need for tedious manual filtering.

---

## ✨ **What This Step Does**

* Loads abstracts from `combined_filtered.xlsx`
* Applies a **custom LLM prompt** to assess relevance
* Adds a new column to the Excel file with `True`/`False` labels:

  * ✅ `True` → Relevant
  * ❌ `False` → Not relevant

This process is based on a topic-specific prompt defined by you.

---


---

## ⚙️ **Configuration via `agentic_setting`**

The entire filtering behavior is controlled by a configuration dictionary called `agentic_setting`:

```python
agentic_setting = {
    "agent_name": "abstract_filter",     # The identifier for the agent logic (must match YAML)
    "column_name": "abstract_filter",    # Name of the column added to the Excel file
    "yaml_path": yaml_path,              # Path to the YAML file defining agent behavior
    "model_name": model_name             # Name of the LLM model to use
}
```

### 🔍 Parameter Breakdown:

| Key           | Description                                                             |
| ------------- | ----------------------------------------------------------------------- |
| `agent_name`  | Matches the name of the agent defined in the YAML configuration file.   |
| `column_name` | The name of the new column in the Excel file where results are saved.   |
| `yaml_path`   | Path to the YAML file containing the agent's logic and prompt template. |
| `model_name`  | The specific LLM model used (e.g., `"gpt-4"` or `"claude-3-opus"`).     |

---

## 📂 **File and Folder Structure**

To avoid redundant LLM calls (which can be costly), results are cached as JSON files:

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx           ← Updated with filtering results
├── abstract_filter/
│   └── json_output/
│       ├── kim_2019.json
│       ├── smith_2020.json
│       └── ...
```

* Each abstract’s filtering result is saved individually.
* If the run encounters an error, it can resume without reprocessing previous abstracts.
* These files are later reused for **cross-checking** and **final review**.

---

## 📤 **Excel Output Behavior**

Depending on the `overwrite_csv` setting:

* `True` → Updates the original `combined_filtered.xlsx`
* `False` → Creates a new file (e.g., `combined_filtered_updated.xlsx`)

---

## ⚠️ **Caution: Manual Review is Still Required**

> ❗ **LLMs are powerful but not perfect**. They may misclassify edge cases or ambiguous abstracts.
>
> 🔍 Always **manually inspect** the final results before using them for publication or decision-making.

In [12]:
run_abstract_filtering_colab(model_name="gpt-4o-mini", placeholders=placeholders, cfg=cfg)

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\logging\__init__.py", line 1163, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f680' in position 33: character maps to <undefined>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-p

The agent instruction is: role: An expert evaluator specializing in the relevance of research abstracts  related to wafer defect classification.  You possess advanced knowledge of machine learning applications  specifically tailored to semiconductor manufacturing and inspection.

goal: Determine whether the provided abstract is directly relevant to the  research topic: "wafer defect classification".  The evaluation should focus on identifying: - The use of machine learning techniques. - A specific application to wafer defect classification.

backstory: You are a seasoned researcher with extensive expertise in the intersection  of machine learning and semiconductor manufacturing and inspection, particularly in identifying  and classifying wafer defect classification. Your task is to filter abstracts,  ensuring that only those that contribute significantly to the specified  research topic are considered relevant.

evaluation_criteria: The abstract explicitly or implicitly focuses on wafe

2025-06-10 08:57:40,822 - INFO - Loading DataFrame from D:\test_me\project_files\database\wafer_database.xlsx
  0%|          | 0/8 [00:00<?, ?it/s]2025-06-10 08:57:40,842 - INFO - Using abstract text for Tsai_2025.
2025-06-10 08:57:40,842 - INFO - Processing Tsai_2025 with AI agent gpt-4o-mini.
2025-06-10 08:57:41,657 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-10 08:57:41,664 - INFO - Successfully processed Tsai_2025, saving at D:\test_me\project_files\abstract_filter\json_output\Tsai_2025.json
 12%|█▎        | 1/8 [00:00<00:05,  1.22it/s]2025-06-10 08:57:41,664 - INFO - Using abstract text for Ingle_2025.
2025-06-10 08:57:41,664 - INFO - Processing Ingle_2025 with AI agent gpt-4o-mini.
2025-06-10 08:57:42,442 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-10 08:57:42,452 - INFO - Successfully processed Ingle_2025, saving at D:\test_me\project_files\abstract_filter\json_output\Ingle_

# 🧾 **Step 8b: Review The Excel**

As mentioned in the previous step, the LLM-based filtering will update your Excel file with a new column (under the column name abstract_filter) indicating whether each abstract is relevant (True or False) to your research topic. You may want to review this file before proceeding to the next step.It is recommended to manually delete any entries that are not relevant to your research topic, even if the LLM marked them as relevant.

By dropping these entries, you ensure that the next step (downloading PDFs) only includes papers that are truly pertinent to your study.

# 🧾 **Step 9: Download PDFs (you may skip this)**

Now that your reference list has been filtered to include only **relevant papers** (based on the abstract analysis in **Step 5**), you're ready to automatically download their corresponding PDFs.

This step uses the filtered Excel file (updated in Step 5) to retrieve and save PDFs for each BibTeX entry. The script is powered by Selenium and supports fallback strategies for sources like IEEE, MDPI, and ScienceDirect.

> 🛑 **Note:** This step launches a full browser window during execution. Some publishers may block headless downloads — using a visible browser avoids this issue.

> 📝 **Important:** By default, only **Sci-Hub** is enabled. To use fallback sources like IEEE, MDPI, or ScienceDirect, you must **manually uncomment the relevant function calls** in the script. This allows you to selectively control which sources to attempt.

---

## ✨ What This Step Does

* Loads metadata from the filtered Excel file (`combined_filtered.xlsx` or the updated version from Step 5).
* Attempts to download each paper from **Sci-Hub** first.
* If Sci-Hub fails for a paper, you can optionally enable **fallback downloads** from:

  * IEEE
  * IEEE Search
  * MDPI
  * ScienceDirect (note: may fail due to access restrictions)
* Saves each PDF as `{bibtex_key}.pdf` in the `pdf/` directory for easy tracking and consistent file naming.

---

## 🧰 Code Snippet

```python
from download_pdf.download_pdf import (
    run_pipeline,
    process_scihub_downloads,
    process_fallback_ieee,
    # process_fallback_ieee_search,
    # process_fallback_mdpi,
    # process_fallback_sciencedirect,
)
from setting.project_path import project_folder
import os

# Project setup
project_review = 'wafer_defect'
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

# Use the filtered Excel from Step 5
file_path = path_dic['csv_path']
output_folder = os.path.join(main_folder, 'pdf')

# Load and categorize data
categories, data_filtered = run_pipeline(file_path)

# Step 1: Attempt to download from Sci-Hub
process_scihub_downloads(categories, output_folder, data_filtered)

# Optional fallback sources — uncomment as needed:
# Step 2: Try IEEE fallback if Sci-Hub fails
# process_fallback_ieee(categories, data_filtered, output_folder)

# Step 3: Use IEEE Search-based fallback
# process_fallback_ieee_search(categories, data_filtered, output_folder)

# Step 4: Try MDPI fallback
# process_fallback_mdpi(categories, data_filtered, output_folder)

# Step 5: Try ScienceDirect fallback (limited due to restrictions)
# process_fallback_sciencedirect(categories, data_filtered, output_folder)

# Optional: Save updated Excel with download statuses
# save_data(data_filtered, file_path)
```

---

## 🗂️ Example Folder Structure

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx     ← Filtered metadata with BibTeX keys (updated in Step 5)
├── pdf/
│   ├── smith_2020.pdf                 ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
```

---

In [None]:
import os

from download_pdf.download_pdf import run_pipeline, process_scihub_downloads, process_fallback_ieee
from setting.project_path import project_folder

# Define your project
project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
main_folder = path_dic['main_folder']

file_path = path_dic['database_path']
output_folder = os.path.join(main_folder, 'pdf')

# Run the main pipeline to load and categorize the data
categories, data_filtered = run_pipeline(file_path)

# First step, we will always use Sci-Hub to attempt PDF downloads
process_scihub_downloads(categories, output_folder, data_filtered)

# Fallback options for entries not available via Sci-Hub:
# Uncomment the following lines one by one if you want to try downloading from specific sources

# Uncomment to attempt fallback download from IEEE
# process_fallback_ieee(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download using IEEE Search
# process_fallback_ieee_search(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from MDPI
# process_fallback_mdpi(categories, data_filtered, output_folder)

# Uncomment to attempt fallback download from ScienceDirect
# Note: ScienceDirect URLs can be extracted but PDFs may not be downloadable due to security restrictions
# process_fallback_sciencedirect(categories, data_filtered, output_folder)

# Uncomment to save the updated data to Excel after processing
# save_data(data_filtered, file_path)

# 🧾 **Step 10: Convert PDFs to XML using GROBID (you may skip this)**

After downloading the PDFs, the next step is to convert them into structured **TEI XML** format using [**GROBID**](https://grobid.readthedocs.io). This step enables downstream tasks like metadata extraction, reference parsing, and full-text analysis.

---

## 🧰 What This Step Does

* Processes all PDF files in your `pdf/` directory.
* Uses GROBID's **batch processing API**.
* Saves the resulting XML files into the `xml/` folder (one `.xml` per `.pdf`).
* Leverages Docker for fast, isolated execution.

---

## ⚙️ Setup Requirements

> 🛠️ **GROBID requires WSL + Docker on Windows**

* You must have **WSL** installed (tested on **WSL2 with Ubuntu 22.04**).
* You must have **Docker** installed and **running** before launching GROBID.

---

## 🐳 How to Install & Run GROBID

1. **Pull the Docker image from Docker Hub**
   Check for the [latest version](https://hub.docker.com/r/grobid/grobid/tags), or use the stable one:

   ```bash
   docker pull grobid/grobid:0.8.1
   ```

2. **Start the GROBID container in Ubuntu (WSL)**

   Open your Ubuntu terminal and run:

   ```bash
   docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1
   ```

   > ✅ This exposes GROBID's REST API on `http://localhost:8070/`

3. **Test it in your browser**

   Open a browser (e.g., Firefox or Chrome) and navigate to:

   ```
   http://localhost:8070/
   ```

   You should see the GROBID interface.

---

## 🚀 Batch Conversion Command

Once the GROBID service is running, you can convert all PDFs in your `pdf/` folder to XML using:

```bash
# From your project root (in WSL/Ubuntu)
cd path/to/your/project

# Create output folder if not exists
mkdir -p xml

# Run batch processing using curl
curl -v --form "input=@pdf/" localhost:8070/api/processFulltextDocument -o xml/
```

Or, use a Python wrapper or script to iterate over PDFs and call GROBID’s REST API for more control.

---

## 🗂️ Folder Structure After Conversion

```
wafer_defect/
├── pdf/
│   ├── smith_2020.pdf
│   └── kim_2019.pdf
├── xml/
│   ├── smith_2020.xml
│   └── kim_2019.xml
```


# 🧾 **Step 11: Convert XML to JSON (you may skip this)**

This step converts GROBID-generated TEI XML files into structured JSON format. While optional, it can be helpful for reviewing document content, integrating into other tools, or preparing data to feed into an LLM.

> 📝 **Note:** This step is **optional** — the main pipeline (`run_llm`) reads directly from XML. Use this conversion if you want to inspect or process JSON files instead.

---

## ✨ What This Step Does

* Reads all `*.xml` files from the `xml/` directory.
* Converts each into a corresponding `*.json` file (preserving the **BibTeX key as filename** for consistency).
* Stores all JSON outputs in `xml/json/`.

In addition, it handles and organizes special cases:

* 📁 **`xml/json/no_intro_conclusion/`**: XML files where GROBID could not detect an *introduction* or *conclusion* section.
* 📁 **`xml/json/untitled_section/`**: XML files where GROBID could not detect any section titles at all — these require manual checking.
* 📄 Other successfully processed files are stored directly in `xml/json/`.

---

## ▶️ Example Code

```python
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)

# Convert all XML files in the specified folder to JSON
run_pipeline(path_dic['xml_path'])
```

---

## 🗂️ Example Folder Structure After Conversion

```
corona_discharge/
├── pdf/
│   ├── smith_2020.pdf
│   └── kim_2019.pdf
├── xml/
│   ├── smith_2020.xml
│   ├── kim_2019.xml
│   └── json/
│       ├── smith_2020.json
│       ├── kim_2019.json
│       ├── no_intro_conclusion/
│       │   └── failed_paper1.json
│       └── untitled_section/
│           └── failed_paper2.json
```

In [None]:
from setting.project_path import project_folder
from grobid_tei_xml.xml_json import run_pipeline

project_review = 'corona_discharge'

# Load project paths
path_dic = project_folder(project_review=project_review)
run_pipeline(path_dic['xml_path'])

# 🧾 **Step 12: Extract Methodology Details Using LLM**

At this stage, your reference list is filtered and the corresponding PDFs (or abstracts) are available. Now, the focus shifts to **extracting key methodological insights** from each paper, such as:

* 🧠 Classification algorithms
* 🛠️ Feature engineering approaches
* 📏 Evaluation metrics

This is achieved using a specialized **LLM agent** with a targeted prompt for methodology extraction.

> 📌 This step works with **full manuscripts (PDF)** when available, and **falls back to abstracts** if no PDF exists. This flexibility ensures comprehensive analysis even with incomplete data.

---

## ✨ What This Step Does

* Loads your filtered Excel or CSV file from Step 5 or 6.
* For each relevant paper:

  * If the PDF is available: extracts methodology from the full text.
  * If the PDF is **not** available: extracts from the **abstract** instead.
* Uses a domain-specific LLM prompt to analyze methodological content.
* Appends the results to the existing Excel or CSV file.
* Saves per-paper structured JSON files for advanced or customized usage.
* Handles backups automatically if overwriting the metadata file.

> ⛑️ **Safety Note**: If `'overwrite_csv': True`, a **timestamped backup** of the original `.csv` or `.xlsx` file is automatically created **in the same folder** before any updates are made. This prevents accidental corruption and allows for recovery or version tracking.

---

## 📄 Prompt Purpose

This step uses a **domain-aware analytical agent** designed to:

> “Extract methodological details (e.g., classification algorithms, feature engineering, evaluation metrics) from filtered papers relevant to a specific machine learning task.”

The prompt is defined in a YAML config file (`agent_ml.yaml`) and is tailored by the agent name you provide.

---

## 🗂️ Folder and Output Structure

Your project directory may look like this after completing this step:

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Updated metadata with extraction results
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto-generated backup (if overwrite_csv=True)
├── pdf/
│   ├── smith_2020.pdf                              ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor/
│   ├── json_output/
│   │   └── smith_2020.json
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/
```

---

## 🧠 Supported Models

Choose from the following supported LLMs:

* `"gpt-4o"`
* `"gpt-4o-mini"`
* `"gpt-o3-mini"`
* `'gemini-1.5-pro'`
* `'gemini-exp-1206'`
* `'gemini-2.0-flash-thinking-exp-01-21'`

---

## ⚠️ Important Reminders

* ✅ Set `used_abstract = True` if some papers lack full PDFs.
* 🛑 **Always verify** extracted methodologies manually before using them in analysis, models, or publication. LLMs can hallucinate or misinterpret technical details.

---

In [13]:
model_for_method_extractor = "gpt-4o"

run_methodology_extraction_colab(
    model_name_method_extractor="gpt-4o",
    agent_name="methodology_gap_extractor",
    placeholders=placeholders,
    cfg=cfg
)


--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\logging\__init__.py", line 1163, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f680' in position 33: character maps to <undefined>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-p

The agent instruction is: role: An analytical agent specialized in extracting methodological details  (e.g., classification algorithms, feature engineering approaches, evaluation metrics)  from the filtered set of papers deemed relevant to partial discharge classification using machine learning.

goal: From each selected paper, identify the core methods and rationales: - Which machine learning techniques or algorithms are used (if applicable). - Why those techniques were chosen. - Performance metrics employed.

backstory: As a meticulous methodology researcher, you have deep knowledge of various  analysis techniques and their typical application domains. You leverage NLP-based parsing  to summarize each paper's approach and methodological framework efficiently.

evaluation_criteria: Accurately identifies and lists the machine learning techniques employed (e.g., SVM, LSTM, Random Forest).
Extracts the exact text or rationale stated in the paper for selecting the methods used.
Notes impo

2025-06-10 08:58:22,453 - INFO - Loading DataFrame from D:\test_me\project_files\database\wafer_database.xlsx
  0%|          | 0/5 [00:00<?, ?it/s]2025-06-10 08:58:22,463 - INFO - Using abstract text for Tsai_2025.
2025-06-10 08:58:22,463 - INFO - Processing Tsai_2025 with AI agent gpt-4o.
2025-06-10 08:58:31,142 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-10 08:58:31,149 - INFO - Successfully processed Tsai_2025, saving at D:\test_me\project_files\methodology_gap_extractor\json_output\gpt-4o\Tsai_2025.json
 20%|██        | 1/5 [00:08<00:34,  8.69s/it]2025-06-10 08:58:31,152 - INFO - Using abstract text for Ingle_2025.
2025-06-10 08:58:31,152 - INFO - Processing Ingle_2025 with AI agent gpt-4o.
2025-06-10 08:58:41,222 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-06-10 08:58:41,222 - INFO - Successfully processed Ingle_2025, saving at D:\test_me\project_files\methodology_gap_extractor\j

# 🧾 **Step 12: Draft Literature Review (Chapter 2) Using Combined JSON**

Once you've extracted methodological insights in Step 9, the next logical move is to **start organizing and drafting your literature review** (typically Chapter 2 of a thesis or paper). This step outlines how to **combine extracted results** into a single, structured file — and how to use that file to produce high-quality summaries, tables, and narrative drafts using an LLM.

> 📌 This is a branching step — not everyone will follow this exact path — but it's a powerful way to move from **structured data → written draft** efficiently.

---

## ✨ What This Step Does

* Collects individual methodology JSON files.
* Merges them into a single, unified JSON (`combined_output.json`).
* Prepares this JSON as input for:

  * Google AI Studio (e.g., Gemini Pro)
  * GPT-based agents
  * Jupyter notebooks or drafting pipelines
* Enables:

  * Narrative generation for Chapter 2
  * Thematic clustering of methods
  * Structured tables summarizing key findings


---

## 🗂️ Output Structure Example

```
corona_discharge/
├── combined_output.json      ← ✅ Master file for summarization and drafting
├── methodology_gap_extractor_partial_discharge/
│   └── json_output/
│       └── *.json            ← Individual extracted documents
```

---

## 📊 Using LLM to Generate Summary Tables

With the `combined_output.json`, you can prompt an LLM to create structured tables summarizing key findings. For example:

> “Using the combined JSON, generate a table with the following columns:
> **Author**, **Machine Learning Algorithm**, **Feature Types**, **Dataset Used**, and **Performance Metrics**.”

This gives you a **quick-glance overview** of the landscape, helpful both for understanding trends and citing clearly in your literature review.

### ✅ Sample Columns for Summary Table:

* **Author / Year**
* **ML Algorithm(s) Used**
* **Features**
* **Dataset / Source**
* **Performance (Accuracy, F1, etc.)**

---

## ✍️ Drafting Strategy Tips

You can also use LLM prompts like:

> “Based on the combined JSON, write a summary paragraph comparing the top 3 machine learning techniques used for partial discharge classification.”

Or:

> “Generate an introduction section discussing the evolution of feature engineering techniques in this domain.”

---

## ⚠️ Important Reminders

* ✅ Make sure all JSON structures are consistent before combining.
* 📉 Segment or cluster the JSON by subdomain if the file becomes too large.
* 🔍 Review all generated tables and text — LLMs are **tools**, not final authorities.

---

In [17]:
combine_methodology_output_colab(model_name_method_extractor=model_for_method_extractor, cfg=cfg)

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\logging\__init__.py", line 1163, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f504' in position 33: character maps to <undefined>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\balan\anaconda3\envs\pyblinker\Lib\site-p

# 🧾 **Step 13: Export Filtered Excel to BibTeX**

After reviewing and filtering your `combined_filtered.xlsx` file, you can convert the refined list of papers back into a **BibTeX file**. This can be helpful for citation management or integration with tools like LaTeX or reference managers.

---

# ✨ What This Step Does

* Loads the filtered Excel file containing your selected papers
* Converts the data back into BibTeX format
* Saves the result to a `.bib` file for easy reuse or citation


# 📁 File Structure Example

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Filtered Excel file with selected papers
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto backup if overwrite is enabled
│       └── filtered_output.bib                     ← Newly generated BibTeX file
├── pdf/
│   ├── smith_2020.pdf                              ← Full-text PDFs saved by bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor_wafer_defect/
│   ├── json_output/
│   │   └── smith_2020.json                         ← Extracted methodology as structured JSON
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/

```

---

## 🛠️ Notes

* Ensure the Excel file has standard bibliographic columns like: `title`, `author`, `year`, `journal`, `doi`, etc.
* The function `generate_bibtex()` maps these fields into valid BibTeX entries.
* You can open the `.bib` file in any text editor or reference manager to confirm the results.

In [19]:
print(cfg)

{'project_review': 'wafer', 'env_path': 'D:\\test_me\\project_files\\.env', 'main_folder': 'D:\\test_me\\project_files', 'project_root': 'D:\\test_me\\project_files\\database', 'yaml_path_abstract': 'C:\\Users\\balan\\IdeaProjects\\academic_paper_maker\\research_filter\\agent\\abstract_check.yaml', 'yaml_path_methodology': 'C:\\Users\\balan\\IdeaProjects\\academic_paper_maker\\research_filter\\agent\\agent_ml.yaml', 'config_file': 'D:\\test_me\\project_files\\project_folders.json', 'methodology_gap_extractor_path': 'D:\\test_me\\project_files\\methodology_gap_extractor\\json_output'}


In [25]:
import os
import pandas as pd
from post_code_saviour.excel_to_bib import generate_bibtex
from setting.project_path import project_folder


# Load project paths
path_dic = project_folder(project_review=cfg['project_review'],config_file=cfg["config_file"])
# main_folder = path_dic['main_folder']


output_bib = os.path.join(path_dic['main_folder'], 'database', 'filtered_output.bib')

# Load the filtered Excel file
df = pd.read_excel(path_dic['database_path'])

# Generate BibTeX file
generate_bibtex(df, output_file=output_bib)

BibTeX file generated: D:\test_me\project_files\database\filtered_output.bib
