# Notebook Summary: Academic Paper Maker Workflow

This notebook guides you through a step-by-step process for setting up your environment, collecting academic papers, filtering them using Large Language Models (LLMs), extracting key information, and preparing files for literature review drafting and citation.

---

## How to Run This Notebook

To run this notebook, follow these instructions:

1.  **Read each step** below first to understand the process.
2.  **Run the code cells** one by one, in the order they appear.
3.  Click on a code cell and press `Shift + Enter`, or click the play button (`▶`) on the left side of the cell.
4.  A `[*]` next to a cell indicates it's currently running. It will change to a number (e.g., `[1]`, `[2]`) when execution is complete.

---

## Summary of Steps

This notebook consists of **12 distinct steps** presented sequentially through Markdown headers, guiding you through the automated academic paper review process.

Here is a summary of each step and its purpose:

1.  **Step 1: Install Required Packages**
    *   **Purpose:** To set up the necessary Python environment by cloning the `academic_paper_maker` repository and installing all required dependencies from `requirements.txt`.

2.  **Step 2: Utility Functions for Colab Execution**
    *   **Purpose:** To define helper functions specifically tailored for running the various steps within a Google Colab environment, simplifying common operations like API testing, filtering, JSON merging, and BibTeX generation. (Code cell implements these functions).

3.  **Step 3: Obtain Your API Keys**
    *   **Purpose:** To instruct the user on how to acquire necessary API keys (e.g., OpenAI, Gemini) from their respective platforms, which are required for using LLM-based features later in the workflow. Keys are saved externally at this stage.

4.  **Step 4: Create and Register Your Project Folder**
    *   **Purpose:** To define and initialize the main directory for your project, setting up a standard folder structure (`database/scopus`, `pdf`, `xml`) and registering the project's location for easy access in future sessions.

5.  **Step 5: Create a `.env` File to Store API Keys**
    *   **Purpose:** To create a `.env` file in the project directory. This file serves as a secure location to store your obtained API keys (manually added by the user after creation), preventing credentials from being hardcoded.

6.  **Step 5a: Load API Key and Test OpenAI Connection**
    *   **Purpose:** To verify that the OpenAI API key stored in the `.env` file is correctly loaded and can successfully connect and communicate with the OpenAI API by sending a simple test query.

7.  **Step 6: Download the Scopus BibTeX File**
    *   **Purpose:** To guide the user on how to use the Scopus database (or a similar source) to perform advanced searches for relevant academic papers and download the results in BibTeX format (`.bib`) into the designated `database/scopus` folder.

8.  **Step 7: Combine Scopus BibTeX Files into Excel**
    *   **Purpose:** To process the downloaded BibTeX (`.bib`) files from the `database/scopus` folder, extract key metadata from all files, and merge them into a single, structured Excel file (`combined_filtered.xlsx`) for subsequent filtering and review.

9.  **Step 8: Automatically Filter Relevant References Using LLM**
    *   **Purpose:** To employ a Large Language Model (LLM) to automatically read the abstracts from the `combined_filtered.xlsx` file and classify their relevance based on a user-defined `topic` and `topic_context`, adding a `True`/`False` filter column to the Excel.

10. **Step 7: Extract Methodology Details Using LLM** (Note: Header numbers are inconsistent, but this is a distinct step logically)
    *   **Purpose:** To use an LLM agent to extract detailed methodological information (e.g., algorithms, datasets, metrics) from the filtered papers, utilizing either the full PDF text (if available) or the abstract as a fallback, and saving the results in structured JSON files and updating the Excel.

11. **Step 8: Draft Literature Review (Chapter 2) Using Combined JSON** (Note: Header numbers are inconsistent, but this is a distinct step logically)
    *   **Purpose:** To combine the individual JSON files containing extracted methodology details into a single `combined_output.json`. This file then serves as structured input for LLMs or scripts to assist in drafting literature review sections, generating summary tables, or performing thematic analysis.

12. **Step 9: Export Filtered Excel to BibTeX**
    *   **Purpose:** To convert the final, manually reviewed and filtered Excel file (`combined_filtered.xlsx`) back into a BibTeX (`.bib`) file, facilitating easy import into citation managers or integration with LaTeX documents for generating bibliographies.

---

This sequence of steps automates significant portions of the literature review process, from collecting initial references to extracting detailed insights and preparing final outputs for writing.

# 📦 **Step 1: Install Required Packages**

Before running the notebook, you need to **set up your environment** by cloning the repository and installing all the required Python packages.

---

## 📥 What This Step Does

* **Clones** the GitHub repository `academic_paper_maker` into your Colab environment
* Changes into the project directory so all subsequent code runs in the right context
* Installs dependencies listed in the project's `requirements.txt` file
* Installs `openai` explicitly in case it’s not listed in `requirements.txt`

> ✅ **Tip:** This ensures that all required libraries are available before continuing with API setup or running the app logic.

In [None]:
!git clone https://github.com/balandongiv/academic_paper_maker.git
%cd academic_paper_maker
!pip install -r requirements.txt

print("✅ Packages installed successfully. You can now proceed to the next steps.")

---

# ⚙️ Step 2 **Utility Functions for Colab Execution**

To streamline your workflow, we've wrapped common operations into **modular utility functions**. These abstract away repetitive code and let you focus on your analysis—not setup.

---

## 🧠 Why Use These Functions?

Instead of writing long, error-prone blocks of boilerplate, you can execute key tasks using concise, high-level commands. These functions handle:

* Environment setup
* API connection testing
* LLM-based filtering
* JSON merging
* BibTeX export

> ✅ You only need to supply minimal inputs (like `cfg` or model name), and the functions will take care of the rest.

---

## 🛠️ Available Functions

| Function                                                           | Purpose                                                           |
| ------------------------------------------------------------------ | ----------------------------------------------------------------- |
| `create_env_file(env_path, ...)`                                   | Creates a `.env` file with placeholders or prefilled API keys     |
| `test_openai_connection(model='gpt-4o')`                           | Tests your OpenAI key and prints a basic response                 |
| `run_llm_filtering_pipeline_colab(model_name, cfg, placeholders)`  | Executes the LLM-based abstract filtering pipeline                |
| `combine_methodology_json_colab(cfg, model_name_method_extractor)` | Merges multiple JSON outputs into one `combined_output.json`      |
| `generate_bibtex_from_excel_colab(cfg)`                            | Converts a filtered Excel sheet into a `.bib` file for references |

---

---

## 📌 Reminder

These functions are preloaded into the `academic_paper_maker.helper.google_colab` module to make your notebook cleaner, more modular, and easier to maintain.

> ⚡ Use them as building blocks in your Colab workflow — no need to copy-paste long setup code every time.

---


In [None]:
import logging
import os
import sys
from pathlib import Path

import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError, RateLimitError, APIConnectionError

from academic_paper_maker.setting.project_path import project_folder
from academic_paper_maker.helper.combine_json import combine_json_files
from academic_paper_maker.post_code_saviour.excel_to_bib import generate_bibtex



# Path to the directory where "research_filter" lives
project_root = "/content/academic_paper_maker"

# Add it to Python's module search path
if project_root not in sys.path:
    sys.path.append(project_root)

from research_filter.auto_llm import run_pipeline




def generate_bibtex_from_excel_colab(cfg: dict) -> None:
    """
    Generate a BibTeX file from a filtered Excel database.

    Parameters:
        cfg (dict): Configuration dictionary containing at least:
            - project_review (str)

    Returns:
        None
    """
    # Load project paths
    path_dic = project_folder(project_review=cfg['project_review'])
    main_folder = path_dic['main_folder']

    # Define input and output paths
    input_excel = os.path.join(main_folder, 'database', 'combined_filtered.xlsx')
    output_bib = os.path.join(main_folder, 'database', 'filtered_output.bib')

    # Load the filtered Excel file
    df = pd.read_excel(input_excel)

    # Generate BibTeX file
    generate_bibtex(df, output_file=output_bib)

    print(f"✅ BibTeX file saved to: {output_bib}")


def create_env_file(env_path, openai_api_key=None, gemini_api_key=None):
    """
    Creates a .env file with placeholders or provided API keys.

    Parameters:
    - env_path (str): Full path to the .env file (e.g., "/content/my_project/.env").
    - openai_api_key (str, optional): OpenAI API key (e.g., "sk-...").
    - gemini_api_key (str, optional): Gemini (Google AI) API key.
    """
    lines = [
        "# Paste your API keys below.",
        "# OpenAI example: OPENAI_API_KEY=sk-...",
        "# Gemini (Google AI) example: GEMINI_API_KEY=your-google-api-key",
        "",
    ]

    lines.append(f"OPENAI_API_KEY={openai_api_key}" if openai_api_key else "# OPENAI_API_KEY=")
    lines.append(f"GEMINI_API_KEY={gemini_api_key}" if gemini_api_key else "# GEMINI_API_KEY=")

    try:
        with open(env_path, "w") as f:
            f.write("\n".join(lines) + "\n")
        print(f"✅ `.env` file created at: {env_path}")
        print("👉 In Colab, click the folder icon on the left, go to the appropriate folder, right-click `.env`, and choose 'Edit' to enter your keys.")
    except Exception as e:
        print(f"❌ Failed to create `.env`: {e}")




def test_openai_connection(model: str = "gpt-4o") -> None:
    """
    Loads the OpenAI API key from a .env file and sends a test message
    to verify the connection to the OpenAI Chat Completion API.

    Parameters:
        model (str): The model name to test (default: "gpt-4o")

    Expected Output:
        - ✅ Confirmation that the API call succeeded
        - 🤖 The assistant's response to "Hello, what is 2 + 2?"
    """

    # 🔄 Load environment variables from the .env file
    load_dotenv()

    # 🔑 Fetch the OpenAI API key
    api_key = os.getenv("OPENAI_API_KEY")

    if not api_key:
        raise ValueError("❌ OPENAI_API_KEY not found. Please make sure it is set in your .env file.")

    # ✅ Initialize the OpenAI client
    client = OpenAI(api_key=api_key)

    try:
        # 📤 Send a test prompt to the model
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Hello, what is 2 + 2?"}
            ]
        )

        # 🟢 Print the result if successful
        print("✅ API call successful!")
        print("🤖 Response:", response.choices[0].message.content)

    except AuthenticationError:
        print("❌ Authentication failed. Please check your OPENAI_API_KEY.")
    except RateLimitError:
        print("⚠️ Rate limit exceeded. Wait and try again later.")
    except APIConnectionError:
        print("📡 Network issue or OpenAI server unavailable. Check your connection.")
    # except InvalidRequestError as e:
    #     print(f"🚫 Invalid request to the API: {e}")
    except Exception as e:
        print(f"❗ An unexpected error occurred: {e}")




def run_llm_filtering_pipeline_colab(
        model_name: str,
        cfg: dict,
        placeholders: dict
) -> None:
    """
    Run the LLM-based abstract filtering pipeline in Google Colab.

    Parameters:
        model_name (str): The LLM to use (e.g., "gpt-4o", "gpt-4o-mini")
        cfg (dict): Configuration dictionary with keys:
            - project_review
            - config_file
            - project_root
            - yaml_path
            - main_folder
        placeholders (dict): Dictionary with placeholder text for the LLM (e.g., topic and context)

    Returns:
        None
    """
    try:
        logging.info("Initializing LLM filtering pipeline...")

        # Ensure project_root is in sys.path
        if cfg['project_root'] not in sys.path:
            sys.path.append(cfg['project_root'])

        # Get project paths
        path_dic = project_folder(
            project_review=cfg['project_review'],
            config_file=cfg['config_file']
        )
        main_folder = path_dic['main_folder']
        csv_path = path_dic['database_path']

        # Agent configuration
        agentic_setting = {
            "agent_name": "abstract_filter",
            "column_name": "abstract_filter",
            "yaml_path": cfg['yaml_path'],
            "model_name": model_name
        }

        # Output folders
        methodology_json_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'json_output')
        multiple_runs_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'multiple_runs_folder')
        final_cross_check_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'final_cross_check_folder')

        # Processing configuration
        process_setup = {
            'batch_process': False,
            'manual_paste_llm': False,
            'iterative_confirmation': False,
            'overwrite_csv': True,
            'cross_check_enabled': False,
            'cross_check_runs': 3,
            'cross_check_agent_name': 'agent_cross_check',
            'cleanup_json': False
        }

        # Run the LLM-based filtering
        run_pipeline(
            agentic_setting,
            process_setup,
            placeholders=placeholders,
            csv_path=csv_path,
            main_folder=main_folder,
            methodology_json_folder=methodology_json_folder,
            multiple_runs_folder=multiple_runs_folder,
            final_cross_check_folder=final_cross_check_folder,
        )

        logging.info("✅ LLM filtering pipeline completed successfully.")

    except Exception as e:
        logging.error(f"❌ An error occurred while running the LLM filtering pipeline: {e}")


#
#
# def combine_methodology_json_colab(cfg: dict, model_name_method_extractor: str) -> None:
#     """
#     Combine individual JSON files from the methodology gap extractor into a single combined JSON file.
#
#     Parameters:
#         cfg (dict): Configuration dictionary with keys:
#             - project_review
#             - config_file
#             - main_folder
#             - methodology_gap_extractor_path
#         model_name_method_extractor (str): Name of the model used in the methodology gap extractor step.
#                                            This determines the subdirectory to read JSONs from.
#
#     Raises:
#         FileNotFoundError: If the input directory does not exist.
#
#     Returns:
#         None
#     """
#
#     # Load paths from registered project
#     path_dict = project_folder(
#         project_review=cfg['project_review'],
#         config_file=cfg['config_file']
#     )
#
#     # Define input and output paths
#     input_dir = Path(cfg['methodology_gap_extractor_path']) / model_name_method_extractor
#     output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')
#
#     # Validate input directory
#     if not input_dir.exists():
#         raise FileNotFoundError(f"Input directory not found: {input_dir}")
#
#     # Combine JSON files into one
#     combine_json_files(
#         input_directory=input_dir,
#         output_file=output_file
#     )
#
#     print(f"✅ Combined JSON saved to: {output_file}")



def run_abstract_filtering_colab(model_name: str, placeholders: dict, cfg: dict) -> None:
    """
    Run the abstract filtering pipeline using a specified LLM in a Colab environment.

    Parameters:
        model_name (str): The LLM to use (e.g., "gpt-4o", "gpt-4o-mini")
        placeholders (dict): Dictionary with keys like "topic" and "topic_context"
        cfg (dict): Configuration dictionary containing:
            - project_review
            - config_file
            - project_root
            - yaml_path

    Returns:
        None
    """
    try:
        logging.info("🚀 Starting abstract filtering pipeline...")

        # Ensure project root is available
        if cfg['project_root'] not in sys.path:
            sys.path.append(cfg['project_root'])

        # Load project folder structure
        path_dic = project_folder(
            project_review=cfg['project_review'],
            config_file=cfg['config_file']
        )
        main_folder = path_dic['main_folder']
        csv_path = path_dic['database_path']

        # Agent configuration
        agentic_setting = {
            "agent_name": "abstract_filter",
            "column_name": "abstract_filter",
            "yaml_path": cfg['yaml_path'],
            "model_name": model_name
        }

        # Define output folders
        methodology_json_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'json_output')
        multiple_runs_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'multiple_runs_folder')
        final_cross_check_folder = os.path.join(main_folder, agentic_setting['agent_name'], 'final_cross_check_folder')

        # Runtime options
        process_setup = {
            'batch_process': False,
            'manual_paste_llm': False,
            'iterative_confirmation': False,
            'overwrite_csv': True,
            'cross_check_enabled': False,
            'cross_check_runs': 3,
            'cross_check_agent_name': 'agent_cross_check',
            'cleanup_json': False
        }

        # Run the pipeline
        run_pipeline(
            agentic_setting,
            process_setup,
            placeholders=placeholders,
            csv_path=csv_path,
            main_folder=main_folder,
            methodology_json_folder=methodology_json_folder,
            multiple_runs_folder=multiple_runs_folder,
            final_cross_check_folder=final_cross_check_folder,
        )

        logging.info("✅ Abstract filtering completed successfully.")

    except Exception as e:
        logging.error(f"❌ An error occurred during the abstract filtering process: {e}")



def run_methodology_extraction_colab(
        model_name_method_extractor: str,
        agent_name: str,
        placeholders: dict,
        cfg: dict
) -> None:
    """
    Run the methodology gap extraction pipeline using a specified model and agent in Colab.

    Parameters:
        model_name_method_extractor (str): LLM model to use (e.g., "gpt-4o-mini")
        agent_name (str): Name of the agent configuration (e.g., "methodology_gap_extractor")
        placeholders (dict): Dictionary with keys like "topic" and "topic_context"
        cfg (dict): Configuration dictionary with keys:
            - project_review
            - config_file
            - yaml_path

    Returns:
        None
    """
    try:
        logging.info("🚀 Starting methodology gap extraction pipeline...")

        # Load project folder paths
        path_dic = project_folder(
            project_review=cfg['project_review'],
            config_file=cfg['config_file']
        )
        main_folder = path_dic['main_folder']
        csv_path = path_dic['database_path']

        # Set up the agent
        agentic_setting = {
            "agent_name": agent_name,
            "column_name": "methodology_gap",
            "yaml_path": cfg['yaml_path'],
            "model_name": model_name_method_extractor
        }

        # Define output folders
        methodology_json_folder = os.path.join(
            main_folder, agent_name, 'json_output', model_name_method_extractor
        )
        multiple_runs_folder = os.path.join(main_folder, agent_name, 'multiple_runs_folder')
        final_cross_check_folder = os.path.join(main_folder, agent_name, 'final_cross_check_folder')

        # Runtime options
        process_setup = {
            'batch_process': False,
            'manual_paste_llm': False,
            'iterative_confirmation': False,
            'overwrite_csv': True,
            'cross_check_enabled': False,
            'cross_check_runs': 3,
            'cross_check_agent_name': 'agent_cross_check',
            'cleanup_json': False,
            'used_abstract': True  # Always enable abstract fallback
        }

        # Run the pipeline
        run_pipeline(
            agentic_setting,
            process_setup,
            placeholders=placeholders,
            csv_path=csv_path,
            main_folder=main_folder,
            methodology_json_folder=methodology_json_folder,
            multiple_runs_folder=multiple_runs_folder,
            final_cross_check_folder=final_cross_check_folder,
        )

        logging.info("✅ Methodology extraction completed successfully.")

    except Exception as e:
        logging.error(f"❌ An error occurred during the methodology extraction: {e}")


import os
from pathlib import Path
from setting.project_path import project_folder
from helper.combine_json import combine_json_files
import logging

def combine_methodology_output_colab(model_name_method_extractor: str, cfg: dict) -> None:
    """
    Combine individual JSON files from the methodology gap extractor into a single combined JSON.

    Parameters:
        model_name_method_extractor (str): Name of the model used in the methodology gap extractor step
        cfg (dict): Configuration dictionary containing:
            - project_review
            - config_file
            - methodology_gap_extractor_path

    Returns:
        None
    """
    try:
        logging.info("🔄 Starting to combine methodology gap JSON files...")

        # Load project paths
        path_dict = project_folder(
            project_review=cfg['project_review'],
            config_file=cfg['config_file']
        )

        # Define input and output paths
        input_dir = Path(cfg['methodology_gap_extractor_path']) / model_name_method_extractor
        output_file = os.path.join(path_dict['main_folder'], 'combined_output.json')

        # Validate input directory
        if not input_dir.exists():
            raise FileNotFoundError(f"❌ Input directory not found: {input_dir}")

        # Combine JSONs into one file
        combine_json_files(
            input_directory=input_dir,
            output_file=output_file
        )

        logging.info(f"✅ Combined JSON saved to: {output_file}")

    except Exception as e:
        logging.error(f"❌ Failed to combine JSON files: {e}")


# 🔑 **Step 3: Obtain Your API Keys**

Before running any LLM-based tools, you'll need to obtain API keys for **OpenAI** and optionally **Google Gemini**, if you plan to use both.

---

## 🧠 Why You Need This

API keys are required to authenticate your access to models like `gpt-4o` or `gemini-pro`. Without them, the system won’t be able to connect to the services.

---

## 🔐 Where to Get Your Keys

| Provider   | Key Name         | Get It From                                                          |
| ---------- | ---------------- | -------------------------------------------------------------------- |
| **OpenAI** | `OPENAI_API_KEY` | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) |
| **Gemini** | `GEMINI_API_KEY` | [aistudio.google.com](https://aistudio.google.com/apikey)            |

---

## 📋 What To Do With Them Now

Just **copy and save your API keys in a safe place** (e.g., a password manager or text file on your machine).

> 📝 You do **not** need to put them in a `.env` file yet.

In the **next step**, you’ll insert them into a configuration file automatically using a helper function.

---

## ✅ Example (Save for Later)

```
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy...
```

> ⚠️ Do not share these keys with anyone or post them online.


# 🔧 Step 4: Create and Register Your Project Folder

In this step, you'll register the **main project folder**, so the system knows where to store and retrieve all files related to your project.

This setup automatically creates a structured folder system under your specified `main_folder`, and stores the configuration in a central JSON file. This ensures your projects remain organized, especially when working with multiple studies or datasets.

---

## 🗂️ Project Folder Structure

On the first run, the following folder structure will be automatically created under your specified `main_folder`:

```
main_folder/
├── database/
│   └── scopus/
│   └── <project_name>_database.xlsx  ← auto-generated path (not file creation)
├── pdf/
└── xml/
```

---

## 💡 Example

Suppose your project name is `corona_discharge`, and you want to store all project files under:

```
G:\My Drive\research_related\ear_eog
```

You can register this setup by running:

```python
project_folder(
    project_review='corona_discharge',
    main_folder=r'G:\My Drive\research_related\ear_eog'
)
```

✅ This will:

* Save the project path in `setting/project_folders.json`
* Create the full folder structure: `database/scopus`, `pdf`, and `xml`

---

## 🔁 Loading the Project Later

Once registered, you can access the project folders in future sessions by providing just the `project_review` name:

```python
paths = project_folder(project_review='corona_discharge')

print(paths['main_folder'])  # Main project folder
print(paths['csv_path'])     # Path to the Excel database file
```

---

## ⚙️ What Happens in the Background?

* A file named `project_folders.json` is stored in the `setting/` directory within your project.
* It maps each `project_review` to its corresponding `main_folder`.
* Folder structure is created automatically on the first run.
* On subsequent runs, the system reads the JSON to locate your project — no need to re-enter paths.


In [None]:
from academic_paper_maker.setting.project_path import project_folder

cfg={
    "project_review": "wafer",
    "env_path": "/content/.env",
    "main_folder": "/content/my_project",  # Use a Unix-style path for Colab
    "project_root": "/content/my_project/database",
    'yaml_path': '/content/academic_paper_maker/research_filter/agent/abstract_check.yaml',
    'config_file':'/content/setting/project_folders.json',
    'methodology_gap_extractor_path': '/content/my_project/methodology_gap_extractor/json_output',
}


project_folder(project_review=cfg['project_review'], main_folder=cfg['main_folder'])

print('Success')


# 🔐 **Step 5: Create a `.env` File to Store API Keys**

Now that you’ve obtained your API keys, the next step is to securely store them in a `.env` file inside your project directory. This ensures your credentials are protected and accessible by your code—without hardcoding them.

---

## 🛠️ What This Step Does

This step will:

* ✅ Create a `.env` file at the path defined by `cfg["env_path"]`
* ✅ Add helpful comments inside the file so you know where to paste your keys
* ✅ Support both `OPENAI_API_KEY` and `GEMINI_API_KEY` (also known as `GOOGLE_API_KEY`)
* ✅ Ensure compatibility with tools like `python-dotenv` for easy runtime access

---

## ✍️ How to Manually Edit the File

1. Click the 📁 **folder icon** on the left sidebar
2. Find `.env` in your project folder (e.g., `/content/my_project`)
3. Right-click it → **Edit**
4. Paste your keys in this format:

```env
# 🔑 OpenAI API key
OPENAI_API_KEY=sk-...

# 🔑 Google Gemini API key
GEMINI_API_KEY=AIzaSy...
```

> 💡 `GEMINI_API_KEY` may also be labeled as `GOOGLE_API_KEY` in some SDKs—they’re interchangeable.

---

## 🔒 Best Practices

* ❌ Never share or commit your `.env` file
* ✅ Add `.env` to `.gitignore` if using Git
* 🔄 Use `load_dotenv()` in your scripts to access the keys safely

> This step sets up a secure foundation for accessing LLMs in your project.

---

In [None]:

# Minimal (user will edit manually later)
create_env_file(cfg["env_path"])

# Or with keys already known
create_env_file(cfg["env_path"],
                openai_api_key="sk-abc123",
                # gemini_api_key="AIzaSy..."
                )

---

# ⚙️ **Step 5a: Load API Key and Test OpenAI Connection**

After setting up your `.env` file, it's important to **verify that your OpenAI API key is working correctly**. This step uses a built-in helper function to test the connection.

---

## 🔄 What This Step Does

* Loads your API key from the `.env` file
* Initializes an OpenAI client with that key
* Sends a simple test message to the GPT model (e.g., `gpt-4o`)
* Confirms success or provides a clear error message if something goes wrong

---

## ✅ How to Run It

Just call the helper function:

```python
from academic_paper_maker.helper.google_colab import test_openai_connection

test_openai_connection()
```

---

## 📌 Expected Outcome

If everything is set up correctly, you'll see output like:

```
✅ API call successful!
🤖 Response: 2 + 2 is 4.
```

---

## ⚠️ If Something Goes Wrong

Common errors and their meanings:

| Error Type             | What It Means                                 |
| ---------------------- | --------------------------------------------- |
| ❌ AuthenticationError  | API key is missing or incorrect               |
| ⚠️ RateLimitError      | You’re sending too many requests too quickly  |
| 📡 APIConnectionError  | Network issue or OpenAI server is unreachable |
| 🚫 InvalidRequestError | Incorrect model name or bad request structure |

Make sure your `.env` file includes a valid key in this format:

```env
OPENAI_API_KEY=sk-...
```

> 🔁 Re-run the test after fixing the `.env` or internet connection if needed.

---


In [None]:
print("🔍 Testing OpenAI connection... Please wait.\n")

# Run the test
test_openai_connection()

print("\n✅ If the connection is successful, you should see a response from the assistant above.and the answer should be something like:  Response: Hello! The sum of 2 + 2 is 4.")
print("⚠️ If not, please check your `.env` file and verify that your `OPENAI_API_KEY` is correct.")

# 📥 Step 6: Download the Scopus BibTeX File

In this step, you'll use the **Scopus database** to find and download relevant papers for your project. Scopus is a comprehensive and widely-used repository of peer-reviewed academic literature.

---

## 🔍 Using Scopus Advanced Search

To retrieve high-quality and relevant papers, we recommend using **Scopus' Advanced Search** feature. This powerful tool lets you refine your search based on:

* Keywords
* Authors
* Publication dates
* Document types
* And more...

This ensures that your literature collection is both targeted and comprehensive.

---

## 💡 Get Keyword Ideas with a Prompt

To help you formulate effective search queries, you can use the following **prompt-based suggestion tool**:

👉 [Keyword Search Prompt](https://gist.github.com/balandongiv/886437963d38252e61634ddc00b9d983)

You may need to modify the prompt to better suit your research domain. Here are some example domains:

* `"corona discharge"`
* `"fatigue driving EEG"`
* `"wafer classification"`

Feel free to add, remove, or tweak keywords as needed to refine your search results.

---

## 💾 Save and Organize Your Results

Once you've finalized your search:

1. **Select all available attributes** when exporting results from Scopus.
2. # Access Scopus
![Scopus CSV Export](image/scopus_csv_export.png)


2. Choose the **BibTeX** format when saving the export file.
3. Save the file inside the `database/scopus/` folder of your project.

The resulting folder structure might look like this:

```
main_folder/
├── database/
│   └── scopus/
│       ├── scopus(1).bib
│       ├── scopus(2).bib
│       ├── scopus(3).bib
```

Make sure the BibTeX files are correctly named and stored to ensure smooth integration in later steps.

# 📊 Step 7: Combine Scopus BibTeX Files into Excel

Once you've downloaded multiple `.bib` files from Scopus, the next step is to **combine and convert** them into a structured Excel file. This makes it easier to filter, sort, and review the metadata of all collected papers.

---

## 🧰 What This Step Does

* Loads all `.bib` files from your project's `database/scopus/` folder
* Parses the relevant metadata (e.g., title, authors, year, source, DOI)
* Combines the results into a single Excel spreadsheet
* Saves the spreadsheet in the `database/` folder as `combined_filtered.xlsx`



## 📁 Folder Structure Example

After running the script, your folder might look like this:

```
main_folder/
├── database/
│   ├── scopus/
│   │   ├── scopus(1).bib
│   │   ├── scopus(2).bib
│   │   ├── scopus(3).bib
│   └── combined_filtered.xlsx
```

This Excel file will serve as your primary reference for filtering papers before downloading PDFs.


In [None]:
from academic_paper_maker.download_pdf.database_preparation import combine_scopus_bib_to_excel
from academic_paper_maker.setting.project_path import project_folder

# project_review='corona_discharge'
path_dic=project_folder(project_review=cfg['project_review'])
folder_path=path_dic['scopus_path']
output_excel =  path_dic['database_path']
combine_scopus_bib_to_excel(folder_path, output_excel)

# 🧾 **Step 8: Automatically Filter Relevant References Using LLM**

After retrieving thousands of BibTeX references from Scopus (**Step 3**) and combining them into an Excel file (`combined_filtered.xlsx`) in **Step 4**, you'll likely find many entries irrelevant to your specific research focus.

In this step, we use a **Large Language Model (LLM)** to classify abstracts based on a defined research topic and context, eliminating the need for tedious manual filtering.

---

## ✨ **What This Step Does**

* Loads abstracts from `combined_filtered.xlsx`
* Applies a **custom LLM prompt** to assess relevance
* Adds a new column to the Excel file with `True`/`False` labels:

  * ✅ `True` → Relevant
  * ❌ `False` → Not relevant

This process is based on a topic-specific prompt defined by you.

---

## 🧠 **LLM Prompt Customization**

For the automated relevance filtering to work effectively, you **must provide two key inputs** to guide the LLM:

* `topic`: A concise statement of your **specific research goal or area of interest**.
  *Example:* `"wafer defect classification"`

* `topic_context`: A description of the **broader scientific or industrial context** where your topic belongs.
  *Example:* `"semiconductor manufacturing and inspection"`

These inputs help the LLM understand what kinds of abstracts are considered relevant and which ones should be filtered out.

---

### 💡 *Not sure how to define your `topic` and `topic_context`?*

You can get help from an LLM to generate these values. Here’s how:

1. **Open the filtering prompt** template defined in the YAML file:

   ```
   research_filter/agent/abstract_check.yaml
   ```

2. **Copy the prompt structure** (including the placeholders for `topic` and `topic_context`).

3. **Ask any LLM to assist**, by providing it the YAML prompt and a description of your research.
   For example, you could say:

   > 🧠 *"Given the following prompt structure, and knowing that my research is about identifying AI-generated academic papers, can you help me fill in the `topic` and `topic_context` placeholders?"*

4. The LLM might then suggest:

   ```yaml
   topic: "AI-generated academic paper detection"
   topic_context: "scientific publishing and machine learning ethics"
   ```

This approach ensures your filtering prompt is both precise and contextually grounded, improving the accuracy of the classification.

---

## ⚙️ **Configuration via `agentic_setting`**

The entire filtering behavior is controlled by a configuration dictionary called `agentic_setting`:

```python
agentic_setting = {
    "agent_name": "abstract_filter",     # The identifier for the agent logic (must match YAML)
    "column_name": "abstract_filter",    # Name of the column added to the Excel file
    "yaml_path": yaml_path,              # Path to the YAML file defining agent behavior
    "model_name": model_name             # Name of the LLM model to use
}
```

### 🔍 Parameter Breakdown:

| Key           | Description                                                             |
| ------------- | ----------------------------------------------------------------------- |
| `agent_name`  | Matches the name of the agent defined in the YAML configuration file.   |
| `column_name` | The name of the new column in the Excel file where results are saved.   |
| `yaml_path`   | Path to the YAML file containing the agent's logic and prompt template. |
| `model_name`  | The specific LLM model used (e.g., `"gpt-4"` or `"claude-3-opus"`).     |

---

## 📂 **File and Folder Structure**

To avoid redundant LLM calls (which can be costly), results are cached as JSON files:

```
wafer_defect/
├── database/
│   └── scopus/
│       └── combined_filtered.xlsx           ← Updated with filtering results
├── abstract_filter/
│   └── json_output/
│       ├── kim_2019.json
│       ├── smith_2020.json
│       └── ...
```

* Each abstract’s filtering result is saved individually.
* If the run encounters an error, it can resume without reprocessing previous abstracts.
* These files are later reused for **cross-checking** and **final review**.

---

## 📤 **Excel Output Behavior**

Depending on the `overwrite_csv` setting:

* `True` → Updates the original `combined_filtered.xlsx`
* `False` → Creates a new file (e.g., `combined_filtered_updated.xlsx`)

---

## ⚠️ **Caution: Manual Review is Still Required**

> ❗ **LLMs are powerful but not perfect**. They may misclassify edge cases or ambiguous abstracts.
>
> 🔍 Always **manually inspect** the final results before using them for publication or decision-making.

In [None]:
placeholders = {
    "topic": "wafer defect classification",
    "topic_context": "semiconductor manufacturing and inspection"
}

run_abstract_filtering_colab(model_name="gpt-4o-mini", placeholders=placeholders, cfg=cfg)

# 🧾 **Step 7: Extract Methodology Details Using LLM**

At this stage, your reference list is filtered and the corresponding PDFs (or abstracts) are available. Now, the focus shifts to **extracting key methodological insights** from each paper, such as:

* 🧠 Classification algorithms
* 🛠️ Feature engineering approaches
* 📏 Evaluation metrics

This is achieved using a specialized **LLM agent** with a targeted prompt for methodology extraction.

> 📌 This step works with **full manuscripts (PDF)** when available, and **falls back to abstracts** if no PDF exists. This flexibility ensures comprehensive analysis even with incomplete data.

---

## ✨ What This Step Does

* Loads your filtered Excel or CSV file from Step 5 or 6.
* For each relevant paper:

  * If the PDF is available: extracts methodology from the full text.
  * If the PDF is **not** available: extracts from the **abstract** instead.
* Uses a domain-specific LLM prompt to analyze methodological content.
* Appends the results to the existing Excel or CSV file.
* Saves per-paper structured JSON files for advanced or customized usage.
* Handles backups automatically if overwriting the metadata file.

> ⛑️ **Safety Note**: If `'overwrite_csv': True`, a **timestamped backup** of the original `.csv` or `.xlsx` file is automatically created **in the same folder** before any updates are made. This prevents accidental corruption and allows for recovery or version tracking.

---

## 📄 Prompt Purpose

This step uses a **domain-aware analytical agent** designed to:

> “Extract methodological details (e.g., classification algorithms, feature engineering, evaluation metrics) from filtered papers relevant to a specific machine learning task.”

The prompt is defined in a YAML config file (`agent_ml.yaml`) and is tailored by the agent name you provide.

---

## 🗂️ Folder and Output Structure

Your project directory may look like this after completing this step:

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Updated metadata with extraction results
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto-generated backup (if overwrite_csv=True)
├── pdf/
│   ├── smith_2020.pdf                              ← Saved using bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor/
│   ├── json_output/
│   │   └── smith_2020.json
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/
```

---

## 🧠 Supported Models

Choose from the following supported LLMs:

* `"gpt-4o"`
* `"gpt-4o-mini"`
* `"gpt-o3-mini"`
* `'gemini-1.5-pro'`
* `'gemini-exp-1206'`
* `'gemini-2.0-flash-thinking-exp-01-21'`

---

## ⚠️ Important Reminders

* ✅ Set `used_abstract = True` if some papers lack full PDFs.
* 🛑 **Always verify** extracted methodologies manually before using them in analysis, models, or publication. LLMs can hallucinate or misinterpret technical details.

---

Shortcut of below is
run_llm_filtering_pipeline_colab

In [None]:
run_methodology_extraction_colab(
    model_name_method_extractor="gpt-4o-mini",
    agent_name="methodology_gap_extractor",
    placeholders=placeholders,
    cfg=cfg
)


# 🧾 **Step 8: Draft Literature Review (Chapter 2) Using Combined JSON**

Once you've extracted methodological insights in Step 9, the next logical move is to **start organizing and drafting your literature review** (typically Chapter 2 of a thesis or paper). This step outlines how to **combine extracted results** into a single, structured file — and how to use that file to produce high-quality summaries, tables, and narrative drafts using an LLM.

> 📌 This is a branching step — not everyone will follow this exact path — but it's a powerful way to move from **structured data → written draft** efficiently.

---

## ✨ What This Step Does

* Collects individual methodology JSON files.
* Merges them into a single, unified JSON (`combined_output.json`).
* Prepares this JSON as input for:

  * Google AI Studio (e.g., Gemini Pro)
  * GPT-based agents
  * Jupyter notebooks or drafting pipelines
* Enables:

  * Narrative generation for Chapter 2
  * Thematic clustering of methods
  * Structured tables summarizing key findings


---

## 🗂️ Output Structure Example

```
corona_discharge/
├── combined_output.json      ← ✅ Master file for summarization and drafting
├── methodology_gap_extractor_partial_discharge/
│   └── json_output/
│       └── *.json            ← Individual extracted documents
```

---

## 📊 Using LLM to Generate Summary Tables

With the `combined_output.json`, you can prompt an LLM to create structured tables summarizing key findings. For example:

> “Using the combined JSON, generate a table with the following columns:
> **Author**, **Machine Learning Algorithm**, **Feature Types**, **Dataset Used**, and **Performance Metrics**.”

This gives you a **quick-glance overview** of the landscape, helpful both for understanding trends and citing clearly in your literature review.

### ✅ Sample Columns for Summary Table:

* **Author / Year**
* **ML Algorithm(s) Used**
* **Features**
* **Dataset / Source**
* **Performance (Accuracy, F1, etc.)**

---

## ✍️ Drafting Strategy Tips

You can also use LLM prompts like:

> “Based on the combined JSON, write a summary paragraph comparing the top 3 machine learning techniques used for partial discharge classification.”

Or:

> “Generate an introduction section discussing the evolution of feature engineering techniques in this domain.”

---

## ⚠️ Important Reminders

* ✅ Make sure all JSON structures are consistent before combining.
* 📉 Segment or cluster the JSON by subdomain if the file becomes too large.
* 🔍 Review all generated tables and text — LLMs are **tools**, not final authorities.

---

In [None]:
combine_methodology_output_colab(model_name_method_extractor="gpt-4o-mini", cfg=cfg)

# 🧾 **Step 9: Export Filtered Excel to BibTeX**

After reviewing and filtering your `combined_filtered.xlsx` file, you can convert the refined list of papers back into a **BibTeX file**. This can be helpful for citation management or integration with tools like LaTeX or reference managers.

---

# ✨ What This Step Does

* Loads the filtered Excel file containing your selected papers
* Converts the data back into BibTeX format
* Saves the result to a `.bib` file for easy reuse or citation


# 📁 File Structure Example

```
wafer_defect/
├── database/
│   └── scopus/
│       ├── combined_filtered.xlsx                  ← Filtered Excel file with selected papers
│       ├── combined_filtered_backup_20250508_1530.xlsx  ← Auto backup if overwrite is enabled
│       └── filtered_output.bib                     ← Newly generated BibTeX file
├── pdf/
│   ├── smith_2020.pdf                              ← Full-text PDFs saved by bibtex_key
│   ├── kim_2019.pdf
│   └── ...
├── methodology_gap_extractor_wafer_defect/
│   ├── json_output/
│   │   └── smith_2020.json                         ← Extracted methodology as structured JSON
│   ├── multiple_runs_folder/
│   └── final_cross_check_folder/

```

---

## 🛠️ Notes

* Ensure the Excel file has standard bibliographic columns like: `title`, `author`, `year`, `journal`, `doi`, etc.
* The function `generate_bibtex()` maps these fields into valid BibTeX entries.
* You can open the `.bib` file in any text editor or reference manager to confirm the results.

In [None]:
import os
import pandas as pd
from post_code_saviour.excel_to_bib import generate_bibtex
from setting.project_path import project_folder


# Load project paths
path_dic = project_folder(project_review=cfg['project_review'])
main_folder = path_dic['main_folder']

# Define input and output paths
input_excel = os.path.join(main_folder, 'database', 'combined_filtered.xlsx')
output_bib = os.path.join(main_folder, 'database', 'filtered_output.bib')

# Load the filtered Excel file
df = pd.read_excel(input_excel)

# Generate BibTeX file
generate_bibtex(df, output_file=output_bib)

### 📄 BibTeX Generator: Use This Shortcut

You can simplify the import with an alias. Here's the **shortcut version**:

```python
from academic_paper_maker.helper.google_colab import generate_bibtex_from_excel_colab as gen_bib
```

Now you can call the function like this:

```python
gen_bib(cfg=cfg)
```

✅ This will read the filtered Excel file and generate a `.bib` file for your bibliography—keeping everything clean and automatic.

In [None]:
from academic_paper_maker.helper.google_colab import generate_bibtex_from_excel_colab as gen_bib
gen_bib(cfg=cfg)