# Introduction
This notebook is designed to demonstrate the process of working with economics research papers. It includes steps to fetch a single paper, assess its relevance, generate a false abstract, and compute perplexity scores. The goal is to provide a comprehensive understanding of the workflow and its applications in research data analysis.

### Objectives
- **Scrape a single economics research paper** from a specified source.
- **Assess the paper** for inclusion in the database.
- If accepted, **generate a false abstract** for the paper.
- **Calculate perplexity scores** for both the original and false abstracts.

### Workflow Overview
1. **Scraping**: Retrieve a single economics research paper using the DOI.
2. **Assessment**: Use the inclusion assessment logic to determine if the paper should be included in the database.
3. **False Abstract Generation**: If the paper is accepted, generate a false abstract using a predefined prompt.
4. **Perplexity Scoring**: Calculate perplexity scores for both the original and false abstracts to evaluate their linguistic complexity.

### Prerequisites
- Ensure the database is set up and accessible.
- Verify that the necessary Python modules and dependencies are installed.
- Confirm access to the required APIs (e.g., OpenAI, arXiv, etc.).


Note: This notebook is designed to be run in the project root directory. (Not the containing `notebooks` directory)

In [None]:
import logging
from data.db import get_db_session, close_db_session
from data.models import ResearchPaper
from data.ingest import download_preprints_with_publication_info
from data.generate import assess_papers_by_category, assess_paper_for_inclusion, AbstractModifier
from services.llm_services import BasicOpenAI, TogetherClient
from services.arxiv import arxiv_taxonomy
from prompts_library.data_creation import economics_abstract_modification_prompt
from data.utils import zlib_compression_size

In [None]:


def econ_papers_inclusion_assessment(n=10):
    """
    Assesses unreviewed economics papers from the database for inclusion.

    Args:
        n: Number of papers to assess
    """
    # Get economics categories from arxiv taxonomy
    # Ensure arxiv_taxonomy is loaded/available here
    # You might need to import it or load it if it's not globally available
    try:
        from services.arxiv import arxiv_taxonomy # Assuming arxiv_taxonomy is here
        economics_categories = list(arxiv_taxonomy.get("Economics", {}).keys())
        if not economics_categories:
             logging.warning("No economics categories found in arxiv_taxonomy.")
             # Decide how to handle this: return, use defaults, or raise error
             return
    except ImportError:
        logging.error("Could not import arxiv_taxonomy from services.arxiv.")
        return
    except Exception as e:
        logging.error(f"Error loading economics categories: {e}")
        return

    logging.info(f"Using categories for economics assessment: {economics_categories}")

    # Call the generic assessment function
    assess_papers_by_category(
        domain="economics",
        categories=economics_categories,
        limit=n
        # llm_client can be omitted to let the generic function create one
    )

def download_economics_papers(start_date, end_date, n_samples, output_format='json', published_in_journal=None):
    """
    Downloads ResearchPaper objects for the domain of economics using arXiv taxonomy.

    Args:
        start_date (str): Start date in YYYY-MM-DD format.
        end_date (str): End date in YYYY-MM-DD format.
        n_samples (int): Total number of papers to download.
        output_format (str): Output format (default is 'json').
        published_in_journal (str): Optional - Filter for papers published in a specific journal.
    """
    # Get economics-related categories from arXiv taxonomy
    economics_categories = [
        f"cat:{code}" for code in arxiv_taxonomy.get("Economics", {}).keys()
    ]
    if not economics_categories:
        logging.info("No economics categories found in arXiv taxonomy.")
        return

    # Join the categories with OR for the search query
    economics_category_query = " OR ".join(economics_categories)
    logging.info(f"Using arXiv categories for economics: {economics_category_query}")

    # Calculate the number of pagination cursors
    arxiv_fixed_n_samples_per_request = 100
    n_cursor = n_samples // arxiv_fixed_n_samples_per_request

    # Call the download function for arXiv
    return download_preprints_with_publication_info(
        start_date=start_date,
        end_date=end_date,
        n_cursor=n_cursor,
        output_format=output_format,
        source="arxiv",
        category=economics_category_query,
        published_in_journal=published_in_journal
    )

economics_categories = list(arxiv_taxonomy.get("Economics", {}).keys())

# Create extended list that includes related economics categories
extended_econ_categories = economics_categories + [
    "q-fin.EC",       # Economics in Quantitative Finance
    "q-fin.GN",       # General Finance
    "q-fin.PM",       # Portfolio Management
    "q-fin.RM",       # Risk Management
    "q-fin.ST",       # Statistical Finance
    "q-fin.TR",       # Trading and Microstructure
    "economics",      # Generic category
    "finance",        # Generic category
    "econometrics"    # Generic category
]

## Step 1: Fetching a Research Paper
In this step, we download research papers from arxiv. Using the `download_economics_papers` function. Then we assess some research papers for whether they should be included in our research. Finally, we retrieve the paper from the database which was assessed to be of use to our research but which has not yet had a fake abstract generated for downstream use.

In [None]:
# Download a single economics paper into the database
session = get_db_session()
try:
    download_economics_papers(start_date="2024-01-01", end_date="2024-01-20", n_samples=1, output_format="json")
    econ_papers_inclusion_assessment(n=1)
    # Retrieve the inserted paper
    paper = session.query(ResearchPaper).where(
        ResearchPaper.inclusion_decision == True,
        ResearchPaper.category.in_(extended_econ_categories),
        # Doesn't have ResearchPaper.gpt4_incorrect_abstract
        ResearchPaper.gpt4_incorrect_abstract.is_(None)
    ).order_by(ResearchPaper.id.desc()).first()
finally:
    close_db_session(session)

# Paper retrieved:
print(f"Paper ID: {paper.id}")
print(f"Title: {paper.title}")
print(f"Authors: {paper.authors}")
print(f"Abstract: {paper.abstract}")

## Step 2: Generate a False Abstract
We use the OpenAI API to generate a false abstract for the paper. The prompt is designed to elicit a response that mimics the style and content of a typical research abstract, but with fabricated information. This allows us to test the model's ability to generate coherent and contextually relevant text.

In [None]:

modifier = AbstractModifier(prompt_template=economics_abstract_modification_prompt, domain="economics", llm_client=BasicOpenAI())
false_paper, event = modifier.modify_abstract(paper, save_event=True)
print('Generated false abstract:', false_paper.gpt4_incorrect_abstract)

## Step 3: Computing Perplexity Scores
Perplexity scores are calculated for both the original and false abstracts. These scores provide insights into the complexity and predictability of the text, which can be useful for evaluating the quality and coherence of the generated content.

In [None]:
# Compute perplexity scores
client = TogetherClient()
orig_text = paper.abstract
false_text = false_paper.gpt4_incorrect_abstract if paper.inclusion_decision else None
orig_perp = client.perplexity_score(orig_text) if orig_text else None
false_perp = client.perplexity_score(false_text) if false_text else None
print(f'Original Abstract Perplexity: {orig_perp:.2f}' if orig_perp is not None else 'No original abstract')
print(f'False Abstract Perplexity: {false_perp:.2f}' if false_perp is not None else 'No false abstract')