# everyrow SDK Basic Usage

This notebook demonstrates all the core operations in the everyrow SDK:

1. **Screen** - Filter rows based on criteria that need judgment
2. **Rank** - Score rows by qualitative factors
3. **Dedupe** - Deduplicate when fuzzy matching fails
4. **Merge** - Join tables when keys don't match exactly
5. **Derive** - Add computed columns (no AI needed)
6. **Single Agent** - Web research on a single input
7. **Agent Map** - Web research on every row of a dataframe

Get an API key at [everyrow.io/api-key](https://everyrow.io/api-key) to run this notebook.

In [5]:
import os
from textwrap import dedent

from dotenv import load_dotenv
from pandas import DataFrame
from pydantic import BaseModel, Field

from everyrow.session import create_session

load_dotenv()

True

## 1. Screen

Filter rows based on criteria you can't put in a WHERE clause. This example performs vendor risk assessment - evaluating security track records and financial stability requires judgment, not pattern matching.

In [6]:
from everyrow.ops import screen


vendors = DataFrame(
    [
        {"company": "Okta", "category": "Identity Management", "website": "okta.com"},
        {"company": "LastPass", "category": "Password Management", "website": "lastpass.com"},
        {"company": "Snowflake", "category": "Data Warehouse", "website": "snowflake.com"},
        {"company": "Cloudflare", "category": "CDN & Security", "website": "cloudflare.com"},
        {"company": "MongoDB", "category": "Database", "website": "mongodb.com"},
    ]
)

print("Input vendors:")
print(vendors.to_string())

Input vendors:
      company             category         website
0        Okta  Identity Management        okta.com
1    LastPass  Password Management    lastpass.com
2   Snowflake       Data Warehouse   snowflake.com
3  Cloudflare       CDN & Security  cloudflare.com
4     MongoDB             Database     mongodb.com


In [None]:
# Basic screen - returns only passes/fails boolean
basic_screen_result = await screen(
    task=dedent("""
        Perform vendor risk assessment for each company. Research and evaluate:

        1. Security track record: Have they had any significant data breaches or security
        incidents in the past 3 years? How did they respond?

        2. Financial stability: Are there signs of financial distress (major layoffs,
        funding difficulties, declining revenue)?

        3. Overall recommendation: Based on your research, should we proceed with
        this vendor for enterprise use?

        Only approve vendors with low or medium risk and no unresolved critical security incidents.
    """),
    input=vendors,
)

print("Basic Screen Results (passes/fails only):")
print(basic_screen_result.data.to_string())

In [None]:
# Screen with response_model for additional structured context
class VendorRiskAssessment(BaseModel):
    passes: bool = Field(description="Whether the vendor passes risk assessment")
    risk_level: str = Field(description="Risk level: Low, Medium, or High")
    security_summary: str = Field(description="Brief summary of security track record")
    financial_summary: str = Field(description="Brief summary of financial stability")


detailed_screen_result = await screen(
    task=dedent("""
        Perform vendor risk assessment for each company. Research and evaluate:

        1. Security track record: Have they had any significant data breaches or security
        incidents in the past 3 years? How did they respond?

        2. Financial stability: Are there signs of financial distress (major layoffs,
        funding difficulties, declining revenue)?

        3. Overall recommendation: Based on your research, should we proceed with
        this vendor for enterprise use?

        Only approve vendors with low or medium risk and no unresolved critical security incidents.
    """),
    input=vendors,
    response_model=VendorRiskAssessment,
)

print("Detailed Screen Results (with structured context):")
print(detailed_screen_result.data.to_string())

## 2. Rank

Score rows by things you can't put in a database field. This example ranks AI research organizations by leadership citation counts - information that requires researching each org's leaders and their publications.

In [16]:
from everyrow.ops import rank


class ContributionRanking(BaseModel):
    contribution_score: int = Field(description="Total citation count")
    most_significant_contribution: str = Field(
        description="Single most important paper authored by a firm leader"
    )

print(ContributionRanking.model_json_schema())


ai_research_orgs = DataFrame(
    [
        {"organization": "OpenAI", "type": "Private lab", "founded": 2015},
        {"organization": "Google DeepMind", "type": "Corporate lab", "founded": 2010},
        {"organization": "Anthropic", "type": "Private lab", "founded": 2021},
        {"organization": "Meta FAIR", "type": "Corporate lab", "founded": 2013},
        {"organization": "Microsoft Research", "type": "Corporate lab", "founded": 1991},
        {"organization": "Stanford HAI", "type": "Academic", "founded": 2019},
    ]
)

print("AI Research Organizations:")
print(ai_research_orgs.to_string())

{'properties': {'contribution_score': {'description': 'Total citation count', 'title': 'Contribution Score', 'type': 'integer'}, 'most_significant_contribution': {'description': 'Single most important paper authored by a firm leader', 'title': 'Most Significant Contribution', 'type': 'string'}}, 'required': ['contribution_score', 'most_significant_contribution'], 'title': 'ContributionRanking', 'type': 'object'}
AI Research Organizations:
         organization           type  founded
0              OpenAI    Private lab     2015
1     Google DeepMind  Corporate lab     2010
2           Anthropic    Private lab     2021
3           Meta FAIR  Corporate lab     2013
4  Microsoft Research  Corporate lab     1991
5        Stanford HAI       Academic     2019


In [None]:
rank_task = dedent("""
    Research the total citation count of all leaders of the given AI research organization.

    A leader is defined as a C-Suite or founder of the company.
    Citation count should count all major publications. Top ten by each person is sufficient.
""")

# Basic ranking - returns only the score field
basic_rank_result = await rank(
    task=rank_task,
    input=ai_research_orgs,
    field_name="contribution_score",
    field_type="int",
    ascending_order=False,
)

print("Basic Rankings (score only):")
print(basic_rank_result.data.to_string())

In [None]:
# Ranking with response_model for additional structured context
detailed_rank_result = await rank(
    task=rank_task + "\n\nAlso identify their single most significant contribution.",
    input=ai_research_orgs,
    field_name="contribution_score",
    response_model=ContributionRanking,
    ascending_order=False,
)

print("Detailed Rankings (with structured context):")
print(detailed_rank_result.data.to_string())

## 3. Dedupe

Deduplicate when fuzzy matching falls short. This example deduplicates academic papers where the same paper may appear with different identifiers (arXiv ID vs DOI), different title formats, or as preprint vs published versions.

In [None]:
from everyrow.ops import dedupe

papers = DataFrame(
    [
        {
            "title": "Attention Is All You Need",
            "authors": "Vaswani et al.",
            "venue": "NeurIPS 2017",
            "identifier": "10.5555/3295222.3295349",
        },
        {
            "title": "Attention Is All You Need",
            "authors": "Vaswani, Shazeer, Parmar et al.",
            "venue": "arXiv",
            "identifier": "1706.03762",
        },
        {
            "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
            "authors": "Devlin et al.",
            "venue": "NAACL 2019",
            "identifier": "10.18653/v1/N19-1423",
        },
        {
            "title": "BERT: Pre-training of Deep Bidirectional Transformers",
            "authors": "Devlin, Chang, Lee, Toutanova",
            "venue": "arXiv",
            "identifier": "1810.04805",
        },
        {
            "title": "Language Models are Few-Shot Learners",
            "authors": "Brown et al.",
            "venue": "NeurIPS 2020",
            "identifier": "GPT-3",
        },
        {
            "title": "GPT-3: Language Models are Few-Shot Learners",
            "authors": "Brown, Mann, Ryder et al.",
            "venue": "arXiv",
            "identifier": "2005.14165",
        },
        {
            "title": "LLaMA: Open and Efficient Foundation Language Models",
            "authors": "Touvron et al.",
            "venue": "arXiv",
            "identifier": "2302.13971",
        },
        {
            "title": "Llama 2: Open Foundation and Fine-Tuned Chat Models",
            "authors": "Touvron et al.",
            "venue": "arXiv",
            "identifier": "2307.09288",
        },
    ]
)

print(f"Input papers ({len(papers)} rows):")
print(papers.to_string())

Input papers (8 rows):
                                                                              title                          authors         venue               identifier
0                                                         Attention Is All You Need                   Vaswani et al.  NeurIPS 2017  10.5555/3295222.3295349
1                                                         Attention Is All You Need  Vaswani, Shazeer, Parmar et al.         arXiv               1706.03762
2  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding                    Devlin et al.    NAACL 2019     10.18653/v1/N19-1423
3                             BERT: Pre-training of Deep Bidirectional Transformers    Devlin, Chang, Lee, Toutanova         arXiv               1810.04805
4                                             Language Models are Few-Shot Learners                     Brown et al.  NeurIPS 2020                    GPT-3
5                                      GP

In [None]:
dedupe_result = await dedupe(
    input=papers,
    equivalence_relation=dedent("""
        Two entries are duplicates if they represent the same research work, which requires
        verifying through research:

        - An arXiv preprint and its published conference/journal version are duplicates
        - Papers with slightly different titles but same core contribution are duplicates
        - Different author list formats (et al. vs full list) don't matter
        - Papers with different identifiers (arXiv ID vs DOI) may still be duplicates

        However, genuinely different papers (e.g., LLaMA 1 vs LLaMA 2) are NOT duplicates,
        even if authors and topics overlap. Research each paper to determine if they
        report the same findings or are distinct works.
    """),
)

print("Deduplicated Paper List:")
print(dedupe_result.data.to_string())
print(f"\nOriginal entries: {len(papers)}")
print(f"Unique papers: {len(dedupe_result.data)}")
print(f"Duplicates removed: {len(papers) - len(dedupe_result.data)}")

Deduplicated Paper List:
                                                                              title                          authors         venue               identifier                  equivalence_class_id  selected                                 equivalence_class_name
0                                                         Attention Is All You Need                   Vaswani et al.  NeurIPS 2017  10.5555/3295222.3295349  7fabd3d4-68ba-4176-907f-021cda42bfe6      True                              Attention Is All You Need
1                                                         Attention Is All You Need  Vaswani, Shazeer, Parmar et al.         arXiv               1706.03762  7fabd3d4-68ba-4176-907f-021cda42bfe6     False                              Attention Is All You Need
2  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding                    Devlin et al.    NAACL 2019     10.18653/v1/N19-1423  a75771a8-5631-45bd-915f-651f5e616e60     

## 4. Merge

Join two tables when the keys don't match exactly. This example merges clinical trial data with pharmaceutical company information - the challenge is that trial sponsors are often subsidiaries or use abbreviated names (e.g., "MSD" instead of "Merck").

In [None]:
from everyrow.ops import merge

clinical_trials = DataFrame(
    [
        {
            "trial_id": "NCT05432109",
            "sponsor": "Genentech",
            "indication": "Non-small cell lung cancer",
            "phase": "Phase 3",
        },
        {
            "trial_id": "NCT05891234",
            "sponsor": "Janssen Pharmaceuticals",
            "indication": "Multiple myeloma",
            "phase": "Phase 2",
        },
        {
            "trial_id": "NCT05567890",
            "sponsor": "MSD",
            "indication": "Melanoma",
            "phase": "Phase 3",
        },
        {
            "trial_id": "NCT05234567",
            "sponsor": "AbbVie Inc",
            "indication": "Rheumatoid arthritis",
            "phase": "Phase 3",
        },
        {
            "trial_id": "NCT05678901",
            "sponsor": "BMS",
            "indication": "Acute myeloid leukemia",
            "phase": "Phase 2",
        },
    ]
)

pharma_companies = DataFrame(
    [
        {
            "company": "Roche Holding AG",
            "hq_country": "Switzerland",
            "2024_revenue_billions": 58.7,
        },
        {
            "company": "Johnson & Johnson",
            "hq_country": "United States",
            "2024_revenue_billions": 85.2,
        },
        {
            "company": "Merck & Co.",
            "hq_country": "United States",
            "2024_revenue_billions": 60.1,
        },
        {
            "company": "AbbVie",
            "hq_country": "United States",
            "2024_revenue_billions": 56.3,
        },
        {
            "company": "Bristol-Myers Squibb",
            "hq_country": "United States",
            "2024_revenue_billions": 45.0,
        },
    ]
)

print("Clinical Trials:")
print(clinical_trials.to_string())
print("\nPharma Companies:")
print(pharma_companies.to_string())

Clinical Trials:
      trial_id                  sponsor                  indication    phase
0  NCT05432109                Genentech  Non-small cell lung cancer  Phase 3
1  NCT05891234  Janssen Pharmaceuticals            Multiple myeloma  Phase 2
2  NCT05567890                      MSD                    Melanoma  Phase 3
3  NCT05234567               AbbVie Inc        Rheumatoid arthritis  Phase 3
4  NCT05678901                      BMS      Acute myeloid leukemia  Phase 2

Pharma Companies:
                company     hq_country  2024_revenue_billions
0      Roche Holding AG    Switzerland                   58.7
1     Johnson & Johnson  United States                   85.2
2           Merck & Co.  United States                   60.1
3                AbbVie  United States                   56.3
4  Bristol-Myers Squibb  United States                   45.0


In [None]:
merge_result = await merge(
    task=dedent("""
        Merge clinical trial data with parent pharmaceutical company information.

        The sponsor names in the trials table are often subsidiaries or abbreviations:
        - Research which parent company owns each trial sponsor
        - Match trials to their parent company's financial data

        For example, Genentech is a subsidiary of Roche, Janssen is part of J&J,
        MSD is Merck's name outside the US, BMS is Bristol-Myers Squibb.
    """),
    left_table=clinical_trials,
    right_table=pharma_companies,
    merge_on_left="sponsor",
    merge_on_right="company",
)

print("Clinical Trials with Parent Company Data:")
print(merge_result.data.to_string())

Clinical Trials with Parent Company Data:
      trial_id                  sponsor                  indication    phase               company     hq_country  2024_revenue_billions                                                                                                                                                                                                                                      research
0  NCT05432109                Genentech  Non-small cell lung cancer  Phase 3      Roche Holding AG    Switzerland                   58.7  {'company': 'This row was matched due to the information in both tables', 'hq_country': 'This row was matched due to the information in both tables', '2024_revenue_billions': 'This row was matched due to the information in both tables'}
1  NCT05891234  Janssen Pharmaceuticals            Multiple myeloma  Phase 2     Johnson & Johnson  United States                   85.2  {'company': 'This row was matched due to the information in both table

## 6. Single Agent

Run web research on a single input. Agents can generate tabular data or answer questions based on research. This example first generates a competitor dataset, then analyzes it for market gaps.

In [None]:
from everyrow.ops import single_agent


class Competitor(BaseModel):
    company: str = Field(description="Company name")
    pricing_tier: str = Field(description="Pricing model, e.g. 'Freemium, $10-50/user/mo'")
    target_market: str = Field(description="Primary customer segment")
    key_features: str = Field(description="Top 3 features or differentiators")


# Step 1: Generate a dataset of competitors
print("Step 1: Research competitors")
competitors = await single_agent(
    task="Find the top 10 competitors in the B2B expense management software market",
    response_model=Competitor,
    return_table=True,
)
print(competitors.data.to_string())

Step 1: Research competitors


EveryrowError: Expected table result (list of records), but got scalar or null

In [None]:
# Step 2: Distill insights from the dataset
print("Step 2: Identify market gaps")
insights = await single_agent(
    task="""
        What gaps exist in the B2B expense management software market
        that these competitors aren't addressing?
    """,
    input=competitors,
)
print(insights.data.answer)

## 7. Agent Map

Run web research on every row of a dataframe. This example researches financial information for tech companies - information that requires looking up each company individually.

In [None]:
from everyrow.ops import agent_map


class CompanyFinancials(BaseModel):
    annual_revenue_usd: int = Field(description="Most recent annual revenue in USD")
    employee_count: int = Field(description="Current number of employees")
    founded_year: int = Field(description="Year the company was founded")


companies = DataFrame(
    [
        {"company": "Stripe", "industry": "Payments"},
        {"company": "Databricks", "industry": "Data & AI"},
        {"company": "Canva", "industry": "Design"},
        {"company": "Figma", "industry": "Design"},
        {"company": "Notion", "industry": "Productivity"},
    ]
)

print("Companies to research:")
print(companies.to_string())

Companies to research:
      company      industry
0      Stripe      Payments
1  Databricks     Data & AI
2       Canva        Design
3       Figma        Design
4      Notion  Productivity


In [None]:
# Basic usage with default response
basic_result = await agent_map(
    task="Find the company's most recent annual revenue in USD",
    input=companies,
    response_model=CompanyFinancials,
)

print("Basic Results:")
print(basic_result.data.to_string())

NameError: name 'agent_map' is not defined

In [None]:
# Structured output with a response model
structured_result = await agent_map(
    task=dedent("""
        Research the company's financials. Find:
        1. Their most recent annual revenue (in USD)
        2. Current employee count
        3. Year founded

        If the company is a subsidiary, report figures for the subsidiary
        specifically, not the parent company.
    """),
    input=companies,
    response_model=CompanyFinancials,
)

print("Structured Results:")
print(structured_result.data.to_string())

Structured Results:
      company      industry  annual_revenue_usd  employee_count  founded_year                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       