# MVP author search with OpenAlex data source

The main goal is to return at author (and institution level) from a given search query.

### Groupby and aggregate

Borrowing/Stealing? the design language from pandas, let's think about `groupby` and `aggregate`.

- groupby: defines the return object level (`author` for now)
- aggregate: what formula we will use to `reduce` multiple records into a single metric. For example:
    - basic count of hits
    - custom `reranker` borrowing from retrival augmented generation (RAG) field, we weights the things that people like to see to obtain a score. More specifically, we can obtain the `cited_by_count` in each relevant paper and sum all within an author.
    - `reranker` can also sources form multiple underlying metrics like what we do in [faculty search](https://github.com/UW-Madison-DSI/faculty-search/blob/eff2ecfedcf5e817e70e3f3541b91d4cceeabb27/api/core.py#L387)

TODOs:

1. Make a MVP groupby aggregate interface for search with base hit counts
1. Implement faculty search `reranker` into aggregate


Mock Interface 

```python
search_results = search("effect of fungicide on corn")
search_results.groupby("author").aggregate("count")  # Hit Counts, return authors
search_results.groupby("author").aggregate("reranker_v0")  # Fully custom reranker, return author
search_results.groupby("institution").aggregate("count")  # Hit Counts, return institute
```

For what we have now, without pulling a lot of author data, we should make reranker_v0 as follow:

1. Get a list of relevant articles w.r.t. search query
2. group by author
3. weight by sum of similarity

This will be significantly better than the vanilla count without putting too much work into it. 

In [None]:
from openalex_search.search import _search, SearchResults

In [None]:
y = _search("climate change", 3)

In [None]:
SearchResults.from_raw(y)

In [None]:
y

In [None]:
from openalex_search import search

search("Higgs field")

### Snippets

### Reset DB

In [1]:
from sqlmodel import SQLModel
from pathlib import Path
from openalex_search.db import init, ENGINE
from openalex_search.ingest import ingest

SQLModel.metadata.drop_all(ENGINE)
init()
# ingest(Path("local_data/test_authors.parquet"))
# ingest(Path("local_data/test_articles.parquet"))

### Rebuild Index

In [None]:
from sqlmodel import Session, text, Index
from openalex_search.db import ENGINE, Work
from openalex_search.common import CONFIG

with Session(ENGINE) as session:
    session.connection().execute(text("DROP INDEX IF EXISTS work_index;"))
    session.commit()

index = Index(
    "work_index",
    Work.embedding,
    postgresql_using="hnsw",
    postgresql_with={
        "m": CONFIG.HNSW_M,
        "ef_construction": CONFIG.HNSW_EF_CONSTRUCTION,
    },
    postgresql_ops={"embedding": "vector_ip_ops"},
)
index.create(bind=ENGINE, checkfirst=True)

In [2]:
import pandas as pd

df = pd.read_parquet("local_data/uw-works-27-last.parquet")

: 

In [None]:
len(df)

In [None]:
df.head()