## Dependencies

1. git
2. python >= 3.13
3. Requests - `pip install requests`
4. SAIM - `pip install saim@git+https://github.com/LeibnizDSMZ/saim.git@v0.9.1`
5. CAFI - `pip install cafi@git+https://github.com/LeibnizDSMZ/cafi.git@v0.9.3`

## Installation - Linux

1. **Run `git clone https://github.com/artdotlis/N4M_STRAININFO_2025.git`** to download the repository.
2. **Run `make install`** to install dependencies and set up the environment.
3. **Run `make activate`** to activate the environment.


## Installation - Windows/Linux/Mac

1. **Run `git clone https://github.com/artdotlis/N4M_STRAININFO_2025.git`** to download the repository.
2. **Run `python -m venv .venv`** to set up the environment.
3. Activate environment
   - Windows - **Run `.\.venv\Scripts\activate`**
   - Linux/Mac - **Run `source .venv/bin/activate`**
4. **Run `pip install .`** to install dependencies.

## Resources

1. **Implement your solution**  
   The main implementation file for the solution can be found here:  
   `src/n4m_straininfo_2025/main.py`

2. **Recommended functions to use**  
   To save time, refer to the recommended functions in the library folder:  
   `src/n4m_straininfo_2025/lib`

3. **Tips for each step**  
   Helpful tips for each step are provided here (spoiler alert!):  
   `src/n4m_straininfo_2025/tips`

## 1. Get abstract for DOI: 10.1093/femsle/fnad030

- Use the **OpenAlex API** and **requests** to fetch the abstract.
- **OpenAlex API** - https://api.openalex.org/works/https://doi.org/10.1093/femsle/fnad030
- Use the function **`create_abstract_from_inverted`** in `lib.helper` to process the response.
- (Alternative): You can simply copy the abstract directly from [DOI: 10.1093/femsle/fnad030](https://doi.org/10.1093/femsle/fnad030).

In [None]:
import requests

OPEN_ALEX = "https://api.openalex.org/works/https://doi.org/10.1093/femsle/fnad030"


def create_abstract_from_inverted(r_ind: dict[str, tuple[int, ...]], /) -> str:
    last_ind = 0
    for ind_v in r_ind.values():
        new_m = max(ind_v)
        last_ind = new_m if new_m > last_ind else last_ind
    reversed_abs = ["" for _ in range(0, last_ind + 1)]
    for key, ind_v in r_ind.items():
        for pos in ind_v:
            reversed_abs[pos] = key
    return " ".join(reversed_abs)


def get_abstract_from_open_alex() -> str:
    res = requests.get(OPEN_ALEX)
    if res.status_code == 200:
        return create_abstract_from_inverted(res.json()["abstract_inverted_index"])
    else:
        return f"Error: status code {res.status_code}"
    return ""


if __name__ == "__main__":
    print(get_abstract_from_open_alex())

## Best practices - API

- Retry on failure
- Schema checks - `pydantic`
- Timeout - respect the requests per second limits imposed by APIs
- Cache responses - `requests-cache`
- Thread-/Process-based parallelism or asynchronous calls
- Data license
- (Optional) robots.txt

## 2. Find all culture collection numbers in the abstract

- Extract culture collection numbers from the abstract using the function **`get_ccno_from_abstract`** in `lib.helper`.
- (Optional): Try creating a **RegEx** pattern to extract culture collection numbers with the Acronym-Number structure (**DSM 72**).

In [None]:
import re
from typing import Final

from n4m_straininfo_2025.tips.abstract_1 import get_abstract_from_open_alex


CCNO_RE: Final[re.Pattern[str]] = re.compile(r"\b(ATCC\s*\d+)\b")

def _get_ccno_from_abstract() -> set[str]:
    return set(
        res
        for sea in re.finditer(CCNO_RE, get_abstract_from_open_alex())
        if isinstance((res := sea.group(1)), str) and res != ""
    )

if __name__ == "__main__":
    print(_get_ccno_from_abstract())

## Issue - CCNo

**Complexity escalation**:

- Maintainability
- Performance
- High resource consumption

**Best practices**

- Use number as anchor - `(\d+(?:\D\d+)*)` 123-1.42
- Look left to search for an acronym and prefix
- (Optional) Look right to search for suffix

-> Requires knowledge about acronyms - CAFI

In [None]:
from n4m_straininfo_2025.tips.abstract_1 import get_abstract_from_open_alex
from saim.designation.manager import AcronymManager


def get_ccno_from_abstract(abstract: str, /) -> set[str]:
    acronym_manager = AcronymManager("v0.9.3")
    res_ccno = acronym_manager.extract_all_valid_ccno_from_text(abstract)
    return set(
        ccno.designation
        for ccno in res_ccno if ccno.acr != ""
    )


def _get_ccno_from_abstract() -> set[str]:
    return get_ccno_from_abstract(get_abstract_from_open_alex())


if __name__ == "__main__":
    print(_get_ccno_from_abstract())

## 3. Find *Eubacterium limosum* in the abstract**  

- Search for occurrences of **`Eubacterium limosum`** within the abstract text.
- (Recommended): Use a simple string search approach, or refer to `tips.taxa_3`

In [None]:
import re
from typing import Final, Iterable
from n4m_straininfo_2025.tips.abstract_1 import get_abstract_from_open_alex

GEN_RE: Final[re.Pattern[str]] = re.compile(r"(\b[A-Z][a-z]+\b)")
SPE_RE: Final[re.Pattern[str]] = re.compile(r"^\s([a-z]+)")


def search_regex(text: str, /) -> Iterable[tuple[int, str]]:
    for sea in re.finditer(GEN_RE, text):
        res = sea.group(1)
        if isinstance(res, str) and res != "":
            yield sea.start(), res


def get_spe_name(spe_s: set[str] | None, gen: str, pos: int, full: str, /) -> str:
    if spe_s is None:
        return gen
    spe_m = SPE_RE.match(full[pos + len(gen) :])
    if spe_m is not None and spe_m.group(1) in spe_s:
        return f"{gen} {spe_m.group(1)}"
    return gen


def extract_taxa(tax_man: dict[str, set[str] | None], abstract: str, /,) -> Iterable[str]:
    if abstract != "":
        for pos_start, match in search_regex(abstract):
            if (gid := match.lower()) in tax_man:
                yield get_spe_name(tax_man[gid], match, pos_start, abstract)


def get_e_limosum_from_abstract() -> set[str]:
    return set(
        extract_taxa({"eubacterium": {"limosum"}}, get_abstract_from_open_alex())
    )


if __name__ == "__main__":
    print(get_e_limosum_from_abstract())

## Issue - Taxonomy

- Knowledge about taxonomy
- Search word-by-word

## Issue - False positives

**Found**
- ATCC 8486
- Eubacterium limosum
  
**Could these be false positives?**
- Verify with StrainInfo API

## 4. Find all strains in StrainInfo with the same CCNos found in the previous task**
- Use the **StrainInfo API** to search for all matching strain IDs (`SI-ID`).
- Documentation: https://straininfo.dsmz.de/service

In [None]:
import requests
from n4m_straininfo_2025.lib.helper import get_ccno_from_abstract
from n4m_straininfo_2025.tips.abstract_1 import get_abstract_from_open_alex

SI_IDS = "https://api.straininfo.dsmz.de/v2/search/strain/cc_no/"


def get_si_ids_for_ccnos() -> set[str]:
    ccnos = get_ccno_from_abstract(get_abstract_from_open_alex())
    res = requests.get(SI_IDS + ",".join(ccnos))
    if res.status_code == 200:
        return set(res.json())
    return set()


if __name__ == "__main__":
    print(get_si_ids_for_ccnos())

## Best practices - StrainInfo API

- Bundle requests - make one request instead of multiple

## 5. Check if the strains overlap with detected taxonomy

- Use the **StrainInfo API** to retrieve information about the strain.
- Compare each strain's taxonomy with the detected taxonomy.
- Print the following for each matched strain:
  - **SI-ID**
  - **Strain deposit designation**
  - **Taxonomy**

In [None]:
import requests
from typing import Iterable
from n4m_straininfo_2025.tips.si_ids_4 import get_si_ids_for_ccnos
from n4m_straininfo_2025.tips.taxa_3 import get_e_limosum_from_abstract


STRAIN = "https://api.straininfo.dsmz.de/v2/data/strain/min/"


def get_matched_strains() -> Iterable[tuple[int, str, str]]:
    taxa = get_e_limosum_from_abstract()
    si_ids = get_si_ids_for_ccnos()
    res = requests.get(STRAIN + ",".join(str(si_id) for si_id in si_ids))
    if res.status_code == 200:
        for strain in res.json():
            if strain["strain"]["taxon"]["name"] in taxa:
                yield (
                    strain["strain"]["siID"], 
                    ",".join(
                        dep["designation"]
                        for dep in strain["strain"]["relation"]["deposit"]
                    ),
                    strain["strain"]["taxon"]["name"] 
                )


if __name__ == "__main__":
    for line in get_matched_strains():
        print("STRAIN:")
        print("\n\t".join(str(ele) for ele in line))


## Best practices - Verification

- Links reduce false positives