## Testing code used in my ANSP Streamlit app

Below is mostly the _exact same code_ as in my Streamlit app file `app.py` (with the exception of the displacy stuff at the end). Along the way I tested aspects of the python code here first before adding it to the Streamlit file. 

In [1]:
# load packages 

import spacy
from pathlib import Path
import srsly
import importlib
import random
from spacy.pipeline import EntityRuler # Import the Entity Ruler for making custom entities
from spacy.language import Language  # type: ignore 
import json
# Import the spaCy visualizer
from spacy import displacy
import pandas as pd

In [2]:
# streamlit setup stuff
MODELS = srsly.read_json("/Users/thalassa/streamlit/streamlit-ansp/models.json")
DEFAULT_MODEL = "en_core_web_sm"
DEFAULT_TEXT =  "Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Seven Ampelis cedrorum specimens were collected in a meadow near lowland fruit trees. Some habitats we know are in the json file are near large rocks, near river mouths, near the bottom and near the ocean. Some species names are Hemigrapsus affinis, Hemigrapsus crassimanus, Hendersonia alternifoliae and Hendersonia celtifolia."
DESCRIPTION = """**Explore trained [spaCy v3.0](https://nightly.spacy.io) pipelines with the Proceedings of the Academy of Natural Sciences of Philadelphia**"""


In [3]:
text = DEFAULT_TEXT
    # doc = spacy_streamlit.process_text(NLP, text)

nlp = spacy.load(DEFAULT_MODEL)
ruler = nlp.add_pipe("entity_ruler", before='ner')
ruler.from_disk("/Users/thalassa/streamlit/streamlit-ansp/ansp-patterns.jsonl")


<spacy.pipeline.entityruler.EntityRuler at 0x16eebd2c0>

In [4]:
# Check the pipeline. The entity_ruler should happen *before* ner.
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x170973f90>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1709af1d0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1706c3e80>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x170959f80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x170952c80>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x16eebd2c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1706c3fa0>)]

In [5]:
doc = nlp(text)

In [6]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Frances Naomi Clark 0 19 PERSON
American 27 35 NORP
1894 58 62 DATE
one 72 75 CARDINAL
Frances Naomi Clark 150 169 PERSON
American 177 185 NORP
1894 208 212 DATE
one 222 225 CARDINAL
Seven 300 305 CARDINAL
Ampelis cedrorum 306 322 TAXA
lowland fruit trees 365 384 HABITAT
near large rocks 433 449 HABITAT
near river mouths 451 468 HABITAT
near the bottom 470 485 HABITAT
near the ocean 490 504 HABITAT
Hemigrapsus affinis 529 548 TAXA
Hemigrapsus crassimanus 550 573 TAXA
Hendersonia alternifoliae 575 600 TAXA
Hendersonia celtifolia 605 627 TAXA


In [7]:
displacy.render(doc, style="ent", jupyter=True)

In [8]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head, 
            token.ent_type_, token.is_alpha, token.is_punct, token.is_stop)

Frances Frances PROPN NNP compound Clark PERSON True False False
Naomi Naomi PROPN NNP compound Clark PERSON True False False
Clark Clark PROPN NNP nsubj was PERSON True False False
was be AUX VBD ROOT was  True False True
an an DET DT det ichthyologist  True False True
American american ADJ JJ amod ichthyologist NORP True False False
ichthyologist ichthyologist NOUN NN attr was  True False False
born bear VERB VBN acl ichthyologist  True False False
in in ADP IN prep born  True False True
1894 1894 NUM CD pobj in DATE False False False
, , PUNCT , punct was  False True False
and and CCONJ CC cc was  True False True
was be VERB VBD conj was  True False True
one one NUM CD attr was CARDINAL True False True
of of ADP IN prep one  True False True
the the DET DT det researchers  True False True
first first ADJ JJ amod woman  True False True
woman woman NOUN NN compound researchers  True False False
fishery fishery NOUN NN compound researchers  True False False
researchers researcher NOUN N

In [12]:
rows = []

for token in doc:
    rows.append(
        {
            'Token': token.text, 
            'Lemma': token.lemma_,
            'POS': token.pos_,
            'Tag': token.tag_,
            'Dependency': token.dep_,
            'Head': token.head,
            'Ent Type': token.ent_type_,
            'IsAlpha': token.is_alpha,
            'IsPunct': token.is_punct,
            'IsStop': token.is_stop
        }
    )   
tokes = pd.DataFrame(rows)

In [11]:
print(tokes)

             Token          Lemma    POS  Tag Dependency           Head  \
0          Frances        Frances  PROPN  NNP   compound          Clark   
1            Naomi          Naomi  PROPN  NNP   compound          Clark   
2            Clark          Clark  PROPN  NNP      nsubj            was   
3              was             be    AUX  VBD       ROOT            was   
4               an             an    DET   DT        det  ichthyologist   
..             ...            ...    ...  ...        ...            ...   
105  alternifoliae  alternifoliae  PROPN  NNP      appos    crassimanus   
106            and            and  CCONJ   CC         cc  alternifoliae   
107    Hendersonia    Hendersonia  PROPN  NNP   compound     celtifolia   
108     celtifolia     celtifolia   NOUN  NNS       conj  alternifoliae   
109              .              .  PUNCT    .      punct            are   

    Ent Type  IsAlpha  IsPunct  IsStop  
0     PERSON     True    False   False  
1     PERSON     

The tokens dataframe isn't catching the entity types for the custom entities. Need to fix that somehow...

In [None]:
 # to save the token for the custom etitites, we have to make a new set.
    labels=list(nlp.get_pipe("ner").labels)
    for label in nlp.get_pipe("entity_ruler").labels:
        labels.append(label)

In [None]:
displacy.render(doc, style="dep")