# Using a Regex To Extract More Candidates for an Entity Classification Approach

Model 2 uses the Schwartz-Hearst algorithm to extract candidates. However, if
the classifier works well there is no reason to limit the candidates to those
that fit the pattern `LONG NAME (ACRONYM)`. Instead, can we use a regex to
produce a high recall set of candidates and then use the classifier raise the
precision?

In [3]:
import json
import regex as re
from unidecode import unidecode

In [4]:
text_id = "0008656f-0ba2-4632-8602-3017b44c2e90"
with open("../data/kaggle/train/{}.json".format(text_id)) as f:
    text = unidecode(" ".join([sec["text"].replace("\n", " ") for sec in json.load(f)]))

The pattern below matches the following:

- `([A-Z][a-z]{2,}\ )`: Matches a word that starts with a capital letter and has
at least two lower case letters after.
- `((for\ |in\ |and\ |the\ |[A-Z][a-z]{2,})[\ ]?){2,}`: matches at least two of
the following:
  - `for `
  - `in `
  - `and `
  - `the `
  - `([A-Z][a-z]{2,}[\ ]?)`: same as the first pattern, but with an optional space
- `(\([A-Z]{3,}\))?`: matches a string that starts with a `(` and ends with a
`)` and has at least three capital letters in between. This is optional.

In [54]:
entity_pattern = re.compile(r"(([A-Z][a-z]{2,}\ )((for\ |in\ |and\ |the\ |[A-Z][a-z]{2,})[\ ]?){2,}(\([A-Z]{3,}\))?)")

In [56]:
print(set(match[0] for match in entity_pattern.findall(text)))


{'Wang and Degol ', 'Like in the ', 'European Union and the United States ', 'David Wade Chambers in ', 'Trends in International Mathematics and Science Study (TIMMS)', 'Cejka and Eagly ', 'Southern Finland ', 'Eastern Finland ', 'National Science Foundation', 'Dick and Rallis ', 'Miller and Hayward', 'Picker and Berry ', 'Maltese and Tai ', 'Programme for International Student Assessment (PISA)'}
