### Find vendor names in emails

This script finds vendor names in a set of emails by:
1. Finding emails relevant to payment, invoices, etc. 
2. Extracting named organization entities from them.
3. Filtering out the most frequent entities in the whole set of emails. 

#### Setup

JS setup
- `npm install pst-extractor` in process_pst folder.

- Comment out lines 162-165 in *process_pst/node_modules/pst-extractor/dist/PSTFolder.class.js*: 

  ```
  if ((emailRow && emailRow.itemIndex == -1) || !emailRow) {
    // no more!
    return null;
  }
  ```

Process PST files
- `node process_pst.js <input folder or file> <output folder>`

Python dependencies

- `pip install beautifulsoup4 bertopic flair keybert keyphrase_vectorizers scikit-learn`

#### Imports

In [1]:
from sentence_transformers import util
from get_entities import main, semantic_search_kw, similarity_wordnet, semantic_search_doc

#### Specify input and output, optional args, and run

**TL;DR:** 
- Best parameters to play around with are:

    - **filter_terms**: Used to find relevant emails to get named organization entities from, eg. ['invoice', 'payment', 'vendor'] to find emails containing vendor names.

    - **filter_func_args['threshold']**: threshold for relevancy.
- The default *semantic_search_doc* models save doc embeddings after first run, so re-runs with different configs (but the same model) will be faster. 


**Args for get_entities.main**

- **input** : str

    - Path to folder containing extracted PST file contents.
    
- **output** : str

    - Path.
- **filter_terms** : list, optional

    - Terms used to find relevant emails, 
        default ['invoice', 'payment', 'vendor']
        
- **filter_func** : func, optional

    - Function used to compare filter_terms and email text/keywords:
        - *semantic_search_doc* (default) : Asymmetric semantic search comparing filter_terms with email text.

        - *semantic_search_kw* : Symmetric semantic search comparing filter_terms with email keywords.

        - *similarity_wordnet* : Compares filter_terms with WordNet synsets of email keywords. filter_terms must be a list of [WordNet synsets](https://www.nltk.org/howto/wordnet.html)

- **filter_func_args** : dict, optional

    - Each filter_func has its own default args (see beginning of function). Only use this when overriding those :

        - *model_name* (semantic_search_*) : Use asymmetric semantic search models for semantic_search_doc and symmetric ones for semantic_search_kw. Make sure to use the appropriate metric for the specified model. See [sentence_transformer documentation](https://www.sbert.net/docs/pretrained_models.html#semantic-search) for more info.

        - *metric* : Similarity metric.

            - wn.wup_similarity (similarity_wordnet) : 0 to 1

            - util.cos_sim (semantic_search_*) : -1 to 1

            - util.dot_score (semantic_search_*) : Test model on huggingface to determine a reasonable threshold.

        - *threshold* : Threshold to determine relevance.

- **kw_kwargs** : dict, optional

    - Args for keywords extraction, default {'top_n': 10}

- **kw_batch_size** : int, optional

    - Batch size for keyword extraction, default None

In [2]:
input_folder = 'data/extracted emails/process_pst_js/priorityservices/'
output_folder = 'data/results/priorityservices/'

use_custom = False # Set to True to use custom_kwargs

custom_kwargs = {
    'filter_terms': ['lunch.n.01'],
    'filter_func': similarity_wordnet,
    'filter_func_args': {'threshold': .90}
}

##########################################

kwargs = custom_kwargs if use_custom else {}

main(input_folder, output_folder, **kwargs)

2024-02-08 17:12:21,706 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


Getting entities...: 100%|██████████| 37758/37758 [00:00<00:00, 1739376.93it/s]



Finished.
